Roundup of remote encrypted deduplicated backups in Linux

Since I wrote last about Linux backup tools, back in a 2008 article about BackupPC and similar toools and a 2011 article about dedpulicating filesystems, I’ve revisited my personal backup strategy a bit.

I still use ZFS, with my tool “simplesnap” that I wrote about in 2014 to perform local backups to USB drives, which get rotated offsite periodically. This has the advantage of being very fast and very secure, but I also wanted offsite backups over the Internet. I began compiling criteria, which ran like this:

  • Remote end must not need any special software installed. Storage across rsync, sftp, S3, WebDAV, etc. should all be good candidates. The remote end should not need to support hard links or symlinks, etc.
  • Cross-host deduplication at at least the file level is required, so if I move a 4GB video file from one machine to another, my puny DSL wouldn’t have to re-upload it.
  • All data that is stored remotely must be 100% encrypted 100% of the time. I must not need to have any trust at all in the remote end.
  • Each backup after the first must send only an incremental’s worth of data across the line. No periodic re-uploading of the entire data set can be done.
  • The repository format must be well-documented and stable.

So, how did things stack up?

Didn’t meet criteria

A lot of popular tools didn’t meet the criteria. Here are some that I considered:

  • BackupPC requires software on the remote end and does not do encryption.
  • None of the rsync hardlink tree-based tools are suitable here.
  • rdiff-backup requires software on the remote end and does not do encryption or dedup.
  • duplicity requires a periodic re-upload of a full backup, or incremental chains become quite long and storage-inefficient. It also does not support dedup, although it does have an impressive list of “dumb” storage backends.
  • ZFS, if used to do backups the efficient way, would require software to be installed on the remote end. If simple “zfs send” images are used, the same limitations as with duplicity apply.
  • The tools must preserve POSIX attributes like uid/gid, permission bits, symbolic links, hard links, etc. Support for xattrs is also desireable but not required.
  • bup and zbackup are both interesting deduplicators, but do not yet have support for removing old data, so are impractical for this purpose.
  • burp requires software on the server side.

Obnam and Attic/Borg Backup

Obnam and Attic (and its fork Borg Backup) are both programs that have a similar concept at their heart, which is roughly this: the backup repository stores small chunks of data, indexed by a checksum. Directory trees are composed of files that are assembled out of lists of chunks, so if any given file matches another file already in the repository somewhere, the added cost is just a small amount of metadata.

Obnam was eventually my tool of choice. It has built-in support for sftp, but its reliance on local filesystem semantics is very conservative and it works fine atop davfs2 (and, I’d imagine, other S3-backed FUSE filesystems). Obnam’s repository format is carefully documented and it is very conservatively designed through and through — clearly optimized for integrity above all else, including speed. Just what a backup program should be. It has a lot of configurable options, including chunk size, caching information (dedup tables can be RAM-hungry), etc. These default to fairly conservative values, and the performance of Obnam can be significantly improved with a few simple config tweaks.

Attic was also a leading contender. It has a few advantages over Obnam, actually. One is that it uses an rsync-like rolling checksum method. This means that if you add 1 byte at the beginning of a 100MB file, Attic will upload a 1-byte chunk and then reference the other chunks after that, while Obnam will have to re-upload the entire file, since its chunks start at the beginning of the file in fixed sizes. (The only time Obnam has chunks smaller than its configured chunk size is with very small files or the last chunk in a file.) Another nice feature of Attic is its use of “packs”, where it groups chunks together into larger pack files. This can have significant performance advantages when backing up small files, especially over high-latency protocols and links.

On the downside, Attic has a hardcoded fairly small chunksize that gives it a heavy metadata load. It is not at all as configurable as Obnam, and unlike Obnam, there is nothing you can do about this. The biggest reason I avoided it though was that it uses a single monolithic index file that would have to be uploaded from scratch after each backup. I calculated that this would be many GB in size, if not even tens of GB, for my intended use, and this is just not practical over the Internet. Attic assumes that if you are going remote, you run Attic on the remote so that the rewrite of this file doesn’t have to send all the data across the network. Although it does work atop davfs2, this support seemed like an afterthought and is clearly not very practical.

Attic did perform much better than Obnam in some ways, largely thanks to its pack support, but the monolothic index file was going to make it simply impractical to use.

There is a new fork of Attic called Borg that may, in the future, address some of these issues.

Brief honorable mentions: bup, zbackup, syncany

There are a few other backup tools that people are talking about which do dedup. bup is frequently mentioned, but one big problem with it is that it has no way to delete old data! In other words, it is more of an archive than a backup tool. zbackup is a really neat idea — it dedups anything you feed at it, such as a tar stream or “zfs send” stream, and can encrypt, too. But it doesn’t (yet) support removing old data either.

syncany is fundamentally a syncing tool, but can also be used from the command line to do periodic syncs to a remote. It supports encryption, sftp, webdave, etc. natively, and runs on quite a number of platforms easily. However, it doesn’t store a number of POSIX attributes, such as hard links, uid/gid owner, ACL, xattr, etc. This makes it impractical for use for even backing up my home directory; I make fairly frequent use of ln, both with and without -s. If there were some tool to create/restore archives of metadata, that might work out better.

25 thoughts on “Roundup of remote encrypted deduplicated backups in Linux

  1. Josh Triplett says:

    Can you provide your obnam configuration tweaks, and the corresponding performance you observed? I still seem to get unusably slow performance over sftp.

    1. John Goerzen says:

      Hi Josh,

      My .obnam.conf contains, among other things, these relevant bits:

      [config]

      compress-with=deflate
      checkpoint=536870912
      encrypt-with=[hidden]
      chunk-size=10485760
      lru-size=8192
      upload-queue-size=8192
      leave-checkpoints = False

      The performance is slow, but using something like strace is quite interesting to see *where* it’s slow. I’m using it over davfs2 — and tweaking davfs2.conf was MUCH harder than obnam.conf — and every file operation (open, rename, etc.) has a lot of latency in that situation. Even cleaning old generations can take a significant amount of time. However, uploading large files can saturate the pipe and the startup time, even for a significant fraction of a million files under storage, is reasonable (a few minutes).

      This isn’t a “get me what I want instantly on a daily basis” sort of thing. It’s a “my house burned down and I need all my data back” sort of thing.

  2. This is a strange article. You mention that you don’t want to use any “special software”, but you don’t qualify what that means. You give examples of software that isn’t considered “special”: rsync, sftp, S3, WebDAV.

    S3 is an interesting choice, because it requires an account with Amazon, a company that has questionable ethics and business practices. WebDAV is also an interesting choice, because it requires an HTTP server (Apache, NGINX, etc). So the software stack in that case is unnecessairly complex.

    Then, despite already being familiar with ZFS, you dismiss it, because you qualify that as “special software”, and you do the same with BackupPC, which just uses rsync(1) under the hood, which you qualify as non-special software.

    Then, you conclude that Obnam and Attic/Borg Backup fit your description of non-special software. True, they are in the Debian repositories, and a simple “apt-get install” away (not for Borg Backup, however). You finish with “honorable mentions”, again, two of which are in the Debian repositories (bup and zbackup) and one that is not (syancany).

    I think this post would be better served, if instead of qualifying what would work based on “special” versus “non-special” software, to look at the features provided by each, with a critical analysis of the pros and cons of running each. Then, based on that analysis, make recomendations on what works best.

    Personally, I prefer BackupPC on ZFS as my “cloud backup” storage solution, including for offsite disaster recovery. ZFS can provide transparent LZ4 compression, checksums, snapshots, and sending/receiving to offsite, while BackupPC can handle the deduplication (as well as checksums). I’m familiar with the software stack, it performs well, it is stable, and Just Works with minimal config and setup.

    Just a thought.

    1. John Goerzen says:

      Aaron,

      I think you misunderstand. I mean “special software” in the sense of being able to use an arbitrary data-hosting service without having control over what’s installed there.

      This rules out BackupPC because BackupPC must run on the server end. Same with doing a “zfs receive”. S3 is a generic API that is supported by Amazon and tons of its competitors, and has numerous open-source implementations on the client and server.

  3. B. says:

    Not sure if encryption is a relevant criteria… there are transparent encrypting mounting softwares (like encfs) that can do that for any software, so backup softwares included. And factorizing functions is good isn’t it?

    1. John Goerzen says:

      That would be great, but encfs is known to not be all that secure these days. I’m not aware of any other solution that will layer atop a FUSE filesystem.

      Obnam, by the way, uses GPG for encryption so it’s not reinventing the wheel.

  4. Here’s another recent comparison of Attic vs Bup vs Obnam with a different conclusion:

    http://librelist.com/browser/attic/2015/3/31/comparison-of-attic-vs-bup-vs-obnam/

    1. John Goerzen says:

      Yep, I read that article. It’s not exactly a different conclusion; it’s a different use case. The key differentiator is that in that use case, it is not a problem to rewrite a multi-GB file on every backup. (Or the file may not grow to be that big.) Attic does have generally better performance than Obnam, particularly with many small files thanks to its pack files. But when backing up 100K of data would require uploading 10GB over the DSL, it doesn’t ;-)

    1. John Goerzen says:

      I didn’t much; it looks like it has been unmaintained for 6 years so I thought I’d give it a pass.

    2. Jack says:

      Last release of ‘brackup’ was 2009 (!) so I can’t imagine anybody in their right mind using that. I hope you don’t, either.

  5. fd0 says:

    Hi,

    I’ve started building another backup program called ‘restic’ over the last year, it can be found at https://restic.github.io and https://github.com/restic/restic.

    It’s not yet finished, but I’m planning a release around 1 July and present it on Froscon (a conference about Free Software in Germany).

    If you have the time, I’d love to hear your opinion!

    – Alex

  6. Nikolaus Rath says:

    I think you may want to have a look at S3QL (https://bitbucket.org/nikratio/s3ql/). It satisfies most of the criteria you listed (I am not 100% what you mean with “cross-host deduplication” – S3QL de-duplicates on a (large) block-level).

  7. I am working on improving the “lots of small chunks” issue of Attic and will fix it ASAP in Borg Backup, see there:

    https://github.com/borgbackup/borg/issues/16#issuecomment-113764369

    1. John Goerzen says:

      That is great news. Are you also working on the monolothic index file issue? That to mee is a bigger problem.

  8. Well, maybe the monolithic index file issue is less of an issue if we have much less chunks, so the index will be significantly smaller.

    If you still see an issue and maybe even have ideas how to solve it, I’ld appreciate if you open an issue on: https://github.com/borgbackup/borg/issues

    BTW, I made the chunker configurable now, it’s in master branch.

  9. Jos says:

    On the ZBackup site i read:
    “Possibility to delete old backup data”
    yet your article says that zbackup support removing old data.
    I do not see a manual though that explains how to remove data from the backup. The website talks about the desire for improved garbage collection which implies that there is garbage collection.

    1. John Goerzen says:

      Install it, and you will see that there is not yet any option to delete old data.

      1. ZOG says:

        Zbackup does allow you to delete old data, but the method is not obvious and only mentioned in some forum posts.

        -> You delete the placeholder file for the backup you want to remove from repo/backups/

        -> then you do a “zbackup gc “.. garbage collection operation, and info not references by existing backups is removed.

  10. Abby says:

    Hello,

    I need help with attic retention policy, I don’t get it so maybe you can help. If I set prune to 14 days and on the first day the backup is initialized so the first backup is full backup. Then every other day is incremental. At 15 day the first backup is deleted which was full backup and again all those files are created as incremental right on day 15? What if I then want to access all of my files on day 3 but the full backup of first day was deleted, where are the files from day 1 which are not in incremental backup of day 2 and day 3?

    1. Daniel H says:

      Attic doesn’t distinguish between full and incremental backups. Every backup uploads file contents in chunks, and files as descriptions of which chunks go where. Your day 3 files contain mostly the same chunks as your day 1 files, but presumably have some new ones. If you delete the day 1 backup you only delete the chunks which are only referenced on day 1, i.e. things you changed on day 2.

  11. Abby, deduplicating backup programs’ approach is different from the older full/incremental (or differential) approach.

    Each backup you create with attic (or borgbackup) is a full backup that references all data chunks it needs in a storage shared between all backup archives.

    The space saving comes from identical chunks being only stored once in that storage (but can be referenced as often as needed).

    So, don’t worry about deleting backups you do not need any more. The chunks used by other backups will be still kept in the storage and are only removed when they are not referenced any more.

  12. Faheem Mitha says:

    Hi John,

    Interesting article. Have you given Borgbackup a try? If so, what did you think of it?

    1. Daniel H says:

      It’s mentioned in the article. By now I think the issues have been addressed, but I’m not certain.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.