Roundup of remote encrypted deduplicated backups in Linux

June 11, 2015LinuxJohn Goerzen

Since I wrote last about Linux backup tools, back in a 2008 article about BackupPC and similar toools and a 2011 article about dedpulicating filesystems, I’ve revisited my personal backup strategy a bit.

I still use ZFS, with my tool “simplesnap” that I wrote about in 2014 to perform local backups to USB drives, which get rotated offsite periodically. This has the advantage of being very fast and very secure, but I also wanted offsite backups over the Internet. I began compiling criteria, which ran like this:

Remote end must not need any special software installed. Storage across rsync, sftp, S3, WebDAV, etc. should all be good candidates. The remote end should not need to support hard links or symlinks, etc.
Cross-host deduplication at at least the file level is required, so if I move a 4GB video file from one machine to another, my puny DSL wouldn’t have to re-upload it.
All data that is stored remotely must be 100% encrypted 100% of the time. I must not need to have any trust at all in the remote end.
Each backup after the first must send only an incremental’s worth of data across the line. No periodic re-uploading of the entire data set can be done.
The repository format must be well-documented and stable.

So, how did things stack up?

Didn’t meet criteria

A lot of popular tools didn’t meet the criteria. Here are some that I considered:

BackupPC requires software on the remote end and does not do encryption.
None of the rsync hardlink tree-based tools are suitable here.
rdiff-backup requires software on the remote end and does not do encryption or dedup.
duplicity requires a periodic re-upload of a full backup, or incremental chains become quite long and storage-inefficient. It also does not support dedup, although it does have an impressive list of “dumb” storage backends.
ZFS, if used to do backups the efficient way, would require software to be installed on the remote end. If simple “zfs send” images are used, the same limitations as with duplicity apply.
The tools must preserve POSIX attributes like uid/gid, permission bits, symbolic links, hard links, etc. Support for xattrs is also desireable but not required.
bup and zbackup are both interesting deduplicators, but do not yet have support for removing old data, so are impractical for this purpose.
burp requires software on the server side.

Obnam and Attic/Borg Backup

Obnam and Attic (and its fork Borg Backup) are both programs that have a similar concept at their heart, which is roughly this: the backup repository stores small chunks of data, indexed by a checksum. Directory trees are composed of files that are assembled out of lists of chunks, so if any given file matches another file already in the repository somewhere, the added cost is just a small amount of metadata.

Obnam was eventually my tool of choice. It has built-in support for sftp, but its reliance on local filesystem semantics is very conservative and it works fine atop davfs2 (and, I’d imagine, other S3-backed FUSE filesystems). Obnam’s repository format is carefully documented and it is very conservatively designed through and through — clearly optimized for integrity above all else, including speed. Just what a backup program should be. It has a lot of configurable options, including chunk size, caching information (dedup tables can be RAM-hungry), etc. These default to fairly conservative values, and the performance of Obnam can be significantly improved with a few simple config tweaks.

Attic was also a leading contender. It has a few advantages over Obnam, actually. One is that it uses an rsync-like rolling checksum method. This means that if you add 1 byte at the beginning of a 100MB file, Attic will upload a 1-byte chunk and then reference the other chunks after that, while Obnam will have to re-upload the entire file, since its chunks start at the beginning of the file in fixed sizes. (The only time Obnam has chunks smaller than its configured chunk size is with very small files or the last chunk in a file.) Another nice feature of Attic is its use of “packs”, where it groups chunks together into larger pack files. This can have significant performance advantages when backing up small files, especially over high-latency protocols and links.

On the downside, Attic has a hardcoded fairly small chunksize that gives it a heavy metadata load. It is not at all as configurable as Obnam, and unlike Obnam, there is nothing you can do about this. The biggest reason I avoided it though was that it uses a single monolithic index file that would have to be uploaded from scratch after each backup. I calculated that this would be many GB in size, if not even tens of GB, for my intended use, and this is just not practical over the Internet. Attic assumes that if you are going remote, you run Attic on the remote so that the rewrite of this file doesn’t have to send all the data across the network. Although it does work atop davfs2, this support seemed like an afterthought and is clearly not very practical.

Attic did perform much better than Obnam in some ways, largely thanks to its pack support, but the monolothic index file was going to make it simply impractical to use.

There is a new fork of Attic called Borg that may, in the future, address some of these issues.

Brief honorable mentions: bup, zbackup, syncany

There are a few other backup tools that people are talking about which do dedup. bup is frequently mentioned, but one big problem with it is that it has no way to delete old data! In other words, it is more of an archive than a backup tool. zbackup is a really neat idea — it dedups anything you feed at it, such as a tar stream or “zfs send” stream, and can encrypt, too. But it doesn’t (yet) support removing old data either.

syncany is fundamentally a syncing tool, but can also be used from the command line to do periodic syncs to a remote. It supports encryption, sftp, webdave, etc. natively, and runs on quite a number of platforms easily. However, it doesn’t store a number of POSIX attributes, such as hard links, uid/gid owner, ACL, xattr, etc. This makes it impractical for use for even backing up my home directory; I make fairly frequent use of ln, both with and without -s. If there were some tool to create/restore archives of metadata, that might work out better.

25 thoughts on “Roundup of remote encrypted deduplicated backups in Linux”

Josh Triplett says:

June 11, 2015 at 12:33 pm

Can you provide your obnam configuration tweaks, and the corresponding performance you observed? I still seem to get unusably slow performance over sftp.

Reply
1. John Goerzen says:
  
  June 11, 2015 at 1:27 pm
  
  Hi Josh,
  
  My .obnam.conf contains, among other things, these relevant bits:
  
  [config]
  
  compress-with=deflate
  checkpoint=536870912
  encrypt-with=[hidden]
  chunk-size=10485760
  lru-size=8192
  upload-queue-size=8192
  leave-checkpoints = False
  
  The performance is slow, but using something like strace is quite interesting to see *where* it’s slow. I’m using it over davfs2 — and tweaking davfs2.conf was MUCH harder than obnam.conf — and every file operation (open, rename, etc.) has a lot of latency in that situation. Even cleaning old generations can take a significant amount of time. However, uploading large files can saturate the pipe and the startup time, even for a significant fraction of a million files under storage, is reasonable (a few minutes).
  
  This isn’t a “get me what I want instantly on a daily basis” sort of thing. It’s a “my house burned down and I need all my data back” sort of thing.
  
  Reply
Aaron Toponce says:

June 11, 2015 at 1:23 pm

This is a strange article. You mention that you don’t want to use any “special software”, but you don’t qualify what that means. You give examples of software that isn’t considered “special”: rsync, sftp, S3, WebDAV.

S3 is an interesting choice, because it requires an account with Amazon, a company that has questionable ethics and business practices. WebDAV is also an interesting choice, because it requires an HTTP server (Apache, NGINX, etc). So the software stack in that case is unnecessairly complex.

Then, despite already being familiar with ZFS, you dismiss it, because you qualify that as “special software”, and you do the same with BackupPC, which just uses rsync(1) under the hood, which you qualify as non-special software.

Then, you conclude that Obnam and Attic/Borg Backup fit your description of non-special software. True, they are in the Debian repositories, and a simple “apt-get install” away (not for Borg Backup, however). You finish with “honorable mentions”, again, two of which are in the Debian repositories (bup and zbackup) and one that is not (syancany).

I think this post would be better served, if instead of qualifying what would work based on “special” versus “non-special” software, to look at the features provided by each, with a critical analysis of the pros and cons of running each. Then, based on that analysis, make recomendations on what works best.

Personally, I prefer BackupPC on ZFS as my “cloud backup” storage solution, including for offsite disaster recovery. ZFS can provide transparent LZ4 compression, checksums, snapshots, and sending/receiving to offsite, while BackupPC can handle the deduplication (as well as checksums). I’m familiar with the software stack, it performs well, it is stable, and Just Works with minimal config and setup.

Just a thought.

Reply
1. John Goerzen says:
  
  June 11, 2015 at 1:25 pm
  
  Aaron,
  
  I think you misunderstand. I mean “special software” in the sense of being able to use an arbitrary data-hosting service without having control over what’s installed there.
  
  This rules out BackupPC because BackupPC must run on the server end. Same with doing a “zfs receive”. S3 is a generic API that is supported by Amazon and tons of its competitors, and has numerous open-source implementations on the client and server.
  
  Reply
B. says:

June 12, 2015 at 7:00 am

Not sure if encryption is a relevant criteria… there are transparent encrypting mounting softwares (like encfs) that can do that for any software, so backup softwares included. And factorizing functions is good isn’t it?

Reply
1. John Goerzen says:
  
  June 12, 2015 at 9:24 am
  
  That would be great, but encfs is known to not be all that secure these days. I’m not aware of any other solution that will layer atop a FUSE filesystem.
  
  Obnam, by the way, uses GPG for encryption so it’s not reinventing the wheel.
  
  Reply
Emanuele Aina says:

June 12, 2015 at 5:00 pm

Here’s another recent comparison of Attic vs Bup vs Obnam with a different conclusion:

http://librelist.com/browser/attic/2015/3/31/comparison-of-attic-vs-bup-vs-obnam/

Reply
1. John Goerzen says:
  
  June 12, 2015 at 5:12 pm
  
  Yep, I read that article. It’s not exactly a different conclusion; it’s a different use case. The key differentiator is that in that use case, it is not a problem to rewrite a multi-GB file on every backup. (Or the file may not grow to be that big.) Attic does have generally better performance than Obnam, particularly with many small files thanks to its pack files. But when backing up 100K of data would require uploading 10GB over the DSL, it doesn’t ;-)
  
  Reply
Dale says:

June 15, 2015 at 12:55 am

Have you looked at brackup?

http://search.cpan.org/~bradfitz/Brackup/

Reply
1. John Goerzen says:
  
  June 15, 2015 at 7:34 am
  
  I didn’t much; it looks like it has been unmaintained for 6 years so I thought I’d give it a pass.
  
  Reply
2. Jack says:
  
  October 23, 2017 at 12:47 pm
  
  Last release of ‘brackup’ was 2009 (!) so I can’t imagine anybody in their right mind using that. I hope you don’t, either.
  
  Reply
fd0 says:

June 15, 2015 at 11:12 am

Hi,

I’ve started building another backup program called ‘restic’ over the last year, it can be found at https://restic.github.io and https://github.com/restic/restic.

It’s not yet finished, but I’m planning a release around 1 July and present it on Froscon (a conference about Free Software in Germany).

If you have the time, I’d love to hear your opinion!

– Alex

Reply
Nikolaus Rath says:

June 16, 2015 at 8:45 pm

I think you may want to have a look at S3QL (https://bitbucket.org/nikratio/s3ql/). It satisfies most of the criteria you listed (I am not 100% what you mean with “cross-host deduplication” – S3QL de-duplicates on a (large) block-level).

Reply
Thomas Waldmann says:

June 20, 2015 at 8:11 am

I am working on improving the “lots of small chunks” issue of Attic and will fix it ASAP in Borg Backup, see there:

https://github.com/borgbackup/borg/issues/16#issuecomment-113764369

Reply
1. John Goerzen says:
  
  June 20, 2015 at 8:39 pm
  
  That is great news. Are you also working on the monolothic index file issue? That to mee is a bigger problem.
  
  Reply
Thomas Waldmann says:

June 21, 2015 at 5:37 am

Well, maybe the monolithic index file issue is less of an issue if we have much less chunks, so the index will be significantly smaller.

If you still see an issue and maybe even have ideas how to solve it, I’ld appreciate if you open an issue on: https://github.com/borgbackup/borg/issues

BTW, I made the chunker configurable now, it’s in master branch.

Reply
Jos says:

July 8, 2015 at 6:17 am

On the ZBackup site i read:
“Possibility to delete old backup data”
yet your article says that zbackup support removing old data.
I do not see a manual though that explains how to remove data from the backup. The website talks about the desire for improved garbage collection which implies that there is garbage collection.

Reply
1. John Goerzen says:
  
  July 10, 2015 at 7:33 am
  
  Install it, and you will see that there is not yet any option to delete old data.
  
  Reply
  1. ZOG says:
    
    September 23, 2015 at 9:58 pm
    
    Zbackup does allow you to delete old data, but the method is not obvious and only mentioned in some forum posts.
    
    -> You delete the placeholder file for the backup you want to remove from repo/backups/
    
    -> then you do a “zbackup gc “.. garbage collection operation, and info not references by existing backups is removed.
    
    Reply
Abby says:

October 16, 2015 at 9:33 am

Hello,

I need help with attic retention policy, I don’t get it so maybe you can help. If I set prune to 14 days and on the first day the backup is initialized so the first backup is full backup. Then every other day is incremental. At 15 day the first backup is deleted which was full backup and again all those files are created as incremental right on day 15? What if I then want to access all of my files on day 3 but the full backup of first day was deleted, where are the files from day 1 which are not in incremental backup of day 2 and day 3?

Reply
1. Daniel H says:
  
  December 29, 2016 at 10:17 pm
  
  Attic doesn’t distinguish between full and incremental backups. Every backup uploads file contents in chunks, and files as descriptions of which chunks go where. Your day 3 files contain mostly the same chunks as your day 1 files, but presumably have some new ones. If you delete the day 1 backup you only delete the chunks which are only referenced on day 1, i.e. things you changed on day 2.
  
  Reply
Thomas Waldmann says:

October 16, 2015 at 9:51 am

Abby, deduplicating backup programs’ approach is different from the older full/incremental (or differential) approach.

Each backup you create with attic (or borgbackup) is a full backup that references all data chunks it needs in a storage shared between all backup archives.

The space saving comes from identical chunks being only stored once in that storage (but can be referenced as often as needed).

So, don’t worry about deleting backups you do not need any more. The chunks used by other backups will be still kept in the storage and are only removed when they are not referenced any more.

Reply
Faheem Mitha says:

November 28, 2015 at 1:10 pm

Hi John,

Interesting article. Have you given Borgbackup a try? If so, what did you think of it?

Reply
1. Daniel H says:
  
  December 29, 2016 at 10:18 pm
  
  It’s mentioned in the article. By now I think the issues have been addressed, but I’m not certain.
  
  Reply
John Goerzen says: @ changelog.complete.org

January 3, 2021 at 7:55 pm

A good backup strategy needs to consider various threats to the integrity of data. For instance:
Building catches fire
Accidental deletion
Equipment failure
Security incident / malware / compromise
It’s that last one that is of particular interest today. A lot of backup strategies are such that if a user (or administrator) has their local account or network compromised, their backups could very well be destroyed as well. For instance, do you ssh from the account being backed up to the system holding the backups? Or rsync using a keypair stored on it? Or access S3 buckets, etc? It is trivially easy in many of these schemes to totally ruin cloud-based backups, or even some other schemes. rsync can be run with –delete (and often is, to prune remotes), S3 buckets can be deleted, etc. And even if you try to lock down an over-network backup to be append-only, still there are vectors for attack (ssh credentials, OpenSSL bugs, etc). In this post, I try to explore how we can protect against them and still retain some modern conveniences.
A backup scheme also needs to make a balance between:
Cost
Security
Accessibility
Efficiency (of time, bandwidth, storage, etc)
My story so far…
About 20 years ago, I had an Exabyte tape drive, with the amazing capacity of 7GB per tape! Eventually as disk prices fell, I had external disks plugged in to a server, and would periodically rotate them offsite. I’ve also had various combinations of partial or complete offsite copies over the Internet as well. I have around 6TB of data to back up (after compression), a figure that is growing somewhat rapidly as I digitize some old family recordings and videos.
Since I last wrote about backups 5 years ago, my scheme has been largely unchanged; at present I use ZFS for local and to-disk backups and borg for the copies over the Internet.
Let’s take a look at some options that could make this better.
Tape
The original airgapped backup. You back up to a tape, then you take the (fairly cheap) tape out of the drive and put in another one. In cost per GB, tape is probably the cheapest medium out there. But of course it has its drawbacks.
Let’s start with cost. To get a drive that can handle capacities of what I’d be needing, at least LTO-6 (2.5TB per tape) would be needed, if not LTO-7 (6TB). New, these drives cost several thousand dollars, plus they need LVD SCSI or Fibre Channel cards. You’re not going to be hanging one off a Raspberry Pi; these things need a real server with enterprise-style connectivity. If you’re particularly lucky, you might find an LTO-6 drive for as low as $500 on eBay. Then there are tapes. A 10-pack of LTO-6 tapes runs more than $200, and provides a total capacity of 25TB – sufficient for these needs (note that, of course, you need to have at least double the actual space of the data, to account for multiple full backups in a set). A 5-pack of LTO-7 tapes is a little more expensive, while providing more storage.
So all-in, this is going to be — in the best possible scenario — nearly $1000, and possibly a lot more. For a large company with many TB of storage, the initial costs can be defrayed due to the cheaper media, but for a home user, not so much.
Consider that 8TB hard drives can be found for $150 – $200. A pair of them (for redundancy) would run $300-400, and then you have all the other benefits of disk (quicker access, etc.) Plus they can be driven by something as cheap as a Raspberry Pi.
Fancier tape setups involve auto-changers, but then you’re not really airgapped, are you? (If you leave all your tapes in the changer, they can generally be selected and overwritten, barring things like hardware WORM).
As useful as tape is, for this project, it would simply be way more expensive than disk-based options.
Fundamentals of disk-based airgapping
The fundamental thing we need to address with disk-based airgapping is that the machines being backed up have no real-time contact with the backup storage system. This rules out most solutions out there, that want to sync by comparing local state with remote state. If one is willing to throw storage efficiency out the window — maybe practical for very small data sets — one could just send a full backup daily. But in reality, what is more likely needed is a way to store a local proxy for the remote state. Then a “runner” device (a USB stick, disk, etc) could be plugged into the network, filled with queued data, then plugged into the backup system to have the data dequeued and processed.
Some may be tempted to short-circuit this and just plug external disks into a backup system. I’ve done that for a long time. This is, however, a risk, because it makes those disks vulnerable to whatever may be attacking the local system (anything from lightning to ransomware).
ZFS
ZFS is, it should be no surprise, particularly well suited for this. zfs send/receive can send an incremental stream that represents a delta between two checkpoints (snapshots or bookmarks) on a filesystem. It can do this very efficiently, much more so than walking an entire filesystem tree.
Additionally, with the recent addition of ZFS crypto to ZFS on Linux, the replication stream can optionally reflect the encrypted data. Yes, as long as you don’t need to mount them, you can mostly work with ZFS datasets on an encrypted basis, and can directly tell zfs send to just send the encrypted data instead of the decrypted data.
The downside of ZFS is the resource requirements at the destination, which in terms of RAM are higher than most of the older Raspberry Pi-style devices. Still, one could perhaps just save off zfs send streams and restore them later if need be, but that implies a periodic resend of a full stream, an inefficient operation. dedpulicating software such as borg could be used on those streams (though with less effectiveness if they’re encrypted).
Tar
Perhaps surprisingly, tar in listed incremental mode can solve this problem for non-ZFS users. It will keep a local cache of the state of the filesystem as of the time of the last run of tar, and can generate new tarballs that reflect the changes since the previous run (even deletions). This can achieve a similar result to the ZFS send/receive, though in a much less elegant way.
Bacula / Bareos
Bacula (and its fork Bareos) both have support for a FIFO destination. Theoretically this could be used to queue of data for transfer to the airgapped machine. This support is very poorly documented in both and is rumored to have bitrotted, however.
rdiff and xdelta
rdiff and xdelta can be used as sort of a non-real-time rsync, at least on a per-file basis. Theoretically, one could generate a full backup (with tar, ZFS send, or whatever), take an rdiff signature, and send over the file while keeping the signature. On the next run, another full backup is piped into rdiff, and on the basis of the signature file of the old and the new data, it produces a binary patch that can be queued for the backup target to update its stored copy of the file.
This leaves history preservation as an exercise to be undertaken on the backup target. It may not necessarily be easy and may not be efficient.
rsync batches
rsync can be used to compute a delta between two directory trees and express this as a single-file batch that can be processed by a remote rsync. Unfortunately this implies the sender must always keep an old tree around (barring a solution such as ZFS snapshots) in order to compute the delta, and of course it still implies the need for history processing on the remote.
Getting the Data There
OK, so you’ve got an airgapped system, some sort of “runner” device for your sneakernet (USB stick, hard drive, etc). Now what?
Obviously you could just copy data on the runner and move it back off at the backup target. But a tool like NNCP (sort of a modernized UUCP) offer a lot of help in automating the process, returning error reports, etc. NNCP can be used online over TCP, over reliable serial links, over ssh, with offline onion routing via intermediaries or directly, etc.
Imagine having an airgapped machine at a different location you go to frequently (workplace, friend, etc). Before leaving, you put a USB stick in your pocket. When you get there, you pop it in. It’s despooled and processed while you want, and return emails or whatever are queued up to be sent when you get back home. Not bad, eh?
Future installment…
I’m going to try some of these approaches and report back on my experiences in the next few weeks.

Reply