I think most people reading my blog would agree that backups are extremely important. So much important data is on computers these days: family photos, emails, financial records. So I take backups seriously.
A little while back, I purchased two identical 400GB external hard disks. One is kept at home, and the other at a safe deposit box in a bank in a different town. Every week or two, I swap drives, so that neither one ever becomes too dated. This process is relatively inexpensive (safe deposit boxes big enough to hold the drive go for $25/year), and works well.
I have been using rdiff-backup to make these backups for several years now. (Since at least 2004, when I submitted a patch to make it record all metadata on MacOS X). rdiff-backup is quite nice. It is designed for storage to a hard disk. It stores on the disk a current filesystem mirror along with some metadata files that include permissions information. History is achieved by storing compressed rdiff (rsync) deltas going backwards in time. So restoring “most recent” files is a simple copy plus application of metadata, and restoring older files means reversing history. rdiff-backup does both automatically.
This is a nice system and has served me well for quite some time. But it has its drawbacks. One is that you always have to have the current image, uncompressed, which uses up lots of space. Another is that you can’t encrypt these backups with something like gpg for storage on a potentially untrusted hosting service (say, rsync.net). Also, when your backup disk fills up, it takes forever to figure out what to delete, since rdiff-backup –list-increment-sizes must stat tens of thousands of files. So I went looking for alternatives.
The author of rdiff-backup actually wrote one, called duplicity. Duplicity works by, essentially, storing a tarball full backup with its rdiff signature, then storing tarballs of rdiff deltas going forward in time. The reason rdiff-backup must have the full mirror is that it must generate rdiff deltas “backwards”, which requires the full prior file available. Duplicity works around this.
However, the problem with duplicity is that if the full backup gets lost or corrupted, nothing newer than it can be restored. You must make new full backups periodically so that you can remove the old history. The other big problem with duplicity is that it doesn’t grok hard links at all. That makes it unsuitable for backing up /sbin, /bin, /usr, and my /home, in which I frequently use hard links for preparing CD images, linking DVCS branches, etc.
So I went off searching out other projects and thinking about the problem myself.
One potential solution is to simply store tarballs and rdiff deltas going forward. That would require performing an entire full backup every day, which probably isn’t a problem for me now, but I worry about the load that will place on my hard disks and the additional power it would consume to process all that data.
So what other projects are out there? Two caught my attention. The first is Box Backup. It is similar in concept to rdiff-backup. It has its own archive format, and otherwise operates on a similar principle to rdiff-backup. It stores the most recent data in its archive format, compressed, along with the signatures for it. Then it generates reverse deltas similar to rdiff-backup. It supports encryption out of the box, too. It sounded like a perfect solution. Then I realized it doesn’t store hard links, device entries, etc., and has a design flaw that causes it to miss some changes to config files in /etc on Gentoo. That’s a real bummer, because it sounded so nice otherwise. But I just can’t trust my system to a program where I have to be careful not to use certain OS features because they won’t be backed up right.
The other interesting one is dar, the Disk ARchive tool, described by its author as the great grandson of tar — and a pretty legitimate claim at that. Traditionally, if you are going to back up a Unix box, you have to choose between two not-quite-perfect options. You could use something like tar, which backs up all your permissions, special files, hard links, etc, but doesn’t support random access. So to extract just one file, tar will read through the 5GB before it in the archive. Or you could use zip, which doesn’t handle all the special stuff, but does support random access. Over the years, many backup systems have improved upon this in various ways. Bacula, for instance, is incredibly fast for tapes as it creates new tape “files” every so often and stores the precise tape location of each file in its database.
But none seem quite as nice as dar for disk backups. In addition to supporting all the special stuff out there, dar sports built-in compression and encryption. Unlike tar, compression is applied per-file, and encryption is applied per 10K block, which is really slick. This allows you to extract one file without having to decrypt and decompress the entire archive. dar also maintains a catalog which permits random access, has built-in support for splitting archives across removable media like CD-Rs, has a nice incremental backup feature, and sports a host of tools for tweaking archives — removing files from them, changing compression schemes, etc.
But dar does not use binary deltas. I thought this would be quite space-inefficient, so I decided I would put it to the test, against a real-world scenario that would probably be pretty much a worst case scenario for it and a best case for rdiff-backup.
I track Debian sid and haven’t updated my home box in quite some time. I have over 1GB of .debs downloaded which represent updates. Many of these updates are going to touch tons of files in /usr, though often making small changes, or even none at all. Sounds like rdiff-backup heaven, right?
I ran rdiff-backup to a clean area before applying any updates, and used dar to create a full backup file of the same data. Then I ran apt-get upgrade, and made incrementals with both rdiff-backup and dar. Finally I ran apt-get dist-upgrade, and did the same thing. So I have three backups with each system.
Let’s look at how rdiff-backup did first.
According to rdiff-backup –list-increment-sizes, my /usr backup looks like this:
Time Size Cumulative size
-----------------------------------------------------------------------------
Sun Apr 13 18:37:56 2008 5.15 GB 5.15 GB (current mirror)
Sun Apr 13 08:51:30 2008 405 MB 5.54 GB
Sun Apr 13 03:08:07 2008 471 MB 6.00 GB
So what we see here is that we’re using 5.15GB for the mirror of the current state of /usr. The delta between the old state of /usr and the state after apt-get upgrade was 471MB, and the delta representing dist-upgrade was 405MB, for total disk consumption of 6GB.
But if I run du -s over the /usr storage area in rdiff, it says that 7.0GB was used. du -s –apparent-size shows 6.1GB. The difference is that all the tens of thousands of files each waste some space at the end of their blocks, and that adds up to an entire gigabyte. rdiff-backup effectively consumed 7.0GB of space.
Now, for dar:
-rw-r--r-- 1 root root 2.3G Apr 12 22:47 usr-l00.1.dar
-rw-r--r-- 1 root root 826M Apr 13 11:34 usr-l01.1.dar
-rw-r--r-- 1 root root 411M Apr 13 19:05 usr-l02.1.dar
This was using bzip2 compression, and backed up the exact same files and data that rdiff-backup did. The initial mirror was 2.3GB, much smaller than the 5.1GB that rdiff-backup consumes. The apt-get upgrade differential was 826MB compared to the 471MB in rdiff-backup — not really a surprise. But the dist-upgrade differential — still a pathologically bad case for dar, but less so — was only 6MB larger than the 405MB rdiff-backup case. And the total actual disk consumption of dar was only 3.5GB — half the 7.0GB rdiff-backup claimed!
I still expect that, over an extended time, rdiff-backup could chip away at dar’s lead… or maybe not, if lots of small files change.
But this was a completely unexpected result. I am definitely going to give dar a closer look.
Also, before I started all this, I converted my external hard disk from ext3 to XFS because of ext3’s terrible performance with rdiff-backup.