rdiff-backup vs. ZFS
As I’ve been writing about backups, I’ve gone ahead and run some tests with rdiff-backup. I have been using rdiff-backup personally for many years now — probably since 2002, when I packaged it up for Debian. It’s a nice, stable system, but I always like to look at other options for things every so often.
rdiff-backup stores an uncompressed current mirror of the filesystem, similar to rsync. History is achieved by the use of compressed backwards binary deltas generated by rdiff (using the rsync algorithm). So, you can restore the current copy very easily — a simple cp will do if you don’t need to preserve permissions. rdiff-backup restores previous copies by applying all necessary binary deltas to generate the previous version.
Things I like about rdiff-backup:
- Bandwidth-efficient
- Reasonably space-efficient, especially where history is concerned
- Easily scriptable and nice CLI
- Unlike tools such as duplicity, there is no need to periodically run full backups — old backups can be deleted without impacting the ability to restore more current backups
Things I don’t like about it:
- Speed. It can be really slow. Deleting 3 months’ worth of old history takes hours. It has to unlink vast numbers of files — and that’s pretty much it, but it does it really slowly. Restores, backups, etc. are all slow as well. Even just getting a list of your increment sizes so you’d know how much space would be saved can take a very long time.
- The current backup copy is stored without any kind of compression, which is not at all space-efficient
- It creates vast numbers of little files that take forever to delete or summarize
So I thought I would examine how efficient ZFS would be. I wrote a script that would replay the rdiff-backup history — first it would rsync the current copy onto the ZFS filesystem and make a ZFS snapshot. Then each previous version was processed by my script (rdiff-backup’s files are sufficiently standard that a shell script can process them), and a ZFS snapshot created after each. This lets me directly compare the space used by rdiff-backup to that used by ZFS using actual history.
I enabled gzip-3 compression and block dedup in ZFS.
My backups were nearly 1TB in size and the amount of space I had available for ZFS was roughly 600GB, so I couldn’t test all of them. As it happened, I tested the ones that were the worst-case scenario for ZFS: my photos, music collection, etc. These files had very little duplication and very little compressibility. Plus a backup of my regular server that was reasonably compressible.
The total size of the data backed up with rdiff-backup was 583 GB. With ZFS, this came to 498GB. My dedup ratio on this was only 1.05 (meaning 5% or 25GB saved). The compression ratio was 1.12 (60GB saved). The combined ratio was 1.17 (85GB saved). Interestingly 498 + 85 = 583.
Remember that the data under test here was mostly a worst-case scenario for ZFS. It would probably have done better had I had the time to throw the rest of my dataset at it (such as the 60GB backup of my iPod, which would have mostly deduplicated with the backup of my music server).
One problem with ZFS is that dedup is very memory-hungry. This is common knowledge and it is advertised that you need to use roughly 2GB of RAM per TB of disk when using dedup. I don’t have quite that much to dedicate to it, so ZFS got VERY slow and thrashed the disk a lot after the ARC grew to about 300MB. I found some tweakables in zfsrc and the zfs command that let me tweak the ARC cache to grow bigger. But the machine in question only has 2GB RAM, and is doing lots of other things as well, so this barely improved anything. Note that this dedup RAM requirement is not out of line with what is expected from these sorts of solutions.
Even if I got absolutely stellar dedup ratio of 2:1, that would get me at most 1TB. The cost of buying a 1TB disk is less than the cost of upgrading my system to 4GB RAM, so dedup isn’t worth it here.
I think the lesson is: think carefully about where dedup makes sense. If you’re storing a bunch of nearly-identical virtual machine images — the sort of canonical use case for this — go for it. A general fileserver — well, maybe you should just add more disk instead of more RAM.
Then that raises the question: if I don’t need dedup from ZFS, do I bother with it at all, or just use ext4 and LVM snapshots? I think ZFS still makes sense, given its built-in support for compression and very fast snapshots — LVM snapshots are known to cause serious degradation to write performance once enabled, which ZFS doesn’t.
So I plan to switch my backups to use ZFS. A few observations on this:
- Some testing suggests that the time to delete a few months of old snapshots will be a minute or two with ZFS compared to hours with rdiff-backup.
- ZFS has shown itself to be more space-efficient than rdiff-backup, even without dedup enabled.
- There are clear performance and convenience wins with ZFS.
- Provide scheduling, making sure that we don’t hammer the server with 30 clients all at once
- Provide for “run before” jobs to do things like snapshot databases
- Be silent on success and scream loudly via emails to administrators on any kind of error… and keep backing up other systems when there is an error
- Create snapshots and provide an automated way to remove old snapshots (or mount them for reading, as ZFS-fuse doesn’t support the .zfs snapshot directory yet)
Backup Scripts
So now comes the question of backup scripts. rsync is obviously a pretty nice choice here — and if used with –inplace perhaps even will play friendly with ZFS snapshots even if dedup is off. But let’s say I’m backing up a few machines at home, or perhaps dozens at work. There is a need to automate all of this. Specifically, there’s a need to:
To date I haven’t found anything that looks suitable. I found a shell script system called rsbackup that does a large part of this, but something about using a script whose homepage is a forum makes me less than 100% confident.
On the securing the backups front, rsync comes with a good-looking rrsync script (inexplicably installed under /usr/share/doc/rsync/scripts instead of /usr/bin on Debian) that can help secure the SSH authorization. GNU rush also looks like a useful restricted shell.
> Things I don’t like about it:
[…]
4. It wreaks havoc backing up e.g. /var/log, if files are just being
rotated.
I’ve used rdiff-backup myself for years until I switched to
backuppc. I have never looked back. backuppc rocks, for a single
machine, but especially if you have a few. It uses content-addressed
storage (like Git), so the same file is only ever saved once, even
across systems.
The web interface requires a web server, which is a bit naff. But
the ability to browse backups and restore files, directories and
hosts interactively and intuitively quickly relatives that.
That is an excellent point. Last time I looked at it, the requirement for the web server and the fact that they pretty much roll their own compression format and rsync server both bothered me. But it’s worth another look. I may miss having things directly accessible on the filesystem though.
how well does btrfs compare
I haven’t tried it; see my comment on http://changelog.complete.org/archives/5547-research-on-deduplicating-disk-based-and-cloud-backups
As far as deduplication sstems go, have you tested just dropping files into a git repository and packing periodically? Git has some pretty impressive deduplication bits.
Git won’t be suitable for this. For one thing, you can’t delete the old backups very easily. For another, ever tried committing a 10GB file in Git? Yeah, it ain’t pretty. You’d need more RAM for that than I can afford.
Have you seen bup?
Yes. At present, not being able to remove old backups is a major showstopper. Concerns about how reliable it is, not preserving symlinks or hardlinks, etc. also rank up there. But I think they are actively working on all those things and it may well turn out to be a very good option in the future.
I have a bit different scenario in my small office; small files (word docs, spreadsheets, MS Access databases), and high availability is really wanted. I ran into some difficulties as well with LVM snapshots, especially when I was working to make a DRBD setup.
But I think the fix for snapshots will be in reach for me soon by using NILFS. It still has a bit to go before it’s production-ready, but it seems to have what I’d need for good performance.
I envision using NILFS and DRBD for an asyncronous primary/secondary setup, where the secondary server mounts the NILFS snapshots and backs up to another off-site server using rdiff or some other diff tool.
Not suggesting this is any answer to your problems, but the developments with NILFS might be of interest to you.
Homepage: http://www.nilfs.org
Current discussion: http://www.mail-archive.com/linux-nilfs@vger.kernel.org/
WRT backup scripts, have you considered backupninja? It has already backends to backing up databases and rsync.
As to rotating log files, I’ve switched to date-based rotation instead of number-based rotation, that pretty much eliminates the undesired behavior.
Interesting program — especially this in the description: “Backupninja is a silent flower blossom death strike to lost data.”
I’m really liking BackupPC’s scheduler and management interface. One problem with backupninja is that it doesn’t have centralized scheduling. It will not be practical for backing up dozens of servers as a result.
Thanks for your blog posts about this, I’m having a lot of trouble with our backup server, doing remote backups of the backup server itself, and you have given me some interesting ideas.
About backupninja, I’m surprised you have not heard of it before, its been around for ages. You can schedule different backup jobs with backupninja to fire at different times.
Other things that backupninja can do that you were looking for:
1. Provides “run before” job capability
2. Can be silent on success and loud on failures (although if your mail system fails, you may never notice that failures are happening!)
3. Continues to backup when there is an error
4. You can easily add the capability to do snapshots and removal of snapshots, there is a shell handler that can be run with any commands you wish.
5. Its easily extendable, and simply written in bash
Interesting to see you evaluate rdiff-backup. I wrote a FUSE filesystem for browsing rdiff-backup increments, rdifffs, fueled largely by your excellent RWH book.
I think you may be wrong about like #4: From what I have been able to figure out, binary diffs are calculated against the next-newest version of the file, so if you delete an increment from the middle of the deck you might prevent restoring some files in older increments. But you can probably safely delete the oldest increment at any given time.
Well, I think I was right about it but maybe not sufficiently clear.
You are right that I can’t delete incrementals from the middle of history. What I meant is that I can say “delete all backups older than 4 months old” and this just happens without consequence to restoring newer files. That’s unlike duplicity, which uses forward history, so you have to periodically run full backups so that you can delete old backups.
Ah, interesting on the rdifffs — I’ll have to check it out. Glad you enjoyed RWH!
Hey there, I’ve been in the same boat as you–only in reverse. I started with ZFS on Solaris which worked beautifully. But because of the poor support and lack of active development on Solaris, I moved to Linux. I chose to use rdiff-backup as well, but I’ve noticed it slowing down quite a bit.
I’m running into the same problem as you: I want to use ZFS but it’s just not all there on Linux.
if you’e looking at zfs snaps, how about rsync with link-desk instead of rdiff ?
I found this quite interesting
http://www.sanitarium.net/unix_stuff/backups/readme.txt
Since its only 10 years later, let me drop this: https://github.com/psy0rz/zfs_autobackup
Thats a program that does these things and more automaticly. There are many more programs that can do this already, but those work like a black box and dont have rigorous regression testing.