Why are we still backing up to hardlink farms?

A person can find all sorts of implementations of backups using hardlink trees to save space for incrementals. Some of them are fairly rudimentary, using rsync --link-dest. Others, like BackupPC, are more sophisticated, doing file-level dedup to a storage pool indexed by a hash.

While these are fairly space-efficient, they are really inefficient in other ways, because they create tons of directory entries. It would not be surprising to find millions of directory entries consumed very quickly. And while any given backup set can be deleted without impact on the others, the act of doing so can be very time-intensive, since often a full directory tree is populated with every day’s backup.

Much better is possible on modern filesystems. ZFS has been around for quite awhile now, and is stable on Solaris, FreeBSD and derivatives, and Linux. btrfs is also being used for real workloads and is considered stable on Linux.

Both have cheap copy-on-write snapshot operations that would work well with a simple rsync --inplace to achieve the same effect ad hardlink farms, but without all the performance penalties. When creating and destroying snapshots is a virtually instantaneous operation, and the snapshots work at a file block level instead of an entire file level, and preserve changing permissions and such as well (which rsync --link-dest can have issues with), why are we not using it more?

BackupPC has a very nice scheduler, a helpful web interface, and a backend that doesn’t have a mode to take advantage of these more modern filesystems. The only tool I see like this is dirvish, which someone made patches for btrfs snapshots three years ago that never, as far as I can tell, got integrated.

A lot of folks are rolling a homegrown solution involving rsync and snapshots. Some are using zfs send / btrfs send, but those mechanisms require the same kind of FS on the machine being backed up as on the destination, and do not permit excluding files from the backup set.

Is this an area that needs work, or am I overlooking something?

Incidentally, hats off to liw’s obnam. It doesn’t exactly do this, but sort of implements its own filesystem with CoW semantics.

9 thoughts on “Why are we still backing up to hardlink farms?

  1. Andre Fachat says:

    I’ve shared the link and added some comments here https://plus.google.com/108561624393340605896/posts/4ZH3nAM8Yzi

  2. Anonymous says:

    I’d love to use obnam; I’m looking forward to its performance approaching that of duplicity.

  3. Jon says:

    I use `rdiff-snapshot`, which has limitations but at least doesn’t do the hardlink-tree thing. I try to recommend it to people who use `rsnapshot` (or in riposte when it is frequently recommended on user lists).

    Yet to try obnam or backuppc, hopefully one day.

    Another useful thing with backup software would be flexibility over whether the client or server do the hard work of calculating the differential. With cheap, low-power devices becoming more popular (e.g. raspberry pi), the traditional arrangement of the server doing the hard work doesn’t always cut it (I frequently killed my rPi with `rdiff-backup`, despite having plenty of swap, because the sd driver starved regardless.)

    1. jgoerzen says:

      @Jon I suspect you meant rdiff-backup. I used it for quite awhile, but its performance wasn’t up to snuff for me. Very nice program though.

  4. tobias3 says:

    Have a look at https://www.urbackup.org using btrfs as backup storage.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.