Backup Software

I think most people reading my blog would agree that backups are extremely important. So much important data is on computers these days: family photos, emails, financial records. So I take backups seriously.

A little while back, I purchased two identical 400GB external hard disks. One is kept at home, and the other at a safe deposit box in a bank in a different town. Every week or two, I swap drives, so that neither one ever becomes too dated. This process is relatively inexpensive (safe deposit boxes big enough to hold the drive go for $25/year), and works well.

I have been using rdiff-backup to make these backups for several years now. (Since at least 2004, when I submitted a patch to make it record all metadata on MacOS X). rdiff-backup is quite nice. It is designed for storage to a hard disk. It stores on the disk a current filesystem mirror along with some metadata files that include permissions information. History is achieved by storing compressed rdiff (rsync) deltas going backwards in time. So restoring “most recent” files is a simple copy plus application of metadata, and restoring older files means reversing history. rdiff-backup does both automatically.

This is a nice system and has served me well for quite some time. But it has its drawbacks. One is that you always have to have the current image, uncompressed, which uses up lots of space. Another is that you can’t encrypt these backups with something like gpg for storage on a potentially untrusted hosting service (say, rsync.net). Also, when your backup disk fills up, it takes forever to figure out what to delete, since rdiff-backup –list-increment-sizes must stat tens of thousands of files. So I went looking for alternatives.

The author of rdiff-backup actually wrote one, called duplicity. Duplicity works by, essentially, storing a tarball full backup with its rdiff signature, then storing tarballs of rdiff deltas going forward in time. The reason rdiff-backup must have the full mirror is that it must generate rdiff deltas “backwards”, which requires the full prior file available. Duplicity works around this.

However, the problem with duplicity is that if the full backup gets lost or corrupted, nothing newer than it can be restored. You must make new full backups periodically so that you can remove the old history. The other big problem with duplicity is that it doesn’t grok hard links at all. That makes it unsuitable for backing up /sbin, /bin, /usr, and my /home, in which I frequently use hard links for preparing CD images, linking DVCS branches, etc.

So I went off searching out other projects and thinking about the problem myself.

One potential solution is to simply store tarballs and rdiff deltas going forward. That would require performing an entire full backup every day, which probably isn’t a problem for me now, but I worry about the load that will place on my hard disks and the additional power it would consume to process all that data.

So what other projects are out there? Two caught my attention. The first is Box Backup. It is similar in concept to rdiff-backup. It has its own archive format, and otherwise operates on a similar principle to rdiff-backup. It stores the most recent data in its archive format, compressed, along with the signatures for it. Then it generates reverse deltas similar to rdiff-backup. It supports encryption out of the box, too. It sounded like a perfect solution. Then I realized it doesn’t store hard links, device entries, etc., and has a design flaw that causes it to miss some changes to config files in /etc on Gentoo. That’s a real bummer, because it sounded so nice otherwise. But I just can’t trust my system to a program where I have to be careful not to use certain OS features because they won’t be backed up right.

The other interesting one is dar, the Disk ARchive tool, described by its author as the great grandson of tar — and a pretty legitimate claim at that. Traditionally, if you are going to back up a Unix box, you have to choose between two not-quite-perfect options. You could use something like tar, which backs up all your permissions, special files, hard links, etc, but doesn’t support random access. So to extract just one file, tar will read through the 5GB before it in the archive. Or you could use zip, which doesn’t handle all the special stuff, but does support random access. Over the years, many backup systems have improved upon this in various ways. Bacula, for instance, is incredibly fast for tapes as it creates new tape “files” every so often and stores the precise tape location of each file in its database.

But none seem quite as nice as dar for disk backups. In addition to supporting all the special stuff out there, dar sports built-in compression and encryption. Unlike tar, compression is applied per-file, and encryption is applied per 10K block, which is really slick. This allows you to extract one file without having to decrypt and decompress the entire archive. dar also maintains a catalog which permits random access, has built-in support for splitting archives across removable media like CD-Rs, has a nice incremental backup feature, and sports a host of tools for tweaking archives — removing files from them, changing compression schemes, etc.

But dar does not use binary deltas. I thought this would be quite space-inefficient, so I decided I would put it to the test, against a real-world scenario that would probably be pretty much a worst case scenario for it and a best case for rdiff-backup.

I track Debian sid and haven’t updated my home box in quite some time. I have over 1GB of .debs downloaded which represent updates. Many of these updates are going to touch tons of files in /usr, though often making small changes, or even none at all. Sounds like rdiff-backup heaven, right?

I ran rdiff-backup to a clean area before applying any updates, and used dar to create a full backup file of the same data. Then I ran apt-get upgrade, and made incrementals with both rdiff-backup and dar. Finally I ran apt-get dist-upgrade, and did the same thing. So I have three backups with each system.

Let’s look at how rdiff-backup did first.

According to rdiff-backup –list-increment-sizes, my /usr backup looks like this:

        Time                       Size        Cumulative size
-----------------------------------------------------------------------------
Sun Apr 13 18:37:56 2008         5.15 GB           5.15 GB   (current mirror)
Sun Apr 13 08:51:30 2008          405 MB           5.54 GB
Sun Apr 13 03:08:07 2008          471 MB           6.00 GB

So what we see here is that we’re using 5.15GB for the mirror of the current state of /usr. The delta between the old state of /usr and the state after apt-get upgrade was 471MB, and the delta representing dist-upgrade was 405MB, for total disk consumption of 6GB.

But if I run du -s over the /usr storage area in rdiff, it says that 7.0GB was used. du -s –apparent-size shows 6.1GB. The difference is that all the tens of thousands of files each waste some space at the end of their blocks, and that adds up to an entire gigabyte. rdiff-backup effectively consumed 7.0GB of space.

Now, for dar:

-rw-r--r-- 1 root root 2.3G Apr 12 22:47 usr-l00.1.dar
-rw-r--r-- 1 root root 826M Apr 13 11:34 usr-l01.1.dar
-rw-r--r-- 1 root root 411M Apr 13 19:05 usr-l02.1.dar

This was using bzip2 compression, and backed up the exact same files and data that rdiff-backup did. The initial mirror was 2.3GB, much smaller than the 5.1GB that rdiff-backup consumes. The apt-get upgrade differential was 826MB compared to the 471MB in rdiff-backup — not really a surprise. But the dist-upgrade differential — still a pathologically bad case for dar, but less so — was only 6MB larger than the 405MB rdiff-backup case. And the total actual disk consumption of dar was only 3.5GB — half the 7.0GB rdiff-backup claimed!

I still expect that, over an extended time, rdiff-backup could chip away at dar’s lead… or maybe not, if lots of small files change.

But this was a completely unexpected result. I am definitely going to give dar a closer look.

Also, before I started all this, I converted my external hard disk from ext3 to XFS because of ext3’s terrible performance with rdiff-backup.

13 thoughts on “Backup Software

  1. I have used dar for a couple years now, and I am pretty happy with it. I use compression (no encryption), and have been pretty happy with the filesize, although I don’t have as much binary data as you do.

    I combine dar with par2 (run par2 on the generated dar files), so hopefully par2 can repair local errors in the dar files, and then dar can locally extract the files. Without needing the whole file to be recovered.

    http://en.wikipedia.org/wiki/Par2
    http://parchive.sourceforge.net/

    1. Good post, I like one free software [url=http://www.softsea.com/review/Karens-Replicator.html]karens replicator[/url], easy for newbie, some tool under linux should be cool.

    1. Cliff, I was reading this post while eating raspberry-rhubarb pie for breakfast and thought “Cliff would really love this post”. Looks like I was right! :-)

  2. Take a look at BackupPC as well. Has web interface, maintains logarithmic backup history, provides browse of files for restore, etc.

    1. I too recommend backuppc. I ignore the GUI, but it is very efficient at space when you’re backing up multiple hosts.

      For example if you have ten copies of /bin/ls on ten machines it will only store one copy ..

  3. Not to take anything away from your informative post, but why not just use a mirrored RAID? You could run a software (mdadm) RAID 1, sync the disks, remove one of the disks for storage at another location, and then every so often add the disk back to the array and re-sync.

    You know you have an exact copy and if one fails it’s a simple matter of just plugging the other in it’s place, buying another hard drive, and then syncing again.

    Could it be this simple, or am I missing something?

    1. That doesn’t give you any incremental history, and so doesn’t protect you from “oh crap I delete this file a week ago and only just noticed”.

    2. The other problem is that you have a single point of failure when you have both disks at your house. Not a good thing in my book. In my setup, I have no single point of failure. The only time the two disks are at the same place is for 30 seconds in the bank vault when I swap them out, but then they’re not at the same place as my PC at home.

      1. Jon: Do you really need that for the entire partition? If not I would just use Subversion (or something similar) to version track a subset of the files.

        John: Good point, that’s why if I did it I would use 3 disks for the mirror. That way you have a rotation where there are always at least two in your box at home and one in a safe location. If you’re really paranoid you can scale this even further say up to 8 drives in a RAID 1, allowing you to have up to 6 alternate locations.

        You could also use dd to do a complete byte for byte backup, and then compress that image. That can then be wrapped in a PKCS#7 envelope.

        Anyways, I think your solution is all fine and good. I just thought I’d offer some more “brute force” methods.

  4. I have several TB worth of family photos, videos, and other data. This needs to be backed up — and archived.
    Backups and archives are often thought of as similar. And indeed, they may be done with the same tools at the same time. But the goals differ somewhat:
    Backups are designed to recover from a disaster that you can fairly rapidly detect.
    Archives are designed to survive for many years, protecting against disaster not only impacting the original equipment but also the original person that created them.
    Reflecting on this, it implies that while a nice ZFS snapshot-based scheme that supports twice-hourly backups may be fantastic for that purpose, if you think about things like family members being able to access it if you are incapacitated, or accessibility in a few decades’ time, it becomes much less appealing for archives. ZFS doesn’t have the wide software support that NTFS, FAT, UDF, ISO-9660, etc. do.
    This post isn’t about the pros and cons of the different storage media, nor is it about the pros and cons of cloud storage for archiving; these conversations can readily be found elsewhere. Let’s assume, for the point of conversation, that we are considering BD-R optical discs as well as external HDDs, both of which are too small to hold the entire backup set.
    What would you use for archiving in these circumstances?
    Establishing goals
    The goals I have are:

    Archives can be restored using Linux or Windows (even though I don’t use Windows, this requirement will ensure the broadest compatibility in the future)
    The archival system must be able to accommodate periodic updates consisting of new files, deleted files, moved files, and modified files, without requiring a rewrite of the entire archive dataset
    Archives can ideally be mounted on any common OS and the component files directly copied off
    Redundancy must be possible. In the worst case, one could manually copy one drive/disc to another. Ideally, the archiving system would automatically track making n copies of data.
    While a full restore may be a goal, simply finding one file or one directory may also be a goal. Ideally, an archiving system would be able to quickly tell me which discs/drives contain a given file.
    Ideally, preserves as much POSIX metadata as possible (hard links, symlinks, modification date, permissions, etc). However, for the archiving case, this is less important than for the backup case, with the possible exception of modification date.
    Must be easy enough to do, and sufficiently automatable, to allow frequent updates without error-prone or time-consuming manual hassle

    I would welcome your ideas for what to use. Below, I’ll highlight different approaches I’ve looked into and how they stack up.
    Basic copies of directories
    The initial approach might be one of simply copying directories across. This would work well if the data set to be archived is smaller than the archival media. In that case, you could just burn or rsync a new copy with every update and be done. Unfortunately, this is much less convenient with data of the size I’m dealing with. rsync is unavailable in that case. With some datasets, you could manually design some rsyncs to store individual directories on individual devices, but that gets unwieldy fast and isn’t scalable.
    You could use something like my datapacker program to split the data across multiple discs/drives efficiently. However, updates will be a problem; you’d have to re-burn the entire set to get a consistent copy, or rely on external tools like mtree to reflect deletions. Not very convenient in any case.
    So I won’t be using this.
    tar or zip
    While you can split tar and zip files across multiple media, they have a lot of issues. GNU tar’s incremental mode is clunky and buggy; zip is even worse. tar files can’t be read randomly, making it extremely time-consuming to extract just certain files out of a tar file.
    The only thing going for these formats (and especially zip) is the wide compatibility for restoration.
    dar
    Here we start to get into the more interesting tools. Dar is, in my opinion, one of the best Linux tools that few people know about. Since I first wrote about dar in 2008, it’s added some interesting new features; among them, binary deltas and cloud storage support. So, dar has quite a few interesting features that I make use of in other ways, and could also be quite helpful here:

    Dar can both read and write files sequentially (streaming, like tar), or with random-access (quick seek to extract a subset without having to read the entire archive)
    Dar can apply compression to individual files, rather than to the archive as a whole, faciliting both random access and resilience (corruption in one file doesn’t invalidate all subsequent files). Dar also supports numerous compression algorithms including gzip, bzip2, xz, lzo, etc., and can omit compressing already-compressed files.
    The end of each dar file contains a central directory (dar calls this a catalog). The catalog contains everything necessary to extract individual files from the archive quickly, as well as everything necessary to make a future incremental archive based on this one. Additionally, dar can make and work with “isolated catalogs” — a file containing the catalog only, without data.
    Dar can split the archive into multiple pieces called slices. This can best be done with fixed-size slices (–slice and –first-slice options), which let the catalog regord the slice number and preserves random access capabilities. With the –execute option, dar can easily wait for a given slice to be burned, etc.
    Dar normally stores an entire new copy of a modified file, but can optionally store an rdiff binary delta instead. This has the potential to be far smaller (think of a case of modifying metadata for a photo, for instance).

    Additionally, dar comes with a dar_manager program. dar_manager makes a database out of dar catalogs (or archives). This can then be used to identify the precise archive containing a particular version of a particular file.
    All this combines to make a useful system for archiving. Isolated catalogs are tiny, and it would be easy enough to include the isolated catalogs for the entire set of archives that came before (or even the dar_manager database file) with each new incremental archive. This would make restoration of a particular subset easy.
    The main thing to address with dar is that you do need dar to extract the archive. Every dar release comes with source code and a win64 build. dar also supports building a statically-linked Linux binary. It would therefore be easy to include win64 binary, Linux binary, and source with every archive run. dar is also a part of multiple Linux and BSD distributions, which are archived around the Internet. I think this provides a reasonable future-proofing to make sure dar archives will still be readable in the future.
    The other challenge is user ability. While dar is highly portable, it is fundamentally a CLI tool and will require CLI abilities on the part of users. I suspect, though, that I could write up a few pages of instructions to include and make that a reasonably easy process. Not everyone can use a CLI, but I would expect a person that could follow those instructions could be readily-enough found.
    One other benefit of dar is that it could easily be used with tapes. The LTO series is liked by various hobbyists, though it could pose formidable obstacles to non-hobbyists trying to aceess data in future decades. Additionally, since the archive is a big file, it lends itself to working with par2 to provide redundancy for certain amounts of data corruption.
    git-annex
    git-annex is an interesting program that is designed to facilitate managing large sets of data and moving it between repositories. git-annex has particular support for offline archive drives and tracks which drives contain which files.
    The idea would be to store the data to be archived in a git-annex repository. Then git-annex commands could generate filesystem trees on the external drives (or trees to br burned to read-only media).
    In a post about using git-annex for blu-ray backups, an earlier thread about DVD-Rs was mentioned.
    This has a few interesting properties. For one, with due care, the files can be stored on archival media as regular files. There are some different options for how to generate the archives; some of them would place the entire git-annex metadata on each drive/disc. With that arrangement, one could access the individual files without git-annex. With git-annex, one could reconstruct the final (or any intermediate) state of the archive appropriately, handling deltions, renames, etc. You would also easily be able to know where copies of your files are.
    The practice is somewhat more challenging. Hundreds of thousands of files — what I would consider a medium-sized archive — can pose some challenges, running into hours-long execution if used in conjunction with the directory special remote (but only minutes-long with a standard git-annex repo).
    Ruling out the directory special remote, I had thought I could maybe just work with my files in git-annex directly. However, I ran into some challenges with that approach as well. I am uncomfortable with git-annex mucking about with hard links in my source data. While it does try to preserve timestamps in the source data, these are lost on the clones. I wrote up my best effort to work around all this.
    In a forum post, the author of git-annex comments that “I don’t think that CDs/DVDs are a particularly good fit for git-annex, but it seems a couple of users have gotten something working.” The page he references is Managing a large number of files archived on many pieces of read-only medium. Some of that discussion is a bit dated (for instance, the directory special remote has the importtree feature that implements what was being asked for there), but has some interesting tips.
    git-annex supplies win64 binaries, and git-annex is included with many distributions as well. So it should be nearly as accessible as dar in the future. Since git-annex would be required to restore a consistent recovery image, similar caveats as with dar apply; CLI experience would be needed, along with some written instructions.
    Bacula and BareOS
    Although primarily tape-based archivers, these do also also nominally support drives and optical media. However, they are much more tailored as backup tools, especially with the ability to pull from multiple machines. They require a database and extensive configuration, making them a poor fit for both the creation and future extractability of this project.
    Conclusions
    I’m going to spend some more time with dar and git-annex, testing them out, and hope to write some future posts about my experiences.

Mentions

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.