Research on deduplicating disk-based and cloud backups

Yesterday, I wrote about backing up to the cloud. I specifically was looking at cloud backup services. I’ve been looking into various options there, but also various options for disk-based backups. I’d like to have both onsite and offsite backups, so both types of backup are needed. Also, it is useful to think about how the two types of backups can be combined with minimal overhead.

For the onsite backups, I’d want to see:

  1. Preservation of ownership, permissions, etc.
  2. Preservation of symlinks and hardlinks
  3. Space-efficient representation of changes — ideally binary deltas or block-level deduplication
  4. Ease of restoring
  5. Support for backing up Linux and Windows machines

Deduplicating Filesystems for Local Storage

Although I initially thought of block-level deduplicating file systems as something to use for offsite backups, they could also make an excellent choice for onsite disk-based backups.

rsync-based dedup backups

One way to use them would be to simply rsync data to them each night. Since copies are essentially free, we could do (or use some optimized version of) cp -r current snapshot/2011-01-20 or some such to save off historic backups. Moreover, we’d get dedup both across and within machines. And, many of these can use filesystem-level compression.

The real upshot of this is that the entire history of the backups can be browsed as a mounted filesystem. It would be fast and easy to find files, especially when users call about that file that they deleted at some point in the past but they don’t remember when, exactly what it was called, or exactly where it was stored. We can do a lot more with find and grep to locate these things than we could do with tools in Bacula (or any other backup program) restore console. Since it is a real mounted filesystem, we could also do fun things like make tarballs of it at will, zip parts up, scp them back to the file server, whatever. We could potentially even give users direct access to their files to restore things they need for themselves.

The downside of this approach is that rsync can’t store all the permissions unless it’s running as root on the system. Wrappers such as rdup around rsync could help with that. Another downside is that there isn’t a central scheduling/statistics service. We wouldn’t want the backup system to be hammered by 20 servers trying to send it data at once. So there’d be an element of rolling our own scripts, though not too bad. I’d have preferred not to authorize a backup server with root-level access to dozens of machines, but may be inescapable in this instance.

Bacula and dedup

The other alternative I thought of system such as Bacula with disk-based “volumes”. A Bacula volume is normally a tape, but Bacula can just write them to disk files. This lets us use the powerful Bacula scheduling engine, logging service, pre-backup and post-backup jobs, etc. Normally this would be an egregious waste of disk space. Bacula, like most tape-heritage programs, will write out an entire new copy of a file if even one byte changes. I had thought that I could let block-level dedupe reduce the storage size of Bacula volumes, but after looking at the Bacula block format spec, this won’t be possible as each block will have timestamps and such in it.

The good things about this setup revolve around using the central Bacula director. We need only install bacula-fd on each server to be backed up, and it has a fairly limited set of things it can do. Bacula already has built-in support for defining simple or complicated retention policies. Its director will email us if there is a problem with anything. And its logs and catalog are already extensive and enable us to easily find out things such as how long backups take, how much space they consume, etc. And it backs up Windows machines intelligently and comprehensively in addition to POSIX ones.

The downsides are, of course, that we don’t have all the features we’d get from having the entire history on the filesystem all at once, and far less efficient use of space. Not only that, but recovering from a disaster would require a more extensive bootstrapping process.

A hybrid option may be possible: automatically unpacking bacula backups after they’ve run onto the local filesystem. Dedupe should ensure this doesn’t take additional space — if the Bacula blocksize aligns with the filesystem blocksize. This is certainly not a given however. It may also make sense to use Bacula for Windows and rsync/rdup for Linux systems.

This seems, however, rather wasteful and useless.

Evaluation of deduplicating filesystems

I set up and tested three deduplicating filesystems available for Linux: S3QL, SDFS, and zfs-fuse. I did not examine lessfs. I ran a similar set of tests for each:

  1. Copy /usr/bin into the fs with tar -cpf - /usr/bin | tar -xvpf - -C /mnt/testfs
  2. Run commands to sync/flush the disk cache. Evaluate time and disk used at this point.
  3. Rerun the tar command, putting the contents into a slightly different path in the test filesystem. This should consume very little additional space since the files will have already been there. This will validate that dedupe works as expected, and provide a hint about its efficiency.
  4. Make a tarball of both directories from the dedup filesystem, writing it to /dev/zero (to test read performance)

I did not attempt to flush read caches during this, but I did flush write caches. The test system has 8GB RAM, 5GB of which was free or in use by a cache. The CPU is a Core2 6420 at 2.13GHz. The filesystems which created files atop an existing filesystem had ext4 mounted noatime beneath them. ZFS was mounted on an LVM LV. I also benchmarked native performance on ext4 as a baseline. The data set consists of 3232 files and 516MB. It contains hardlinks and symlinks.

Here are my results. Please note the comments below as SDFS could not accurately complete the test.

Test ext4 S3QL SDFS zfs-fuse
First copy 1.59s 6m20s 2m2s 0m25s
Sync/Flush 8.0s 1m1s 0s 0s
Second copy+sync N/A 0m48s 1m48s 0m24s
Disk usage after 1st copy 516MB 156MB 791MB 201MB
Disk usage after 2nd copy N/A 157MB 823MB 208MB
Make tarball 0.2s 1m1s 2m22s 0m54s
Max RAM usage N/A 150MB 350MB 153MB
Compression none lzma none gzip-2

It should be mentioned that these tests pretty much ruled out SDFS. SDFS doesn’t appear to support local compression, and it severely bloated the data store, which was much larger than the original data. Moreover, it permitted any user to create and modify files, even if the permissions bits said that the user couldn’t. tar gave many errors unpacking symlinks onto the SDFS filesystem, and du -s on the result threw up errors as well. Besides that, I noted that find found 10 fewer files than in my source data. Between the huge memory consumption, the data integrity concerns, and inefficient disk storage, SDFS is out of the running for this project.

S3QL is optimized for storage to S3, though it can also store its files locally or on an sftp server — a nice touch. I suspect part of its performance problem stems from being designed for network backends, and using slow compression algorithms. S3QL worked fine, however, and produced no problems. Creating a checkpoint using s3qlcp (faster than cp since it doesn’t have to read the data from the store) took 16s.

zfs-fuse appears to be the most-used ZFS implementation on Linux at the moment. I set up a 2GB ZFS pool for this test, and set dedupe=on and compress=gzip-2. When I evaluated compression in the past, I hadn’t looked at lzjb. I found a blog post comparing lzjb to the gzip options supported by zfs and wound up using gzip-2 for this test.

ZFS really shone here. Compared to S3QL, it took 25s instead of over 6 minutes to copy the data over — and took only 28% more space. I suspect that if I selected gzip -9 compression it would have been closer both in time and space to S3QL. But creating a ZFS snapshot was nearly instantaneous. Although ZFS-fuse probably doesn’t have as many users as ZFS on Solaris, still it is available in Debian, and has a good backing behind it. I feel safer using it than I do using S3QL. So I think ZFS wins this comparison.

I spent quite some time testing ZFS snapshots, which are instantaneous. (Incidentally, ZFS-fuse can’t mount them directly as documented, so you create a clone of the snapshot and mount that.) They worked out as well as could be hoped. Due to dedupe, even deleting and recreating the entire content of the original filesystem resulted in less than 1MB additional storage used. I also tested creating multiple filesystems in the zpool, and confirmed that dedupe even works between filesystems.

Incidentally — wow, ZFS has a ton of awesome features. I see why you OpenSolaris people kept looking at us Linux folks with a sneer now. Only our project hasn’t been killed by a new corporate overlord, so guess that maybe didn’t work out so well for you… <grin>.

The Cloud Tie-In

This discussion leaves another discussion: what to do about offsite backups? Assuming for the moment that I want to back them up over the Internet to some sort of cloud storage facility, there are about 3 options:

  1. Get an Amazon EC2 instance with EBS storage and rsync files to it. Perhaps run ZFS on that thing.
  2. Use a filesystem that can efficiently store data in S3 or Cloud Files (S3QL is the only contender here)
  3. Use a third-party backup product (JungleDisk appears to be the leading option)

There is something to be said for using a different tool for offsite backups — if there is some tool-level issue, that could be helpful.

One of the nice things about JungleDisk is that bandwidth is free, and disk is the same $0.15/GB-mo that RackSpace normally charges. JungleDisk also does block-level dedup, and has a central management interface. This all spells “nice” for us.

The only remaining question would be whether to just use JungleDisk to back up the backup server, or to put it on each individual machine as well. If it just backs up the backup server, then administrative burdens are lower; we can back everything there up by default and just not worry about it. On the other hand, if there is a problem with our main backups, we could be really stuck. So I’d say I’m leaning towards ZFS plus some sort of rsync solution and JungleDisk for offsite.

I had two people suggest CrashPlan Pro on my blog. It looks interesting, but is a very closed product which makes me nervous. I like using standard tools and formats — gives me more peace of mind, control, and recovery options. CrashPlan Pro supports multiple destinations and says that they do cloud hosting, but don’t list pricing anywhere. So I’ll probably not mess with it.

I’m still very interested in what comments people may have on all this. Let me know!

20 thoughts on “Research on deduplicating disk-based and cloud backups

  1. Jason Riedy says:

    Have you considered the source-visible tarsnap or closed-client, open-raw-storage-protocol SpiderOak? I know the latter has some level of blocked dedup, and both encrypt on the client side. I get very worried about non-free software, particularly for critical uses like backups, but comparing and contrasting could be interesting.

    1. John Goerzen says:

      I did look briefly at both. SpiderOak seems more geared to individual users or small workgroups rather than servers. In particular, their FAQ notes that they can’t back up symlinks, which really rules it out as a useful backup tool.

      Storage on tarsnap is twice as expensive as competitors, and bandwidth more than twice as expensive. It also explicitly doesn’t support Windows, which is at least somewhat of a drawback though not a showstopper for us.

      1. Jason Riedy says:

        Thank you! I hadn’t noticed the lack of symlink support for SpiderOak. I had half assumed their architecture was based on git-like methods.

        I hadn’t noticed the tarsnap differences, either. My uses tend to be either small or utterly massive (TB size datasets).

        Thank you again for the useful notes.

      2. A happy tarsnap user says:

        I read on Reddit that tarsnap works well under Cygwin (the guy mentioned than Colin Percival had been helpful on that topic). Maybe should you ask on the tarsnap-users mailing-list?

        About price, you should consider the fact that tarsnap does compression and block-level deduplication, potentially reducing the amount of data you need to transfer and backup, while using full snapshots rather than deltas (thus allowing quick restoration). It also does not have monthly costs (you pay only for what you use). I guess it would be wise, at least, to give it a try and figure out how much it would cost you.

        On the other hand, it is definitely easier to browse an rsync backup than a tarsnap one, because the former is neither encrypted nor compressed. But then, the price issue strikes back.

        1. Gour says:

          I agree…tried yesterday with 1.2GB mail archive which tarsnap compressed into 481994351 bytes and used 508041300 bytes for bandwidth (initial archive).

          However, I may agree it can be expensive for 100s of GBs and I’ll still use Bacula & LTO2 tapes for such purpose.

        2. John Goerzen says:

          No disputing that, but it’s not a feature unique to tarsnap. I’ve specifically tested the block-level dedup on JungleDisk, and it does it and it works — at less than half the cost of tarsnap.

  2. Nikolaus Rath says:

    Hi,

    You can tell S3QL to use a different compression algorithm (–compress option to mount.s3ql), this should make it significantly faster.

    Nevertheless, you are right in that it will never be as fast as ext4 or zfs because it is written under the assumption that performance will always be limited by the speed of the network connection.

    1. John Goerzen says:

      Hi Nikolaus! Thank you for your hard work on S3QL. Yes, I think your right, though of the compression methods supported, I think I picked the fastest (other than none). I am wondering — does it maintain a local cache to prevent it from having to look up a hash on the remote for every block?

  3. Laurent says:

    For offsite backups I use Amazon S3 via s3sync. It works fine.

    1. John Goerzen says:

      Way, way too inefficient for this. How would a person use it to maintain 20 days’ worth of snapshots? If a tool is going to make things look like a regular FS on S3, I can’t imagine a way where it will avoid re-uploading an entire data set every day (if we want to be able to restore from, say, the state as of 13 days ago).

  4. djadala says:

    can u explain why u not examine lessfs ?

    1. John Goerzen says:

      It doesn’t support remote destinations. And for local use, ZFS seems more mature, widely used, and reliable to me.

      1. Emmanuel Florac says:

        I’ve tested lessfs quite extensively and it works really well. The best setup is to use it to host filesystem images (eventually loop-mounted, or iSCSI-exported) because lessfs is optimized for a small number of files.

        On the other hand I’ve created 600 millions files totalizing 19 TB and though lessfs slowed to a crawl it didn’t break.

        Space saved with usual files is in the 40-60% range. Overall it’s quite impressive, it even manage to save some space when storing heavily compressed files like mpg or divx videos.

        All in all I’m very satisfied with lessfs, it only lacks slightly in multithreaded performance.

  5. Philip Hands says:

    You might want to have a look at bup

    You’d want to sit it on an encrypting file system and then store that on the cloud, probably, but it ought to work. I’d be interested to see how it compares to your existing tests — I’ve been very impressed with the small tests I’ve done, particularly doing things like backing up complete LVM volumes (although doing it that way makes restoring individual files a bit of a pain)

    The most significant missing feature at present would seem to be that there’s no way to discard old backups.

  6. chad says:

    why not try btrfs? It has built in compression, snapshot ability, and a linux native filesystem.
    Maybe it was ruled out by one of your criteria that I missed…

    1. John Goerzen says:

      Hi Chad,

      The two reasons I didn’t look at it were that it doesn’t yet support deduplication, and I’m not confident enough in it yet to trust it with critical backups.

  7. Zetta offers what you are looking for, but as a commercial service. We have windows, linux and mac clients, preserve permissions, and are incremental-forever approach with sub-file change detection and transmission– transport (but not storage) deduplication if you will.

    Zetta is designed for enterprises, with a single pane of glass to view the sync/replication status of multiple servers. We’re also really good with a large number of files and efficiently stuffing the WAN pipe.

    Disclaimer: I am a co-founder and CTO of Zetta. Nice blog!

  8. Ruben says:

    Hi,
    I had my hdd crash with all my data. Now I find that bios does not detect it but windows does, though it freezes to access data each time. Now I was considering a fail safe backup plan.

    I want the de-duplication feature and compressed drive feature. The idea is to buy a 2TB Hdd, and use my 1TB as a compressed single volume to backup incremental data for the data on my 2TB. My primary boot disk is a 256GB ssd, and I am writing to my 2TB for all projects and other files I use on daily basis.

    I need advise on weather to go with SDFS or ZFS on my 1TB backup volume for best results, I am not aware of tools that can do active or passive cloning much, so I need your advise. I just want to make sure if my 2TB crashes again, I will have some way to restore it without loosing it all again.

    I may store photos, jar, zips, iso etc, so I want the volume to be smart not to recompress or try to recompress these common compressed formats.

    Thanks again for any advice in this regard,

    Warm Regards,
    Ruben.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.