Daily Archives: January 18, 2011

Backing Up to the Cloud

I’m recently taking some big-picture looks at how we do things, and one thing that I think could be useful would be for us to back up a limited set of data to an offsite location states away. Prices are cheap enough for this to make it useful. Services such as Amazon S3 and Rackspace Cloud Files (I’ve heard particularly good things about that one) seem to be perfect for this. I’m not quite finding software that does what I want, though. Here are my general criteria:

  1. Storage fees of $0.15 per gigabyte-month or less
  2. Free or cheap ($0.20 per gigabyte or less) bandwidth fees
  3. rsync-like protocol to avoid having to re-send those 20GB files that have 20MB of changes in their entirety every night
  4. Open Source and cross-platform (Linux, Windows, Mac, Solaris ideally; Linux and Windows at a minimum)
  5. Compression and encryption
  6. Easy way to restore the entire backup set or individual files
  7. Versatile include/exclude rules
  8. Must be runnable from scripts, cron, etc. without a GUI
  9. Nice to have: block or file-level de-duplication
  10. Nice to have: support for accurately backing up POSIX (user, group, permission bits, symlink, hard links, sparse files) and Windows filesystem attributes
  11. Nice to have: a point-and-click interface for the non-Unix folks to use to restore Windows files and routine restore requests

So far, here’s what I’ve found. I should note that not a single one of these solutions appears to handle hard links or sparse files correctly, meaning I can’t rely on them for complete system-level backups. That doesn’t mean they’re useless — I could still use them to back up critical user data — just less useful.

Of the Free Software solutions, Duplicity is a leading contender. It has built-in support for Amazon S3 and Rackspace Cloud Files storage. It uses rdiff, which is a standalone implementation of the rsync binary delta algorithm. So you send up a full backup, then binary deltas from that for incrementals. That makes it bandwidth-efficient for incremental backups, and storage-efficient. However, periodic full backups will have to be run, which will make it less bandwidth-efficient. (Perhaps not incredibly *often*, but they will still be needed.) Duplicity doesn’t offer block-level de-deuplication or a GUI for the point-and-click folks. But it DOES offer the most Unixy approach and feels like a decent match for the task overall.

The other service relying on Free Software is rsync.net, which supports rsync, sftp, scp, etc. protocols directly. That would be great, as it could preserve hard links and be compatible with any number of rsync-based backup systems. The downside is that it’s expensive — really expensive. Their cheapest rate is $0.32 per GB-month and that’s only realized if you store more than 2TB with them. The base rate is $0.80 per GB-month. They promise premium support and such, but I just don’t think I can justify that for what is, essentially, secondary backup.

On the non-Open Source side, there’s JungleDisk, which has a Server Edition that looks like a good fit. The files are stored on either S3 or Rackspace, and it seems to be a very slick and full-featured solution. The client, however, is proprietary though it does seem to offer a non-GUI command-line interface. They claim to offer block-level de-duplication which could be very nice. The other nice thing is that the server management is centralized, which presumably lets you easily automate things like not running more than one backup at a time in order to not monopolize an Internet link. This can, of course, be managed with something like duplicity with appropriate ssh jobs kicked off from appropriate places, but it would be easier if the agent just handled it automatically.

What are people’s thoughts about this sort of thing?