Backing Up to the Cloud

I’m recently taking some big-picture looks at how we do things, and one thing that I think could be useful would be for us to back up a limited set of data to an offsite location states away. Prices are cheap enough for this to make it useful. Services such as Amazon S3 and Rackspace Cloud Files (I’ve heard particularly good things about that one) seem to be perfect for this. I’m not quite finding software that does what I want, though. Here are my general criteria:

  1. Storage fees of $0.15 per gigabyte-month or less
  2. Free or cheap ($0.20 per gigabyte or less) bandwidth fees
  3. rsync-like protocol to avoid having to re-send those 20GB files that have 20MB of changes in their entirety every night
  4. Open Source and cross-platform (Linux, Windows, Mac, Solaris ideally; Linux and Windows at a minimum)
  5. Compression and encryption
  6. Easy way to restore the entire backup set or individual files
  7. Versatile include/exclude rules
  8. Must be runnable from scripts, cron, etc. without a GUI
  9. Nice to have: block or file-level de-duplication
  10. Nice to have: support for accurately backing up POSIX (user, group, permission bits, symlink, hard links, sparse files) and Windows filesystem attributes
  11. Nice to have: a point-and-click interface for the non-Unix folks to use to restore Windows files and routine restore requests

So far, here’s what I’ve found. I should note that not a single one of these solutions appears to handle hard links or sparse files correctly, meaning I can’t rely on them for complete system-level backups. That doesn’t mean they’re useless — I could still use them to back up critical user data — just less useful.

Of the Free Software solutions, Duplicity is a leading contender. It has built-in support for Amazon S3 and Rackspace Cloud Files storage. It uses rdiff, which is a standalone implementation of the rsync binary delta algorithm. So you send up a full backup, then binary deltas from that for incrementals. That makes it bandwidth-efficient for incremental backups, and storage-efficient. However, periodic full backups will have to be run, which will make it less bandwidth-efficient. (Perhaps not incredibly *often*, but they will still be needed.) Duplicity doesn’t offer block-level de-deuplication or a GUI for the point-and-click folks. But it DOES offer the most Unixy approach and feels like a decent match for the task overall.

The other service relying on Free Software is rsync.net, which supports rsync, sftp, scp, etc. protocols directly. That would be great, as it could preserve hard links and be compatible with any number of rsync-based backup systems. The downside is that it’s expensive — really expensive. Their cheapest rate is $0.32 per GB-month and that’s only realized if you store more than 2TB with them. The base rate is $0.80 per GB-month. They promise premium support and such, but I just don’t think I can justify that for what is, essentially, secondary backup.

On the non-Open Source side, there’s JungleDisk, which has a Server Edition that looks like a good fit. The files are stored on either S3 or Rackspace, and it seems to be a very slick and full-featured solution. The client, however, is proprietary though it does seem to offer a non-GUI command-line interface. They claim to offer block-level de-duplication which could be very nice. The other nice thing is that the server management is centralized, which presumably lets you easily automate things like not running more than one backup at a time in order to not monopolize an Internet link. This can, of course, be managed with something like duplicity with appropriate ssh jobs kicked off from appropriate places, but it would be easier if the agent just handled it automatically.

What are people’s thoughts about this sort of thing?

37 thoughts on “Backing Up to the Cloud

  1. s3cmd has a sync option which uses s3 metadata to determine whether a file needs to be updated. I don’t think it can do anything but re-upload the whole file.

    Whatever you do, don’t learn the lesson I learnt by doing an rsync to an s3fs-fuse mount. s3fs sucks down every whole file and then re-uploads it. I got a pretty serious S3 bandwidth bill that month….

  2. I’m currently running bacula to backup several machines at home (windows and debian) to a usb disk, and then use baculafs (a tool I wrote) to expose it as a fuse file system, which is then backed up by duplicity to s3.

    I find duplicity to be rather finicky if the incremental backup chain grows too large. I get occasional checksum errors, which I haven’t been able to pin down – I find that I need to repeat a nightly backup to s3 every two to four weeks. Quite a pain.

    Lately, I’ve been playing around with s3ql – an s3 fuse file system – I’m tempted to try it as an rsync target, but haven’t had the time to really stress-test it.

    1. rsync is not at all good for using an s3-backed filesystem. It reads the entire previous backed-up file to make its deltas, and this is bandwidth-heavy. –whole-file –inplace are probably going to be a must.

    2. Some very interesting properties of s3ql there, especially the deduplication. This could make it useful as a disk-based backend for Bacula volume files for one thing.

    1. Hi Matt,

      I’m interested in your observations: what sort of things went wrong, how long ago you were using it, and on what platform.

      s3backer looks very interesting. Will have to look into that.

  3. I found those approaches too expensive. We went with CrashPlan Pro. There’s a seat license but the servers are free so you can add as many backup destinations as you’d like. We have a couple headless debian instances with it running, as well as Mac and Windows clients. We have a local backup server and a remote one hosted in a distant colocation environment. We get a soup to nuts hardware and software as a service with 6TB of storage for less that I could do anything with S3. Plus I get a nice central management interface, and a local backup endpoint for fast restores.

    1. That’s an interesting approach. Are you using their hosting, or perhaps it sounds like you built your own server for the distant environment? I couldn’t find pricing on their hosted offering.

  4. I’m using a small script using rsync with its –link-dest option to generate backups. This way, each backup is a complete mirror, while unchanged files are just hard links to the previous backup.

    Most important to me is that the backup is just a regular file system mirror, no proprietary or special file format, easily indexable and searchable by standard tools.

    I’m using this mostly to USB disks, so bandwidth doesn’t matter much, but perhaps rdiff-backup supports the same functionality?

  5. If you don’t mind having a base cost in addition to the per-GB charges, then get a server at somewhere like Gandi, make a nice roomy disk, and rsync there. I currently pay $0.15/GB with a FOSS developer discount; they normally charge $0.25/GB.

    1. Forgot to say in the original comment: Note in particular that Gandi doesn’t have bandwidth charges by usage.

  6. There is another option to backup data to cloud storage powered by Amazon S3. Check out CloudBerry Backup http://backup.cloudberrylab.com/ . It is onetime fee and the rest what you pay for Amazon S3. Besides, there is no proprietary data format and you can access your data using other Amazon s3 tools. Supports all Amazon S3 regions and Reduced Redundancy Storage.

  7. For various other reasons, I have an account with Dreamhost. They have an account level which allows unlimited bandwidth and unlimited disk storage; I’m investigating using that for backup.

    They do allow rsync-over-ssh, which means Duplicity is the primary contender since the backups are stored encrypted. I certainly wouldn’t want to do unencrypted backups to *any* “cloud” provider.

    1. Dreamhost only provides 50GB which can be used for backup, if you upload anywhere else than your backup account with non website related data you will be violating the TOS but they do offer unlimited bandwidth and extra storage for 10 cents GB/Month.

  8. I think s3ql is currently about the best looking S3 option. Supports hardlinks and other interesting file attributes. It chunks files, but it has no optimised backup solution which uses knowledge of the chunk hashes to optimise changed files.

    Brackup is neater in that it saves the chunk hashes locally, hence optimising uploading changes. However, it’s development has stalled and it only recently got support for storing uid:guid, and it doesnt track hard links (so no good really for a full backup).

    S3cmd actually does nearly everything you need for a straight single snapshot though…

    Zmanda has support for S3 (I believe).

    Nothing is perfect, but S3QL is looking the most promising at the moment? Perhaps in conjunction with a better backup script?

  9. This is a great article.Well for me Safecopy backup works well for me and my family.They are cost effective and so effictive.

  10. I have a VPS or 2 at a host and I can purchase extra RAID 1 storage for $0.25/GB for the first little bit (something 100GB or less, I don’t know exact) and the price gets cheaper as more storage is purchases and there is no bandwidth limitation on a 100Mbps port and I can set you up with storage at cost or send you a link to get your own VPS starting at $10 a month with 20GB included or if you are going large on storage (several hundred GB) I can set you up with a dedicated server from them starting at $89 a month including a little onboard storage and 300GB of RAID 1 SAN storage with more for $1/Month per 100GB on a unlimited bandwidth 100Mbit port. If you are willing get the storage through me on the level that a dedicated makes sense I may be able to set up a dedicated server and give you a large storage allocation with a discount and share the server with others. If you want the link or to work with me to set up storage on a shared dedicated server to save money (probably $10-$30 less per month, most likely $20) email me (changelog-complete@mcgurrin.net) With having an actual server on the end with the storage it is possible to smb, nfs, rsync, sftp, scp, etc., whatever there is good software available for on the server end as it just has to be installed and configured. If there is significant interest from those who read this I should be able to put together a shared server with a lot of storage and make the cost low and it already is relatively low for high storage setups.

    Thanks,
    Glenn McGurrin

  11. Might be useful to note that S3 traffic to EC2 hosts is free (provided you stay within the datacenter, i.e., not from S3 Europe to EC2 US). You could boot an EC2 host, download the S3 file there, rsync it, and upload it again, and you’d only be paying for the traffic for your diffs.

  12. I use backuppc on local network. Have been satisfied till date. It uses rsync & lots of hardlinks. Plus is that it stores only one copy of a file, even from multiple machines. Don’t know much about internals, but it works for me.

    I am looking to backup data online, now that I have mastered network backup (am a follow the tutorial kind of DIY admin)….I was thinking of using Amazon.

    Research shows that Amazon S3 does not accept hardlinks. Would I be able to store hardlinks if I use an EC2 instance with EBS store? That way hardlinks don’t go directly into S3, but into a block device, that is formatted as ext3 filesystem and then on machine shutdown persists into s3 (ebs store persists, not harlinked files themselves).

    Am I thinking along right lines?

    With regards.
    Sanjay.

    1. You could do that, or there are various services that let you remotely mount S3 or Rackspace Cloud Files as a FUSE filesystem. How reliable the are, I don’t know.

      I am leaning towards using JungleDisk to do backups. It doesn’t do hardlinks directly, but does to block dedup and could back up our backuppc directory.

  13. I find that Tarsnap does all the things you described. Incremental backups, dedupe and pricing based on used storage. Been using it for a while and can’ t complain.

  14. I plan to set up my own stuff that uses rsync + encfs + EC2 + EBS + S3 snapshots. I’ve tested enough to be fairly certain it’s going to work. The rough idea will be:

    Set up an extra machine as a local backup server. Other machines will be nodes. The backup server will execute a backup script on each node via SSH. Each node backup will:

    – stop services as needed
    – snapshot (lvm) the filesystem
    – start services
    – mount lvm snapshot
    – mount a data-crypt dir with encfs (fuse)
    – rsync all needed data from lvm mount to the data-crypt dir
    – unmount data-crypt dir
    – rdiff-backup the data-crypt dir to the backup server
    – return 0 for success

    Once all nodes are done running, my backup server will have an up to date, encrypted copy of my data from each node. It will then:

    – start an EC2 instance on demand
    – attach an EBS store to the instance
    – mount the EBS store
    – rsync current mirrors from each node to the EBS store
    — the rdiff-backup (differential) data will get excluded
    – unmount the EBS store
    – shut down the on demand EC2 instance
    – snapshot the EBS store to S3

    I have ~5GB of data that changes very little each month. My backups take <1 hour to run. Once my free micro tier runs out, I figure it will cost (roughly):

    – 5GB x $.15 = $.75
    – 30 hours x $.02 per hour micro instance = $.60
    – 8GB EBS root volume x $.10 = $.80
    – 10GB EBS data store x $.10 = $1
    —–
    $3.15 per month

    THE big thing for me though is that I can create a customized instance-store AMI that can be used to test my backups. The storage for the AMI should be about $1 per month and the small instances cost about $0.09 per hour. The $.02 per hour micro instances can only use EBS volumes, not instance-stores.

    One caveat: Testing in the cloud requires you to put your encryption keys in the cloud. You also have to (temporarily) provide your passphrase to unlock the keys. Don't do it if you don't understand the risks.

    The other thing is that once my data is sitting on my local backup server (encrypted), it's fairly easy to archive it to another off-site machine using rdiff-backup.

    Has anyone else done anything similar?

    I've looked at CrashPlan, JungleDisk, Carbonite, Dropbox, Mozy, SpiderOak, TurnKey Linux and none of them do what I want. Hence the reason for rolling my own. Of all those:

    CrashPlan – Really good for what I call 'poor man peering' since you can set up your own off-site box as a backup destination. I assume the encryption keys are local, but didn't research it.

    JungleDisk – I couldn't get a disk mounted in linux. The command line tools suck. No good docs on how the encryption works (at least not easy to find). $3 per month for the version I was trying.

    Carbonite – They claim client side encryption keys + remote access and don't explain how they manage both (FYI, it's impossible). That doesn't work for me. I only looked at them because they advertise on TWiT.

    Dropbox – Awesome, but not intended for the way I want to do backups. If you have technically challenged friends that need online backups, this is best option. I didn't research how / if they do encryption though.

    Mozy – No linux support. Online access to data. Couldn't find enough security info.

    SpiderOak – The most security conscious of everyone. Encryption keys never have to hit their servers. UI sucks. Restoring old versions of data is virtually impossible (one file at a time). Their de-duplication seems to be pretty good (I've used them for over a year).

    TurnKey Linux – Not strictly backups, but I was updating my virtual machines anyway, so I gave them a try. These guys have the right idea, but they use duplicity (incremental backups) to S3 instead of rsync (differential backups). I want differential backups. You can look at the code that makes it work (python IIRC).

    My biggest complaint about every existing backup service is the total lack of technical documentation about how the service works.

  15. CrashPlan exceed security requirements (you can use your own key/escrowing policy) . has best de-duplication in the bunch and is significantly less expensive than the constraints you’ve outlined.
    It also has something the others don’t.. asynchronous / autonomous validation of backup sets at destinations with automatic recovery/healing if there is bit rot.

    1. Hi Matthew,

      CrashPlan Pro does sound interesting for us. And because of that, I’ve looked into it a bit more.

      I actually sent CrashPlan Pro an email more than 24 hours ago with some questions. Other than a form letter, I have yet to hear back.

      Specifically, I asked:

      * About the pricing of your centralized storage for CrashPlan Pro customers.

      * What on-disk format you use for local backups, and if I can extract them with utilities such as tar

      * What workarounds you might have for CrashPlan Pro not dealing properly with hardlinks (which prevents a person from properly restoring a Unix machine from scratch using a backup)

      * What level of redundancy is provided with your offiste storage, and what level of geographic redundancy is provided.

      I must say that the lack of responsiveness on this doesn’t make me very comfortable with the product. I’ve had occasion to contact JungleDisk once, and despite it being about a more complex problem, heard back in less than 10 minutes. That’s what I’d be looking for from a company I’d be trusting critical data to.

      Thanks Matthew!

      — John

  16. You can backup to Amzon S3 cloud using Rsync. Rsync have bandwidth efficient backup algorithm that can save time and money. For that you need the gateway service s3rsync.com

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.