Backing Up to the Cloud

January 18, 2011UncategorizedJohn Goerzen

I’m recently taking some big-picture looks at how we do things, and one thing that I think could be useful would be for us to back up a limited set of data to an offsite location states away. Prices are cheap enough for this to make it useful. Services such as Amazon S3 and Rackspace Cloud Files (I’ve heard particularly good things about that one) seem to be perfect for this. I’m not quite finding software that does what I want, though. Here are my general criteria:

Storage fees of $0.15 per gigabyte-month or less
Free or cheap ($0.20 per gigabyte or less) bandwidth fees
rsync-like protocol to avoid having to re-send those 20GB files that have 20MB of changes in their entirety every night
Open Source and cross-platform (Linux, Windows, Mac, Solaris ideally; Linux and Windows at a minimum)
Compression and encryption
Easy way to restore the entire backup set or individual files
Versatile include/exclude rules
Must be runnable from scripts, cron, etc. without a GUI
Nice to have: block or file-level de-duplication
Nice to have: support for accurately backing up POSIX (user, group, permission bits, symlink, hard links, sparse files) and Windows filesystem attributes
Nice to have: a point-and-click interface for the non-Unix folks to use to restore Windows files and routine restore requests

So far, here’s what I’ve found. I should note that not a single one of these solutions appears to handle hard links or sparse files correctly, meaning I can’t rely on them for complete system-level backups. That doesn’t mean they’re useless — I could still use them to back up critical user data — just less useful.

Of the Free Software solutions, Duplicity is a leading contender. It has built-in support for Amazon S3 and Rackspace Cloud Files storage. It uses rdiff, which is a standalone implementation of the rsync binary delta algorithm. So you send up a full backup, then binary deltas from that for incrementals. That makes it bandwidth-efficient for incremental backups, and storage-efficient. However, periodic full backups will have to be run, which will make it less bandwidth-efficient. (Perhaps not incredibly *often*, but they will still be needed.) Duplicity doesn’t offer block-level de-deuplication or a GUI for the point-and-click folks. But it DOES offer the most Unixy approach and feels like a decent match for the task overall.

The other service relying on Free Software is rsync.net, which supports rsync, sftp, scp, etc. protocols directly. That would be great, as it could preserve hard links and be compatible with any number of rsync-based backup systems. The downside is that it’s expensive — really expensive. Their cheapest rate is $0.32 per GB-month and that’s only realized if you store more than 2TB with them. The base rate is $0.80 per GB-month. They promise premium support and such, but I just don’t think I can justify that for what is, essentially, secondary backup.

On the non-Open Source side, there’s JungleDisk, which has a Server Edition that looks like a good fit. The files are stored on either S3 or Rackspace, and it seems to be a very slick and full-featured solution. The client, however, is proprietary though it does seem to offer a non-GUI command-line interface. They claim to offer block-level de-duplication which could be very nice. The other nice thing is that the server management is centralized, which presumably lets you easily automate things like not running more than one backup at a time in order to not monopolize an Internet link. This can, of course, be managed with something like duplicity with appropriate ssh jobs kicked off from appropriate places, but it would be easier if the agent just handled it automatically.

What are people’s thoughts about this sort of thing?

37 thoughts on “Backing Up to the Cloud”

pratfall says:

January 18, 2011 at 3:12 pm

s3cmd has a sync option which uses s3 metadata to determine whether a file needs to be updated. I don’t think it can do anything but re-upload the whole file.

Whatever you do, don’t learn the lesson I learnt by doing an rsync to an s3fs-fuse mount. s3fs sucks down every whole file and then re-uploads it. I got a pretty serious S3 bandwidth bill that month….

Reply
Avi Rozen says:

January 18, 2011 at 3:25 pm

I’m currently running bacula to backup several machines at home (windows and debian) to a usb disk, and then use baculafs (a tool I wrote) to expose it as a fuse file system, which is then backed up by duplicity to s3.

I find duplicity to be rather finicky if the incremental backup chain grows too large. I get occasional checksum errors, which I haven’t been able to pin down – I find that I need to repeat a nightly backup to s3 every two to four weeks. Quite a pain.

Lately, I’ve been playing around with s3ql – an s3 fuse file system – I’m tempted to try it as an rsync target, but haven’t had the time to really stress-test it.

Reply
1. John Goerzen says:
  
  January 18, 2011 at 4:19 pm
  
  rsync is not at all good for using an s3-backed filesystem. It reads the entire previous backed-up file to make its deltas, and this is bandwidth-heavy. –whole-file –inplace are probably going to be a must.
  
  Reply
2. John Goerzen says:
  
  January 18, 2011 at 4:29 pm
  
  Some very interesting properties of s3ql there, especially the deduplication. This could make it useful as a disk-based backend for Bacula volume files for one thing.
  
  Reply
Ryan Phillips says:

January 18, 2011 at 3:35 pm

I’ve been using tarsnap [1] to backup my websites. It runs 30 cents a gigabyte for bandwidth, and 30 cents a gigabyte per month for storage. I believe it has some support for hard links.

1. http://www.tarsnap.com/

Reply
1. James Vega says:
  
  January 19, 2011 at 8:49 am
  
  Tarsnap also just announced[1] a vulnerability in versions 1.0.22 – 1.0.27. On the plus side, a very detailed description of what happened and how to handle the problem was given and credit is being offered to cover the bandwidth charges of uploading the re-encrypted data. I think that speaks well for the service.
  
  1. http://www.daemonology.net/blog/2011-01-18-tarsnap-critical-security-bug.html
  
  Reply
Matt Campbell says:

January 18, 2011 at 4:12 pm

My company tried using Jungle Disk Server Edition, but we found that it was generally pretty flaky; I wouldn’t recommend it.

Another option for your consideration is s3backer (http://code.google.com/p/s3backer/)

Reply
1. John Goerzen says:
  
  January 18, 2011 at 4:18 pm
  
  Hi Matt,
  
  I’m interested in your observations: what sort of things went wrong, how long ago you were using it, and on what platform.
  
  s3backer looks very interesting. Will have to look into that.
  
  Reply
Luke Plant says:

January 18, 2011 at 5:59 pm

I know a guy who runs a small web hosting firm, and he has been very pleased with Duplicity, using a variety of backends, including Amazon S3.

Reply
Craig Bowers says:

January 18, 2011 at 7:04 pm

I found those approaches too expensive. We went with CrashPlan Pro. There’s a seat license but the servers are free so you can add as many backup destinations as you’d like. We have a couple headless debian instances with it running, as well as Mac and Windows clients. We have a local backup server and a remote one hosted in a distant colocation environment. We get a soup to nuts hardware and software as a service with 6TB of storage for less that I could do anything with S3. Plus I get a nice central management interface, and a local backup endpoint for fast restores.

Reply
1. John Goerzen says:
  
  January 18, 2011 at 9:23 pm
  
  That’s an interesting approach. Are you using their hosting, or perhaps it sounds like you built your own server for the distant environment? I couldn’t find pricing on their hosted offering.
  
  Reply
Neal says:

January 19, 2011 at 1:00 am

You might want to consider Tahoe, .

Reply
Ketil says:

January 19, 2011 at 1:17 am

I’m using a small script using rsync with its –link-dest option to generate backups. This way, each backup is a complete mirror, while unchanged files are just hard links to the previous backup.

Most important to me is that the backup is just a regular file system mirror, no proprietary or special file format, easily indexable and searchable by standard tools.

I’m using this mostly to USB disks, so bandwidth doesn’t matter much, but perhaps rdiff-backup supports the same functionality?

Reply
Anonymous says:

January 19, 2011 at 1:40 am

If you don’t mind having a base cost in addition to the per-GB charges, then get a server at somewhere like Gandi, make a nice roomy disk, and rsync there. I currently pay $0.15/GB with a FOSS developer discount; they normally charge $0.25/GB.

Reply
1. Anonymous says:
  
  January 19, 2011 at 1:41 am
  
  Forgot to say in the original comment: Note in particular that Gandi doesn’t have bandwidth charges by usage.
  
  Reply
Anna, CloudBerry Lab says:

January 19, 2011 at 2:39 am

There is another option to backup data to cloud storage powered by Amazon S3. Check out CloudBerry Backup http://backup.cloudberrylab.com/ . It is onetime fee and the rest what you pay for Amazon S3. Besides, there is no proprietary data format and you can access your data using other Amazon s3 tools. Supports all Amazon S3 regions and Reduced Redundancy Storage.

Reply
1. John Goerzen says:
  
  January 19, 2011 at 11:08 am
  
  Hi Anna,
  
  Looks like that is a Windows-only product. I won’t be able to use that.
  
  Reply
Ben Finney says:

January 19, 2011 at 4:09 am

For various other reasons, I have an account with Dreamhost. They have an account level which allows unlimited bandwidth and unlimited disk storage; I’m investigating using that for backup.

They do allow rsync-over-ssh, which means Duplicity is the primary contender since the backups are stored encrypted. I certainly wouldn’t want to do unencrypted backups to *any* “cloud” provider.

Reply
1. Glenn McGurrin says:
  
  January 22, 2011 at 1:32 pm
  
  Dreamhost only provides 50GB which can be used for backup, if you upload anywhere else than your backup account with non website related data you will be violating the TOS but they do offer unlimited bandwidth and extra storage for 10 cents GB/Month.
  
  Reply
Pla says:

January 19, 2011 at 4:57 am

Brad Fitzpatrick’s Brackup? https://code.google.com/p/brackup/

Reply
1. John Goerzen says:
  
  January 19, 2011 at 1:11 pm
  
  That’s another very interesting option. I’ll probably have to give it a spin.
  
  Reply
Ed W says:

January 19, 2011 at 3:31 pm

I think s3ql is currently about the best looking S3 option. Supports hardlinks and other interesting file attributes. It chunks files, but it has no optimised backup solution which uses knowledge of the chunk hashes to optimise changed files.

Brackup is neater in that it saves the chunk hashes locally, hence optimising uploading changes. However, it’s development has stalled and it only recently got support for storing uid:guid, and it doesnt track hard links (so no good really for a full backup).

S3cmd actually does nearly everything you need for a straight single snapshot though…

Zmanda has support for S3 (I believe).

Nothing is perfect, but S3QL is looking the most promising at the moment? Perhaps in conjunction with a better backup script?

Reply
1. John Goerzen says:
  
  January 19, 2011 at 4:30 pm
  
  S3QL does look promising. I’m also looking at SDFS. Will be interesting to see how they compare.
  
  Reply
David Claughton says:

January 20, 2011 at 6:16 am

Dreamhost only offer unlimited storage for web hosting, they cracked down on using it for backups a while ago. They do allow you to use up to 50GB for personal backup though – if you want more they charge $0.10 per GB/month.

http://wiki.dreamhost.com/Personal_Backup

Reply
Mecky says:

January 20, 2011 at 8:46 am

This is a great article.Well for me Safecopy backup works well for me and my family.They are cost effective and so effictive.

Reply
Kasumi_Ninja says:

January 20, 2011 at 11:01 am

I recommend Crashplan, it has excellent features for a very low price. Their support is excellent as well.

Reply
Pingback: Research on deduplicating disk-based and cloud backups | The Changelog
Glenn McGurrin says:

January 22, 2011 at 1:44 pm

I have a VPS or 2 at a host and I can purchase extra RAID 1 storage for $0.25/GB for the first little bit (something 100GB or less, I don’t know exact) and the price gets cheaper as more storage is purchases and there is no bandwidth limitation on a 100Mbps port and I can set you up with storage at cost or send you a link to get your own VPS starting at $10 a month with 20GB included or if you are going large on storage (several hundred GB) I can set you up with a dedicated server from them starting at $89 a month including a little onboard storage and 300GB of RAID 1 SAN storage with more for $1/Month per 100GB on a unlimited bandwidth 100Mbit port. If you are willing get the storage through me on the level that a dedicated makes sense I may be able to set up a dedicated server and give you a large storage allocation with a discount and share the server with others. If you want the link or to work with me to set up storage on a shared dedicated server to save money (probably $10-$30 less per month, most likely $20) email me (changelog-complete@mcgurrin.net) With having an actual server on the end with the storage it is possible to smb, nfs, rsync, sftp, scp, etc., whatever there is good software available for on the server end as it just has to be installed and configured. If there is significant interest from those who read this I should be able to put together a shared server with a lot of storage and make the cost low and it already is relatively low for high storage setups.

Thanks,
Glenn McGurrin

Reply
Wouter Verhelst says:

January 26, 2011 at 3:28 am

Might be useful to note that S3 traffic to EC2 hosts is free (provided you stay within the datacenter, i.e., not from S3 Europe to EC2 US). You could boot an EC2 host, download the S3 file there, rsync it, and upload it again, and you’d only be paying for the traffic for your diffs.

Reply
Sanjay Arora says:

January 28, 2011 at 4:27 pm

I use backuppc on local network. Have been satisfied till date. It uses rsync & lots of hardlinks. Plus is that it stores only one copy of a file, even from multiple machines. Don’t know much about internals, but it works for me.

I am looking to backup data online, now that I have mastered network backup (am a follow the tutorial kind of DIY admin)….I was thinking of using Amazon.

Research shows that Amazon S3 does not accept hardlinks. Would I be able to store hardlinks if I use an EC2 instance with EBS store? That way hardlinks don’t go directly into S3, but into a block device, that is formatted as ext3 filesystem and then on machine shutdown persists into s3 (ebs store persists, not harlinked files themselves).

Am I thinking along right lines?

With regards.
Sanjay.

Reply
1. John Goerzen says:
  
  January 28, 2011 at 7:44 pm
  
  You could do that, or there are various services that let you remotely mount S3 or Rackspace Cloud Files as a FUSE filesystem. How reliable the are, I don’t know.
  
  I am leaning towards using JungleDisk to do backups. It doesn’t do hardlinks directly, but does to block dedup and could back up our backuppc directory.
  
  Reply
Kaspars says:

January 31, 2011 at 7:02 am

I find that Tarsnap does all the things you described. Incremental backups, dedupe and pricing based on used storage. Been using it for a while and can’ t complain.

Reply
1. John Goerzen says:
  
  January 31, 2011 at 12:02 pm
  
  tarsnap is also at least twice as expensive as the competition, and without providing any noticeable additional features, doesn’t seem worth it.
  
  Reply
Donald McRonald says:

January 31, 2011 at 11:15 pm

I plan to set up my own stuff that uses rsync + encfs + EC2 + EBS + S3 snapshots. I’ve tested enough to be fairly certain it’s going to work. The rough idea will be:

Set up an extra machine as a local backup server. Other machines will be nodes. The backup server will execute a backup script on each node via SSH. Each node backup will:

– stop services as needed
– snapshot (lvm) the filesystem
– start services
– mount lvm snapshot
– mount a data-crypt dir with encfs (fuse)
– rsync all needed data from lvm mount to the data-crypt dir
– unmount data-crypt dir
– rdiff-backup the data-crypt dir to the backup server
– return 0 for success

Once all nodes are done running, my backup server will have an up to date, encrypted copy of my data from each node. It will then:

– start an EC2 instance on demand
– attach an EBS store to the instance
– mount the EBS store
– rsync current mirrors from each node to the EBS store
— the rdiff-backup (differential) data will get excluded
– unmount the EBS store
– shut down the on demand EC2 instance
– snapshot the EBS store to S3

I have ~5GB of data that changes very little each month. My backups take <1 hour to run. Once my free micro tier runs out, I figure it will cost (roughly):

– 5GB x $.15 = $.75
– 30 hours x $.02 per hour micro instance = $.60
– 8GB EBS root volume x $.10 = $.80
– 10GB EBS data store x $.10 = $1
—–
$3.15 per month

THE big thing for me though is that I can create a customized instance-store AMI that can be used to test my backups. The storage for the AMI should be about $1 per month and the small instances cost about $0.09 per hour. The $.02 per hour micro instances can only use EBS volumes, not instance-stores.

One caveat: Testing in the cloud requires you to put your encryption keys in the cloud. You also have to (temporarily) provide your passphrase to unlock the keys. Don't do it if you don't understand the risks.

The other thing is that once my data is sitting on my local backup server (encrypted), it's fairly easy to archive it to another off-site machine using rdiff-backup.

Has anyone else done anything similar?

I've looked at CrashPlan, JungleDisk, Carbonite, Dropbox, Mozy, SpiderOak, TurnKey Linux and none of them do what I want. Hence the reason for rolling my own. Of all those:

CrashPlan – Really good for what I call 'poor man peering' since you can set up your own off-site box as a backup destination. I assume the encryption keys are local, but didn't research it.

JungleDisk – I couldn't get a disk mounted in linux. The command line tools suck. No good docs on how the encryption works (at least not easy to find). $3 per month for the version I was trying.

Carbonite – They claim client side encryption keys + remote access and don't explain how they manage both (FYI, it's impossible). That doesn't work for me. I only looked at them because they advertise on TWiT.

Dropbox – Awesome, but not intended for the way I want to do backups. If you have technically challenged friends that need online backups, this is best option. I didn't research how / if they do encryption though.

Mozy – No linux support. Online access to data. Couldn't find enough security info.

SpiderOak – The most security conscious of everyone. Encryption keys never have to hit their servers. UI sucks. Restoring old versions of data is virtually impossible (one file at a time). Their de-duplication seems to be pretty good (I've used them for over a year).

TurnKey Linux – Not strictly backups, but I was updating my virtual machines anyway, so I gave them a try. These guys have the right idea, but they use duplicity (incremental backups) to S3 instead of rsync (differential backups). I want differential backups. You can look at the code that makes it work (python IIRC).

My biggest complaint about every existing backup service is the total lack of technical documentation about how the service works.

Reply
Matthew Dornquast says:

February 1, 2011 at 10:17 am

CrashPlan exceed security requirements (you can use your own key/escrowing policy) . has best de-duplication in the bunch and is significantly less expensive than the constraints you’ve outlined.
It also has something the others don’t.. asynchronous / autonomous validation of backup sets at destinations with automatic recovery/healing if there is bit rot.

Reply
1. John Goerzen says:
  
  February 1, 2011 at 10:49 am
  
  Hi Matthew,
  
  CrashPlan Pro does sound interesting for us. And because of that, I’ve looked into it a bit more.
  
  I actually sent CrashPlan Pro an email more than 24 hours ago with some questions. Other than a form letter, I have yet to hear back.
  
  Specifically, I asked:
  
  * About the pricing of your centralized storage for CrashPlan Pro customers.
  
  * What on-disk format you use for local backups, and if I can extract them with utilities such as tar
  
  * What workarounds you might have for CrashPlan Pro not dealing properly with hardlinks (which prevents a person from properly restoring a Unix machine from scratch using a backup)
  
  * What level of redundancy is provided with your offiste storage, and what level of geographic redundancy is provided.
  
  I must say that the lack of responsiveness on this doesn’t make me very comfortable with the product. I’ve had occasion to contact JungleDisk once, and despite it being about a more complex problem, heard back in less than 10 minutes. That’s what I’d be looking for from a company I’d be trusting critical data to.
  
  Thanks Matthew!
  
  — John
  
  Reply
s3rsync says:

February 10, 2012 at 12:13 am

You can backup to Amzon S3 cloud using Rsync. Rsync have bandwidth efficient backup algorithm that can save time and money. For that you need the gateway service s3rsync.com

Reply

The Changelog

Comments on family, technology, and society

Backing Up to the Cloud

37 thoughts on “Backing Up to the Cloud”

Leave a Reply Cancel reply