A Cloud Filesystem

A Slashdot question today about putting to use all the unused disk space on corporate desktops got me to thinking. Now, before I start, comments there raised valid points about performance, reliability, etc.

But let’s say that we have a “cloud filesystem”. This filesystem would, at its core, have one configurable parameter: how many copies of each block of data must exist in the cloud. Now, we add servers with disk space to the cloud. As we add servers, the amount of available space on the cloud increases, subject to having enough space for replication according to our parameters.

Then, say we say we want a minimum of 3 copies of each block replicated. Each write to the filesystem will then cause a write to at least 3 different servers. Now, what if one server goes down? If the cloud filesystem is short on space, we may be down to only 2 copies of some blocks until that server comes back up. Otherwise, space permitting, it can rebuild that third copy on other servers.

Now, has this been done before? As far as I can tell, no. Wouldn’t it be sweet?

But there are some projects that are close. Most notably, GlusterFS. GlusterFS does all of the above, except the automated bits. You can have this 3 copy redundancy, but you have to manually tell it where each copy goes, manually reconfigure if a server goes offline, etc. Other options such as NBD, OpenAFS, GFS, DRBD, Lustre, GFS, etc. aren’t really well-suited for this scenario for various reasons.

So, what does everyone think? Can this work? Has it been done outside of Google?

16 thoughts on “A Cloud Filesystem

  1. Things similar have been done in P2P protocol research.

    Freenet is similar to that but focus more on anonymity of exchanges.

    OceanStore (http://oceanstore.cs.berkeley.edu/) and the protocol it builds on supports information retrieval at a speed that can support file system operations.

    In fact, several P2P protocol I’ve read about can claim aims similar to what you explain.

    On the other hand, I’m not aware of practical implementation that would be usable in entreprises.

  2. Something like Tahoe (http://allmydata.org/trac/tahoe) might also fit the bill; although not just yet…

  3. Anonymous says:

    mogilefs, if you don’t need POSIX

  4. Axioplase says:

    Well, I have this strange feeling that you want your filesystem to rely on “mobile code”, where your data D spawns (here) three processes embedding D, that loop
    1: trying to check whether both co-processes are stil alive a spawn a copy of them otherwise (after electing the only spawner)
    2: waitin for the special message ‘contents in order to return the value they embed, that is D.

    All in all, your FS idea can be stated very easily, and has probably been written thousands of time in Erlang (or should I say, “other instances than you FS of a particular mobility problem”


  5. More or less one year ago I was on a talk from some guys working on Google. From what I understood (and if I correctly understood), they use something similar like this. Data are spread and duplicated over several machines. If one of them fails, the data is “reconstructed” automatically on another good machine.
    Really interesting thing… :-)

  6. Hum… only now that I saw the last line of your post, about Google :-/

  7. Aaron Kaplan says:

    [url=http://hadoop.apache.org/core/]Hadoop[/url]’s HDFS? I have no experience with it personally, but it’s advertised as an open-source version of Google’s distributed filesystem. (Hadoop also implements map-reduce.)

    1. John Goerzen says:

      From what I can tell, Hadoop is not a general-purpose filesystem; it requires a special api.

  8. Anonymous says:

    This presentation addresses exactly the issue of what happens if a sever with a copy of yours goes down:

  9. I was pointed to this link by one of my friends. We are happy to see GlusterFS as your closest recommendation. I am just adding little more info..

    GlusterFS allows you to create a cluster of mirror packs and schedule I/O based on round-robin, adaptive-least-usage, random, non-uniform-file-access or a custom logic. It is even possible to say *.mpeg 3 copies, *.dbm 4 copies, *.html 2 copies and so on. GlusterFS has self-heal feature to automatically fix inconsistencies while the file system is mounted and actively serving data.

    New features such as hot-add, hot-remove, hot-move and hot-spare are in development. They will be integrated into self-heal framework to achieve your automatic re-provisioning requirements.

    Anand Babu Periasamy
    GlusterFS Team

  10. Wuala seems to be very intelligently implemented… The big thing to do to boost the effectiveness of replication is to use replication in combination of erasure encoding. I’ve [url=http://nikolasco.livejournal.com/394640.html]posted a technical summary[/url] in my blog, with links to relevant papers … I was given an invite by Dominik, which in turn lets me invite (a few) others, so let me know if you’d like to try it out. Sadly write-support is still forthcoming (Linux support is present), but the ability to mount Wuala as a network drive is pretty spiffy. Note that Wuala is aimed at ‘get that file, store this file’ operation, not general filesystem operations.

    I’m aware of Glusterfs, but it seems to be aimed at, well, clusters; in it’s case, uptime is high, latency is low, and nodes can be assumed to behave well.

    I’ve read a number of papers on things like what your post describes, and Wuala at least sorta implements … let me know if you’re interested and I can at least provide the titles.

  11. David says:


    A project that might be of interest to you: [url=http://danga.com/mogilefs/]MogileFS[/url]


  12. Paul Mineiro says:

    Well, we have done this in a trivial way in “walkenfs” by wrapping Erlang’s built-in Mnesia distributed database with FUSE, and then providing automatic node discovery and fragment rebalancing systems. (Details at http://dukesoferl.blogspot.com/).

    Full disclosure tho, it is rather slow. It fits our needs of storing our rrdtool data in a cluster-wide redundant fashion but we can see that the latencies are pretty poor with strace (100ms to write a 4kb block is pretty typical, because we wait for the remote replicas to write as well).

    Since it’s working for our modest workload we’re not working on improving it at this time, but the source code is out there for interested parties.

  13. Jeff Darcy says:

    This has even been done commercially, by MangoSoft (now dead). The closest current equivalents are probably Tahoe and Wuala.

  14. M S Prasad says:

    Most of the file system what we are mentioning use ” Erasure Codes ” for replicating the data in such a way that M out of N replications fail also, the data can be created/ retrieved.
    Hadoop is DFS system suitable for cloud application ,compatibility with Google Map Reduce. EUCALYPTUS , Big Table ,Total recall , Venti etc are other DFS’s for cloud.

    What they lack is encryption or data security built in and good key management.

    I recently came across a new idea ” Sector & Sphere ” high performance data cloud system design.( Phil. Transaction R . Soc , A ( 2009) 367, 2429-2445 author Yunhong Gu , Robert L Grossman , Univ of Illinois)

  15. Alex says:

    I had this idea some time ago too, based on a DHT, P2P, DFS, erasure codes, etc… as everyone mentioned it in the comments.
    I read a lot of papers about different components implementations and research papers…
    But i hit a wall : to make such storage distributed on unreliable computers, provide encryption and ACLs and key management at the same time : I did not find any cryptographic knowledge or theory able to achieve that.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.