Category Archives: Uncategorized

How git-annex replaces Dropbox + encfs with untrusted providers

git-annex has been around for a long time, but I just recently stumbled across some of the work Joey has been doing to it. This post isn’t about it’s traditional roots in git or all the features it has for partial copies of large data sets, but rather for its live syncing capabilities like Dropbox. It takes a bit to wrap your head around, because git-annex is just a little different from everything else. It’s sort of like a different-colored smell.

The git-annex wiki has a lot of great information — both low-level reference and a high-level 10-minute screencast showing how easy it is to set up. I found I had to sort of piece together the architecture between those levels, so I’m writing this all down hoping it will benefit others that are curious.

Ir you just want to use it, you don’t need to know all this. But I like to understand how my tools work.


git-annex lets you set up a live syncing solution that requires no central provider at all, or can be used with a completely untrusted central provider. Depending on your usage pattern, this central provider could require only a few MBs of space even for repositories containing gigabytes or terabytes of data that is kept in sync.

Let’s take a look at the high-level architecture of the tool. Then I’ll illustrate how it works with some scenarios.

Three Layers

Fundamentally, git-annex takes layers that are all combined in Dropbox and separates them out. There is the storage layer, which stores the literal data bytes that you are interested in. git-annex indexes the data in storage by a hash. There is metadata, which is for things like a filename-to-hash mapping and revision history. And then there is an optional layer, which is live signaling used to drive the real-time syncing.

git-annex has several modes of operation, and the one that enables live syncing is called the git-annex assistant. It runs as a daemon, and is available for Linux/POSIX platforms, Windows, Mac, and Android. I’ll be covering it here.

The storage layer

The storage layer simply is blobs of data. These blobs are indexed by a hash, and can be optionally encrypted at rest at remote backends. git-annex has a large number of storage backends; some examples include rsync, a remote machine with git-annex on it that has ssh installed, WebDAV, S3, Amazon Glacier, removable USB drive, etc. There’s a huge list.

One of the git-annex features is that each client knows the state of each storage repository, as well as the capability set of each storage repository. So let’s say you have a workstation at home and a laptop you take with you to work or the coffee shop. You’d like changes on one to be instantly recognized on another. With something like Dropbox or OwnCloud, every file in the set you want synchronized has to reside on a server in the cloud. With git-annex, it can be configured such that the server in the cloud only contains a copy of a file until every client has synced it up, at which point it gets removed. Think about it – that is often what you want anyhow, so why maintain an unnecessary copy after it’s synced everywhere? (This behavior is, of course, configurable.) git-annex can also avoid storing in the cloud entirely if the machines are able to reach each other directly at least some of the time.

The metadata layer

Metadata about your files includes a mapping from the file names to the storage location (based on hashes), change history, and information about the status of each machine that participates in the syncing. On your clients, git-annex stores this using git. This detail is very useful to some, and irrelevant to others.

Some of the git-annex storage backends can support only storage (S3, for instance). Some can support both storage and metadata (rsync, ssh, local drives, etc.) You can even configure a backend to support only metadata (more on why that may be useful in a bit). When you are working with a git-backed repository for git-annex, it can hold data, metadata, or both.

So, to have a working sync system, you must have a way to transport both the data and the metadata. The transport for the metadata is generally rsync or git, but it can also be XMPP in which Git changesets are basically wrapped up in XMPP presence messages. Joey says, however, that there are some known issues with XMPP servers sometimes dropping or reordering some XMPP messages, so he doesn’t encourage that method currently.

The live signaling layer

So once you have your data and metadata, you can already do syncs via git annex sync --contents. But the real killer feature here will be automatic detection of changes, both on the local and the remote. To do that, you need some way of live signaling. git-annex supports two methods.

The first requires ssh access to a remote machine where git-annex is installed. In this mode of operation, when the git-annex assistant fires up, it opens up a persistent ssh connection to the remote and runs the git-annex-shell over there, which notifies it of changes to the git metadata repository. When a change is detected, a sync is initiated. This is considered ideal.

A substitute can be XMPP, and git-annex actually converts git commits into a form that can be sent over XMPP. As I mentioned above, there are some known reliability issues with this and it is not the recommended option.


When it comes to encryption, you generally are concerned about all three layers. In an ideal scenario, the encryption and decryption happens entirely on the client side, so no service provider ever has any details about your data.

The live signaling layer is encrypted pretty trivially; the ssh sessions are, of course, encrypted and TLS support in XMPP is pervasive these days. However, this is not end-to-end encryption; those messages are decrypted by the service provider, so a service provider could theoretically spy on metadata, which may include change times and filenames, though not the contents of files themselves.

The data layer also can be encrypted very trivially. In the case of the “dumb” backends like S3, git-annex can use symmetric encryption or a gpg keypair and all that ever shows up on the server are arbitrarily-named buckets.

You can also use a gcrypt-based git repository. This can cover both data and metadata — and, if the target also has git-annex installed, the live signalling layer. Using a gcrypt-based git repository for the metadata and live signalling is the only way to accomplish live syncing with 100% client-side encryption.

All of these methods are implemented in terms of gpg, and can support symmetric of public-key encryption.

It should be noted here that the current release versions of git-annex need a one-character patch in order to fix live syncing with a remote using gcrypt. For those of you running jessie, I recommend the version in jessie-backports, which is presently 5.20151208. For your convenience, I have compiled an amd64 binary that can drop in over /usr/bin/git-annex if you have this version. You can download it and a gpg signature for it. Note that you only need this binary on the clients; the server can use the version from jessie-backports without issue.

Putting the pieces together: some scenarios

Now that I’ve explained the layers, let’s look at how they fit together.

Scenario 1: Central server

In this scenario, you might have a workstation and a laptop that sync up with each other by way of a central server that also has a full copy of the data. This is the scenario that most closely resembles Dropbox, box, or OwnCloud.

Here you would basically follow the steps in the git-assistant screencast: install git-annex on a server somewhere, and point your clients to it. If you want full end-to-end encryption, I would recommend letting git-annex generate a gpg keypair for you, which you would then need to copy to both your laptop and workstation (but not the server).

Every change you make locally will be synced to the server, and then from the server to your other PC. All three systems would be configured in the “client” transfer group.

Scenario 1a: Central server without a full copy of the data

In this scenario, everything is configured the same except the central server is configured with the “transfer” transfer group. This means that the actual data synced to it is deleted after it has been propagated to all clients. Since git-annex can verify which repository has received a copy of which data, it can easily enough delete the actual file content from the central server after it has been copied to all the clients. Many people use something like Dropbox or OwnCloud as a multi-PC syncing solution anyhow, so once the files have been synced everywhere, it makes sense to remove them from the central server.

This is often a good ideal for people. There are some obvious downsides that are sometimes relevant. For instance, to add a third sync client, it must be able to initially copy down from one of the existing clients. Or, if you intend to access the data from a device such as a cell phone where you don’t intend for it to have a copy of all data all the time, you won’t have as convenient way to download your data.

Scenario 1b: Split data/metadata central servers

Imagine that you have a shell or rsync account on some remote system where you can run git-annex, but don’t have much storage space. Maybe you have a cheap VPS or shell account somewhere, but it’s just not big enough to hold your data.

The answer to this would be to use this shell or rsync account for the metadata, but put the data elsewhere. You could, for instance, store the data in Amazon S3 or Amazon Glacier. These backends aren’t capable of storing the git-annex metadata, so all you need is a shell or rsync account somewhere to sync up the metadata. (Or, as below, you might even combine a fully distributed approach with this.) Then you can have your encrypted data pushed up to S3 or some such service, which presumably will grow to whatever size you need.

Scenario 2: Fully distributed

Like git itself, git-annex does not actually need a central server at all. If your different clients can reach each other directly at least some of the time, that is good enough. Of course, a given client will not be able to do fully automatic live sync unless it can reach at least one other client, so changes may not propagate as quickly.

You can simply set this up by making ssh connections available between your clients. git-annex assistant can automatically generate appropriate ~/.ssh/authorized_keys entries for you.

Scenario 2a: Fully distributed with multiple disconnected branches

You can even have a graph of connections available. For instance, you might have a couple machines at home and a couple machines at work with no ability to have a direct connection between them (due to, say, firewalls). The two machines at home could sync with each other in real-time, as could the two machines at work. git-annex also supports things like USB drives as a transport mechanism, so you could throw a USB drive in your pocket each morning, pop it in to one client at work, and poof – both clients are synced up over there. Repeat when you get home in the evening, and you’re synced there. The USB drive’s repository can, of course, be of the “transport” type so data is automatically deleted from it once it’s been synced everywhere.

Scenario 3: Hybrid

git-annex can support LAN sync even if you have a central server. If your laptop, say, travels around but is sometimes on the same LAN as your PC, git-annex can easily sync directly between the two when they are reachable, saving a round-trip to the server. You can assign a cost to each remote, and git-annex will always try to sync first to the lowest-cost path that is available.

Drawbacks of git-annex

There are some scenarios where git-annex with the assistant won’t be as useful as one of the more traditional instant-sync systems.

The first and most obvious one is if you want to access the files without the git-annex client. For instance, many of the other tools let you generate a URL that you can email to people, and then they can download files without any special client software. This is not directly possible with git-annex. You could, of course, make something like a public_html directory be managed with git-annex, but it wouldn’t provide things like obfuscated URLs, password-protected sharing, time-limited sharing, etc. that you get with other systems. While you can share your repositories with others that have git-annex, you can’t share individual subdirectories; for a given repository, it is all or nothing.

The Android client for git-annex is a pretty interesting thing: it is mostly a small POSIX environment, providing a terminal, git, gpg, and the same web interface that you get on a standalone machine. This means that the git-annex Android client is fully functional compared to a desktop one. It also has a quick setup process for syncing off your photos/videos. On the other hand, the integration with the Android ecosystem is poor compared to most other tools.

Other git-annex features

git-annex has a lot to offer besides the git-annex assistant. Besides the things I’ve already mentioned, any given git-annex repository — including your client repository — can have a partial copy of the full content. Say, for instance, that you set up a git-annex repository for your music collection, which is quite large. You want some music on your netbook, but don’t have room for it all. You can tell git-annex to get or drop files from the netbook’s repository without deleting them remotely. git-annex has quite a few ways to automate and configure this, including making sure that at least a certain number of copies of a file exist in your git-annex ecosystem.


I initially started looking at git-annex due to the security issues with encfs, and the difficulty with setting up ecryptfs in this way. (I had been layering encfs atop OwnCloud). git-annex certainly ticks the box for me security-wise, and obviously anything encrypted with encfs wasn’t going to be shared with others anyhow. I’ll be using git-annex more in the future, I’m sure.

Update 2016-06-27: I had some issues with git-annex in this configuration.

Amtrak Airlines

I came downstairs this morning and found a surprise waiting for me. Chairs from all over had been gathered up and arranged in rows, airline style. Taped to the wall was a “food court” sign. At the front was a picture of an airplane, decked out with the Amtrak logo of all things, and a timetable taped to our dining room table.


Jacob soon got out string to be seatbelts, too. And, using his copy machine, printed out a picture of a wing to tape to the side of the “airplane”.


And here is the “food court” sign Oliver made:


This plane was, according to the boys, scheduled to leave at 9:30. It left a fashionable 2 hours late or so. They told me I would be the pilot, and had me find headphones to be my “headset”. (I didn’t wear my real headset on the grounds that then I wouldn’t be able to hear them.) Jacob decided he would be a flight attendant, his grandma would be the co-pilot, and Oliver would be the food court worker. The food court somehow seemed to travel with the plane.

Oliver made up a menu for the food court. It consisted of, and I quote: “trail mix, banana, trail mix, half banana, trail mix, trail mix, trail mix”. He’s already got the limited selection of airport food down pat, I can see.

Jacob said the flight would be from Chicago to Los Angeles, and so it was. Since it was Amtrak Airlines, we were supposed to pretend to fly over the train tracks the whole way.

If it’s not Christmas yet, we just invent some fun, eh? Pretty clever.

First steps: Debian on an Asus t100, and some negative experience with Gnome

The Asus t100 tablet is this amazing and odd little thing: it sells for under $200, yet has a full-featured Atom 64-bit CPU, 2GB RAM, 32 or 64GB SSD, etc. By default, it ships with Windows 8.1. It has a detachable keyboard, so it can be used as a tablet or a very small 10″ laptop.

I have never been a fan of Windows on it. It does the trick for web browsing and email, but I’d like to ssh into my machines sometimes, and I just can’t bring myself to type sensitive passwords into Windows.

I decided to try installing Debian on it. After a lot of abortive starts due to the UEFI-only firmware, I got jessie installed. (The installer was fine; it was Debian Live that wouldn’t boot.) I got wifi and battery status working via an upgrade to the 4.1 kernel. A little $10 Edimax USB adapter was handy to spare a bunch of copying via USB disks.

I have been using XFCE with XMonad for so many years that I am somewhat a stranger to other desktop environments. XMonad isn’t really suitable for a tablet, however, so I thought I’d try Gnome, especially after a fairly glowing review about its use on a tablet.

I am already disappointed after just a few minutes. There is no suspend button on the menu. Some Googling showed that holding Alt while hovering over the power off button will change it to a suspend button. And indeed it does. But… uh, what? That is so common and so non-obvious. And pushing the power button does… nothing. That’s right, nothing. Apparently the way to enable some action when you push the power button is to type in a settings command in a terminal. There’s no setting in the settings panel.

I initially ditched Gnome some years ago due to its penchant for removing features. I had hoped that this much time later, it would have passed that stage, but I’m already disappointed. I was hoping for some really nice integration with the system. But my XFCE setup has a very clear “When power button is pressed” setting. I have no idea why Gnome doesn’t.

Also, the touch screen works fine and it registers my touches, but whenever I touch anywhere, the cursor disappears. Weird, eh?

There are some things to fix yet on the tablet (sound, brightness adjustment, and making suspend reliable) but others have solved these in Ubuntu so I don’t think it’ll be too hard.

In the meantime, any suggestions regarding Gnome? Is it just going to annoy me? Maybe I should try KDE also. I’ve heard good things about Plasma Active, but don’t see it in Debian though.

I Give Up on Google: Free is Too Expensive

I am really tired of things Google has done lately.

The most recent example being retiring Classic Maps. That’s a problem, because the current Maps mysteriously doesn’t show most of my saved (“starred”) places. Google has known about this since at least 2013. There are posts all over their forums about it going back to when what is now “regular” Google Maps was beta. Google employees even knew about it and did nothing. For someone that made heavy use of it, this was quite annoying.

But there have been plenty of others:

  • Removing My Places and My Maps from Maps for Android. Those features were used to, for instance, plan trips, highlight routes, add campground possibilities, etc. (They eventually brought this feature back months/years later, in limited form.)
  • Removed the 7-day and month views from Calendar for Android, claiming this was “better” for users. Finally re-added those views a few months later after many complaints. I even participated in a survey process with them where they were clearly struggling to understand why anybody wanted to see 7 days at once, when that feature had been there for years…
  • Removing the XMPP capabilities in Google Talk/Hangouts.
  • Picasaweb pretty much shut down, with very strong redirects to Google+ Photos. Which still to this day doesn’t have a handy feature for embedding in a blog post or anything that’s not, well, Google+.
  • General creeping crapification of everything they touch. It’s almost like Microsoft in the 90s all over again. All of a sudden my starred places stop showing up in Google Maps, but show up in Google Drive — shared with the whole world. What? I never wanted them in Google Drive to start with.
  • All the products that are all-but-dead — Google Groups and the sad state of the Deja News archives. Maybe Google+ itself goes on this list soon?
  • Looks like they’re trying to kill off Google Voice and merge it into hangouts, but I can’t send a text from the web with Hangouts.
  • And this massive list of discontinued services and products. Yeowch. Remember when Google Code was hot, and then they didn’t touch it at all for years?
  • And they still haven’t fixed some really basic things, such as letting people change their email address when they get married.
  • Dropping SIP from Grand Central, ActiveSync from Apps, etc.

I even used to use Flickr, then moved to Picasa when Yahoo stopped investing in Flickr. Now I’m back to Flickr, because Google stopped investing in Picasa.

The takeaway is that you can’t really rely on Google for anything. Counting on something being there for an upcoming trip and then having it be suddenly yanked away is a level of frustration that just makes the service not so useful. Never knowing when obvious things (7-day calendar view) will be removed means you just can’t depend on it.

So, are there good alternatives? Things I’m thinking of include:

  • Alternative calendar applications. Ideally it would support shared calendars for multiple people in a family, an Android app that lets you easily view some or all calendars, etc. I wonder if is really the only competitor here? Last I looked — a few years ago — none of the Open Source options really worked well.
  • Alternative mapping applications. Must-haves include directions, navigation in the car, saving points of interest, and offline storage on Android. Nice-to-haves would include restaurant review integration, etc. Looks like Nokia ( and Mapquest, plus a few OSM spinoffs, are the leading contenders here.
  • Email is easily enough found elsewhere, and I’ve never used Gmail much anyhow.

Anybody else moving off Google?

ssh suddenly stops communicating with some hosts

Here’s a puzzle I’m having trouble figuring out. This afternoon, ssh from my workstation or laptop stopped working to any of my servers (at OVH). The servers are all running wheezy, the local machines jessie. This happens on both my DSL and when tethered to my mobile phone. They had not applied any updates since the last time ssh worked. When looking at it with ssh -v, they were all hanging after:

debug1: SSH2_MSG_KEXINIT sent
debug1: SSH2_MSG_KEXINIT received
debug1: kex: server->client aes128-ctr none
debug1: kex: client->server aes128-ctr none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

Now, I noticed that a server on my LAN — running wheezy — could successfully connect. It was a little different:

debug1: kex: server->client aes128-ctr hmac-md5 none
debug1: kex: client->server aes128-ctr hmac-md5 none
debug1: sending SSH2_MSG_KEX_ECDH_INIT
debug1: expecting SSH2_MSG_KEX_ECDH_REPLY

And indeed, if I run ssh -o MACs=hmac-md5, it works fine.

Now, I tried rebooting machines at multiple ends of this. No change. I tried connecting from multiple networks. No change. And then, as I was writing this blog post, all of a sudden it works normally again. Supremely weird! Any ideas what I can blame here?

Computer Without a Case

My desk today looks like this:

2014-11-12 11.58.45

Yep, that’s a computer. Motherboard to the right, floppy drives and CD drive stacked on top of the power supply, hard drive to the left.

And it’s an OLD computer. (I had forgotten just how loud these old power supplies are; wow.)

The point of this exercise is to read data off the floppies that I have made starting nearly 30 years ago now (wow). Many were made with DOS, some were made on a TRS-80 Color Computer II (aka CoCo 2). There are 5.25″ disks, 3.25″ disks, and all sorts of formats. Most are DOS, but the TRS-80 ones use a different physical format. Some of the data was written by Central Point Backup (from PC Tools), which squeezed more data on the disk by adding an extra sector or something, if my vague memory is working.

Reading these disks requires low-level playing with controller timing, and sometimes the original software to extract the data. It doesn’t necessarily work under Linux, and certainly doesn’t work with USB floppies or under emulation. Hence this system.

It’s a bridge. Old enough to run DOS, new enough to use an IDE drive. I can then hook up the IDE drive to a IDE-to-USB converter and copy the data off it onto my Linux system.

But this was tricky. I started the project a few years ago, but life got in the way. Getting back to it now, with the same motherboard and drive, but I just couldn’t get it to boot. I eventually began to suspect some disk geometry settings, and with some detective work from fdisk in Linux plus some research into old BIOS disk size limitations, discovered the problem was a 2GB limit. Through some educated trial and error, I programmed the BOIS with a number of cylinders that worked, set it to LBA mode, and finally my 3-year-old DOS 6.2 installation booted.

I had also forgotten how finicky things were back then. Pop a floppy from a Debian install set into the drive, type dir b:, and the system hangs. I guess there was a reason the reset button was prominent on the front of the computer back then…

I’m hiring a senior Linux sysadmin/architect

I’m never sure whether to post such things here, but I hope that it’s of interest to people: I’m trying to hire a top-notch Linux person for a 100% telecommute position. I’m particularly interested in people with experience managing 500 or more OS instances. It’s a shop with a lot of Debian, by the way. You can apply at that URL and mention you saw it in my blog if you’re interested.

Being Different

This evening, after the boys were in bed, Laura and I sat down to an episode of MASH (a TV series from the 70s) and leftover homemade pumpkin bars. She commented, “Sometimes I wonder what generation we’re in. This doesn’t seem to be something people our age are usually doing.” Probably true. I suppose people my age aren’t usually learning to play the penny whistle or put up antennas in trees either.

We’ve had a fun day today – a different sort of day in a lot of ways. We took the boys for their first Wichita Symphony Orchestra experience — they were doing their first-ever “family concert” (Beethoven Lived Upstairs, which combined Beethoven’s music with a two-person play aimed at kids). And they had an “instrument petting zoo” beforehand. Both boys loved it.

From November 8, 2014

After that, we took them to a sushi place for the first time. We ordered different types of rolls for our table, encouraging them to start with the California roll. They loved it (though Oliver did complain it was a bit hard to eat). Jacob happily devoured everything he could that wasn’t spicy. He would have probably devoured the plate of California roll slices by himself if I hadn’t stopped him and encouraged him to slow down and try some other things too.

It doesn’t seem very common around here to take 5-year-olds to a sushi place and plan on them eating the same sort of food that the adults around them are. It is a lot of fun to be different. Jacob and Oliver both have their unique personalities and interests, and I hope that they continue to find strength and joy in all the ways they are unique.

The Thrill and Stress of Too Many Hobbies

Today, 4PM. Jacob and Oliver excitedly peer at the box in our kitchen – a really big box, taller than them. Inside is is the first model airplane I’d ever purchased. The three of us hunkered down on the kitchen floor, opened the box, unpacked the parts, examined the controller, and found the manual with cryptic assembly directions. Oliver turned some screws while Jacob checked out the levers on the controllers. Then they both left for a bit to play with their toy buses.

A little while later, the three of us went outside. It was too windy to fly. I had never flied an RC plane before — only RC quadcopters (much easier to fly), and some practice time on an RC simulator. But the excitement was too much. So out we went, and the plane took off perfectly, climbed, flew over the trees, and circled above our heads at my command. I even managed a good landing in the wind, despite about 5 aborted attempts due to coming in too high, wrong angle, too fast, or last-minute gusts of wind throwing everything off. I am not sure how I pulled that all off on my first flight, but somehow I did! It was thrilling!

I’ve had a lot of hobbies in my life. Computers have run through many of them; I learned Pascal (a programming language) at about the same time I learned cursive handwriting and started with C at around age 10. It was all fun. I’ve been a Debian developer for some 18 years now, and have written a lot of code, and even books about code, over the years.

Photography, music, literature, history, philosophy, and theology have been interests for quite some time as well. In the last few years, I’ve picked up amateur radio, model aircraft, etc. And last month, Laura led me into Ada’s Technical Books during our visit to Seattle, resulting in me getting interested in Arduino. (The boys and I have already built a light-activated crossing gate for their HO-gauge model trains, and Jacob can now say he’s edited a few characters of C!)

Sometimes I find ways to merge hobbies; I’ve set up all sorts of amateur radio systems on Linux, take aerial photographs, and set up systems to stream music in my house.

But I also have a lot less time for hobbies overall than I once did; other things in life, such as my children, are more important. Some of the code I once worked on actively I no longer use or maintain, and I feel guilty about that when people send bug reports that I have no interest in fixing anymore.

Sometimes I feel a need to cut down, and perhaps have; and then, I get an interest in RC aircraft and find an airplane that is great for a beginner and fairly inexpensive.

Perhaps it is the curse of being a curious person living in an interesting world. Do any of the rest of you have a large number of hobbies? How do you feel about that?