Category Archives: Software

rdiff-backup, ZFS, and rsync scripts

rdiff-backup vs. ZFS

As I’ve been writing about backups, I’ve gone ahead and run some tests with rdiff-backup. I have been using rdiff-backup personally for many years now — probably since 2002, when I packaged it up for Debian. It’s a nice, stable system, but I always like to look at other options for things every so often.

rdiff-backup stores an uncompressed current mirror of the filesystem, similar to rsync. History is achieved by the use of compressed backwards binary deltas generated by rdiff (using the rsync algorithm). So, you can restore the current copy very easily — a simple cp will do if you don’t need to preserve permissions. rdiff-backup restores previous copies by applying all necessary binary deltas to generate the previous version.

Things I like about rdiff-backup:

  1. Bandwidth-efficient
  2. Reasonably space-efficient, especially where history is concerned
  3. Easily scriptable and nice CLI
  4. Unlike tools such as duplicity, there is no need to periodically run full backups — old backups can be deleted without impacting the ability to restore more current backups

Things I don’t like about it:

  1. Speed. It can be really slow. Deleting 3 months’ worth of old history takes hours. It has to unlink vast numbers of files — and that’s pretty much it, but it does it really slowly. Restores, backups, etc. are all slow as well. Even just getting a list of your increment sizes so you’d know how much space would be saved can take a very long time.
  2. The current backup copy is stored without any kind of compression, which is not at all space-efficient
  3. It creates vast numbers of little files that take forever to delete or summarize

So I thought I would examine how efficient ZFS would be. I wrote a script that would replay the rdiff-backup history — first it would rsync the current copy onto the ZFS filesystem and make a ZFS snapshot. Then each previous version was processed by my script (rdiff-backup’s files are sufficiently standard that a shell script can process them), and a ZFS snapshot created after each. This lets me directly compare the space used by rdiff-backup to that used by ZFS using actual history.

I enabled gzip-3 compression and block dedup in ZFS.

My backups were nearly 1TB in size and the amount of space I had available for ZFS was roughly 600GB, so I couldn’t test all of them. As it happened, I tested the ones that were the worst-case scenario for ZFS: my photos, music collection, etc. These files had very little duplication and very little compressibility. Plus a backup of my regular server that was reasonably compressible.

The total size of the data backed up with rdiff-backup was 583 GB. With ZFS, this came to 498GB. My dedup ratio on this was only 1.05 (meaning 5% or 25GB saved). The compression ratio was 1.12 (60GB saved). The combined ratio was 1.17 (85GB saved). Interestingly 498 + 85 = 583.

Remember that the data under test here was mostly a worst-case scenario for ZFS. It would probably have done better had I had the time to throw the rest of my dataset at it (such as the 60GB backup of my iPod, which would have mostly deduplicated with the backup of my music server).

One problem with ZFS is that dedup is very memory-hungry. This is common knowledge and it is advertised that you need to use roughly 2GB of RAM per TB of disk when using dedup. I don’t have quite that much to dedicate to it, so ZFS got VERY slow and thrashed the disk a lot after the ARC grew to about 300MB. I found some tweakables in zfsrc and the zfs command that let me tweak the ARC cache to grow bigger. But the machine in question only has 2GB RAM, and is doing lots of other things as well, so this barely improved anything. Note that this dedup RAM requirement is not out of line with what is expected from these sorts of solutions.

Even if I got absolutely stellar dedup ratio of 2:1, that would get me at most 1TB. The cost of buying a 1TB disk is less than the cost of upgrading my system to 4GB RAM, so dedup isn’t worth it here.

I think the lesson is: think carefully about where dedup makes sense. If you’re storing a bunch of nearly-identical virtual machine images — the sort of canonical use case for this — go for it. A general fileserver — well, maybe you should just add more disk instead of more RAM.

Then that raises the question: if I don’t need dedup from ZFS, do I bother with it at all, or just use ext4 and LVM snapshots? I think ZFS still makes sense, given its built-in support for compression and very fast snapshots — LVM snapshots are known to cause serious degradation to write performance once enabled, which ZFS doesn’t.

So I plan to switch my backups to use ZFS. A few observations on this:

  1. Some testing suggests that the time to delete a few months of old snapshots will be a minute or two with ZFS compared to hours with rdiff-backup.
  2. ZFS has shown itself to be more space-efficient than rdiff-backup, even without dedup enabled.
  3. There are clear performance and convenience wins with ZFS.
  4. Backup Scripts

    So now comes the question of backup scripts. rsync is obviously a pretty nice choice here — and if used with –inplace perhaps even will play friendly with ZFS snapshots even if dedup is off. But let’s say I’m backing up a few machines at home, or perhaps dozens at work. There is a need to automate all of this. Specifically, there’s a need to:

    1. Provide scheduling, making sure that we don’t hammer the server with 30 clients all at once
    2. Provide for “run before” jobs to do things like snapshot databases
    3. Be silent on success and scream loudly via emails to administrators on any kind of error… and keep backing up other systems when there is an error
    4. Create snapshots and provide an automated way to remove old snapshots (or mount them for reading, as ZFS-fuse doesn’t support the .zfs snapshot directory yet)

    To date I haven’t found anything that looks suitable. I found a shell script system called rsbackup that does a large part of this, but something about using a script whose homepage is a forum makes me less than 100% confident.

    On the securing the backups front, rsync comes with a good-looking rrsync script (inexplicably installed under /usr/share/doc/rsync/scripts instead of /usr/bin on Debian) that can help secure the SSH authorization. GNU rush also looks like a useful restricted shell.

Research on deduplicating disk-based and cloud backups

Yesterday, I wrote about backing up to the cloud. I specifically was looking at cloud backup services. I’ve been looking into various options there, but also various options for disk-based backups. I’d like to have both onsite and offsite backups, so both types of backup are needed. Also, it is useful to think about how the two types of backups can be combined with minimal overhead.

For the onsite backups, I’d want to see:

  1. Preservation of ownership, permissions, etc.
  2. Preservation of symlinks and hardlinks
  3. Space-efficient representation of changes — ideally binary deltas or block-level deduplication
  4. Ease of restoring
  5. Support for backing up Linux and Windows machines

Deduplicating Filesystems for Local Storage

Although I initially thought of block-level deduplicating file systems as something to use for offsite backups, they could also make an excellent choice for onsite disk-based backups.

rsync-based dedup backups

One way to use them would be to simply rsync data to them each night. Since copies are essentially free, we could do (or use some optimized version of) cp -r current snapshot/2011-01-20 or some such to save off historic backups. Moreover, we’d get dedup both across and within machines. And, many of these can use filesystem-level compression.

The real upshot of this is that the entire history of the backups can be browsed as a mounted filesystem. It would be fast and easy to find files, especially when users call about that file that they deleted at some point in the past but they don’t remember when, exactly what it was called, or exactly where it was stored. We can do a lot more with find and grep to locate these things than we could do with tools in Bacula (or any other backup program) restore console. Since it is a real mounted filesystem, we could also do fun things like make tarballs of it at will, zip parts up, scp them back to the file server, whatever. We could potentially even give users direct access to their files to restore things they need for themselves.

The downside of this approach is that rsync can’t store all the permissions unless it’s running as root on the system. Wrappers such as rdup around rsync could help with that. Another downside is that there isn’t a central scheduling/statistics service. We wouldn’t want the backup system to be hammered by 20 servers trying to send it data at once. So there’d be an element of rolling our own scripts, though not too bad. I’d have preferred not to authorize a backup server with root-level access to dozens of machines, but may be inescapable in this instance.

Bacula and dedup

The other alternative I thought of system such as Bacula with disk-based “volumes”. A Bacula volume is normally a tape, but Bacula can just write them to disk files. This lets us use the powerful Bacula scheduling engine, logging service, pre-backup and post-backup jobs, etc. Normally this would be an egregious waste of disk space. Bacula, like most tape-heritage programs, will write out an entire new copy of a file if even one byte changes. I had thought that I could let block-level dedupe reduce the storage size of Bacula volumes, but after looking at the Bacula block format spec, this won’t be possible as each block will have timestamps and such in it.

The good things about this setup revolve around using the central Bacula director. We need only install bacula-fd on each server to be backed up, and it has a fairly limited set of things it can do. Bacula already has built-in support for defining simple or complicated retention policies. Its director will email us if there is a problem with anything. And its logs and catalog are already extensive and enable us to easily find out things such as how long backups take, how much space they consume, etc. And it backs up Windows machines intelligently and comprehensively in addition to POSIX ones.

The downsides are, of course, that we don’t have all the features we’d get from having the entire history on the filesystem all at once, and far less efficient use of space. Not only that, but recovering from a disaster would require a more extensive bootstrapping process.

A hybrid option may be possible: automatically unpacking bacula backups after they’ve run onto the local filesystem. Dedupe should ensure this doesn’t take additional space — if the Bacula blocksize aligns with the filesystem blocksize. This is certainly not a given however. It may also make sense to use Bacula for Windows and rsync/rdup for Linux systems.

This seems, however, rather wasteful and useless.

Evaluation of deduplicating filesystems

I set up and tested three deduplicating filesystems available for Linux: S3QL, SDFS, and zfs-fuse. I did not examine lessfs. I ran a similar set of tests for each:

  1. Copy /usr/bin into the fs with tar -cpf - /usr/bin | tar -xvpf - -C /mnt/testfs
  2. Run commands to sync/flush the disk cache. Evaluate time and disk used at this point.
  3. Rerun the tar command, putting the contents into a slightly different path in the test filesystem. This should consume very little additional space since the files will have already been there. This will validate that dedupe works as expected, and provide a hint about its efficiency.
  4. Make a tarball of both directories from the dedup filesystem, writing it to /dev/zero (to test read performance)

I did not attempt to flush read caches during this, but I did flush write caches. The test system has 8GB RAM, 5GB of which was free or in use by a cache. The CPU is a Core2 6420 at 2.13GHz. The filesystems which created files atop an existing filesystem had ext4 mounted noatime beneath them. ZFS was mounted on an LVM LV. I also benchmarked native performance on ext4 as a baseline. The data set consists of 3232 files and 516MB. It contains hardlinks and symlinks.

Here are my results. Please note the comments below as SDFS could not accurately complete the test.

Test ext4 S3QL SDFS zfs-fuse
First copy 1.59s 6m20s 2m2s 0m25s
Sync/Flush 8.0s 1m1s 0s 0s
Second copy+sync N/A 0m48s 1m48s 0m24s
Disk usage after 1st copy 516MB 156MB 791MB 201MB
Disk usage after 2nd copy N/A 157MB 823MB 208MB
Make tarball 0.2s 1m1s 2m22s 0m54s
Max RAM usage N/A 150MB 350MB 153MB
Compression none lzma none gzip-2

It should be mentioned that these tests pretty much ruled out SDFS. SDFS doesn’t appear to support local compression, and it severely bloated the data store, which was much larger than the original data. Moreover, it permitted any user to create and modify files, even if the permissions bits said that the user couldn’t. tar gave many errors unpacking symlinks onto the SDFS filesystem, and du -s on the result threw up errors as well. Besides that, I noted that find found 10 fewer files than in my source data. Between the huge memory consumption, the data integrity concerns, and inefficient disk storage, SDFS is out of the running for this project.

S3QL is optimized for storage to S3, though it can also store its files locally or on an sftp server — a nice touch. I suspect part of its performance problem stems from being designed for network backends, and using slow compression algorithms. S3QL worked fine, however, and produced no problems. Creating a checkpoint using s3qlcp (faster than cp since it doesn’t have to read the data from the store) took 16s.

zfs-fuse appears to be the most-used ZFS implementation on Linux at the moment. I set up a 2GB ZFS pool for this test, and set dedupe=on and compress=gzip-2. When I evaluated compression in the past, I hadn’t looked at lzjb. I found a blog post comparing lzjb to the gzip options supported by zfs and wound up using gzip-2 for this test.

ZFS really shone here. Compared to S3QL, it took 25s instead of over 6 minutes to copy the data over — and took only 28% more space. I suspect that if I selected gzip -9 compression it would have been closer both in time and space to S3QL. But creating a ZFS snapshot was nearly instantaneous. Although ZFS-fuse probably doesn’t have as many users as ZFS on Solaris, still it is available in Debian, and has a good backing behind it. I feel safer using it than I do using S3QL. So I think ZFS wins this comparison.

I spent quite some time testing ZFS snapshots, which are instantaneous. (Incidentally, ZFS-fuse can’t mount them directly as documented, so you create a clone of the snapshot and mount that.) They worked out as well as could be hoped. Due to dedupe, even deleting and recreating the entire content of the original filesystem resulted in less than 1MB additional storage used. I also tested creating multiple filesystems in the zpool, and confirmed that dedupe even works between filesystems.

Incidentally — wow, ZFS has a ton of awesome features. I see why you OpenSolaris people kept looking at us Linux folks with a sneer now. Only our project hasn’t been killed by a new corporate overlord, so guess that maybe didn’t work out so well for you… <grin>.

The Cloud Tie-In

This discussion leaves another discussion: what to do about offsite backups? Assuming for the moment that I want to back them up over the Internet to some sort of cloud storage facility, there are about 3 options:

  1. Get an Amazon EC2 instance with EBS storage and rsync files to it. Perhaps run ZFS on that thing.
  2. Use a filesystem that can efficiently store data in S3 or Cloud Files (S3QL is the only contender here)
  3. Use a third-party backup product (JungleDisk appears to be the leading option)

There is something to be said for using a different tool for offsite backups — if there is some tool-level issue, that could be helpful.

One of the nice things about JungleDisk is that bandwidth is free, and disk is the same $0.15/GB-mo that RackSpace normally charges. JungleDisk also does block-level dedup, and has a central management interface. This all spells “nice” for us.

The only remaining question would be whether to just use JungleDisk to back up the backup server, or to put it on each individual machine as well. If it just backs up the backup server, then administrative burdens are lower; we can back everything there up by default and just not worry about it. On the other hand, if there is a problem with our main backups, we could be really stuck. So I’d say I’m leaning towards ZFS plus some sort of rsync solution and JungleDisk for offsite.

I had two people suggest CrashPlan Pro on my blog. It looks interesting, but is a very closed product which makes me nervous. I like using standard tools and formats — gives me more peace of mind, control, and recovery options. CrashPlan Pro supports multiple destinations and says that they do cloud hosting, but don’t list pricing anywhere. So I’ll probably not mess with it.

I’m still very interested in what comments people may have on all this. Let me know!

Wikis, Amateur Radio, and Debian

As I have been getting involved with amateur radio this year, I’ve been taking notes on what I’m learning about certain things: tips from people on rigging up a bicycle antenna to achieve a 40-mile range, setting up packet radio in Linux, etc. I have long run a personal, private wiki where I put such things.

But I really wanted a convenient place to put this stuff in public. There was no reason to keep it private. In fact, I wanted to share with others what I’ve learned. And, as I wanted to let others add their tips if they wish, I set up a public MoinMoin instance on . So far, most of my attention has focused on the amateur radio section of it

This has worked out pretty well for me. Sometimes I will cut and paste tips from emails into there, and then after trying them out, edit them into a more coherent summary based on my experiences.

Now then, on to packet radio and Debian. Packet radio is a digital communications mode that runs on the amateur radio bands. It is a routable, networking protocol that typically runs at 300bps, 1200bps, and 9600bps. My packet radio page gives a better background on it, but essentially AX.25 — the packet protocol — is similar to a scaled-down TCP/IP. One interesting thing about packet is that, since it can use the HF bands, can have direct transcontinental wireless links. More common are links spanning 30-50 miles on VHF and UHF, as well as those going across a continent on HF.

Linux is the only operating system I know of that has AX.25 integrated as a first-class protocol in the kernel. You can create AX.25 sockets and use them with the APIs you’re familiar with already. Not only that, but the Linux AX.25 stack is probably the best there is, and it interfaces easily with TCP/IP — there are global standards for encapsulating TCP/IP within AX.25 and AX.25 within UDP, and both are supported on Linux. Yes, I have telnetted to a machine to work on it over VHF. Of Linux distributions, Debian appears to have the best AX.25 stack built-in.

The AX.25 support in Linux is great, but it’s rather under-documented. So I set up a page for packet radio on Linux. I’ve had a great deal of fun with this. It’s amazing what you can do running a real networking protocol at 300bps over long-distance radio. I’ve had real-time conversations with people, connected to their personal BBS and sent them mail, and even use AX.25 “nodes” (think of them as a kind of router or bridge; you can connect in to them and the connect back out on the same or different frequencies to extend your reach) to connect out to systems that I can’t reach directly.

MoinMoin has worked out well for this. It has an inviting theme and newbie-friendly interface (I want to encourage drive-by contributions).

Debconf10

Debconf10 ended a week ago, and I’m only now finding some time to write about it. Funny how it works that way sometimes.

Anyhow, the summary of Debconf has to be: this is one amazing conference. Despite being involved with Debian for years, this was my first Debconf. I often go to one conference a year that my employer sends me to. In the past, it’s often been OSCon, which was very good, but Debconf was much better than that even. For those of you considering Debconf11 next year, perhaps this post will help you make your decision.

First of all, as might be expected from a technical conference, Debconf was of course informative. I particularly appreciated the enterprise track, which was very relevant to me. Unlike many other conferences, Debconf has some rooms specifically set aside for BoFs. With a day or two warning, you can get your event in one of those rooms on the official schedule. That exact thing happened with a virtualization BoF — I thought the topic was interesting, given the recent shifts in various virtualization options. So I emailed the conference mailing list, and we got an event on the schedule a short while later — and had a fairly large group turn out to discuss it.

The “hallway track” — conversations struck up with others in hallways or hacklabs — also was better at Debconf than other conferences. Partly that may be because, although there were fewer people at Debconf, they very much tended to be technical people whose interests aligned with my own. Partly it’s probably also because the keysigning party, which went throughout the conference, encouraged meeting random people. That was a great success, by the way.

So Debconf succeeded at informing, which is perhaps why many people go to these things. But it also inspired, especially Eben Moglen’s lecture. Who would have thought I’d come away from a conference enthused about the very real potential we have to alter the dynamics of some of the largest companies in the world today by using Free Software to it’s greatest potential?

And, of course, I had fun at Debconf. Meeting new people — or, more commonly, finally meeting in person people I’d known for years — was great. I got a real sense of the tremendously positive aspect of Debian’s community, which I must admit I have sometimes overlooked during certain mailing list discussions. This was a community of people, not just a bunch of folks attending a random conference for a week, and that point underlined a lot of things that happened.

Of course, it wasn’t 100% perfect, and it won’t ever be. But still, my thanks to everyone that organized, volunteered, and attended Debconf. I’m now wishing I’d been to more of them, and hope to attend next year’s.

Time to learn a new language

I have something of an informal goal of learning a new programming language every few years. It’s not so much a goal as it is something of a discomfort. There are so many programming languages out there, with so many niches and approaches to problems, that I get uncomfortable with my lack of knowledge of some of them after awhile. This tends to happen every few years.

The last major language I learned was Haskell, which I started working with in 2004. I still enjoy Haskell and don’t see anything displacing it as my primary day-to-day workhorse.

Yet there are some languages that I’d like to learn. I have an interest in cross-platform languages; one of my few annoyances with Haskell is that it can’t (at least with production quality) be compiled into something like Java bytecode or something else that isn’t architecture-dependent. I have long had a soft spot for functional languages. I haven’t had such a soft spot for static type checking, but Haskell’s type inference changed that for me. Also I have an interest in writing Android apps, which means some sort of Java tie-in would be needed.

Here are my current candidates:

  • JavaScript. I have never learned the language but dislike it intensely based on everything I have learned about it (especially the diverging standards of implementation). Nevertheless, there are certain obvious reasons to try it — the fact that most computers and mobile phones can run it out of the box is an important one.
  • Scheme. Of somewhat less interest since I learned Common Lisp quite awhile back. I’m probably pretty rusty at it, but I’m not sure Scheme would offer me anything novel that I can’t find in Haskell — except for the ability to compile to JVM bytecode.
  • Lua — it sounds interesting, but I’m not sure if it’s general-purpose enough to merit long-term interest.
  • Scala sounds interesting — a OOP and FP language that compiles to JVM bytecode.
  • Smalltalk. Seems sad I’ve never learned this one.
  • There are some amazing webapps written using Cappuccino. The Github issue interface is where I hear about this one.
  • Eclipse. I guess it’s mostly not a programming language but an IDE, but then there’s some libraries (RCP?) or something with it — so to be honest, I don’t know what it is. Some people seem very excited about it. I tried it once, couldn’t figure out how to just open a file and start editing already. Made me feel like I was working for Initech and wouldn’t get to actually compile something until my TPS coversheets were in order. Dunno, maybe it’s not that bad, but I never really understood the appeal of something other than emacs/vi+make.
  • A Haskell web infrastructure such as HSP or hApps. Not a new language, but might as well be…

Of some particular interest to me is that Haskell has interpreters for Scheme, Lua, and JavaScript as well as code generators for some of these languages (though not generic Haskell-to-foo compilers).

Languages not in the running because I already know them include: OCaml, POSIX shell, Python, Perl, Java, C, C++, Pascal, BASIC, Common Lisp, Prolog, SQL. Languages I have no interest in learning right now include Ruby (not different enough from what I already know plus bad experiences with it), any assembly, anything steeped in the Microsoft monoculture (C#, VB, etc.), or anything that is hard to work with outside of an Emacs or vim environment. (If your language requires or strongly encourages me to use your IDE or proprietary compiler, I’m not interested — that means you, flash.)

Brief Reivews of Languages I Have Used

To give you a bit of an idea of where I’m coming from:

  • C: Not much to say there. I think its pros and cons are well-known. I consider it to be too unwieldy for general-purpose use these days, though find myself writing code in it every few years for some reason or other.
  • Perl: The first major interpreted language I learned. Stopped using it entirely after learning Python (didn’t see any advantage of Perl over Python, and plenty of disadvantages.)
  • Python: Used to really like it. Both the language and I have changed. It is no longer the “clean” language it was in the 1.5 days. Too many lists-that-aren’t-lists, __underscorethings__, etc. Still cleaner than Perl. It didn’t scale up to large projects well, and the interpreted dynamic nature left me not all that happy to use it for some tasks. Haven’t looked at Python 3, but also it probably isn’t ready for prime time yet anyhow.
  • Java: Better than C++ for me. Cross-platform bytecode features. That’s about all that I have to say about it that’s kind. The world’s biggest source of carpal tunnel referrals in my book. Mysteriously manages to have web apps that require 2GB of RAM just to load. Dunno where that came from; Apache with PHP and mod_python takes less than 100M.
  • Haskell: A pretty stellar mix of a lot of nice features, type inference being one of the very nice ones in my book. My language of choice for almost all tasks. Laziness can be hard for even experienced Haskellers to understand at times, and the libraries are sometimes in a bit of flux (which should calm down soon). Still a very interesting language and, in fact, a decent candidate for my time as there is some about it I’ve never picked up, including some modules and web toolkits.
  • OCaml: Tried it, eventually discarded it. An I/O library that makes you go through all sorts of contortions to be able to open a file read/write isn’t cool in my book.

Jacob has a new computer — and a favorite shell

Earlier today, I wrote about building a computer with Jacob, our 3.5-year-old, and setting him up with a Linux shell.

We did that this evening, and wow — he loves it. While the Debian Installer was running, he kept begging to type, so I taught him how to hit Alt-F2 and fired up cat for him. That was a lot of fun. But even more fun was had once the system was set up. I installed bsdgames and taught him how to use worm. worm is a simple snake-like game where you use the arrow keys to “eat” the numbers. That was a big hit, as Jacob likes numbers right now. He watched me play it a time or two, then tried it himself. Of course he crashed into the wall pretty quickly, which exits the game.

I taught him how to type “worm” at the computer, then press Enter to start it again. Suffice it to say he now knows how to spell worm very well. Yes, that’s right: Jacob’s first ever Unix command was…. worm.

He’d play the game, and cackle if he managed to eat a number. If he crashed into a wall, he’d laugh much harder and run over to the other side of the room.

Much as worm was a hit, the Linux shell was even more fun. He sometimes has a problem with the keyboard repeat, and one time typed “worrrrrrrrrrrrrrrrrrm”. I tried to pronounce that for him, which he thought was hilarious. He was about to backspace to fix it, when I asked, “Jacob, what will happen if you press Enter without fixing it?” He looked at me with this look of wonder and excitement, as if to say, “Hey, I never thought of that. Let’s see!” And a second later, he pressed Enter.

The result, of course, was:

-bash: worrrrrrrrrrrrrrrrrrm: command not found

“Dad, what did it do?”

I read the text back, and told him it means that the computer doesn’t know what worrrrrrrrrrrrrrrrrrm means. Much laughter. At that point, it became a game. He’d bang at random letters, and finally press Enter. I’d read what it said. Pretty soon he was recognizing the word “bash”, and I heard one time, “Dad, it said BASH again!!!” Sometimes if he’d get semicolons at the right place, he’d get two or three “bashes”. That was always an exciting surprise. He had more fun at the command line than he did with worm, and I think at least half of it was because the shell was called bash.

He took somewhat of an interest in the hardware part earlier in the evening, though not quite as much. He was interested in opening up other computers to take parts out of them, but bored quickly. The fact that Terah was cooking supper probably had something to do with that. He really enjoyed the motherboard (and learned that word), and especially the CPU fan. He loved to spin it with his finger. He thought it interesting that there would be a fan inside his computer.

When it came time to assign a hostname, I told Jacob he could name his computer. Initially he was confused. Terah suggested he could name it “kitty”, but he didn’t go for it. After a minute’s thought, he said, “I will name it ‘Grandma Marla.'” Confusion from us — did he really understand what he was saying? “You want to name your computer ‘Grandma Marla?'” “Yep. That will be silly!” “Sure you don’t want to name it Thomas?” “That would be silly! No. I will name my computer ‘Grandma Marla.”” OK then. My DNS now has an entry for grandma-marla. I had wondered what he would come up with. You never know with a 3-year-old!

It was a lot of fun to see that sense of wonder and experimentation at work. I remember it from the TRS-80 and DOS machine, when I would just try random things to see what they would do. It is lots of fun to watch it in Jacob too, and hear the laughter as he discovers something amusing.

We let Jacob stay up 2 hours past his bedtime to enjoy all the excitement. Tomorrow the computer moves to his room. Should be loads of excitement then too.

Introducing the Command Line at 3 years

Jacob is very interested in how things work. He’s 3.5 years old, and into everything. He loves to look at propane tanks, as the pressure meter, and open the lids on top to see the vent underneath. Last night, I showed him our electric meter and the spinning disc inside it.

And, more importantly, last night I introduced him to the Linux command line interface, which I called the “black screen.” Now, Jacob can’t read yet, though he does know his letters. He had a lot of fun sort of exploring the system.

I ran “cat”, which will simply let him bash on the keyboard, and whenever he presses Enter, will echo what he typed back at him. I taught him how to hold Shift and press a number key to get a fun symbol. His favorite is the “hat” above the 6.

Then I ran tr a-z A-Z for him, and he got to watch the computer convert every lowercase letter into an uppercase letter.

Despite the fact that Jacob enjoys watching Youtube videos of trains and even a bit of Railroad Tycoon 3 with me, this was some pure exploration that he loves. Sometimes he’d say, “Dad, what will this key do?” Sometimes I didn’t know; some media keys did nothing, and some other keys caused weird things to appear. My keyboard has back and forward buttons designed to use with a web browser. He almost squealed with delight when he pressed the forward button and noticed it printed lots of ^@^@^@ characters on the screen when he held it down. “DAD! It makes LOTS of little hats! And what is that other thing?” (The at-sign).

I’ve decided it’s time to build a computer for Jacob. I have an old Sempron motherboard lying around, and an old 9″ black-and-white VGA CRT that’s pretty much indestructible, plus an old case or two. So it will cost nothing. This evening, Jacob will help me find the parts, and then he can help me assemble them all. (This should be interesting.)

Then I’ll install Debian while he sleeps, and by tomorrow he should be able to run cat all by himself. I think that, within a few days, he can probably remember how to log himself in and fire up a program or two without help.

I’m looking for suggestions for text-mode games appropriate to a 3-year-old. So far, I’ve found worm from bsdgames that looks good. It doesn’t require him to have quick reflexes or to read anything, and I think he’ll pick up using the arrow keys to move it just fine. I think that tetris is probably still a bit much, but maybe after he’s had enough of worm he would enjoy trying it.

I was asked on Twitter why I’ll be using the command line for him. There are a few reasons. One is that it will actually be usable on the 9″ screen, but another one is that it will expose the computer at a different level than a GUI would. He will inevitably learn about GUIs, but learning about a CLI isn’t inevitable. He won’t have to master coordination with a mouse right away, and there’s pretty much no way he can screw it up. (No, I won’t be giving him root yet!) Finally, it’s new and different to him, so he’s interested in it right now.

My first computer was a TRS-80 Color Computer (CoCo) II. Its primary interface, a BASIC interpreter, I guess counts as a command-line interface. I remember learning how to use that, and later DOS on a PC. Some of the games and software back then had no documentation and crashed often. Part of the fun, the challenge, and sometimes the frustration, was figuring out just what a program was supposed to do and how to use it. It will be fun to see what Jacob figures out.

Review: Linux IM Software

I’ve been looking at instant messaging and chat software lately. Briefly stated, I connect to Jabber and IRC networks from at least three different computers. I don’t like having to sign in and out on different machines. One of the nice features about Jabber (XMPP) is that I can have clients signing in from all over the place and it will automatically route messages to the active one. If the clients are smart enough, that is.

Gajim

I have been using Gajim as my primary chat client for some time now. It has a good feature set, but has had a history of being a bit buggy for me. It used to have issues when starting up: sometimes it would try to fire up two copies of itself. It still has a bug when being fired up from a terminal: if you run gajim & exit, it will simply die. You have to wait a few seconds to close the terminal you launched it from. It has also had issues with failing to reconnect properly after a dropped network connection and generating spurious “resource already in use” errors. Upgrades sometimes fix bugs, and sometimes introduce them.

The latest one I’ve been dealing with is its auto-idle support. Sometimes it will fail to recognize that I am back at the machine. Even weirder, sometimes it will set one of my accounts to available status, but not the other.

So much for my complaints about Gajim; it also has some good sides. It has excellent multi-account support. You can have it present your multiple accounts as separate sections in the roster, or you can have them merged. Then, say, all your contacts in a group called Friends will be listed together, regardless of which account you use to contact them.

The Jabber protocol (XMPP) permits you to connect from multiple clients. Each client specifies a numeric priority for its connection. When someone sends you a message, it will be sent to the connection with the highest priority. The obvious feature, then, is to lower your priority when you are away (or auto-away due to being idle), so that you always get IMs at the device you are actively using. Gajim supports this via letting you specify timeouts that get you into different away states, and using the advanced configuration editor, you can also set the priority that each state goes to. So, if Gajim actually recognized your idleness correctly, this would be great.

I do also have AIM and MSN accounts which I use rarely. I run Jabber gateways to each of these on my server, so there is no need for me to use a multiprotocol client. That also is nice because then I can use a simple Jabber client on my phone, laptop, whatever and see all my contacts.

Gajim does not support voice or video calls.

Due to an apparent bug in Facebook, the latest Gajim release won’t connect to Facebook servers, but there is a patch that claims to fix it.

Psi

Psi is another single-protocol Jabber client, and like Gajim, it runs on Linux, Windows, and MacOS. Psi has a nicer GUI than Gajim, and is more stable. It is not quite as featureful, and one huge omission is that it doesn’t support dropping priority on auto-away (though it, weirdly, does support a dropped priority when you manually set yourself away).

Psi doesn’t support account merging, so it always shows my contacts from one account separately from those from another. I like having the option in Gajim.

There is a fork of Psi known variously as psi-dev or psi-plus or Psi+. It adds that missing priority feature and some others. Unfortunately, I’ve had it crash on me several times. Not only that, but the documentation, wiki, bug tracker, everything is available only in Russian. That is not very helpful to me, unfortunately. Psi+ still doesn’t support account merging.

Both branches of Psi support media calling.

Kopete

Kopete is a KDE multiprotocol instant messenger client. I gave it only about 10 minutes of time because it is far from meeting my needs. It doesn’t support adjustable priorities that I can tell. It also doesn’t support XMPP service discovery, which is used to do things like establish links to other chat networks using a Jabber gateway. It also has no way to access ejabberd’s “send message to all online users” feature (which can be accessed via service discovery), which I need in emergencies at work. It does offer multimedia calls, but that’s about it.

Update: A comment pointed out that Kopete can do service discovery, though it is in a very non-obvious place. However, it still can’t adjust priority when auto-away, so I still can’t use it.

Pidgin

Pidgin is a multiprotocol chat client. I have been avoiding it for years, with the legitimate fear that it was “jack of all trades, master of none.” Last I looked at it, it had the same limitations that Kopete does.

But these days, it is more capable. It supports all those XMPP features. It supports priority dropping by default, and with a plugin, you can even configure all the priority levels just like with Gajim. It also has decent, though not excellent, IRC protocol support.

Pidgin supports account merging — and in fact, it doesn’t support any other mode. You can, for instance, tell it that a given person on IRC is the same as a given Jabber ID. That works, but it’s annoying because you have to manually do it on every machine you’re running Pidgin on. Worse, they used to support a view without merged accounts, but don’t anymore, and they think that’s a feature.

Pidgin does still miss some nifty features that Gajim and Psi both have. Both of those clients will not only tell you that someone is away, but if you hover over their name, tell you how long someone has been away. (Gajim says “away since”, while Pidgin shows “last status at”. Same data either way.) Pidgin has the data to show this, but doesn’t. You can manually find it in the system log if you like, but unhelpfully, it’s not on the log for an individual person.

Also, the Jabber protocol supports notifications while in a chat: “The contact is typing”, paying attention to a conversation, or closed the chat window. Psi and Gajim have configurable support for these; you can send whatever notifications your privacy preferences say. Pidgin, alas, removed that option, and again they see this as a feature.

Pidgin, as a result, makes me rather nervous. They keep removing useful features. What will they remove next?

It is difficult to change colors in Pidgin. It follows the Gtk theme, and there is a special plugin that will override some, but not all, Gtk options.

Empathy

Empathy supports neither priority dropping when away nor service discovery, so it’s not usable for me. Its feature set appears sparse in general, although it has a unique desktop sharing option.

Update: this section added in response to a comment.

On IRC

I also use IRC, and have been using Xchat for that for quite some time now. I tried IRC in Pidgin. It has OK IRC support, but not great. It can automatically identify to nickserv, but it is under-documented and doesn’t support multiple IRC servers for a given network.

I’ve started using xchat with the bip IRC proxy, which makes connecting from multiple machines easier.

Review: Free Software Project Hosting

I asked for suggestions a few days ago. I got several good ones, and investigated them. You can find my original criteria at the link above. Here’s what I came up with:

Google Code

Its very simple interface appeals to me. It has an issue tracker, a wiki, a download area. But zero integration with git. That’s not necessarily a big problem; I can always keep on hosting git repos at git.complete.org. It is a bit annoying, though, since I wouldn’t get to nicely link commit messages to automatic issue closing.

A big requirement of mine is being able to upload tarballs or ZIP files from the command line in an automated fashion. I haven’t yet checked to see if Google Code exports an API for this. Google Code also has a lifetime limit of 25 project creations, though rumor has it they may lift the limit if you figure out where to ask and ask nicely.

URL: googlecode.com

Gitorious

Gitorious is one of the two Git-based sites that put a strong emphasis on community. Like Github, Gitorious tries to make it easy for developers to fork projects, submit pull requests to maintainers, and work together. This aspect of it does hold some appeal to me, though I have never worked with one of these sites, so I am somewhat unsure of how I would use it.

The downside of Gitorious or Github is that they tie me to Git. While I’m happy with Git and have no plans to change now, I’ve changed VCSs many times over the years when better tools show up; I’ve used, in approximately this order, CVS, Subversion, Arch/tla, baz, darcs, Mercurial, and Git, with a brief use of Perforce at a job that required it. I may use Git for another 3 years, but after 5 years will Git still be the best VCS out there? I don’t know.

Gitorious fails several of my requirements, though. It has no issue tracker and no downloads area.

It can spontaneously create a tar.gz file from the head of any branch, but not a zip file. It is possible to provide a download of a specific revision, but this is not very intuitive for the end user.

Potential workarounds include using Lighthouse for bug tracking (they do support git integration for changelog messages) and my own server to host tarballs and ZIP files — which I could trivially upload via scp.

URL: gitorious.org

Github

At first glance, this is a more-powerful version of Gitorious. It has similar community features, has a wiki, but adds an issue tracker, download area, home page capability, and a bunch of features. It has about a dozen pre-built commit hooks, that do everything from integrate with Lighthouse to pop commit notices into Jabber.

But there are surprising drawbacks, limitations, and even outright bugs all throughout. And it all starts with the user interface.

On the main project page, the user gets both a download button and a download tab. But they don’t do the same thing. Talk about confusing!

The download button will make a ZIP or tarball out of any tag in the repo. The download tab will also do the same, though presented in a different way; but the tab can also offer downloads for files that the maintainer has manually uploaded. Neither one lets you limit the set of tags presented, so if you have an old project with lots of checkpoints, the poor end user has to sift through hundreds of tags to find the desired version. It is possible to make a tarball out of a given branch, so a link to the latest revision could be easy, but still.

Even worse, there’s a long-standing issue where several of the tabs get hidden under other on-screen elements. The wiki tab, project administration tab, and sometimes even the download tab are impacted. It’s been open since February with no apparent fix.

And on top of that, uploading arbitrary tarballs requires — yes — Flash. Despite requests to make it scriptable, they reply that there is no option but Flash and they may make some other option sometime.

The issue tracker is nice and simple. But it doesn’t support attachments. So users can’t attach screenshots, debug logs, or diffs.

I really wanted to like Github. It has so many features for developers. But all these surprising limitations make it a pain both for developers (I keep having to “view source” to find the link to the wiki or the project admin page) and for users (confusing download options, lack of issue attachments). In the end, I think the display bug is a showstopper for me. I could work around some of the others by having a wiki page with links to downloads and revisions and giving that out as the home page perhaps. But that’s a lot of manual maintenance that I would rather avoid.

URL: github.com

Launchpad

Launchpad is the project management service operated by Canonical, the company behind Ubuntu. While Launchpad can optionally integrate well with Ubuntu, that isn’t required, so non-developers like me can work with it fine.

Launchpad does offer issue tracking, but no wiki. It has a forum of sorts though (the “Answers” section). It has some other features, such as blueprints, that would likely only be useful for projects larger than the ones I would plan to use it for.

It does have a downloads area, and they say they have a Python API. I haven’t checked it out, but if it supports scriptable uploads, that would work for me.

Besides the lack of a wiki, Launchpad is also tied to the bzr VCS. bzr was one of the early players in DVCS, written as a better-designed successor to tla/Arch and baz, but has no compelling features over Git or Mercurial for me today. I have no intention of switching to or using it any time soon.

Launchpad does let you “import” branches from another VCS such as Git or svn. I set up an “import” branch for a test project yesterday. 12 hours later, it still hasn’t imported anything; it’s just sitting at “pending review.” I have no idea if it ever will, or why setting up a bzr branch requires no review but a git branch requires review. So I am unable to test the integration between it and the changesets, which is really annoying.

So, some possibilities here, but the bzr-only thing really bugs me. And having to have my git trees reviewed really goes against the “quick and simple” project setup that I would have preferred to see.

URL: launchpad.net

Indefero

Indefero is explicitly a Google Code clone, but aims to be a better Google Code than Google Code. The interface is similar to Google’s — very simple and clean. Unlike Google Code, Indefero does support Git. It supports a wiki, downloads area, and issue tracker. You can download the PHP-based code and run it yourself, or you can get hosting from the Indefero site.

I initially was favorably impressed by Indefero, but as I looked into it more, I am not very impressed right now. Although it does integrate with Git, and you can refer to an issue number in a Git commit, a Git commit can’t close an issue. Git developers use git over ssh to interact with it, but it supports only one ssh key per user — so this makes it very annoying if I wish to push changes from all three of the machines I regularly do development with. Despite the fact that this is a “high priority” issue, it hasn’t been touched by the maintainer in almost a month, even though patches have been offered.

Indefero can generate files based on any revision in git, or based on the latest on any branch, but only in ZIP format (no tar.gz).

Although the program looks very nice and the developer clueful, Indefero has only one main active developer or committer, and he is a consultant that also works on other projects. That makes me nervous about putting too many eggs into the Indefero basket.

URL: indefero.net

Trac

Trac is perhaps the gold standard of lightweight project management apps. It has a wiki, downloads, issue tracking, and VCS integration (SVN only in the base version, quite a few others with 3rd-party plugins). I ran trac myself for awhile.

It also has quite a few failings. Chief among them is that you must run a completely separate Trac instance for each project. So there is no possible way to go to some dashboard and see all bugs assigned to you from all projects, for instance. That is what drove me away from it initially. That and the serious performance problems that most of its VCS backends have.

URL: trac.edgewall.org

Redmine

Redmine is designed to be a better Trac than Trac. It uses the same lightweight philosophy in general, has a wiki, issue tracker, VCS integration, downloads area, etc. But it supports multiple projects in a sane and nice way. It’s what I currently use over on software.complete.org.

Redmine has no API to speak of, though I have managed to cobble together an automatic uploader using curl. It was unpleasant and sometimes breaks on new releases, but it generally gets the job done.

I have two big problems with Redmine. One is performance. It’s slow. And when web spiders hit it, it sometimes has been so slow that it takes down my entire server. Because of the way it structures its URLs, it is not possible to craft a robots.txt that does the right thing — and there is no plan to completely fix it. There is, however, a 3rd-party plugin that may help.

The bigger problem relates to maintaining and upgrading Redmine. This is the first Ruby on Rails app I have ever used, and let me say it has made me want to run away screaming from Ruby on Rails. I’ve had such incredible annoyances installing and upgrading this thing that I can’t even describe what was wrong. All sorts of undocumented requirements for newer software, GEMS that are supposed to work with it but don’t, having to manually patch things so they actually work, conflicts with what’s on the system, and nobody in the Redmine, Rails, or Ruby communities being able to help. I upgrade rarely because it is such a hassle and breaks in such spectacular ways. I don’t think this is even Redmine’s fault; I think it’s a Rails and Ruby issue, but nevertheless, I am stuck with it. My last upgrade was a real mess — bugs in the PostgreSQL driver — the newer one that the newer GEM that the newer Redmine required — were sending invalid SQL to it. Finally patched it myself, and this AFTER the whole pain that is installing gems in Ruby.

I’d take a CGI script written in Bash over Ruby on Rails after this.

That said, Redmine has the most complete set of the features I want of all the programs I’ve mentioned on this page.

URL: redmine.org

Savannah

Savannah is operated by the Free Software Foundation, and runs a fork of the SourceForge software. Its fork does support Git, but lacks a wiki. It has the standard *forge issue tracker, download area, home page support, integrated mailing lists, etc. It also has the standard *forge over-complexity.

There is a command-line SourceForge uploader in Debian that could potentially be hacked to work with Savannah, but I haven’t checked.

URL: savannah.nongnu.org

berlios.de

Appears to be another *forge clone. Similar to Savannah, but with a wiki, ugly page layout, and intrusive ads.

URL: berlios.de

SourceForge

Used to be the gold-standard of project hosting. Now looks more like a back alley in a trashy neighborhood. Ads all over the place, and intrusive and ugly ones at that. The ads make it hard to use the interface and difficult to navigate, especially for newbies. No thanks.

Conclusions

The four options that look most interesting to me are: Indefero, Github, Gitorious, and staying with Redmine. The community features of Github, Gitorious, and Launchpad all sound interesting, but I don’t have the experience to evaluate how well they work in practice — and how well they encourage “drive by commits” for small projects.

Gitorious + Lighthouse and my own download server merits more attention. Indefero still makes me nervous due to the level of development activity and single main developer. Github has a lot of promise, but an interface that is too confusing and buggy for me to throw at end users. That leaves me with Redmine, despite all the Rails aggravations. Adding the bot blocking plugin may just get me what I want right now, and is certainly the path of least resistance.

I am trying to find ways to build communities around these projects. If I had more experience with Github or Gitorious, and thought their community features could make a difference for small projects, I would try them.

How To Think About Compression, Part 2

Yesterday, I posted part 1 of how to think about compression. If you haven’t read it already, take a look now, so this post makes sense.

Introduction

In the part 1 test, I compressed a 6GB tar file with various tools. This is a good test if you are writing an entire tar file to disk, or if you are writing to tape.

For part 2, I will be compressing each individual file contained in that tarball individually. This is a good test if you back up to hard disk and want quick access to your files. Quite a few tools take this approach — rdiff-backup, rdup, and backuppc are among them.

We can expect performance to be worse both in terms of size and speed for this test. The compressor tool will be executed once per file, instead of once for the entire group of files. This will magnify any startup costs in the tool. It will also reduce compression ratios, because the tools won’t have as large a data set to draw on to look for redundancy.

To add to that, we have the block size of the filesystem — 4K on most Linux systems. Any file’s actual disk consumption is always rounded up to the next multiple of 4K. So a 5-byte file takes up the same amount of space as a 3000-byte file. (This behavior is not unique to Linux.) If a compressor can’t shrink enough space out of a file to cross at least one 4K barrier, it effectively doesn’t save any disk space. On the other hand, in certain situations, saving one byte of data could free 4K of disk space.

So, for the results below, I use du to calculate disk usage, which reflects the actual amount of space consumed by files on disk.

The Tools

Based on comments in part 1, I added tests for lzop and xz to this iteration. I attempted to test pbzip2, but it would have taken 3 days to complete, so it is not included here — more on that issue below.

The Numbers

Let’s start with the table, using the same metrics as with part 1:

Tool MB saved Space vs. gzip Time vs. gzip Cost
gzip 3081 100.00% 100.00% 0.41
gzip -1 2908 104.84% 82.34% 0.36
gzip -9 3091 99.72% 141.60% 0.58
bzip2 3173 97.44% 201.87% 0.81
bzip2 -1 3126 98.75% 182.22% 0.74
lzma -1 3280 94.44% 163.31% 0.63
lzma -2 3320 93.33% 217.94% 0.83
xz -1 3270 94.73% 176.52% 0.68
xz -2 3309 93.63% 200.05% 0.76
lzop -1 2508 116.01% 77.49% 0.39
lzop -2 2498 116.30% 76.59% 0.39

As before, in the “MB saved” column, higher numbers are better; in all other columns, lower numbers are better. I’m using clock seconds here on a dual-core machine. The cost column is clock seconds per MB saved.

Let’s draw some initial conclusions:

  • lzma -1 continues to be both faster and smaller than bzip2. lzma -2 is still smaller than bzip2, but unlike the test in part 1, is now a bit slower.
  • As you’ll see below, lzop ran as fast as cat. Strangely, lzop -3 produced larger output than lzop -1.
  • gzip -9 is probably not worth it — it saved less than 1% more space and took 42% longer.
  • xz -1 is not as good as lzma -1 in either way, though xz -2 is faster than lzma -2, at the cost of some storage space.
  • Among the tools also considered for part 1, the difference in space and time were both smaller. Across all tools, the difference in time is still far more significant than the difference in space.

The Pretty Charts

Now, let’s look at an illustration of this. As before, the sweet spot is the lower left, and the worst spot is the upper right. First, let’s look at the compression tools themselves:

compress2-zoomed

At the extremely fast, but not as good compression, end is lzop. gzip is still the balanced performer, bzip2 still looks really bad, and lzma -1 is still the best high-compression performer.

Now, let’s throw cat into the mix:

compress2-big

Here’s something notable, that this graph makes crystal clear: lzop was just as fast as cat. In other words, it is likely that lzop was faster than the disk, and using lzop compression would be essentially free in terms of time consumed.

And finally, look at the cost:

compress2-efficiency

What happened to pbzip2?

I tried the parallel bzip2 implementation just like last time, but it ran extremely slow. Interestingly, pbzip2 < notes.txt > notes.txt.bz2 took 1.002 wall seconds, but pbzip2 notes.txt finished almost instantaneously. This 1-second startup time for pbzip2 was a killer, and the test would have taken more than 3 days to complete. I killed it early and omitted it from my results. Hopefully this bug can be fixed. I didn’t expect pbzip2 to help much in this test, and perhaps even to see a slight degradation, but not like THAT.

Conclusions

As before, the difference in time was far more significant than the difference in space. By compressing files individually, we lost about 400MB (about 7%) space compared to making a tar file and then combining that. My test set contained 270,101 files.

gzip continues to be a strong all-purpose contender, posting fast compression time and respectable compression ratios. lzop is a very interesting tool, running as fast as cat and yet turning in reasonable compression — though 25% worse than gzip on its default settings. gzip -1 was almost as fast, though, and compressed better. If gzip weren’t fast enough with -6, I’d be likely to try gzip -1 before using lzop, since the gzip format is far more widely supported, and that’s important to me for backups.

These results still look troubling for bzip2. lzma -1 continued to turn in far better times and compression ratios that bzip2. Even bzip2 -1 couldn’t match the speed of lzma -1, and compressed barely better than gzip. I think bzip2 would be hard-pressed to find a comfortable niche anywhere by now.

As before, you can download my spreadsheet with all the numbers behind these charts and the table.