Category Archives: Technology

A 4-year-old, Linux command line, and microphone

There are certain times when I’m really glad that we have Linux on the house for our boys to play with. I’ve already written how our 4-year-old Jacob has fun with bash and can chain together commands to draw ASCII animated steam locomotives. Today I thought it might be fun to install cw, a program that can take text on standard input and play it on the console speaker or sound card as Morse code. Just the sort of thing that I could see Jacob eventually getting a kick out of.

But his PC was mute. We opened it up and discovered it didn’t have a console speaker. So we traipsed downstairs, dug out an external speaker, and I figured out how to enable the on-board audio chipset in the BIOS. So now the cw command worked, but also there were a lot of other possibilities. We also brought up a microphone.

While Jacob was busy with other things, I set to work getting things hooked up, volume levels adjusted, and wrote some shell scripts for him. I also printed out this reference sheet for Jacob:

He is good at reading but not so good at spelling. I intentionally didn’t write down what the commands do, hoping that this would provide some avenue for exploration for him. He already is generally familiar with the ones under the quiet category.

I wrote a shell script called “record”. It simply records from the microphone and drops a timestamped WAV file in a holding directory. He can then type “play” to simply play back whatever he recorded most recently. Easy enough.

But what he really wanted was sound for his ASCII steam locomotive. So with the help of a Google search for “steam train mp3”, I wrote a script “ssl” (sound steam locomotive) that starts playing the sound in the background if it isn’t already going, and then runs sl to show the animation. This was a big hit.

I also set it up so he can type “play train” to hear that audio, or “play song” to play our favorite train song (Always a Train in My Dreams by Steve Gillette). Jacob typed that in and sat still for the entire 3 minutes listening to it.

I had to hook up an Ethernet cable to his machine to do all this, and he was very interested that I was hooking his computer up to mine in some way. He thought all the stuff about cables in the walls was quite exciting.

The last thing I did was install flite, a speech synthesis program. I wrote a small shell script called “talk” which reads a line at a time from stdin and invokes flite for each one (to give more instant feedback rather than not starting playback until after having read a large block from stdin). He had some fun hearing it say his name and other favorite words, but predictably the most fun was when he typed gibberish at it, and heard it try to pronounce or spell nonsense words.

In all, he was so excited about this new world of computer sound opened up to him. I’m sure there will be lots of happy experimentation and discovery going on.

Update Feb 10, 2012: I have posted the shell scripts behind this.

Geeks, Hobbies, and Free/Open Source: Feedback Wanted

I’ve been thinking lately about ways to improve ways in which I interact with Free Software projects, and ways in which they interact with me. Before I proceed to take steps or make suggestions, I’d like to see if others share my traits and observations.

Here are some questions I have been thinking of. If you’d like to help give me anecdotal evidence, please post a comment below this post. Identify the question numbers you are answering. It helps me if you can give specific examples, but if you don’t have the time or memory for that, no problem.

I will post my own answers in a day or two, but the point of this post is listening, not talking, so I’ll not post them immediately.

Hobbies (General – any geeks)

  • H1: To what degree do you like your hobbies to be challenging vs. easy? If something isn’t challenging, does that make it a good, bad, or indifferent candidate for a hobby
  • H2: To what degree do you like your hobbies to be educational or enlightening?
  • H3: How do you pick up new hobbies? Do you go looking for them? Do you stumble upon them? What excites you to commit time and/or money to them at the beginning?
  • H4: How does your interest wane? What causes you to lose interest in hobbies?
  • H5: For how long do you tend to maintain hobbies? Sub-hobbies?
  • H6: Are your hobbies or sub-hobbies cyclical? In other words, do you lose interest in a hobby for a time, then regain interest for a time, then lose it again? What is the length of time of these cycles, if any?
  • H7: Do you prefer social hobbies or solitary hobbies? (Note that many hobbies, including programming, video gaming, reading, knitting, etc. could be either social or solitary, depending on the inclination of individuals.)
  • H8: Have you ever felt guilt about wanting to stop a hobby or sub-hobby? (For instance, from stopping supporting users of your software project, readers of your e-zine, etc) Did the guilt keep you going? Was that a good thing?

Examples: video games might be a challenging hobby (depending on the person) but in most cases aren’t educational.

A hobby might be “video game playing” or “being a Debian developer.” A sub-hobby might be “playing GTA IV”, “playing RPGs”, or “maintaining mutt”.

Free/Open Source Hobbies

  • F1: Considering your answers above, do your FLOSS activities follow the same general pattern as your other hobbies/interests, or are there differences? If there are differences, what are they?
  • F2: Has concern for being expected to support software longer than you will have an interest in it ever been a factor in a decision whether to release source code publicly, or how public to make a release?
  • F3: Has concern over the long-term interest of a submitter in maintaining their patch/contribution ever caused you to consider rejecting it? (Or caused you to avoid using software over the same concern about its author)
  • F4: In general, do you find requirements FLOSS projects place on first-time contributors to be too stringent, not stringent enough, or about right?
  • F5: Have you ever continued contributing to a project past the point where your interest would otherwise motivate you to do so? If so, what caused you to do this? Do you believe that cause is a general positive or negative force for members of the FLOSS community?
  • F6: Have there ever been factors that caused you to stop contributing to a project even though you still had an active interest in doing so? What were they?
  • F7: Have you ever wanted to be able to take a break as a contributor or maintainer of a project, and be able to return to contributing to it later? If so, have you found it easy to do so?
  • F8: What is your typical length of engagement with FLOSS projects (such as Debian) and sub-projects (such as maintaining a particular package)?
  • F9: Does a change in social group ever encourage or discourage you from changing hobbies or sub-hobbies?
  • F10: Have you ever wanted to stop working on a project/sub-project because the problems involved were no longer challenging or educational to you?
  • F11: Have you ever wanted to stop working on a project/sub-project because of issues with the people involved?

Examples on F9: If, say, you are a long-time Perl user and have gone to Perl conferences, but now you are interested in Ruby, would your involvement with the Perl community cause you to avoid taking up the Ruby programming hobby? Or would it cause you to cut your ties with Perl less quickly than your changing interest might dictate? (This is a completely arbitrary example and isn’t meant to start a $LANGUAGE thread.)

Changes over time

  • C1: Do you believe that your answers to any of the above questions have changed over time? If yes, then:
  • C2: What kinds of changes have happened?
  • C3: What caused the change?
  • C4: Do you believe the changes produced positive results for you? For the community?

APRS: World’s Best Social Mapping and Wide-Area Ad-Hoc Wireless Mesh Network

That was quite a headline, and I’m going to try to back it up below.

APRS is the Automatic Packet Reporting System. It’s a system for exchanging brief packets of information. It is most frequently used for mapping applications, but it really does a lot more than that. It has its biggest home in the amateur radio world, but isn’t limited to that, either.

The most common way to use APRS is to have some device hooked up to a GPS transmit packets with the GPS information in them. These packets can then be plotted on a map in real-time or with history. That in itself isn’t particularly newsworthy these days.

An interesting thing about APRS is that it’s not just positioning. Let’s say that there was a search-and-rescue operation. A person could draw a rectangle on the map indicating the search area, and within about 3 seconds everyone else’s map also shows that rectangle. People have even been known to play chess by sharing and moving objects on APRS!

The next piece that makes this interesting is that APRS is an ad-hoc mesh network. In its traditional implementation, VHF amateur radio, a radio emits a packet with a geolocation in it (a “beacon”) and any other radio within direct range of that can receive it. Radios can display basic information (such as distance to the other radio, heading, etc.) or hooked up to a laptop or mapping device for a better display. So if everyone is within a few miles, APRS works without any pre-existing infrastructure at all. This makes it wonderful for use in disaster areas, and was put to heavy use in Joplin after the tornado there.

But what about radios that are too far away? Any APRS station could also be a digipeater. When a packet is transmitted, it has a maximum hop count. A digipeater hears the packet, decrements the maximum hop count, and re-transmits it. With this mechanism, packets can travel hundreds of miles. It creates a highly resilient network, one that can route around trouble without even having to have an explicit backup route. I could bring in a digipeater in my car — it can be small enough to hold in my hand — and instantly improve APRS reception in an area.

One interesting aspect is that packets can be digipeated more than the maximum hop count. For instance, if a packet leaves my radio and is picked up by a digipeater to my west and one to my east, it can keep on traveling in both directions. This is part of what leads to resiliency.

APRS also functions over the Internet. There is a large network of interconnected Internet servers that exchange all global APRS traffic amongst themselves. Gateways between the radio (RF) and Internet (APRS-IS) services exist, and are called iGates. They are not generally required, but make useful websites like aprs.fi and email gateways possible. As long as an iGate is within a reasonable number of hops from you, you’re effectively linked. And again, if one iGate drops off, another iGate is probably monitoring your traffic too and you never notice. It’s an ad-hoc mesh network that is actually reliable – how about that?

On the PC side, there are many programs for using APRS. The most common one for Windows is called UI-View, but I don’t use Windows so I can’t comment. On Linux, there are programs (such as aprx) for running your digipeater, but the best-known program is Xastir. Xastir lets you download map files to your local disk, and can interface with the APRS-IS Internet service, radios, weather stations, or simply other arbitrary machines to exchange information. Xastir is a very nice program and is well worth the install, despite its somewhat dated-looking interface.

APRS clients, such as APRSDroid, exist for Android and iOS platforms as well.

So let’s say you’re doing something like helping handle food/water stations for a long bike ride. Even if you don’t have anybody with an amateur radio license, you can use APRS to great effect. At your headquarters, you can run Xastir and turn on its “server mode”. This puts everyone on a map. Then you can have everyone turn on APRS on their phone, and have it report to your custom server instead of APRS-IS. Now you have instant visibility into your entire team’s location and status. If you have transport people driving supplies between locations, that’s especially helpful.

In an amateur radio scenario, you would instead have people with radios at each location, and one laptop hooked up to a radio at HQ. This provides an added bonus of not relying on third-party infrastructure such as cellphone towers.

APRS also has a messaging system, similar in concept to text messaging. It works the same as other things. If I want to send a message to Jane, my radio simply emits a packet that lists the message and Jane as a recipient. It’s digipeated up to its maximum hop count. If Jane is within RF range of one of those digipeaters, she gets the message and her radio ACKs it. Otherwise, it’s delivered into the APRS-IS network — probably several times, which isn’t a problem — and the APRS-IS network delivers it to the iGate closest to her, and from there it gets digipeated the rest of the way to her.

Here’s an example of something created with APRS. While I was on a bus choir tour last weekend, I had a radio with me that was beaconing all the while. Now it was a small handheld radio inside a large metal bus, so it didn’t always have a digipeater in range. But still, you can go see a detailed map with the trail and even see exactly what path each packet took before it hit the Internet.

If you want to try out Xastir, please grab at least version 2.0 – the version in squeeze has some bugs.

A Proud Dad

I saw this on my computer screen the other day, and I’ve got to say it really warmed my heart. I’ll explain below if it doesn’t provoke that reaction for you.

Evidence a 4-year-old has been using my computer

So here’s why that made me happy. Well for one, it was the first time Jacob had left stuff on my computer that I found later. And of course he left his name there.

But moreover, he’s learning a bit about the Unix shell. sl is a command that displays an animated steam locomotive. I taught him how to use the semicolon to combine commands. So he has realized that he can combine calls to sl with the semicolon to get a series of a LOT of steam trains all at once. And was very excited about this discovery.

Also he likes how error messages start with the word “bash”.

How do you hire programmers and sysadmins? How should employers evaluate you?

Reading job listings for any sort of IT job is depressing. It’s been quite some time since I’ve had to do that, but how many times have you seen something like this:

  • “5 years of Java experience required.”
  • “3 years of Java experience with modules X, Y, Z required.”
  • “6 years of experience administering Linux machines running RHEL 4 on a Windows 2000 domain with 1500 clients in an educational setting preferred.”

I could go on and on. As a job seeker, that sort of thing is fundamentally devaluing to someone who has strengths in being adaptable and quickly learning new tools, languages, or even entire environments. As an employer, it sends a message that you’re not interested in more than a surface look at someone’s strengths, and probably don’t care to hire the best and the brightest. After all, would you turn away a rockstar programmer simply because he or she had been writing filesystem code in C the last 3 years instead of the latest whizbang Java web widget that will probably be obsolete in a year and unsupported in two? I am quite certain that there are plenty of managers that do. Even if you are a company large enough to have an entire team of people that do nothing but work on that whizbang app, don’t you still want the best you can find, realizing that some of the best people to work on that app may not have even heard of it yet? (And that when the app goes obsolete in 5 years, you’d rather not have to lay off a large team of single-skill people)

Some of you may know that I work in IT at a manufacturing company. We have a small IT team here, about seven people, and are a heavy Debian shop. And we have a vacancy open up in our development/Linux admin group. I’m the manager of that group, which is why I’m thinking about this right now.

We’re too small for single-subject specialists to make sense, yet we’re big enough to appreciate skill, experience, flexibility, and rigor. Consequently, when the occasion arises for me to look for new employees, I don’t prepare a laundry list of things we use in-house and would like experience with.

The list of almost-required things generally begins with “Linux” and ends with “experienced”, and has nothing else in between. In other words, I’d like it if I don’t have to explain to you what a symlink or a hardlink is, but I’d be willing to do so if I think you’d internalize it quickly. On the “experienced” side, it would be nice if you already have a well-developed sense of fear at running rm when you’re root, or have designed a storage infrastructure for a network before, or are paranoid about security. But again, if people can pick up those traits on the job, we are usually still interested. If learning how to package up software for Debian, fix bugs in software you’ve never seen in a language you’ve never heard of, raise good questions about things you may not have lots of experience with, and write documentation for it all on a wiki sounds like fun, then that’s probably the kind of person I want, even if you’ve never used our particular tools before.

If I were to judge based on the stuff I normally see in job postings, I guess you might conclude I’m nuts. I don’t think I am, but then again I’m also the only person I know that formats his own resume in hand-crafted LaTeX. What do you all think?

The next question is: how should one evaluate candidates given this sort of philosophy? I’m not a fan of canned tests, or even “whiteboard tests” that tend to be some sort of canned topic that may test the applicant’s specific knowledge base more than overall skill and flexibility. Similarly, as an applicant in years past, I’ve struggled with how to present the “I’ve never used $LANGUAGE, but I know I could pick it up quickly and do it very well” vibe. To certain people, that might sound like BS. To the more geeky managers, perhaps it sounds like what they want.

We’ve built a fairly diverse team on the back of this approach, and it’s worked out well for us so far. I’m interested to hear your thoughts.

Oh, and if you’d like to work for us, you should probably be sending me an email. No, I’m not going to list the address here on this blog post. If you can’t figure it out, I don’t want to hear from you <grin>

Unix Password and Authority Management

One of the things that everyone seems to do different is managing passwords. We haven’t looked at that in quite some time, despite growth both of the company and the IT department.

As I look to us moving some things to the cloud, and shifting offsite backups from carrying tapes to a bank to backups via the Internet, I’m aware that the potential for mischief — whether intentional or not — is magnified. With cloud hosting, a person could, with the press of a button, wipe out the equivalent of racks of machines in a few seconds. With disk-based local and Internet-based offsite backups, the potential for malicious behavior may be magnified; someone could pretty quickly wipe out local and remote backups.

Add to that the mysterious fact that many enterprise-targeted services allow only a single username/password for an account, and make no provision for ACLs to delegate permissions to others. Even Rackspace Cloud has this problem, as do their JungleDisk backup product, and many, many other offsite backup products. Amazon AWS seems to be the only real exception to this rule, and their ACL support is more than a little complicated.

So one of the questions we will have to address is the balance of who has these passwords. Too many people and the probability of trouble, intentional or not, rises. Too few and productivity is harmed, and potentially also the ability to restore. (If only one person has the password, and that person is unavailable, company data may be as well.) The company does have some storage locations, including locked vaults and safe deposit boxes, that no IT people have access to. I am thinking that putting a record of passwords in those locations may be a good first step, as putting the passwords in the control of those that can’t use them seems a reasonable step.

But we’ve been thinking of this as it pertains to our local systems as well. We have, for a number of years now, assigned a unique root password to every server. These passwords are then stored in a password-management tool, encrypted with a master password, and stored on a shared filesystem. Everyone in the department therefore can access every password.

Many places where I worked used this scheme, or some variant of it. The general idea was that if root on one machine was compromised and the attacker got root’s password, it would prevent the person from being able to just try that password on the other servers on the network and achieve a greater level of intrusion.

However, the drawback is that we now have more servers than anyone can really remember the passwords for. So many people are just leaving the password tool running. Moreover, while the attack described above is still possible, these days I worry more about automated intrusion attempts that most likely won’t try that attack vector.

A couple of ways we could go may include using a single root password everywhere, or a small set of root passwords. Another option may be to not log in to root accounts at all — possibly even disabling their password — and requiring the use of user accounts plus sudo. This hasn’t been practical to date. We don’t want to make a dependency on LDAP from a bunch of machines just to be able to use root, and we haven’t been using a tool such as puppet or cfengine to manage this stuff. Using such a tool is on our roadmap and could let us manage that approach more easily. But this approach has risks too. One is that if user accounts can get to root on many machines, then we’re not really more secure than a standard root password. Second is that it makes it more difficult to detect and enforce password expiration and systematic password changes.

I’m curious what approaches other people are taking on this.

rdiff-backup, ZFS, and rsync scripts

rdiff-backup vs. ZFS

As I’ve been writing about backups, I’ve gone ahead and run some tests with rdiff-backup. I have been using rdiff-backup personally for many years now — probably since 2002, when I packaged it up for Debian. It’s a nice, stable system, but I always like to look at other options for things every so often.

rdiff-backup stores an uncompressed current mirror of the filesystem, similar to rsync. History is achieved by the use of compressed backwards binary deltas generated by rdiff (using the rsync algorithm). So, you can restore the current copy very easily — a simple cp will do if you don’t need to preserve permissions. rdiff-backup restores previous copies by applying all necessary binary deltas to generate the previous version.

Things I like about rdiff-backup:

  1. Bandwidth-efficient
  2. Reasonably space-efficient, especially where history is concerned
  3. Easily scriptable and nice CLI
  4. Unlike tools such as duplicity, there is no need to periodically run full backups — old backups can be deleted without impacting the ability to restore more current backups

Things I don’t like about it:

  1. Speed. It can be really slow. Deleting 3 months’ worth of old history takes hours. It has to unlink vast numbers of files — and that’s pretty much it, but it does it really slowly. Restores, backups, etc. are all slow as well. Even just getting a list of your increment sizes so you’d know how much space would be saved can take a very long time.
  2. The current backup copy is stored without any kind of compression, which is not at all space-efficient
  3. It creates vast numbers of little files that take forever to delete or summarize

So I thought I would examine how efficient ZFS would be. I wrote a script that would replay the rdiff-backup history — first it would rsync the current copy onto the ZFS filesystem and make a ZFS snapshot. Then each previous version was processed by my script (rdiff-backup’s files are sufficiently standard that a shell script can process them), and a ZFS snapshot created after each. This lets me directly compare the space used by rdiff-backup to that used by ZFS using actual history.

I enabled gzip-3 compression and block dedup in ZFS.

My backups were nearly 1TB in size and the amount of space I had available for ZFS was roughly 600GB, so I couldn’t test all of them. As it happened, I tested the ones that were the worst-case scenario for ZFS: my photos, music collection, etc. These files had very little duplication and very little compressibility. Plus a backup of my regular server that was reasonably compressible.

The total size of the data backed up with rdiff-backup was 583 GB. With ZFS, this came to 498GB. My dedup ratio on this was only 1.05 (meaning 5% or 25GB saved). The compression ratio was 1.12 (60GB saved). The combined ratio was 1.17 (85GB saved). Interestingly 498 + 85 = 583.

Remember that the data under test here was mostly a worst-case scenario for ZFS. It would probably have done better had I had the time to throw the rest of my dataset at it (such as the 60GB backup of my iPod, which would have mostly deduplicated with the backup of my music server).

One problem with ZFS is that dedup is very memory-hungry. This is common knowledge and it is advertised that you need to use roughly 2GB of RAM per TB of disk when using dedup. I don’t have quite that much to dedicate to it, so ZFS got VERY slow and thrashed the disk a lot after the ARC grew to about 300MB. I found some tweakables in zfsrc and the zfs command that let me tweak the ARC cache to grow bigger. But the machine in question only has 2GB RAM, and is doing lots of other things as well, so this barely improved anything. Note that this dedup RAM requirement is not out of line with what is expected from these sorts of solutions.

Even if I got absolutely stellar dedup ratio of 2:1, that would get me at most 1TB. The cost of buying a 1TB disk is less than the cost of upgrading my system to 4GB RAM, so dedup isn’t worth it here.

I think the lesson is: think carefully about where dedup makes sense. If you’re storing a bunch of nearly-identical virtual machine images — the sort of canonical use case for this — go for it. A general fileserver — well, maybe you should just add more disk instead of more RAM.

Then that raises the question: if I don’t need dedup from ZFS, do I bother with it at all, or just use ext4 and LVM snapshots? I think ZFS still makes sense, given its built-in support for compression and very fast snapshots — LVM snapshots are known to cause serious degradation to write performance once enabled, which ZFS doesn’t.

So I plan to switch my backups to use ZFS. A few observations on this:

  1. Some testing suggests that the time to delete a few months of old snapshots will be a minute or two with ZFS compared to hours with rdiff-backup.
  2. ZFS has shown itself to be more space-efficient than rdiff-backup, even without dedup enabled.
  3. There are clear performance and convenience wins with ZFS.
  4. Backup Scripts

    So now comes the question of backup scripts. rsync is obviously a pretty nice choice here — and if used with –inplace perhaps even will play friendly with ZFS snapshots even if dedup is off. But let’s say I’m backing up a few machines at home, or perhaps dozens at work. There is a need to automate all of this. Specifically, there’s a need to:

    1. Provide scheduling, making sure that we don’t hammer the server with 30 clients all at once
    2. Provide for “run before” jobs to do things like snapshot databases
    3. Be silent on success and scream loudly via emails to administrators on any kind of error… and keep backing up other systems when there is an error
    4. Create snapshots and provide an automated way to remove old snapshots (or mount them for reading, as ZFS-fuse doesn’t support the .zfs snapshot directory yet)

    To date I haven’t found anything that looks suitable. I found a shell script system called rsbackup that does a large part of this, but something about using a script whose homepage is a forum makes me less than 100% confident.

    On the securing the backups front, rsync comes with a good-looking rrsync script (inexplicably installed under /usr/share/doc/rsync/scripts instead of /usr/bin on Debian) that can help secure the SSH authorization. GNU rush also looks like a useful restricted shell.

Research on deduplicating disk-based and cloud backups

Yesterday, I wrote about backing up to the cloud. I specifically was looking at cloud backup services. I’ve been looking into various options there, but also various options for disk-based backups. I’d like to have both onsite and offsite backups, so both types of backup are needed. Also, it is useful to think about how the two types of backups can be combined with minimal overhead.

For the onsite backups, I’d want to see:

  1. Preservation of ownership, permissions, etc.
  2. Preservation of symlinks and hardlinks
  3. Space-efficient representation of changes — ideally binary deltas or block-level deduplication
  4. Ease of restoring
  5. Support for backing up Linux and Windows machines

Deduplicating Filesystems for Local Storage

Although I initially thought of block-level deduplicating file systems as something to use for offsite backups, they could also make an excellent choice for onsite disk-based backups.

rsync-based dedup backups

One way to use them would be to simply rsync data to them each night. Since copies are essentially free, we could do (or use some optimized version of) cp -r current snapshot/2011-01-20 or some such to save off historic backups. Moreover, we’d get dedup both across and within machines. And, many of these can use filesystem-level compression.

The real upshot of this is that the entire history of the backups can be browsed as a mounted filesystem. It would be fast and easy to find files, especially when users call about that file that they deleted at some point in the past but they don’t remember when, exactly what it was called, or exactly where it was stored. We can do a lot more with find and grep to locate these things than we could do with tools in Bacula (or any other backup program) restore console. Since it is a real mounted filesystem, we could also do fun things like make tarballs of it at will, zip parts up, scp them back to the file server, whatever. We could potentially even give users direct access to their files to restore things they need for themselves.

The downside of this approach is that rsync can’t store all the permissions unless it’s running as root on the system. Wrappers such as rdup around rsync could help with that. Another downside is that there isn’t a central scheduling/statistics service. We wouldn’t want the backup system to be hammered by 20 servers trying to send it data at once. So there’d be an element of rolling our own scripts, though not too bad. I’d have preferred not to authorize a backup server with root-level access to dozens of machines, but may be inescapable in this instance.

Bacula and dedup

The other alternative I thought of system such as Bacula with disk-based “volumes”. A Bacula volume is normally a tape, but Bacula can just write them to disk files. This lets us use the powerful Bacula scheduling engine, logging service, pre-backup and post-backup jobs, etc. Normally this would be an egregious waste of disk space. Bacula, like most tape-heritage programs, will write out an entire new copy of a file if even one byte changes. I had thought that I could let block-level dedupe reduce the storage size of Bacula volumes, but after looking at the Bacula block format spec, this won’t be possible as each block will have timestamps and such in it.

The good things about this setup revolve around using the central Bacula director. We need only install bacula-fd on each server to be backed up, and it has a fairly limited set of things it can do. Bacula already has built-in support for defining simple or complicated retention policies. Its director will email us if there is a problem with anything. And its logs and catalog are already extensive and enable us to easily find out things such as how long backups take, how much space they consume, etc. And it backs up Windows machines intelligently and comprehensively in addition to POSIX ones.

The downsides are, of course, that we don’t have all the features we’d get from having the entire history on the filesystem all at once, and far less efficient use of space. Not only that, but recovering from a disaster would require a more extensive bootstrapping process.

A hybrid option may be possible: automatically unpacking bacula backups after they’ve run onto the local filesystem. Dedupe should ensure this doesn’t take additional space — if the Bacula blocksize aligns with the filesystem blocksize. This is certainly not a given however. It may also make sense to use Bacula for Windows and rsync/rdup for Linux systems.

This seems, however, rather wasteful and useless.

Evaluation of deduplicating filesystems

I set up and tested three deduplicating filesystems available for Linux: S3QL, SDFS, and zfs-fuse. I did not examine lessfs. I ran a similar set of tests for each:

  1. Copy /usr/bin into the fs with tar -cpf - /usr/bin | tar -xvpf - -C /mnt/testfs
  2. Run commands to sync/flush the disk cache. Evaluate time and disk used at this point.
  3. Rerun the tar command, putting the contents into a slightly different path in the test filesystem. This should consume very little additional space since the files will have already been there. This will validate that dedupe works as expected, and provide a hint about its efficiency.
  4. Make a tarball of both directories from the dedup filesystem, writing it to /dev/zero (to test read performance)

I did not attempt to flush read caches during this, but I did flush write caches. The test system has 8GB RAM, 5GB of which was free or in use by a cache. The CPU is a Core2 6420 at 2.13GHz. The filesystems which created files atop an existing filesystem had ext4 mounted noatime beneath them. ZFS was mounted on an LVM LV. I also benchmarked native performance on ext4 as a baseline. The data set consists of 3232 files and 516MB. It contains hardlinks and symlinks.

Here are my results. Please note the comments below as SDFS could not accurately complete the test.

Test ext4 S3QL SDFS zfs-fuse
First copy 1.59s 6m20s 2m2s 0m25s
Sync/Flush 8.0s 1m1s 0s 0s
Second copy+sync N/A 0m48s 1m48s 0m24s
Disk usage after 1st copy 516MB 156MB 791MB 201MB
Disk usage after 2nd copy N/A 157MB 823MB 208MB
Make tarball 0.2s 1m1s 2m22s 0m54s
Max RAM usage N/A 150MB 350MB 153MB
Compression none lzma none gzip-2

It should be mentioned that these tests pretty much ruled out SDFS. SDFS doesn’t appear to support local compression, and it severely bloated the data store, which was much larger than the original data. Moreover, it permitted any user to create and modify files, even if the permissions bits said that the user couldn’t. tar gave many errors unpacking symlinks onto the SDFS filesystem, and du -s on the result threw up errors as well. Besides that, I noted that find found 10 fewer files than in my source data. Between the huge memory consumption, the data integrity concerns, and inefficient disk storage, SDFS is out of the running for this project.

S3QL is optimized for storage to S3, though it can also store its files locally or on an sftp server — a nice touch. I suspect part of its performance problem stems from being designed for network backends, and using slow compression algorithms. S3QL worked fine, however, and produced no problems. Creating a checkpoint using s3qlcp (faster than cp since it doesn’t have to read the data from the store) took 16s.

zfs-fuse appears to be the most-used ZFS implementation on Linux at the moment. I set up a 2GB ZFS pool for this test, and set dedupe=on and compress=gzip-2. When I evaluated compression in the past, I hadn’t looked at lzjb. I found a blog post comparing lzjb to the gzip options supported by zfs and wound up using gzip-2 for this test.

ZFS really shone here. Compared to S3QL, it took 25s instead of over 6 minutes to copy the data over — and took only 28% more space. I suspect that if I selected gzip -9 compression it would have been closer both in time and space to S3QL. But creating a ZFS snapshot was nearly instantaneous. Although ZFS-fuse probably doesn’t have as many users as ZFS on Solaris, still it is available in Debian, and has a good backing behind it. I feel safer using it than I do using S3QL. So I think ZFS wins this comparison.

I spent quite some time testing ZFS snapshots, which are instantaneous. (Incidentally, ZFS-fuse can’t mount them directly as documented, so you create a clone of the snapshot and mount that.) They worked out as well as could be hoped. Due to dedupe, even deleting and recreating the entire content of the original filesystem resulted in less than 1MB additional storage used. I also tested creating multiple filesystems in the zpool, and confirmed that dedupe even works between filesystems.

Incidentally — wow, ZFS has a ton of awesome features. I see why you OpenSolaris people kept looking at us Linux folks with a sneer now. Only our project hasn’t been killed by a new corporate overlord, so guess that maybe didn’t work out so well for you… <grin>.

The Cloud Tie-In

This discussion leaves another discussion: what to do about offsite backups? Assuming for the moment that I want to back them up over the Internet to some sort of cloud storage facility, there are about 3 options:

  1. Get an Amazon EC2 instance with EBS storage and rsync files to it. Perhaps run ZFS on that thing.
  2. Use a filesystem that can efficiently store data in S3 or Cloud Files (S3QL is the only contender here)
  3. Use a third-party backup product (JungleDisk appears to be the leading option)

There is something to be said for using a different tool for offsite backups — if there is some tool-level issue, that could be helpful.

One of the nice things about JungleDisk is that bandwidth is free, and disk is the same $0.15/GB-mo that RackSpace normally charges. JungleDisk also does block-level dedup, and has a central management interface. This all spells “nice” for us.

The only remaining question would be whether to just use JungleDisk to back up the backup server, or to put it on each individual machine as well. If it just backs up the backup server, then administrative burdens are lower; we can back everything there up by default and just not worry about it. On the other hand, if there is a problem with our main backups, we could be really stuck. So I’d say I’m leaning towards ZFS plus some sort of rsync solution and JungleDisk for offsite.

I had two people suggest CrashPlan Pro on my blog. It looks interesting, but is a very closed product which makes me nervous. I like using standard tools and formats — gives me more peace of mind, control, and recovery options. CrashPlan Pro supports multiple destinations and says that they do cloud hosting, but don’t list pricing anywhere. So I’ll probably not mess with it.

I’m still very interested in what comments people may have on all this. Let me know!

Wikis, Amateur Radio, and Debian

As I have been getting involved with amateur radio this year, I’ve been taking notes on what I’m learning about certain things: tips from people on rigging up a bicycle antenna to achieve a 40-mile range, setting up packet radio in Linux, etc. I have long run a personal, private wiki where I put such things.

But I really wanted a convenient place to put this stuff in public. There was no reason to keep it private. In fact, I wanted to share with others what I’ve learned. And, as I wanted to let others add their tips if they wish, I set up a public MoinMoin instance on . So far, most of my attention has focused on the amateur radio section of it

This has worked out pretty well for me. Sometimes I will cut and paste tips from emails into there, and then after trying them out, edit them into a more coherent summary based on my experiences.

Now then, on to packet radio and Debian. Packet radio is a digital communications mode that runs on the amateur radio bands. It is a routable, networking protocol that typically runs at 300bps, 1200bps, and 9600bps. My packet radio page gives a better background on it, but essentially AX.25 — the packet protocol — is similar to a scaled-down TCP/IP. One interesting thing about packet is that, since it can use the HF bands, can have direct transcontinental wireless links. More common are links spanning 30-50 miles on VHF and UHF, as well as those going across a continent on HF.

Linux is the only operating system I know of that has AX.25 integrated as a first-class protocol in the kernel. You can create AX.25 sockets and use them with the APIs you’re familiar with already. Not only that, but the Linux AX.25 stack is probably the best there is, and it interfaces easily with TCP/IP — there are global standards for encapsulating TCP/IP within AX.25 and AX.25 within UDP, and both are supported on Linux. Yes, I have telnetted to a machine to work on it over VHF. Of Linux distributions, Debian appears to have the best AX.25 stack built-in.

The AX.25 support in Linux is great, but it’s rather under-documented. So I set up a page for packet radio on Linux. I’ve had a great deal of fun with this. It’s amazing what you can do running a real networking protocol at 300bps over long-distance radio. I’ve had real-time conversations with people, connected to their personal BBS and sent them mail, and even use AX.25 “nodes” (think of them as a kind of router or bridge; you can connect in to them and the connect back out on the same or different frequencies to extend your reach) to connect out to systems that I can’t reach directly.

MoinMoin has worked out well for this. It has an inviting theme and newbie-friendly interface (I want to encourage drive-by contributions).

Debconf10

Debconf10 ended a week ago, and I’m only now finding some time to write about it. Funny how it works that way sometimes.

Anyhow, the summary of Debconf has to be: this is one amazing conference. Despite being involved with Debian for years, this was my first Debconf. I often go to one conference a year that my employer sends me to. In the past, it’s often been OSCon, which was very good, but Debconf was much better than that even. For those of you considering Debconf11 next year, perhaps this post will help you make your decision.

First of all, as might be expected from a technical conference, Debconf was of course informative. I particularly appreciated the enterprise track, which was very relevant to me. Unlike many other conferences, Debconf has some rooms specifically set aside for BoFs. With a day or two warning, you can get your event in one of those rooms on the official schedule. That exact thing happened with a virtualization BoF — I thought the topic was interesting, given the recent shifts in various virtualization options. So I emailed the conference mailing list, and we got an event on the schedule a short while later — and had a fairly large group turn out to discuss it.

The “hallway track” — conversations struck up with others in hallways or hacklabs — also was better at Debconf than other conferences. Partly that may be because, although there were fewer people at Debconf, they very much tended to be technical people whose interests aligned with my own. Partly it’s probably also because the keysigning party, which went throughout the conference, encouraged meeting random people. That was a great success, by the way.

So Debconf succeeded at informing, which is perhaps why many people go to these things. But it also inspired, especially Eben Moglen’s lecture. Who would have thought I’d come away from a conference enthused about the very real potential we have to alter the dynamics of some of the largest companies in the world today by using Free Software to it’s greatest potential?

And, of course, I had fun at Debconf. Meeting new people — or, more commonly, finally meeting in person people I’d known for years — was great. I got a real sense of the tremendously positive aspect of Debian’s community, which I must admit I have sometimes overlooked during certain mailing list discussions. This was a community of people, not just a bunch of folks attending a random conference for a week, and that point underlined a lot of things that happened.

Of course, it wasn’t 100% perfect, and it won’t ever be. But still, my thanks to everyone that organized, volunteered, and attended Debconf. I’m now wishing I’d been to more of them, and hope to attend next year’s.