Category Archives: Technology

Review: Free Software Project Hosting

I asked for suggestions a few days ago. I got several good ones, and investigated them. You can find my original criteria at the link above. Here’s what I came up with:

Google Code

Its very simple interface appeals to me. It has an issue tracker, a wiki, a download area. But zero integration with git. That’s not necessarily a big problem; I can always keep on hosting git repos at git.complete.org. It is a bit annoying, though, since I wouldn’t get to nicely link commit messages to automatic issue closing.

A big requirement of mine is being able to upload tarballs or ZIP files from the command line in an automated fashion. I haven’t yet checked to see if Google Code exports an API for this. Google Code also has a lifetime limit of 25 project creations, though rumor has it they may lift the limit if you figure out where to ask and ask nicely.

URL: googlecode.com

Gitorious

Gitorious is one of the two Git-based sites that put a strong emphasis on community. Like Github, Gitorious tries to make it easy for developers to fork projects, submit pull requests to maintainers, and work together. This aspect of it does hold some appeal to me, though I have never worked with one of these sites, so I am somewhat unsure of how I would use it.

The downside of Gitorious or Github is that they tie me to Git. While I’m happy with Git and have no plans to change now, I’ve changed VCSs many times over the years when better tools show up; I’ve used, in approximately this order, CVS, Subversion, Arch/tla, baz, darcs, Mercurial, and Git, with a brief use of Perforce at a job that required it. I may use Git for another 3 years, but after 5 years will Git still be the best VCS out there? I don’t know.

Gitorious fails several of my requirements, though. It has no issue tracker and no downloads area.

It can spontaneously create a tar.gz file from the head of any branch, but not a zip file. It is possible to provide a download of a specific revision, but this is not very intuitive for the end user.

Potential workarounds include using Lighthouse for bug tracking (they do support git integration for changelog messages) and my own server to host tarballs and ZIP files — which I could trivially upload via scp.

URL: gitorious.org

Github

At first glance, this is a more-powerful version of Gitorious. It has similar community features, has a wiki, but adds an issue tracker, download area, home page capability, and a bunch of features. It has about a dozen pre-built commit hooks, that do everything from integrate with Lighthouse to pop commit notices into Jabber.

But there are surprising drawbacks, limitations, and even outright bugs all throughout. And it all starts with the user interface.

On the main project page, the user gets both a download button and a download tab. But they don’t do the same thing. Talk about confusing!

The download button will make a ZIP or tarball out of any tag in the repo. The download tab will also do the same, though presented in a different way; but the tab can also offer downloads for files that the maintainer has manually uploaded. Neither one lets you limit the set of tags presented, so if you have an old project with lots of checkpoints, the poor end user has to sift through hundreds of tags to find the desired version. It is possible to make a tarball out of a given branch, so a link to the latest revision could be easy, but still.

Even worse, there’s a long-standing issue where several of the tabs get hidden under other on-screen elements. The wiki tab, project administration tab, and sometimes even the download tab are impacted. It’s been open since February with no apparent fix.

And on top of that, uploading arbitrary tarballs requires — yes — Flash. Despite requests to make it scriptable, they reply that there is no option but Flash and they may make some other option sometime.

The issue tracker is nice and simple. But it doesn’t support attachments. So users can’t attach screenshots, debug logs, or diffs.

I really wanted to like Github. It has so many features for developers. But all these surprising limitations make it a pain both for developers (I keep having to “view source” to find the link to the wiki or the project admin page) and for users (confusing download options, lack of issue attachments). In the end, I think the display bug is a showstopper for me. I could work around some of the others by having a wiki page with links to downloads and revisions and giving that out as the home page perhaps. But that’s a lot of manual maintenance that I would rather avoid.

URL: github.com

Launchpad

Launchpad is the project management service operated by Canonical, the company behind Ubuntu. While Launchpad can optionally integrate well with Ubuntu, that isn’t required, so non-developers like me can work with it fine.

Launchpad does offer issue tracking, but no wiki. It has a forum of sorts though (the “Answers” section). It has some other features, such as blueprints, that would likely only be useful for projects larger than the ones I would plan to use it for.

It does have a downloads area, and they say they have a Python API. I haven’t checked it out, but if it supports scriptable uploads, that would work for me.

Besides the lack of a wiki, Launchpad is also tied to the bzr VCS. bzr was one of the early players in DVCS, written as a better-designed successor to tla/Arch and baz, but has no compelling features over Git or Mercurial for me today. I have no intention of switching to or using it any time soon.

Launchpad does let you “import” branches from another VCS such as Git or svn. I set up an “import” branch for a test project yesterday. 12 hours later, it still hasn’t imported anything; it’s just sitting at “pending review.” I have no idea if it ever will, or why setting up a bzr branch requires no review but a git branch requires review. So I am unable to test the integration between it and the changesets, which is really annoying.

So, some possibilities here, but the bzr-only thing really bugs me. And having to have my git trees reviewed really goes against the “quick and simple” project setup that I would have preferred to see.

URL: launchpad.net

Indefero

Indefero is explicitly a Google Code clone, but aims to be a better Google Code than Google Code. The interface is similar to Google’s — very simple and clean. Unlike Google Code, Indefero does support Git. It supports a wiki, downloads area, and issue tracker. You can download the PHP-based code and run it yourself, or you can get hosting from the Indefero site.

I initially was favorably impressed by Indefero, but as I looked into it more, I am not very impressed right now. Although it does integrate with Git, and you can refer to an issue number in a Git commit, a Git commit can’t close an issue. Git developers use git over ssh to interact with it, but it supports only one ssh key per user — so this makes it very annoying if I wish to push changes from all three of the machines I regularly do development with. Despite the fact that this is a “high priority” issue, it hasn’t been touched by the maintainer in almost a month, even though patches have been offered.

Indefero can generate files based on any revision in git, or based on the latest on any branch, but only in ZIP format (no tar.gz).

Although the program looks very nice and the developer clueful, Indefero has only one main active developer or committer, and he is a consultant that also works on other projects. That makes me nervous about putting too many eggs into the Indefero basket.

URL: indefero.net

Trac

Trac is perhaps the gold standard of lightweight project management apps. It has a wiki, downloads, issue tracking, and VCS integration (SVN only in the base version, quite a few others with 3rd-party plugins). I ran trac myself for awhile.

It also has quite a few failings. Chief among them is that you must run a completely separate Trac instance for each project. So there is no possible way to go to some dashboard and see all bugs assigned to you from all projects, for instance. That is what drove me away from it initially. That and the serious performance problems that most of its VCS backends have.

URL: trac.edgewall.org

Redmine

Redmine is designed to be a better Trac than Trac. It uses the same lightweight philosophy in general, has a wiki, issue tracker, VCS integration, downloads area, etc. But it supports multiple projects in a sane and nice way. It’s what I currently use over on software.complete.org.

Redmine has no API to speak of, though I have managed to cobble together an automatic uploader using curl. It was unpleasant and sometimes breaks on new releases, but it generally gets the job done.

I have two big problems with Redmine. One is performance. It’s slow. And when web spiders hit it, it sometimes has been so slow that it takes down my entire server. Because of the way it structures its URLs, it is not possible to craft a robots.txt that does the right thing — and there is no plan to completely fix it. There is, however, a 3rd-party plugin that may help.

The bigger problem relates to maintaining and upgrading Redmine. This is the first Ruby on Rails app I have ever used, and let me say it has made me want to run away screaming from Ruby on Rails. I’ve had such incredible annoyances installing and upgrading this thing that I can’t even describe what was wrong. All sorts of undocumented requirements for newer software, GEMS that are supposed to work with it but don’t, having to manually patch things so they actually work, conflicts with what’s on the system, and nobody in the Redmine, Rails, or Ruby communities being able to help. I upgrade rarely because it is such a hassle and breaks in such spectacular ways. I don’t think this is even Redmine’s fault; I think it’s a Rails and Ruby issue, but nevertheless, I am stuck with it. My last upgrade was a real mess — bugs in the PostgreSQL driver — the newer one that the newer GEM that the newer Redmine required — were sending invalid SQL to it. Finally patched it myself, and this AFTER the whole pain that is installing gems in Ruby.

I’d take a CGI script written in Bash over Ruby on Rails after this.

That said, Redmine has the most complete set of the features I want of all the programs I’ve mentioned on this page.

URL: redmine.org

Savannah

Savannah is operated by the Free Software Foundation, and runs a fork of the SourceForge software. Its fork does support Git, but lacks a wiki. It has the standard *forge issue tracker, download area, home page support, integrated mailing lists, etc. It also has the standard *forge over-complexity.

There is a command-line SourceForge uploader in Debian that could potentially be hacked to work with Savannah, but I haven’t checked.

URL: savannah.nongnu.org

berlios.de

Appears to be another *forge clone. Similar to Savannah, but with a wiki, ugly page layout, and intrusive ads.

URL: berlios.de

SourceForge

Used to be the gold-standard of project hosting. Now looks more like a back alley in a trashy neighborhood. Ads all over the place, and intrusive and ugly ones at that. The ads make it hard to use the interface and difficult to navigate, especially for newbies. No thanks.

Conclusions

The four options that look most interesting to me are: Indefero, Github, Gitorious, and staying with Redmine. The community features of Github, Gitorious, and Launchpad all sound interesting, but I don’t have the experience to evaluate how well they work in practice — and how well they encourage “drive by commits” for small projects.

Gitorious + Lighthouse and my own download server merits more attention. Indefero still makes me nervous due to the level of development activity and single main developer. Github has a lot of promise, but an interface that is too confusing and buggy for me to throw at end users. That leaves me with Redmine, despite all the Rails aggravations. Adding the bot blocking plugin may just get me what I want right now, and is certainly the path of least resistance.

I am trying to find ways to build communities around these projects. If I had more experience with Github or Gitorious, and thought their community features could make a difference for small projects, I would try them.

My Week

It’s been quite the week.

Stomach Flu

Last Friday, my stomach was just starting to feel a little odd. I didn’t think much off it — a little food that didn’t go over well or stress, I thought.

Saturday I got out of bed and almost immediately felt like throwing up. Ugh. I probably caught some sort of stomach flu. I was nauseous all day and had some terrible diarrhea to boot. I spent parts of Saturday, Saturday night, Sunday, and Sunday night “supervising some emergency downloads” as the BOFH would say. By Sunday afternoon, I thought I was doing good enough to attend a practice of the Kansas Mennonite Men’s Choir. I made it through but it wasn’t quite as up to it as I thought.

Monday morning I woke up and thought the worst was behind me, so I went to work. By evening, the worst clearly was not behind me. I was extremely cold, and then got very hot a few hours later. Tuesday I left work a little early because of not feeling well.

Servers

Wednesday a colleague called me at home before I left to say that the ERP database had a major hiccup. That’s never good. The database is this creaky old dinosaur thing that has a habit of inventing novel ways to fail (favorite pastime: exceeding some arbitrary limit to the size of files that no OS has cared about for 5 years, then hanging without telling anybody why). My coworkers had been working on it since 5.

I went into the office and did what I could to help out, though they had mostly taken care of it. Then we went to reboot the server. It didn’t come back. I/O error on sda just after init started, and it hung. Puzzled, as it just used that disk to boot from. Try rebooting again.

This time, I/O error as the fibre channel controller driver loads. Again, puzzled as it just used that controller to load grub. Power cycle this time.

And now the server doesn’t see the fibre channel link at all. Eep. Check our fiber optic cables, and power cycle again.

And THIS time, the server doesn’t power back up. Fans whir for about a second, then an ominous red light I never knew was there shows up. Eeep!

So I call HP. They want me to remove one CPU. Yes, remove one CPU. I tried, and long story short, they dispatch a local guy with a replacement motherboard. “Can you send along a FC controller, in case it’s dead too?” “Nope, not until we diagnose a problem with it.”

Local guy comes out. He’s a sharp guy and I really like him. But the motherboard wasn’t in stock at the local HP warehouse, so he had to have it driven in from Oklahoma City. He gets here with it by about 4:30. At this point the single most important server to the company’s business has been down almost 12 hours.

He replaces the motherboard. The server now powers up — yay! And it POSTs, and it…. doesn’t see the disks. !#$!#$

He orders the FC controller, which is so very much not in stock that they can’t get it to us until 8:30AM the next morning (keep in mind this thing is on a 4-hour 24/7 contract).

Next morning rolls around. Outage now more than 24 hours. He pops the FC controller in, we tweak the SAN settings appropriately, we power up the machine, and….

still doesn’t see any disks, and the SAN switch still doesn’t see any link. EEP!

Even the BIOS firmware tool built into the controller doesn’t see a link, so we KNOW it’s not a software issue. We try plugging and unplugging cables, trying different ports, everything. Nothing makes a difference.

At this point, while he ponders what else he can replace while we start migrating the server to a different blade. We get ERP back up on its temporary home an hour later, and he basically orders us every part he can think of while we’ve bought him some room.

Several additional trips later, he’s replaced just about everything at least once, some things 2 or 3 times, and still no FC link. Meanwhile, I’ve asked my colleague to submit a new ticket to HP’s SAN team so we can try checking of the switch has an issue. They take their sweet time answering until he informs them this morning that it’s been *48 HOURS* since we first reported the outage. All of a sudden half a dozen people at HP take a keen interest in our case. As if they could smell this blog post coming…

So they advise us to upgrade the firmware in the SAN switch, but they also say “we really should send this to the blade group; the problem can’t be with the SAN” — and of course the blade people are saying “the problem’s GOT to be with the SAN”. We try to plan the firmware upgrade. In theory, we can lose a switch and nobody ever notices due to multipathing redundancy. In practice, we haven’t tested that in 2 years. None of this equipment had even been rebooted in 390 days.

While investigating this, we discovered that one of the blade servers could only see one path to its disks, not two. Strange. Fortunately, THAT blade wasn’t mission-critical on a Friday, so I power cycled it.

And it powered back up. And it promptly lost connection to its disks entirely, causing the SAN switches to display the same mysterious error they did with the first blade — the one that nobody at HP had heard of, could find in their documentation, or even on Google. Yes, that’s right. Apparently power cycling a server means it loses access to its disks.

Faced with the prospect of our network coming to a halt if anything else rebooted (or worse, if the problem started happening without a reboot), we decided we’d power cycle one switch now and see what would happen. If it worked out, our problems would be fixed. If not, at least things would go down in our and HP’s presence.

And that… worked? What? Yes. Power cycling the switch fixed every problem over the course of about 2 minutes, without us having to do anything.

Meanwhile, HP calls back to say, “Uhm, that firmware upgrade we told you to do? DON’T DO IT!” We power cycle the other switch, and have a normal SAN life again.

I let out a “WOOHOO!” My colleague, however, had the opposite reaction. “Now we’ll never be able to reproduce this problem to get it fixed!” Fair point, I suppose.

Then began the fairly quick job of migrating ERP back to its rightful home — it’s all on Xen already, designed to be nimble for just these circumstances. Full speed restored 4:55PM today.

So, to cap it all off, within the space of four hours, we had fail:

  • One ERP database
  • ERP server’s motherboard
  • Two fiber optic switches — but only regarding their ability to talk to machines recently rebooted
  • And possibly one FC controller

Murphy, I hate you.

The one fun moment out of this was this conversation:

Me to HP guy: “So yeah, that machine you’ve got open wasn’t rebooted in 392 days until today.”

HP guy: “WOW! That’s INCRED — oh wait, are you running Linux on it?”

Me: “Yep.”

HP: “Figures. No WAY you’d get that kind of uptime from Windows.”

And here he was going to be all impressed.

How To Think About Compression, Part 2

Yesterday, I posted part 1 of how to think about compression. If you haven’t read it already, take a look now, so this post makes sense.

Introduction

In the part 1 test, I compressed a 6GB tar file with various tools. This is a good test if you are writing an entire tar file to disk, or if you are writing to tape.

For part 2, I will be compressing each individual file contained in that tarball individually. This is a good test if you back up to hard disk and want quick access to your files. Quite a few tools take this approach — rdiff-backup, rdup, and backuppc are among them.

We can expect performance to be worse both in terms of size and speed for this test. The compressor tool will be executed once per file, instead of once for the entire group of files. This will magnify any startup costs in the tool. It will also reduce compression ratios, because the tools won’t have as large a data set to draw on to look for redundancy.

To add to that, we have the block size of the filesystem — 4K on most Linux systems. Any file’s actual disk consumption is always rounded up to the next multiple of 4K. So a 5-byte file takes up the same amount of space as a 3000-byte file. (This behavior is not unique to Linux.) If a compressor can’t shrink enough space out of a file to cross at least one 4K barrier, it effectively doesn’t save any disk space. On the other hand, in certain situations, saving one byte of data could free 4K of disk space.

So, for the results below, I use du to calculate disk usage, which reflects the actual amount of space consumed by files on disk.

The Tools

Based on comments in part 1, I added tests for lzop and xz to this iteration. I attempted to test pbzip2, but it would have taken 3 days to complete, so it is not included here — more on that issue below.

The Numbers

Let’s start with the table, using the same metrics as with part 1:

Tool MB saved Space vs. gzip Time vs. gzip Cost
gzip 3081 100.00% 100.00% 0.41
gzip -1 2908 104.84% 82.34% 0.36
gzip -9 3091 99.72% 141.60% 0.58
bzip2 3173 97.44% 201.87% 0.81
bzip2 -1 3126 98.75% 182.22% 0.74
lzma -1 3280 94.44% 163.31% 0.63
lzma -2 3320 93.33% 217.94% 0.83
xz -1 3270 94.73% 176.52% 0.68
xz -2 3309 93.63% 200.05% 0.76
lzop -1 2508 116.01% 77.49% 0.39
lzop -2 2498 116.30% 76.59% 0.39

As before, in the “MB saved” column, higher numbers are better; in all other columns, lower numbers are better. I’m using clock seconds here on a dual-core machine. The cost column is clock seconds per MB saved.

Let’s draw some initial conclusions:

  • lzma -1 continues to be both faster and smaller than bzip2. lzma -2 is still smaller than bzip2, but unlike the test in part 1, is now a bit slower.
  • As you’ll see below, lzop ran as fast as cat. Strangely, lzop -3 produced larger output than lzop -1.
  • gzip -9 is probably not worth it — it saved less than 1% more space and took 42% longer.
  • xz -1 is not as good as lzma -1 in either way, though xz -2 is faster than lzma -2, at the cost of some storage space.
  • Among the tools also considered for part 1, the difference in space and time were both smaller. Across all tools, the difference in time is still far more significant than the difference in space.

The Pretty Charts

Now, let’s look at an illustration of this. As before, the sweet spot is the lower left, and the worst spot is the upper right. First, let’s look at the compression tools themselves:

compress2-zoomed

At the extremely fast, but not as good compression, end is lzop. gzip is still the balanced performer, bzip2 still looks really bad, and lzma -1 is still the best high-compression performer.

Now, let’s throw cat into the mix:

compress2-big

Here’s something notable, that this graph makes crystal clear: lzop was just as fast as cat. In other words, it is likely that lzop was faster than the disk, and using lzop compression would be essentially free in terms of time consumed.

And finally, look at the cost:

compress2-efficiency

What happened to pbzip2?

I tried the parallel bzip2 implementation just like last time, but it ran extremely slow. Interestingly, pbzip2 < notes.txt > notes.txt.bz2 took 1.002 wall seconds, but pbzip2 notes.txt finished almost instantaneously. This 1-second startup time for pbzip2 was a killer, and the test would have taken more than 3 days to complete. I killed it early and omitted it from my results. Hopefully this bug can be fixed. I didn’t expect pbzip2 to help much in this test, and perhaps even to see a slight degradation, but not like THAT.

Conclusions

As before, the difference in time was far more significant than the difference in space. By compressing files individually, we lost about 400MB (about 7%) space compared to making a tar file and then combining that. My test set contained 270,101 files.

gzip continues to be a strong all-purpose contender, posting fast compression time and respectable compression ratios. lzop is a very interesting tool, running as fast as cat and yet turning in reasonable compression — though 25% worse than gzip on its default settings. gzip -1 was almost as fast, though, and compressed better. If gzip weren’t fast enough with -6, I’d be likely to try gzip -1 before using lzop, since the gzip format is far more widely supported, and that’s important to me for backups.

These results still look troubling for bzip2. lzma -1 continued to turn in far better times and compression ratios that bzip2. Even bzip2 -1 couldn’t match the speed of lzma -1, and compressed barely better than gzip. I think bzip2 would be hard-pressed to find a comfortable niche anywhere by now.

As before, you can download my spreadsheet with all the numbers behind these charts and the table.

How To Think About Compression

… and the case against bzip2

Compression is with us all the time. I want to talk about general-purpose lossless compression here.

There is a lot of agonizing over compression ratios: the size of output for various sizes of input. For some situations, this is of course the single most important factor. For instance, if you’re Linus Torvalds putting your code out there for millions of people to download, the benefit of saving even a few percent of file size is well worth the cost of perhaps 50% worse compression performance. He compresses a source tarball once a month maybe, and we are all downloading it thousands of times a day.

On the other hand, when you’re doing backups, the calculation is different. Your storage media costs money, but so does your CPU. If you have a large photo collection or edit digital video, you may create 50GB of new data in a day. If you use a compression algorithm that’s too slow, your backup for one day may not complete before your backup for the next day starts. This is even more significant a problem when you consider enterprises backing up terabytes of data each day.

So I want to think of compression both in terms of resulting size and performance. Onward…

Starting Point

I started by looking at the practical compression test, which has some very useful charts. He has charted savings vs. runtime for a number of different compressors, and with the range of different settings for each.

If you look at his first chart, you’ll notice several interesting things:

  • gzip performance flattens at about -5 or -6, right where the manpage tells us it will, and in line with its defaults.
  • 7za -2 (the LZMA algorithm used in 7-Zip and p7zip) is both faster and smaller than any possible bzip2 combination. 7za -3 gets much slower.
  • bzip2’s performance is more tightly clustered than the others, both in terms of speed and space. bzip2 -3 is about the same speed as -1, but gains some space.

All this was very interesting, but had one limitation: it applied only to the gimp source tree, which is something of a best-case scenario for compression tools.

A 6GB Test
I wanted to try something a bit more interesting. I made an uncompressed tar file of /usr on my workstation, which comes to 6GB of data. My /usr contains highly compressible data such as header files and source code, ELF binaries and libraries, already-compressed documentation files, small icons, and the like. It is a large, real-world mix of data.

In fact, every compression comparison I saw was using data sets less than 1GB in size — hardly representative of backup workloads.

Let’s start with the numbers:

Tool MB saved Space vs. gzip Time vs. gzip Cost
gzip 3398 100.00% 100.00% 0.15
bzip2 3590 92.91% 333.05% 0.48
pbzip2 3587 92.99% 183.77% 0.26
lzma -1 3641 91.01% 195.58% 0.28
lzma -2 3783 85.76% 273.83% 0.37

In the “MB saved” column, higher numbers are better; in all other columns, lower numbers are better. I’m using clock seconds here on a dual-core machine. The cost column is clock seconds per MB saved.

What does this tell us?

  • bzip2 can do roughly 7% better than gzip, at a cost of a compression time more than 3 times as long.
  • lzma -1 compresses better than bzip2 -9 in less than twice the time of gzip. That is, it is significantly faster and marginally smaller than bzip2.
  • lzma -2 is significantly smaller and still somewhat faster than bzip2.
  • pbzip2 achieves better wall clock performance, though not better CPU time performance, than bzip2 — though even then, it is only marginally better than lzma -1 on a dual-core machine.

Some Pretty Charts

First, let’s see how the time vs. size numbers look:

compress-zoomed

Like the other charts, the best area is the lower left, and worst is upper right. It’s clear we have two outliers: gzip and bzip2. And a cluster of pretty similar performers.

This view somewhat magnifies the differences, though. Let’s add cat to the mix:

compress-big

And finally, look at the cost:

compress-efficiency

Conclusions

First off, the difference in time is far larger than the difference in space. We’re talking a difference of 15% at the most in terms of space, but orders of magnitude for time.

I think this pretty definitively is a death knell for bzip2. lzma -1 can achieve better compression in significantly less time, and lzma -2 can achieve significantly better compression in a little less time.

pbzip2 can help even that out in terms of clock time on multicore machines, but 7za already has a parallel LZMA implementation, and it seems only a matter of time before /usr/bin/lzma gets it too. Also, if I were to chart CPU time, the numbers would be even less kind to pbzip2 than to bzip2.

bzip2 does have some interesting properties, such as resetting everything every 900K, which could provide marginally better safety than any other compressor here — though I don’t know if lzma provides similar properties, or could.

I think a strong argument remains that gzip is most suitable for backups in the general case. lzma -1 makes a good contender when space is at more of a premium. bzip2 doesn’t seem to make a good contender at all now that we have lzma.

I have also made my spreadsheet (OpenOffice format) containing the raw numbers and charts available for those interested.

Update

Part 2 of this story is now available, which considers more compression tools, and looks at performance compressing files individually rather than the large tar file.

Real World Haskell update

There’s been quite the activity around this book lately.

Pat Eyler of On Ruby published an interview with the three of us RWH authors. It apparently got some very positive comments on Reddit.

Over on Amazon, the book is continuing its streak of 5-star reviews. When I see a review there titled “Finally a Haskell book that I could understand”, I am very glad that we took the time to write Real World Haskell. It is great to know that people find it useful.

Some of you may know that the book is licensed under a Creative Commons license. This is most certainly not common in the publishing world these days, and O’Reilly took a risk by letting us do it. We had tremendous community participation while we were writing it, and O’Reilly is quite pleased with the sales so far. I hope that this bodes well for future books released under such a license.

Server upgraded to Debian lenny

This afternoon, I finally decided to upgrade my main server from Debian etch to lenny. Lenny is still testing, but is nearing release. This server is colocated with Core Networks, and I have no physical or console access to it. (Well, I can request the IP KVM if needed.) It also hadn’t been rebooted in over 200 days.

The actual upgrade itself was incredibly smooth. For those of you that don’t use Debian, you might be interested to know that you can upgrade a running system in-place. A reboot is not even strictly necessary, though you won’t get kernel updates without it.

There was a bit of config file tweaking for Exim and Apache, a small bit for PHP, and that was it for the entire thing.

EXCEPT for the two things that always really bug me: Horde/IMP and Ruby/Rails. Horde has the most annoying upgrade process of any web app I’ve used lately. You first go through reading two different upgrade docs. To upgrade imp, you pipe some SQL commands into a PostgreSQL psql process (only they only document the mysql command line). Ditto for horde. But for Turba, the address book, you have to run a PHP program from the command line. Only it doesn’t work from any place in your PATH, so you have to divine a location to copy it to, run it from there, hack up the database stuff in it, then remember to delete it.

And that concludes the documented upgrade process. Only — surprise — it’s not done. At this point, you’ll get weird PHP warnings all over your screen. Then you google them, and find you have to log in to the web app as an administrator, and run three different upgrade procedures from within it, each of which requires you to copy and paste a config file to disk.

A far cry from the WordPress single-click upgrade. And this is easier than Horde/IMP upgrades I’ve done in the past.

The other annoying thing is Ruby on Rails. I run one Rails app, Redmine, and it’s always annoying. You’ve got to get all sorts of these gems just right. Today they decided they didn’t like my new PostgreSQL driver for Ruby, but they weren’t exactly obvious about it. Try upgrading the Gems, and — surprise — AFTER they are upgraded, they say that I need a newer rubygems than’s in Debian. Oooookaaayy…. restore gems from backup, google some more, find a patch, apply it, hack for awhile, and finally it works. But I have no idea why.

So, overall, kudos to all the Debian developers for a smooth upgrade process. I hope I can say that about Horde and Rails in the future.

Oh, and by the way, I did reboot the server. It came right up with the new kernel an OS, no problem.

More Wiki Annoyances

So today, I discovered that MoinMoin has an “all or nothing” attachment setting: either everyone with write permission to a page can upload attachments, or uploading attachments is disabled for the entire site. No exceptions. Period.

Not only that, but there is no maximum attachment size setting — unless the attachment was uploaded embedded in a ZIP file. How’s that for irony?

Not wanting to have my railroad site turn into a file trading site, I don’t really want to let everyone upload attachments.

Oh, also MoinMoin doesn’t maintain a history of attachments. So if somebody drives by and vandalizes an attachment, you get to…. restore the original version from tape. Yay?

So I decided I’d look back at MediaWiki, which has better attachment controls.

I still had my test installation, so I went to use it. I edited the main page. I wanted to read about the markup, so I clicked the “Editing help” link. Whoops, broken link. It links to Help:Editing on the local wiki. Which MediaWiki does not install for you. I asked about it on . The answer was: copy from mediawiki.org. OK, fine. How? “Cut and paste.” Yes, that’s right. Every time a new version of MediaWiki comes out, you get to cut and paste dozens of pages from mediawiki.org to your wiki, or else you’ll have outdated help. Yay?

The MediaWiki folks on IRC seem to like it this way. “Not everyone wants the same help.” Fine, but provide a sane default for those that don’t care.

Who thought running a wiki would be so hard?

Update: since yesterday, I went to moinmo.in and fixed the ThemeMarket page to reflect what versions a MoinMoin theme works with. I’m happy to help out with fixing Free Software — though I don’t really have time to add fundamental features like working syntax help links on this particular project.

Wiki Software

I used to run a website for traveling by rail in the United States. I let it falter, and eventually took it down. But I still have the domain, and am working to bring it back as a wiki.

The first step in that process was selecting which wiki software to use. I have a few requirements for the site:

  • Availability of both WYSIWYG (friendly for beginners) and non-WYSIWYG editors
  • A number of nice-looking themes to choose from
  • Nice to have: a hierarchy or category system
  • The ability to search within only a particular section or category in the hierarchy
  • Easy to maintain software; not having tons of plugins to keep up-to-date for security
  • Stellar spam prevention
  • Nice to have: ability to redirect people to the new page after a rename

I’m frustrated that there is no wiki out there that does all of these. There are quite a few that do all but one, but which one they omit varies.

My two finalists were MoinMoin and MediaWiki.

MoinMoin

MoinMoin will let you easily define arbitrary categories (by creating a wiki page following a certain name). The search screen automatically presents checkboxes for restricting searches to a particular category. Some reviews have complained about its anti-spam features, but they are all talking about older versions and they seem to have done some work on this lately.

MoinMoin has tons of features and is easy to set up and maintain. But here’s where it falls down: themes.

Over at moinmo.in, there is a “theme market” for themes. Only most of the themes there haven’t been compatible with the current MoinMoin release since about 2005. Most of the rest have one download, then a long discussion page full of mixed bug reports, diffs, and non-diff “edit this to make it work” comments. Most of these don’t state what version of the theme they apply to. Most themes won’t work with current browsers and Moin releases without them. UGH. After discussing on #moin, I’ll probably go in there and at least organize the ThemeMarket page by release.

MediaWiki

Then there’s MediaWiki. It’s got a lot of features, and a lot of complexity. It has no current WYSIWYG feature, though apparently there is work on one.

MediaWiki has an amazing category system. It can generate sorted lists of pages in a category, supports subcategories, etc. Surprisingly, though, you can’t search in just one category. (Though it might be possible indirectly via some syntax; not sure.)

Searching in MediaWiki overall is less capable that in MoinMoin.

MediaWiki does offer namespaces, and namespaces are the sole way of searching just one part of a site. They’re used well over at, say, uesp.net. But namespaces are heavy-handed. You have to edit config files to define them, and they bring with them other associated namespaces for discussion and whatnot. It’s not as easy as creating a category in MoinMoin, and might not scale well to lots of future categories.

MediaWiki does appear to have good spam prevention, and support recaptcha.

Conclusions

I eventually selected MoinMoin and have set up most of my content in it. But now that I am to the point of selecting themes, I’m having some second thoughts.

I also looked at DokuWiki. Its design makes me nervous. The user list is stored in a single file. You can’t search by category. You can search by namespace, but there aren’t checkboxes for it in the search screen; you have to know the syntax. WYSIWYG is a plugin. Categories are a plugin. So — too many plugins to maintain, and no real features above MoinMoin.

Administering Dozens of Debian Servers

At work, we have quite a few Debian servers. We have a few physical machines, then a number of virtual machines running under Xen. These servers are split up mainly along task-oriented lines: DNS server, LDAP server, file server, print server, mail server, several web app servers, ERP system, and the like.

In the past, we had fewer virtual instances and combined more services into a single OS install. This led to some difficulties, especially with upgrades. If we wanted to upgrade the OS for, say, the file server, we’d have to upgrade the web apps and test them along with it at the same time. This was not a terribly sustainable approach, hence the heavier reliance on smaller virtual environments.

All these virtual environments have led to their own issues. One of them is getting security patches installed. At present, that’s a mainly manual task. In the past, I used cron-apt a bit, but it seemed to be rather fragile. I’m wondering what people are using to get security updates onto servers in an automated fashion these days.

The other issue is managing the configuration of these things. We have some bits of configuration that are pretty similar between servers — the mail system setup, for instance. Most of them are just simple SMTP clients that need to be able to send out cron reports and the like. We had tried using cfengine2 for this, but it didn’t work out well. I don’t know if it was our approach or not, but we found that hacking cfengine2 after making changes on systems was too time-consuming, and so that task slipped and eventually cfengine2 wasn’t doing what it should anymore. And that even with taking advantage of it being able to do things like put the local hostname in the right places.

I’ve thought a bit about perhaps providing some locally-built packages that establish these config files, or load them up with our defaults. That approach has worked out well for me before, though it also means that pushing out changes isn’t a simple hack of a config file somewhere anymore.

It seems like a lot of the cfengine2/bcfg tools are designed for environments where servers are more homogenous than ours. bcfg2, in particular, goes down that road; it makes it difficult to be able to log on to a web server, apt-get install a few PHP modules that we need for a random app, and just proceed.

Any suggestions?

Why Do Web Applications Stink So Badly?

So today, I happen to be looking at wikis for two small to mid-sized public proojects (MoinMoin and DokuWiki look like frontrunners right now — any suggestions?) Recently, I’ve also looked at blog and CMS software, and a host of other web apps. It’s as if these people have learned nothing about good software practices over the last 20 years.

Warning: Rant Ahead

So how many of you have been here before? You download WebApp X. It tells you to cd to your DocumentRoot and unzip/untar it there. At this point, most of them will tell you to chmod -R 777 the install directory. Some of the better ones, such as WordPress, will tell you to chmod it 777, or if that makes you nervous, to instead chown it to the user that your webserver runs as.

It is at this point that you realize that the Java-based programs ship with their own webserver that takes 2 minutes to load and uses 2GB of RAM, while the PHP-based programs want you to give them 32MB RAM per process, and probably modify your global PHP settings in a way that breaks some other PHP web app you’re already using.

As if that isn’t enough to scare you off, generally speaking, config files — including passwords to databases — are stored in the same directory, along with .htaccess files. Many of these programs are also downloading and updating plugins over the Internet, usually without any kind of cryptographic authentication, and overwriting their own program files in the process.

Oh, and this is a class of app that is notorious for security problems to start with, and makes your server known to billions of people via search engines.

Absolutely no opportunity for trouble here, of course! That sentence was dripping with sarcasm, in case you didn’t get it.

It also makes it almost impossible for people such as Debian maintainers to package up some webapps (such as just about every single one that uses Ruby on Rails) because there is just no sane way to make it behave with respect to the Filesystem Hierarchy Standard.

I’d love to see web app developers do a few simple things:

  1. Separate code from data
  2. Separate code from configuration
  3. Separate all of the above from the DocumentRoot to the greatest extent possible

I realize that some of this is purportedly to make things easier to install when you have FTP access only. But to me it seems just really poor design. I’ve written webapps, and it’s not that hard to do this part right.

Plus, doing the above right means that I no longer have to do something like use git on my WordPress installations because it’s too much of a hassle to apply security and plugin updates on all three separate ones otherwise.