Category Archives: Software

How To Think About Compression

… and the case against bzip2

Compression is with us all the time. I want to talk about general-purpose lossless compression here.

There is a lot of agonizing over compression ratios: the size of output for various sizes of input. For some situations, this is of course the single most important factor. For instance, if you’re Linus Torvalds putting your code out there for millions of people to download, the benefit of saving even a few percent of file size is well worth the cost of perhaps 50% worse compression performance. He compresses a source tarball once a month maybe, and we are all downloading it thousands of times a day.

On the other hand, when you’re doing backups, the calculation is different. Your storage media costs money, but so does your CPU. If you have a large photo collection or edit digital video, you may create 50GB of new data in a day. If you use a compression algorithm that’s too slow, your backup for one day may not complete before your backup for the next day starts. This is even more significant a problem when you consider enterprises backing up terabytes of data each day.

So I want to think of compression both in terms of resulting size and performance. Onward…

Starting Point

I started by looking at the practical compression test, which has some very useful charts. He has charted savings vs. runtime for a number of different compressors, and with the range of different settings for each.

If you look at his first chart, you’ll notice several interesting things:

  • gzip performance flattens at about -5 or -6, right where the manpage tells us it will, and in line with its defaults.
  • 7za -2 (the LZMA algorithm used in 7-Zip and p7zip) is both faster and smaller than any possible bzip2 combination. 7za -3 gets much slower.
  • bzip2’s performance is more tightly clustered than the others, both in terms of speed and space. bzip2 -3 is about the same speed as -1, but gains some space.

All this was very interesting, but had one limitation: it applied only to the gimp source tree, which is something of a best-case scenario for compression tools.

A 6GB Test
I wanted to try something a bit more interesting. I made an uncompressed tar file of /usr on my workstation, which comes to 6GB of data. My /usr contains highly compressible data such as header files and source code, ELF binaries and libraries, already-compressed documentation files, small icons, and the like. It is a large, real-world mix of data.

In fact, every compression comparison I saw was using data sets less than 1GB in size — hardly representative of backup workloads.

Let’s start with the numbers:

Tool MB saved Space vs. gzip Time vs. gzip Cost
gzip 3398 100.00% 100.00% 0.15
bzip2 3590 92.91% 333.05% 0.48
pbzip2 3587 92.99% 183.77% 0.26
lzma -1 3641 91.01% 195.58% 0.28
lzma -2 3783 85.76% 273.83% 0.37

In the “MB saved” column, higher numbers are better; in all other columns, lower numbers are better. I’m using clock seconds here on a dual-core machine. The cost column is clock seconds per MB saved.

What does this tell us?

  • bzip2 can do roughly 7% better than gzip, at a cost of a compression time more than 3 times as long.
  • lzma -1 compresses better than bzip2 -9 in less than twice the time of gzip. That is, it is significantly faster and marginally smaller than bzip2.
  • lzma -2 is significantly smaller and still somewhat faster than bzip2.
  • pbzip2 achieves better wall clock performance, though not better CPU time performance, than bzip2 — though even then, it is only marginally better than lzma -1 on a dual-core machine.

Some Pretty Charts

First, let’s see how the time vs. size numbers look:

compress-zoomed

Like the other charts, the best area is the lower left, and worst is upper right. It’s clear we have two outliers: gzip and bzip2. And a cluster of pretty similar performers.

This view somewhat magnifies the differences, though. Let’s add cat to the mix:

compress-big

And finally, look at the cost:

compress-efficiency

Conclusions

First off, the difference in time is far larger than the difference in space. We’re talking a difference of 15% at the most in terms of space, but orders of magnitude for time.

I think this pretty definitively is a death knell for bzip2. lzma -1 can achieve better compression in significantly less time, and lzma -2 can achieve significantly better compression in a little less time.

pbzip2 can help even that out in terms of clock time on multicore machines, but 7za already has a parallel LZMA implementation, and it seems only a matter of time before /usr/bin/lzma gets it too. Also, if I were to chart CPU time, the numbers would be even less kind to pbzip2 than to bzip2.

bzip2 does have some interesting properties, such as resetting everything every 900K, which could provide marginally better safety than any other compressor here — though I don’t know if lzma provides similar properties, or could.

I think a strong argument remains that gzip is most suitable for backups in the general case. lzma -1 makes a good contender when space is at more of a premium. bzip2 doesn’t seem to make a good contender at all now that we have lzma.

I have also made my spreadsheet (OpenOffice format) containing the raw numbers and charts available for those interested.

Update

Part 2 of this story is now available, which considers more compression tools, and looks at performance compressing files individually rather than the large tar file.

Real World Haskell update

There’s been quite the activity around this book lately.

Pat Eyler of On Ruby published an interview with the three of us RWH authors. It apparently got some very positive comments on Reddit.

Over on Amazon, the book is continuing its streak of 5-star reviews. When I see a review there titled “Finally a Haskell book that I could understand”, I am very glad that we took the time to write Real World Haskell. It is great to know that people find it useful.

Some of you may know that the book is licensed under a Creative Commons license. This is most certainly not common in the publishing world these days, and O’Reilly took a risk by letting us do it. We had tremendous community participation while we were writing it, and O’Reilly is quite pleased with the sales so far. I hope that this bodes well for future books released under such a license.

Server upgraded to Debian lenny

This afternoon, I finally decided to upgrade my main server from Debian etch to lenny. Lenny is still testing, but is nearing release. This server is colocated with Core Networks, and I have no physical or console access to it. (Well, I can request the IP KVM if needed.) It also hadn’t been rebooted in over 200 days.

The actual upgrade itself was incredibly smooth. For those of you that don’t use Debian, you might be interested to know that you can upgrade a running system in-place. A reboot is not even strictly necessary, though you won’t get kernel updates without it.

There was a bit of config file tweaking for Exim and Apache, a small bit for PHP, and that was it for the entire thing.

EXCEPT for the two things that always really bug me: Horde/IMP and Ruby/Rails. Horde has the most annoying upgrade process of any web app I’ve used lately. You first go through reading two different upgrade docs. To upgrade imp, you pipe some SQL commands into a PostgreSQL psql process (only they only document the mysql command line). Ditto for horde. But for Turba, the address book, you have to run a PHP program from the command line. Only it doesn’t work from any place in your PATH, so you have to divine a location to copy it to, run it from there, hack up the database stuff in it, then remember to delete it.

And that concludes the documented upgrade process. Only — surprise — it’s not done. At this point, you’ll get weird PHP warnings all over your screen. Then you google them, and find you have to log in to the web app as an administrator, and run three different upgrade procedures from within it, each of which requires you to copy and paste a config file to disk.

A far cry from the WordPress single-click upgrade. And this is easier than Horde/IMP upgrades I’ve done in the past.

The other annoying thing is Ruby on Rails. I run one Rails app, Redmine, and it’s always annoying. You’ve got to get all sorts of these gems just right. Today they decided they didn’t like my new PostgreSQL driver for Ruby, but they weren’t exactly obvious about it. Try upgrading the Gems, and — surprise — AFTER they are upgraded, they say that I need a newer rubygems than’s in Debian. Oooookaaayy…. restore gems from backup, google some more, find a patch, apply it, hack for awhile, and finally it works. But I have no idea why.

So, overall, kudos to all the Debian developers for a smooth upgrade process. I hope I can say that about Horde and Rails in the future.

Oh, and by the way, I did reboot the server. It came right up with the new kernel an OS, no problem.

More Wiki Annoyances

So today, I discovered that MoinMoin has an “all or nothing” attachment setting: either everyone with write permission to a page can upload attachments, or uploading attachments is disabled for the entire site. No exceptions. Period.

Not only that, but there is no maximum attachment size setting — unless the attachment was uploaded embedded in a ZIP file. How’s that for irony?

Not wanting to have my railroad site turn into a file trading site, I don’t really want to let everyone upload attachments.

Oh, also MoinMoin doesn’t maintain a history of attachments. So if somebody drives by and vandalizes an attachment, you get to…. restore the original version from tape. Yay?

So I decided I’d look back at MediaWiki, which has better attachment controls.

I still had my test installation, so I went to use it. I edited the main page. I wanted to read about the markup, so I clicked the “Editing help” link. Whoops, broken link. It links to Help:Editing on the local wiki. Which MediaWiki does not install for you. I asked about it on . The answer was: copy from mediawiki.org. OK, fine. How? “Cut and paste.” Yes, that’s right. Every time a new version of MediaWiki comes out, you get to cut and paste dozens of pages from mediawiki.org to your wiki, or else you’ll have outdated help. Yay?

The MediaWiki folks on IRC seem to like it this way. “Not everyone wants the same help.” Fine, but provide a sane default for those that don’t care.

Who thought running a wiki would be so hard?

Update: since yesterday, I went to moinmo.in and fixed the ThemeMarket page to reflect what versions a MoinMoin theme works with. I’m happy to help out with fixing Free Software — though I don’t really have time to add fundamental features like working syntax help links on this particular project.

Wiki Software

I used to run a website for traveling by rail in the United States. I let it falter, and eventually took it down. But I still have the domain, and am working to bring it back as a wiki.

The first step in that process was selecting which wiki software to use. I have a few requirements for the site:

  • Availability of both WYSIWYG (friendly for beginners) and non-WYSIWYG editors
  • A number of nice-looking themes to choose from
  • Nice to have: a hierarchy or category system
  • The ability to search within only a particular section or category in the hierarchy
  • Easy to maintain software; not having tons of plugins to keep up-to-date for security
  • Stellar spam prevention
  • Nice to have: ability to redirect people to the new page after a rename

I’m frustrated that there is no wiki out there that does all of these. There are quite a few that do all but one, but which one they omit varies.

My two finalists were MoinMoin and MediaWiki.

MoinMoin

MoinMoin will let you easily define arbitrary categories (by creating a wiki page following a certain name). The search screen automatically presents checkboxes for restricting searches to a particular category. Some reviews have complained about its anti-spam features, but they are all talking about older versions and they seem to have done some work on this lately.

MoinMoin has tons of features and is easy to set up and maintain. But here’s where it falls down: themes.

Over at moinmo.in, there is a “theme market” for themes. Only most of the themes there haven’t been compatible with the current MoinMoin release since about 2005. Most of the rest have one download, then a long discussion page full of mixed bug reports, diffs, and non-diff “edit this to make it work” comments. Most of these don’t state what version of the theme they apply to. Most themes won’t work with current browsers and Moin releases without them. UGH. After discussing on #moin, I’ll probably go in there and at least organize the ThemeMarket page by release.

MediaWiki

Then there’s MediaWiki. It’s got a lot of features, and a lot of complexity. It has no current WYSIWYG feature, though apparently there is work on one.

MediaWiki has an amazing category system. It can generate sorted lists of pages in a category, supports subcategories, etc. Surprisingly, though, you can’t search in just one category. (Though it might be possible indirectly via some syntax; not sure.)

Searching in MediaWiki overall is less capable that in MoinMoin.

MediaWiki does offer namespaces, and namespaces are the sole way of searching just one part of a site. They’re used well over at, say, uesp.net. But namespaces are heavy-handed. You have to edit config files to define them, and they bring with them other associated namespaces for discussion and whatnot. It’s not as easy as creating a category in MoinMoin, and might not scale well to lots of future categories.

MediaWiki does appear to have good spam prevention, and support recaptcha.

Conclusions

I eventually selected MoinMoin and have set up most of my content in it. But now that I am to the point of selecting themes, I’m having some second thoughts.

I also looked at DokuWiki. Its design makes me nervous. The user list is stored in a single file. You can’t search by category. You can search by namespace, but there aren’t checkboxes for it in the search screen; you have to know the syntax. WYSIWYG is a plugin. Categories are a plugin. So — too many plugins to maintain, and no real features above MoinMoin.

Administering Dozens of Debian Servers

At work, we have quite a few Debian servers. We have a few physical machines, then a number of virtual machines running under Xen. These servers are split up mainly along task-oriented lines: DNS server, LDAP server, file server, print server, mail server, several web app servers, ERP system, and the like.

In the past, we had fewer virtual instances and combined more services into a single OS install. This led to some difficulties, especially with upgrades. If we wanted to upgrade the OS for, say, the file server, we’d have to upgrade the web apps and test them along with it at the same time. This was not a terribly sustainable approach, hence the heavier reliance on smaller virtual environments.

All these virtual environments have led to their own issues. One of them is getting security patches installed. At present, that’s a mainly manual task. In the past, I used cron-apt a bit, but it seemed to be rather fragile. I’m wondering what people are using to get security updates onto servers in an automated fashion these days.

The other issue is managing the configuration of these things. We have some bits of configuration that are pretty similar between servers — the mail system setup, for instance. Most of them are just simple SMTP clients that need to be able to send out cron reports and the like. We had tried using cfengine2 for this, but it didn’t work out well. I don’t know if it was our approach or not, but we found that hacking cfengine2 after making changes on systems was too time-consuming, and so that task slipped and eventually cfengine2 wasn’t doing what it should anymore. And that even with taking advantage of it being able to do things like put the local hostname in the right places.

I’ve thought a bit about perhaps providing some locally-built packages that establish these config files, or load them up with our defaults. That approach has worked out well for me before, though it also means that pushing out changes isn’t a simple hack of a config file somewhere anymore.

It seems like a lot of the cfengine2/bcfg tools are designed for environments where servers are more homogenous than ours. bcfg2, in particular, goes down that road; it makes it difficult to be able to log on to a web server, apt-get install a few PHP modules that we need for a random app, and just proceed.

Any suggestions?

Why Do Web Applications Stink So Badly?

So today, I happen to be looking at wikis for two small to mid-sized public proojects (MoinMoin and DokuWiki look like frontrunners right now — any suggestions?) Recently, I’ve also looked at blog and CMS software, and a host of other web apps. It’s as if these people have learned nothing about good software practices over the last 20 years.

Warning: Rant Ahead

So how many of you have been here before? You download WebApp X. It tells you to cd to your DocumentRoot and unzip/untar it there. At this point, most of them will tell you to chmod -R 777 the install directory. Some of the better ones, such as WordPress, will tell you to chmod it 777, or if that makes you nervous, to instead chown it to the user that your webserver runs as.

It is at this point that you realize that the Java-based programs ship with their own webserver that takes 2 minutes to load and uses 2GB of RAM, while the PHP-based programs want you to give them 32MB RAM per process, and probably modify your global PHP settings in a way that breaks some other PHP web app you’re already using.

As if that isn’t enough to scare you off, generally speaking, config files — including passwords to databases — are stored in the same directory, along with .htaccess files. Many of these programs are also downloading and updating plugins over the Internet, usually without any kind of cryptographic authentication, and overwriting their own program files in the process.

Oh, and this is a class of app that is notorious for security problems to start with, and makes your server known to billions of people via search engines.

Absolutely no opportunity for trouble here, of course! That sentence was dripping with sarcasm, in case you didn’t get it.

It also makes it almost impossible for people such as Debian maintainers to package up some webapps (such as just about every single one that uses Ruby on Rails) because there is just no sane way to make it behave with respect to the Filesystem Hierarchy Standard.

I’d love to see web app developers do a few simple things:

  1. Separate code from data
  2. Separate code from configuration
  3. Separate all of the above from the DocumentRoot to the greatest extent possible

I realize that some of this is purportedly to make things easier to install when you have FTP access only. But to me it seems just really poor design. I’ve written webapps, and it’s not that hard to do this part right.

Plus, doing the above right means that I no longer have to do something like use git on my WordPress installations because it’s too much of a hassle to apply security and plugin updates on all three separate ones otherwise.

If Programming Languages Were Christmas Carols

Last spring, I posted If Version Contol Systems Were Airlines, which I really enjoyed. Now, because I seem to have a desire to take a good joke way too far, it’s time for:

IF PROGRAMMING LANGUAGES WERE CHRISTMAS CAROLS

I apologize in advance. (Feel free to add your own verses/carols in the comments.)

Away in a Pointer (C)

(to Away in a Manger)

Away in a pointer, the bits in a row.
A little dereference to see where they go.
I look down upon thee, and what do I see?
A segfault and core dump, right there just for me.

I saw thy init there, a reaping away
My process, from its address space, so sorry to say.
I thought I had saved thee, from void pointers all,
But maybe I missed one, and doomed you to fall.

Be near me, debugger, I ask thee to stay
Close by my terminal, and help me, I pray;
To find all the bugs and the void pointers too,
And if my kernel oopses, help me reboot for you.

Joy to the Wall (Perl)

(to Joy to the World)

Joy to the Wall, the Perl is come!
Let awk receive her King;
Let every grep prepare him room,
And bash and sed shall sing,
And bash and sed shall sing,
And bash, and bash, and sed shall sing.

Joy to the keyboard, we’ll use it all!
Let men, shift keys, employ;
Implicit variables, and globals never fall.
Repeat the line noise now,
Repeat the line noise now,
Repeat, repeat, the line noise now.

Perl rules the world with truth and ASCII,
And makes the doctors prove
The glories of carpal tunnel hands,
And we do it more than one way,
And we do it more than one way,
And we do it, and we do it, more than one way.

Hark! The Herald Coders Sing (Haskell)

(to Hark! The Herald Angels Sing)

Hark! The herald coders sing,
“Map and fold, recursive King;
Recursion and patterns wild,
Pure and IO — they’re reconciled!”
Joyful, all ye functions rise,
Join the typeclasses of the types,
With recursion, do proclaim,
“Laziness is born in this domain.”

Refrain
Hark! The herald coders sing,
“Map and fold, recursive king!”

Monads, by highest Heav’n adored;
Monads, their depths still unexplored;
Late in time, behold they’re good,
Never once were understood.
Veiled in functions, the Monads stay,
Used for IO, and more, each day,
With excitement, Monads say,
“Arrows are stranger, so with us stay.”

(Refrain)

Hail the glorious compiler of Glasgow!
Hail the threaded run-time system!
Join the beautiful Cabal of Hackage,
Upload there thy perfect package.
We know best, what we will Handle,
You’re safe with us: no pointers, no vandals.
Born to make your exceptions throw,
Unless you unsafePerformIO.

(Refrain)

Lispy the Paren

(to Frosty the Snowman)

Lispy the paren was a jolly happy soul,
With a lot of cars and a little cons
And two ends made out of curves.
Lispy the paren is a fairy tale, they say,
He was just common, but the children know
how he came to life one day.
There must have been some magic in that
Old Symbolics they found.
For when they placed him on its disk,
It recursed around and ’round.

O, Lispy the paren,
Was recursive as can be.
And the coders say it would take a day
To put his parens away.
Clunkety clunk clunk,
Clunkety clunk clunk,
Look at Lispy go.
Clunkety clunk clunk,
Clunkety clunk clunk,
Consing on the car.

Lispy the snowman knew
The keyboard was hot the day,
So he said, “Let’s cons and we’ll have some fun
now before they Scheme away.”
Down to the function,
With a list there in his RAM,
Running here and there,
all around the LAN, saying
“cdr me if you can.”
He led them down the streets of disk
Right to the traffic bus.
And only paused a moment when
He heard them holler (quit).

Oh BASIC Night

(to O Holy Night)

Oh BASIC night, the LEDs are brightly glinting;
It is the night of the dear GOSUB’s birth!
Long lay the world in sin and error printing,
Till you appeared and the RAM felt its worth.
Shiver of fear, line numbers do inspire,
For yonder breaks a mostly harmless GOTO.
Fall on your bits, O hear the Visual voices!
O BASIC divine, O BASIC where GOTO was born!
O BASIC, O Holy BASIC, O BASIC, you’re mine!

Some want to say, “GOTO is harmful always,”
But what of them, in their post-modern world.
We PRINT the truth, in the line-numbered goodness,
But Dijkstra appeared, and the faith, it was lost.
A thrill of hope, when .NET BASIC announces,
But Visual BASIC, what kind of thing are you?
Fall on your GUI, O see the old line numbers!
Behold BASICA, O BASIC when DOS was born!
O numbers, O lines, spaghetti divine!

Guido We Have Heard on High (Python)

(to Angels We Have Heard on High)

Guido we have hard on high
Sweetly indenting o’re the code,
And the functions in reply
Their exceptions sweetly flowed.

Refrain

Indent….. in your whitespace careful!
Indent…… in your whitespace careful!

Spaces, why this jubilee?
Why semicolons have you so wronged?
What backslashes must we use
If we want our lines so long?

(Refrain)

Come to Guido here to see
“One Right Way” is good, of course.
There’s no need for Perl, you know,
We have to be more verbose.

(Refrain)

Now the PEP will show the way
To the future, we shall see.
Banish lambda and the rest
Of the things we liked the best.

(Refrain)

The Demise of PC Magazine

I just read the news that PC Magazine is being canceled. It’s not exactly a shock, given the state of technical magazines right now. I haven’t read one of those in years, since they turned to be more of a consumer than a technical publication.

But I hope I am not the only one out there that remembers PC Magazine from the mid to late 1980s. I had two favorite parts in each issue: the programming example, and the “Abort, Retry, Fail” page at the back of the magazine.

The programming example was usually some sort of DOS (or, on occasion, OS/2) utility. It was usually written in assembly, and would be accompanied by a BASIC program you could type in to get the resulting binary, as assemblers weren’t readily available. The BASIC program was line after line of decimal numbers that would decode them and write out the resulting binary — sort of a primitive uuencode for paper. Trying to type those in gave me some serious eyestrain on more than one occasion. By now, I forget what most of those utilities did, but I remember one: BatchMan. It was a collection of tools for use in DOS batch files, and could do things like display output in color or even — yes — play monophonic music. It came with an example that displayed some lyrics about batch programming on-screen, set to what I later realized was the Batman theme. Geek nirvana, right?

But Batchman was too big to publish the source code, or the BASIC decoder, in print. It might have been one of those things that eventually led me to a CompuServe account. PC Magazine had some deal with CompuServe that you could get their utilities for free, or reduced cost — I forget. CompuServe was probably where I sent my first email, from my account which was 71510,1421 — comma and all. In later years, you could pay a small fee to send email to the Internet, and I had the amazingly attractive email address of 71510.1421@cis.compuserve.com. Take that, gmail.

PC Magazine eventually stopped running utilities that taught people about assembly or batch programming and shifted more to the genre of Windows screensavers. They stopped their articles about how hard disks work and what SCSI is all about, and instead have cover stories like “Vista made easy!” I am, sadly, not making this up. Gone are the days of investigating alternative operating systems like OS/2.

It appears that “Abort, Retry, Fail” is gone, too. It was a one-page thing at the back of each magazine that featured braindead error messages and funny stories about people that did things like FAX an image of a floppy disk to a remote office — before such stories were cliche. Sort of like DailyWTF these days. The sad truth is that the people that would FAX an image of a floppy are probably the ones that are reading PC Magazine today.

I still have a bunch of PC Magazine issues — the good ones — in my parents’ basement. I also still have my floppies with the utilities on them somewhere. One day, when I get some time — I’m estimating this will be about when Jacob goes to college — I’ll go back and take another look at them.

Real World Haskell Update

Times are exciting. Our book, Real World Haskell, is now available in a number of venues. But before I get to that, I’ve got to talk about what a thrill this project has been.

I created our internal Darcs repository in May, 2007. Since then, the three of us has made 1324 commits — and that doesn’t count work done by copyeditors and others at O’Reilly.

We made available early drafts of the book online for commenting, which served as our tech review process. By the time we finished writing the book, about 800 people had submitted over 7,500 comments. I’ve never seen anything like it, and really appreciate all those that commented about it.

As for availability, RWH is available:

  • For immediate purchase with electronic delivery, from O’Reilly’s page
  • For immediate viewing on Safari Books Online, at its book page
  • Paper editing timing is still tentative, but we’re estimating arrival in bookstores the week of December 8.

People are talking about it on blogs, twitter, etc. We’re excited!