Category Archives: Software

Re-Examining Darcs & Mercurial

I recently wrote an article or two about distributed version control systems.

I’ve been using Darcs since 2005. I switched to Darcs, in fact, 10 days after the simultaneous founding announcements of git and Mercurial.

Overall, I have been happy. I continue to believe that it is the most distributed of the distributed VCSs, which is a Good Thing.

However, I have lately started having trouble with Darcs hanging while working on my Debian packages. My post to the Darcs user list drew out a few other people whith this problem, which is a design flaw of Darcs.

So I revisited the VCS landscape. I re-examined git, Mercurial, and bzr. I eventually decided to give Mercurial a try. I avoided git because I write some code that is portable to Windows, and git isn’t (or isn’t very well). Also, git is complex to pick up for me, and I certainly don’t want to force something complex onto my contributors. bzr seemed to still have some strange behaviors that it’s had for awhile, and I couldn’t find even one advantage of it over Mercurial. So off I went with Mercurial.

I quickly learned a bit of a philosophical difference from Darcs to Mercurial.

Darcs avoids conflicts at all costs. Mercurial makes handling conflict easy and, in many cases, automatic.

It is exactly this Darcs behavior that permits both is excellent “darcs send” feature (still unmatched in any other VCS), but also causes its hang problems.

I found Mercurial quite pleasant to work with, and *fast*. It seems to be edging out git in speed tests sometimes these days.

It is easy to get started with Mercurial. The mq system — similar to quilt or other patch-management programs — is really quite an amazing hybrid between patch management and version control. I frankly don’t see any need for other patch-management tools anymore.

Mercurial has a “patchbomb” feature where you can select a range of changesets to send off, and it will generate nice emails with one changeset per email, and send them to your selected destination, optionally with an introductory message. The normal way of interacting with other Mercurial users is via the hg export/import commands, which send around simple unified diffs plus some additional header information, optionally in the git extended diff format.

I am happy with Mercurial and am in the process of converting my Debian repositories from Darcs to Mercurial. I’m going to keep my personal code in Darcs for the moment because “darcs send” is still easier than “hg email”, but that may change before long, depending on how my experience goes.

I’d encourage others to give Mercurial a try. The community is also very nice and helpful.

I have contributed patches to Tailor to make it make exact copies of Darcs repos into Mercurial, which are now in its Darcs repo. There is also a thread on the Mercurial list with some of my initial questions/concerns coming from a Darcs perspective.

A better environment for shell scripting

Shell scripts are good for a lot of things. It’s quick and easy to design shell scripts that take input from one program, pass it to another program, munge it for filenames, etc.

But there are a few drawbacks to shell scripts.

The drawback, in my opinion, is that it is extremely difficult to get quoting and escaping right. I often see things like $@ in shell scripts (breaks if a parameter has a space in it). I also see people failing to check for errors properly (set -e helps that). It’s also difficult to do a more modern style of exception handling (do a sequence of actions in a temporary directory, and always remove that directory, even if there’s an error, but stop processing and propogate the error). Command-line parsing is esoteric and odd, even with getopt. That’s not to say that it’s impossible to make a secure shell script that handles filenames with spaces in them properly. Just that it’s difficult, and makes using common operators like backticks difficult.

Awhile back, I toyed with the idea of making Haskell a shell scripting language. This week, I spent some time to make this a reality. I released HSH, a shell scripting environment for Haskell.

HSH makes it easy to run shell commands, set up pipelines, etc. straight from Haskell. You can either use simple strings to invoke commands (they’ll be passed to sh -c), or you can specify arguments as a list (like exec…() takes), which eliminates the strange filename problems.

But the really cool thing is that HSH doesn’t just let you pipe from one external program to another. It also lets you pipe to/from pure Haskell functions. Yes, you can pipe the output of ls -l straight into a Haskell version of grep. I’ve found it to be very nice, especially for more complex processing tasks.

I put these simple examples on the HSH homepage:

run $ "echo /etc/pass*" :: IO String
 -> "/etc/passwd /etc/passwd-"

runIO $ "ls -l" -|- "wc -l"
 -> 12

runIO $ "ls -l" -|- wcL
 -> 12

In this example, wcL is a pure-Haskell line-counting function.

The results were surprising. According to SLOCCount, porting hg-buildpackage from a shell script to a HSH script achieved a 20% reduction in source lines of code. And at the same time, gained better error handling, better safety of filenames, better type safety (compile-time type checking), etc. Yet it does exactly the same thing in almost exactly the same way.

Even greater savings will occur too. I decided to reimplement a small part of sed just for fun, and that code is still in my tree. If I removed that and replaced it with a call to sed as in the shell version, that would probably buy another 5% savings.

I didn’t really expect to achieve a reduction in lines of code. I thought that I’d be lucky to come close to breaking even. After all, who’d expect something other than the shell to be better at shell scripting?

I don’t know if these results are generalizable, but I’m really excited about it.

Rebase Considered Harmful

Today I was musing about different version control systems and merge algorithms. I’ve been thinking specifically about how I maintain Debian packages in Darcs. I tend to import upstream tarballs into one branch, and maintain the Debian packages in another, simply merging when a new upstream is released.

Now, there seem to be two prevailing philosophies on how to handle merges in this case. I’m thinking here about merges back to upstream. Say I want to contribute my Debian patches to them.

  1. Commit “clean” patches upstream. Don’t have a bunch of history — the fixing typos commits, the fixing bugs commits, or the merging to track new upstream releases. Just something like a series of diffs against the current head.
  2. Bring across the full history, warts and all, and keep it around permanently.

git encourages option , with its rebase option. Darcs encourages option (though some use its amend-record option to work more like ).

As I got to thinking about it, it occured to me that git-rebase would be very nice if you are going to use philosophy . In short, rebase will remove your local patches from a repo, update it to the latest upstream, then re-apply your local changesets — aborting to have you fix any conflicts. This is as opposed to a more traditional merge, where you add the upstream changesets to your local branch and then commit new changesets to resolve conflicts. (So a rebase would be totally useless in situation )

I got to thinking about this, and started wondering what would happen to people that I’m working with that in turn work off my branches. And sure enough, the git-rebase manpage says, “When you rebase a branch, you are changing its history in a way that will cause problems for anyone who already has a copy of the branch in their repository and tries to pull updates from you.”

I maintain, therefore, that git-rebase is evil and should be avoided. It only works for a situation where someone maintains a private branch of a project, never shared in any way except to submit patches to an upstream. Forget it if you have a team maintaining that branch, or want to post that branch online for others to help with (as I do with my Debian darcs package). Even if you keep it private now, do you really want to adopt a work process that forces you to keep it private forever, or else completely change how you work?

And this brings me back to the original question of patch philosophy. Personally, I dislike philosophy . I’d much rather have the full history of a change, warts and all. Look at the Linux kernel example: changesets that introduced bugs that made it into the official tree have their fixes documented, but changesets that introduced bugs that were fixed before being merged into the official tree could be lost to the public due to rebasing by submitters. Is that really what we want? I don’t think so.

With Darcs, tagging is very cheap and it is quite trivial to write an “apply a changeset bundle” script that makes a before tag, applies a series of patches, and makes an after tag. One could then run a darcs diff between the two tags to see the net effect on the repository, or could still look at the individual patches. (Or, you can avoid tagging and manually specify the “from” and “to” patches.) I find that a much better model: you can have it both ways. I’d think that most modern VCSs ought to support some variant on that, too.

And I think that git-rebase should be removed on the grounds that it encourages poor version tracking practices.

Haskell Time Travel

There is something very cool about a language in which the easiest, most direct way to explain how it solves a problem is to say, “When we pass the output of [this function] into the input for the oracle we are actually sending the data backwards in time. So when [the code] queries the oracle we get a result from the future.”

Sweet.

The story goes on to say, however, “Time travel is a very dangerous business. One false move and you can create a temporal paradox that will destroy the universe (which in this case means that the computation will diverge). When programming with values from the future, it is important never, never, to do anything with the values that might change the future. This is the temporal prime directive.”

Saving Power with CPU Frequency Scaling

Yesterday I wrote about the climate crisis. Today, let’s start doing something about it.

Electricity, especially in the United States and China, turns out to be a pretty dirty energy source. Most of our electricity is generated using coal, which despite promises of “clean coal” to come, burns dirty. Not only does it contribute to global warming, but it also has been shown to have an adverse impact on health.

So let’s start simple: reduce the amount of electricity our computers consume. Even for an individual person, this can add up to quite a bit of energy (and money) savings in a year. When you think about multiplying this over companies, server rooms, etc., it adds up fast. This works on desktops, servers, laptops, whatever.

The easiest way to save power is with CPU frequency scaling. This is a technology that lets you adjust how fast a running CPU runs, while it’s running. When CPUs run at slower speeds, they consume less power. Most CPUs are set to their maximum speed all the time, even when the system isn’t using them. Linux has support for keeping the CPU at maximum speed unless it is idle. By turning on this feature, we can save power at virtually no cost to performance. The Linux feature to handle CPU frequency scaling is called cpufreq.

Set up modules

Let’s start by checking to see whether cpufreq support is already enabled in your kernel. These commands will need to be run as root.

# cd /sys/devices/system/cpu/cpu0
# ls -l

If you see an entry called cpufreq, you are good and can skip to the governor selection below.

If not, you’ll need to load cpufreq support into your kernel. Let’s get a list of available drivers:

# ls /lib/modules/`uname -r`/kernel/arch/*/kernel/cpu/cpufreq

Now it’s guess time. It doesn’t really hurt if you guess wrong; you’ll just get a harmless error message. One hint, though: try acpi-cpufreq last; it’s the option of last resort.

On my system, I see:

acpi-cpufreq.ko     longrun.ko      powernow-k8.ko         speedstep-smi.ko
cpufreq-nforce2.ko  p4-clockmod.ko  speedstep-centrino.ko
gx-suspmod.ko       powernow-k6.ko  speedstep-ich.ko
longhaul.ko         powernow-k7.ko  speedstep-lib.ko

For each guess, you’ll run modprobe with the driver name. I have an Athlon64, which is a K8 machine, so I run:

#modprobe powernow-k8

Note that you leave off the “.ko” bit. If you don’t get any error message, it worked.

Once you find a working module, edit /etc/modules and add the module name there (again without the “.ko”) so it will be loaded for you on boot.

Governor Selection

Next, we need to load the driver that tells the kernel what governor to use. The governor is the thing that monitors the system and adjusts the speed accordingly.

I’m going to suggest the ondemand governor. This governor keeps the system’s speed at maximum unless it is pretty sure that the system is idle. So this will be the one that will let you save power with the least performance impact.

Let’s load the module now:

# modprobe cpufreq_ondemand

You should also edit /etc/modules and add a line that says simply cpufreq_ondemand to the end of the file so that the ondemand governor loads at next boot.

Turning It On

Now, back under /sys/devices/system/cpu/cpu0, you should see a cpufreq directory. cd into it.

To turn on the ondemand governor, run this:

# echo echo ondemand > scaling_governor

That’s it, your governor is enabled. You can see what it’s doing like this:

# cat cpuinfo_min_freq
800000
# cat cpuinfo_max_freq
2200000
# cat cpuinfo_cur_freq
800000

That shows that my CPU can go as low as 800MHz, as high as 2.2GHz, and that at the present moment, it’s running at 800MHz presently.

Now, check your scaling governor settings:

# cat scaling_min_freq
800000
# cat scaling_max_freq
800000

This is showing that the system is constraining the governor to only ever operate on an 800MHz to 800MHz range. That’s not what I want; I want it to scale over the entire range of the CPU. Since my cpuinfo_max_freq was 2200000, I want to write that out to scaling_max_freq as well:

echo 2200000 > scaling_max_freq

Making This The Default

The last step is to make this happen on each boot. Open up your /etc/sysfs.conf file. If you don’t have one, you will want to run a command such as apt-get install sysfsutils (or the appropriate one for your distribution).

Add a line like this:

devices/system/cpu/cpu0/cpufreq/scaling_governor = ondemand
devices/system/cpu/cpu0/cpufreq/scaling_max_freq = 2200000

Remember to replace the 2200000 with your own cpu_max_freq value.

IMPORTANT NOTE: If you have a dual-core CPU, or more than one CPU, you’ll need to add a line for each CPU. For instance:

devices/system/cpu/cpu1/cpufreq/scaling_governor = ondemand
devices/system/cpu/cpu1/cpufreq/scaling_max_freq = 2200000

You can see what all CPU devices you have with ls /sys/devices/system/cpu.

Now, save this file, and you’ll have CPU frequency scaling saving you money, and helping the environment, every time you boot. And with the ondemand governor, chances are you’ll never notice any performance loss.

This article showed you how to save power using CPU frequency scaling on Linux. I have no idea if it’s possible to do the same on Windows, Mac, or the various BSDs, but it would be great if someone would leave comments with links to resources for doing that if so.

Updated: added scaling_max_freq info

The Haskell Blog Tutorial

The first installment of Mark C. Chu-Carroll’s Haskell tutorial series went up last week.

It begins this way:

Before diving in and starting to explain Haskell, I thought it would be good to take a moment and answer the most important question before we start:

Why should you want to learn Haskell?

It’s always surprised me how many people don’t ask questions like that.

Farther down:

So what makes Haskell so wonderful? Or, to ask the question in a slightly better way: what is so great about the pure functional programming model as exemplified by Haskell?

The answer is simple: glue.

Languages like Haskell have absolutely amazing support for modular development.

An interesting and though-provoking article, even for someone that’s been using Haskell for more than 2 years now. (Yikes, I had no idea it was that long)

You can also see all his posts on Haskell, which include a couple more installments.

hpodder to be multithreaded… done right.

I’ll be hacking on my hpodder program this weekend. hpodder is a full-featured podcast aggregator that runs on the command line, and has many features over other command-line podcatchers like bashpodder, and even over GUI tools like iPodder.

I originally envisioned hpodder to be something that I’d cron up and run in the background. But I have tended to run it in the foreground more than in the background. Some others have too, and the requested hpodder feature is parallel downloads.

So I am working on that. I already have code working, in fact, that will parallelize both the hpodder update (downloading the feeds) and the hpodder download (downloading the actual episodes) commands.

Unlike ipodder, my code will make sure that no more than 1 thread will ever be downloading from a given server at a given time. ipodder had the terribly annoying habit of pointing all of its threads at a single server, thus pounding it while also providing little benefit for someone with a pipe fatter than the server’s.

Before all this multithreaded stuff could be written, I needed to write my own status bar code instead of just letting curl display its own status bar. (That wouldn’t work when there are 5 curls running at once)

I decided that I would write some generic status bar code, rather than something specific to hpodder. I took the apt-get status bar as an example, and whipped one up in Haskell and added it to my MissingH Haskell library.

But a status bar just begged for another feature: a generalized progress tracker. Something that could keep track of where a task (and its sub-tasks) are, calculate ETA, estimated time remaining, speed, etc. So I wrote that and made the status bar use it.

AND, a status bar begged for a generalized numeric formatter: something that could render 512 as 512, 2048 as 2K, 1048576 as 1M, etc. So I wrote that, and it’s general enough that it can render into both SI and “binary” units by default (and others that users may want).

Finally, I wrote a function to take a number of seconds and render it in something friendly like 23m5s like apt uses, and shoved that in MissingH as well.

So now hpodder will have a status bar, and any other Haskell program can use the same status bar code in minutes because it’s all generic. Or if someone just needs to render a number in megabytes, they can do that.

I really enjoy it when a program needs a solution that is generic enough to put in a larger library. I try to put as much of my Haskell code in MissingH as I can, so as to make it useful to others (and my other programs).

Desktop Linux: NFS or something else?

Recently, I asked for opinions on desktop Linux. Thanks very much to those that replied. I’ve set up an old laptop as an experiment. I’m using Debian, Gnome, and Systemimager. It’s been an interesting project (especially getting SystemImager and a splash screen program to do what I want).

I’d like for my desktop machines to mount /home over the network. I could use NFS, but of course that has all the well-known security risks. Is there a better network filesystem that is easy to use, fast, and more secure than NFS?

Desktop Linux Opinions?

I’m brainstorming about ways of setting up Linux desktops machines for people used to Window users on a LAN. It could be any size of LAN.

I’d like people to be able to sit down at any Linux machine on the LAN and log in — probably use a LDAP directory for that, and NFS-mounted home directories. I wouldn’t want to NFS-mount the entire thing for performance reasons.

So, some of the things I’m thinking about are:

  • Desktop environment: KDE or Gnome? Which would give Windows users all the tools they’d want? Which would they feel most at home with? I’m thinking it’s KDE, but Gnome has a more polished “feel” too it.
  • Image management. How could the desktops be updated? Just rsync everything except fstab over? Can we actually have a single system image? Is XOrg powerful enough to just recognize hardware at boot and Do The Right Thing? Can we build a unified initrd somehow?
  • Distribution. Debian, Ubuntu, Kubuntu? Do the Ubuntus bring anything to the table, if we take as a given that an experienced Debian admin is managing all this?
  • Laptops. What do we do about the home directories there? Some sort of automated rsync thingy?
  • Installation. FAI? Or some homegrown thing that just boots up, partitions, and runs rsync?

Managing Software

Recently I mentioned that I hate releasing software. It’s true, and I’ve decided that the first part of fixing it is to tackle the presentation of software to the world.

My current scheme of darcs.complete.org for repositories plus bare directories on my gopher (yes, that gopher) site leaves a lot to be desired. There is no bug tracker, there are few screenshots, there is no consistency. It is also not easy to empower others to work on them directly.

At the same time, I am the sole or primary contributor to most of them. These are not huge kernel-sized projects. These are smaller, bite-size projects. So I don’t want or need a lot of overhead. I’ve been thinking about my options.

  • I could just use sourceforge.net. I poked around there today, and all the advertising there is a real eyesore. Plus I figure that if anyone is getting paid for all my hard work, why should it be some random people that no longer write free software? On the other hand, it would be an easy way for my projects to gain visibility. Or I could use alioth, and give up both the advertisements and the larger visibility. But I don’t like giving up control over my site’s appearance, or behing beholden to others for backups, uptime, etc.
  • I could use trac. It’s nice, and is the only option that supports darcs, and has a very cool wiki integrated into everything (even parsing out keywords from changelogs). On the other hand, downloads are — at best — attachments to wiki pages. There is no download manager. And you have to set up a separate trac instance for each project. That is a non-starter for me. If I can’t see all my bug reports at one place, the bug tracker will be too annoying to use.
  • I could use gforge or savane. These are the sourceforge forks. Neither seems to be as resource-hungry as I expected, and debs are available for both. I could just install them locally and use them for my projects, though that seems like overkill. Plus, like SourceForge and Alioth, they have a crappy web-only bug tracker. I’d rather use something like RT that works by email. (Though RT is too resource-intensive to run on my server). However, web-only is better than nothing so I could hold my nose and use it.
  • I could write my own. But I’d rather not, if there’s already something workable out there.

Is anyone else thinking about this? What are your thoughts?