Category Archives: Programming

More on Git, Mercurial, and Bzr

I’ve been writing a lot about this lately, I know, but it’s an interesting landscape.

I had previously discarded git, but in light of git-cvsserver (which provides a plausible way for Windows people to participate), I gave it a try.

The first thing I noticed is that git documentation, in general, is really poor. Some tutorials that claim to cover git actually cover cogito. Still others use commands that are much more complex than those in the current git — and these just the ones linked to from the git homepage.

git’s manpages aren’t much better. There are quite a few git commands (such as log) that take arguments that other git commands accept. Sometimes this fact is documented with a pointer to these other commands, but often not; a person is left guessing what the full range of accepted arguments are.

My complaint that git is overly complex still exists. They’ve made progress, but still have a serious issue here. Part is because of the docuemtnation, and part is because of the interface. I wanted to export to diffs all patches on the current branch in a repo. I asked on , and someone suggested using the revision specifier ..HEAD. Nope, didn’t work. A few other git experts chimed in, and none could come up with the correct recipe. I finally used -500, which worked but is hackish.

git’s lack of even offering support for a human to indicate renames also bothers me, though trustworthy people have assured me that it doesn’t generally cause a problem in practice.

git does have nicer intra-repo branching than Mercurial does, for the moment. But the Mercurial folks are working on that anyway, and branching to new directories still works fine for me.

But in general, git’s philosophy is to make things easy for the upstream maintainer, and doesn’t spend much effort making things easy for contributors (except to make it mildly easier to contribute to a large project like Linux). Most of my software doesn’t have a large developer community, and I want to make it as easy as possible for new developers to join in and participate. git still utterly fails on that.

I tried bzr again. It seems that every time I try it, after just a few minutes, I am repulsed. This time, I stopped when I realized that bzr doesn’t support tags and has no support for emailing changesets whatsoever. As someone that has really liked darcs send (and even used tags way back with CVS!), this is alarming. The tutorial on the bzr website referenced a command “bzr help topics”, which does not work.

So I’ll stick with my mercurial and darcs combination for now.

I announced the first version of a hg send extension yesterday as well. I think Mercurial is very close to having a working equivalent to darcs send.

Want to try living in vim

I’ve been an Emacs user for many years, though of course I know some vi and vim commands out of necessity.

I want to try taking the plunge by spending a month using vim only, no Emacs.

Sadly the vim documentation isn’t very helpful for me in a number of areas. I’m hoping someone can point me to some resources or recipes that will help with:

  • Turning off that stupid “hide most of the Debian changelog” thing. I have no idea why it does that or how to make it stop.
  • Turn on or off autoindent, syntax highlighting, etc. in various languages (really, I want to set global defaults for all of them)
  • Be able to edit another file without closing or saving the first (:e doesn’t seem to do what I want)
  • Integrate it with Mercurial and Darcs

Re-Examining Darcs & Mercurial

I recently wrote an article or two about distributed version control systems.

I’ve been using Darcs since 2005. I switched to Darcs, in fact, 10 days after the simultaneous founding announcements of git and Mercurial.

Overall, I have been happy. I continue to believe that it is the most distributed of the distributed VCSs, which is a Good Thing.

However, I have lately started having trouble with Darcs hanging while working on my Debian packages. My post to the Darcs user list drew out a few other people whith this problem, which is a design flaw of Darcs.

So I revisited the VCS landscape. I re-examined git, Mercurial, and bzr. I eventually decided to give Mercurial a try. I avoided git because I write some code that is portable to Windows, and git isn’t (or isn’t very well). Also, git is complex to pick up for me, and I certainly don’t want to force something complex onto my contributors. bzr seemed to still have some strange behaviors that it’s had for awhile, and I couldn’t find even one advantage of it over Mercurial. So off I went with Mercurial.

I quickly learned a bit of a philosophical difference from Darcs to Mercurial.

Darcs avoids conflicts at all costs. Mercurial makes handling conflict easy and, in many cases, automatic.

It is exactly this Darcs behavior that permits both is excellent “darcs send” feature (still unmatched in any other VCS), but also causes its hang problems.

I found Mercurial quite pleasant to work with, and *fast*. It seems to be edging out git in speed tests sometimes these days.

It is easy to get started with Mercurial. The mq system — similar to quilt or other patch-management programs — is really quite an amazing hybrid between patch management and version control. I frankly don’t see any need for other patch-management tools anymore.

Mercurial has a “patchbomb” feature where you can select a range of changesets to send off, and it will generate nice emails with one changeset per email, and send them to your selected destination, optionally with an introductory message. The normal way of interacting with other Mercurial users is via the hg export/import commands, which send around simple unified diffs plus some additional header information, optionally in the git extended diff format.

I am happy with Mercurial and am in the process of converting my Debian repositories from Darcs to Mercurial. I’m going to keep my personal code in Darcs for the moment because “darcs send” is still easier than “hg email”, but that may change before long, depending on how my experience goes.

I’d encourage others to give Mercurial a try. The community is also very nice and helpful.

I have contributed patches to Tailor to make it make exact copies of Darcs repos into Mercurial, which are now in its Darcs repo. There is also a thread on the Mercurial list with some of my initial questions/concerns coming from a Darcs perspective.

A better environment for shell scripting

Shell scripts are good for a lot of things. It’s quick and easy to design shell scripts that take input from one program, pass it to another program, munge it for filenames, etc.

But there are a few drawbacks to shell scripts.

The drawback, in my opinion, is that it is extremely difficult to get quoting and escaping right. I often see things like $@ in shell scripts (breaks if a parameter has a space in it). I also see people failing to check for errors properly (set -e helps that). It’s also difficult to do a more modern style of exception handling (do a sequence of actions in a temporary directory, and always remove that directory, even if there’s an error, but stop processing and propogate the error). Command-line parsing is esoteric and odd, even with getopt. That’s not to say that it’s impossible to make a secure shell script that handles filenames with spaces in them properly. Just that it’s difficult, and makes using common operators like backticks difficult.

Awhile back, I toyed with the idea of making Haskell a shell scripting language. This week, I spent some time to make this a reality. I released HSH, a shell scripting environment for Haskell.

HSH makes it easy to run shell commands, set up pipelines, etc. straight from Haskell. You can either use simple strings to invoke commands (they’ll be passed to sh -c), or you can specify arguments as a list (like exec…() takes), which eliminates the strange filename problems.

But the really cool thing is that HSH doesn’t just let you pipe from one external program to another. It also lets you pipe to/from pure Haskell functions. Yes, you can pipe the output of ls -l straight into a Haskell version of grep. I’ve found it to be very nice, especially for more complex processing tasks.

I put these simple examples on the HSH homepage:

run $ "echo /etc/pass*" :: IO String
 -> "/etc/passwd /etc/passwd-"

runIO $ "ls -l" -|- "wc -l"
 -> 12

runIO $ "ls -l" -|- wcL
 -> 12

In this example, wcL is a pure-Haskell line-counting function.

The results were surprising. According to SLOCCount, porting hg-buildpackage from a shell script to a HSH script achieved a 20% reduction in source lines of code. And at the same time, gained better error handling, better safety of filenames, better type safety (compile-time type checking), etc. Yet it does exactly the same thing in almost exactly the same way.

Even greater savings will occur too. I decided to reimplement a small part of sed just for fun, and that code is still in my tree. If I removed that and replaced it with a call to sed as in the shell version, that would probably buy another 5% savings.

I didn’t really expect to achieve a reduction in lines of code. I thought that I’d be lucky to come close to breaking even. After all, who’d expect something other than the shell to be better at shell scripting?

I don’t know if these results are generalizable, but I’m really excited about it.

Rebase Considered Harmful

Today I was musing about different version control systems and merge algorithms. I’ve been thinking specifically about how I maintain Debian packages in Darcs. I tend to import upstream tarballs into one branch, and maintain the Debian packages in another, simply merging when a new upstream is released.

Now, there seem to be two prevailing philosophies on how to handle merges in this case. I’m thinking here about merges back to upstream. Say I want to contribute my Debian patches to them.

  1. Commit “clean” patches upstream. Don’t have a bunch of history — the fixing typos commits, the fixing bugs commits, or the merging to track new upstream releases. Just something like a series of diffs against the current head.
  2. Bring across the full history, warts and all, and keep it around permanently.

git encourages option , with its rebase option. Darcs encourages option (though some use its amend-record option to work more like ).

As I got to thinking about it, it occured to me that git-rebase would be very nice if you are going to use philosophy . In short, rebase will remove your local patches from a repo, update it to the latest upstream, then re-apply your local changesets — aborting to have you fix any conflicts. This is as opposed to a more traditional merge, where you add the upstream changesets to your local branch and then commit new changesets to resolve conflicts. (So a rebase would be totally useless in situation )

I got to thinking about this, and started wondering what would happen to people that I’m working with that in turn work off my branches. And sure enough, the git-rebase manpage says, “When you rebase a branch, you are changing its history in a way that will cause problems for anyone who already has a copy of the branch in their repository and tries to pull updates from you.”

I maintain, therefore, that git-rebase is evil and should be avoided. It only works for a situation where someone maintains a private branch of a project, never shared in any way except to submit patches to an upstream. Forget it if you have a team maintaining that branch, or want to post that branch online for others to help with (as I do with my Debian darcs package). Even if you keep it private now, do you really want to adopt a work process that forces you to keep it private forever, or else completely change how you work?

And this brings me back to the original question of patch philosophy. Personally, I dislike philosophy . I’d much rather have the full history of a change, warts and all. Look at the Linux kernel example: changesets that introduced bugs that made it into the official tree have their fixes documented, but changesets that introduced bugs that were fixed before being merged into the official tree could be lost to the public due to rebasing by submitters. Is that really what we want? I don’t think so.

With Darcs, tagging is very cheap and it is quite trivial to write an “apply a changeset bundle” script that makes a before tag, applies a series of patches, and makes an after tag. One could then run a darcs diff between the two tags to see the net effect on the repository, or could still look at the individual patches. (Or, you can avoid tagging and manually specify the “from” and “to” patches.) I find that a much better model: you can have it both ways. I’d think that most modern VCSs ought to support some variant on that, too.

And I think that git-rebase should be removed on the grounds that it encourages poor version tracking practices.

Haskell Time Travel

There is something very cool about a language in which the easiest, most direct way to explain how it solves a problem is to say, “When we pass the output of [this function] into the input for the oracle we are actually sending the data backwards in time. So when [the code] queries the oracle we get a result from the future.”

Sweet.

The story goes on to say, however, “Time travel is a very dangerous business. One false move and you can create a temporal paradox that will destroy the universe (which in this case means that the computation will diverge). When programming with values from the future, it is important never, never, to do anything with the values that might change the future. This is the temporal prime directive.”

The Haskell Blog Tutorial

The first installment of Mark C. Chu-Carroll’s Haskell tutorial series went up last week.

It begins this way:

Before diving in and starting to explain Haskell, I thought it would be good to take a moment and answer the most important question before we start:

Why should you want to learn Haskell?

It’s always surprised me how many people don’t ask questions like that.

Farther down:

So what makes Haskell so wonderful? Or, to ask the question in a slightly better way: what is so great about the pure functional programming model as exemplified by Haskell?

The answer is simple: glue.

Languages like Haskell have absolutely amazing support for modular development.

An interesting and though-provoking article, even for someone that’s been using Haskell for more than 2 years now. (Yikes, I had no idea it was that long)

You can also see all his posts on Haskell, which include a couple more installments.

I Hate Releasing Software

I’ve written a bunch of software. I like coding, I like debugging. I like getting e-mail from people that have used my software and are happy.

I don’t like actually having to make a release.

To do a good and proper release of a program, I’d be doing approximately these tasks:

  • Upload to Debian
  • Push to my darcs repo
  • Upload a tar.gz to my server
  • Update a webpage with the latest tar.gz
  • Announce the release to freshmeat
  • Announce the release to a mailing list
  • Update/post screenshots, if things have changed

So I have two wishes. First, I want a tool that maintains a website with software listings. Each program should have its own page, with a description, links to mailing lists, download links, links to the darcs repo, screenshots, etc. It should be simple but I’m too lazy to write it.

Secondly, there should be a tool that will do all of the above tasks (except the screenshots) for me. It should infer the name of the project and the version from the data in my working directory. It should be able to automate this while process without me having to lift a finger.

Sadly, no such thing seems to exist.

And, to date, I’ve been too lazy to write one. Does anyone know of such a thing?

Another Haskell Solution to Lars’ Problem

Yesterday, I posted an 18-line solution to Lars’ language problem. One problem with it was that it was not very memory-efficient (or time-efficient, for that matter). In other words, it was optimized for elegance.

Here is a 22-line solution that is much more memory-efficient and works well with his “huge” test case. Note to Planet readers: Planet seems to corrupt code examples at times; click on the original story to see the correct code.

import System.Environment
import Data.List
import Data.Char
import qualified Data.Map as Map

custwords = filter (/= "") . lines . map (conv . toLower)
    where iswordchar x = isAlphaNum x && isAscii x
          conv x = if iswordchar x then x else '\n'

wordfreq inp = Map.toList $ foldl' updmap (Map.empty::Map.Map String Int) inp
    where updmap nm word = case Map.lookup word nm of
                             Nothing -> Map.insert word 1 nm
                             Just x -> (Map.insert word $! x + 1) nm

freqsort (w1, c1) (w2, c2) = if c1 == c2
                                 then compare w1 w2
                                 else compare c2 c1

showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
          interact $ unlines . map showit . take (read . head $ args) .
                     sortBy freqsort . wordfreq . custwords

The main change from the previous example to this one is using a Map to keep track of the frequency of each word.

A Haskell solution to Lars’ Problem

Thanks to a little glitch in planet, one of Lars’ posts from 2004 came to my attention. In it, he proposes a test for language benchmarking:

Read text from the standard input and count the number of times each word occurs. Convert letters to lower case. Order the words according to frequency, words with the same frequency should be ordered in ascending lexicographic order according to character code. Print out the top N words, where N is a decimal number given on the command line. Each output line must contain the count, a space, and the word (in lower case), and end in an ASCII LINE FEED character. Output must contain exactly N such output lines and no other output lines.

A word contains only ASCII letters A through Z and a through z (convert upper case to lower case) and ASCII digits 0 through 9 and is not empty. All other characters separate words and are ignored except to notice word boundaries. Word boundaries only occur at the beginning and end of the file and at non-word characters. You may not assume a maximum length for the word, line, or input file.

He provides a tarball with sample implementations in C, Python, and Shell.

His C code is 183 lines long, Python 57, and Shell 11. The specs for this test seem particularly suited for shell.

I wrote a version in Haskell, commented and formatted approximately the same as his Python version, but using an algorithm more like the shell version. It comes in at 18 lines. Here it is:

import System.Environment
import Data.List
import Data.Char

custwords = filter (/= "") . lines . map (conv . toLower)
    where iswordchar x = isAlphaNum x && isAscii x
          conv x = if iswordchar x then x else '\n'

wordfreq = map (\x -> (head x, length x)) . group . sort

freqsort (w1, c1) (w2, c2) = if c1 == c2
                                 then compare w1 w2
                                 else compare c2 c1

showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
          interact $ unlines . map showit . take (read . head $ args) .
                     sortBy freqsort . wordfreq . custwords

Taking a look at this, one thing that might strike you is the function composition in main. This takes the output from one function and feeds it into the next -- and the Haskell syntactic sugar for this makes it look a lot like pipes in the shell version. The interact call takes, as a parameter, a function that takes a string and returns a string. interact supplies stdin as the input and prints the output to stdout. Note that, since Haskell is lazily, this does not mean buffering up the entire input or output -- it is read and written on demand.

The rest of the functions are also standard in Haskell, and you can find them in the index to the library reference if you want to learn more.

I understand and agree that short code doesn't necessarily mean good code, but I think that Haskell provides a very elegant and expressive solution to many problems -- one that also happens to be remarkably concise.

Updated 9/4: Changed isLower to isAlphaNum to fix a bug, and removed unnecessary Data.Map import