Category Archives: Technology

Desktop Linux Opinions?

I’m brainstorming about ways of setting up Linux desktops machines for people used to Window users on a LAN. It could be any size of LAN.

I’d like people to be able to sit down at any Linux machine on the LAN and log in — probably use a LDAP directory for that, and NFS-mounted home directories. I wouldn’t want to NFS-mount the entire thing for performance reasons.

So, some of the things I’m thinking about are:

  • Desktop environment: KDE or Gnome? Which would give Windows users all the tools they’d want? Which would they feel most at home with? I’m thinking it’s KDE, but Gnome has a more polished “feel” too it.
  • Image management. How could the desktops be updated? Just rsync everything except fstab over? Can we actually have a single system image? Is XOrg powerful enough to just recognize hardware at boot and Do The Right Thing? Can we build a unified initrd somehow?
  • Distribution. Debian, Ubuntu, Kubuntu? Do the Ubuntus bring anything to the table, if we take as a given that an experienced Debian admin is managing all this?
  • Laptops. What do we do about the home directories there? Some sort of automated rsync thingy?
  • Installation. FAI? Or some homegrown thing that just boots up, partitions, and runs rsync?

Managing Software

Recently I mentioned that I hate releasing software. It’s true, and I’ve decided that the first part of fixing it is to tackle the presentation of software to the world.

My current scheme of darcs.complete.org for repositories plus bare directories on my gopher (yes, that gopher) site leaves a lot to be desired. There is no bug tracker, there are few screenshots, there is no consistency. It is also not easy to empower others to work on them directly.

At the same time, I am the sole or primary contributor to most of them. These are not huge kernel-sized projects. These are smaller, bite-size projects. So I don’t want or need a lot of overhead. I’ve been thinking about my options.

  • I could just use sourceforge.net. I poked around there today, and all the advertising there is a real eyesore. Plus I figure that if anyone is getting paid for all my hard work, why should it be some random people that no longer write free software? On the other hand, it would be an easy way for my projects to gain visibility. Or I could use alioth, and give up both the advertisements and the larger visibility. But I don’t like giving up control over my site’s appearance, or behing beholden to others for backups, uptime, etc.
  • I could use trac. It’s nice, and is the only option that supports darcs, and has a very cool wiki integrated into everything (even parsing out keywords from changelogs). On the other hand, downloads are — at best — attachments to wiki pages. There is no download manager. And you have to set up a separate trac instance for each project. That is a non-starter for me. If I can’t see all my bug reports at one place, the bug tracker will be too annoying to use.
  • I could use gforge or savane. These are the sourceforge forks. Neither seems to be as resource-hungry as I expected, and debs are available for both. I could just install them locally and use them for my projects, though that seems like overkill. Plus, like SourceForge and Alioth, they have a crappy web-only bug tracker. I’d rather use something like RT that works by email. (Though RT is too resource-intensive to run on my server). However, web-only is better than nothing so I could hold my nose and use it.
  • I could write my own. But I’d rather not, if there’s already something workable out there.

Is anyone else thinking about this? What are your thoughts?

I use more than one computer

I use more than one computer, and quite a bit. I use three regularly, and two or three more on occasion.

But this seems to be a surprise to many programs.

I want to carry certain things with me from machine to machine, access them from anywhere, and have changes propogate across.

Things such as:

  • Bookmarks
  • newsrc files (to mark which Usenet articles are read)
  • mail (solved with my OfflineIMAP program)
  • A small set of files
  • Contacts
  • Calendar/scheduler (appointments)

Now, MacOS X seems to do some of this with their for-pay mac.com service. But I wonder why so few other apps do this out of the box?

The newsrc question is a particularly difficult one to crack, it seems. There are various schemes for synchronizing bookmarks, but none seem to work reliably.

Sigh.

I Hate Releasing Software

I’ve written a bunch of software. I like coding, I like debugging. I like getting e-mail from people that have used my software and are happy.

I don’t like actually having to make a release.

To do a good and proper release of a program, I’d be doing approximately these tasks:

  • Upload to Debian
  • Push to my darcs repo
  • Upload a tar.gz to my server
  • Update a webpage with the latest tar.gz
  • Announce the release to freshmeat
  • Announce the release to a mailing list
  • Update/post screenshots, if things have changed

So I have two wishes. First, I want a tool that maintains a website with software listings. Each program should have its own page, with a description, links to mailing lists, download links, links to the darcs repo, screenshots, etc. It should be simple but I’m too lazy to write it.

Secondly, there should be a tool that will do all of the above tasks (except the screenshots) for me. It should infer the name of the project and the version from the data in my working directory. It should be able to automate this while process without me having to lift a finger.

Sadly, no such thing seems to exist.

And, to date, I’ve been too lazy to write one. Does anyone know of such a thing?

Disk encryption support in Etch

Well, I got my new MacBook Pro 15″ in yesterday. I’ll write something about that shortly. The main OS for this machine is not Mac OS X, though, but Debian.

I decided that, being a laptop, I would like to run dm-crypt on here. Much to my delight, the etch installers support dm-crypt out of the box.

Not only that, but they supported this setup out of the box, too:

  • Two partitions for Debian — one for /boot, everything else on the second one
  • The second partition is completely encrypted
  • Inside the encrypted container is an LVM physical volume
  • Inside the LVM physical volume are logical volumes for /, /home, /usr, /var, and swap
  • XFS is used for each filesystem

Not only that, but it set up proper boot sequence for all of this out of the box, too.

So I turn on the unit, enter the password for the encrypted partition, and then the system continues booting.

Nice. Very nice.

Kudos to the debian-installer and initramfs teams.

Another Haskell Solution to Lars’ Problem

Yesterday, I posted an 18-line solution to Lars’ language problem. One problem with it was that it was not very memory-efficient (or time-efficient, for that matter). In other words, it was optimized for elegance.

Here is a 22-line solution that is much more memory-efficient and works well with his “huge” test case. Note to Planet readers: Planet seems to corrupt code examples at times; click on the original story to see the correct code.

import System.Environment
import Data.List
import Data.Char
import qualified Data.Map as Map

custwords = filter (/= "") . lines . map (conv . toLower)
    where iswordchar x = isAlphaNum x && isAscii x
          conv x = if iswordchar x then x else '\n'

wordfreq inp = Map.toList $ foldl' updmap (Map.empty::Map.Map String Int) inp
    where updmap nm word = case Map.lookup word nm of
                             Nothing -> Map.insert word 1 nm
                             Just x -> (Map.insert word $! x + 1) nm

freqsort (w1, c1) (w2, c2) = if c1 == c2
                                 then compare w1 w2
                                 else compare c2 c1

showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
          interact $ unlines . map showit . take (read . head $ args) .
                     sortBy freqsort . wordfreq . custwords

The main change from the previous example to this one is using a Map to keep track of the frequency of each word.

A Haskell solution to Lars’ Problem

Thanks to a little glitch in planet, one of Lars’ posts from 2004 came to my attention. In it, he proposes a test for language benchmarking:

Read text from the standard input and count the number of times each word occurs. Convert letters to lower case. Order the words according to frequency, words with the same frequency should be ordered in ascending lexicographic order according to character code. Print out the top N words, where N is a decimal number given on the command line. Each output line must contain the count, a space, and the word (in lower case), and end in an ASCII LINE FEED character. Output must contain exactly N such output lines and no other output lines.

A word contains only ASCII letters A through Z and a through z (convert upper case to lower case) and ASCII digits 0 through 9 and is not empty. All other characters separate words and are ignored except to notice word boundaries. Word boundaries only occur at the beginning and end of the file and at non-word characters. You may not assume a maximum length for the word, line, or input file.

He provides a tarball with sample implementations in C, Python, and Shell.

His C code is 183 lines long, Python 57, and Shell 11. The specs for this test seem particularly suited for shell.

I wrote a version in Haskell, commented and formatted approximately the same as his Python version, but using an algorithm more like the shell version. It comes in at 18 lines. Here it is:

import System.Environment
import Data.List
import Data.Char

custwords = filter (/= "") . lines . map (conv . toLower)
    where iswordchar x = isAlphaNum x && isAscii x
          conv x = if iswordchar x then x else '\n'

wordfreq = map (\x -> (head x, length x)) . group . sort

freqsort (w1, c1) (w2, c2) = if c1 == c2
                                 then compare w1 w2
                                 else compare c2 c1

showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
          interact $ unlines . map showit . take (read . head $ args) .
                     sortBy freqsort . wordfreq . custwords

Taking a look at this, one thing that might strike you is the function composition in main. This takes the output from one function and feeds it into the next -- and the Haskell syntactic sugar for this makes it look a lot like pipes in the shell version. The interact call takes, as a parameter, a function that takes a string and returns a string. interact supplies stdin as the input and prints the output to stdout. Note that, since Haskell is lazily, this does not mean buffering up the entire input or output -- it is read and written on demand.

The rest of the functions are also standard in Haskell, and you can find them in the index to the library reference if you want to learn more.

I understand and agree that short code doesn't necessarily mean good code, but I think that Haskell provides a very elegant and expressive solution to many problems -- one that also happens to be remarkably concise.

Updated 9/4: Changed isLower to isAlphaNum to fix a bug, and removed unnecessary Data.Map import

Lazy big-O and Haskell Answers

First, Evan has a host of interesting articles about Haskell, and I found his lazy big-O article particulary interesting.

Next, Eric Warmenhoven has recently taken up Haskell and posted some Haskell questions on his blog. Eric, here are some answers for you.

First, regarding shared libraries. While Haskell can be compiled to machine code, and GHC is a popular way to do that, a standard C way of representing information about a library (.h and .so files) is not really rich enough for Haskell. Consider, for instance, that functions may accept arguments of a wide range of types (or even things such as lists of any type). Haskell also performs type checking, and thus must know the type of arguments a function expects, as well as its return type, at compile time. So you do not generally compile Haskell code directly to .so files, but rather use the compiler’s module or package support to do that. See Cabal for more information on packages. Through the FFI (Foreign Function Interface), it is possible to both call into C and be called from C with Haskell code, if that’s where you want to go. It is actually easier in Haskell than in any other high-level language I’ve dealt with before.

Regarding circular module deps — I’ve never used them and can’t really comment. I can say, though, that the .boot files are internal files created by GHC.

Regarding practical stuff in tutorials — I share your complaint there. I have found a few that are better than the others: Yet Another Haskell Tutorial, and Haskell: The Craft of Functional Programming, 2nd ed., by Simon Thompson. Several of us are working intermittently on a project called Haskell V8 — take a look and darcs send me patches! I would say that Haskell’s I/O system is the most powerful I’ve seen in many ways — especially with regard to laziness — and in the upcoming GHC 6.6 release, it will be both lazy *and* blazingly fast. Very nice.

There isn’t much Debian-specific documentation, but there is a draft policy and a mailing list (link to it is in the policy doc).

Hope this helps!

Whose Distributed VCS Is The Most Distributed?

Lately I have been trying out a number of distributed version control systems (VCS or SCM).

One of my tests was a real problem: I wanted to track the Linux 2.6.16.x kernel tree, apply the Xen patches to it, and pull only specific patches (for the qla2xxx driver) from 2.6.17.x into this local branch. I wanted also to be able to upgrade to 2.6.17.x later (once Xen supports it) and have the version control system properly track which patches I already have.

But before going on, let’s establish what it means to be an ideal distributed VCS:

  • 1. The fundamental method of collaboration must be a branch. A checkout should mean creating a local branch on which a person can commit and work without having to involve the server. An update from some central server should take the form of a merge from that branch to the local branch.
  • 2. Branching should be cheap. It should be easy to create a local branch, the operation to do so should be fast, and it shouldn’t take an inordinate amount of space. It should also be as easy as possible to branch from a remote repository.
  • 3. Merging between branches is intelligent. It should be easy to merge another branch with your own. The VCS should know which changesets from the other branch are already on yours, and should not attempt to merge changesets that you have already merged previously.
  • 4. Inividial changesets should be mergeable without bringing across the whole history. You should be able to bring across the minimum number of changesets necessary to effect a specific change. This corresponds to my test case above. Future merges from the whole branch should, of course, recognize that these changesets are present already.
  • 5. Branching preserves full history. A branch should be a first-class copy of a repository, even if the repository is remote. It should contain the full history of the branch it was made from, including diffs for each individual changeset and full commit logs, unless otherwise requested by the user.
  • 6. Merging preserves full history. A merge from one branch to another should also preserve full history. Changesets merged to the local branch should retain the individual, distinct diffs and commit logs for each changeset.

There are also some things that we would generally want:

  • 7. It is possible to commit, branch, merge, and work with history offline.
  • 8. The program is fast enough for general-purpose use.

Evaluation

Let’s look at some common VCSs against these criteria. I’ll talk about Arch (tla, baz, etc), bzr (bazaar-ng), Darcs, Git, Mercurial (hg), and Subversion (svn) for reference.

1. The fundamental method of collaboration must be a branch

All of the tools pass this test except for svn.

2. Branching should be cheap

Everyone except svn generally does this reasonably well.

The tla interface for Arch had a pretty terrible interface for this, so it took awhile simply due to all the typing involved. That’s better these days.

Darcs supports hardlinking of history to other local repositories and will do this automatically by default. Git also supports that, but defaults to not doing it, or you can store a path to look in for changesets that aren’t in the current repo. I believe Mercurial also can use hardlinks, though I didn’t personally verify that. bzr appears to have some features in this area, but not hardlinks, and the features were too complex (or poorly documented) to learn about quickly.

svn does not support branching across repositories, so doesn’t really pass this test. Branches within a repository are not directly supported either, but are conventionally simulated by doing a low-cost copy into a specially-named area in the repository.

3. Merging between branches is intelligent

Arch was one of the early ones to work on this problem. It works reasonably well in most situations, but breaks in spectacular and unintelligble ways in some other situations.

When asked to merge one branch to another, Darcs will simply merge in any patches from the source branch onto the destination which the destination doesn’t already have. This goes farther than any of the other systems, which generally store a “head” pointer for each branch that shows how far you’ve gone. (Arch is closer to darcs here, though ironically bzr is more like the other systems)

Merging between branches in svn is really poor, and has no support for recognizing changesets that have been applied both places, resulting in conflicts in many development models.

4. Inividial changesets should be mergeable without bringing across the whole history

Darcs is really the only one that can do this right. I was really surprised that nobody else could, since it is such a useful and vital feature for me.

Both bzr and git have a cherry-pick mode that simulates this, but really these commands just get a diff from the specific changeset requested, then apply the diff as with patch. So you really get a different changeset committed, which can really complicate your history later — AND lead to potential conflicts in future merges. bzr works around some of the conflict problems because on a merge, it will silently ignore patches that attempt to perform an operation that has already occured. But that leads to even more confusing results, as the merge of the patch is recorded for a commit that didn’t actually merge it. (That could even be a commit that doesn’t modify the source.) Sounds like a nightmare for later.

Arch has some support for it, but in my experience, actually using this support tends to get it really confused when you do merges later.

Neither Mercurial nor svn have any support for this at all.

5. Branching preserves full history

git, darcs, and Mercurial get this right. Making a branch from one of these repos will give you full history, including individual diffs and commit logs for each changeset.

Arch and bzr preserve commit logs but not the individual changesets on a new branch. I was particularly surprised at this shortcoming with bzr, but sure enough, a standard bzr merge from a remote branch commited three original changesets into one and did not preserve the individual history on the one commit.

svn doesn’t support cross-repo branching at all.

6. Merging preserves full history

Again, darcs, git, and Mercurial get this right (I haven’t tested this in Mercurial, so I’m not 100% sure).

Arch and bzr have the same problem of preserving commit logs, but not individual changesets. A merge from one branch to another in Arch or bzr simply commits one big changeset on the target that represents all the changesets pulled in from the source. So you lose the distinctness of each individual changeset. This can result in the uncomfortable situation of being unable to re-create full history without access to dozens of repositories on the ‘net.

Subversion has no support for merging across repositories, and its support for merging across simulated local branches isn’t all that great, either.

7. It is possible to commit, branch, merge, and work with history offline

Everyone except Subversion does a good job of this.

8. The program is fast enough for general-purpose use

All tools here are probably fast enough for most people’s projects. Subversion can be annoying at times because many more svn commands hit the network than those from others.

In my experience, Arch was the slowest. Though it was still fine for most work, it really bogged down with the Linux kernel. bzr was next, somewhere between arch and darcs. bzr commands “felt” sluggish, but I haven’t used it enough to really see how it scales.

Darcs is the next. It used to be pretty slow, but has been improving rapidly since 1.0.0 was released. It now scales up to a kernel-sized project very well, and is quite usable and reasonably responsive for such a thing. The two main things that slow it down are very large files (10MB or above) and conflicts during a merge.

Mercurial and git appear to be fastest and pretty similar in performance.

All of these tools perform best with periodic manual (or scheduled cron jobs) intervention — once a month to once a year, depending on your project’s size. Arch users have typically created a new repo each year. Darcs users periodically tag things (if things are tagged as part of normal work, no extra work is needed here) and can create checkpoints to speed checkouts over the net. git and Mercurial also use a form of checkpoints. (not sure about bzr)

Subversion works so differently from the others that it’s hard to compare. (For one, a checkout doesn’t bring down any history.)

Conclusions

I was surprised by a few things.

First, that only one system actually got #4 (merging individual changesets) right. Second, that if you had to pick losers among VCSs, it seems to be Arch and bzr — the lack of history in branching and merging is a really big issue, and they don’t seem to have any compelling features that git, darcs, or Mercurial lack. #4 was a unique feature to Darcs a few years ago, but I figured it surely would have been cloned by all the other new VCS projects that have popped up since. It seems that people have realized it is important, and have added token workaround support for it, but not real working support.

On the other hand, it was interesting to see how VCS projects have copied from each other. Everyone (except tla) seems to use a command-line syntax similar to CVS. The influence of tla Arch is, of course, plainly visible in baz and bzr, but you can also see pieces of it in all the other projects. I was also interested to see the Darcs notion of patch dependencies was visible (albeit in a more limited fashion) in bzr, git, and Mercurial.

So, I will be staying with Darcs. It seems to really take the idea of distributed VCS and run with it. Nobody else seems to have quite gotten the merging thing right yet — and if you are going to make it difficult to do anything but merge everything up to point x from someone’s branch, I just don’t see how your tool is as useful as Darcs. But I am glad to see ideas from different projects percolating across and getting reused — this is certainly good for the community.

Updates / Corrections

I got an e-mail explaining how to get the individual patch diffs out of bzr. This will work only for “regular”, non-cherry-picked merges, and requires some manual effort.

You’ll need to run bzr log, and find the patch IDs (these are the long hex numbers on the “merged:” line) of the changeset you’re interested in, plus the changeset immediately before it on the same branch (which may not be on the same patch and may not be obvious at all on busy projects.) Then, run bzr diff -r revid:old-revid-string..new-revid-string.

I think this procedure really stinks, though, since it requires people to manually find previous commits from the same branch in the log.