September 4th, 2006
Thanks to a little glitch in planet, one of Lars’ posts from 2004 came to my attention. In it, he proposes a test for language benchmarking:
Read text from the standard input and count the number of times each word occurs. Convert letters to lower case. Order the words according to frequency, words with the same frequency should be ordered in ascending lexicographic order according to character code. Print out the top N words, where N is a decimal number given on the command line. Each output line must contain the count, a space, and the word (in lower case), and end in an ASCII LINE FEED character. Output must contain exactly N such output lines and no other output lines.
A word contains only ASCII letters A through Z and a through z (convert upper case to lower case) and ASCII digits 0 through 9 and is not empty. All other characters separate words and are ignored except to notice word boundaries. Word boundaries only occur at the beginning and end of the file and at non-word characters. You may not assume a maximum length for the word, line, or input file.
He provides a tarball with sample implementations in C, Python, and Shell.
His C code is 183 lines long, Python 57, and Shell 11. The specs for this test seem particularly suited for shell.
I wrote a version in Haskell, commented and formatted approximately the same as his Python version, but using an algorithm more like the shell version. It comes in at 18 lines. Here it is:
custwords = filter (/= "") . lines . map (conv . toLower)
where iswordchar x = isAlphaNum x && isAscii x
conv x = if iswordchar x then x else '\n'
wordfreq = map (\x -> (head x, length x)) . group . sort
freqsort (w1, c1) (w2, c2) = if c1 == c2
then compare w1 w2
else compare c2 c1
showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
interact $ unlines . map showit . take (read . head $ args) .
sortBy freqsort . wordfreq . custwords
Taking a look at this, one thing that might strike you is the function composition in main. This takes the output from one function and feeds it into the next -- and the Haskell syntactic sugar for this makes it look a lot like pipes in the shell version. The interact call takes, as a parameter, a function that takes a string and returns a string. interact supplies stdin as the input and prints the output to stdout. Note that, since Haskell is lazily, this does not mean buffering up the entire input or output -- it is read and written on demand.
The rest of the functions are also standard in Haskell, and you can find them in the index to the library reference if you want to learn more.
I understand and agree that short code doesn't necessarily mean good code, but I think that Haskell provides a very elegant and expressive solution to many problems -- one that also happens to be remarkably concise.
Updated 9/4: Changed isLower to isAlphaNum to fix a bug, and removed unnecessary Data.Map import