A Haskell solution to Lars’ Problem

Thanks to a little glitch in planet, one of Lars’ posts from 2004 came to my attention. In it, he proposes a test for language benchmarking:

Read text from the standard input and count the number of times each word occurs. Convert letters to lower case. Order the words according to frequency, words with the same frequency should be ordered in ascending lexicographic order according to character code. Print out the top N words, where N is a decimal number given on the command line. Each output line must contain the count, a space, and the word (in lower case), and end in an ASCII LINE FEED character. Output must contain exactly N such output lines and no other output lines.

A word contains only ASCII letters A through Z and a through z (convert upper case to lower case) and ASCII digits 0 through 9 and is not empty. All other characters separate words and are ignored except to notice word boundaries. Word boundaries only occur at the beginning and end of the file and at non-word characters. You may not assume a maximum length for the word, line, or input file.

He provides a tarball with sample implementations in C, Python, and Shell.

His C code is 183 lines long, Python 57, and Shell 11. The specs for this test seem particularly suited for shell.

I wrote a version in Haskell, commented and formatted approximately the same as his Python version, but using an algorithm more like the shell version. It comes in at 18 lines. Here it is:

import System.Environment
import Data.List
import Data.Char

custwords = filter (/= "") . lines . map (conv . toLower)
    where iswordchar x = isAlphaNum x && isAscii x
          conv x = if iswordchar x then x else '\n'

wordfreq = map (\x -> (head x, length x)) . group . sort

freqsort (w1, c1) (w2, c2) = if c1 == c2
                                 then compare w1 w2
                                 else compare c2 c1

showit (word, count) = show count ++ " " ++ word
main = do args <- getArgs
          interact $ unlines . map showit . take (read . head $ args) .
                     sortBy freqsort . wordfreq . custwords

Taking a look at this, one thing that might strike you is the function composition in main. This takes the output from one function and feeds it into the next -- and the Haskell syntactic sugar for this makes it look a lot like pipes in the shell version. The interact call takes, as a parameter, a function that takes a string and returns a string. interact supplies stdin as the input and prints the output to stdout. Note that, since Haskell is lazily, this does not mean buffering up the entire input or output -- it is read and written on demand.

The rest of the functions are also standard in Haskell, and you can find them in the index to the library reference if you want to learn more.

I understand and agree that short code doesn't necessarily mean good code, but I think that Haskell provides a very elegant and expressive solution to many problems -- one that also happens to be remarkably concise.

Updated 9/4: Changed isLower to isAlphaNum to fix a bug, and removed unnecessary Data.Map import

7 thoughts on “A Haskell solution to Lars’ Problem

  1. I see some bugs and obvious improvements in the code:
    [list]
    [*] You don’t treat the ASCII digits as word characters, as specified in the problem.
    [*] You don’t seem to actually use Data.Map.
    [*] If you switched the order of the count and the word, you wouldn’t need to specify your own comparison function.
    [/list]

    Also, for some reason the ‘<' symbol did not show up on the Planet Debian feed.

    1. You’re right on #1 and #2 — though #2 is harmless. To fix #1, I’ve updated the example to replace isLower with isAlphaNum.

      On #3, I still would need to, since we need a descending sort on the count and an ascending sort on the word.

  2. Yesterday, I posted an 18-line solution to Lars’ language problem. One problem with it was that it was not very memory-efficient (or time-efficient, for that matter). In other words, it was optimized for elegance.

    Here is a 22-line solution that is m

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.