Thanks to a little glitch in planet, one of Lars’ posts from 2004 came to my attention. In it, he proposes a test for language benchmarking:
Read text from the standard input and count the number of times each word occurs. Convert letters to lower case. Order the words according to frequency, words with the same frequency should be ordered in ascending lexicographic order according to character code. Print out the top N words, where N is a decimal number given on the command line. Each output line must contain the count, a space, and the word (in lower case), and end in an ASCII LINE FEED character. Output must contain exactly N such output lines and no other output lines.
A word contains only ASCII letters A through Z and a through z (convert upper case to lower case) and ASCII digits 0 through 9 and is not empty. All other characters separate words and are ignored except to notice word boundaries. Word boundaries only occur at the beginning and end of the file and at non-word characters. You may not assume a maximum length for the word, line, or input file.
He provides a tarball with sample implementations in C, Python, and Shell.
His C code is 183 lines long, Python 57, and Shell 11. The specs for this test seem particularly suited for shell.
I wrote a version in Haskell, commented and formatted approximately the same as his Python version, but using an algorithm more like the shell version. It comes in at 18 lines. Here it is:
import System.Environment import Data.List import Data.Char custwords = filter (/= "") . lines . map (conv . toLower) where iswordchar x = isAlphaNum x && isAscii x conv x = if iswordchar x then x else '\n' wordfreq = map (\x -> (head x, length x)) . group . sort freqsort (w1, c1) (w2, c2) = if c1 == c2 then compare w1 w2 else compare c2 c1 showit (word, count) = show count ++ " " ++ word main = do args <- getArgs interact $ unlines . map showit . take (read . head $ args) . sortBy freqsort . wordfreq . custwords
Taking a look at this, one thing that might strike you is the function composition in main. This takes the output from one function and feeds it into the next -- and the Haskell syntactic sugar for this makes it look a lot like pipes in the shell version. The interact call takes, as a parameter, a function that takes a string and returns a string. interact supplies stdin as the input and prints the output to stdout. Note that, since Haskell is lazily, this does not mean buffering up the entire input or output -- it is read and written on demand.
The rest of the functions are also standard in Haskell, and you can find them in the index to the library reference if you want to learn more.
I understand and agree that short code doesn't necessarily mean good code, but I think that Haskell provides a very elegant and expressive solution to many problems -- one that also happens to be remarkably concise.
Updated 9/4: Changed isLower to isAlphaNum to fix a bug, and removed unnecessary Data.Map import
I see some bugs and obvious improvements in the code:
[list]
[*] You don’t treat the ASCII digits as word characters, as specified in the problem.
[*] You don’t seem to actually use Data.Map.
[*] If you switched the order of the count and the word, you wouldn’t need to specify your own comparison function.
[/list]
Also, for some reason the ‘<' symbol did not show up on the Planet Debian feed.
You’re right on #1 and #2 — though #2 is harmless. To fix #1, I’ve updated the example to replace isLower with isAlphaNum.
On #3, I still would need to, since we need a descending sort on the count and an ascending sort on the word.
Very interesting, thanks.
Using ghc 6.4.1 on ubuntu dapper I wasn’t able to complete the huge test within 400M of ram.. how much does it use for you ?
You are right — this one is optimized for elegance. I will post one with better memory usage later today.
Maybe using the new ByteString? :)
It is now here: [url]http://changelog.complete.org/posts/536-Another-Haskell-Solution-to-Lars-Problem.html[/ur]
Yesterday, I posted an 18-line solution to Lars’ language problem. One problem with it was that it was not very memory-efficient (or time-efficient, for that matter). In other words, it was optimized for elegance.
Here is a 22-line solution that is m