20100520

The 100 most common words in Icelandic, automatically generated from Wikipedia

Scroll down to download the file if you want to skip the how I did it part!

As you may already know, I'm travelling to Iceland this July, and started learning Icelandic a few months ago. It advances slowly but firmly, but I found a problem: when you are self-learning a new language, an invaluable tool is a list of most common words. I was able to find the 100 most common words, from a research paper (Íslenskur Orðasjóður - Building a large icelandic corpus). I don't want to dismiss their results, but for a published paper you can't count twice Hann and hann, or count f. as a word, I think. However, they explain the procedure in the paper, and it looks pretty good. Just that the list they give leaves a little to be desired, and I could not find a way to use the corpus they generated to get the frequency list.

I decided to do something different. First I thought of sampling a lot of Icelandic data (online news and such), but I didn't want to waste that much time... I downloaded is.wikipedia.org. A meagre 42Mb of compressed data! Well, it could be even smaller!, if I was the one sampling it.The article sampled 142Gb, in comparison. Truly an amazing corpus!

After I had the data, I wrote a small script that moved all html files to the same directory:
#/bin/bash

# Navigate through directory tree and copy all html files here

for FILE in $(find ./ -name *.html -type f); do
mv $FILE ./
done
My idea was then to cat all these files into one big html file, and then do word frequency analysis there. Problem: cat *.html > file does not work: *.html yielded too many results (around 60 thousand, I think). Instead of writing a script (the solution I should have used) I just cat-ed every letter as in cat A*.html > is-a.dat. I should have used a script similar to the one I created for the Christmas postcard:
for i in `seq 1 $FILES`;
do
let NUM=$i*$COLUMNS
ls *.jpg | head -n $NUM | tail -n $COLUMNS > F$i
done
This is the original code. In the file F$i I would have a list of all files I need to cat together. Anyway, I did it by hand. On my way to the end file, I found several letter combinations (Fokkur which means Categories, for instance) with a lot of pages, which cat also could not manage. (I think the problem was bash, more than cat) I removed them, because the Categories page could have a strong bias towards certain words.

Once I had this really big WPislenska.dat file, it was just standard command line tricks (which I got from the Linux Cookbook) (where /lt and /gt stand for <>)
tr ' ' [return]
'' /lt WPislenska.dat | sort | uniq -c | sort -g -r /gt IslenskaWF-FromWP.dat
This turns spaces into returns, sorts alphabetically, counts unique words and orders in decreasing frequency order.

Now IslenskaWF-FromWP.dat contains word-frequency counts for data from this Wikipedia dump. The next step was the maddening one: removing all html entities, wikipedia words (like page, visitors, users...) and find the English translation, via Wiktionary, my Icelandic dictionary, my Icelandic learning course and Ubiquity.

The final result is this file, with the 100 most common words in Icelandic's Wikipedia. If you can help with some messy translation, feel free to add a comment in the comment box below.

Related posts:
9 programming books I have read and somewhat liked...
Written by Ruben Berenguel