As you may already know, I'm travelling to Iceland this July, and started learning Icelandic a few months ago. It advances slowly but firmly, but I found a problem: when you are self-learning a new language, an invaluable tool is a list of most common words. I was able to find the 100 most common words, from a research paper (Íslenskur Orðasjóður - Building a large icelandic corpus). I don't want to dismiss their results, but for a published paper you can't count twice Hann and hann, or count f. as a word, I think. However, they explain the procedure in the paper, and it looks pretty good. Just that the list they give leaves a little to be desired, and I could not find a way to use the corpus they generated to get the frequency list.
I decided to do something different. First I thought of sampling a lot of Icelandic data (online news and such), but I didn't want to waste that much time... I downloaded is.wikipedia.org. A meagre 42Mb of compressed data! Well, it could be even smaller!, if I was the one sampling it.The article sampled 142Gb, in comparison. Truly an amazing corpus!
# Navigate through directory tree and copy all html files here
for FILE in $(find ./ -name *.html -type f); do
mv $FILE ./
for i in `seq 1 $FILES`;
ls *.jpg | head -n $NUM | tail -n $COLUMNS > F$i
tr ' ' [return]
'' /ltWPislenska.dat | sort | uniq -c | sort -g -r /gt IslenskaWF-FromWP.dat
9 programming books I have read and somewhat liked...
Eating in Iceland (if you are not an Icelander)
Icelandic Waterfall and Rock Wallpaper
Things you should read about before travelling to Iceland
The 100 Most Common Words in Icelandic (more or less)
9+4 fundamental things you should pack in your travels
8 reasons for re-inventing the wheel as a programmer