Last week I wrote about whatlanguageis.com, my first simple try at creating a Django powered site. How did it came to be?
In the last few months have been writing and thinking about several ideas to process social data with Python, undoubtedly motivated by the great book Mining the Social Web. In the back of my mind, I wanted my code to work as a web application (either for me or for others,) and thought that I should comply with everyone else, learn PHP and try to make Python and PHP be friends. Of course, this is like getting a Eskimo to talk to a Polynesian. It may work, but needs a lot of set-up from both sides.
This suggests the next question: why me, PHP? Well, where I work the de facto language for web development is PHP. I'm not a developer there, I just do the data analytics and whatever else that needs to be done using the tools I need and just report back. I thought that PHP-ing would help me get it in the workflow. But, even if my mother programming language may be C (although my first words were in GW-Basic,) my languages of choice are Python and Lisp (and I have a great fondness for Forth.) I guess you realise how ugly I feel PHP is.
On 29th May I decided enough was enough: if it is something that should almost work as a black box, to automate the part I was already doing with Python code and R data churning, why bother? Just make it work. I started reading the free, online Django Book, and I was hooked (you can buy a printed copy in Amazon). It is an awesomely readable book. I had just finished reading Zed Shaw's Learn Python the Hard Way to find things I may not know yet (for example, unit testing... silly me) and Django felt just like an extension of what I usually write and what I had read. Views, urls, models, all have this Python feeling. For me, of course, I'm not a seasoned Python developer at all. I don't even consider myself as seasoned in any programming language, I just know I can write code as I need. But no expert.
I read the first 2 chapters very quickly, and on Wednesday 30 I started testing how Django worked, basic view-url combinations and a test on templating. Just the hello world tests, so to say. I set aside the book while I was thinking what would made a good initial project.
Related to my social media endavours I had a tickling question: How do I decide if a twitter user data is valid or not? I.e. if I want to do some kind of language analysis, I need to know the user 's language. I had a simple idea around that week (I think it was Monday 28,) but didn't get to code it until the 4th of June (Monday): you can read about it in Language detection in Python with NLTK and Stopwords.
On Tuesday 5, after work I created the Django project whatlanguageis in my computer, and read the chapter on Forms. Created a very basic stripped HTML (I think I didn't even add a head section...) page where I could enter some text and get a language as a reply from the Django test server. Yay! It was very easy to do: Django does a lot of heavy lifting and this was straightforward to code.
Then on Wednesday 6 I took the day slightly off (only answering to urgent emails and sales enquiries) and wrote the whole site whatlanguageis.com, which involved:
- Reading the chapter on Models to add a little database of affiliate books, checking how it works
- Reading the chapter on the Admin interface to add some books easily, checking it
- Reading the chapter on Deployment...
- ... which involved a lot of VirtualHost tinkering with my httpd.conf file
- Purchasing a domain and setting up the DNS records
- Designing the page (this was the quickest part, Twitter Bootstrap to the rescue)
My initial plan for Wednesday was not of doing everything and not working. My original plan was just to hook the database and check I could read and write the data as expected, forcing me to read the Models chapter. But as soon as I had finished this hooking and checking, the site was almost ready, needing only a neat dressing. Some twitter bootstrap and css-arrows later, the site was working wonderfully in my local machine. Getting it to work in my remote machine was trickier, due to the setup of static files in Django. It's not hard, it is just that it is tricky to find where it fails when it does. In any case, I got it working.
Of course, as I have said a lot of times, whatlanguageis.com currently is just a proof of concept of two things: language detection and setting up a Django site. Both worked very well, but now I have to sit down and refactor the whole of it.
The site works well as it is (no need of much refactoring except for the needed changes in views), but the language detector needs a proper module, unit tests and a huge revamping. Among others, my future plans for whatlanguageis.com include:
- Adding a lot of languages. More or less I will reuse the ideas from an old post: The 100 most common words in Icelandic, automatically generated from Wikipedia
- Add the ability to also detect programming languages (or at least, the most common)
- Probably add the ability to detect language families (i.e. when unsure, classify as Scandinavian, or classify as Lisp-like)
- Improve the results: Unless the analyzer is completely sure I want to know what it could be (i.e. for very close languages like Norwegian Bokmål and Danish detection is hard, thanks to @qimarc for this specific and useful example)
- Add checks for word diversity in the received text
- Unit-test all the things!
I hope you enjoyed whatlanguageis.com tale and liked the memes lying around. Somehow I felt this post needed them. Thanks to memegenerator.net for the ability to create them :)