Monday, September 14, 2015

Processing en.wikipedia into n-grams

The website Memrise features a number of courses (prominently languages, but other subjects as well) and a method to help you learn them quickly and efficiently. Curious about how the courses are setup, I recently took a look at the ESPDIC 52,303 Vortoj course -- a course based upon the Esperanto-English dictionary. The dictionary contains 52k records of Esperanto-English translations such as:

aerohaveno : airport
aerokondukilo : air duct
aerolita : aerolitic
aerolito : meteoric stone, meteorite, aerolite

The purpose of the Memrise course is to help would-be Esperanto speakers with learning the language effectively. I took a look at the first lesson and found a number of words of varying difficulty. Some were pretty obvious, such as kiloneŭtono for kilonewton. Others, however, were rather baffling.


It turns out belles-lettres is a form of writing which values aesthetic qualities. It's the "department of literature which implies literary culture and belongs to the domain of art, whatever the subject may be or the special form; it includes poetry, the drama, fiction, and criticism." I'm rather confident in saying that I had never heard of this term before in my thirty-some years, and in learning another language, this would be very low on my list of terms to learn.

Similarly, the very first lesson in this course also introduces the Esperanto word jemenano (for "Yemini," a person from Yemen) -- also a word that is of extremely little value to me according to its frequency of use.

So I decided to see if I could come up with a more appropriate ordering of this course's astounding 3495 lessons with 15 words/lesson.

Assuming the frequency of English words in a sufficiently large corpus is a good basis, I downloaded the full en.wikipedia data dump from 2015-08-05, a 12.2gb bz2 XML file consisting primarily of the wiki markup for each article as well as some other information such as who made the most recent edit. This download itself was a pain on my internet connection, but in short order, I had a 50.2gb XML file staring me in the face.

I figured the two biggest problems with this file was that it was entirely too large to easily work with and the text in which I was interested contained unnecessary wiki markup.

XML->TXT

Trying to validate such a huge file would be nearly impossible, but being the seventh-most-visited website, I figure their XML is valid. Extracting the markup proved straightforward[1]:



More painful, however, is removing the wikipedia markup. There are already methods of doing this[2], but none as simple as invoking a single exe from a command prompt, which is what I was aiming for:
> ProcessWikipedia -input enwiki-20150805-pages-articles.xml -output plaintext.dat
I found no suitable libraries for parsing wiki markup that worked; Wiki .NET Parser sounded like it may do the trick, but it failed, leaving horribly mangled results in its wake. And as exciting as writing a full-on parser with ANTLR sounded, I wanted an end result and not a process. So I started writing regular expressions to handle many markup tags as I could find:

Remove [[Category:Constructed_languages]]@"\[\[(?!(?:Category|File):)(?:[^|\[\]]+?\|)?([^\[\]]+?)\]\]"

Remove external links: @"\[[^\[\]]+(?: ([""']*)([^""\[\]]+)\1)\]"

Remove headings: @"(=+)([^'=]+)\1"

And so on.

I saved the results to an extremely simple binary file, omitting disambiguation pages, special pages, and redirects, keeping only the article's title and the reasonably well-cleaned plaintext:


This processing took about 2hr6m on my already overloaded laptop, yielding an 11.4gb file.

TXT->NGrams

With such a huge corpus, I started using 1-grams -- that is, simple frequency counts of individual words ("the", "aeroplane", "flies"), and in the future will look to 2-grams ("the aeroplane", "aeroplane flies") for the simple reason of not enough computer for such a huge task. Unfortunately, with the entire corpus, a simple in-memory Dictionary<string, intwould not hold everything. (Indeed, trying this gave me an OutOfMemoryException.)

Rather than use the sledgehammer that is Hadoop, I opted for a leaner solution in SQLite, taking nearly 12hrs on my little laptop to process the entire corpus which, in the end, spit out a 17.5mb list of key-value pairs.

Further Work

As of publication, the plaintext processing still has some kinks to work out, so the n-grams contains terms that are invalid (incorrectly processed) or are unintentional results of processing. These are most noticeable at the very end of the histogram tail:

contractexpress 10
*chernushka     10
hirschig's 10
m_{\mathrm 10
\bar{m}_{\mathrm 10
furd’s     10
antennolaelaps  10
euepicrius 10
**euryparasitus 10
protocoltcp     10

I would argue that these should be eliminated and the size of the file dramatically reduced simply by setting a cutoff higher than 10. In fact, using a cutoff of 1000 eliminates all but 3.98% of all terms, but it still gives you 94.53% of coverage of the English language -- an excellent example of the Pareto Principle.

Using this cutoff, you get rid of words such as "filosa", "muhsuds", "unmanoeuvrable" [sic], "anti-u-boat" and more, while still keeping "fasting", "dutton", "headlining", "cemented", and "barometric."

Here's a graph showing the cutoff and coverage of terms. Ten-thousand would reduce the number of unique terms again a fifth, but then you lose terms such as "sandwich", "loops", and "decreases".




In an upcoming post, I'll discuss the use of word frequencies to more appropriately order the ESPDIC dictionary and build a Memrise course.

Saturday, September 5, 2015

Your app redesign sucks

Not terribly long ago, Google added their Google Play Music radio stations with a free, ad-supported version for everyone. Aside from the fact that their recommendations are terrible in my experience, they've nerfed their user experience. Just yesterday, in fact, I wanted to play the song The Pretender on my phone. Here's the result screen from my search:


While their auto-generated The Pretender Radio station is at least relevant, it's not what I am looking for -- the song I have played many times on my phone is nowhere to be seen.

I don't even get to my intended result until I scroll down. Here's what the full pane looks like:


Of the entire result pane, less the operating system's top bar and Music's top search bar and bottom music controls, about 11% of the real estate -- all of it below the initial view -- is devoted to what I'm looking for, and what I ultimately clicked on.

The radio stations have become a big pain point for me lately, as I tend to play music almost exclusively from Google Play Music, either on my phone or on my work or home laptops. The pain comes partially from their A1 position on nearly every screen but also from the fact that Google apparently isn't taking the hint that I don't use it and isn't letting me collapse or hide this section.

Their desktop experience isn't much better, yielding about 26% after discounting the search bar and controls:


The first project I worked on in Bing measured dissatisfaction (referred to as DSAT) with search results in XBox Music and Video search results. For example, when you search for "seven" in the context of videos, you might expect to see the results "Se7en", "Seven Years in Tibet", "Seven Samurai", and more. Ideally, the first result is the one you wanted, and if you clicked on the fifth one or if you didn't click any, that is a DSAT -- the results were poor. We took those data and learned from it to improve the quality of the results.

I imagine Google has some sophisticated ML to do exactly that; however, this is not an engineering problem. It's a simple matter of understanding what your users want, and at least in my humble opinion, this ain't it.