Monday, September 14, 2015

Processing en.wikipedia into n-grams

The website Memrise features a number of courses (prominently languages, but other subjects as well) and a method to help you learn them quickly and efficiently. Curious about how the courses are setup, I recently took a look at the ESPDIC 52,303 Vortoj course -- a course based upon the Esperanto-English dictionary. The dictionary contains 52k records of Esperanto-English translations such as:

aerohaveno : airport
aerokondukilo : air duct
aerolita : aerolitic
aerolito : meteoric stone, meteorite, aerolite

The purpose of the Memrise course is to help would-be Esperanto speakers with learning the language effectively. I took a look at the first lesson and found a number of words of varying difficulty. Some were pretty obvious, such as kiloneŭtono for kilonewton. Others, however, were rather baffling.


It turns out belles-lettres is a form of writing which values aesthetic qualities. It's the "department of literature which implies literary culture and belongs to the domain of art, whatever the subject may be or the special form; it includes poetry, the drama, fiction, and criticism." I'm rather confident in saying that I had never heard of this term before in my thirty-some years, and in learning another language, this would be very low on my list of terms to learn.

Similarly, the very first lesson in this course also introduces the Esperanto word jemenano (for "Yemini," a person from Yemen) -- also a word that is of extremely little value to me according to its frequency of use.

So I decided to see if I could come up with a more appropriate ordering of this course's astounding 3495 lessons with 15 words/lesson.

Assuming the frequency of English words in a sufficiently large corpus is a good basis, I downloaded the full en.wikipedia data dump from 2015-08-05, a 12.2gb bz2 XML file consisting primarily of the wiki markup for each article as well as some other information such as who made the most recent edit. This download itself was a pain on my internet connection, but in short order, I had a 50.2gb XML file staring me in the face.

I figured the two biggest problems with this file was that it was entirely too large to easily work with and the text in which I was interested contained unnecessary wiki markup.

XML->TXT

Trying to validate such a huge file would be nearly impossible, but being the seventh-most-visited website, I figure their XML is valid. Extracting the markup proved straightforward[1]:



More painful, however, is removing the wikipedia markup. There are already methods of doing this[2], but none as simple as invoking a single exe from a command prompt, which is what I was aiming for:
> ProcessWikipedia -input enwiki-20150805-pages-articles.xml -output plaintext.dat
I found no suitable libraries for parsing wiki markup that worked; Wiki .NET Parser sounded like it may do the trick, but it failed, leaving horribly mangled results in its wake. And as exciting as writing a full-on parser with ANTLR sounded, I wanted an end result and not a process. So I started writing regular expressions to handle many markup tags as I could find:

Remove [[Category:Constructed_languages]]@"\[\[(?!(?:Category|File):)(?:[^|\[\]]+?\|)?([^\[\]]+?)\]\]"

Remove external links: @"\[[^\[\]]+(?: ([""']*)([^""\[\]]+)\1)\]"

Remove headings: @"(=+)([^'=]+)\1"

And so on.

I saved the results to an extremely simple binary file, omitting disambiguation pages, special pages, and redirects, keeping only the article's title and the reasonably well-cleaned plaintext:


This processing took about 2hr6m on my already overloaded laptop, yielding an 11.4gb file.

TXT->NGrams

With such a huge corpus, I started using 1-grams -- that is, simple frequency counts of individual words ("the", "aeroplane", "flies"), and in the future will look to 2-grams ("the aeroplane", "aeroplane flies") for the simple reason of not enough computer for such a huge task. Unfortunately, with the entire corpus, a simple in-memory Dictionary<string, intwould not hold everything. (Indeed, trying this gave me an OutOfMemoryException.)

Rather than use the sledgehammer that is Hadoop, I opted for a leaner solution in SQLite, taking nearly 12hrs on my little laptop to process the entire corpus which, in the end, spit out a 17.5mb list of key-value pairs.

Further Work

As of publication, the plaintext processing still has some kinks to work out, so the n-grams contains terms that are invalid (incorrectly processed) or are unintentional results of processing. These are most noticeable at the very end of the histogram tail:

contractexpress 10
*chernushka     10
hirschig's 10
m_{\mathrm 10
\bar{m}_{\mathrm 10
furd’s     10
antennolaelaps  10
euepicrius 10
**euryparasitus 10
protocoltcp     10

I would argue that these should be eliminated and the size of the file dramatically reduced simply by setting a cutoff higher than 10. In fact, using a cutoff of 1000 eliminates all but 3.98% of all terms, but it still gives you 94.53% of coverage of the English language -- an excellent example of the Pareto Principle.

Using this cutoff, you get rid of words such as "filosa", "muhsuds", "unmanoeuvrable" [sic], "anti-u-boat" and more, while still keeping "fasting", "dutton", "headlining", "cemented", and "barometric."

Here's a graph showing the cutoff and coverage of terms. Ten-thousand would reduce the number of unique terms again a fifth, but then you lose terms such as "sandwich", "loops", and "decreases".




In an upcoming post, I'll discuss the use of word frequencies to more appropriately order the ESPDIC dictionary and build a Memrise course.