Thursday, October 29, 2015

Another year at Microsoft, another position change

A little over a year ago, I moved from Bing to DX, part of Microsoft's evangelism group to work on a project that still has not yet been disclosed. Tomorrow will be my last day within DX as I move to the Windows and Devices Group (WDG) to work yet again on some truly big data + machine learning.

In Bing, I worked on a few projects utilizing Microsoft's Cosmos distributed and massively parallel processing system before moving over to do some tented work on Azure and websites and other secret things. Starting next week, I'll once again be doing severely large-scale work in analytics and prediction, with some natural language stuff thrown in for good measure. Or so I'm lead to understand.

The position sort of fell into my lap, so it's with some trepidation that I'm leaving a project where I get to do some cool stuff and implement a lot of neat features -- which, of course, I can't yet discuss. Maybe once my soon-to-be-former team goes public, I'll do some blog posts about what I've been doing.

Saturday, October 3, 2015

Climbing Mount Saint Helens

Yesterday, I climbed Mount Saint Helens with my friends Tim and Mike. We started at Climber's Bivouac at 6am. After about an hour, when starting to get above much of the dense forest, we got an amazing view of the sunrise.

Sunrise looking at Mt. Adams (left) and Mt. Hood (right)

Looking through the trees at a layered sunrise.

In short order, we reached the treeline and the start of the very rocky, ashy incline up the mountain.

Mike looking up the trail
It quickly turned to a straight shot up the mountain over decently sized boulders.

When we started off, it was rather foggy at Climber's Bivouac. Throughout the day, the clouds stayed down at the bottom, providing a beautiful blanket across the national forest.

Scrambling up all the volcanic rock made some form of gloves a very useful addition to my gear for the day.

A couple years ago, I bought the boys some small stuffed black bears at REI. Reed's bear "Junior" somehow got lost in some of my gear, so I had a stowaway for the day.

The last 1000' or so vertical feet was through very soft ash and gravel. Here, I'm looking down the mountain at Mike, Tim, and the dozens of other climbers.

At the top, the wind varied from nil to extremely strong, blasting sand into our faces. Between that and the sun, I'm a super squinty guy.

We had an amazing view of the crater, lava dome, etc. I tried to get a few decent panoramas to try to illustrate how incredible the sight is. Here's a decent stitch; the steam coming out of the middle is, I believe, methane gas.

I got a quick shot at the top showing how windy it was:

It took us a little over five hours (6:00-11:20) to get to the top. The descent back to the Bivouac took until 15:00, but not without a few more photo ops:
Looking back up the mountain during the descent

Junior was on the mountain illegally, as he didn't have his own permit.

Monday, September 14, 2015

Processing en.wikipedia into n-grams

The website Memrise features a number of courses (prominently languages, but other subjects as well) and a method to help you learn them quickly and efficiently. Curious about how the courses are setup, I recently took a look at the ESPDIC 52,303 Vortoj course -- a course based upon the Esperanto-English dictionary. The dictionary contains 52k records of Esperanto-English translations such as:

aerohaveno : airport
aerokondukilo : air duct
aerolita : aerolitic
aerolito : meteoric stone, meteorite, aerolite

The purpose of the Memrise course is to help would-be Esperanto speakers with learning the language effectively. I took a look at the first lesson and found a number of words of varying difficulty. Some were pretty obvious, such as kiloneŭtono for kilonewton. Others, however, were rather baffling.

It turns out belles-lettres is a form of writing which values aesthetic qualities. It's the "department of literature which implies literary culture and belongs to the domain of art, whatever the subject may be or the special form; it includes poetry, the drama, fiction, and criticism." I'm rather confident in saying that I had never heard of this term before in my thirty-some years, and in learning another language, this would be very low on my list of terms to learn.

Similarly, the very first lesson in this course also introduces the Esperanto word jemenano (for "Yemini," a person from Yemen) -- also a word that is of extremely little value to me according to its frequency of use.

So I decided to see if I could come up with a more appropriate ordering of this course's astounding 3495 lessons with 15 words/lesson.

Assuming the frequency of English words in a sufficiently large corpus is a good basis, I downloaded the full en.wikipedia data dump from 2015-08-05, a 12.2gb bz2 XML file consisting primarily of the wiki markup for each article as well as some other information such as who made the most recent edit. This download itself was a pain on my internet connection, but in short order, I had a 50.2gb XML file staring me in the face.

I figured the two biggest problems with this file was that it was entirely too large to easily work with and the text in which I was interested contained unnecessary wiki markup.


Trying to validate such a huge file would be nearly impossible, but being the seventh-most-visited website, I figure their XML is valid. Extracting the markup proved straightforward[1]:

More painful, however, is removing the wikipedia markup. There are already methods of doing this[2], but none as simple as invoking a single exe from a command prompt, which is what I was aiming for:
> ProcessWikipedia -input enwiki-20150805-pages-articles.xml -output plaintext.dat
I found no suitable libraries for parsing wiki markup that worked; Wiki .NET Parser sounded like it may do the trick, but it failed, leaving horribly mangled results in its wake. And as exciting as writing a full-on parser with ANTLR sounded, I wanted an end result and not a process. So I started writing regular expressions to handle many markup tags as I could find:

Remove [[Category:Constructed_languages]]@"\[\[(?!(?:Category|File):)(?:[^|\[\]]+?\|)?([^\[\]]+?)\]\]"

Remove external links: @"\[[^\[\]]+(?: ([""']*)([^""\[\]]+)\1)\]"

Remove headings: @"(=+)([^'=]+)\1"

And so on.

I saved the results to an extremely simple binary file, omitting disambiguation pages, special pages, and redirects, keeping only the article's title and the reasonably well-cleaned plaintext:

This processing took about 2hr6m on my already overloaded laptop, yielding an 11.4gb file.


With such a huge corpus, I started using 1-grams -- that is, simple frequency counts of individual words ("the", "aeroplane", "flies"), and in the future will look to 2-grams ("the aeroplane", "aeroplane flies") for the simple reason of not enough computer for such a huge task. Unfortunately, with the entire corpus, a simple in-memory Dictionary<string, intwould not hold everything. (Indeed, trying this gave me an OutOfMemoryException.)

Rather than use the sledgehammer that is Hadoop, I opted for a leaner solution in SQLite, taking nearly 12hrs on my little laptop to process the entire corpus which, in the end, spit out a 17.5mb list of key-value pairs.

Further Work

As of publication, the plaintext processing still has some kinks to work out, so the n-grams contains terms that are invalid (incorrectly processed) or are unintentional results of processing. These are most noticeable at the very end of the histogram tail:

contractexpress 10
*chernushka     10
hirschig's 10
m_{\mathrm 10
\bar{m}_{\mathrm 10
furd’s     10
antennolaelaps  10
euepicrius 10
**euryparasitus 10
protocoltcp     10

I would argue that these should be eliminated and the size of the file dramatically reduced simply by setting a cutoff higher than 10. In fact, using a cutoff of 1000 eliminates all but 3.98% of all terms, but it still gives you 94.53% of coverage of the English language -- an excellent example of the Pareto Principle.

Using this cutoff, you get rid of words such as "filosa", "muhsuds", "unmanoeuvrable" [sic], "anti-u-boat" and more, while still keeping "fasting", "dutton", "headlining", "cemented", and "barometric."

Here's a graph showing the cutoff and coverage of terms. Ten-thousand would reduce the number of unique terms again a fifth, but then you lose terms such as "sandwich", "loops", and "decreases".

In an upcoming post, I'll discuss the use of word frequencies to more appropriately order the ESPDIC dictionary and build a Memrise course.

Saturday, September 5, 2015

Your app redesign sucks

Not terribly long ago, Google added their Google Play Music radio stations with a free, ad-supported version for everyone. Aside from the fact that their recommendations are terrible in my experience, they've nerfed their user experience. Just yesterday, in fact, I wanted to play the song The Pretender on my phone. Here's the result screen from my search:

While their auto-generated The Pretender Radio station is at least relevant, it's not what I am looking for -- the song I have played many times on my phone is nowhere to be seen.

I don't even get to my intended result until I scroll down. Here's what the full pane looks like:

Of the entire result pane, less the operating system's top bar and Music's top search bar and bottom music controls, about 11% of the real estate -- all of it below the initial view -- is devoted to what I'm looking for, and what I ultimately clicked on.

The radio stations have become a big pain point for me lately, as I tend to play music almost exclusively from Google Play Music, either on my phone or on my work or home laptops. The pain comes partially from their A1 position on nearly every screen but also from the fact that Google apparently isn't taking the hint that I don't use it and isn't letting me collapse or hide this section.

Their desktop experience isn't much better, yielding about 26% after discounting the search bar and controls:

The first project I worked on in Bing measured dissatisfaction (referred to as DSAT) with search results in XBox Music and Video search results. For example, when you search for "seven" in the context of videos, you might expect to see the results "Se7en", "Seven Years in Tibet", "Seven Samurai", and more. Ideally, the first result is the one you wanted, and if you clicked on the fifth one or if you didn't click any, that is a DSAT -- the results were poor. We took those data and learned from it to improve the quality of the results.

I imagine Google has some sophisticated ML to do exactly that; however, this is not an engineering problem. It's a simple matter of understanding what your users want, and at least in my humble opinion, this ain't it.

Wednesday, July 22, 2015

Riding bikes: zero to sixty in one day

We've been trying to get our boys to learn to ride their bicycles for years. They never really liked riding with training wheels, and when we took away the training wheels, they wanted them back. (That didn't happen.)

When we lived in Magnolia, Andrea and I took the boys to the Discovery Park basketball court to learn to ride a few feet. Jacob dug his heels in so far that it took convincing Reed to ride to motivate Jake to try it. They each rode unassisted for a few feet before promptly declaring their retirement from the sport.

A couple weekends ago, when Andrea was away for the day with one of her clients, I decided we'd take some baby steps toward riding bikes. I was so committed to baby steps, in fact, that I told them they'd only be riding a few feet. I drew a start line and the finish line. By the time Jacob saw how short a distance he'd have to go (which apparently was motivating for him), he decided he'd write the word "FINISH."

As I'm sure is the case with most kids learning to ride, his body tensed up and he began to shake when he mounted his bike (which, by the way, he named Sophie). I told him that he would only have to try it, I'd help him, and he'd have to go no farther than the finish line.

He made it across once with me holding the bicycle seat the whole time. We returned to the start and did it again. And by the third time, I was able to let go for a fraction of a second. By then, he was so terrified that he declared he was done and went inside to read. Then he and Reed decided they wanted to go to the store on their scooters.

Maybe it was the three mile round-trip on scooters that made them see the folly of their scooting ways. Maybe the painfully slow uphill trek. Maybe Reed's exhaustion that required I backpack his scooter and try to ride with him on the seat and me dangling precariously off the edge of my seat. Or maybe just the incredibly small successes earlier in the day that had time to reassure him. Either way, not long after we got home -- and after Andrea had returned -- Jacob was ready for another go on his bike. Our second round of the day went from a second or two of solo riding to twenty feet to nearly the length of our street with an unassisted, controlled stop. 

It was an unqualified success -- so much so that a few minutes later, when I went inside for some water, Reed came inside looking for his helmet and exclaimed, "Jacob's going to help me learn to ride my bike!"

As exciting and heartwarming as that was -- not only because of the brotherly love from both sides but also because Jacob was so quickly turned around on his opinion of bikes -- I figured he wasn't ready to help his brother. I returned outside and ran Reed through the same process. After he was able to do the same, Andrea and I called the grandparents to let them know of our major breakthrough.

While on the phone -- and without any help from us -- they had both learned on their own to start from no momentum, to ride around in circles, and even to control their balance while riding uphill.

Tuesday, June 16, 2015

Don't punt parenting to teachers

As with many big companies, we have many non- or tangentially-work-related distribution lists. On one of those lists, someone yesterday asked this (paraphrased) question:
I have a two-year-old. How and when do I decide what education to give my child (public vs private; Montessori vs traditional; etc.)?
And because this person made it clear that they were soliciting any advice, I chimed in. Here is my verbatim response:

Okay, since you are soliciting any advice, I’ll go on my tirade, starting off with: Don’t fret.

“The study found that low-income students from urban public high schools generally did as well academically and on long-term indicators as their peers from private high schools, once key family background characteristics were considered," according to the findings. [emphasis mine]

I believe many studies show that genetics and family involvement are the biggest factors in students’ success. Which teachers your child has aren’t nearly as important. After all, your child will spend a few hours each day with a teacher, but he or she is spending much more with you. That said, this may change as kids get older and their peers become their bigger influences instead of Mom and Dad.

So for your two year old, just try your best every day. That doesn’t mean you have to do your best, just keep on trying to do well J

FWIW, my focuses (caveat emptor!) have been on:
  •  reading to my kids early and often (language skills, vocabulary), making sure they’re following along (“How many dwarves are in Bilbo’s house? Who do you think is at the door this time?”)
  • building emotional intelligence (when reading books or watching movies, stopping to ask, “What just happened? How do you think that makes so-and-so feel?”)
  • working on imagination, self-reliance, and a connection with nature; I really enjoyed Last Child In The Woods)

Perhaps not Dear Abby-caliber, but while I'm not an expert in real life, I do play one on email lists.

Today, another non-expert responded with, among other blasé comments, this scathing rebuttal to my email:
I disagree with Adam, as with my schedule my child often spends 10 hours a day with his teachers. It’s important to me that the time they spend with him is positive and will help in authentic ways. 
This person didn't say how old their child is, but if you assume somewhere in the 7-12 year range, they should be getting 10-11 hours of sleep per night. This means your 7-12 year old who is with their teacher 10 hours each day is spending less than 30% of their waking weekdays with you.

But then there's TV. A few years ago, Nielsen noted that kids 6-11 get an average of 28 hours per week of screen time. It's only that low for this age bracket "due in part that they are more likely to be attending school for longer hours." So if you assume a heavy skew toward the weekend -- nine hours on each of Saturday and Sunday and "only" two hours each weekday, that 30% of waking time drops to 15%.

When Andrea was a teacher in Minnesota, she regularly told me of parents from the Catholic school that were on either end of the spectrum. Some were so heavily invested in their kids -- perhaps in large part because they were shelling out money for a private education -- that they knew all the teachers, knew their kids' classmates, and were involved in the classroom. Others apparently felt that the tuition they paid gave them a free ride to punt their responsibilities to the school. "My Timmy is so talented, I just hope you can see it. What are you doing to make sure he succeeds?"

If your kid is in school for ten hours each day, and if you don't think your parental involvement (or lack thereof) is still the biggest influence (along with genetics) on your child's success in school and in life, then you're doing it wrong.

Sunday, May 17, 2015

Flambe: a retrospective

When I was a kid, my Mom kept all of her recipes either in her head or on hand-written 3x5 cards in the cupboard. Over time, she got organized and put everything into some software for Mac called Mangia.

Years later, in the early 2000s, she found that Mangia was discontinued, and all of her recipes were stuck there. Being a computer science student at the time, it either fell on me or I volunteered to come up with something better.

I spent entirely too many hours writing what I planned to be the Next Great Recipe Program, Flambé, with the ability to share recipes online and request recipes from the worldwide community of cooking enthusiasts. I endeavored into writing clever spiders that would collect tens or hundreds of thousands of recipes from other online databases. (To this day, I still have several zip files hundreds of megabytes in size containing gathered recipes.) I wrote interesting regular expressions to be able to split an ingredient such as "1T brown sugar, packed (optional)" into constituent components quantity (1), unit (T), item (brown sugar), remarks (packed) and isOptional (true). I put many of these recipes online on a searchable database for anyone to use, either with the program or without.

Dat splash screen

I spent hundreds of hours doing this in part because I really enjoy cooking. I have fond memories of helping Mom in the kitchen with chocolate chip cookies, no bake cookies, peanut butter cookies (my favorite as a child), and more, and I still love spending time in the kitchen.

Just the same, my program kind of sucked. I focused heavily on features that weren't really that useful. (Peer-to-peer sharing of recipes via an HTTP server built into the program? What was I thinking?) So it comes as no surprise that the worldwide community of enthusiasts never showed up. My program was used pretty much exclusively by my immediate family. But I did learn a good deal and had a good time doing it.

Every now and then, though, my brother would email me asking how he could upload a recipe to my online database, or whether I have this or that recipe. He's the last known user of my terrible software, so rather than hasten the death of Flambé (which quite possibly is the correct path), I decided to drag it on by whipping up a "quick" revised version that has the core functionality.

While Flambé never really went anywhere, I was fortunate to work on a project at work this spring for the OneNote team in which we make it extremely easy to save recipes from websites into your OneNote notebook of choice. Maybe there's hope for my recipe-sharing utopian dream yet.