Monday, January 18, 2016

Take the train and see America (very slowly)!

I've been meaning to write up a little something about our Christmas travels this last year in which we took Amtrak home instead of our usual flight. Usually we spend something like $400/person for our airfare between SEA & MSP, and I think we've had to pay as much as $500. Turns out a lot of people travel around Christmas, and supply and demand means it costs more to travel at that time of year. Go figure.

On the other hand, I found that one-way airfare (SEA->MSP) was only about $125/person, and a one-way train ticket was $150/adult, $75/child. After taxes and fees, we spent $600 to fly to Minnesota and about $500 to take the train. For all four of us. On the other hand, a return flight would be something like 3.5hrs versus 34hrs for the train, so we're trading a day and a third for $1000ish savings. And the experience of seeing the country.

The St. Cloud Amtrak station is about as big as one airport terminal gate seating area but with no coffee or gift shops. Instead, there was a broken water fountain and a (working!) 7-Up machine from the Taft administration.

We got to the station around 11:30pm after a busy day at my parents' house. Both boys zonked out for a while, though Jacob had been so extremely tired that before falling asleep, he got a severe case of the sillies and was Andrea's and my comic relief for about twenty minutes.

When the train arrived around 12:40am, we carried the boys outside, but they woke up on their own, extremely excited to take the train. Jacob told me, "Dad, it's real! I pinched myself and it hurt!"

Once on the train, we found the seats to be pretty comfortable, the ride smooth, accommodations... acceptable. The boys actually enjoyed it surprisingly well.

Though the ride took a day and a half, we were at least able to get out at several of the stops (though I only did so on one).

Unfortunately, the better part of our daylight on New Years Eve day was through North Dakota and Montana -- all very flat and boring. We did have a nice sunset, however.

By the time we got to Glacier, it was nighttime and nothing to see. We got to Spokane around midnight-ish, and we didn't get any light until around the Cascades. The snow-covered trees and mountains were beautiful, though unfortunately, we were so close that we didn't get any good shots of the scenery.

The western side of the Cascades were a nice welcome home.

The final stretch from Everett to Seattle was right along the Sound and was gorgeous, though I was too busy gawking at it to take any pictures. The arrival at King Street Station was a nice break to stretch and walk around before we hopped into a taxi to get home.

All in all, it was a good experience. I'm glad we did it -- not just because it barely offset the cost of boarding Shasta for our two week trip, but it was also was a fun experience in and of itself. Hard to say, though, whether and when we'll do it again.

Friday, November 27, 2015

A return to Mailbox Peak

After entirely too long, I returned to Mailbox Peak today with my hiking buddy. When I grabbed the leash and other hiking gear, she was elated, happily jumping into the car. The long drive out to and past North Bend was tiring, though.

We took the old trail up, which is a pretty grueling 4000' elevation gain over a mere 2.6 miles. Shasta had no problem whatsoever with the hike.

I, however, got wrecked, having done only two other hikes since my June summit of Mount Shuksan, one of which being last month's trip up Mount Saint Helens. I bonked around 1.5-2.0mi in, allowing Shasta to drag my ass up the rest of the way to the beautiful scenery of the Cascades and Mount Rainier. She was quite happy up there.

But I could barely walk.

I thought the descent might be made better by using the new trail, a 4.7mi descent instead of 2.6. That was all well and good, but it was significantly more snowy and icy than the old trail and took so long that I was stumbling all the way back to the car.

Up until today, I had been very adamant about the rule that dogs must always be on leashes. In fact, when Shasta was walking under a felled log, I didn't even let go of the leash while transferring it. However, having spent a good half hour without seeing another person, I decided to try letting her off-leash to see how she would do.

After maybe five minutes of her sprinting up ahead 50 yards or so then doubling-back to me, lather, rinse, repeat, she finally got into a good routine of staying within maybe 10-20 feet of me. (For what it's worth, I put her back on-leash immediately when we did eventually run into hikers. I then took her back off, and she did phenomenally every time.)

We got home and Shasta climbed onto the couch to curl up and rest. After my supremely weak-sauce performance for the day, so did I. Roughly 5.5hrs for the 7.3mi round-trip plus travel time, it left much to be desired. But it was good to be back on the mountain again.

Thursday, October 29, 2015

Another year at Microsoft, another position change

A little over a year ago, I moved from Bing to DX, part of Microsoft's evangelism group to work on a project that still has not yet been disclosed. Tomorrow will be my last day within DX as I move to the Windows and Devices Group (WDG) to work yet again on some truly big data + machine learning.

In Bing, I worked on a few projects utilizing Microsoft's Cosmos distributed and massively parallel processing system before moving over to do some tented work on Azure and websites and other secret things. Starting next week, I'll once again be doing severely large-scale work in analytics and prediction, with some natural language stuff thrown in for good measure. Or so I'm lead to understand.

The position sort of fell into my lap, so it's with some trepidation that I'm leaving a project where I get to do some cool stuff and implement a lot of neat features -- which, of course, I can't yet discuss. Maybe once my soon-to-be-former team goes public, I'll do some blog posts about what I've been doing.

Saturday, October 3, 2015

Climbing Mount Saint Helens

Yesterday, I climbed Mount Saint Helens with my friends Tim and Mike. We started at Climber's Bivouac at 6am. After about an hour, when starting to get above much of the dense forest, we got an amazing view of the sunrise.

Sunrise looking at Mt. Adams (left) and Mt. Hood (right)

Looking through the trees at a layered sunrise.

In short order, we reached the treeline and the start of the very rocky, ashy incline up the mountain.

Mike looking up the trail
It quickly turned to a straight shot up the mountain over decently sized boulders.

When we started off, it was rather foggy at Climber's Bivouac. Throughout the day, the clouds stayed down at the bottom, providing a beautiful blanket across the national forest.

Scrambling up all the volcanic rock made some form of gloves a very useful addition to my gear for the day.

A couple years ago, I bought the boys some small stuffed black bears at REI. Reed's bear "Junior" somehow got lost in some of my gear, so I had a stowaway for the day.

The last 1000' or so vertical feet was through very soft ash and gravel. Here, I'm looking down the mountain at Mike, Tim, and the dozens of other climbers.

At the top, the wind varied from nil to extremely strong, blasting sand into our faces. Between that and the sun, I'm a super squinty guy.

We had an amazing view of the crater, lava dome, etc. I tried to get a few decent panoramas to try to illustrate how incredible the sight is. Here's a decent stitch; the steam coming out of the middle is, I believe, methane gas.

I got a quick shot at the top showing how windy it was:

It took us a little over five hours (6:00-11:20) to get to the top. The descent back to the Bivouac took until 15:00, but not without a few more photo ops:
Looking back up the mountain during the descent

Junior was on the mountain illegally, as he didn't have his own permit.

Monday, September 14, 2015

Processing en.wikipedia into n-grams

The website Memrise features a number of courses (prominently languages, but other subjects as well) and a method to help you learn them quickly and efficiently. Curious about how the courses are setup, I recently took a look at the ESPDIC 52,303 Vortoj course -- a course based upon the Esperanto-English dictionary. The dictionary contains 52k records of Esperanto-English translations such as:

aerohaveno : airport
aerokondukilo : air duct
aerolita : aerolitic
aerolito : meteoric stone, meteorite, aerolite

The purpose of the Memrise course is to help would-be Esperanto speakers with learning the language effectively. I took a look at the first lesson and found a number of words of varying difficulty. Some were pretty obvious, such as kiloneŇ≠tono for kilonewton. Others, however, were rather baffling.

It turns out belles-lettres is a form of writing which values aesthetic qualities. It's the "department of literature which implies literary culture and belongs to the domain of art, whatever the subject may be or the special form; it includes poetry, the drama, fiction, and criticism." I'm rather confident in saying that I had never heard of this term before in my thirty-some years, and in learning another language, this would be very low on my list of terms to learn.

Similarly, the very first lesson in this course also introduces the Esperanto word jemenano (for "Yemini," a person from Yemen) -- also a word that is of extremely little value to me according to its frequency of use.

So I decided to see if I could come up with a more appropriate ordering of this course's astounding 3495 lessons with 15 words/lesson.

Assuming the frequency of English words in a sufficiently large corpus is a good basis, I downloaded the full en.wikipedia data dump from 2015-08-05, a 12.2gb bz2 XML file consisting primarily of the wiki markup for each article as well as some other information such as who made the most recent edit. This download itself was a pain on my internet connection, but in short order, I had a 50.2gb XML file staring me in the face.

I figured the two biggest problems with this file was that it was entirely too large to easily work with and the text in which I was interested contained unnecessary wiki markup.


Trying to validate such a huge file would be nearly impossible, but being the seventh-most-visited website, I figure their XML is valid. Extracting the markup proved straightforward[1]:

More painful, however, is removing the wikipedia markup. There are already methods of doing this[2], but none as simple as invoking a single exe from a command prompt, which is what I was aiming for:
> ProcessWikipedia -input enwiki-20150805-pages-articles.xml -output plaintext.dat
I found no suitable libraries for parsing wiki markup that worked; Wiki .NET Parser sounded like it may do the trick, but it failed, leaving horribly mangled results in its wake. And as exciting as writing a full-on parser with ANTLR sounded, I wanted an end result and not a process. So I started writing regular expressions to handle many markup tags as I could find:

Remove [[Category:Constructed_languages]]@"\[\[(?!(?:Category|File):)(?:[^|\[\]]+?\|)?([^\[\]]+?)\]\]"

Remove external links: @"\[[^\[\]]+(?: ([""']*)([^""\[\]]+)\1)\]"

Remove headings: @"(=+)([^'=]+)\1"

And so on.

I saved the results to an extremely simple binary file, omitting disambiguation pages, special pages, and redirects, keeping only the article's title and the reasonably well-cleaned plaintext:

This processing took about 2hr6m on my already overloaded laptop, yielding an 11.4gb file.


With such a huge corpus, I started using 1-grams -- that is, simple frequency counts of individual words ("the", "aeroplane", "flies"), and in the future will look to 2-grams ("the aeroplane", "aeroplane flies") for the simple reason of not enough computer for such a huge task. Unfortunately, with the entire corpus, a simple in-memory Dictionary<string, intwould not hold everything. (Indeed, trying this gave me an OutOfMemoryException.)

Rather than use the sledgehammer that is Hadoop, I opted for a leaner solution in SQLite, taking nearly 12hrs on my little laptop to process the entire corpus which, in the end, spit out a 17.5mb list of key-value pairs.

Further Work

As of publication, the plaintext processing still has some kinks to work out, so the n-grams contains terms that are invalid (incorrectly processed) or are unintentional results of processing. These are most noticeable at the very end of the histogram tail:

contractexpress 10
*chernushka     10
hirschig's 10
m_{\mathrm 10
\bar{m}_{\mathrm 10
furd’s     10
antennolaelaps  10
euepicrius 10
**euryparasitus 10
protocoltcp     10

I would argue that these should be eliminated and the size of the file dramatically reduced simply by setting a cutoff higher than 10. In fact, using a cutoff of 1000 eliminates all but 3.98% of all terms, but it still gives you 94.53% of coverage of the English language -- an excellent example of the Pareto Principle.

Using this cutoff, you get rid of words such as "filosa", "muhsuds", "unmanoeuvrable" [sic], "anti-u-boat" and more, while still keeping "fasting", "dutton", "headlining", "cemented", and "barometric."

Here's a graph showing the cutoff and coverage of terms. Ten-thousand would reduce the number of unique terms again a fifth, but then you lose terms such as "sandwich", "loops", and "decreases".

In an upcoming post, I'll discuss the use of word frequencies to more appropriately order the ESPDIC dictionary and build a Memrise course.

Saturday, September 5, 2015

Your app redesign sucks

Not terribly long ago, Google added their Google Play Music radio stations with a free, ad-supported version for everyone. Aside from the fact that their recommendations are terrible in my experience, they've nerfed their user experience. Just yesterday, in fact, I wanted to play the song The Pretender on my phone. Here's the result screen from my search:

While their auto-generated The Pretender Radio station is at least relevant, it's not what I am looking for -- the song I have played many times on my phone is nowhere to be seen.

I don't even get to my intended result until I scroll down. Here's what the full pane looks like:

Of the entire result pane, less the operating system's top bar and Music's top search bar and bottom music controls, about 11% of the real estate -- all of it below the initial view -- is devoted to what I'm looking for, and what I ultimately clicked on.

The radio stations have become a big pain point for me lately, as I tend to play music almost exclusively from Google Play Music, either on my phone or on my work or home laptops. The pain comes partially from their A1 position on nearly every screen but also from the fact that Google apparently isn't taking the hint that I don't use it and isn't letting me collapse or hide this section.

Their desktop experience isn't much better, yielding about 26% after discounting the search bar and controls:

The first project I worked on in Bing measured dissatisfaction (referred to as DSAT) with search results in XBox Music and Video search results. For example, when you search for "seven" in the context of videos, you might expect to see the results "Se7en", "Seven Years in Tibet", "Seven Samurai", and more. Ideally, the first result is the one you wanted, and if you clicked on the fifth one or if you didn't click any, that is a DSAT -- the results were poor. We took those data and learned from it to improve the quality of the results.

I imagine Google has some sophisticated ML to do exactly that; however, this is not an engineering problem. It's a simple matter of understanding what your users want, and at least in my humble opinion, this ain't it.

Wednesday, July 22, 2015

Riding bikes: zero to sixty in one day

We've been trying to get our boys to learn to ride their bicycles for years. They never really liked riding with training wheels, and when we took away the training wheels, they wanted them back. (That didn't happen.)

When we lived in Magnolia, Andrea and I took the boys to the Discovery Park basketball court to learn to ride a few feet. Jacob dug his heels in so far that it took convincing Reed to ride to motivate Jake to try it. They each rode unassisted for a few feet before promptly declaring their retirement from the sport.

A couple weekends ago, when Andrea was away for the day with one of her clients, I decided we'd take some baby steps toward riding bikes. I was so committed to baby steps, in fact, that I told them they'd only be riding a few feet. I drew a start line and the finish line. By the time Jacob saw how short a distance he'd have to go (which apparently was motivating for him), he decided he'd write the word "FINISH."

As I'm sure is the case with most kids learning to ride, his body tensed up and he began to shake when he mounted his bike (which, by the way, he named Sophie). I told him that he would only have to try it, I'd help him, and he'd have to go no farther than the finish line.

He made it across once with me holding the bicycle seat the whole time. We returned to the start and did it again. And by the third time, I was able to let go for a fraction of a second. By then, he was so terrified that he declared he was done and went inside to read. Then he and Reed decided they wanted to go to the store on their scooters.

Maybe it was the three mile round-trip on scooters that made them see the folly of their scooting ways. Maybe the painfully slow uphill trek. Maybe Reed's exhaustion that required I backpack his scooter and try to ride with him on the seat and me dangling precariously off the edge of my seat. Or maybe just the incredibly small successes earlier in the day that had time to reassure him. Either way, not long after we got home -- and after Andrea had returned -- Jacob was ready for another go on his bike. Our second round of the day went from a second or two of solo riding to twenty feet to nearly the length of our street with an unassisted, controlled stop. 

It was an unqualified success -- so much so that a few minutes later, when I went inside for some water, Reed came inside looking for his helmet and exclaimed, "Jacob's going to help me learn to ride my bike!"

As exciting and heartwarming as that was -- not only because of the brotherly love from both sides but also because Jacob was so quickly turned around on his opinion of bikes -- I figured he wasn't ready to help his brother. I returned outside and ran Reed through the same process. After he was able to do the same, Andrea and I called the grandparents to let them know of our major breakthrough.

While on the phone -- and without any help from us -- they had both learned on their own to start from no momentum, to ride around in circles, and even to control their balance while riding uphill.