Failed the Turing Test: April 2019

Sunday, April 7, 2019

Generating a bigram language model from the Wikipedia corpus

Motivation

While working through the Scala Principles class, the final unit had an example exercise that involved creating a "mnemonic" for a telephone number by mapping the number to an English "phrase". While a fun exercise, I was unsatisfied with the naive generation of mnemonic phrases. Yes, you could get "Scala is fun" for 7225247386, but you also get:

sack air fun
pack ah re t
pack bird to
Scala ire to
rack ah re to
pack air fun
sack bird to
rack bird to
sack ah re to
rack air fun

... which are decidedly less helpful. I thought that if I could implement even a simplistic language model, like a bigram model, to the options, and then rank them, I could make the output a little more useful. So I went looking for some bigram data, and eventually ran across a Wikipedia corpus on corpusdata.org. The second paragraph points out that prepping the data yourself is hard work and would require many hours, so paying them $295 for it is a steal. Challenge accepted mofos.

Functional Programming Principals in Scala

After a long break (Advanced Algos took a lot out of me), I decided it was time to take another class. When the topic of taking the Scala series of classes from Coursera came up in my programming group, I jumped at the chance. I needed a break from theory, and a chance to do some coding in a new language was just the ticket. This post covers the first course in the series, "Functional Programming Principles in Scala". tl;dr: Take this course, it's great!