Sunday, April 7, 2019

Generating a bigram language model from the Wikipedia corpus

Motivation


While working through the Scala Principles class, the final unit had an example exercise that involved creating a "mnemonic" for a telephone number by mapping the number to an English "phrase". While a fun exercise, I was unsatisfied with the naive generation of mnemonic phrases. Yes, you could get "Scala is fun" for 7225247386, but you also get:
  • sack air fun
  • pack ah re t
  • pack bird to
  • Scala ire to
  • rack ah re to
  • pack air fun
  • sack bird to
  • rack bird to
  • sack ah re to
  • rack air fun

... which are decidedly less helpful. I thought that if I could implement even a simplistic language model, like a bigram model, to the options, and then rank them, I could make the output a little more useful. So I went looking for some bigram data, and eventually ran across a Wikipedia corpus on corpusdata.org.  The second paragraph points out that prepping the data yourself is hard work and would require many hours, so paying them $295 for it is a steal. Challenge accepted mofos.

Functional Programming Principals in Scala

After a long break (Advanced Algos took a lot out of me), I decided it was time to take another class. When the topic of taking the Scala series of classes from Coursera came up in my programming group, I jumped at the chance. I needed a break from theory, and a chance to do some coding in a new language was just the ticket. This post covers the first course in the series, "Functional Programming Principles in Scala". tl;dr: Take this course, it's great!