Motivation
While working through the Scala Principles class, the final unit had an example exercise that involved creating a "mnemonic" for a telephone number by mapping the number to an English "phrase". While a fun exercise, I was unsatisfied with the naive generation of mnemonic phrases. Yes, you could get "Scala is fun" for 7225247386, but you also get:
- sack air fun
- pack ah re t
- pack bird to
- Scala ire to
- rack ah re to
- pack air fun
- sack bird to
- rack bird to
- sack ah re to
- rack air fun
... which are decidedly less helpful. I thought that if I could implement even a simplistic language model, like a bigram model, to the options, and then rank them, I could make the output a little more useful. So I went looking for some bigram data, and eventually ran across a Wikipedia corpus on corpusdata.org. The second paragraph points out that prepping the data yourself is hard work and would require many hours, so paying them $295 for it is a steal. Challenge accepted mofos.