Failed the Turing Test: Intro to Natural Language Processing (Coursera)

While I had held out hope that Stanford or even Columbia would offer their NLP courses on Coursera again (I didn't find anything related to the subject on edX), it was not to be. So I signed up for the University of Michigan's introductory offering. The Stanford lectures (Jurafsky and Manning) and the Columbia lectures (Collins) are available on YouTube, and these assignments from Cairo University are based on this material as well, so there are plenty of additional resources available in the space. For now I'm going to mostly concentrate on the Michigan material, just to save a bit of my sanity lol. An online version of the draft 3rd ed of Jurafsky and Martins book, Speech and Language Processing, is also available.

Overview (a.k.a. Post Mortem)

I thought I would have a lot more fun in this class than I did. By the end, it was a real grind that I just wanted to be over. The first couple reviews I read [here] capture some of my frustration: quizzes weren't very challenging (go treasure hunt the lecture transcripts!), the lectures offered pretty shallow coverage, and then the programming assignments were brutal with no real support from the lectures. I had to lean heavily on the discussion board for assignment 2, and assignment 3 I scratched out something "good enough" just to get it over with. I think the contrast for me is really sharp because I just finished up the Berkeley AI class which was worlds above this class.

Maybe I just didn't have the right expectations. I mean, it does say "Intro" in the title, so a broad, shallow overview of the whole field is probably appropriate. I dunno, seems like I remember Jurafsky's lectures being more interesting. Maybe in the future if I want to take another stab at NLP I'll work through the Stanford and Columbia materials. For now, I'm done with it lol...

Why pay for a certificate when you can print-screen?

I could probably bring up my overall score if I retook the final (got a 72%, ouch!). Buuuuut.... I choose life.

My notes follow, though they are pretty rough. They are intended more as evidence I worked through the material than anything else, so don't expect a whole lot of value unless maybe you are taking the exact same course lol.

Week 1

Natural Language Processing (NLP) is the study of the computational treatment of natural (human) language, i.e. teaching computers how to understand and generate human language.

Modern Applications:

Search Engines
Question Answering (Watson)
Natural language assistants (Siri)
Translation Systems (Google Translate)
News digest, automatic earthquake reports.

NLP draws on linguistics, statistics, artificial intelligence, psychology, and other fields.

Quite a bit of the lectures for this week are building context and looking at examples. Which isn't surprising considering it's an introductory module. So I'll probably keep this week's notes pretty light.

Radev is involved with NACLO, which is a computational linguistics competition for high school students, and he'll sometimes suggest problems from old competitions to solve. In "Why is NLP hard?" there are these two: One, Two, Three and Fakepapershelfmaker. "Linguistics" includes: All in the family and Warlpiri Kinship.

Week 2

This week covers parts of speech, morphology, text similarity, and text preprocessing.

Parts of speech:

In English there are about eight parts of speech: noun, pronoun, verb, adjective, adverb, preposition, conjunction, and interjection. English does not use gender for nouns, where many other languages do. Grammatical cases affect nouns and pronouns depending on the relationships in a clause (though Wikipedia notes that English has largely lost it's case system). Of course this part of the lecture made me think of Life of Brian, but I came to find out that Modern English only has three cases anymore: subjective or nominative (he), objective or accusative (him), and possessive or genitive (his), and it seems they only apply to pronouns.

Determiners include words like the, a, this, that.

Adjectives are used to describe properties. Small house

Verbs describe actions, activities, and states, and take four forms (no examples?) and three tenses (past, present, future). Also distinctions based on gerunds, infinitives, aspect, and voice.

Intransitive verbs do not have a direct object. The dog sleeps
Transitive verbs take a direct object. The dog chased the cat
Ditransitive verbs takes two objects. Mary gave the dog a bone

Adverbs, prepositions, particles also discussed. Coordinating and subordinating conjunctions.

Part of speech tagging is the process of automatically assigning parts of speech to words in a corpus. One well known tagger is the Stanford Tagger, which is built in Java.

Morphology and Lexicon

Mental lexicon is used to store the mean, pronunciation, and part of speech of a word.

Derivational Morphology has to do with understanding how different forms of words are created from other words, by adding different affixes (prefix and suffix). We can represent these combinations as a tree structure, and see how different combinations can change the part of speech of a word. For example, adding "-able" to a verb, makes it an adjective: drink => drinkable. Morphemes have different meanings: -ness, -able, -ing, re-, un-, -er. These morpheme rules can be applied "recursively" - unconcernednesses.

Inflectional Morphology deals with tense, verb forms, etc.

Morphological Analysis takes a word and breaks it into morphological representations. So it will break a word down into parts of speech, stemming the word, and other morphological information (tense, case, etc.) NACLO problem dealing with Turkish vowel harmony: Turkish Delight.

Semantics is the study of the meaning of words. Lexical semantics has to do with the meaning of individual words, whereas Compositional semantics concerns the meaning of a sentence based on the meaning of its components. Lexical semantics includes concepts like hypernyms and hyponyms (super and sub class relationships), antonyms and synonyms (opposite and same meaning), meronyms and holonym (part-whole relationships), homonyms (same spelling).

Pragmatics is the study of how knowledge about the world and language conventions interact with literal meaning.

Text Similarity

Two concepts that are very similar can be expressed in completely different ways. One way to measure similarity between two words is to have human beings rate the similarity of two words. Automatic word similarity can be accomplished using tools like word2vec and WordNet.
Types of similarity:

Morphological (respect-respectful)
Spelling (theater-theatre)
Synonym (talkative-chatty)
Homophony (raise-raze-rays)
Semantic (cat-tabby)
Sentence (paraphrase)
Document (two news stories on same event)
Cross-lingual (Japan-Nihon)

Morphological Similarity

We can identify that two words are mophologically related through a process called Stemming. One stemmer is the Porter Stemmer (online demo) which applies a series of rules to a word to reduce it down to a stem (which generally isn't a proper word itself, e.g. "computability" is stemmed to "comput", which is also the stem of "computer". Porter's algorithm is based on a measure of a word, which is closely related to the syllables in the word (uses vowel-consonant pairs).

Most of the lecture is a walk-through of Porter's algorithm. Also a plug for Natural Language Toolkit (NLTK) http://www.nltk.org/

NACLO problem in stemming: Thorny Stems

Spelling Similarity: Edit Distance

Typos, transliterations, and dialects can result in the same word being spelled different ways. It is useful to have a measure of how close two words are in spelling. Letters can be added, dropped, or substituted, operations that are all called "edits". The Levenshtein method is used to determine the edit distance between two words (online demo, demo2). The implementation of this algorithm that uses a dynamic programming approach can efficiently find the minimum edit distance between two strings. The lecture walks through an example using "strength" and "trend"

Damerau modification treats "transcriptions", i.e. adjacent letter swaps, as a single operation. Edit distance may also take into account visual similarity (such as lowercase m and lowercase r and n together), as well as typing patterns (fat finger errors).

The usefulness of edit distance is not limited to human language. It can also be applied to genetics to align and compare RNA and DNA sequences, or protein sequences (amino acids).

NACLO problems: Nok-Nok, The Lost Tram

Preprocessing

Step in the NLP pipeline that converts input text into a format that our NLP software can understand.

Remove non-text
Encoding (convert to/from Unicode)
Sentence segmentation
Normalization (use the same variant of a word)
Stemming
Morphological analysis (tense, case, inflection)
Capitalization
Named entity extraction

A type is any sequence of characters that represents a word, whereas a token is an instance of a type. Tokenization works out where words end and begin (can get tricky depending on punctuation, use of dashes, and multi word phrases). Languages like Japanese and Chinese can be difficult to segment because they do not use spaces or other delimiters. German can be challenging because of the use of long compound words.

Sentence boundary recognition often uses decision trees. Features include punctuation, formatting, fonts, spacing, capitalization, and case.

Week 3

Semantic Similarity: Synonymy and other Semantic Relations

Synonyms are words with similar, but not necessarily identical, meaning. Whether two synonymous words can be substituted depends on context, e.g. big sister vs large sister.

Polysemy is the property of a word to have multiple senses. A book can been a literary work, stack of pages, record of bets, etc.

A synset is used to group together all the synonyms of the same word. Princeton's WordNet database is a project that builds many tree-like structures grouping together synonymous words into synsets and models hyper- and hyponymous relationships between synsets.

Thesaurus Based Word Similarity Methods

Using the tree-like structure of WordNet, it is possible to calculate a metric for word distance based on the traversal distance to get from one word to the other. That is, traveling up the tree to a common ancestor and then back down the tree to the other word. Modification of this approach look at the Lowest Common Subsumer (e.g. "ungulate" for horse and deer) and look at the probability of that word appearing in a corpus (Version 3). This can be further modified by taking into account the probabilities of the compared words (Lin similarity). NLTK implements many of these (examples of Lin algorithm).

The Vector Space Model

Documents are represented in multidimensional space as a vector, and document similarity is measured by measuring the angle between the vectors (or the cosine between the vectors). Jaccard coefficient divides the size of the intersection of two vectors by the size of the union.

Distributional similarity: two words that appear in similar contexts are likely to be semantically related. "You know a word by the company it keeps". Proximity comes in to play when determining what constitutes the "context". This can mean adjacent words, words in same sentence, or even just the same document.

Dimensionality Reduction

Polysemy and synonymy can affect the calculation of cosine similarity, causes words to appear more closely or less closely related than they should. Relatedness of words is also not captured by cosine approach very well (doctor/patient/nurse/treatment), and vectors tend to be very sparse. Dimensionality reduction can address some of these weaknesses.

Eigenvectors and eigenvalues are a crucial component in the reduction process. If a matrix is square, it can be decomposed into UΛU^-1, where U is the matrix of eigenvectors, U^-1 is "U inverse", and Λ is the diagonal matrix of eigenvalues.

If the matrix is not square, we can use Singular Value Decomposition. The matrix A can be decomposed into UΣV^T, where

U is the matrix of orthogonal eigenvectors AA' (A times A transpose)
V is the matrix of orthogonal eigenvectors A'A (A transpose times A)
Σ are the eigenvalues of A'A

In Matlab (or Octave) [U, S, V] = svd (A)

Latent Semantic Analysis is introduced.

NLP Tasks

Part of speech tagging.

Parsing takes a sentence as input and produces a syntactic representation. Stanford Parser

NACLO problem: This Problem is Pretty // Easy "Garden path sentences"

Information extraction - extracting named entities and relations.

NACLO problem: Betrand and Russell

Reading comprehension - system can answer questions about a piece of text

Word sense disambiguation - determine which sense of a word applies. Very important in Machine Translation

Named Entity Recognition - identity parts of a person or organization name. In biology this can identify names or proteins, genes, cells, etc.

Semantic Role Labeling - identifies the arguments associated with a verb or predicate (demo)

Coreference resolution - Identify when two references apply to the same entity (often pronouns or noun phrases)

Ellipsis, parallelism, underspecification - some words are omitted because they are implied or inferred.

Question answering - Watson playing Jeopardy

Sentiment analysis - recognize object being discussed and whether feeling is positive or negative

Machine Translation - Google Translate, Moses Machine Translation

MT uses noisy channel model, most probable input to be encoded into target language

Text summarization

Text to Speech (demo, demo2)

Entailment and paraphrasing - can you infer paraphrase from text?

Dialogue Systems

Week 4

Syntax

Syntax is a set of grammatical words that apply to a group of words (as opposed to an individual word). From a syntactic point of view we are more interested in the way words form sentences (i.e. the can precede a noun), rather than the semantics of the words (nouns are objects, verbs are actions, etc.)

A constituent is a word or a group of words that function(s) as a single unit within a hierarchical structure [wikipedia]. Some of the constituent tests (to find out if two constituents are the same type): coordination, pronoun, question by repetition, topicalization, question, deletion, semantic, and intuition.

A Grammar is a set of rules that can be used to create sentences. It may look something like this:

S => N V
N => Samantha | Min | Jorge
V => left | sang | walked

This isn't a terribly interesting grammar because it only allows us to build two word sentences, but it demonstrates the basic idea. Adding other parts of speech (JJ adjectives, DT determiners, etc.), and grouping them into phrases, will allow a more robust grammar, especially when the grammar allows for recursion.

Adjective ordering is often determined by semantics. det < number < size < color < purpose < noun, strength < material < noun, origin < noun

The Three French Little Steadfast Tin Red Riding Hoods ... oh my =S

Conjunctions can produce nested sentences.

Parsing

The process of taking a sentence and a grammar, and determining the parse tree that could produce the sentence. Unlike programming languages, human languages have many ambiguities that make them difficult to parse.

Applications - grammar checking, question answering, machine translation, information extraction, speech generation and understanding.

Context-free Grammar is a 4-tuple (N, Σ, R, S)

N: non-terminal symbols (NP, PP, etc.)
Σ: terminal symbols (words, disjoint from N) also called lexicon
R: rules (A => β), where β is a string from (Σ ∪ N)*
S: start symbol from N ("sentence" symbol is an example, but not only example)

Constituent order is important in language... sentences are not just a bag of words. Some languages are Subject Verb Object (English, Mandarin), others are Subject Object Verb (French, Japanese)... other possible orderings as well.

Penn Treebank is a project at the University of Pennsylvania that contains a large collection of parsed and tagged corpi.

Classic Parsing Methods

One approach is to treat parsing as a search problem, constrained by the input sentence and the grammar. This search can be done in two ways: a top down approach that starts with the whole sentence and tries to expand a tree structure to match, and a bottom up approach that builds up constituents from individual words until the entire sentence is added. Both approaches can lead to dead ends.

Shift-reduce parser is a bottom up parser. Basically works left to right, placing words on a stack and matching against the right hand side of grammar rules (shift). When it gets a match, it replaces with left hand side of grammar rule (reduce).

Cocke-Kasami-Younger (CYK) parser is based on dynamic programming, and is very powerful when grammer is binarized (right hand side has two non-terminal symbols). Works top down, building a parse for a substring [i, j] based on all parses [i, k] and [k, j] that are included in it. O(n³) for input string of length n. (online demo)

Binary format is also called Chomsky Normal Form (CNF).

Earley Parser

Early parser is a type of chart parser. It uses dynamic programming, working left to right through the input text. Unlike the CYK parser, it is not necessary for the grammar to be converted to CNF first.

Both Earley and CYK have some weaknesses. Agreement is onc such area. Many kinds of agreement are possible: number, tense, case, and gender. Creating gender specific rules in the grammar can lead to a combinatorial explosion. Another problem area is subcategorization frames. Certain verbs take specific arguments.

Context Free Grammars (CFG) make assumptions about independence that may be inaccurate. Certain parses are going to be more likely, based on their occurrence in the language, than others, but again this is not captured in the grammar itself.

The Penn Treebank

Human annotators create a collection of parses for the purpose of training automatic parsers. It contains 40,000 training sentences and 2400 test sentences. Contains mostly wall street journal and spoken conversations.

Week 5 - Parsing

Intro and Parsing noun sequences

Challenges to parsing human language:

coordinating scope: small boys and girls are playing (are girls also small?)
prepositional phrase attachment: I saw the man with the telescope (did I see with telescope, or does man have a telescope?)
gaps: Mary likes physics but hates chemistry (who hates chemistry?)
particles vs prepositions: She ran up a large bill/hill (is "up" a particle in "ran up" or not?)
Gerunds vs adjectives: playing cards can be expensive (is playing a verb or adjective?)

Noun-noun compounds are combinations of nouns that form one complete idea, though sometimes their interpretation can be ambiguous. "Fish Sauce" can be a sauce for fish, or a sauce made from fish. Grouping of nouns in a compound noun can be done using parenthesis, like:

(((Leland Stanford) Junior) University)

It turns out that the number of possible groupings follows the Catalan number sequence: 2 words have 1, 3 words have 2, 4 words have 5, 5 => 14, 6 => 42, 7=> 132, etc.

Prepositional Phrase Attachment

Prepositions can modify the noun that is closest to them, or the verb that is before that noun. Two types of attachment:

High (verbal) - attaches to the verb (Lucy's plane leaves Detroit on Monday)
Low (nominal) - attaches to the noun (This painting must cost millions of dollars)

In binary classification problems, the attachment label is represented as 0 (high) and 1 (low), and rather than looking at an entire sentence, just the preposition, the nouns before and after the preposition, and the verb before the preposition are considered. This is done because this tuple structure represents all the information needed, and it makes the machine learning structure consistent and scalable. Standard corpus for training on PP attachment is RRR 1994 (motivating paper, dataset on github), named after scientists Ratnaparkhi, Reynar, and Roukos working for IBM Research at the time.

Spends a lot of time reviewing concepts surrounding model evaluation, and several algorithms for rule based attachment.

Statistical Parsing

Probabilistic Context-free Grammars (PCFGs) are similar to deterministic CFG in structure, still consists of a 4-tuple, with one additions: the rules include a probability of the right side given the left side.

N: non-terminal symbols (NP, PP, etc.)
Σ: terminal symbols (disjoint from N)
R: rules (A => β) [p],

β is a string from (Σ ∪ N)*
p is the probability P(β | A)

S: start symbol from N

When rules have the same left hand side, their probabilities must sum to 1. The main tasks when using a PCFG, given a grammar and a sentence:

find the parse tree with the maximum probability
find the probability of the sentence as a sum of all possible parse trees

Parsing methods include Earley and CKY, just with the addition of a probabilistic component.

Probabilities can be learned from a corpus, like Penn Treebank (Maximum likelihood estimates)

Lexicalized Parsing

Allows us to capture irregularities with specific words and phrases. PCFGs don't take into account the fact that certain words may take a different number of arguments (give [someone something] vs see [something]), or differences in semantics.

Dependency Parsing

Previous sections were concerned with constituent parsing. Dependency parsing

Consider the phrase blue house. This phrase is referencing a type of house, not a type of blue, so blue is said to modify house, blue <-- house. "Blue" can be called a modifier, dependent, child, or subordinate, and "house" is said to be the head, governor, parent, or regent.

Dependency grammars have the following characteristics:

The capture the lexical and semantic dependencies between words
The top level verb is the root of the dependency tree
Simpler to parse than CFG
Particularly good for free word order languages.

Dependency parser can use dynamic programming similar to CKY, with O(n³). Can also be constructed as a constraint satisfaction problem, however because CSP is NP-Complete it requires good heuristics to make it work in practice. Third approach is deterministic parsing, which is similar to the shift/reduce approach in CFG (MaltParser is one example). Maximum spanning trees can also be used, which are an approach that works with non-projective parses.

A parse is said to be projective if none of the dependencies cross. MaltParser only works for projective parses, though this isn't really an issue in English. Homework 1 basically has us implementing portions of an arc-greedy transition parser like MaltParser. Lecture 17-2 from the Stanford NLP series goes into great detail on how MaltParser works, and is very helpful in the first homework.

Evaluating a dependency parser uses two metrics, the Labeled dependency accuracy, and the Unlabeled dependency accuracy (attachment accuracy).

Dependency Kernel determines how similar two sentences are based on the similarity of their dependency parses.

Prague Dependency Treebank has some interesting tools associated with it, such as TrEd (TRee EDitor). I didn't play around with it though...

Alternatives to CFG

Tree Substitution Grammar (TSG) - terminals can generate entire tree fragment

Tree Adjoining Grammar (TAG) - like TSG but allow adjunction. More powerful than CFG but less powerful than Context Sensitive Grammar (CSG). Can handle cross-serial dependencies, e.g. "Mary gave a book and a magazine to Chen and Mike, respectively.". CFG wouldn't be able to model this situation because book => Chen and magazine => Mike.

Combinatory Categorical Grammar has complex types, using forward and backward slash to denote if an argument appears on the right or left of the object, i.e. X/Y denotes that Y appears to the right of X and X\Y denotes that Y appears on the left of X. Y is the argument, and X is the returned object. A rule would look like this:

sleep S \ NP

meaning if you combine "sleep" with a noun phrase on left, it returns a sentence. So combining "sleep" with the noun "I" on the left (i.e. "I sleep") returns a sentence. Found some slides that elaborate a bit: slides.

Briefly touches on semantic parsing.

Shares 6 NACLO problems on parsing: Twodee, One Two Tree, CCG, Combining Categories, Grammar Rules, Sk8 Parser

Week 6

Good bit of review this week on basic probability stuff. No need for me to rehash it all here since I'm getting much better depth on the subject in my Computational Probability class.

We are introduced to probabilistic models for language, which basically tell us the likelihood of a given sentence. N-grams are the basic of one such model. At it's most simple, a 1-gram is simple single words, with probabilities based on their frequency in the corpus. 2-grams (called bigrams) are two word sequences, and this is where it starts to get more interesting, as this allows us to condition the probability of one word on the word that came before it (i.e. likelihood of "Square" given "Times"). 3, 4, 5 grams can be used, but as the grams get longer the data become much more sparse.

The "Markov Assumption" allows us to work with very sparse data using N-grams. Rather than estimating the probability of a sentence based on the full joint probability, we use the probability of each N-Gram and take the product. This is a trade off... it is technically less accurate than using the proper full joint probability, but the full joint may not be feasibly computable anyway. This shortcut works well when the corpus the probabilities were calculated on is very similar to the genre of the sentence we are estimating.

Notes that because the products of many small numbers can lead to underflow, best to work in log space and then convert back with exponentiation.

Smoothing - important in sparse data set, since test data may have values that are not in training set, and setting probability to 0 would make the entire sentence have a probability of 0. I covered Laplace (add 1) smoothing in Week 11 of the AI class. Good-Turing is another smoothing technique that estimates the value of unseen events based on the probability of seen events.

Another way of dealing with sparse data is with backoff. For a given N-gram model, if the higher order model does not have enough data, the estimate is "backed off" to the N-1 gram model (e.g. "pickle sundae" doesn't appear, so "pickle" and "sundae" are considered individually). Interpolation is a more effective strategy than backoff. With interpolation, a sparse 3-gram would be replaced with the linear combination of the 3, 2, and 1 gram probabilities.

One way of evaluating a language model is using a concept called perplexity. Perplexity measures how well a model fits an actual sentence (effectively a branching factor). Two distributions can be compared using cross-entropy, which basically compares the perplexity of the two distributions.

Word Error Rate is very much like the Levenstein Edit Distance applied to sentences rather than individual words. Measures distance based on the number of additions, subtractions, and substitutions between two sentences.

Long distance dependencies can trip up N-gram models.

Tools for doing language modeling: SRI-LM and CMU-LM, as well as Google N-Grams

Word Sense Disambiguation

"Polysemy" is the property of words to have different senses. "Let's have a drink in the bar" vs "I have to study for the bar". Word sense disambiguation is used to figure out which of these senses is intended.

One method is the Dictionary Method, or Lesk Algorithm. Basically this looks at the dictionary definitions of ambiguous words and finds the senses that are the closest match (think "plant" and "leaf"). Another method is Decision Lists, which relys on "collocation". So if "plant" is next to "living", it means the foliage, and if it's next to "manufacturing", it means the factory.

K-nearest neighbor can be used for word sense classification. Bootstrapping is a form of semi-supervised learning that can learn senses, starting with some labeled data. It then trains on collocations based on those labels, and automatically labels examples with this new data.

Homework 2

Homework 2 consists of two parts. While it's kind of billed as an introduction to NLTK, the graded portions really don't require using the library that much. There is a quick overview, though, so maybe I could have taken advantage of it more. Who knows... The first part of the assignment involves creating a language model. I fumbled around with this alot before I turned to the discussion forms and got some guidance. The biggest obstacle with calculating the bi and tri grams probabilities was monkeying around with the start symbols. This code is super ugly because of all the little "hacks" here and there... I really don't feel like it's an elegant solution

Part two has us creating a part of speech tagger and holy hell, this part hurt my brain. Wrapping my brain around how to implement the Viterbi algorithm took some time. It's one thing to understand the algorithm, it's something else entirely to make it work in code. I did manage to get it working well enough to earn full marks, though it wasn't perfect. Probably some missing special tag somewhere caused it to be off a tiny bit, who knows...

Week 7

Noisy Channel Model - Treats the subject of our task (foreign language string in machine translation, audio signal in speech recognition, etc.) as an "encoded" version of what we really want. So if our model looks like this:

x => ENCODER => s

Where "s" is the signal, what we want is to guess the value of "x" that maximizes P(x) * P(s | x)

The probability P(s | x) is called the "translation model" and P(x) is the "language model"

Part of Speech tagging - associating words in a sentence with POS tags (nouns, verbs, prepositions, articles, etc.) Parts of speech are divided into two classes:

Open class (new words can be added): nouns, non-modal verbs, adjectives, adverbs
Closed class (generally no new words): prepositions, modal verbs, conjunctions, particles, determiners, pronouns

Ambiguity is a major challenge with POS tagging. 11% of types, 40% of tokens in Brown corpus are ambiguous. Disambiguation can be done using n-gram models that look at either the neighboring word, or the neighboring part of speech.

Baseline for evaluating a POS tagger is to tag a word with it's mostly likely tag, and tag everything else as a noun, which results in about 90% accuracy. Current state of the art is about 97%, which human performance is about 98% (in 2% of cases humans don't agree).

Rule based POS tagging uses a finite-state automata and hundreds of manually designed constraints.

Touches on Markov Models, which were covered in the AI class. One modification we touch on here is adapting the MM to language by using a bigram model. Where a joint probability of a sentence can be expressed as P(x1 ... xT) = P(x1) * P(x2 | x1) * P(x3 | x1, x2) * ... P(xT | x1 ... xT-1) , a MM makes assumptions about independence that let us formulate this as P(x1 ... xT) = P(x1) * P(x2 | x1) * P(x3 | x2) * ... P(xT | xT-1) , where each probability is conditioned only on the word before it (making it a bigram). This is a visible Markov Model.

A hidden Markov Model (HMM) assumes there is a Markov process as described above, however we do not directly observe the states, instead using inference over observations (emissions) to estimate the state. The "states" in an HMM record the recent history (in bigram model, the prior word or POS). The transition model assigns probabilities to the transitions (such as transitioning from article to a noun). Transitions that are not grammatically legal are assigned a low probability, and transitions that are very common are given a high probability.

In an HMM, an observer can only see the emitted symbols and not the states that lead to their generation. Given this sequence of observations, the task is to estimate the probability that these observations were generated by a given model.

Viterbi algorithm is based on dynamic programming, and finds the best path up until observation i and state s. The forward algorithm is a closely related algorithm.

HMM can be trained with both supervised and unsupervised methods. Unsupervised training commonly uses EM (expectation maximization) methods, of which the Baum-Welch algorithm (forward-backward) is of particular note. Baum-Welch is not guaranteed to find an optimal solution, but in practice often does well enough.

Transformation based learning is another approach to POS tagging which uses transformation rules.

Johns Hopkins resource for learning HMMs: Interactive Spreadsheet.

Information Extraction is the process of identifying named entities from unstructured or semi-structured text. Can also identify times and events from input.

Named Entity Recognition is divided into two tasks: segmentation and classification. Segmentation involves identifying which words belong to a named entity, while classification is identifying the type of the segmented entity. Very commonly used in the biomedical domain (gene identification example).

Last lecture this week covers Relation Extraction. Covers some examples looking at identifying the relevant data in joint venture announcements (companies involved, location, dates, etc). Quick overview of regular expressions.

Evaluating a relation extraction model can be done with Precision, Recall, and the F1 measure:

Precision (P) = correctly extracted relations / all extracted relations
Recall (R) = correctly extracted relations / all existing relations
F1 = 2PR / (P + R)

Homework 3

Another exercise in ultimate frustration. Maybe it's meant to be a teaching tool... we learn by floundering around and kinda-sorta figuring crap out. I dunno.

The last assignment is word sense disambiguation in three languages: English, Spanish, and Catalan. The basic idea is to train k-nn and SVM classifiers so they can determine what "sense" of an ambiguous word is intended in a text document. It would have been much easier to wrap my head around what they were looking for if the instructions had been clearer. The data we get looks something like this:

{
word1: [(instance_id, left_context, head, right_context, sense_id), ...]
word2: [(instance_id, left_context, head, right_context, sense_id), ...]
}

or more concretely:

{
'cat': [('cat_001', 'blah blah before word', 'cat', 'blah blah after word', 001), ...]
'dog': [('dog_001', 'blah blah before word', 'dog', 'blah blah after word', 001), ...]
...
}

For part A, the basic algorithm is this:

get the k words that occur before and after the word, for every instance

this is the window or context of the word

aggregate these together into a combined set S. This needs to be a set()
for each instance, count how many times each word in S appears in the context window
train classifiers based on these counts

First I got confused as to which direction the counting was happening (I was counting how many times each word in S was occurring in the context...). I was also getting really crappy results until I figured out that S needed to be a python set (so no duplicates)... the instructions were counterproductive on that front because they sort of say S can have duplicates.

Once part A is done, we try to add some additional features. I implemented stemming for English and Spanish, as well as pos tagging for English. Catalan didn't have a stemmer or pos tagger available, but it didn't matter because the part A SVM results were good enough for full credit on part B anyway.

This probably could have been a much more fun and interesting assignment if not for the vague instructions leading to enormous frustration. I'm just glad to be done...

Week 8

Question Answering systems. Siri, Ask-Jeeves, Wolfram-Alpha, and Watson are examples.

Excite and AOL corpus, Murax looks at "English Encyclopedia" types of questions.

Question Type Taxonomy:

Yes/no vs wh-
Factual vs procedural
Single vs Multiple answers
Objective vs subjective
Context-specific vs generic
Known answer in the collection?

Most systems use information retrieval, statistical procedures, and do not use deep knowledge. Example systems include Eliza, SHRDLU, and Start.

Factual QA systems evaluated by TREC QA. This was organized by NIST and uses a 2GB corpus called the AQUAINT corpus. Most of the questions are wh- questions with short, unambiguous answers. Scoring is calculated with MRR (Mean reciprocal rank) early on, and confidence weighted rank later.

Components of a Question Answering system:

Data Source identification - semi-structured vs text sources
Query Modulation - paraphrasing question into syntax of search engine
Document retrieval - may use search engine
Sentence ranking - ngram matching, Okapi
Answer extraction - question type classification, phrase chunking
Answer ranking - question type, proximity, frequency

Question classification aided by taxonomies by IBM AnSel and University of Illinois (UIUC). Classification task can be learned using machine learning techniques, or regular expressions.

Predictive Annotation - AnSel.

Reviews the architecture of several systems developed since the late nineties.

Challenges in QA:

Word sense disambiguation
Co-reference resolution
Semantic Role labeling
Temporal questions (who is the president of the United States)
Context (e.g. Categories in Jeopardy)

Week 9

Text Summarization - taking a document or collection of documents and return a short summary. Genres of summaries include headlines, outlines, minutes, biographies, abridgemets, sound bites, movie summaries, etc.

There are typically three stages in the summarization process:

content identification - determine relevant info and facts from input document(s)
conceptual organization - combining data, deciding what to preserve, etc
realization - making readable summary, generating connectives ("for example" etc.)

Extractive summarization does not transform the identified snippets from the input document. Usually whole sentences, not simplified or rewritten, and presented in the same order. Baseline is to just output the first few sentences (in news stories, for example).

Early work that stepped beyond this naive approach started to look at word frequency and word clustering to identify potential "content" words. Also consider bonus ("significant") and stigma ("hardly", "impossible").

Knowledge based systems look to "fill in the blanks" of a script using manually selected keywords to determine which type of news story is being summarized (that is, which "script" to fill in).

Some of the problems with extracts include a lack of balance (only telling one side of story), a lack of cohesion, reference errors. Using rhetorical structure and rhetorical connectives can address some of these weaknesses.

Kupiec in 1995 used a Naive Bayes classifier to determine the probability that a sentence would be included in a summary based on a set of features (sentence length, uppercase words, thematic words, sentence position, fixed phrases). This model showed significant improvement over "Lead" based summarizer (taking first few sentences).

Graph based summarization looks at semantic links between sentences, with highly connected nodes corresponding to the most important passages.

Lexical chains provide another approach. These chains are identified using several different relations:

extra strong - same word (repitition)
strong - WordNet relation (hyper/hypo nym)
medium - indirect WordNet relationship

Rhetorical structure theory based parser by Marcu looks at clauses or "utterances" and builds rhetorical connections between them (justification, elaboration, contrast, evidence, etc.) which it then uses to identify the most important ideas in the document. Related utterances have a "nucleus" and "satellite" relationship, with the most important of the pair being the nucleus.

Maximum Marginal Relevance is a metric for determining if a given sentence is adding new and relevant information to a summary (so you aren't getting the same basic idea repeated).

Mead is a clustering (centroid) based summarization system. Similar approaches are used by Columbia's Newsblaster system, and Google News

LexRank is an interesting approach that creates a graph of the sentences based on sentence similarity. The most highly connected sentences are the most relevant.

Evaluating Summarization System

Evaluation criteria:

appropriate length
fidelity
salience
grammatical
non-redundant
referentially well formed
structure and coherence

An ideal evaluations looks at the compression ratio and the retention ratio (say a 10% summary that retains 90% of the information).

Lengthy discussion of relative utility, subsumption, and equivalence.

Sentence Compression

Sentence simplification removes the least informative parts of the text:

Quotes
Appositions (e.g. Barak Obama, president of the United States)
Adjectives and adverbs
Embedded clauses
Attribution clasuses

Using structed syntactic information, can use noisy channel and decision based approaches. Important corpus is the "Simple English Wikipedia"

Week 10

Collocations are phrases that have a specific meaning that is different from the words that make up the phrase ("kick the bucket" meaning "died" as opposed to literally kicking a physical bucket). Idioms and free-word combinations are related topics.

The "Base" bears most of the meaning in a collocation ("base" is a part of speech). In "Set the table", the base is the noun "table" and the collocator is the verb "set".

Information Retrieval is the domain of building search engines. Key terms:

Query - representation of what the user is looking for
Document - information the user wants
Collection - set of documents
Index - representation of information that makes querying easier
Term - word or concept that appears in a document

Document collections can be represented as a matrix of documents and terms, with either boolean or integer values representing whether and how often a term appears in the document. However for large corpi (like the internet), this is infeasible. The solution is to "invert" the index, and instead store a "posting table", where each term has an associated linked list of documents containing that term.

Documents are represented as a vector. This allows us to compare documents using the angle between the vectors, or the cosine of the angle (demo). When queries are treated as documents, this angle can also allow us to test the similarity of documents to a query.

Positional indexing tracks the order in which a term appears. This is an important concept for phrasal words such as "New York".

IDF or Inverse Document Frequency is a metric used to determine the importance of a term based on how infrequently it appears. The intuition is that words that appear very often have less information value.

Evaluation is difficult. Some of the factors include:

Size of index
Speed of indexing
Speed of retrieval
Accuracy
Timeliness
Ease of use
Expressiveness of search language

Text Classification - hierarchy or flat, soft (multiple classes) or hard (only one)

Popular Techniques - generative (k-nn, naive Bayes) vs discriminative (SVM, regression)

Text Clustering - don't have any idea about what groupings (classifications) exist. Unsupervised techniques like k-means. Metrics for evaluating k-means are Purity and Rand index.

Hierarchical clustering methods: single-linkage, complete-linkage, average-linkage. Also called "agglomerative" clustering. Clustering algorithms were covered in Week 12 of the AI class.

Week 11

Sentiment analysis - identify and extract subjective information in source materials.

Sentiment lexicon - words associated with a sentiment. Dictionary methods start with a seed word and use WordNet to find similar words. Pointwise mutual information is another method.

Semantics - Representing meaning. Semantic analysis assigns meaning to words and combines into sentence.

Entailment - one fact follows from the other
Presupposition

Understanding meaning - if an agent hears a sentence and can act accordingly, the agent is said to understand it. First Order Logic (FOL) can be used to represent meaning, as can Propositional Logic (withing some limitations).

Knowledge Representation - storing and reasoning about semantic information. Ontologies, categories and objects, Events, Times, and Beliefs.

Object - Martin the cat
Categories - Cat
Ontology - Mammal includes cat, dog, whale; Cat includes PersianCat, ManxCat
ISA Relation - ISA (Martin, Cat)
AKO (A Kind Of) - AKO (PersianCat, Cat)
HASA - HASA (Cat, tail) meronomy

Inference in logic can be done with a method called Modus Ponens.

Semantic Parsing converts natural language to a logical representation.

Week 12

Discourse analysis

One issue with discourse analysis is anaphorisms (referring to an entity that was introduced earlier in a passage). Seems to involve using pronouns in place of proper nouns. Cataphora is a similar property where the pronoun appears before the noun it refers to.

Coreference resolution:

agreement constraints
syntactic constraints
sentence ordering

Salience weights of features for coreference resolution (in rank order) [source]:

sentence recency
head noun emphasis
subject emphasis
existential emphasis
accusative (direct object) emphasis
non-adverbal emphasis
indirect object emphasis

Coherence

Coherence is the property of paragraphs in the same paragraph "belonging" together. That is to say they share a logical connection.

Examples in decreasing order of coherence:

I saw Mary in the street.

She was looking for a bookstore.
She has a cat
The Pistons won.

Touch on RST (Rhetorical Structure Theory) again...

Coherence relations:

Result - The carpenter worked all day. The cabinets were finished.
Explanation - The carpenter was tired. He worked all day building a cabinet.
Parallel - The carpenter worked all day. The upholsterer took the day off.
Elaboration - The carpenter built the cabinet. It was made of oak and had four drawers.

Centering identifies the most important concept in a group of sentences.

Dialogue systems

What makes dialog different? Turn-taking, barge-in.

Conversation implicature is shared information in the dialog between the participants.

Grice's Maxims:

Quantity - make contribution informative, but not more than needed
Quality - do not speak false, or say things without evidence
Relevance - say things pertinent to discussion
Manner - avoid ambiguity and obscurity

Speech Acts:

Assertives

suggesting, putting forward, swearing, boasting, concluding

Directives

asking, ordering, requesting, inviting, advising, begging

Commissives

promising, planning, vowing, betting, opposing

Expressives

thanking, apologizing, welcoming, deploring

Declarations

I resign, you're fired (implies an action)

Dialog system architecture:

Understanding
Dialogue manager
Task Manager
Generation

Machine Translation

Statistical machine translation uses "parallel corpora" similar to the Rosetta Stone.

The Hansards Corpus (French-English from Canada)
The Bible (thousands of translations)

Translation as Decoding... can translation be treated as a cryptography problem.

Differences in languages

Word Order: SVO, VSO, SOV
Prepositions (Postpositions)
Inflection
Lexical distinctions
Brother (Japanese) - otooto (younger), oniisan (older)
Gender differences
Adjective
Vocabulary (one word in language A is multiple different words in lang B)
Phrases (one word => multiword phrases)

The "Direct Approach" to machine translation basically reads input from left to right and looks each word up in a dictionary. This approach was tried in 50s and 60s, and is not very effective (naive approach).

Indirect transfer has grammatical rules linking two languages. Interlingua uses first order logic.

Noisy channel methods can also be used (parametric probabilistic model).

Evaluating machine translation systems is difficult. Human translators may produce multiple translations of the same text that are all judged to be equally good, and having humans evaluate the output of a translation system is inherently expensive. One method for automated evaluation of MT is called BLEU, which uses n-gram overlap and several other features to evaluate MT output against multiple human translated references. Penalizes brevity to discourage cheating.

Text Generation

Natural language generation is the process of deliberately constructing a natural language text in order to meet specified communicative goals.

Mapping meaning to text. Stages:

Content selection
Lexical choice
Sentence structure: aggregation, referring expressions
Discourse structure

Friday, October 21, 2016

Intro to Natural Language Processing (Coursera)