Thursday, August 11, 2016

Berkeley CS188 - Artificial Intelligence (edX)

Having passed the 70-486 exam, I decided I wanted to try something different before I tackled the last cert test, so I poked around MIT OpenCourseware, Coursera, and edX looking for something interesting.  I decided that it was about time to try the AI class.  MIT and edX both had similar offerings, and I went with edX since it offered the autograded exercises (and I already owned the recommended book).

Artificial Intelligence on edX (Lectures) (AI Homepage)
Artificial Intelligence: A Modern Approach

Supplemental:
Artificial Intelligence on MIT Open Courseware (Course website)
Artificial Intelligence from NPTEL
Machine Learning from University of British Columbia



Week 1 [June 15]

The opening lecture is pretty standard.  A bit of "administrivia" as the Berkeley profs like to call it, some history, overview of what the class will cover, a few demos.  It follows the topics covered in the first chapter of the book quite faithfully.  Klein recommends the third edition of the text book (current), but notes that the second edition is also acceptable, while the first edition (the one I own, of course) is "right out".  Doh.  So I found the second edition on Amazon, used, for less than $10. Score!

The programming exercises are in Python, and the first week homework is a basic introduction to Python.  My experience with Python is limited to a couple introductory HackerRank challenges (I avoided it in MIT's algorithm class by doing generic exercises in Java instead...).  So it looks like I'll be learning Python too.  Fortunately PluralSight has some pretty good Python courses, and so far Cloud9 has proven to be a pretty decent development environment for Python.  Week one also has a math self check.  I got stuck on one question about probability equivalencies, but otherwise I think I'm in good shape. If I get stuck on the statistic I can always get a refresher at Khan Academy.

The initial Python tutorial homework was not difficult.  I'm pushing all my code to a GitHub repo for easy reference.



Week 2 [June 30]

This week covers the basics of search algorithms.  We start with depth-first search, move on to breadth first search, then uniform-cost search, until finally we learn A* (ay-star).  These search algorithms were examined both as a tree search and as a graph search.  A* uses a heuristic to help focus the search, making it more efficient than bfs and ucs.  In a tree search, it is sufficient that a heuristic be admissible, meaning it never overestimates the cost to the goal.  For a graph search problem, a heuristic must also be consistent, which means the heuristic cost will be less the the cost to the goal for each arc, i.e. if the cost from A to B is 2, then the difference in the heuristic from A to B must be no more than 2.

This week kind of caught me off guard a bit.  It started off like any other, lectures were good, quick little quiz.  Then I got to the project.... holy crap it was hard!  I'm not sure if it was hard because I'm still learning Python, or because I barely knew where to start, or what, but the project for this week seriously kicked my ass.  Which just made it all the more rewarding when I finally passed all the tests.  Project 1 on Github.  The things I struggled with in particular:

  • There was a bit of a learning curve to figure out how the game code interacted with the search code, though to be fair this wasn't that hard
  • Figuring out an admissible and consistent heuristic... and then implementing it
  • Efficiency is a thing.  The first time I finished, I passed the tests fine on my laptop but timed out when I submitted to edX.  I had to rework one of my corner heuristic to do few list operations and flattened a loop, which let me pass the submission autograder.
The book readings for this week was all of chapter 3 and the first two sections of chapter 4 (from the 2nd edition, 3rd edition is different).  Basically just covering the material from the lecture, probably would have been good to read it first, then watch lectures and do homework... oh well.



Week 3 [July 4] 

This week covers Constraint Satisfaction Problems (CSP).  CSPs follow a standard structure, which allows them to be solved by generic algorithms and heuristics rather than problem-specific methods. Whereas the problems examined in week 2 are concerned with the path to the solution, CSPs are only concerned about the solution itself, which the lecture refers to as an identification problem.  That is, we only care about the actual values that satisfy the constraints, and not the search we did to find them.  A few key concepts:

  • Variables: What is being assigned to.  This could be the countries on a map coloring problem, the queens in the n-queens problem, or the letters in a cryptarithmatic problem.
  • Domain:  The allowable values for a variable. {red, green, blue} in the coloring problem, or row numbers for the queens.
  • Constraints:  Which assignments to the variables that are allowed.  Can be implicit (computed) or explicit (enumerated); unary, binary, n-ary.
  • Backtracking Search:  Assigns one variable as a time, checking constraints as variables are assigned.  Basically DFS with these two modifications.
  • MRV: Minimum Remaining Values heuristic, selects the variable with the smallest domain for next assignment
  • LCV: Least Constraining Value heuristic, selects the value that removes the fewest values from the domains of neighboring variables.
  • Forward Checking:  When an assignment is made, check the connected variables and delete conflicting values from their domain. Acts as a filter.
  • Arc-consistency: A fast method of constraint propagation which provides a stronger check than forward checking.  Checks directed arcs after each assignment to ensure that for every value in the tail, there is some value in the head that can be assigned without violating a constraint.  Any values that would cause a violation are deleted from tail (constraint police check the trunk).  Uses a queue to check all arcs that may have constraint violations based on which values are connected.
  • Problem Structure: Identify independent subproblems, cutsets, and tree decompositions to make large problems smaller.  If graph has no loops, can be solved as a tree in O(nd²).
  • Iterative improvement: Complete assignment to start, possibly with constraint violations. Iteratively make the assignment better. Local search method.
  • Also touched on hill climbing and genetic algorithms.
For some of the pop-quiz questions I found the Consistency Based CSP solver from aispace.org useful.  I did have to add aispace.org to the "excluded sites" list in order for the Java app to run.


I tried reading the book before watching the lectures this time around and found that it worked out much better.  The book goes into great detail, while the lectures hit home the concepts.  No programming project this week, just a collection of multiple choice questions that weren't too painful.



Week 4 [July 12]

This week introduced adversarial search.  With adversarial search problems, there is at least one other agent whose goals are in conflict with our own.  This gives rise to competitive behavior, and these problems are often called games.  Some of the key concepts from this module:

  • When facing an adversarial opponent, minimax is the algorithm used to maximize our own outcomes.
  • minimax can be sped up considerably through the use of alpha-beta pruning, which keeps track of the best values for each agent throughout the search and eliminates nodes and branches from the search when they can no longer affect the outcome.
  • When facing a probabilistic agent, we can use expectimax, which is similar to to minimax but instead of a strictly minimizing opponent, there is an element of chance.
  • Important not to be too optimistic (expectimax on adversarial opponent) or pessimistic (minimax on a probabilistic opponent).  Lecture 7, around minute 32, we see a pretty funny demo of Pacman freaking out because he thought the random ghost was going to chase after him.
  • In non-zero sum games, cooperation and betrayal behavior can emerge as goals coalesce and diverge.
  • True minimax requires searching to leaf nodes, but in many games this is impractical (number of nodes many orders more numerous than atoms in the universe), so an evaluation function estimates the value of an intermediate (non-terminal) state.  Minimax then determines actions based on these value estimates.
In addition to the Berkeley lectures (which were very good for this unit), I also watched lecture 6 of the MIT course, which covers the same general material.  It's a totally different approach (watching an actual lecture rather than narrated slides), and I rather liked experiencing both, so I'm going to go back and watch the first few lectures, then watch both series going forward.  Since the projects and homeworks are the major time sinks, it seems like a decent investment in time.

Project 2 covers several concepts from this module.  First we are tasked with implementing an evaluation function for a "Reflex agent", that is, it makes a decision based only on the current state.  I used a combination of ghost position and average food distances to get consistent 1200 scores (more than enough for full credit), then proceeded to ignore the instructions not to spend too much time on it, and spent several hours trying to improve my scores.  A couple things helps sometimes, but the more clever I tried to get, the less reliable my results.  So I moved on.

The second question has us implement minimax.  This was made much simpler by actually cracking open the book and following the psuedocode, though it still required plenty of tweaking to make it general enough to work with multiple adversaries.  The tricky part was getting the depth calculation right.  I kept terminating early, but somehow half the tests would pass anyway. Go figure.

Once minimax is done, Alpha-Beta and Expectimax are pretty easy, just some minor tweaks to the code.  The last question has us write a better evaluation function.  I was able to reuse quite a bit of the code I wrote for the reflex agent eval function, and averaged over 1200 pretty consistently.  Whoo-hoo!  Code on Github.




Week 5 [July 20]


I was a bit surprised that the reading from AI:MA this week was actually the utility material covered in the end of last weeks lecture.  Basics of utility theory, along with its 6 axioms (constraints on rational preferences):
  • Orderability:  Given two states, agent must prefer one to the other or be indifferent.  Cannot avoid deciding.
  • Transitivity:  If A > B and B > C, then A > C
  • Continuity: If A > B  > C, then some p exists such that [p, A; 1 - p, C] ~ B
  • Substitutability:  If A ~ B, then [p, A; 1 - p, C] ~ [p, B; 1 - p, C]
  • Monotonicity:  If A > B, then agent must prefer lottery with higher change of A
  • Decomposability:  Compound lotteries can be reduced to simpler ones using laws of probablility
The utility principle follows from these notions about preference.  Basically that the reason A > B is because the utility U(A) > U(B).  The Maximum Expected Utility principal simply sums these utilities up based on their probability.  The book also covers the utility of money, another bit of material covered last week in lecture.

The reading from Sutton and Barto was brutal.  There is a lot of new vocabulary, and it seems to hint that the content of these chapters (3 and 4) sets the foundation for much of the learning material that follows.  Some of the material is pretty intuitive: an agent interacts with an environment, chooses available actions based on a policy which causes a transition to a new state and a consequent reward. Episodic tasks are those with natural breaks in the interaction (such as  solving a maze... when you win, you're done), whereas continuing tasks go on forever.  The notion of a return is a discounted reward stream (not unlike discounted cash flows from my finance days) that the agent tries to maximize.

The state signal satisfies the Markov Property if it "summarizes the past without degrading the ability to predict the future". They use the example of a checkers board.  The only state signal one needs to know which actions to take is the current state of the board, because how that state came to be (i.e. the past) has no effect on future outcomes.

Five minutes into the first lecture for this unit, and my suspicion that the correct reading from AI Modern Approach was, in fact, from Chapter 17 *facepalm*.  I'll be reading it soon enough I guess.  The lectures in this module really clarified the material.  Basically, MDP are non-deterministic search.  The gridworld examples really helped illustrate this (again, probably would have helped if the right chapter was assigned...).  Compared to Pacman, where selecting action East was guaranteed to move me to the East, in gridworld there is an 80% chance I move East... or I might have tried moving North or South as well.  So every action has this element of chance.

The homework was pretty difficult.  Two of the problems use a three node graph with clockwise and counterclockwise state transition possibilities... I had to create a spreadsheet for that one, since everything was randomized and thus I couldn't reuse and tweak hand written calculations. That was frustrating.



Week 6 [July 30]


Since the optional reading from AI:MA for this week actually covers the week 5 material, I decided I'd be proactive and figure out which chapter would actually support the material for this week.  In the 2nd edition, this is chapter 21.  I decided I actually have an easier time reading the book after having seen the lecture (rather than the other way around).  I figure it's easy enough to rewatch the videos if need be.

This week introduces Reinforcement Learning.  The big change going from a MDP to an RL problem is that while in an MDP know the likely hood of an action leading to a specific state (the transition function) and we know what rewards we'll get for getting to the new state (the reward function), in a RL problem these are unknown to us, and the only way we can determine T(s, a, s') and R(s, a, s') is by interacting with the environment (exploration).

  • Passive Reinforcement Learning is similar to policy evaluation.  An agent takes a policy as given and runs trials to determine the expected utility of the non-terminal states
  • Direct utility estimation determines the value of a given state by examining the outcomes of every trial that starts in that state.  The big weakness of this approach is that it ignores the connections between states (so two states that have to pass through a common state to reach any outcome may have different values just based on luck)
  • In Temporal Difference learning we accumulate a running average of the rewards for each state.  The learning rate alpha is applied to the new information, while the old value is weighted as (1 - alpha).  Temporal difference learning is the focus of the Sutton and Barto reading (Chapter 6).
In Active RL, since we want to maximize the total reward for all trials in the environment, we need to find a balance between exploration, i.e trying new things, and exploitation, or taking the actions we already know to be good.  This is accomplished through the use of an exploration function, which effectively makes less explored areas of the state space more attractive.  The book describes the exploration function as a trade-off between greed and curiosity.

While TD can give us the running average value of a state, we can't do the necessary expectimax look ahead to develop a policy from those states because we don't have the transition function.  Q-Learning solves this problem by learning the value of actions from given states Q(a, s).  Once we know the Q values, we simply pick the best one from any given state.

Project 3 tasks us with implementing value iteration, qvalue iteration, and approximate qvalue iteration, along with a few simple "analysis" questions.  I was a bit intimidated by this one at first, but I found it helped to go back and walk through the algorithms, and write them down in a code in the comments (though I found that including the actual characters for big sigma and little gamma in the value iteration formulas made Python very upset...).  Once I had a clear idea of how the implementation should work, writing the code was much easier.  Code on Github.



Week 7 [August 2]


This week is relatively light, just two lectures and a few sections of reading.  The topic this week is Probability, which will start to lay the groundwork for Bayes Nets.  In addition to the Berkeley videos, the MIT Artificial Intelligence class has two lectures on Probabilistic Inference  (lecture 21 and lecture 22) that also support the material and are definitely worth a look.  I think the important take aways for week 7 are the probability rules and formulas that will eventually allow us to do some interesting things when we stitch probability tables together in Bayes Nets:

  • Conditional Probability: The probability of a given b is the joint probability of a and b divided by the probability of b:
  • The Product Rule:  The joint probability of x and y is the probability of y times the probability of x given y:
  • Bayes Rule: The joint probability of x given y is the probability of y given x, divided by the probability of y, times the probability of x:
  • Independence: x and y are independent if for all x and y the joint probability is equal to the probability of x times the probability of y:

    alternatively written as the probability of x given y is equal to the probability of x:
  • Conditional Independence: x is conditionally independent of y given z if for all x, y, and z the probability of x and y given z is equal to the probability of x given z time the probability of y given z:

    or can be stated as the probability of x given y and z is equal to the probability of x given z:
  • Chain Rule: To calculate the joint probability over a set of variables, we can use the Product rule to construct a series of multiplications:
A Bayes Net is essentially a way of arranging a series of conditional probabilities tables in order to make useful assumptions regarding variable independence and conditional independence that enable us to save considerable space in representing the interactions of the random variables.  Because joint probabilities can be reconstructed from a series of conditional probabilities, we don't need to store a potentially gigantic joint probabilities table.

Chapter 13 in the book covers all the concepts from the lectures, including the above formulas, in all their gory detail.

One last note: I had some trouble with the videos not playing sometimes, but I was able to play the offending lectures via the Berkeley ai homepage.



Week 8 [August 3]


Like week 7, this week's material is not too bad, just the one full blown lecture and a chapter from the book.  Like last week I had to go to the Berkeley ai homepage to watch the lecture.  The conditional probability and independence stuff I accidentally lumped in to last week (I pretty much watched them all together to do the homework).  The Bayes Net tool he uses in the demo can be downloaded from aispace.org, once you've added the site to the Site Exception list... Java security, I swear...


This week goes over D-Separation, which gives us some simple rules for determining conditional independence in a Bayes net (I tried to use the "double up tack" symbol used in the slides but alas, Ubuntu seems to choke on it, so I improvised):
  • Causal Chain:  X → Y → Z
    • X ⊥ Z?  No guarantee
    • X ⊥ Z | Y? Yes
  • Common Cause: X ←  Y →  Z   ( "∧" structure)
    • X ⊥ Z?  No guarantee
    • X ⊥ Z | Y? Yes
  • Common Effect: X →  Y ←  Z   ( "V" structure)
    • X ⊥ Z? Yes
    • X ⊥ Z | Y? No guarantee
    • Also no guarantee if given any decedent of Y
    • This structure behaves differently because when an effect is explained, the parents are coupled by "Causal Competition".
When a path is guaranteed to be conditionally independent, it is said to be "inactive", whereas the opposite case leads to triples that are "active".  If there is any path made up of active triples between two nodes, then they are not independent.  If every path is inactive, then they are independent. (The last part of the homework covers this in detail).

The homework was a little bit of a pain, but Google Sheets made keeping track of all the probability tables and calculating the necessary conditional probability tables much more manageable.

I finally broke down and binge watched the rest of the MIT 6.034 lectures.  I might revisit some of the neural net and SVM stuff if it seems relevant later on, but for the most part it seems to diverge a bit from the direction of the Berkeley material.



Week 9 [August 4]


This week covers the subject of doing probabilistic inference with Bayes nets.  Inference is about querying the Bayes net, which will give us information like the probability of some variable Q given some collection of evidence.  It can also give us the most likely value for some query variables given some evidence (think of medical diagnosis).  Several ways to do this:

Inference by Enumeration:  calculate and combine all entries in the Bayes net that are consistent with our query.  Problem is that this approach is exponential in the number of non-queried variables.  We basically end up building most of the full joint distribution. Variable Elimination and Sampling are ways of avoiding this.

Variable Elimination:  Calculate and "sum out" variables in a smart way to avoid creating huge joint distributions.  Uses properties of a number of "factors":
    • Joint distribution P(X, Y):  
      • Entries P(x, y) for all x, y
      • Sums to 1
    • Selected joint  P(x, Y):
      • Entries P(x, y) for fixed x, all y
      • Sums to P(x)
    • Single conditionals P(Y | x):
      • Entries P(y | x) for fixed x, all y
      • Sums to 1
    • Family of conditionals P(X | Y):
      • Entries P(x | y) for all x, y
      • Sums to |Y|
    • Specified family P(y | X):
      • Entries P(y | x) for fixed y, all x
      • Sums to... ??
The initial factors are the conditional probability tables in the Bayes net.  Any evidence we have (observed variables) are then selected from the tables, effectively "shrinking" the tables.  We can then "join" these tables in a way similar to a database join.  To do a join, all the factors involving the joining variable are combined (union) to create a new factor.  This union is done via a pointwise product:





The join doesn't have to be limited to two variables, as above.   The second operation is called marginalization.  This is the part where we "sum out" the variables we don't care about (similar to how marginal probabilities are calculated).  By interleaving join and marginalization operations, we can (sometimes) keep the intermediate tables from growing too large.  It kind of reminds me of matrix multiplication optimization.

With evidence, variable elimination operates pretty much the same, we just reduce the initial tables by fixing certain values.  We may need to normalize the resultant distribution.

Ordering is very important for variable elimination to realize substantial savings.  Unfortunately, because a Bayes net can be used to represent a 3SAT problem, being able to efficiently answer queries would all us to efficiently solve 3SAT, and we'd have thus proven NP = P and we can collect our million dollar award (not likely). Bummer...  But in many cases, we can make smart choices that save us a lot of computation.

Sampling:

Second lecture this week covers sampling.  The basic idea is that we draw a bunch of samples, computer the approximate probabilities, and show that the approximation converges to the true probabilities.  Several ways to calculate samples:
  • Prior Sampling:  Start from the top of the graph, and randomly pick from each conditional distribution as parents are sampled.
  • Rejection Sampling: Throw out samples that are not consistent with a piece of evidence (based on a query we know we want an answer for).  One drawback is that it can reject a lot of samples for unlikely evidence
  • Likelihood Weighting: Instead of throwing out inconsistent samples, fix variables we care about and multiply sample by a weight based on the probability of evidence given parents.  Unobserved samples don't effect the weight.
  • Gibbs sampling: Start with an arbitrary instantiation consistent with evidence, repeatedly sample one variable at a time conditioned on all the rest.  If you repeat an infinite number of time you are guaranteed to get a sample from the correct distribution (in practice you don't go on for infinity, duh)
The homework for this week was another major pain in the ass.  Not because the concepts are hard, there is just a lot of annoying bookkeeping that is really easy to get wrong.  It was especially troublesome because all of the questions are randomized.  If I'd have tried to do these questions on paper I'd have pulled my hair out lol.  Google Sheets to the rescue again.



Week 10 [August 7]


So week 10 is all about the Value of Perfect Information (VPI), Hidden Markov Models (HMM), and Particle filtering.

VPI comes into play in a decision network.  In a decision network, we have an action (take umbrella, leave it at home), a Bayes' net representing the probability of some uncertain events that effect our utility function.  We are trying to maximize our utility (MEU) based on the actions we can take.  Now suppose there is another node in our BN, which, if known, allow us to increase the overall MEU of our decision.  The difference between the previous MEU and the MEU given the evidence is the value of that evidence.

HMMs and Particle filtering are methods for "Reasoning over time".  HMMs are a "temporal probabilistic model" with one discrete random variable (multiple variables can be combined into one "megavariable", as we see in question 6-7 on the project 4).    The way Klein describes the HMM is that it's a model for the world where we we have a Markov state for each time step, we know how the world transitions for each time step, and we have some kind of evidence variable that can inform our beliefs about the random (hidden) variable.  In the simplest model, the state for each time step depends only on the state at t-1 (transitions) and the evidence for that t (also called emissions).

Inference is based on two cases: first, we want to know the state of our model given the passage of time, and second, we want to know the state of our model given some new piece of evidence.  The first case looks at our distribution at time t-1, and computes a belief state based on the likely hood it would be reached from the the previous time step.  This is basically like the case where we knew where the ghost is in Pacman at time 1, and we have a noise model for how the ghost moves, so as time passes we can use this knowledge of how ghosts move to predict it's location at each time step.  However, since each time step introduces more uncertainty, it tends to "smooth out" our belief state until we have no clue (uniform distribution).

The second case looks at a new piece of evidence and does a point-wise product of the evidence distribution and the prior distribution.  We see this in action in Pacman when we take a sensor reading and the belief model is immediately sharpened.

Particle filtering is an approximation method for calculating inference queries.  Conceptually the same steps take place:  from one time step to the next, a transition function is applied to smooth out the probabilities, and providing evidence will tend to sharpen beliefs.  The devil is in the details.  Rather than doing exact calculation using probability distributions from the Bayes net, a cloud of "particles" is sampled, weighted according to likelihood, and then resampled.  As evidence is accumulated, states that support that evidence will tend to be sampled more frequently.  Unlike in exact inference where the passage of time smoothed everything out, in Project 4 the probabilities of all the particles sometimes dropped to zero and had to be resampled uniformly, but they tended to cluster again quite quickly.

Just a disclaimer... my understanding of particle filtering and HMMs generally is probably still a bit fuzzy, so take my analysis with a grain of salt here lol.

Speaking of projects...

The homework this week wasn't too bad, mostly an exercise in bookkeeping, but holy fuck Project 4 was harder than hell!  Specifically, the last two questions, which have us implementing particle filtering over tuples of ghost positions.  I'm sure my solution wouldn't scale well to more than 2 ghosts (it calculates permutations of legal positions for each ghost, so I get something like the number of legal positions raised to the power of the number of ghosts... yikes!).  The main problem with the last two questions isn't even really the actual problem, it's just the fact that it isn't well specified.  Particularly around what "belief state" meant, or a "uniform distribution" meant in this context. As always my code is on Github.

Those ghosts are in the ends, I just know it!!




Week 11 [August 9]


I blitzed through the last few weeks of material pretty quickly, mostly owing to the fact that there was no new reading to do (that usually one of the biggest time sucks).  Hopefully Berkeley will get the autograder back up and running on edX in time for me to submit my last project (if not I'll just include a screen grab of the local autograder).... [Update 8/10] edX Autograder back up, whoohoo!

After recapping particle filtering, the lecture turns to Dynamic Bayes' Nets, a segment that would have been super helpful when I was working on Project 4 (just fyi...).  Still basically just more review of particle filtering in another context though. Klein then does a couple demos of robot localization that use particle filtering.

Next talked about "Most likely explanation" queries and the Viterbi Algorithm.  What the "most likely explanation" query is trying to determine is the series of states with the highest probability given some evidence.  Basically, looking for a path through the states of an HMM that is most likely.  Basically we are taking a max over the probabilities instead of a sum, since we are asking "What is the most likely thing that happened" rather than "What is the likelihood of things happening".

The speech recognition section is very cool, but I doubt I can summarize it very well.  Basically, a microphone captures an acoustic waveform, a Fourier transform is applied to this signal, which identifies active frequencies, and then these active frequencies are used as features to determine which phone is being heard (HMM does this bit).    Klein demonstrates this with vowel sounds using the Formant Synthesizer Demo.

The synthesizer demo

Transition model stitches phones together into words, then a language model ties the words together.  All of this is based on Dynamic Bayes Nets and Viterbi's algorithm (most likely phones to form words, most likely words to form sentences).

Naive Bayes' is the next piece, and it is discussed in the context of classifiers (e.g. email spam filter). This approach basically takes a series of "features", and treats them as though they are all independent (thus the "naive" bit).  To classify an email as spam or ham, we take the likelihood of each word given the classification (i.e. P(w|spam) and P(w|ham)), multiply them together (or sum the logs), and take the max.  One problem with this approach is that if any word has a probability of zero for a label, the probability for the whole label goes to zero.

This is where smoothing comes into play.  Laplace smoothing is an approach where we add a dummy count of examples when we calculate the probability of a variable appearing.  For example, if we have two values, A and B, and our samples are {A, A, B}, we would say that P(A) is 2/3 and P(B) is 1/3 without smoothing.  With k = 1 smoothing, it is like we have the samples {A, A, B, A, B}, so P(A) is now 0.6 and P(B) is 0.4.  As k increases, the probabilities approach the uniform distribution.  It's worth nothing that while all the examples use whole k, there is really no reason you couldn't use a real number and plug it into the formula:


Homework 9 (next week) has us doing Laplace smoothing.  The "Step-by-step" video is a lifesaver for the homework, because we have to do Laplace on conditional distributions.



Week 12 [August 9]



Start out with a quick recap of Naive Bayes' and smoothing.  Klein emphasizes (last week too) the danger of overfitting a model to the training set (review Laplace smoothing, notes that more sophisticated smoothing methods are in 281a - Statistical Learning Theory or 288 - Statistical NLP). "Tuning" leverages our "held out" data to determine the best k to use, as a simulation of our test data.

So NB is a "model" driven algorithm, where we build up these probabilities and use inference to make a prediction.  Perceptrons, on the other hand, are a "error driven" approach, where we directly tune the weights on a feature vector and try to minimize errors.  This feature vector may be a sparse vector.  When we go to do the actual classification, we take a dot product between the sample and the weights for each classification.  In binary perceptrons, if the dot product is positive, it is considered a match (opposite being true for a negative dot product).  This is basically measuring the angle between the vectors in a high dimensional space (as long as weights are normalized).  In multiclass perceptrons, we take the maximum over these activations, and that is our classification.

Updating weights is very simple:

  • initialize weights to 0
  • loop through examples:
    • try to classify sample
    • if classification is wrong, subtract the sample from w' (weights for the classification we guessed) and add the sample to w* (the weights for the correct classification)
    • if classification is right, do nothing
If the sample set is separable, the update algorithm will finish.  Otherwise it will just keep thrashing as it gets samples wrong on each loop.  One hack is to eventually stop and take an average perceptron.  

One way to improve the updates is to use an algorithm called MIRA.  Instead of adding or subtracting the entire mistaken feature vector as given, scale the feature vector by a factor called tau.  The aim is to make the update as small as possible while still making this sample "right".  We can also "cap" tau to ensure that updates never get too big.  Here is the formula for tau (you'll need it for project 5):




Touch on Support Vector Machines (SVM) - more thorough coverage in MIT lecture 16 and mega recitation 5.  One interesting insight was the comparison of MIRA and SVM.  Where MIRA is minimizing the squared error in the weight vectors, SVMs are minimizing the weight vector itself, which ends up maximizing the "margin" around the separator.

Apprenticeship is an interesting approach.  Basically, an "expert" performs a task, and the actions the expert takes for any given set of features become the "labels".  The apprentice then tried to learn which actions to take based on these features and labels.  Here the apprentice is not learning to be optimal, it is just learning to be the expert (Pacman demo is pretty cool at the very end of Lecture 21)

Also look at k-Nearest Neighbor (MIT looks at NN in lecture 10).  kNN is a "non-parametric" algorithm.  In perceptrons, there is a fixed set of weights that defines the decision boundary.  In kNN, the decision boundary is defined by the data, so getting a good decision boundary is a function of getting more data. In simple NN (k=1), a sample is classified based on the closest example.  As k is increases, the k closest examples are compared to the sample and they "vote".  The similarity function is crucial to making this work, and is usually based on a dot product.  Feature selection is often the most important step.

The "dual perceptron" approach maintains a list of "alpha"s for each training example, where alpha is the weight applied to that datum with respect to the weight vector (one for each classification).  Representing the perceptron this way allows us to do some math and separate the similarity function (kernel function).  Now instead of being tied to a dot product, we can use any function to determine similarity. I suspect this is related to the way SVM kernel function work, but I'm not entirely sure...

Finally, look at "clustering" algorithms, which try to determine meaningful groups of samples without label data.  K-means does this by randomly assigning k random centers, assigning all the data points to the nearest center, recalculating the center for each cloud, reassigning the data to the nearest center, and repeating until no center assignments change.  K-means is not deterministic, and is very sensitive to the initial center assignment.  It is essentially an optimization problem, aiming to minimize the sum of the distances of points to their means (somewhat like hill climbing... complete with local optima).

Agglomerative clustering builds up clusters from the bottom up, grouping the closest points and groups until everything is grouped.  Produces what is called a "dendogram".  Measuring the similarity, or "closeness" between two clusters can be done different ways:  closest pair, farthest pair, average of all pairs, and Ward's method.


Week 13 [August 9]


First lecture this week starts with a discussion of "inductive learning", the trade off we make between "bias" and "variance".  Bias is the result of "underfitting", i.e. we haven't learned enough about our "hypothesis space" to make good predictions, but the trade off we get is that as bias increases our predictions become more consistent (better generalization).  Variance is the result of overfitting.  Our model knows too much about the training data and doesn't generalize well.  But we have to know something about our data.  So it's easy to see how these two measures are at cross purposes and need to be balanced.  Smoothing and limiting training are ways this balance can be achieved in NB and Perceptron.

Decision trees are another approach that use a series of rules (features) to classify a sample.  Decision trees that end in a true/false decision are called "truth tables", but can also end with probabilities.  MIT lectures 11 discusses what Prof Winston call "Identification Trees" but it's basically the same thing.  Decision trees can learn any function of the features, but they can get super large (no benefit over huge table) and very prone to overfitting. A decision "stump" is a single feature decision tree, and was used in the boosting lecture from MIT.  Decision trees are also used in an approach called "random forest", which isn't discussed by MIT or Berkeley, but the UBC Machine Learning lecture series does explore that.  This answer on stack exchange offers an interesting analysis on the difference between boosting and random forest.  But I digress...

Learning in a decision tree is guided by Information Gain and Entropy.  Entropy is the measure of how uniform a distribution is.  A perfectly uniform distribution (all values equally likely) is going to have a maximum entropy, while a distribution where most or all variables are a single value (spiked) will have near 0 entropy.  Information gain is the difference between the entropy of a distribution before making a split and after the split.  To avoid overtraining in a decision tree, we compare the information gain from doing a split to the information gain from a random split.  If the random split does about as well, we don't do the split.  Basically we test for statistical significance.  Another approach is pruning, where we start from the bottom of the tree and eliminate splits that are not significant.

It's not mentioned on edX, but the chapter from the book with the Decision Tree material is Chapter 18 - Learning from Observations.  Chapter 20 covers Neural Networks.

Neural Nets are next.  MIT did a two parter on Neural Networks, lecture 12a, and lecture 12b, as well as mega recitation 4 in particular, which collectively go into a great deal of depth.  Basically a neural net is a series of perceptrons glued together.  Instead of a true/false activation function, they use the sigmoid function to output a range from 0 to 1, which allows them to be trained using a hill climbing algorithm.  The MIT lectures (as well as the Stanford Machine Learning course) go into depth on how to do all the partial derivatives to train the neural net, whereas Berkeley keeps it more high level for this course.

The last lecture this week is advanced applications.  Abbeel discusses Search (web search), Natural Language Processing (machine translation), Games (Overmind agent for Starcraft),  Reinforcement Learning (walking robots), tracking (self driving cars, DARPA grand challenge).  Definitely worth watching.



Week 14 [August 11]


Short week for the last week.  More advanced applications:  Object detection in computer vision (Histogram of Gradients or HOG, and SVM), Deep learning (Google discovered the concept of a cat just using lots of pictures with no labels), and Apprenticeship (Abbeel's robotic helicopters).  Also looked at quadruped robots planning movement across difficult terrain, laundry folding robot, and the DARPA robotics challenge (Petman).

Project 5 has us doing classification of handwriting samples using perceptron and MIRA, and apprenticeship in Pacman. Getting full credit on the feature extraction bit was a pain, but overall it wasn't bad.  And this is why you never say anything about how hard something is until you actually do it.  Holy cow... I thrashed for the better part of two or three days trying to get all the features just right on the digits, and a solid day on Pacman.  I ended up peaking at two repos on Github (here and here) to give me an idea on what I needed to be doing differently on the digit classifier.  I basically had all the pieces, I was mostly just overdoing it and overfitting.  Pacman was frustrating because it wasn't really clear what kinds of features would make sense (the obvious ones didn't seem to quite get me there right away).  Bit of tweaking here and there, though, and I finally got the contest and suicide Pacmans doing there thing (my suicide pacman was especially good... how depressing is that lol).  My Code on Github.

And for all my troubles: 40%.  Sigh, sadly the midterms and final aren't available on edX, so the best I can do is 100% on the projects and homework, which make up 40% of the grade.  Oh well...



3 comments:

  1. Excellent article, its contents on artificial intelligence are a great source of information.
    Artificial Intelligence Solutions

    ReplyDelete
  2. Sigh... since apparently it's everybody's favorite thing to spam this post with links to Machine Learning training, allow me to clue you in on something...

    My blog doesn't display links in comments. So you are wasting your time spamming me. If you found my blog genuinely valuable and have something to add to the conversation, I welcome you. If you're a hack selling a "training course", go away.

    ReplyDelete