Assignments (2009)
From LS2
Contents |
Assignment 1
- Out by: T 9-8
- Due: T 9-15
1. A dynamic programming problem. Consider a telephone keypad with digits 1-9 laid out in the usual way (1 2 3 / 4 5 6 / 7 8 9), and imagine placing a chess piece on the "1" key. The chess piece can move around the keypad according to the following rules:
- A rook can move horizontally or vertically one or more steps. (E.g., from 4 it can move to 1, 5, 6, or 7; from 5 it can move to 2, 4, 6, or 8, etc.)
- A bishop can move diagonally one or more steps. (E.g., from 4 it can move to 2 or 8; from 5 it can move to 1, 3, 7, or 9.)
- A knight moves exactly one horizontal or vertical step plus one diagonal step. (E.g., from 4 it can move to 3 or 9; from 5 it can't move anywhere.)
- A queen's set of valid moves is the union of those of the rook and the bishop. (E.g., from 4 it can move to 1, 2, 5, 6, 7, or 8; from 5 it can move to 1, 2, 3, 4, 6, 7, 8, or 9.)
- A king can move a single step horizontally, vertically, or diagonally. (E.g., from 4 it can move to 1, 2, 5, 7, or 8; from 5 it can move to 1, 2, 3, 4, 6, 7, 8, or 9.)
Consider moving the piece around the keypad according to its allowed moves, pressing numbers to produce digit sequences. For each chess piece (rook, bishop, knight, queen, king - don't worry about pawns), give recursive equations to count the number of unique length-9 digit sequences (possibly with repetitions of digits) that can be produced, always starting from the "1" key. That is, give a separate set of equations for each of the five pieces.
2. Design a semiring to output the full set of unique digit sequences for each piece, instead of simply counting them. Hint: use a string concatenation operator or a set of ordered pairs to represent a digit sequence (e.g.,
denotes "1, 4, 5"). Bonus: prove that it's a semiring.
3. Give a generalized Dyna program for counting that works for any chess piece (not the pawn), any starting key, and any number of steps. These three should be specified as axioms. Try to make your program as succinct as possible. Bonus: Let the dimension of the grid (assumed square) be specified by an axiom, as well. Assume a similar numbering convention (top row is 1 to N, next is N+1 to 2N, etc.).
4. Optimization. Consider the function
, defined by f(x) = (x − 1)2 + e0.3x − cos(5x + 6). This function is smooth, continuous, and differentiable. You are enouraged to plot it to get some idea of its properties. For each of two numerical optimization algorithms (your choice) and ten different starting points (your choice), solve for a local minimum of the function. Report precisely what you did, show the results in a table (columns should correspond to the initial value, and for each algorithm, the final value of x and f(x)), and explain the results. You are strongly encouraged to use an existing implementation of the optimization algorithm to save yourself time (Noah really likes the R language's support for optimization, and also the sorts of things you'll do in problems 5 and 6) - be sure to tell us what implementations you used!
5. Experiment. We are going to encapsulate the optimization algorithms above within a randomized algorithm that hides the initializer choice. The algorithm proceeds by first sampling the starting value from a probability distribution over the reals (we suggest a zero-mean Gaussian with a variance of your choice), then running the algorithm from that starting point. Clarification: Just one random starting point; you only initialize and run once - you don't initialize many times and somehow combine the output of many runs!
In order to compare the two algorithms from problem 4 (now wrapped in the above randomized algorithm), use (the same) 200 initializers generated from your random initialization distribution. Does one algorithm lead to better local minima than the other? Does one lead to faster convergence? Test significance for both questions, and explain clearly how you did it. Note: measure speed in the absolute number of function evaluations performed - not wall time (since it depends on external factors you may not be able to control for) or optimization algorithm iterations (since line searches may require more function evaluations that aren't counted directly that way).
6. Another experiment. Define another distribution over starting values. Generate 200 initial points from this distribution. Conduct an experiment that compares the performance of your first algorithm paired with the random initializer of problem 5 with the new random initializer. Does one random initializer lead to better local minima than the other? Does one lead to faster convergence for this function? Test significance for both questions, and explain clearly how you did it.
Assignment 2
- Out by: Th 9-17
- Due: T 9-29
- The first problem comes from the lecture on September 15. A bigram HMM defines a joint distribution over a label sequence
and an observation sequence
as
. Define an algorithm that runs in O( | Λ | 2(n + k)) time and generates k independent, random samples over the label variable Ci, drawn according to the distribution
, which can be derived from the HMM. (The inputs to the algorithm are the label set Λ, the HMM parameters
and
, the sequence
, and the position index i.) Addendum: It's actually possible to do it with O( | Λ | 2n + | Λ | k) runtime!
- Starting with the same HMM as problem 1, define an algorithm that runs in O( | Λ | 2nk) time and generates k independent, random samples over the label sequence variable
, drawn according to the distribution
. (The inputs to the algorithm are the label set Λ, the HMM parameters
and
, and the sequence
.)
- Suppose I have an HMM such that Λ = {B,I,O}. This model gives a joint distribution over word sequences bracketed with base noun phrases; this is also called "NP chunking." Given a structure
generated by this model, a base noun phrase corresponds to
whenever
(the first argument of the disjunction corresponds to a noun phrase followed by non-NP material, the second corresponds to a base NP that is followed by another base NP, and the third corresponds to a base NP at the end of the sentence). Define an algorithm that calculates the exact probability that subsequence
is a base noun phrase, given the sequence and the HMM. (The inputs to the algorithm are the HMM parameters
and
, i, j, and the sequence
.)
- Here is a new generative model over three sequences, D, which ranges over Δ + , C, which ranges over Λ + , and S, which ranges over Σ + . Δ, Λ, and Σ are finite, discrete sets. (In lecture, we considered two sequences, usually denoted C and S.) Here is the generative process:
-
;
- Generate the value of Di from the distribution
over the finite, discrete set Δ.
- Generate the value of Ci from the distribution
over the finite, discrete set Λ.
- Generate the value of Si from the distribution
over the finite, discrete set
.
- If si = stop, stop. Otherwise,
and go to step 2.
under this model. (The inputs to the algorithm are the sets Δ, Λ, and Σ the model parameters
,
, and
, and the sequence
.) Advice: draw a directed graphical model corresponding to this distribution (if you aren't familiar with graphical models, ignore this advice).
-
- Consider a log-linear model that defines p(X) where X ranges over the discrete, finite set
(superscripts are used here to denote types; they are not to be confused with subscripts, which are used in this class to index positions in a sequence or examples). In this log-linear model, there are v features, and for each
, the ith feature is defined as fi(x) = δ(x,xi). It should be clear that, for any x, the feature vector
contains a single 1 and the rest of the entries are 0. Given a histogram of empirical counts over
(such that ci is the count of xi), prove that one solution to this model's maximum likelihood estimate is to set
. (
is the empirical distribution, given by relative frequency estimates
.) Hint: imagine that
is the set of words and we're using the log-linear model to define a unigram distribution. You are proving that the closed-form maximum likelihood estimate for the classical unigram model gives a closed-form solution to the maximum likelihood estimate for the locally-normalized log-linear Markov model.
- You know that, in general, the maximum likelihood estimate of log-linear models doesn't have a closed-form solution. Explain why this special case of a log-linear model does have a closed form solution. Hint: think of some features to add in or remove that 'break' this property.
- Continuing problem 5, prove that the solution is not unique.
Assignment 3
- Out by: T 10-6
- Due: T 10-20
- Data: hw3-data-2009.tar.gz
- When submitting your report, please also submit (via email) your system's output on the test sets in the testdata subdirectory. Be sure to include in your report a description of the system (parameter settings, features, etc.) you used for these test sets.
For this assignment, you will implement and experiment with a statistical model for sequence labeling. We will provide you with training, development, and test data for two tasks and you will conduct experiments with your implementation to answer several questions. The problems are:
- Truecasing. Given lowercased text from English news articles, predict the true case pattern of each word. The problem we consider is actually simpler than the standard truecasing problem, since we simply ask you to label each word with one of the following four labels: AC (all caps), IC (initial cap), MIX (other mixed case pattern), and O (no change required).
- Accent Restoration. Given accent-stripped Spanish news commentary text, predict which characters should receive accent marks (and which kind for 'u').
For each problem, first implement a model that makes independent decisions for each symbol (a symbol is a word for the truecasing problem, but a character for the accent restoration problem). Then implement a competing model that reasons jointly about the labels. The second model must be decoded exactly using dynamic programming (which you must implement yourself). The two models can be trained however you like, though they should be designed to be similar up to the independence assumption. Some examples:
- A symbol-wise Naive Bayes model and an HMM;
- A symbol-wise log-linear model and an MEMM;
- A symbol-wise log-linear model and a CRF (approximate training methods are acceptable, but you must explain them);
- A multi-class symbol-wise SVM and a structured SVM (we have not covered this in class, so we advise against this!).
Use the development data for preliminary testing and any tuning that is required. Use test set 1 to perform these comparative experiments. Test set 2 (in the testdata subdirectory) is the final test set; you should run your final models on it and send the output to the instructor and the TA along with your report. We will report how your scores compare with those of the rest of the class. Further details about the data are available in a README file included with the tarball.
Deliverable: Write a short report. Describe your two models, including the features they use and how you trained them (optimization algorithms, regularization, amount of data, convergence criteria, step size, etc.). Provide enough detail that your results could be replicated without your code. Thorough experimentation is encouraged, but in particular you must present at least one carefully controlled experiment comparing your two models that aims to draw a conclusion. Acknowledge any code you used that you didn't write; remember that you must write the decoding algorithms yourself.
The following papers may be helpful for the implementation:
- Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. HLT-NAACL 2003. pdf
- Describes gradient computation and the use of several optimization algorithms for training
- Jenny Rose Finkel, Alex Kleeman and Christopher D. Manning. Efficient, Feature-based, Conditional Random Field Parsing. ACL/HLT-2008. pdf
- Discusses regularization and batch and step sizes when using stochastic gradient descent
Assignment 4
- Out by: Th 10-22
- Due: Th 11-5
- (2 points) When we talked about transformations on the treebank, it was noted that you could often describe the transformation in a different (but more or less equivalent) way. For example, the "parent annotation" trick (Johnson, 1998) in which each node is augmented with the label of its parent, can also be described as changing the "vertical Markov order" of the PCFG from 1 to 2. That is, the sequence of a node n's children depends not just on the parent node n (vertical Markov order 1), but also n's parent (vertical Markov order 2). In this exercise, you are to design a parsing algorithm for a PCFG with rules of the form
, indicating that Z with parent node W can rewrite as α with probability p. Note that for a given pair (Z,W), the probabilities of the rules sum to one. You can assume that α consists of either two nonterminals or one terminal symbol. Your parsing algorithm must be guaranteed to find the most probable parse, and it should run in O(n3N3 + n2N4) time, if there are N nonterminal symbols, and it should run in O(n2N3) space. You can describe the algorithm however you like (e.g., logic programs and graphical notation are fine), but the runtime behavior should be clear as well - a declarative specification is only a partial answer. (Hint: there is a solution that runs in O(n3N4) with only O(n2N2) space; it's easier to find, a good starting point, and will earn you partial credit.)
- (1 point) When we discussed large margin structure learning, the idea of a "loss augmented decoding" problem came up. The idea is to find
. Suppose that
are defined using a classic (bigram) HMM and e (the error function) is the number of mis-labeled symbols in x (which is given) if y * is the correct labeling (also given). Give a dynamic programming algorithm that solves this problem. (Hint: start with the Viterbi algorithm.)
- (1 point) Just like the last problem, but let
be defined by a CFG (in CNF if you like). Let the loss function e be defined by the number of constituents hypothesized in the parse that are not in y * (the correct parse, which you have access to). In addition to solving this problem, answer this question: does this loss function look more like recall or precision?
- (1 point) You are given a PCFG and a sentence
. For a query of the form (N,i,j), where N is a nonterminal and
is an inclusive span of the sentence, you are to return the posterior probability (under the PCFG) that an N constituent spans
. Now suppose the query does not specify j; in other words, it only asks the total probability that a N constituent starts at word i. How does your algorithm change? NOTE: we're looking for efficient algorithms here. It's okay to assume the PCFG is in CNF.
- (1 point) Now suppose we get a query consisting of
- we wish to know the total probability that an N constituent spans
and an M constituent spans
. (Note: you only know for sure that
and that
; you do not know any more than that.) Give an algorithm.
