Assignments (2008)
From LS2
Contents |
Assignment 1
- Out by: M 9-8
- Due: W 9-17
1. A dynamic programming problem. Consider a telephone keypad with digits 1-9 laid out in the usual way (1 2 3 / 4 5 6 / 7 8 9), and imagine placing a chess piece on the "1" key. The chess piece can move around the keypad according to the following rules:
- A rook can move horizontally or vertically one or more steps. (E.g., from 4 it can move to 1, 5, 6, or 7; from 5 it can move to 2, 4, 6, or 8, etc.)
- A bishop can move diagonally one or more steps. (E.g., from 4 it can move to 2 or 8; from 5 it can move to 1, 3, 7, or 9.)
- A knight moves exactly one horizontal or vertical step plus one diagonal step. (E.g., from 4 it can move to 3 or 9; from 5 it can't move anywhere.)
- A queen's set of valid moves is the union of those of the rook and the bishop. (E.g., from 4 it can move to 1, 2, 5, 6, 7, or 8; from 5 it can move to 1, 2, 3, 4, 6, 7, 8, or 9.)
- A king can move a single step horizontally, vertically, or diagonally. (E.g., from 4 it can move to 1, 2, 5, 7, or 8; from 5 it can move to 1, 2, 3, 4, 6, 7, 8, or 9.)
Consider moving the piece around the keypad according to its allowed moves, pressing numbers to produce digit sequences. For each chess piece (rook, bishop, knight, queen, king - don't worry about pawns), give recursive equations to count the number of unique length-9 digit sequences (possibly with repetitions of digits) that can be produced. That is, give a separate set of equations for each of the five pieces.
2. Design a semiring to output the full set of unique digit sequences for each piece, instead of simply counting them. Hint: use a string concatenation operator or a set of ordered pairs to represent a digit sequence (e.g.,
denotes "1, 4, 5"). Bonus: prove that it's a semiring.
3. Give a generalized Dyna program for counting that works for any chess piece (not the pawn), any starting key, and any number of steps. These three should be specified as axioms. Try to make your program as succinct as possible. Bonus: Let the dimension of the grid (assumed square) be specified by an axiom, as well. Assume a similar numbering convention (top row is 1 to N, next is N+1 to 2N, etc.).
4. Optimization. Consider the function
, defined by f(x) = (x − 1)2 + e0.3x − cos(5x + 6). This function is smooth, continuous, and differentiable. You are enouraged to plot it to get some idea of its properties. For each of two numerical optimization algorithms (your choice) and ten different starting points (your choice), solve for a local minimum of the function. Report precisely what you did, show the results in a table (columns should correspond to the initial value, and for each algorithm, the final value of x and f(x)), and explain the results. You are strongly encouraged to use an existing implementation of the optimization algorithm to save yourself time (Noah really likes the R language's support for optimization, and also the sorts of things you'll do in problems 5 and 6) - be sure to tell us what implementations you used!
5. Experiment. We are going to encapsulate the optimization algorithms above within a randomized algorithm that hides the initializer choice. The algorithm proceeds by first sampling the starting value from a probability distribution over the reals (we suggest a zero-mean Gaussian with a variance of your choice), then running the algorithm from that starting point. Clarification: Just one random starting point; you only initialize and run once - you don't initialize many times and somehow combine the output of many runs!
In order to compare the two algorithms from problem 4 (now wrapped in the above randomized algorithm), use (the same) 200 initializers generated from your random initialization distribution. Does one algorithm lead to better local minima than the other? Does one lead to faster convergence? Test significance for both questions, and explain clearly how you did it. Note: measure speed in the absolute number of function evaluations performed - not wall time (since it depends on external factors you may not be able to control for) or optimization algorithm iterations (since line searches may require more function evaluations that aren't counted directly that way).
6. Another experiment. Define another distribution over starting values. Generate 200 initial points from this distribution. Conduct an experiment that compares the performance of your first algorithm paired with the random initializer of problem 5 with the new random initializer. Does one random initializer lead to better local minima than the other? Does one lead to faster convergence for this function? Test significance for both questions, and explain clearly how you did it.
Frequently-Asked Questions
Note: please email your questions to the instructor or the TA. We will post them here if we deem them useful and appropriate for the class.
Q: In question 1, does "imagine placing a chess piece on the '1' key" imply that the start state for all the pieces is '1' or was that sentence given just as an example?
A: For question 1, give recursive equations assuming that each piece starts on the '1' key.
--Kgimpel 17:43, 12 September 2008 (EDT)
Assignment 2
- Out by: M 9-22
- Due: T 9-30 9pm
- The first problem comes from the lecture on September 17. A bigram HMM defines a joint distribution over a label sequence
and an observation sequence
as
. Define an algorithm that runs in O( | Λ | 2(n + k)) time and generates k independent, random samples over the label variable Ci, drawn according to the distribution
, which can be derived from the HMM. (The inputs to the algorithm are the label set Λ, the HMM parameters
and
, the sequence
, and the position index i.) Addendum: It's actually possible to do it with O( | Λ | 2n + | Λ | k) runtime!
- Starting with the same HMM as problem 1, define an algorithm that runs in O( | Λ | 2nk) time and generates k independent, random samples over the label sequence variable
, drawn according to the distribution
. (The inputs to the algorithm are the label set Λ, the HMM parameters
and
, and the sequence
.)
- Suppose I have an HMM such that Λ = {B,I,O}. This model gives a joint distribution over word sequences bracketed with base noun phrases; this is also called "NP chunking." Given a structure
generated by this model, a base noun phrase corresponds to
whenever
(the first argument of the disjunction corresponds to a noun phrase followed by non-NP material, the second corresponds to a base NP that is followed by another base NP, and the third corresponds to a base NP at the end of the sentence). Define an algorithm that calculates the exact probability that subsequence
is a base noun phrase, given the sequence and the HMM. (The inputs to the algorithm are the HMM parameters
and
, i, j, and the sequence
.)
- Here is a new generative model over three sequences, D, which ranges over Δ + , C, which ranges over Λ + , and S, which ranges over Σ + . Δ, Λ, and Σ are finite, discrete sets. (In lecture, we considered two sequences, usually denoted C and S.) Here is the generative process:
-
;
- Generate the value of Di from the distribution
over the finite, discrete set Δ.
- Generate the value of Ci from the distribution
over the finite, discrete set Λ.
- Generate the value of Si from the distribution
over the finite, discrete set
.
- If si = stop, stop. Otherwise,
and go to step 2.
under this model. (The inputs to the algorithm are the sets Δ, Λ, and Σ the model parameters
,
, and
, and the sequence
.) Advice: draw a directed graphical model corresponding to this distribution (if you aren't familiar with graphical models, ignore this advice).
-
- Consider a log-linear model that defines p(X) where X ranges over the discrete, finite set
(superscripts are used here to denote types; they are not to be confused with subscripts, which are used in this class to index positions in a sequence or examples). In this log-linear model, there are v features, and for each
, the ith feature is defined as fi(x) = δ(x,xi). It should be clear that, for any x, the feature vector
contains a single 1 and the rest of the entries are 0. Given a histogram of empirical counts over
(such that ci is the count of xi), prove that one solution to this model's maximum likelihood estimate is to set
. (
is the empirical distribution, given by relative frequency estimates
.) Hint: imagine that
is the set of words and we're using the log-linear model to define a unigram distribution. You are proving that the closed-form maximum likelihood estimate for the classical unigram model gives a closed-form solution to the maximum likelihood estimate for the locally-normalized log-linear Markov model.
- You know that, in general, the maximum likelihood estimate of log-linear models doesn't have a closed-form solution. Explain why this special case of a log-linear model does have a closed form solution. Hint: think of some features to add in or remove that 'break' this property.
- Continuing problem 5, prove that the solution is not unique.
Assignment 3
- Out by: M 10-6
- Due:
W 10-15F 10-17 - Data: hw3-data.tar.gz -- posted 10/6/08
- accent_restoration.tar.gz -- posted 10/15/08 -- updated accent restoration data (only change is to substitute ~'s with spaces to be like truecasing data)
- Test data: testdata.tar.gz -- posted 10/15/08 -- When submitting your report, please also submit (via e-mail) your system's output on these test sets. Be sure to include in your report a description of the system (parameter settings, features, etc.) you used for these test sets.
For this assignment, you will implement and experiment with a (linear-chain) CRF. We will provide you with training, development, and test data for two tasks and you will conduct experiments with your implementation to answer several questions. The problems are:
- Truecasing. Given lowercased text from English news articles, predict the true case pattern of each word. The problem we consider is actually simpler than the standard truecasing problem, since we simply ask you to label each word with one of the following four labels: AC (all caps), IC (initial cap), MIX (other mixed case pattern), and O (no change required).
- Accent Restoration. Given accent-stripped Spanish news commentary text, predict which characters should receive accent marks (and which kind for 'u').
Choose one question below for each problem (don't pick the same question twice), and choose a third question to answer for both problems. Answer the questions by performing experiments and reporting your results.
- Feature engineering: what features work best for this problem?
- How do different optimization algorithms compare? What is the effect of optimization algorithm parameters? (convergence criteria, number of iterations, step size, etc.)
- How do different decoding methods compare? (e.g., Viterbi vs. MBR/posterior decoding)
- How does performance change with the amount of training data used?
- How does regularization affect performance, either in form (e.g., L1 vs. L2) or strength (varying regularization coefficient)?
For each problem, you will be given development data and two test sets. Use the development data for preliminary testing and any tuning that is required. Use test set 1 to perform these comparative experiments. Test set 2 will be provided to you a day before the due date; use your best system to test on it and send the output to the instructor and the TA along with your write-up. We will report how your scores compare with those of the rest of the class. Further details about the data are available in a README file included with the tar.gz file.
Implementation: While you may use libraries for optimization routines, you must implement the dynamic programming algorithms for computing the function value/gradient and for decoding. Be prepared to provide us with your code if we ask.
The following papers may be helpful for the implementation:
- Fei Sha and Fernando Pereira. Shallow Parsing with Conditional Random Fields. HLT-NAACL 2003. pdf
- Describes gradient computation and the use of several optimization algorithms for training
- Jenny Rose Finkel, Alex Kleeman and Christopher D. Manning. Efficient, Feature-based, Conditional Random Field Parsing. ACL/HLT-2008. pdf
- Discusses regularization and batch and step sizes when using stochastic gradient descent
Report: Write a report summarizing your findings. Be sure to include important implementation details, the optimization method you used, your experimental set-up, and your results. Cite any code or libraries you used for optimization. Give enough detail so that someone else can replicate your results!
In your report, also answer the following question:
- The data we provided gives true case patterns at the word level and true accent marks at the character level, but either of these problems could be approached at the level of words or characters. What are the pros and cons of these approaches, in terms of computational complexity and modeling flexibility?
Assignment 4
- Out by: M 10-20
- Due: W 11-5
- (2 points) When we talked about transformations on the treebank, it was noted that you could often describe the transformation in a different (but more or less equivalent) way. For example, the "parent annotation" trick (Johnson, 1998) in which each node is augmented with the label of its parent, can also be described as changing the "vertical Markov order" of the PCFG from 1 to 2. That is, the sequence of a node n's children depends not just on the parent node n (vertical Markov order 1), but also n's parent (vertical Markov order 2). In this exercise, you are to design a parsing algorithm for a PCFG with rules of the form
, indicating that Z with parent node W can rewrite as α with probability p. Note that for a given pair (Z,W), the probabilities of the rules sum to one. You can assume that α consists of either two nonterminals or one terminal symbol. Your parsing algorithm must be guaranteed to find the most probable parse, and it should run in O(n3N3 + n2N4) time, if there are N nonterminal symbols, and it should run in O(n2N3) space. You can describe the algorithm however you like (e.g., logic programs and graphical notation are fine), but the runtime behavior should be clear as well - a declarative specification is only a partial answer. (Hint: there is a solution that runs in O(n3N4) with only O(n2N2) space; it's easier to find, a good starting point, and will earn you partial credit.)
- (1 point) When we discussed large margin structure learning, the idea of a "loss augmented decoding" problem came up. The idea is to find
. Suppose that
are defined using a classic (bigram) HMM and
(the loss function) is the number of mis-labeled symbols in x (which is given) if y * is the correct labeling (also given). Give a dynamic programming algorithm that solves this problem. (Hint: start with the Viterbi algorithm.)
- (1 point) Just like the last problem, but let
be defined by a CFG (in CNF if you like). Let the loss function
be defined by the number of constituents hypothesized in the parse that are not in y * (the correct parse, which you have access to). In addition to solving this problem, answer this question: does this loss function look more like recall or precision?
- (1 point) You are given a PCFG and a sentence
. For a query of the form (N,i,j), where N is a nonterminal and
is an inclusive span of the sentence, you are to return the posterior probability (under the PCFG) that an N constituent spans
. Now suppose the query does not specify j; in other words, it only asks the total probability that a N constituent starts at word i. How does your algorithm change? NOTE: we're looking for efficient algorithms here. It's okay to assume the PCFG is in CNF.
- (1 point) Now suppose we get a query consisting of
- we wish to know the total probability that an N constituent spans
and an M constituent spans
. (Note: you only know for sure that
and that
; you do not know any more than that.) Give an algorithm.
Assignment 5
- Out by: M 1-24
- Due: W
12-312-5, 3am
- (2 points) Let X be a discrete random variable whose value we can observe. Let Y and Z be hidden random variables over two different, disjoint, discrete class spaces. Assume that the class priors q(Y = y) and r(Z = z) are known. Let the model over the three random variables be:
. You are given training data
. Assume all probability distributions are multinomials. Give equations for the E step and the M step that (locally) maximize
with respect to the multinomial probabilities
(for all x, y, z).
- (2 points) In the model above, we assumed that q and r were known. Suppose instead that we have prior distributions over them, expressed as symmetric Dirichlets with (scalar) parameters α and β, respectively. More specifically, for each example xi, we believe that the multinomial qi was sampled from the symmetric Dirichlet with parameter α, the multinomial ri was sampled from the symmetric Dirichlet with parameter β, then Y was drawn from qi, Z was drawn from ri, and finally X is drawn from p given the values of Y and Z. Describe a Gibbs sampler that will permit us to get samples of Y and Z for each xi, assuming that p is fixed.
- (1 point) Describe how to use the Gibbs sampler from problem 2 to learn p. Hint: your answer to problem 2 is a subroutine within the solution to this problem.
- (1 point) In the lecture on approximate inference, we discussed sampling a single tag, given the rest of the tag sequence, the observation sequence, and the collection of Dirichlet distributions over the trigram HMM probabilities. The formula given in lecture was
where L is the length of the sequence and V(t) is the number of word types possible with tag t. It was pointed out that this was not correct; the correct formula can be found in Goldwater and Griffiths (2007) in figure 2. Explain the difference (i.e., what are the extra terms in Goldwater and Griffiths doing?).
- Bonus question Derive a variational inference algorithm that you could use instead of the Gibbs sampler in problem 2. To do this, you need to write out the form of
, select a variational model Q over the hidden variables (recommended:
, with variational parameters
), then maximize
. Here's a derivation of variational inference for a simpler model: pdf.
