Assignments (2007)

From LS2

Jump to: navigation, search

Contents

Assignment 1

  • Out by: Th 8-30
  • Due: T 9-11
  • Data for training
  • Data for testing (released 9-10-07)
  • Submission instructions: hard-copy due in class on 9-11-07. For each problem we want your answer in prose and (for questions 2-6) the average (over thirty test instances) of the natural logarithm of the probability of the instance for each of the models implemented: \frac{1}{30}\sum_{i=1}^{30} \log_e(p(\mathrm{testroute}_i)). You should be prepared to turn in your code, but don't turn it in by default.

Since this course deals mainly with text, and the most commonly used text databases come from newspapers, this assignment deals with a newspaper route. Pat is a grad student who delivers newspapers for extra cash. In this exercise, you will try to model Pat's behavior as best you can using statistical models. The goal is to get you to think about how you can capture the observed phenomena using a probabilistic model, and to get some exercise talking about models clearly.

Every Saturday, Pat delivers newspapers. These homes are distributed among several different neighborhoods in the suburban town of Slumberville. Because Pat gets bored easily, Pat likes to vary the route.

Let each house receive a numerical different code, from 0 to 999. These codes have nothing to do with Pat, who doesn't know about them; they are assigned by the Slumberville municipal government for the purpose of maintaining zoning and land use records, sales of property, and so forth. They correspond loosely with the geographical layout. We will use them to index the houses in Slumberville that Pat visits.

Your job is to predict Pat's route (sequence of house codes in the order visited) in real time on a given week. For each problem, you will get a little more information about Pat's route that should help you build a better model. The quality of your model will be measured by the probability it assigns to the route Pat chooses next week, which you won't see until the day before the deadline (i.e., this is the test data).

For each problem, you are to implement a computer program that, given Pat's route on a given week, outputs an estimate of the natural logarithm of the probability that Pat would take the observed route. You can (and should) use the data given to train a model for each problem. You are likely to need hidden variables to make use of the facts given in the problem set, and it's up to you how to deal with that during training and during testing. (Note that if you do not marginalize over hidden variables at test time, your log-likelihood scores will suffer!) Finally, you should be prepared to turn in your code if we have trouble believing in the quality of your estimates.

Pat's routes for the last year are available here.

  1. Before doing anything else, answer this question: why are Markov models inherently unsuited for describing the distribution over Pat's bicycle routes? What is the price your model will have to pay if you keep the Markov property with a small Markov order?
  2. Define and describe a probability model based on what you already know and the data. Implement a program whose input is a sequence of whitespace-separated integers in [0,1000) and whose output is the natural logarithm of your model's probability of Pat following the given sequence. (It's useful to think for a second about a baseline answer to this exercise that doesn't use the data. If Pat delivers a newspaper to each of the 1,000 houses, then there are 1000! possible routes. The uniform model over those routes will assign log probability -\ln(1000!) \approx -5912 to each route. You should be able to do better!)
  3. Pat doesn't live anywhere near Slumberville and usually chooses to take a bus to get there. The two bus stops that have direct lines from Pat's house into the suburb are located in front of houses 432 and 627. Pat can go home on bus lines that stop at the same bus stops. (He can take his bicycle on the bus.) Give the definition of your revised model and implement the scoring function as in (2).
  4. Along the same side of a street, the house codes run consecutively (unlike street addresses, which in the US usually alternate). When a corner is turned or when walking across the street, the numbers may or may not be consecutive. Give the definition of your revised model and implement the scoring function as in (2).
  5. Part of the suburb has grid-like streets that run north-south or east-west (where many streets are one-way), and part of the suburb has typical suburban streets that wind around in circles and frequently end in cul-de-sacs (where all streets are two-way). Pat prefers not to toss newspapers across a lane of oncoming traffic. Give the definition of your revised model and implement the scoring function as in (2).
  6. Pat usually starts the route at 9am, takes a lunch break at a cafe near house 101 from 12--1pm (during which Pat reads research articles on computational linguistics), then continues the route until 4pm. Give the definition of your revised model and implement the scoring function as in (2).

Remember, in addition to answering each question concisely and clearly in English, you need to report values of the log-probability of the test data (available the day before the deadline on the course web page) for questions 2-6! Part of your grade will be based on how well this value compares with your classmates.

7. NLP Question Suppose you must build a language model for a language like Chinese that is not normally whitespace separated. You are given a corpus of unsegmented text; you do not have a word list. Part 1: brainstorm some solutions to this problem and discuss. Part 2: how would your answer change if you had access to a (partial) word list? If you know about this problem, or if you look for relevant literature when answering the question, you must cite prior work!

8. Optional Bonus Question Back to Pat---how would you go about inferring a geographical map of the suburb, given observations of Pat's route over a period of time?

Assignment 2

  • Out by: Th 9-13
  • Due: T 9-25

Each question is worth one point, for a total of seven. To get 100%, answer six of the seven correctly; if you answer all seven correct, you'll get 7/6.

  1. The first problem comes from the lecture on September 4 ("HMM problem 3"). A bigram HMM defines a joint distribution over a label sequence c_1^n and an observation sequence s_1^n as p(c_1^n, s_1^n) = \left(\prod_{i=1}^{n} \gamma(c_i \mid c_{i-1}) \cdot \eta(s_i \mid c_i) \right) \cdot \gamma(\mathrm{stop} \mid c_n). Define an algorithm that runs in O( | Λ | 2(n + k)) time and generates k independent, random samples over the label variable Ci, drawn according to the distribution p(C_i \mid S_1^n = s_1^n), which can be derived from the HMM. (The inputs to the algorithm are the label set Λ, the HMM parameters \boldsymbol{\gamma} and \boldsymbol{\eta}, the sequence s_1^n, and the position index i.)
  2. Starting with the same HMM as problem 1, define an algorithm that runs in O( | Λ | 2nk) time and generates k independent, random samples over the label sequence variable C_1^n, drawn according to the distribution p(C_1^n \mid S_1^n = s_1^n). (The inputs to the algorithm are the label set Λ, the HMM parameters \boldsymbol{\gamma} and \boldsymbol{\eta}, and the sequence s_1^n.)
  3. Suppose I have an HMM such that Λ = {B,I,O}. This model gives a joint distribution over word sequences bracketed with base noun phrases; this is also called "NP chunking." Given a structure (c_1^n, s_1^n) generated by this model, a base noun phrase corresponds to s_i^j whenever (c_i^{j+1} = BI^{j-i}O) \vee (c_i^{j+1} = BI^{j-i}B) \vee ((j=n) \wedge (c_i^j = BI^{j-i})) (the first argument of the disjunction corresponds to a noun phrase followed by non-NP material, the second corresponds to a base NP that is followed by another base NP, and the third corresponds to a base NP at the end of the sentence). Define an algorithm that calculates the exact probability that subsequence s_i^j is a base noun phrase, given the sequence and the HMM. (The inputs to the algorithm are the HMM parameters \boldsymbol{\gamma} and \boldsymbol{\eta}, i, j, and the sequence s_1^n.)
  4. Here is a new generative model over three sequences, D, which ranges over Δ + , C, which ranges over Λ + , and S, which ranges over Σ + . Δ, Λ, and Σ are finite, discrete sets. (In lecture, we considered two sequences, usually denoted C and S.) Here is the generative process:
    1. i \leftarrow 1; s_0 \leftarrow \mathrm{start}
    2. Generate the value of Di from the distribution \phi( \bullet ) over the finite, discrete set Δ.
    3. Generate the value of Ci from the distribution \gamma(\bullet \mid S_{i-1} = s_{i-1}) over the finite, discrete set Λ.
    4. Generate the value of Si from the distribution \eta(\bullet \mid D_i =d_i, C_i = c_i) over the finite, discrete set \Sigma \cup \{\mathrm{stop}\}.
    5. If si = stop, stop. Otherwise, i \leftarrow i + 1 and go to step 2.
    Define an algorithm that calculates the exact probability of s_1^n under this model. NOTE CHANGE IN PREVIOUS SENTENCE, WHICH FORMERLY SAID "s_i^n"! (The inputs to the algorithm are the sets Δ, Λ, and Σ the model parameters \boldsymbol{\phi}, \boldsymbol{\gamma}, and \boldsymbol{\eta}, and the sequence s_1^n.) Advice: draw a directed graphical model corresponding to this distribution (if you aren't familiar with graphical models, ignore this advice).
  5. Consider a log-linear model that defines p(X) where X ranges over the discrete, finite set \mathcal{X} = \{x^1, x^2, ..., x^v\} (superscripts are used here to denote types; they are not to be confused with subscripts, which are used in this class to index positions in a sequence or examples). In this log-linear model, there are v features, and for each i \in \{1, 2, ..., v\}, the ith feature is defined as fi(x) = δ(x,xi). It should be clear that, for any x, the feature vector \mathbf{f}(x) contains a single 1 and the rest of the entries are 0. Given a histogram of empirical counts over \mathcal{X} (such that ci is the count of xi), prove that one solution to this model is to set \theta_i = \log \tilde{p}(x^i). (\tilde{p} is the empirical distribution, given by relative frequency estimates \tilde{p}(x^i) = \frac{c_i}{\sum_{j=1}^v c_j}.) Hint: imagine that \mathcal{X} is the set of words and we're using the log-linear model to define a unigram distribution. You are proving that the closed-form maximum likelihood estimate for the classical unigram model (blue formula on p. 60 of the slides from September 6, with m = 0) gives a closed-form solution to the maximum likelihood estimate for the locally-normalized log-linear Markov model (purple formula on p. 60 of the slides from September 6, with m = 0).
  6. You know that, in general, the maximum likelihood estimate of log-linear models doesn't have a closed-form solution. Explain why this special case of a log-linear model does have a closed form solution. Hint: think of some features to add in that "break" this property. Last minute addendum: the hint is misleading. It should read, instead, "think of some features to add in or remove that 'break' this property."
  7. Continuing problem 5, prove that the solution is not unique.

Assignment 3

  • Out by: Th 9-27
  • Due: T 10-9


In this assignment, you will be given data of the form (x,y), where x is a person's name in the little-known language Scenglish, and y is a login name for that person on a UNIX machine.

You are strongly encouraged to use finite-state transducers to solve this problem. Suggested tools include the AT&T FSM libraries and OpenFST. Note that the emphasis is on building the hypothesis set, not on scoring!

Training data (on each line, the login comes first, then a tab, then the user's name, which may be multiple words separated by spaces)

Test data (on aech line, the user's name, which may be multiple words separated by spaces)

  1. (4 points) Build a hypothesizer that takes a name (x) and returns a nonempty set \mathcal{S} of logins for that name. Your hypothesizer will be evaluated in two ways. The first is coverage on a set of test examples: the fraction of examples in the test data for which the true y is included in your hypothesis set \mathcal{S}. A perfect coverage score of 1 means that you always listed the true login among your hypotheses. The second evaluation measure is precision, which we define as the average (over test examples) of |\mathcal{S}|^{-1} (you must ensure that |\mathcal{S}| > 0, or you will get no credit for this problem). A perfect score of 1 means that you always returned exactly one hypothesis. The input to your program will be a list of newline-separated names. The output of your program should be the same number of lines as the input, with a list of space-separated logins on each line. The deliverables for this problem will be a clear description of your methods, and the list of hypotheses for each of the test example names (made available the day before the assignment is due). The lists should be emailed to us as an electronic file in the format described. Name your file your-login-name.hypotheses. Hint: The goal here is not for you to build a huge FST by hand. Instead, think about the different kinds of string-string relations fit the data well, and try to build little transducers by hand that capture your intuitions. Then, put those little ones together in ways that make sense (composition, concatenation, closure, etc.) to build a bigger FST. Remember, for this problem, you don't need to worry about weights!
  2. (2 points) Build a ranker that ranks the list of logins. Your goal here is to make the correct login go to the top of the list, for each name. The input to the program will be two files, the name list (same data given as input to the hypothesizer) and a file containing a list of hypotheses for each name (the output of the hypothesizer), and the output simply reorders each list of logins. Note that this task need not be carried out separately from problem 1; you could build a single program that ranks at the same time that it builds the hypothesis set. If you combine the tasks into one, you only need to write a single description of the system. Your ranker will be evaluated in one way: mean reciprocal rank, that is, the average of the inverse rank of the correct login. If the correct login is not in the hypothesized list, then the inverse rank is zero. (Note that coverage as defined in problem 1 is an upper bound on mean reciprocal rank, and precision equals MRR if you always recover the true login but rank it last.) The deliverables for this problem will be a clear description of your methods, and the ranked list of hypotheses for each test example. The lists should be emailed to us as an electronic file in the format described. Name your file your-login-name.ranked; it's okay if this file is identical to the one you submit for problem 1. Hint: I don't expect you to do anything extremely fancy here, but estimating weights will be helpful, and you will do best at this if you try to use the data.
  3. Bonus question: As it happens, Scenglish is just like English, except that the alphabet is scrambled. In other words, to write Scenglish, you just write in English and apply a substitution cipher. The data you've been working with were generated by the application of such a cipher. Extra points will be awarded if you use WFSTs to crack the substitution cipher and give a clear explanation of what you did. For us to evaluate your guess at the cipher, submit a file with 26 lines, each line formatted as: Scenglish-letter space English-letter newline. Getting the right cipher isn't enough; you must describe clearly how finite-state transducers were used to solve the problem.

Here is the Perl code for the evaluation script, in case you want to measure your performance on development data:

#!/usr/bin/perl
# Noah Smith
# 9/26/07
# eval.pl
# evaluation script for L&S2 fall 2007 assignment 3.
# usage:  eval.pl true-logins-file hypothesis-lists-file

# Takes true logins (first command line argument) and lists of
# hypotheses (second command line argument) and calculates coverage,
# precision, and mean reciprocal rank, as we have defined them.  This
# program fails if any hypothesis line is empty.

open(TRUE, "<$ARGV[0]");
open(HYP, "<$ARGV[1]");

$count = 0;
while(chomp($t = <TRUE>) and chomp($H = <HYP>)) {
    ++$count;
    %seen = ();
    @h = split /\s+/, $H;
    die "fail" if(scalar(@h) == 0);
    $precision_sum += (1.0 / scalar(@h));
    $k = 0;
    foreach $h (@h) { 
        ++$k;
        if($h eq $t) {
            ++$coverage_sum;
            $mrr_sum += (1.0 / $k);
            last;
        }
    }
}
print "coverage = ", (0.0 + $coverage_sum)/$count, "; precision = ", (0.0 + $precision_sum)/$count, "; m.r.r. = ", (0.0 + $mrr_sum)/$count, "\n";

Assignment 4

  • Out by: Th 10-11
  • Due: T 10-23
  1. (2 points) When we talked about transformations on the treebank, it was noted that you could often describe the transformation in a different (but more or less equivalent) way. For example, the "parent annotation" trick (Johnson, 1998) in which each node is augmented with the label of its parent, can also be described as changing the "vertical Markov order" of the PCFG from 1 to 2. That is, the sequence of a node n's children depends on just on the parent node n (vertical Markov order 1), but also n's parent (vertical Markov order 2). In this exercise, you are to design a parsing algorithm for a PCFG with rules of the form Z^W \rightarrow^p \alpha, indicating that Z with parent node W can rewrite as α with probability p. Note that for a given pair (Z,W), the probabilities of the rules sum to one. You can assume that α consists of either two nonterminals or one terminal symbol. Your parsing algorithm must be guaranteed to find the most probable parse, and it should run in O(n3N3 + n2N4) time, if there are N nonterminal symbols, and it should run in O(n2N3) space. You can describe the algorithm however you like (e.g., logic programs and graphical notation are fine), but the runtime behavior should be clear as well - a declarative specification is only a partial answer. (Hint: there is a solution that runs in O(n3N4) with only O(n2N2) space; it's easier to find, a good starting point, and will earn you partial credit.)
  2. (3 points) Find an existing statistical parser that's freely available. It can be one of the ones we discussed in class, or another one - but it needs to be a statistical parser that was trained on data. Describe how the parser works in a paragraph (what's the underlying formalism, what kinds of parse trees it predicts, what language it's intended for, what data was it trained on, and what kind of parsing algorithm is involved, what various parameter settings are possible and which ones you think are most appropriate for this exercise). Find 25 sentences that you are certain do not come from the parser's training set. These can be sentences you generate, or they can come from news stories, or blogs, or your own email, even this assignment - be creative. Annotate the 25 sentences by hand according to the conventions expected by the parser; in other words, create a gold-standard mini-treebank that is as high-quality as you can expect the parser to perform. (It is probably helpful to look at some of the parser's training data to get an idea of what the annotated data should look like!) You should do this in an electronic file. Finally, run the parser on the raw sentences (or, if it requires pre-tagging, you may feed your gold-standard tags as input - but make it clear in the write-up that you did so!). Provide a quantitative evaluation of the parser (e.g., use the evalb tool to measure labeled precision, recall, and crossing brackets, or measure the accuracy of dependency attachments), and a qualitative evaluation of how well the parser performed on your data (e.g., do the numbers give a fair analysis, or did the parser find reasonable parses that differed from yours, or did you do a poor job of hand-parsing the data?). Feel free to speculate on why the parser fails or succeeds where it does, and how you could improve the model to get better performance. Deliverables for this problem: the parser description and evaluation (in prose), the sentences you generated (electronic file), your gold-standard parses (electronic file), and the parser output (electronic file).
  3. (1 point) You are given a PCFG and a sentence s_1^n. For a query of the form (N,i,j), where N is a nonterminal and 1 \le i \le j \le n is an inclusive span of the sentence, you are to return the posterior probability (under the PCFG) that an N constituent spans s_i^j. Now suppose the query does not specify j; in other words, it only asks the total probability that a N constituent starts at word i. How does your algorithm change? NOTE: we're looking for efficient algorithms here. It's okay to assume the PCFG is in CNF.
  4. Bonus question: Now suppose we get a query consisting of (N, i, j, M, k, \ell) - we wish to know the total probability that an N constituent spans s_i^j and an M constituent spans s_k^\ell. (Note: you only know for sure that 1 \le i \le j \le n and that 1 \le k \le \ell \le n; you do not know any more than that.) Give an algorithm.

Assignment 5

  • Out by: T 10-30 Th 11-2
  • Due: Th 11-8 T 11-13
  1. (2 points) For each of the following measures of accuracy, show how it can be described as a factored loss function. That is, give the complete definition of a loss function that, if minimized, would be equivalent to obtaining maximum accuracy (as defined in each case). You will need to define some notation to make your answer clear!
    • For a POS tagger, the fraction of words correctly tagged (as compared to a gold standard).
    • For a dependency parser, the fraction of words whose parent is correctly identified (as compared to a gold standard). Assume the root word is linked to an invisible parentless parent, \$, at the left edge of the sentence.
    • For a word aligner that links words in English to their corresponding words in French (given parallel text), the fraction of links that are correct (i.e., in a gold standard).
    • For a named entity recognizer, the fraction of named entities (contiguous subsequences of a sentence) that are correctly bracketed (as compared to a gold standard). You may assume that named entities are not recursive in the gold standard or the hypothesis.
  2. A linear program in standard form looks like this: "\min_{\mathbf{x}}  \mathbf{c}^\top \mathbf{x} such that \mathbf{A} \mathbf{x} \le \mathbf{b},  \mathbf{x} \ge \mathbf{0}." Let N be the dimensionality of \mathbf{x} and let M be the number of linear constraints (not including the positivity constraints); \mathbf{A} \in \mathbb{R}^{M\times N} and \mathbf{b} \in \mathbb{R}^M. It is not terribly difficult to show that any linear program can be transformed to this form. (You are encouraged at this juncture to take a careful look at the Klein-Taskar tutorial on maximum margin methods at ACL 2005.)
    1. (1 point) Let \langle w_0 = \$, w_1, w_2, ..., w_n \rangle be a sentence. Let cost(wi,wj) be the cost associated with making wi the parent of wj, and let the cost of a tree be
      cost(wparent(j),wj)
      j
      . Your job is to define an LP such that the minimizing \mathbf{x} can be converted back into a dependency tree such that
      • every word in \langle w_1, ... w_n \rangle has exactly one parent in {w0,w1,...,wn};
      • w0 does not have a parent;
      • the solution to the LP will correspond to the lowest-cost dependency tree.

      To answer the question, you must define \mathbf{x} (in terms of the dependency tree), \mathbf{A}, \mathbf{b}, and \mathbf{c}. Hint: you should be able to define a solution in which N = O(n2) and M = O(n). You don't need to convert to standard form, and it's okay to use inequality constraints, as long as they are linear.

      Note: do not worry about cyclicity yet, unless you find it helpful to do so.

    2. (1 point) Do exercise 1 again, but this time add an overall projectivity constraint on the tree:
      • if wi is the parent of wj, then \forall k \in (i, j) \cup (j, i), wk is a descendent of wi the parent of wk is also \in [i,j] \cup [j,i]. (NOTE CORRECTION!) (Another (I believe) equivalent statement is that for any i, wi and all its descendents form a contiguous substring.)

      Hint: this will require at least another O(n2) linear constraints. Note: do not worry about cyclicity yet, unless you find it helpful to do so.

    3. (1 point) Give the dual of the LP in exercise 3.1 2.1 (NOTE CORRECTION).
    4. Bonus question: The dependency structures recovered above have not been constrained to be acyclic. In practice we typically want dependency structures to be trees. Can you define constraints that will enforce acyclicity? You may introduce more variables, too, if you need to. You may also use non-linear constraints.
  3. (1 point) Here are two news articles from CNN and from the LA Times. Calculate the total number of exactly matching substrings (of bytes) in these two files. Spaces and newlines count. For example, the strings "abaab " and "bbbab " have 17 matching substrings.