Sequence Guessing

From LS2

Jump to: navigation, search

For the present discussion, let's suppose we have a bigram hidden Markov model that assigns probabilities to sequences of labeled symbols, (\Sigma \times\Lambda)^+.

Suppose I give you a sequence s_1^n, and I want you to label it (that is, identify for each si a label ci from some label set Λ). How should you label the sequence?


Motivating Viterbi as minimizing the sequence loss

For most of us, the natural answer is to find the most likely label sequence given the observations:

\hat{c}_1^n = \arg \max_{c_1^n} p(c_1^n \mid s_1^n) = \arg \max_{c_1^n} p(c_1^n, s_1^n)

This can be accomplished by the Viterbi algorithm. It turns out that there is a very nice way to motivate this solution. Imagine that there is a true sequence g_1^{n} \in \Lambda^n. Suppose that if your guess, \hat{c}_1^n is not exactly the same as g_1^n, then I will take a dollar from you. Otherwise, I won't take anything. We can encode this as a loss function:

\ell(\hat{c}_1^n ; g_1^n) = 1 - \delta(\hat{c}_1^n, g_1^n)

I call this the "whole-sequence loss function." If you get the whole sequence right, you lose nothing, and otherwise you lose a dollar.

The idea is to imagine that G_1^n is actually a random variable, the value of which we are uncertain about. Now, if you believe in the distribution that your HMM provides, then you might assume that G_1^n is distributed according to the HMM's conditional distribution over label sequences (given s_1^n). Under that distribution, we can define the expected loss as a function of our possible guesses c_1^n:

\mathbb{E}[\ell(c_1^n; G_1^n] = \sum_{g_1^n \in \Lambda^n} p(G_1^n = g_1^n \mid s_1^n) (1 - \delta(c_1^n, g_1^n)) = 1 - p(G_1^n = c_1^n \mid s_1^n)

Turning this into a decision problem means choosing the sequence c_1^n that minimizes the expected loss, which it should be clear is equivalent to picking the c_1^n that is most likely given the observation sequence s_1^n. So you should clearly choose the label sequence that's most likely given the observation, because it minimizes the expected whole-sequence loss. This kind of decoding is sometimes called maximum a posteriori or MAP decoding.

(It's an open question whether you should trust the HMM, of course, but in this discussion it's all you have.)

A different loss function

Now, consider instead an alternative game. Suppose that, for every element of the sequence you get wrong, I will take a dollar. Here's the loss function:

\ell(\hat{c}_1^n ; g_1^n) = n - \sum_{i=1}^n \delta(\hat{c}_i, g_i)

I call this the "label loss function." This is actually much closer to what's commonly used to evaluate taggers in NLP.

Let's again imagine that G_1^n is a random variable distributed according to the HMM's implied distribution over label sequences given s_1^n.

\mathbb{E}[\ell(c_1^n; G_1^n)] = \sum_{g_1^n \in \Lambda^n} p(G_1^n = g_1^n \mid s_1^n) \left(n - \sum_{i=1}^n  \delta(c_i, g_i)\right) = n - \sum_{i=1}^n p(G_i = c_i \mid s_1^n)

Equivalently, we should choose each ci to maximize:

\hat{c}_i = \arg \max_c p(G_i = c \mid s_1^n)

The quantities above can be computed using forward and backward probabilities, as discussed in the lecture. This is sometimes called minimum risk decoding or posterior decoding; it's not widely used in NLP, though it's arguably more principled in some settings than the standard MAP decoding.

Final note

While these two sequence guessing algorithms are rather different, they are both motivated by the same trick, which is to assume that the hidden structure is really a random variable distributed according to a known model. The model in both cases is the same. The difference between the two approaches is in how we are penalized. Different loss functions lead to different guessing algorithms. It's interesting to note that minimizing the label loss function can result in a \hat{c}_1^n that could not generate s_1^n in its entirety under the HMM, and that MAP decoding can pick an extremely unlikely \hat{c}_i at a particular time step i, in the interest of having a coherent label sequence. Could you implement an algorithm that minimizes a linear interpolation of the two expected loss functions?

--Noah 18:26, 4 September 2007 (EDT)