Sequence Guessing
From LS2
For the present discussion, let's suppose we have a bigram hidden Markov model that assigns probabilities to sequences of labeled symbols,
.
Suppose I give you a sequence
, and I want you to label it (that is, identify for each si a label ci from some label set Λ). How should you label the sequence?
Motivating Viterbi as minimizing the sequence loss
For most of us, the natural answer is to find the most likely label sequence given the observations:
This can be accomplished by the Viterbi algorithm. It turns out that there is a very nice way to motivate this solution. Imagine that there is a true sequence
. Suppose that if your guess,
is not exactly the same as
, then I will take a dollar from you. Otherwise, I won't take anything. We can encode this as a loss function:
I call this the "whole-sequence loss function." If you get the whole sequence right, you lose nothing, and otherwise you lose a dollar.
The idea is to imagine that
is actually a random variable, the value of which we are uncertain about. Now, if you believe in the distribution that your HMM provides, then you might assume that
is distributed according to the HMM's conditional distribution over label sequences (given
). Under that distribution, we can define the expected loss as a function of our possible guesses
:
Turning this into a decision problem means choosing the sequence
that minimizes the expected loss, which it should be clear is equivalent to picking the
that is most likely given the observation sequence
. So you should clearly choose the label sequence that's most likely given the observation, because it minimizes the expected whole-sequence loss. This kind of decoding is sometimes called maximum a posteriori or MAP decoding.
(It's an open question whether you should trust the HMM, of course, but in this discussion it's all you have.)
A different loss function
Now, consider instead an alternative game. Suppose that, for every element of the sequence you get wrong, I will take a dollar. Here's the loss function:
I call this the "label loss function." This is actually much closer to what's commonly used to evaluate taggers in NLP.
Let's again imagine that
is a random variable distributed according to the HMM's implied distribution over label sequences given
.
Equivalently, we should choose each ci to maximize:
The quantities above can be computed using forward and backward probabilities, as discussed in the lecture. This is sometimes called minimum risk decoding or posterior decoding; it's not widely used in NLP, though it's arguably more principled in some settings than the standard MAP decoding.
Final note
While these two sequence guessing algorithms are rather different, they are both motivated by the same trick, which is to assume that the hidden structure is really a random variable distributed according to a known model. The model in both cases is the same. The difference between the two approaches is in how we are penalized. Different loss functions lead to different guessing algorithms. It's interesting to note that minimizing the label loss function can result in a
that could not generate
in its entirety under the HMM, and that MAP decoding can pick an extremely unlikely
at a particular time step i, in the interest of having a coherent label sequence. Could you implement an algorithm that minimizes a linear interpolation of the two expected loss functions?
--Noah 18:26, 4 September 2007 (EDT)
