Fall08:Bosaghzadeh and Schneider
Reza Bosagh Zadeh & Nathan Schneider
"Unsupervised Learning for Sequence Tagging, Morphology Induction, and Lexical Resource Acquisition"
My question is about the Haghighi paper on learning bilingual lexicons from monolingual text. I was skeptical when I first saw this paper presented since the authors use "orthographic features" and report high F1's on language pairs such as FR-EN and ES-EN. However, when they switch to pairs such as AR-EN, even the hand-picked "best F1" drops to the 30's. You also listed this method as having the highest "semantic gain", though it seems this method draws most of its power from substring matches. Am I missing the true power of the model or is it that this research is still in its early stages? --Jon
In addition to orthographic features they use "word context" features which actually perform well across even disparate language pairs. If you look at slide 27 in our slides(which was not presented, for lack of time), you'll see that for EN-ES word context and orthographic features give the same amount of gain, when used separately, and their gains add up when used together. In the case that orthographies don't match up, they do reasonably well with just word context features. But you're right in mentioning that they don't do too well; it's still pretty difficult to do unsupervised bilingual lexicon induction. Also, it's worthwhile noting that their baseline is the edit-distance maximum bipartite matching problem, which they consistently beat, so it's not just easy substring matches. --Reza
Also, the "semantic gain" metric is an idealization assuming the method "works." It does not incorporate performance because it is impossible to compare performance across different types of tasks. (Qualitatively, the state of the art in POS tagging is indeed better than the performance for bilingual lexicon induction.) I suppose semantic gain is a measure of the hardness of a particular task (as defined by its input and output), not a measure of any approach or result. --Nathan
RE: the multilingual morpheme segmentation paper (Snyder & Barzilay). What would you say it's the contribution of this paper? I mean, we already know that some families of languages share morphemes (ie romance languages - french, spanish, etc). But we already know this. Is there any real life need to find this mapping? --Jose
- They showed that modeling the implicit mapping between two languages' morphemes improved monolingual segmentation performance for those languages. In other words, the morphological models in the two languages mutually constrain each other via the parallel phrases. Note that they require only parallel phrases for training (morpheme segmentations/alignments are never observed), and only monolingual phrases at test time. Also, the two languages need not be related for an improvement; English-Arabic parallel phrases led to an improvement in Arabic segmentation, but not as much as Hebrew-Arabic phrases when phonological correspondences were exploited. --Nathan
I have a question somewhat related to your and our review. You have already compared the two nonparametric methods, one for word segmentation, and the other for multilingual morphological analysis. Given that you have seen other methods, how do you evaluate the generative models that these two methods have proposed? Did you find their base distributions reasonable, or do you think maybe a different base distribution in either case would work better? What about their generative model? That's a difficult question but I hope you can give me some intuitions. --narges
- An interesting question--we can only speculate since we haven't done experiments with variants of the models. :) I would say the Goldwater et al. word segmentation model seems reasonable, however the base distribution is a unigram model over characters. I suspect increasing the order of the character model would make estimation more difficult, and it's hard to know whether it would improve performance. As for Snyder & Barzilay, we opined on p. 17 of the review that a couple of the distributions in the generative process are suspect. But again, it's possible this was necessary to make estimation practical, and who knows how much it affects overall performance. --Nathan
- Thanks for the insight! I only want to mention that a different base distribution, (bigram instead of unigram) wouldn't affect the sampler's time complexity necessarily. In case of a bigram base distribution, we will need to estimate some more data, but compared to the other counts that we need to keep, this is not a problem at all... --narges
Hey, like the presentation. (you probably wouldn't see it since I am rather late. Nonetheless.) Re CCA, do you think the idea is applicable something beside bilingual lexicon induction? I don't know if this is the well studied task, but I have not seen a work which tackled the task in unsupervised fashion. In many areas, unsupervised methods are employed when annotated resource is very poor. Does lexicon induction suffers from this problems? I am a bit bemused what good motivation for unsupervised learning therein. There seems to be a set of parallel data (such as the ones they used in the evaluation). Might they be usable with less sophisticated methods than fully unsupervised analysis like H&K? Does the paper have a good performance evaluation against the existing, supervised methods? On the other hand, I think CCA might be an interesting application to co-reference resolution. I also have heard that Tom Mitchel's group is applying CCA to map the brain image and words stimulated the image.
- There are plenty of resources when it comes to bilingual lexicons. Dictionaries have been written many languages, almost entirely removing the need for us to learn them. However, for many rare languages this has not been done. For those languages, we also don't have much annotated data. Really the impressive thing is that they manage to get a lexicon from nothing. I wouldn't say that this paper in itself is of practical use, but extensions and improvements upon its ideas sound exciting to me.
- It's good to say that when we have parallel corpora, it's very easy to extract a dictionary from that. And it actually works much better than this unsupervised method. But of course parallel data is hard to come by.
--Taey 16:28, 12 December 2008 (EST)
One other question about Narrative Event paper by chambers and Jurafsky: What do they do with the graph after they are constructed? Are the edges labeled so that the semantic relationships are explicit? Also, how did they evaluate the goodness of the resulting graph? Is it clear in the paper?
- What is learned are untyped pairwise "narrative similarity" scores between pairs of verbs (and pairwise temporal orderings). From these relations they are able to extract nice graphs for demonstration, but don't evaluate the graphs per se. Their evaluation is a cloze task where they show they can predict what events are likely to come next in a document given the previous ones. As for applications, the best graphs are similar to FrameNet frames, so one would imagine they might be used for semantic role labeling or semantic parsing--though it's unclear how many complex narratives are learned with any accuracy, so there would probably have to be human refinement of the results. --Nathan
--Taey 16:35, 12 December 2008 (EST)
I wanted to pick your brains on a question I've had about the Haghighi & Klein approach. They learn to predict label sequences (POS tags/IE category labels) from small sets of prototypes, and they say that distributional similarity is what solves the problem. I was wondering if I could apply their approach to a more "difficult" labeling problem, as follows. I have a sequence of utterances spoken at a meeting, and I want to label each utterance with either "should be included in notes" or "should not". My prototypes would be "The deadline is tomorrow" as a "should be included in notes", for example. Now, the utterance "unit" is more messy than words, but I could represent utterances as abstractions that make them less so (by using a small set of wisely chosen features, say). I was wondering if you think I could still apply the H&K approach to this problem of mine? Or would it just not work? What do you guys feel, having read the paper in great detail! --Bano
- An interesting question--the key to their approach is that they are able to extract features that are good for scoring distributional similarity, and word context vectors happen to be good features for part-of-speech similarity. We didn't mention this in the presentation, but they also applied their method to a task called information field segmentation, which was basically tagging words from Craigslist-style housing ads according to one of a small set of topics (NEIGHBORHOOD, UTILITIES, etc. - see p. 8 of our report for details). Your problem seems much harder semantically, as it is not just a matter of individual words, and relying on long word sequences leads to obvious sparsity problems. Thus, I'm not sure what features could be used to measure similarity in your case. Secondly, the method assumes that each category to be predicted is effectively captured by a few prototypes and the similarity metric. In your case there could be many, wholly unrelated types of utterances that would be important (or unimportant), so a few prototypes might not do the trick. If there was a good measure of similarity to prototypes, and the categories were clustered tightly enough so that a few prototypes and the similarity metric are sufficient to describe the category, then I believe Haghighi & Klein's method would be successful. --Nathan
- Thanks Nathan for that detailed answer. I think I understand what you are saying - the utterances with the same label need to be distributed in a similar fashion for this algorithm to work. I don't know if that's the case for my problem, but I'll give it a shot. Thanks! --Bano