Code for Unsupervised Grammar Induction

DAGEEM 1.0 - Dependency and Grammar Estimation with Expectation-Maximization
By Shay Cohen

Background

Unsupervised grammar induction refers to the task of learning a grammar with the input being only sentences in natural language. Other than the task being challenging and educating by itself, it can also potentially have impact on applications which need a parser at some stage in their pipeline, especially when these applications are intended to be used with languages for which annotated data is not readily available, if at all. The back bones of such parsers are based on a probabilistic grammars, which encode the structure of the relevant language.

DAGEEM is piece of software, written in C++, which is designed to estimate the parameters of such probabilistic grammars. Currently, DAGEEM supports only the DMV model ("dependency model with valence"), originally designed by Klein and Manning (2004). The DMV is widely recognized as an effective grammar for dependency grammar induction, and has been recently used in various settings for this task.

Central to DAGEEM is the use of a logistic normal distribution on the grammar parameters. The logisitic normal distribution, which was successfully used for topic modeling by Blei and Lafferty (2006) offers advantages, conceptually and (tested) empirically, over the more common Dirichlet distribution, commonly used because of its mathemetical elegance. This piece of software extends the inference algorithm suggested by Blei and Lafferty, and uses a variational approximation in a Bayesian setting to estimate the grammar parameters.

Further Reading

The main technical ideas behind this software appear in the paper:
          S. B. Cohen and N. A. Smith. Covariance in Unsupervised Estimation of Probabilistic Grammars. In Journal of Machine Learning Research, 11(Nov):3017-3051 (2011)
          S. B. Cohen and N. A. Smith. Shared Logistic Normal Distributions for Soft Parameter Tying in Unsupervised Grammar Induction. In Proceedings of NAACL-HLT (2009)
          S. B. Cohen, K. Gimpel and N. A. Smith. Logistic Normal Priors for Unsupervised Probabilistic Grammar Induction. In Advances in Neural Information Processing Systems 22 (NIPS 2008).

Download

The latest version of Dageem can be downloaded here. The package has been rewritten in Java, and the old C++ code is no longer available. Contact me if you are interested in the old C++ code for some reason. The package does not include the data sets used in the paper. The data sets are based on the Penn Treebank, and therefore must be separately licensed through the LDC.

For questions, bug fixes and comments, please e-mail scohen [strudel] cs.cmu.edu.