TurboParser (Dependency Parser)

TurboParser 0.1 - Dependency Parsing with Integer Linear Programming
by André Martins
News: Keep posted for TurboParser 0.2!

Background

Dependency parsing is a lightweight syntactic formalism that relies on lexical relationships between words. Nonprojective dependency grammars may generate languages that are not context-free, offering a formalism that is arguably more adequate for some natural languages. Statistical parsers, learned from treebanks, have achieved the best performance in this task. While only local models (arc-factored) allow for exact inference, it has been shown that including non-local features and performing approximate inference can greatly increase performance.

This package contains a C++ implementation of the unlabeled dependency parser described in the papers [1] and [2] below.

This package allows:

To run this software, you need to have ILOG CPLEX installed in your system. ILOG is a commercial MILP solver. For more information regarding ILOG CPLEX, please go to http://www.ilog.com/products/cplex. You need also to have the Boost C++ libraries installed in your system. This is free software and can be obtained here.

Note: In addition, we provide a simple dependency labeler, implemented in Java, that allows obtaining labeled dependency parse trees from unlabeled ones. It is also possible to learn this labeler from a treebank. Thanks to Ryan McDonald by providing us code which we adapted for this purpose.

Further Reading

The main technical ideas behind this software appear in the papers:

[1]  André F. T. Martins, Noah A. Smith, and Eric P. Xing.
Concise Integer Linear Programming Formulations for Dependency Parsing.
Annual Meeting of the Association for Computational Linguistics (ACL'09), Singapore, August 2009.
[2]  André F. T. Martins, Noah A. Smith, and Eric P. Xing.
Polyhedral Outer Approximations with Application to Natural Language Parsing.
International Conference on Machine Learning (ICML'09), Montreal, Canada, June 2009.

Download

The latest version of TurboParser can be downloaded here [~2.5Mb,.tar.gz format]. The package includes instructions for compilation and running. It does not include the data sets used in the papers; for information about how to get these data sets, please go to http://nextens.uvt.nl/~conll. Bear in mind that some data sets must be separately licensed through the LDC.

In addition, we provide the following pre-trained models (notice that these are very large files):

For an example of how to apply any of these models to parse new data, take a look at this script.

For questions, bug fixes and comments, please e-mail afm [at] cs.cmu.edu.