TurboParser (Dependency Parser with Linear Programming)

This page provides a link to TurboParser, a free multilingual dependency parser developed by André Martins.
It is based on joint work with Noah Smith, Mário Figueiredo, Eric Xing, Pedro Aguiar.

Background

Dependency parsing is a lightweight syntactic formalism that relies on lexical relationships between words. Nonprojective dependency grammars may generate languages that are not context-free, offering a formalism that is arguably more adequate for some natural languages. Statistical parsers, learned from treebanks, have achieved the best performance in this task. While only local models (arc-factored) allow for exact inference, it has been shown that including non-local features and performing approximate inference can greatly increase performance.

This package contains a C++ implementation of a dependency parser based on the papers [1,2,3,4,5] below. The latest version of this package also contains C++ implementations of a POS tagger, a semantic role labeler, a entity tagger, a coreference resolver, and a constituent (phrase-based) parser. The relevant references are the papers [6,7,8,9] below.

This package allows:


Demo

News

We released TurboParser v2.3 on November 6th, 2015! This version introduces some new features:

We released TurboParser v2.2 on June 26th, 2014! This version introduces some new features:

We released TurboParser v2.1 on May 23th, 2013! This version introduces some new features:

We released TurboParser v2.0 on September 20th, 2012! This version introduces a number of new features:

Note: The runtimes above are approximate, and based on experiments with a desktop machine with a Intel Core i7 CPU 3.4 GHz and 8GB RAM.

To run this software, you need a standard C++ compiler. This software has the following external dependencies: AD3, a library for approximate MAP inference; Eigen, a template library for linear algebra; google-glog, a library for logging; gflags, a library for commandline flag processing. All these libraries are free software and are provided as tarballs in this package.

This software has been tested in several Linux platforms. It has also successfully compiled in Mac OS X and MS Windows (using MSVC).


Further Reading

The main technical ideas behind this software appear in the papers:

[1]  André F. T. Martins, Noah A. Smith, and Eric P. Xing.
Concise Integer Linear Programming Formulations for Dependency Parsing.
Annual Meeting of the Association for Computational Linguistics (ACL'09), Singapore, August 2009.
[2]  André F. T. Martins, Noah A. Smith, and Eric P. Xing.
Polyhedral Outer Approximations with Application to Natural Language Parsing.
International Conference on Machine Learning (ICML'09), Montreal, Canada, June 2009.
[3]  André F. T. Martins, Noah A. Smith, Eric P. Xing, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.
TurboParsers: Dependency Parsing by Approximate Variational Inference.
Empirical Methods in Natural Language Processing (EMNLP'10), Boston, USA, October 2010.
[4]  André F. T. Martins, Noah A. Smith, Mário A. T. Figueiredo, Pedro M. Q. Aguiar.
Dual Decomposition With Many Overlapping Components.
Empirical Methods in Natural Language Processing (EMNLP'11), Edinburgh, UK, July 2011.
[5]  André F. T. Martins, Miguel B. Almeida, Noah A. Smith.
Turning on the Turbo: Fast Third-Order Non-Projective Turbo Parsers.
In Annual Meeting of the Association for Computational Linguistics (ACL'13), Sofia, Bulgaria, August 2013.
[6]  André F. T. Martins and Mariana S. C. Almeida.
Priberam: A Turbo Semantic Parser with Second Order Features.
In International Workshop on Semantic Evaluation (SemEval), task 8: Broad-Coverage Semantic Dependency Parsing, Dublin, August 2014.
[7]  Lev Ratinov and Dan Roth.
Design Challenges and Misconceptions in Named Entity Recognition.
In International Conference on Natural Language Learning (CoNLL'09), 2009.
[8]  Greg Durrett and Dan Klein.
Easy Victories and Uphill Battles in Coreference Resolution.
Empirical Methods in Natural Language Processing (EMNLP'13), 2013.
[9]  Daniel F.-González and André F. T. Martins.
Parsing As Reduction.
In Annual Meeting of the Association for Computational Linguistics (ACL'15), Beijing, China, August 2015.


Download

The latest version of TurboParser is TurboParser v2.3.0 [~5.4MB,.tar.gz format]. See the README file for instructions for compilation, running, and file formatting. It does not include the data sets used in the papers; for information about how to get these data sets, please go to http://nextens.uvt.nl/~conll. Bear in mind that some data sets must be separately licensed through the LDC.

In addition, we provide separately the following pre-trained models (notice that these are very large files):

Finally, a script "parse.sh" is provided in this package that allows you to tag and parse free text (in English, one sentence per line) with the models above. Just type:

./scripts/parse.sh <filename>

where <filename> is a text file with one sentence per line. If no filename is specified, it parses stdin, so e.g.
echo "I solved the problem with statistics." | ./scripts/parse.sh

yields
1       I               _       PRP     PRP     _       2       SUB
2       solved          _       VBD     VBD     _       0       ROOT
3       the             _       DT      DT      _       4       NMOD
4       problem         _       NN      NN      _       2       OBJ
5       with            _       IN      IN      _       2       VMOD
6       statistics      _       NNS     NNS     _       5       PMOD
7       .               _       .       .       _       2       P

Older versions:


Contributing to TurboParser

For questions, bug fixes and comments, please e-mail afm [at] cs.cmu.edu.

To contribute to TurboParser, you can fork the following github repository: http://github.com/andre-martins/TurboParser.

To receive announcements about updates to TurboParser, join the ARK-tools mailing list.


Acknowledgments

A. M. was supported by a FCT/ICTI grant through the CMU-Portugal Program, and by Priberam. This work was partially supported by the FET programme (EU FP7), under the SIMBAD project (contract 213250), by National Science Foundation grant IIS-1054319, and by the QNRF grant NPRP 08-485-1-083.