Lexical Semantics Resources for English

This page provides resources for computational analysis of English lexical semantics, including

annotations of comprehensive multiword expressions and noun, verb, and preposition supersenses for a 55,000-token corpus of web reviews, and
a tool (trained on the corpus) that identifies multiword expressions and noun/verb supersenses in context. (It does not currently predict preposition supersenses.)

These were developed by Nathan Schneider, Noah Smith, and others primarily at Carnegie Mellon University.

News:

STREUSLE is now maintained on GitHub.
The SemEval 2016 shared task on Detecting Minimal Semantic Units and their Meanings (DiMSUM) extended the joint MWE+supersense task to additional domains. You can download the full DiMSUM data (the training set includes a slightly simplified version of STREUSLE 2.1). Refer to the shared task paper for details.

Download

STREUSLE is a corpus of online reviews hand-annotated for comprehensive multiword expressions and for noun, verb, and preposition supersenses.

muffins with streusel topping
Not to be confused with streusel, the delicious dessert topping shown above.

A partial example:
I googledcommunication restaurantsGROUP inLocus the areaLOCATION and Fuji_SushiGROUP came_upcommunication and reviewsCOMMUNICATION werestative great so I made_communication a carry_outpossession _order
The annotations cover the reviews section of the English Web Treebank. The source text and gold part-of-speech tags are redistributed here with the permission of Google and LDC.
MWE Annotation Guidelines, new guidelines for prepositional verbs
Guidelines and examples for noun supersense annotation (from the AQMAR Arabic Wikipedia Supersense Corpus)
Verb Supersense Annotation Guidelines

English Multiword Expression Lexicons (README.md, LICENSE)

This includes 9 lexicons mapped to a common JSON format, as well as a hierarchical clustering of words in a large corpus.
The SAID lexicon is distributed by LDC. We provide a script to extract its entries in our JSON format. (It has NLTK 2.0.4 as a dependency.)

AMALGrAM 2.0 (README.md, LICENSE) is a statistical tagger for multiword expressions and supersenses that was trained on the STREUSLE 2.0 corpus.

Dependencies: Python 2.7, Cython, NLTK 3.0.2+ with WordNet downloaded
The simplest way to run AMALGrAM is with the pretrained model for English. You will also need to install the lexicons (including SAID for best results).
The evaluation script in this release contains a bug affecting Exact Match MWE scores only. The bug is fixed in this version of mweval.py.

POS tagging/syntactic dependency parsing models for web text: These were trained on the non-reviews sections of the English Web Treebank, i.e., the weblogs, newsgroups, email, and question-answers sections.

POS tagging model (1.3MB, requires ARK TweetNLP POS Tagger)
Dependency parsing model for Stanford basic dependencies (508MB, requires TurboParser) (Thanks to Lingpeng Kong)

Old Releases

The Comprehensive Multiword Expressions corpus: CMWE 1.0 (README.md, LICENSE) was described in the LREC paper and used for the TACL paper. It is superseded by STREUSLE 2.0.
The version of the multiword expression identification system used in the TACL paper: AMALGr 1.0
- Dependencies: Python 2.7, Cython, NLTK 2.0.4 with WordNet downloaded
- Pretrained model for English
- A script is provided to train and test with the above corpus (with splits) and lexicons to emulate the setup in the TACL paper. This also requires the ARK TweetNLP POS Tagger and the POS tagging model trained on the non-reviews sections of the English Web Treebank.
- The evaluation script in this release contained a bug affecting Exact Match MWE scores only. The bug is fixed in this version of mweval.py.
STREUSLE 2.0 (README.md, LICENSE)
- List of annotated MWEs: frequency count, lemmas, strength, POS tags
STREUSLE 2.1 (README.md, TAGSET.md, LICENSE) is a corpus of online reviews hand-annotated for comprehensive multiword expressions and for noun and verb supersenses.
- Train and test splits used in experiments for the TACL and NAACL-HLT papers.
- List of annotated MWEs: frequency count, lemmas, strength, POS tags
STREUSLE 3.0 (README.md, TAGSET.md, LICENSE)
- List of annotated MWEs: lowercased words, strength, POS tags, frequency count
- PrepWiki lexicon for preposition supersenses (DEPRECATED)

Lexical Semantics Resources for English

Download

Old Releases

Further Reading

Acknowledgments

Contact