STREUSLE Dataset
================
STREUSLE stands for Supersense-Tagged Repository of English with a Unified Semantics for Lexical Expressions. It supersedes the Comprehensive Multiword Expressions corpus [1], adding semantic supersenses in addition to the MWE annotations. The supersense labels apply to single- and multiword noun and verb expressions, as described in [2].
STREUSLE and associated documentation and tools can be downloaded from:
This dataset's multiword expression and supersense annotations are licensed under a [Creative Commons Attribution-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-sa/4.0/) license (see LICENSE). The source sentences and part-of-speech annotations are redistributed with permission of Google and the Linguistic Data Consortium, respectively.
References:
- [1] Nathan Schneider, Spencer Onuffer, Nora Kazour, Emily Danchik, Michael T. Mordowanec, Henrietta Conrad, and Noah A. Smith. Comprehensive annotation of multiword expressions in a social web corpus. _Proceedings of the 9th Linguistic Resources and Evaluation Conference_, Reykjavík, Iceland, May 26-31, 2014.
- [2] Nathan Schneider and Noah A. Smith. A corpus and model integrating multiword expressions and supersenses. _Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies_, Denver, Colorado, May 31–June 5, 2015.
Files
-----
- streusle.sst: Initial annotations, in human-readable and JSON formats, along with gold POS tags.
- streusle.tags: Automatic conversion of streusle.sst to the tagging scheme appropriate for training sequence models. A few intricately structured MWEs have been simplified to fit the tagging scheme, and lemmas from the WordNet lemmatizer have been added.
- streusle.tags.sst: Conversion of streusle.tags back to the .sst format, now with lemmas and tags.
.sst Format
-----------
(Based on CMWE's .mwe format.) 1 sentence per line. 3 tab-separated columns: sentence ID; human-readable MWE annotation from CMWE; JSON data structure with POS-tagged words, MWE groupings, and supersense annotations associated with the first token of the expression they apply to. Note that token indices are 1-based.
The .tags.sst JSON object adds lemmas and tags in the JSON object.
.tags Format
------------
(CoNLL-esque format based on CMWE's .tags format.) 1 token per line, with blank lines separating sentences.
9 tab-separated columns:
1. token offset
2. word
3. lowercase lemma
4. POS
5. full MWE+class tag
6. offset of parent token (i.e. previous token in the same MWE), if applicable
7. strength level encoded in the tag, if applicable: `_` for strong, `~` for weak
8. class (usually supersense) label, if applicable
9. sentence ID
History
-------
- STREUSLE 2.0: 2015-03-29. Added noun and verb supersenses
- CMWE 1.0: 2014-03-26. Multiword expressions for 55k words of English web reviews