AQMAR Arabic Wikipedia Supersense Corpus This dataset contains text extracted from a small corpus of Arabic Wikipedia articles and hand-annotated for nominal supersenses. It is described in the paper Nathan Schneider, Behrang Mohit, Kemal Oflazer, and Noah A. Smith (2012), Coarse Lexical Semantic Annotation with Supersenses: An Arabic Case Study. Proceedings of ACL. and can be downloaded at http://www.ark.cs.cmu.edu/AQMAR/ This dataset is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License (see LICENSE). CONTENTS == Data == articles.txt 4-digit code, English title, and domain of each article in the data. problem.sentences.txt Sentences that were removed from the dataset due to being flagged as problematic by one or both annotators. sentences.txt Annotated sentences, one per line. Tab-separated fields: * sentence ID: first 4 digits are the article code; remaining digits are numbered sequentially in the main text of the article * Arabic sentence (UTF-8, tokenized) * tags from Annotator A, if available * tags from Annotator B, if available tokens.txt Data in the token-based format. Each line contains the Arabic token, tag from Annotator A (if available), tag from Annotator B (if available), and the sentence ID. tokens.agreement.bio Data used to measure inter-annotator agreement. tokens.annA.bio tokens.annB.bio Data from individual annotators. == Documentation == examples.html Short descriptions and examples for each supersense tag. These were listed in a sidebar in the annotation interface. guidelines.html Tagging guidelines used by annotators. LICENSE README VERSION == Scripts == agreementDataFilter.py Applied to sentences.txt, outputs sentences independently annotated by both annotators. counts.sh Counts sentences, tokens, and supersense mentions in each domain and collectively. sentences2tokens.py Converts sentences to a token-based format, with one token per line. extenderTagScheme2BIO.py ./extenderTagScheme2BIO "<" tokens.txt | sed 's/[BI]-[-_]/O/g' converts the tagging to a BIO scheme. NOTES The supersense tag symbols are included in the tagset documentation. Other symbols are: _ or - = blank (not part of a nominal supersense) < = extender (continues a multiword unit) ? = unsure Tokenization separated punctuation and the conjunction wa- from words. For articles 0009-0012, the first 20 sentences (i.e. sentences numbered 0001-0020) were annotated cooperatively between the two annotators. For the remaining articles, the first 5 sentences were annotated cooperatively. All other sentences were annotated independently; those with tags from both annotators (not including articles 0001-0004, which were used in pilot annotation rounds) were used to compute inter-annotator agreement.