Twitter NLP and Part-of-Speech Tagging

We provide a fast and robust Java-based tokenizer and part-of-speech tagger for Twitter, its training data of manually labeled POS annotated tweets, a web-based annotation tool, and hierarchical word clusters from unlabeled tweets.

These were created by Olutobi Owoputi, Brendan O'Connor, Kevin Gimpel, Nathan Schneider, Chris Dyer, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah Smith.

News

What the tagger does

./runTagger.sh --output-format conll examples/casual.txt
These are real tweets.
ikr smh he asked fir yo last name so he can add u on fb lololol
word    tag     confidence
ikr     !       0.8143
smh     G       0.9406
he      O       0.9963
asked   V       0.9979
fir     P       0.5545
yo      D       0.6272
last    A       0.9871
name    N       0.9998
so      P       0.9838
he      O       0.9981
can     V       0.9997
add     V       0.9997
u       O       0.9978
on      P       0.9426
fb      ^       0.9453
lololol !       0.9664
  • "ikr" means "I know, right?", tagged as an interjection.
  • "so" is being used as a subordinating conjunction, which our coarse tagset denotes P.
  • "fb" means "Facebook", a very common proper noun (^).
  • "yo" is being used as equivalent to "your"; our coarse tagset has posessive pronouns as D.
  • "fir" is a misspelling or spelling variant of the preposition for.
  • Perhaps the only debatable errors in this example are for ikr and smh ("shake my head"): should they be G for miscellaneous acronym, or ! for interjection?
:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ....
word    tag     confidence
:o      E       0.9387
:/      E       0.9983
:'(     E       0.9975
>:o     E       0.9964
(:      E       0.9994
:)      E       0.9997
>.<     E       0.9952
XD      E       0.9938
-__-    E       0.9956
o.O     E       0.9899
;D      E       0.9995
:-)     E       0.9992
@_@     E       0.9964
:P      E       0.9996
8D      E       0.9961
:       E       0.6925
1       $       0.9194
>:(     E       0.9715
:D      E       0.9996
=|      E       0.9963
"       ,       0.6125
)       ,       0.9078
:       ,       0.7460
>       G       0.7490
...     ,       0.5223
.       ,       0.9946
Challenge case for emoticon segmentation/recognition: 20/26 precision, 18/21 recall.

Download

To receive announcements about updates, join the ARK-tools mailing list.

Further Reading

Please cite the appropriate paper when using these resources in research.

Resources

Twitter Word Clusters

Here is an HTML viewer of the word clusters. Produced by an unsupervised HMM: Percy Liang's Brown clustering implementation on Lui and Baldwin's langid.py-identified English tweets; see Owoputi et al. (2012) for details.

We recommend the largest one:
filename#wordtypes#tweets#tokens#clustersmin
count
tweet source
50mpaths2216,856 56,345,753847,372,038100040100k tweet/day sample, 9/10/08 to 8/14/12

Also, here are the smaller ones used in the experiments.

Tagger Models

To use an alternate model, download the one you want and specify the flag: --model MODELFILENAME

Other software

See also ARK Social Media Research.

Acknowledgments

This research was supported in part by: the NSF through CAREER grant IIS-1054319, the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533, Sandia National Laboratories (fellowship to K. Gimpel), and the U.S. Department of Education under IES grant R305B040063 (fellowship to M. Heilman), an REU supplement to NSF grant IIS-0915187 and Google's support of the Worldly Knowledge project at CMU.