Twitter NLP and Part-of-Speech Tagging
a fast and robust
Java-based tokenizer and part-of-speech tagger for Twitter,
its training data of manually labeled POS annotated tweets,
a web-based annotation tool, and hierarchical word clusters from unlabeled tweets.
These were created by
and Noah Smith.
- July 2013: Added a Penn Treebank-style tagset model (see bottom of page).
- March 2013: New NAACL paper posted.
- Sept 2012:
Tagger version 0.3 is released! It is 40 times faster, and more accurate.
More annotated training data and unsupervised clusters are also available.
Download the tagger here.
What the tagger does
./runTagger.sh --output-format conll examples/casual.txt
These are real tweets.
ikr smh he asked fir yo last name so he can add u on fb lololol
word tag confidence
ikr ! 0.8143
smh G 0.9406
he O 0.9963
asked V 0.9979
fir P 0.5545
yo D 0.6272
last A 0.9871
name N 0.9998
so P 0.9838
he O 0.9981
can V 0.9997
add V 0.9997
u O 0.9978
on P 0.9426
fb ^ 0.9453
lololol ! 0.9664
- "ikr" means "I know, right?", tagged as an interjection.
- "so" is being used as a subordinating conjunction, which our coarse tagset denotes P.
- "fb" means "Facebook", a very common proper noun (^).
- "yo" is being used as equivalent to "your"; our coarse tagset has posessive pronouns as D.
- "fir" is a misspelling or spelling variant of the preposition for.
- Perhaps the only debatable errors in this example are for ikr and smh ("shake my head"):
should they be G for miscellaneous acronym, or ! for interjection?
:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> ....
word tag confidence
:o E 0.9387
:/ E 0.9983
:'( E 0.9975
>:o E 0.9964
(: E 0.9994
:) E 0.9997
>.< E 0.9952
XD E 0.9938
-__- E 0.9956
o.O E 0.9899
;D E 0.9995
:-) E 0.9992
@_@ E 0.9964
:P E 0.9996
8D E 0.9961
: E 0.6925
1 $ 0.9194
>:( E 0.9715
:D E 0.9996
=| E 0.9963
" , 0.6125
) , 0.9078
: , 0.7460
> G 0.7490
... , 0.5223
. , 0.9946
Challenge case for emoticon segmentation/recognition: 20/26 precision, 18/21 recall.
Releases of the tagger (and tokenizer), data, and annotation tool
are available here on Google Code.
The tagger source code (plus annotated data and web tool) is on GitHub.
The tokenizer code is self-contained in Twokenize.java; or use twokenize.sh in the tagger download.
A Python port of the tokenizer is available from Myle Ott: ark-twokenize-py. (There is also an older Python version from 2010, also called "twokenize," here.)
To receive announcements about updates, join the ARK-tools mailing list.
Please cite the appropriate paper when using these resources in research.
The newest paper (for version 0.3), is:
Improved Part-of-Speech Tagging for Online Conversational Text with Word Clusters
and Noah A. Smith.
In Proceedings of NAACL 2013.
- Tech report version, with a few more and few less details:
Owoputi et al. (2012).
Machine Learning Department.
- The original paper:
Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments
and Noah A. Smith.
In Proceedings of ACL 2011.
The Annotation Guidelines, extensively revamped for 0.3.
Twitter Word Clusters
Here is an HTML viewer of the word clusters.
Produced by an unsupervised HMM:
Percy Liang's Brown clustering implementation
on Lui and Baldwin's langid.py-identified English tweets;
see Owoputi et al. (2012) for details.
We recommend the largest one:
||56,345,753||847,372,038||1000||40||100k tweet/day sample, 9/10/08 to 8/14/12 |
Also, here are the smaller ones used in the experiments.
To use an alternate model, download the one you want and specify the flag:
The ritter_ptb and irc models are trained on datasets that were
annotated separately from the work described here.
guidelines and various distinctions they describe (like constituent versus tag
uses of hashtags) do not apply if you are using the tagger with these models.
model.20120919 (2MB) -- the Twitter POS model with our coarse 25-tag tagset.
This is included with the tagger release and used by default.
model.ritter_ptb_alldata_fixed.20130723 (1.5 MB) -- a model that gives a Penn Treebank-style tagset for Twitter. Trained from a fixed version of Ritter et al. EMNLP 2011's annotated data. If you want Penn Treebank-style POS tags for Twitter, use this model. We documented issues and changes here. Also, here is an accuracy evaluation to compare with other work.
model.irc.20121211 (3MB) -- a model trained on the NPSChat IRC corpus, with a PTB-style tagset.
- gp-ark-tweet-nlp is: "a PL/Java Wrapper for Ark-Tweet-NLP, that
enables you to perform part-of-speech tagging on Tweets, using SQL.
If your environment is an MPP system like Pivotal's Greenplum Database you can piggyback on the MPP architecture and achieve implicit parallelism in your part-of-speech tagging tasks."
This research was supported in part by: the NSF through CAREER grant IIS-1054319, the U.S. Army Research Laboratory and the U.S. Army Research Office under contract/grant number W911NF-10-1-0533, Sandia National Laboratories (fellowship to K. Gimpel), and the U.S. Department of Education under IES grant R305B040063 (fellowship to M. Heilman),
an REU supplement to NSF grant IIS-0915187
Google's support of the Worldly Knowledge project at CMU.