Twitter NLP and Part-of-Speech Tagging |
![]()
|
We provide a fast and robust Java-based tokenizer and part-of-speech tagger for Twitter, its training data of manually labeled POS annotated tweets, a web-based annotation tool, and hierarchical word clusters from unlabeled tweets.
These were created by Olutobi Owoputi, Brendan O'Connor, Kevin Gimpel, Nathan Schneider, Chris Dyer, Dipanjan Das, Daniel Mills, Jacob Eisenstein, Michael Heilman, Dani Yogatama, Jeffrey Flanigan, and Noah Smith.
./runTagger.sh --output-format conll examples/casual.txtThese are real tweets.
ikr smh he asked fir yo last name so he can add u on fb lololol word tag confidence ikr ! 0.8143 smh G 0.9406 he O 0.9963 asked V 0.9979 fir P 0.5545 yo D 0.6272 last A 0.9871 name N 0.9998 so P 0.9838 he O 0.9981 can V 0.9997 add V 0.9997 u O 0.9978 on P 0.9426 fb ^ 0.9453 lololol ! 0.9664
|
:o :/ :'( >:o (: :) >.< XD -__- o.O ;D :-) @_@ :P 8D :1 >:( :D =| ") :> .... word tag confidence :o E 0.9387 :/ E 0.9983 :'( E 0.9975 >:o E 0.9964 (: E 0.9994 :) E 0.9997 >.< E 0.9952 XD E 0.9938 -__- E 0.9956 o.O E 0.9899 ;D E 0.9995 :-) E 0.9992 @_@ E 0.9964 :P E 0.9996 8D E 0.9961 : E 0.6925 1 $ 0.9194 >:( E 0.9715 :D E 0.9996 =| E 0.9963 " , 0.6125 ) , 0.9078 : , 0.7460 > G 0.7490 ... , 0.5223 . , 0.9946Challenge case for emoticon segmentation/recognition: 20/26 precision, 18/21 recall. |
To receive announcements about updates, join the ARK-tools mailing list.
Twitter Word Clusters
Here is an HTML viewer of the word clusters. Produced by an unsupervised HMM: Percy Liang's Brown clustering implementation on Lui and Baldwin's langid.py-identified English tweets; see Owoputi et al. (2012) for details.
We recommend the largest one:
| filename | #wordtypes | #tweets | #tokens | #clusters | min count | tweet source |
|---|---|---|---|---|---|---|
| 50mpaths2 | 216,856 | 56,345,753 | 847,372,038 | 1000 | 40 | 100k tweet/day sample, 9/10/08 to 8/14/12 |
Also, here are the smaller ones used in the experiments:
| filename | #wordtypes | #tweets | #tokens | #clusters | min count | tweet source |
|---|---|---|---|---|---|---|
| 6mpaths | 111,844 | ~6,000,000 | 1,575,589 | 800 | 10 | 10k tweet/day sample, 9/10/08 to 7/18/12 |
| 3mpaths | 124,731 | 3,000,000 | 1,006,324 | 800 | 5 | subsample |
| 750kpaths | 50,780 | 750,000 | ? | 800 | 5 | subsample |
| 100kpaths | 21,345 | 100,000 | ? | 800 | 3 | subsample |
| 10kpaths | 6,944 | 10,000 | ? | 800 | 2 | subsample |
| 1kpaths | 4,142 | 1000 | 15,159 | 800 | 1 | subsample |
Tagger Models
See also ARK Social Media Research.