This README describes data in the CMU Movie Summary Corpus, a collection of 42,306 movie plot summaries and metadata at both the movie level (including box office revenues, genre and date of release) and character level (including gender and estimated age). This data supports work in the following paper: David Bamman, Brendan O'Connor and Noah Smith, "Learning Latent Personas of Film Characters," in: Proceedings of the Annual Meeting of the Association for Computational Linguistics (ACL 2013), Sofia, Bulgaria, August 2013. All data is released under a Creative Commons Attribution-ShareAlike License. For questions or comments, please contact David Bamman (dbamman@cs.cmu.edu). ### # # DATA # ### 1. plot_summaries.txt.gz [29 M] Plot summaries of 42,306 movies extracted from the November 2, 2012 dump of English-language Wikipedia. Each line contains the Wikipedia movie ID (which indexes into movie.metadata.tsv) followed by the summary. 2. corenlp_plot_summaries.tar.gz [628 M, separate download] The plot summaries from above, run through the Stanford CoreNLP pipeline (tagging, parsing, NER and coref). Each filename begins with the Wikipedia movie ID (which indexes into movie.metadata.tsv). ### # # METADATA # ### 3. movie.metadata.tsv.gz [3.4 M] Metadata for 81,741 movies, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: 1. Wikipedia movie ID 2. Freebase movie ID 3. Movie name 4. Movie release date 5. Movie box office revenue 6. Movie runtime 7. Movie languages (Freebase ID:name tuples) 8. Movie countries (Freebase ID:name tuples) 9. Movie genres (Freebase ID:name tuples) 4. character.metadata.tsv.gz [14 M] Metadata for 450,669 characters aligned to the movies above, extracted from the Noverber 4, 2012 dump of Freebase. Tab-separated; columns: 1. Wikipedia movie ID 2. Freebase movie ID 3. Movie release date 4. Character name 5. Actor date of birth 6. Actor gender 7. Actor height (in meters) 8. Actor ethnicity (Freebase ID) 9. Actor name 10. Actor age at movie release 11. Freebase character/actor map ID 12. Freebase character ID 13. Freebase actor ID ## # # TEST DATA # ## tvtropes.clusters.txt 72 character types drawn from tvtropes.com, along with 501 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv. name.clusters.txt 970 unique character names used in at least two different movies, along with 2,666 instances of those types. The ID field indexes into the Freebase character/actor map ID in character.metadata.tsv.