Big Multilinguality for Data-Driven Lexical Semantics

A key challenge in natural language processing is defining the computational representation of words. Data-driven distributional approaches use corpora to induce vector-space representations for words, based on the contexts they occur in. This project goes beyond traditional approaches (e.g., latent semantic analysis; Deerwester et al., 1990), which use words that tend to occur near a word in corpora to define the context, by extending the types of contexts used in constructing semantic vectors. First, this project incorporates translation contexts, i.e., words readily available in multilingual parallel corpora, alongside traditional monolingual corpora. This allows evidence-sharing across languages, most importantly from resource-rich languages with large corpora to more resource-poor languages. Second, this project incorporates social context inferable from social network platforms, captured through author, time, geographic, and social connection metadata. Taken together, these additional features give a broader definition of a word's context and lead to a more unified approach to the distributional approach to modeling human language, moving in the direction of a language-independent semantics. The project focuses on ten typologically diverse languages representing several major language families (English, Arabic, Chinese, Spanish, Russian, German, Portuguese, Swahili, Malagasy, and Farsi). A key emphasis is scaling up algorithms for inferring distributional representations to web-scale corpora and dealing with much larger contextual vectors representing the expanded notion of context. The approach also leverages noisy syntactic processing to enable syntactic information, rather than just information about neighboring words, to be considered when defining context.

In addition to improving the quality of the learned lexico-semantic representations by including richer contextual information, this project creates lexical semantic representations that link word types across languages. These have direct use in text processing applications such as text categorization, machine translation, information extraction, and semantic analysis of text, and they will enable the construction of robust lexical semantic resources in lower-resource languages that benefit from the richness of resources in languages they are paired with. The multilingual vector representations produced will be released to the research community and will be used in undergraduate class projects. The project supports the education of two graduate students in a dynamic research environment.

Project Personnel

Downloads

Publications

Acknowledgments

This project is supported by the National Science Foundation (IIS-1251131).