Big Multilinguality for Data-Driven Lexical Semantics

A key challenge in natural language processing is defining the computational representation of words. Data-driven distributional approaches use corpora to induce vector-space representations for words, based on the contexts they occur in. This project goes beyond traditional approaches (e.g., latent semantic analysis; Deerwester et al., 1990), which use words that tend to occur near a word in corpora to define the context, by extending the types of contexts used in constructing semantic vectors. First, this project incorporates translation contexts, i.e., words readily available in multilingual parallel corpora, alongside traditional monolingual corpora. This allows evidence-sharing across languages, most importantly from resource-rich languages with large corpora to more resource-poor languages. Second, this project incorporates social context inferable from social network platforms, captured through author, time, geographic, and social connection metadata. Taken together, these additional features give a broader definition of a word's context and lead to a more unified approach to the distributional approach to modeling human language, moving in the direction of a language-independent semantics. The project focuses on ten typologically diverse languages representing several major language families (English, Arabic, Chinese, Spanish, Russian, German, Portuguese, Swahili, Malagasy, and Farsi). A key emphasis is scaling up algorithms for inferring distributional representations to web-scale corpora and dealing with much larger contextual vectors representing the expanded notion of context. The approach also leverages noisy syntactic processing to enable syntactic information, rather than just information about neighboring words, to be considered when defining context.

In addition to improving the quality of the learned lexico-semantic representations by including richer contextual information, this project creates lexical semantic representations that link word types across languages. These have direct use in text processing applications such as text categorization, machine translation, information extraction, and semantic analysis of text, and they will enable the construction of robust lexical semantic resources in lower-resource languages that benefit from the richness of resources in languages they are paired with. The multilingual vector representations produced will be released to the research community and will be used in undergraduate class projects. The project supports the education of two graduate students in a dynamic research environment.

Summary of Research Findings

Faruqui and Dyer (2014) introduced a technique based on canonical correlation analysis that incorporates multilingual evidence into word vectors. The key idea is to use word-aligned parallel corpora and project word vectors learned separately for the two languages into the same space. The resulting word vectors are shown to be of higher quality in standard benchmark evaluations than vectors learned from monolingual text alone.

Bamman et al. (2014) introduced a model for incorporating geographical contextual information in learning vector-space representations of situated language. Using geolocated social media messages, this technique allows us to see how word meanings can depend on place. We developed a new quantitative evaluation that judges geographically informed semantic similarity, and used it to demonstrated the ability of the new method to make reasonable inferences about word meaning.

Faruqui et al. (2015) introduced a technique, "retrofitting," that uses semantic lexicons such as WordNet, FrameNet, and the Paraphrase Database to improve word vectors. Given such a lexicon of similarity relationships and a set of learned word vectors, the technique adapts word vectors, encouraging neighbors to be more similar. We find the new vectors show substantial improvement on benchmark tasks.

Yogatama et al. (2015) introduced a new technique for learning word representations using hierarchical regularization in sparse coding. This required significant algorithmic development. The resulting vectors were found to perform competitively on standard evaluation tasks, and the sparsity patterns are suggestive that greater interpretability in these representations may be possible.

Word vectors from this project were incorporated into several synergistic research projects. These include: (1) tests leading up to an entry in the SemEval 2014 competition on broad-coverage semantic dependency parsing (Thomson et al., 2014), which achieved second place; (2) developing a cross-lingual model of metaphor that discrimnates between syntactic constructions with literal or metaphorical meaning in several languages, supporting the hypothesis that metaphors are conceptual rather than lexical (Tsvetkov et al., 2014); and (3) the development of a supersense taxonomy for adjectives (Tsvetkov et al., 2014).

Current distributed representations of words show little resemblance to theories of lexical semantics. The former are dense and uninterpretable, the latter largely based on familiar, discrete classes (e.g., supersenses) and relations (e.g., synonymy and hypernymy). Faruqui et al. (2015) proposed methods that transform word vectors into sparse (and optionally binary) vectors. The resulting representations are more similar to the interpretable features typically used in NLP, though they are discovered automatically from raw corpora. Because the vectors are highly sparse, they are computationally easy to work with. Most importantly, we find that they outperform the original vectors on benchmark tasks.

Big Multilinguality for Data-Driven Lexical Semantics

Project Personnel

Summary of Research Findings

Downloads and Links

Publications

Acknowledgments