Project news
16.11.2018:
ldt v. 0.4.0
See what's new.
04.11.2018:
ldt v. 0.3.9
See what's new.
08.10.2018:
ldt v. 0.3.0
See what's new.
25.09.2018:
ldt v. 0.2.0
See what's new.
24.08.2018:
initial release of ldt
See what's new.
Word embeddings are used in many NLP tasks: sentiment analysis, inference, sequence labeling, etc. The choice of word embeddings can have a dramatic impact on the performance of the whole system. However, there are dozens of models, each with numerous parameters, and it is not feasible to try them all for every task and every corpus. Thus, a big problem that the field is facing is reliable and task-independent evaluation of meaning representations.
LD (Linguistic Diagnostics) is a new methodology for quantitative/qualitative evaluation of word embeddings via automatic annotation of different types of relations between words in word vector neighborhoods. LD methodology is implemented in Linguistic Diagnostics Toolkit, a free and open-source Python library.
The core idea is to identify the kinds of information that a given model encodes as proximity in vector space (since that is the key component of most systems relying on word embeddings). The workflow can be sketched as follows:
For example, one word embedding model could specialize in synonyms, while another could favor morphological relations. LD currently quantifies over twenty such variables that represent different linguistic and distributional relations.
Once you have LD scores for your embeddings (based on our general vocabulary sample or your own sample of vocabulary that is relevant for your task), you can use them to:
- compare the models and explain their performance on downstream tasks
- see the effect of a given parameter change
- test hypotheses in development of word embedding models
- have a better idea about what model would fit your task better, thanks to a growing repository of data on correlations between different tasks.
LD is the only intrinsic evaluation metric to date that does not rely on a pre-defined dataset, which can never be sufficient to reflect everything of interest for specialized downstream tasks. For example, suppose that you are trying to find the optimal representation for sentiment analysis. The standard relatedness tests focus on common nouns, but it may be that in your case the quality of the representations of adjectives matters more. With LD you can profile different models based on their behavior on your vocabulary sample, and avoid using a test set that only works for a few relations, as it happened with the word analogy task. It also does not share the known methodological issues of the similarity/relatedness tests or the word analogy.
LDT was presented at COLING 2018 (pdf):
author = "Rogers, Anna and Hosur Ananthakrishna, Shashwath and Rumshisky, Anna",
title = "What's in Your Embedding, And How It Predicts Task Performance",
booktitle = "Proceedings of the 27th International Conference on Computational Linguistics",
year = "2018",
publisher = "Association for Computational Linguistics",
pages = "2690--2703",
location = "Santa Fe, New Mexico, USA",
url = "http://aclweb.org/anthology/C18-1228"
}