Skip to main content

Task data

Task data

One of the strengths of LD approach is that it is easy to extend to any vocabulary sample (i.e. whatever is relevant for your domain-specific task), and can be run on any set of word embeddings. That being said, fair comparison with models published by others must be conducted on the same data. It is only fair to evaluate what the model retained if you know what data it started with.

Corpus

All ld scores and analysis currently on the website were obtained on the basis of English Wikipedia dump of August 2013, as described in this paper. All pre-trained embeddings can be downloaded here.

The training corpus is available in 3 versions:

Filtering vocabulary

Since LD relies on the content of vector neighborhoods, it is not very fair to compare embeddings with different vocabulary sizes. Our source embeddings were prepared from the same corpus, but with different context types, and so their vocabulary sizes different significantly. We therefore filtered them down to the vocabulary present in all of the models. The vocabulary list is available here.

Vocabulary sample

Our study was performed with ldt909, a balanced sample of 909 common words, balanced for parts of speech (adjectives, adverbs, nouns, verbs) and frequencies in the Wikipedia corpus. Only common nouns were included. For the purposes of POS balance, we also restricted the selection to words that had no more than one part of speech (according to WordNet). See the paper for details.

Download the vocabulary sample.