Task data
Task data
One of the strengths of LD approach is that it is easy to extend to any vocabulary sample (i.e. whatever is relevant for your domain-specific task), and can be run on any set of word embeddings. That being said, fair comparison with models published by others must be conducted on the same data. It is only fair to evaluate what the model retained if you know what data it started with.
Corpus
All ld scores and analysis currently on the website were obtained on the basis of English Wikipedia dump of August 2013, as described in this paper. All pre-trained embeddings can be downloaded here.
The training corpus is available in 3 versions:
One-word-per-line, parser tokenization:
(the last link is the version used in the non-dependency-parsed embeddings in our study, so use this one if you would like to have directly comparable embeddings).
Filtering vocabulary
Since LD relies on the content of vector neighborhoods, it is not very fair to compare embeddings with different vocabulary sizes. Our source embeddings were prepared from the same corpus, but with different context types, and so their vocabulary sizes different significantly. We therefore filtered them down to the vocabulary present in all of the models. The vocabulary list is available here.
Vocabulary sample
Our study was performed with ldt909, a balanced sample of 909 common words, balanced for parts of speech (adjectives, adverbs, nouns, verbs) and frequencies in the Wikipedia corpus. Only common nouns were included. For the purposes of POS balance, we also restricted the selection to words that had no more than one part of speech (according to WordNet). See the paper for details.