Leaderboard
LDT Leaderboard
Linear | DEPS | Linear (structured) | DEPS (structured) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
LD Scores | SG | CBOW | GloVe | SG | CBOW | GloVe | SG | CBOW | GloVe | SG | CBOW | GloVe |
SharedMorphForm% of neighbors of lemma words that are themselves lemmas
|
52.90 | 51.82 | 52.06 | 55.50 | 60.36 | 47.35 | 61.76 | 58.93 | 59.22 | 66.45 | 68.82 | 50.46 |
SharedDerivation% of neighbors that share an affix or stem with the target words
|
5.08 | 4.47 | 3.94 | 7.28 | 11.17 | 3.02 | 11.70 | 11.08 | 6.89 | 14.67 | 15.38 | 2.82 |
SharedPOS% of neighbors that have the same POS as the target words
|
31.71 | 30.06 | 35.51 | 34.89 | 45.57 | 34.50 | 50.05 | 47.73 | 52.73 | 58.47 | 63.41 | 39.22 |
ProperNouns% of neighbors that are proper nouns
| 27.86 | 30.44 | 27.28 | 28.31 | 25.28 | 34.14 | 23.53 | 25.74 | 26.66 | 21.93 | 20.56 | 38.52 |
Numbers% of neighbors that are or contain a number
|
3.64 | 4.31 | 3.15 | 3.84 | 2.64 | 3.31 | 3.31 | 3.95 | 2.95 | 2.73 | 2.87 | 3.30 |
ForeignWords% of neighbors that are foreign words
|
1.79 | 2.15 | 1.98 | 1.86 | 1.53 | 3.37 | 1.51 | 2.12 | 1.90 | 1.50 | 1.17 | 4.42 |
Misspellings% of neighbors that were misspelled or had pre-processing noise
|
12.81 | 13.55 | 9.91 | 13.73 | 8.73 | 11.87 | 11.92 | 13.93 | 8.33 | 11.66 | 10.97 | 13.67 |
Synonyms% of neighbors that were synonyms of the target words
|
0.45 | 0.41 | 0.44 | 0.43 | 0.41 | 0.45 | 0.42 | 0.36 | 0.41 | 0.37 | 0.32 | 0.33 |
Antonyms% of neighbors that were antonyms of the target words
|
0.14 | 0.13 | 0.13 | 0.15 | 0.13 | 0.14 | 0.14 | 0.12 | 0.14 | 0.13 | 0.12 | 0.10 |
Meronyms% of neighbors that were meronyms of the target words
|
0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
Hypernyms% of neighbors that were hypernyms of the target words
|
0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
Hyponyms% of neighbors that were hyponyms of the target words
|
0.04 | 0.04 | 0.04 | 0.04 | 0.03 | 0.04 | 0.04 | 0.03 | 0.03 | 0.03 | 0.03 | 0.03 |
OtherRelations% of neighbors that were in a different lexicographix relation (esp. co-hyponyms)
|
0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 | 0.01 |
Associations% of neighbors that were psychological associations of the target words
|
0.61 | 0.63 | 1.39 | 0.56 | 0.67 | 1.41 | 0.57 | 0.46 | 1.04 | 0.46 | 0.41 | 0.69 |
ShortestPathMedian minimum path between synsets of target:neighbor pairs in the WordNet ontology
|
0.08 | 0.07 | 0.07 | 0.08 | 0.07 | 0.07 | 0.08 | 0.08 | 0.07 | 0.08 | 0.08 | 0.07 |
GDeps% of neighbors that co-occurred with the target words in a larger corpus (Google dependency ngrams)
|
16.53 | 16.39 | 37.27 | 14.69 | 22.66 | 33.60 | 16.48 | 13.94 | 29.50 | 13.93 | 14.72 | 16.60 |
LowFreqNeighbors% of the neighbor whose frequency in the source corpus is under 10,000
|
96.11 | 94.78 | 66.51 | 96.45 | 88.67 | 71.27 | 95.36 | 96.50 | 74.71 | 96.87 | 96.43 | 87.99 |
HighFreqNeighbors% of the neighbors whose frequency in the source corpus is above 10,000
|
2.51 | 3.42 | 15.70 | 2.24 | 7.09 | 15.51 | 3.49 | 2.65 | 17.19 | 2.30 | 2.91 | 9.24 |
NonCooccurring% of the neighbors that did not cooccur with the target words in the source corpus
|
90.25 | 88.97 | 67.90 | 91.32 | 84.89 | 72.76 | 91.96 | 93.27 | 80.17 | 93.86 | 93.72 | 89.85 |
CloseNeighbors% of the neighbors that were closer than 0.8 to the target word
|
2.28 | 3.10 | 0.16 | 3.10 | 3.77 | 0.09 | 2.67 | 5.02 | 0.03 | 5.44 | 7.09 | 0.01 |
FarNeighbors% of the neighbors that were further away than 0.7 from the target word
|
32.57 | 45.83 | 95.72 | 24.22 | 38.85 | 97.15 | 32.16 | 19.24 | 99.02 | 10.53 | 8.82 | 98.47 |
You can sort by column names (i.e. the models and the LD scores.)
LDT leaderboard is unconventional: there is no implication of a binary possibility of winning/losing on all scores. For intrinsic evaluation this is misleading, because there is no such thing as a representation that is just "good" in the vacuum. What LDT does is a detailed profile of what information the model actually encodes as word vector nighborhoods. It is definitely possible to "win" on any combination of these scores, and our correlation data our correlation data shows that these profiles do predict what a given representation is good for. However, it is not clear how a model can "win" on all of these relations: specialization for a given relation/task usually harms generelizability.
All embeddings were trained on English Wikipedia of August 2013. Two parameters are systematically varied: dimensionality and type of syntactic context (structured vs unstructured, linear vs dependency-based, as described in this paper. Structured contexts are the ones that take into account positional information). All embeddings are size 500, and the corpus for training comparable models is available here.