Leaderboard

LDT Leaderboard

	Linear			DEPS			Linear (structured)			DEPS (structured)
LD Scores	SG	CBOW	GloVe	SG	CBOW	GloVe	SG	CBOW	GloVe	SG	CBOW	GloVe
SharedMorphForm% of neighbors of lemma words that are themselves lemmas	52.90	51.82	52.06	55.50	60.36	47.35	61.76	58.93	59.22	66.45	68.82	50.46
SharedDerivation% of neighbors that share an affix or stem with the target words	5.08	4.47	3.94	7.28	11.17	3.02	11.70	11.08	6.89	14.67	15.38	2.82
SharedPOS% of neighbors that have the same POS as the target words	31.71	30.06	35.51	34.89	45.57	34.50	50.05	47.73	52.73	58.47	63.41	39.22
ProperNouns% of neighbors that are proper nouns	27.86	30.44	27.28	28.31	25.28	34.14	23.53	25.74	26.66	21.93	20.56	38.52
Numbers% of neighbors that are or contain a number	3.64	4.31	3.15	3.84	2.64	3.31	3.31	3.95	2.95	2.73	2.87	3.30
ForeignWords% of neighbors that are foreign words	1.79	2.15	1.98	1.86	1.53	3.37	1.51	2.12	1.90	1.50	1.17	4.42
Misspellings% of neighbors that were misspelled or had pre-processing noise	12.81	13.55	9.91	13.73	8.73	11.87	11.92	13.93	8.33	11.66	10.97	13.67
Synonyms% of neighbors that were synonyms of the target words	0.45	0.41	0.44	0.43	0.41	0.45	0.42	0.36	0.41	0.37	0.32	0.33
Antonyms% of neighbors that were antonyms of the target words	0.14	0.13	0.13	0.15	0.13	0.14	0.14	0.12	0.14	0.13	0.12	0.10
Meronyms% of neighbors that were meronyms of the target words	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01
Hypernyms% of neighbors that were hypernyms of the target words	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01
Hyponyms% of neighbors that were hyponyms of the target words	0.04	0.04	0.04	0.04	0.03	0.04	0.04	0.03	0.03	0.03	0.03	0.03
OtherRelations% of neighbors that were in a different lexicographix relation (esp. co-hyponyms)	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01	0.01
Associations% of neighbors that were psychological associations of the target words	0.61	0.63	1.39	0.56	0.67	1.41	0.57	0.46	1.04	0.46	0.41	0.69
ShortestPathMedian minimum path between synsets of target:neighbor pairs in the WordNet ontology	0.08	0.07	0.07	0.08	0.07	0.07	0.08	0.08	0.07	0.08	0.08	0.07
GDeps% of neighbors that co-occurred with the target words in a larger corpus (Google dependency ngrams)	16.53	16.39	37.27	14.69	22.66	33.60	16.48	13.94	29.50	13.93	14.72	16.60
LowFreqNeighbors% of the neighbor whose frequency in the source corpus is under 10,000	96.11	94.78	66.51	96.45	88.67	71.27	95.36	96.50	74.71	96.87	96.43	87.99
HighFreqNeighbors% of the neighbors whose frequency in the source corpus is above 10,000	2.51	3.42	15.70	2.24	7.09	15.51	3.49	2.65	17.19	2.30	2.91	9.24
NonCooccurring% of the neighbors that did not cooccur with the target words in the source corpus	90.25	88.97	67.90	91.32	84.89	72.76	91.96	93.27	80.17	93.86	93.72	89.85
CloseNeighbors% of the neighbors that were closer than 0.8 to the target word	2.28	3.10	0.16	3.10	3.77	0.09	2.67	5.02	0.03	5.44	7.09	0.01
FarNeighbors% of the neighbors that were further away than 0.7 from the target word	32.57	45.83	95.72	24.22	38.85	97.15	32.16	19.24	99.02	10.53	8.82	98.47

You can sort by column names (i.e. the models and the LD scores.)

LDT leaderboard is unconventional: there is no implication of a binary possibility of winning/losing on all scores. For intrinsic evaluation this is misleading, because there is no such thing as a representation that is just "good" in the vacuum. What LDT does is a detailed profile of what information the model actually encodes as word vector nighborhoods. It is definitely possible to "win" on any combination of these scores, and our correlation data our correlation data shows that these profiles do predict what a given representation is good for. However, it is not clear how a model can "win" on all of these relations: specialization for a given relation/task usually harms generelizability.

All embeddings were trained on English Wikipedia of August 2013. Two parameters are systematically varied: dimensionality and type of syntactic context (structured vs unstructured, linear vs dependency-based, as described in this paper. Structured contexts are the ones that take into account positional information). All embeddings are size 500, and the corpus for training comparable models is available here.

Linguistic Diagnostics for Word Embeddings

LDT Leaderboard