Skip to main content

LD Scores

LD Scores

Linguistic Diagnostics considers morphological, semantic, psychological and distributional factors that may be relevant to evaluation of distributional meaning representations. The current collection includes 21 factors, as listed below.

Binary relations (e.g. synonymy is either detected or not) are quantified as a simple count of all cases of that relation in all target:neighbor pairs for each embedding. Directed lexicographic relations (hypernymy, hyponymy, meronymy) are counted when the target word is e.g. a hypernym of the neighbor. Continuous variables are broken down into bins, the size of which was chosen empirically.

Distributional factors

LowFreqNeighbors
the frequency of the neighbor in the source corpus is under 10,000.
HighFreqNeighbors
the frequency of the neighbor in the source corpus is above 10,000.
NeighborsInGDeps
whether the two words co-occur in the Google dependency ngrams.
NonCooccurring
the number of word pairs that do not co-occur in the source corpus.
CloseNeighbors
the number of top 100 neighbors with cosine distance to the target word over 0.8.
FarNeighbors
the number of top $n$ neighbors with cosine distance to the target word less than 0.7.

Semantic factors

Synonyms
the number of neighboring word vectors that are synonyms of the target;
Antonyms
the number of neighboring word vectors that are antonyms of the target;
Hypernyms
the number of neighboring word vectors that are hypernyms of the target;
Hyponyms
the number of neighboring word vectors that are hyponyms of the target;
Meronyms
the number of neighboring word vectors that are meronyms of the target;
Other
holonymy, troponymy, coordinate terms, and "otherwise related" in Wiktionary;
ShortestPath
the median of minimum paths between synsets of all target:neighbor pairs in the WordNet ontology (if they are both present in WN). Starting with LDT 0.3.0 this variable is named ShortestPathMedian.
CloseInOntology
the number of target:neighbor pairs that is closer than 0.5 in WordNet ontology (new in LDT 0.3.0).

Other factors

ProperNouns
the neighbor is a proper noun;
Numbers
the neighbor is a numeral, or contains a number;
ForeignWords
the neighbor is not found in English, but found in other languages (German, French or Spanish in our experiments);
Misspellings
the neighbor is not found in dictionaries and contains an unusual combination of letters and punctuation or numbers;
Noise
the neighbor does not contain letters;
Associations
the two words constitute an associative pair (in either direction), according to EAT or USF-FAN

Morphological factors

SharedMorphForm
the two words share their morphological form (in our experiments, both are lemmas);
SharedDerivation
the two words share affix(es) or stem(s), or are both compounds; (based on Wiktionary and custom LDT tools);
SharedPOS
the two words have the same part of speech (any overlap counts).

There are many other linguistic, psychological and distributional relations that may be relevant to evaluation of distributional meaning representations, and LD test battery will grow over time.