Linguistic Diagnostics considers morphological, semantic, psychological and distributional factors that may be relevant to evaluation of distributional meaning representations. The current collection includes 21 factors, as listed below.
Binary relations (e.g. synonymy is either detected or not) are quantified as a simple count of all cases of that relation in all target:neighbor pairs for each embedding. Directed lexicographic relations (hypernymy, hyponymy, meronymy) are counted when the target word is e.g. a hypernym of the neighbor. Continuous variables are broken down into bins, the size of which was chosen empirically.
- the frequency of the neighbor in the source corpus is under 10,000.
- the frequency of the neighbor in the source corpus is above 10,000.
- whether the two words co-occur in the Google dependency ngrams.
- the number of word pairs that do not co-occur in the source corpus.
- the number of top 100 neighbors with cosine distance to the target word over 0.8.
- the number of top $n$ neighbors with cosine distance to the target word less than 0.7.
- the number of neighboring word vectors that are synonyms of the target;
- the number of neighboring word vectors that are antonyms of the target;
- the number of neighboring word vectors that are hypernyms of the target;
- the number of neighboring word vectors that are hyponyms of the target;
- the number of neighboring word vectors that are meronyms of the target;
- holonymy, troponymy, coordinate terms, and "otherwise related" in Wiktionary;
- the median of minimum paths between synsets of all target:neighbor pairs in the WordNet ontology (if they are both present in WN). Starting with LDT 0.3.0 this variable is named ShortestPathMedian.
- the number of target:neighbor pairs that is closer than 0.5 in WordNet ontology (new in LDT 0.3.0).
- the neighbor is a proper noun;
- the neighbor is a numeral, or contains a number;
- the neighbor is not found in English, but found in other languages (German, French or Spanish in our experiments);
- the neighbor is not found in dictionaries and contains an unusual combination of letters and punctuation or numbers;
- the neighbor does not contain letters;
- the two words constitute an associative pair (in either direction), according to EAT or USF-FAN
- the two words share their morphological form (in our experiments, both are lemmas);
- the two words share affix(es) or stem(s), or are both compounds; (based on Wiktionary and custom LDT tools);
- the two words have the same part of speech (any overlap counts).
There are many other linguistic, psychological and distributional relations that may be relevant to evaluation of distributional meaning representations, and LD test battery will grow over time.