Skip to main content

What LDT can do for you

What LDT can do for you

This page lists a few examples of how LD scores can be used in practice to do error analysis, explain downstream task performance, and also guide model development and optimization.

Explaining performance on downstream tasks

Question: why is model X beating model Y on task Z?

Answer: model X is doing something differently from model Y. Let us consider the example of CBOW, GloVe and Skip-Gram models with select LD scores and task performance results.

CBOW GloVe SG
LD factors
SharedMorphForm 51.819 52.061 52.9
SharedPOS 30.061 35.507 31.706
SharedDerivation 4.468 3.938 5.084
Synonyms 0.413 0.443 0.447
Antonyms 0.128 0.133 0.144
Hyponyms 0.035 0.035 0.038
OtherRelations 0.013 0.013 0.013
Misspellings 13.546 9.914 12.809
ForeignWords 2.147 1.976 1.793
ProperNouns 30.442 27.278 27.864
Numbers 4.313 3.147 3.64
LowFreqNeighbors 94.778 66.51 96.109
HighFreqNeighbors 3.421 15.697 2.513
NonCooccurring 88.97 67.904 90.252
CloseNeighbors 3.102 0.16 2.278
FarNeighbors 25.209 49.934 21.41
Downstream tasks
POS-tagging 87.660 83.800 87.860
Chunking 77.530 66.100 78.230
NER 75.210 69.620 75.720
Relation class. 74.780 71.050 74.800
Subjectivity class. 89.800 89.160 89.920
Polarity (sent.) 75.900 74.600 76.860
Sentiment (text) 82.220 82.240 82.730
SNLI 69.290 69.510 69.740

Skip-Gram consistently outperforms both CBOW and GloVe, but the patterns are different:

  • LD scores for CBOW and Skip-Gram are mostly similar, but CBOW is always slightly trailing behind on meaningful relations such as synonyms, antonyms, and morphological relations, while being ahead on noise. It is also slightly trailing behind on all downstream tasks.
  • GloVe appears to be quite different from both CBOW and Skip-Gram in the distributional factors: in particular, it is much worse at finding neighbor words if they are low-frequency or not cooccurring with the target word in the corpus. Thus it looks that it is less efficient at deducing semantic information without direct distributional evidence.

Hypothesis testing in model development

Question: My model should be specializing in X, how do I make sure?

Answer: Profile your models for X.

Example: Dependency-based embeddings bring together words that can perform similar syntactic functions, as opposed to standard linear bag-of-words. This could mean that dependency-based embeddings have a higher ratio of synonyms in word vector neighborhoods.

/assets/img/synonyms2.png

Effect of context type on specialization in synonymy

It appears that dependency-based embeddings actually do not outperform regular bag-of-words models.