Vector Semantics and Embedding (part 2)
Vector Semantics and Embedding (part 2)
apple
grape
three four
one orange
two
[van der Maaten and Hinton., 2008. Visualizing data using t-SNE]
Named entity recognition example
1 1 0 0 0 0
I 𝑜4343 𝐸 𝑒4343
a 𝑜1 𝐸 𝑒1
of 𝑜6163 𝐸 𝑒6163
1-hot representation
Last 1 word
Nearby 1 word
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec provides various options. We'll do:
skip-gram with negative sampling (SGNS)
Word2vec
Instead of counting how often each word w occurs near "apricot"
◦ Train a classifier on a binary prediction task:
◦ Is w likely to show up near "apricot"?
We don’t actually care about this task
◦ But we'll take the learned classifier weights as the word embeddings
Big idea: self-supervision:
◦ A word c that occurs near apricot in the corpus cats as the gold "correct
answer" for supervised learning
◦ No need for human labels
◦ Bengio et al. (2003); Collobert et al. (2011)
Approach: predict if candidate word c is a "neighbor"
1. Treat the target word t and a neighboring context word c
as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings
Skip-Gram Training Data
This is for one context word, but we have lots of context words.
We'll assume independence and just multiply them:
Skip-gram classifier: summary
A probabilistic classifier, given
• a test target word w
• its context window of L words c1:L
Estimates probability that w occurs in this window based
on similarity of w (embeddings) to c1:L (embeddings).
26
Skip-Gram Training data
28
Word2vec: how to learn vectors
Given the set of positive and negative training instances,
and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such
that we:
◦ Maximize the similarity of the target word, context word pairs
(w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from the
negative data.
9/26/2024 29
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual context words,
and minimize the similarity of the target with the k negative sampled
non-neighbor words.
Learning the classifier
How to learn?
◦ Stochastic gradient descent!
t+ 1 t d
w =w−h L( f (x; w), y)
dw
The derivatives of the loss function
Update equation in SGD
Start with randomly initialized C and W matrices, then incrementally do updates
Two sets of embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together,
representing word i as the vector wi + ci
Summary: How to learn word2vec (skip-gram)
embeddings
Start with V random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
◦Take a corpus and take pairs of words that co-occur as positive
examples
◦Take pairs of words that don't co-occur as negative examples
◦Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦Throw away the classifier code and keep the embeddings.
Properties of Embeddings
Vector
Semantics &
Embeddings
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy
Analogical relations
The classic parallelogram model of analogical reasoning
(Rumelhart and Abrahamson 1973)
To solve: "apple is to tree as grape is to _____"
Add tree – apple to grape to get vine
Analogies
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
Gender −1 1 -0.95 0.97 0.00 0.01
Royal 0.01 0.02 0.93 0.95 -0.01 0.00
Age 0.03 0.02 0.70 0.69 0.03 -0.02
Food 0.09 0.01 0.02 0.01 0.95 0.97
[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations]
Analogies using word vectors man
king woman dog
cat
queen fish
Man:Woman as Boy:Girl
Ottawa:Canada as Nairobi:Kenya
Big:Bigger as Tall:Taller
Yen:Japan as Ruble:Russia
Structure in GloVE Embedding space
Caveats with the parallelogram method
It only seems to work for frequent words, small
distances and certain relations (relating countries to
capitals, or parts of speech), but not others. (Linzen
2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)
William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal
Statistical Laws of Semantic Change. Proceedings of ACL.
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer
programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.