0% found this document useful (0 votes)
9 views

Vector Semantics and Embedding (part 2)

The document discusses vector semantics and embeddings in natural language processing, focusing on the advantages of dense vectors over sparse ones for machine learning applications. It explains the Word2Vec model, particularly the skip-gram approach, for generating word embeddings by predicting context words from target words. Additionally, it highlights the implications of embeddings in capturing analogies, biases, and historical semantic changes.

Uploaded by

Phương Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views

Vector Semantics and Embedding (part 2)

The document discusses vector semantics and embeddings in natural language processing, focusing on the advantages of dense vectors over sparse ones for machine learning applications. It explains the Word2Vec model, particularly the skip-gram approach, for generating word embeddings by predicting context words from target words. Additionally, it highlights the implications of embeddings in capturing analogies, biases, and historical semantic changes.

Uploaded by

Phương Trang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 47

Vector Semantics and Embedding (2)

PGS.TS. Nguyễn Phương Thái


TS. Hoàng Thanh Tùng
TS. Trần Hồng Việt
NLP Laboratory, Institute of Artificial Intelligence

Adapted from slides of CS224N: Natural Language Processing with Deep


Learning, Stanford
Word2vec
Vector
Semantics &
Embeddings
Sparse versus dense vectors

tf-idf (or PMI) vectors are


◦ long (length |V|= 20,000 to 50,000)
◦ sparse (most elements are zero)
Alternative: learn vectors which are
◦ short (length 50-1000)
◦ dense (most elements are non-zero)
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (fewer weights to tune)
◦ Dense vectors may generalize better than explicit counts
◦ Dense vectors may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor should be similar, but aren't
◦ In practice, they work better
5
Featurized representation: word embedding
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)

-0.95 0.97 0.00 0.01

0.93 0.95 -0.01 0.00

0.7 0.69 0.03 -0.02

0.02 0.01 0.95 0.97

I want a glass of orange ______.


I want a glass of apple______.
Visualizing word embeddings
man
woman
dog
king
cat
queen fish

apple
grape
three four
one orange
two

[van der Maaten and Hinton., 2008. Visualizing data using t-SNE]
Named entity recognition example

1 1 0 0 0 0

Sally Johnson is an orange farmer

Robert Lin is an apple farmer


Transfer learning and word embeddings
1. Learn word embeddings from large text corpus. (1-100B words)

(Or download pre-trained embedding online.)

2. Transfer embedding to new task with smaller training set.


(say, 100k words)

3. Optional: Continue to finetune the word embeddings with new


data.
Common methods for getting short dense vectors

“Neural Language Model”-inspired models


◦ Word2vec (skipgram, CBOW), GloVe
Singular Value Decomposition (SVD)
◦ A special case of this is called LSA – Latent Semantic
Analysis
Alternative to these "static embeddings":
• Contextual Embeddings (ELMo, BERT)
• Compute distinct embeddings for a word in its context
• Separate embeddings for each token of a word
Simple static embeddings you can download!

Word2vec (Mikolov et al)


https://code.google.com/archive/p/word2vec/

GloVe (Pennington, Socher, Manning)


http://nlp.stanford.edu/projects/glove/
Neural language model
I want a glass of orange ______.
4343 9665 1 3852 6163 6257

I 𝑜4343 𝐸 𝑒4343

want 𝑜9665 𝐸 𝑒9665

a 𝑜1 𝐸 𝑒1

glass 𝑜3852 𝐸 𝑒3852

of 𝑜6163 𝐸 𝑒6163

orange 𝑜6257 𝐸 𝑒6257


[Bengio et. al., 2003, A neural probabilistic language model]
One-hot vector
V = [a, aaron, …, zulu, <UNK>]

1-hot representation

Man Woman King Queen Apple Orange


I want a glass of orange ______.
(5391) (9853) (4914) (7157) (456) (6257)
0 0 0 0 0 0 I want a glass of apple______.
0 0 0 0 ⋮ 0
0 0 0 0 1 0
0 0 ⋮ 0 ⋮ 0
⋮ 0 1 0 0 0
1 ⋮ ⋮ ⋮ 0 ⋮
⋮ 1 0 1 0 1
0 ⋮ 0 ⋮ 0 ⋮
0 0 0 0 0 0
Other context/target pairs
I want a glass of orange juice to go along with my cereal.

Context: Last 4 words.

4 words on left & right

Last 1 word

Nearby 1 word
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec provides various options. We'll do:
skip-gram with negative sampling (SGNS)
Word2vec
Instead of counting how often each word w occurs near "apricot"
◦ Train a classifier on a binary prediction task:
◦ Is w likely to show up near "apricot"?
We don’t actually care about this task
◦ But we'll take the learned classifier weights as the word embeddings
Big idea: self-supervision:
◦ A word c that occurs near apricot in the corpus cats as the gold "correct
answer" for supervised learning
◦ No need for human labels
◦ Bengio et al. (2003); Collobert et al. (2011)
Approach: predict if candidate word c is a "neighbor"
1. Treat the target word t and a neighboring context word c
as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings
Skip-Gram Training Data

Assume a +/- 2 word window, given training sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4
Skip-Gram Classifier
(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair


(apricot, jam)
(apricot, aardvark)

And assigns each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c)
Similarity is computed from dot product
Remember: two vectors are similar if they have a high
dot product
◦ Cosine is just a normalized dot product
So:
◦ Similarity(w,c) ∝ w ∙ c
We’ll need to normalize to get a probability
◦ (cosine isn't a probability either)
20
Turning dot products into probabilities
Sim(w,c) ≈ w ∙ c
To turn this into a probability
We'll use the sigmoid from logistic regression:
How Skip-Gram Classifier computes P(+|w, c)

This is for one context word, but we have lots of context words.
We'll assume independence and just multiply them:
Skip-gram classifier: summary
A probabilistic classifier, given
• a test target word w
• its context window of L words c1:L
Estimates probability that w occurs in this window based
on similarity of w (embeddings) to c1:L (embeddings).

To compute this, we just need embeddings for all the


words.
These embeddings we'll need: a set for w, a set for c
Word2vec: Learning the
Vector embeddings
Semantics &
Embeddings
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

26
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

For each positive


example we'll grab k
negative examples,
27
sampling by frequency
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…


c1 c2 [target] c3 c4

28
Word2vec: how to learn vectors
Given the set of positive and negative training instances,
and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such
that we:
◦ Maximize the similarity of the target word, context word pairs
(w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from the
negative data.

9/26/2024 29
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual context words,
and minimize the similarity of the target with the k negative sampled
non-neighbor words.
Learning the classifier
How to learn?
◦ Stochastic gradient descent!

We’ll adjust the word weights to


◦ make the positive pairs more likely
◦ and the negative pairs less likely,
◦ over the entire training set.
Intuition of one step of gradient descent
Reminder: gradient descent
• At each step
• Direction: We move in the reverse direction from the
gradient
GISTI C REGRESSI ON of the loss function
• Magnitude: we move the value of this gradient
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster

t+ 1 t d
w =w−h L( f (x; w), y)
dw
The derivatives of the loss function
Update equation in SGD
Start with randomly initialized C and W matrices, then incrementally do updates
Two sets of embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together,
representing word i as the vector wi + ci
Summary: How to learn word2vec (skip-gram)
embeddings
Start with V random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
◦Take a corpus and take pairs of words that co-occur as positive
examples
◦Take pairs of words that don't co-occur as negative examples
◦Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦Throw away the classifier code and keep the embeddings.
Properties of Embeddings
Vector
Semantics &
Embeddings
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy
Analogical relations
The classic parallelogram model of analogical reasoning
(Rumelhart and Abrahamson 1973)
To solve: "apple is to tree as grape is to _____"
Add tree – apple to grape to get vine
Analogies
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
Gender −1 1 -0.95 0.97 0.00 0.01
Royal 0.01 0.02 0.93 0.95 -0.01 0.00
Age 0.03 0.02 0.70 0.69 0.03 -0.02
Food 0.09 0.01 0.02 0.01 0.95 0.97

[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations]
Analogies using word vectors man
king woman dog

cat
queen fish

three four apple


grape
one
two orange

𝑒𝑚𝑎𝑛 − 𝑒𝑤𝑜𝑚𝑎𝑛 ≈ 𝑒𝑘𝑖𝑛𝑔 − 𝑒?


Cosine similarity
𝑠𝑖𝑚(𝑒𝑤 , 𝑒𝑘𝑖𝑛𝑔 − 𝑒𝑚𝑎𝑛 + 𝑒𝑤𝑜𝑚𝑎𝑛 )

Man:Woman as Boy:Girl
Ottawa:Canada as Nairobi:Kenya
Big:Bigger as Tall:Taller
Yen:Japan as Ruble:Russia
Structure in GloVE Embedding space
Caveats with the parallelogram method
It only seems to work for frequent words, small
distances and certain relations (relating countries to
capitals, or parts of speech), but not others. (Linzen
2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)

Understanding analogy is an open area of research


(Peterson et al. 2020)
Embeddings as a window onto historical semantics
Train embeddings on different decades of historical text to see meanings shift
~30 million books, 1850-1990, Google Books data

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal
Statistical Laws of Semantic Change. Proceedings of ACL.
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer
programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.

Ask “Paris : France :: Tokyo : x”


◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Algorithms that use embeddings as part of e.g., hiring searches for
programmers, might lead to bias in hiring
Historical embedding as a tool to study cultural biases
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes.
Proceedings of the National Academy of Sciences 115(16), E3635–E3644.

• Compute a gender or ethnic bias for each adjective: e.g., how


much closer the adjective is to "woman" synonyms than
"man" synonyms, or names of particular ethnicities
• Embeddings for competence adjective (smart, wise,
brilliant, resourceful, thoughtful, logical) are biased toward
men, a bias slowly decreasing 1960-1990
• Embeddings for dehumanizing adjectives (barbaric,
monstrous, bizarre) were biased toward Asians in the
1930s, bias decreasing over the 20th century.
• These match the results of old surveys done in the 1930s

You might also like