0% found this document useful (0 votes)

9 views

Vector Semantics and Embedding (part 2)

The document discusses vector semantics and embeddings in natural language processing, focusing on the advantages of dense vectors over sparse ones for machine learning applications. It explains the Word2Vec model, particularly the skip-gram approach, for generating word embeddings by predicting context words from target words. Additionally, it highlights the implications of embeddings in capturing analogies, biases, and historical semantic changes.

Uploaded by

Phương Trang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views

Vector Semantics and Embedding (part 2)

Uploaded by

Phương Trang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Vector Semantics and Embedding (2)

PGS.TS. Nguyễn Phương Thái

TS. Hoàng Thanh Tùng
TS. Trần Hồng Việt
NLP Laboratory, Institute of Artificial Intelligence

Adapted from slides of CS224N: Natural Language Processing with Deep

Learning, Stanford
Word2vec
Vector
Semantics &
Embeddings
Sparse versus dense vectors

tf-idf (or PMI) vectors are

◦ long (length |V|= 20,000 to 50,000)
◦ sparse (most elements are zero)
Alternative: learn vectors which are
◦ short (length 50-1000)
◦ dense (most elements are non-zero)
Sparse versus dense vectors
Why dense vectors?
◦ Short vectors may be easier to use as features in machine
learning (fewer weights to tune)
◦ Dense vectors may generalize better than explicit counts
◦ Dense vectors may do better at capturing synonymy:
◦ car and automobile are synonyms; but are distinct dimensions
◦ a word with car as a neighbor and a word with automobile as a
neighbor should be similar, but aren't
◦ In practice, they work better
5
Featurized representation: word embedding
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)

-0.95 0.97 0.00 0.01

0.93 0.95 -0.01 0.00

0.7 0.69 0.03 -0.02

0.02 0.01 0.95 0.97

I want a glass of orange ______.

I want a glass of apple______.
Visualizing word embeddings
man
woman
dog
king
cat
queen fish

apple
grape
three four
one orange
two

[van der Maaten and Hinton., 2008. Visualizing data using t-SNE]
Named entity recognition example

1 1 0 0 0 0

Sally Johnson is an orange farmer

Robert Lin is an apple farmer

Transfer learning and word embeddings
1. Learn word embeddings from large text corpus. (1-100B words)

(Or download pre-trained embedding online.)

2. Transfer embedding to new task with smaller training set.

(say, 100k words)

3. Optional: Continue to finetune the word embeddings with new

data.
Common methods for getting short dense vectors

“Neural Language Model”-inspired models

◦ Word2vec (skipgram, CBOW), GloVe
Singular Value Decomposition (SVD)
◦ A special case of this is called LSA – Latent Semantic
Analysis
Alternative to these "static embeddings":
• Contextual Embeddings (ELMo, BERT)
• Compute distinct embeddings for a word in its context
• Separate embeddings for each token of a word
Simple static embeddings you can download!

Word2vec (Mikolov et al)

https://code.google.com/archive/p/word2vec/

GloVe (Pennington, Socher, Manning)

http://nlp.stanford.edu/projects/glove/
Neural language model
I want a glass of orange ______.
4343 9665 1 3852 6163 6257

I 𝑜4343 𝐸 𝑒4343

want 𝑜9665 𝐸 𝑒9665

a 𝑜1 𝐸 𝑒1

glass 𝑜3852 𝐸 𝑒3852

of 𝑜6163 𝐸 𝑒6163

orange 𝑜6257 𝐸 𝑒6257

[Bengio et. al., 2003, A neural probabilistic language model]
One-hot vector
V = [a, aaron, …, zulu, <UNK>]

1-hot representation

Man Woman King Queen Apple Orange

I want a glass of orange ______.
(5391) (9853) (4914) (7157) (456) (6257)
0 0 0 0 0 0 I want a glass of apple______.
0 0 0 0 ⋮ 0
0 0 0 0 1 0
0 0 ⋮ 0 ⋮ 0
⋮ 0 1 0 0 0
1 ⋮ ⋮ ⋮ 0 ⋮
⋮ 1 0 1 0 1
0 ⋮ 0 ⋮ 0 ⋮
0 0 0 0 0 0
Other context/target pairs
I want a glass of orange juice to go along with my cereal.

Context: Last 4 words.

4 words on left & right

Last 1 word

Nearby 1 word
Word2vec
Popular embedding method
Very fast to train
Code available on the web
Idea: predict rather than count
Word2vec provides various options. We'll do:
skip-gram with negative sampling (SGNS)
Word2vec
Instead of counting how often each word w occurs near "apricot"
◦ Train a classifier on a binary prediction task:
◦ Is w likely to show up near "apricot"?
We don’t actually care about this task
◦ But we'll take the learned classifier weights as the word embeddings
Big idea: self-supervision:
◦ A word c that occurs near apricot in the corpus cats as the gold "correct
answer" for supervised learning
◦ No need for human labels
◦ Bengio et al. (2003); Collobert et al. (2011)
Approach: predict if candidate word c is a "neighbor"
1. Treat the target word t and a neighboring context word c
as positive examples.
2. Randomly sample other words in the lexicon to get
negative examples
3. Use logistic regression to train a classifier to distinguish
those two cases
4. Use the learned weights as the embeddings
Skip-Gram Training Data

Assume a +/- 2 word window, given training sentence:

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4
Skip-Gram Classifier
(assuming a +/- 2 word window)

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

Goal: train a classifier that is given a candidate (word, context) pair

(apricot, jam)
(apricot, aardvark)
…
And assigns each pair a probability:
P(+|w, c)
P(−|w, c) = 1 − P(+|w, c)
Similarity is computed from dot product
Remember: two vectors are similar if they have a high
dot product
◦ Cosine is just a normalized dot product
So:
◦ Similarity(w,c) ∝ w ∙ c
We’ll need to normalize to get a probability
◦ (cosine isn't a probability either)
20
Turning dot products into probabilities
Sim(w,c) ≈ w ∙ c
To turn this into a probability
We'll use the sigmoid from logistic regression:
How Skip-Gram Classifier computes P(+|w, c)

This is for one context word, but we have lots of context words.
We'll assume independence and just multiply them:
Skip-gram classifier: summary
A probabilistic classifier, given
• a test target word w
• its context window of L words c1:L
Estimates probability that w occurs in this window based
on similarity of w (embeddings) to c1:L (embeddings).

To compute this, we just need embeddings for all the

words.
These embeddings we'll need: a set for w, a set for c
Word2vec: Learning the
Vector embeddings
Semantics &
Embeddings
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

26
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

For each positive

example we'll grab k
negative examples,
27
sampling by frequency
Skip-Gram Training data

…lemon, a [tablespoon of apricot jam, a] pinch…

c1 c2 [target] c3 c4

28
Word2vec: how to learn vectors
Given the set of positive and negative training instances,
and an initial set of embedding vectors
The goal of learning is to adjust those word vectors such
that we:
◦ Maximize the similarity of the target word, context word pairs
(w , cpos) drawn from the positive data
◦ Minimize the similarity of the (w , cneg) pairs drawn from the
negative data.

9/26/2024 29
Loss function for one w with cpos , cneg1 ...cnegk
Maximize the similarity of the target with the actual context words,
and minimize the similarity of the target with the k negative sampled
non-neighbor words.
Learning the classifier
How to learn?
◦ Stochastic gradient descent!

We’ll adjust the word weights to

◦ make the positive pairs more likely
◦ and the negative pairs less likely,
◦ over the entire training set.
Intuition of one step of gradient descent
Reminder: gradient descent
• At each step
• Direction: We move in the reverse direction from the
gradient
GISTI C REGRESSI ON of the loss function
• Magnitude: we move the value of this gradient
𝑑
𝐿(𝑓 𝑥; 𝑤 , 𝑦) weighted by a learning rate η
𝑑𝑤
• Higher learning rate means move w faster

t+ 1 t d
w =w−h L( f (x; w), y)
dw
The derivatives of the loss function
Update equation in SGD
Start with randomly initialized C and W matrices, then incrementally do updates
Two sets of embeddings
SGNS learns two sets of embeddings
Target embeddings matrix W
Context embedding matrix C
It's common to just add them together,
representing word i as the vector wi + ci
Summary: How to learn word2vec (skip-gram)
embeddings
Start with V random d-dimensional vectors as initial
embeddings
Train a classifier based on embedding similarity
◦Take a corpus and take pairs of words that co-occur as positive
examples
◦Take pairs of words that don't co-occur as negative examples
◦Train the classifier to distinguish these by slowly adjusting all
the embeddings to improve the classifier performance
◦Throw away the classifier code and keep the embeddings.
Properties of Embeddings
Vector
Semantics &
Embeddings
The kinds of neighbors depend on window size
Small windows (C= +/- 2) : nearest words are syntactically
similar words in same taxonomy
◦Hogwarts nearest neighbors are other fictional schools
◦Sunnydale, Evernight, Blandings
Large windows (C= +/- 5) : nearest words are related
words in same semantic field
◦Hogwarts nearest neighbors are Harry Potter world:
◦Dumbledore, half-blood, Malfoy
Analogical relations
The classic parallelogram model of analogical reasoning
(Rumelhart and Abrahamson 1973)
To solve: "apple is to tree as grape is to _____"
Add tree – apple to grape to get vine
Analogies
Man Woman King Queen Apple Orange
(5391) (9853) (4914) (7157) (456) (6257)
Gender −1 1 -0.95 0.97 0.00 0.01
Royal 0.01 0.02 0.93 0.95 -0.01 0.00
Age 0.03 0.02 0.70 0.69 0.03 -0.02
Food 0.09 0.01 0.02 0.01 0.95 0.97

[Mikolov et. al., 2013, Linguistic regularities in continuous space word representations]
Analogies using word vectors man
king woman dog

cat
queen fish

three four apple

grape
one
two orange

𝑒𝑚𝑎𝑛 − 𝑒𝑤𝑜𝑚𝑎𝑛 ≈ 𝑒𝑘𝑖𝑛𝑔 − 𝑒?

Cosine similarity
𝑠𝑖𝑚(𝑒𝑤 , 𝑒𝑘𝑖𝑛𝑔 − 𝑒𝑚𝑎𝑛 + 𝑒𝑤𝑜𝑚𝑎𝑛 )

Man:Woman as Boy:Girl
Ottawa:Canada as Nairobi:Kenya
Big:Bigger as Tall:Taller
Yen:Japan as Ruble:Russia
Structure in GloVE Embedding space
Caveats with the parallelogram method
It only seems to work for frequent words, small
distances and certain relations (relating countries to
capitals, or parts of speech), but not others. (Linzen
2016, Gladkova et al. 2016, Ethayarajh et al. 2019a)

Understanding analogy is an open area of research

(Peterson et al. 2020)
Embeddings as a window onto historical semantics
Train embeddings on different decades of historical text to see meanings shift
~30 million books, 1850-1990, Google Books data

William L. Hamilton, Jure Leskovec, and Dan Jurafsky. 2016. Diachronic Word Embeddings Reveal
Statistical Laws of Semantic Change. Proceedings of ACL.
Embeddings reflect cultural bias!
Bolukbasi, Tolga, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T. Kalai. "Man is to computer
programmer as woman is to homemaker? debiasing word embeddings." In NeurIPS, pp. 4349-4357. 2016.

Ask “Paris : France :: Tokyo : x”

◦ x = Japan
Ask “father : doctor :: mother : x”
◦ x = nurse
Ask “man : computer programmer :: woman : x”
◦ x = homemaker
Algorithms that use embeddings as part of e.g., hiring searches for
programmers, might lead to bias in hiring
Historical embedding as a tool to study cultural biases
Garg, N., Schiebinger, L., Jurafsky, D., and Zou, J. (2018). Word embeddings quantify 100 years of gender and ethnic stereotypes.
Proceedings of the National Academy of Sciences 115(16), E3635–E3644.

• Compute a gender or ethnic bias for each adjective: e.g., how

much closer the adjective is to "woman" synonyms than
"man" synonyms, or names of particular ethnicities
• Embeddings for competence adjective (smart, wise,
brilliant, resourceful, thoughtful, logical) are biased toward
men, a bias slowly decreasing 1960-1990
• Embeddings for dehumanizing adjectives (barbaric,
monstrous, bizarre) were biased toward Asians in the
1930s, bias decreasing over the 20th century.
• These match the results of old surveys done in the 1930s

Explaining The Intuition of Word2Vec & Implementing It in Python
No ratings yet
Explaining The Intuition of Word2Vec & Implementing It in Python
13 pages
Lebijp 59 SZ 31 Py
No ratings yet
Lebijp 59 SZ 31 Py
69 pages
Lecture Word Embeddings WordTo Vec IR
No ratings yet
Lecture Word Embeddings WordTo Vec IR
60 pages
CCS369 - TSS-Unit 2
No ratings yet
CCS369 - TSS-Unit 2
56 pages
Word and Document Embeddings
No ratings yet
Word and Document Embeddings
94 pages
Dan Jurafsky and James Martin Speech and Language Processing
No ratings yet
Dan Jurafsky and James Martin Speech and Language Processing
46 pages
21 Word2Vec 24 09 2024
No ratings yet
21 Word2Vec 24 09 2024
63 pages
Word Embeddings
No ratings yet
Word Embeddings
55 pages
wordembed
No ratings yet
wordembed
31 pages
08 Word Embeddings (2021)
No ratings yet
08 Word Embeddings (2021)
58 pages
DM Chapter 9 - word embedding
No ratings yet
DM Chapter 9 - word embedding
7 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
unit2
No ratings yet
unit2
15 pages
L4_CSE256_FA24_WE
No ratings yet
L4_CSE256_FA24_WE
68 pages
3 WordMeaning
No ratings yet
3 WordMeaning
78 pages
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
No ratings yet
Christopher Manning Lecture 2: Word Vectors, Word Senses, and Neural Classifiers
57 pages
4 Word Representation
No ratings yet
4 Word Representation
41 pages
Constructing and Evaluating Word Embeddings
No ratings yet
Constructing and Evaluating Word Embeddings
33 pages
11.Chapter8_WordEmbedding
No ratings yet
11.Chapter8_WordEmbedding
17 pages
Wordembed v2.0
No ratings yet
Wordembed v2.0
46 pages
7a. Word Embeddings Word2Vec and GloVe
No ratings yet
7a. Word Embeddings Word2Vec and GloVe
39 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
33 pages
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
100% (1)
Word2Vec - A Baby Step in Deep Learning But A Giant Leap Towards Natural Language Processing
12 pages
06 Wordvectors
No ratings yet
06 Wordvectors
96 pages
07_word_embeddings_notes
No ratings yet
07_word_embeddings_notes
23 pages
XCS224N_Module1_Slides
No ratings yet
XCS224N_Module1_Slides
72 pages
Lecture 2a - Word Level Semantics
No ratings yet
Lecture 2a - Word Level Semantics
34 pages
Natural Language Processing With Deep Learning CS224N/Ling284
No ratings yet
Natural Language Processing With Deep Learning CS224N/Ling284
57 pages
Chapter II
No ratings yet
Chapter II
26 pages
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
No ratings yet
08-DL-Deep Learning For Text Data (Transfer Learning in NLP)
53 pages
GML Part2
No ratings yet
GML Part2
48 pages
GML Part3
No ratings yet
GML Part3
49 pages
NLP Lec 03
No ratings yet
NLP Lec 03
26 pages
Cs224n 2024 Lecture02 Wordvecs2
No ratings yet
Cs224n 2024 Lecture02 Wordvecs2
45 pages
4. Word Embeddings 1
No ratings yet
4. Word Embeddings 1
42 pages
Unit 2 Updated New
No ratings yet
Unit 2 Updated New
77 pages
DLNLP CH-3 N
No ratings yet
DLNLP CH-3 N
11 pages
Week5
No ratings yet
Week5
26 pages
Word Embeddings Classification
No ratings yet
Word Embeddings Classification
52 pages
Unit iv
No ratings yet
Unit iv
57 pages
Word2Vec
No ratings yet
Word2Vec
22 pages
word embedding
No ratings yet
word embedding
35 pages
NLP Using Deep Learning Handson.txt
No ratings yet
NLP Using Deep Learning Handson.txt
7 pages
Part 3
No ratings yet
Part 3
5 pages
NLP Notes
No ratings yet
NLP Notes
11 pages
Vector Semantics 4
No ratings yet
Vector Semantics 4
3 pages
12 Subrata DL
No ratings yet
12 Subrata DL
25 pages
ML for NLP-LO4
No ratings yet
ML for NLP-LO4
42 pages
4. Word Embadding
No ratings yet
4. Word Embadding
24 pages
Word2Vec
No ratings yet
Word2Vec
33 pages
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
No ratings yet
A Simple Word2vec Tutorial - Zafar Ali - Medium - Reader View
9 pages
NLP2
No ratings yet
NLP2
11 pages
Word Vectors I
No ratings yet
Word Vectors I
23 pages
NLP Summary
No ratings yet
NLP Summary
6 pages
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
No ratings yet
Unsupervised Learning of Sentence Embeddings Using Compositional N-Gram Features
11 pages
Word Embedding 9 Mar 23 PDF
No ratings yet
Word Embedding 9 Mar 23 PDF
16 pages
MCS-011: Problem Solving and Programming
From Everand
MCS-011: Problem Solving and Programming
Dr. DK Sukhani
No ratings yet
The Little Book of Javascript
From Everand
The Little Book of Javascript
Karl Agius
No ratings yet
JavaScript: Advanced Guide to Programming Code with Javascript: JavaScript Computer Programming, #4
From Everand
JavaScript: Advanced Guide to Programming Code with Javascript: JavaScript Computer Programming, #4
Charlie Masterson
No ratings yet
JavaScript: Advanced Guide to Programming Code with JavaScript
From Everand
JavaScript: Advanced Guide to Programming Code with JavaScript
Charlie Masterson
No ratings yet
An Overview of Tensorflow + Deep learning 沒一村
No ratings yet
An Overview of Tensorflow + Deep learning 沒一村
31 pages
Convolutional Neural
No ratings yet
Convolutional Neural
13 pages
Coefficient of Variation and Machine Learning Applications 1st Edition K. Hima Bindu (Author) 2024 scribd download
100% (2)
Coefficient of Variation and Machine Learning Applications 1st Edition K. Hima Bindu (Author) 2024 scribd download
72 pages
Automatic Early Stopping Using Cross Validation: Quantifying The Criteria
No ratings yet
Automatic Early Stopping Using Cross Validation: Quantifying The Criteria
7 pages
Advances in Knowledge Discovery and Data Mining Part II 14th edition by Mohammed Zaki , Jeffrey Xu Yu, Ravindran, Vikram Pudi ISBN 3642136710 978-3642136719 - The full ebook with complete content is ready for download
100% (7)
Advances in Knowledge Discovery and Data Mining Part II 14th edition by Mohammed Zaki , Jeffrey Xu Yu, Ravindran, Vikram Pudi ISBN 3642136710 978-3642136719 - The full ebook with complete content is ready for download
86 pages
Theory and Examples: Problem Statement
No ratings yet
Theory and Examples: Problem Statement
44 pages
Recommenders Intro Annotated PDF
No ratings yet
Recommenders Intro Annotated PDF
45 pages
Potato Leaf Disease Detection.-Test1
No ratings yet
Potato Leaf Disease Detection.-Test1
8 pages
Major Project Synopsis Format
No ratings yet
Major Project Synopsis Format
14 pages
MLT unit 3
No ratings yet
MLT unit 3
11 pages
Ashritha Data Science SOP
No ratings yet
Ashritha Data Science SOP
2 pages
SummaryofSubjects For B Eng
No ratings yet
SummaryofSubjects For B Eng
12 pages
Cse121 - Orientation To Computingii 1
No ratings yet
Cse121 - Orientation To Computingii 1
36 pages
Machine Learning Guide for Oil and Gas Using Python Hoss Belyadi All Chapters Instant Download
100% (2)
Machine Learning Guide for Oil and Gas Using Python Hoss Belyadi All Chapters Instant Download
41 pages
Lec 0
No ratings yet
Lec 0
24 pages
Disease Prediction Using Data Mining
No ratings yet
Disease Prediction Using Data Mining
5 pages
Machine Learning Based Workload Prediction in Cloud Computing
No ratings yet
Machine Learning Based Workload Prediction in Cloud Computing
9 pages
Lecture 2 - Nearest-Neighbors Methods
No ratings yet
Lecture 2 - Nearest-Neighbors Methods
57 pages
Mirasys v3 Samsung
No ratings yet
Mirasys v3 Samsung
24 pages
Chapter5_DataWarehouse
No ratings yet
Chapter5_DataWarehouse
77 pages
How To Write A Thesis Executive Summary
100% (3)
How To Write A Thesis Executive Summary
8 pages
Ai for remte sensing assignments notes
No ratings yet
Ai for remte sensing assignments notes
16 pages
DAY_1_OVERVIEW_PYTHON_AL_ML
No ratings yet
DAY_1_OVERVIEW_PYTHON_AL_ML
17 pages
LecturePlan - CS201 - 21CSH 316 AIML Lab
No ratings yet
LecturePlan - CS201 - 21CSH 316 AIML Lab
15 pages
Process Mining vs. Data Mining: Common
No ratings yet
Process Mining vs. Data Mining: Common
2 pages
Unit 2 DL
No ratings yet
Unit 2 DL
3 pages
Artificial Intelligence.pptx
No ratings yet
Artificial Intelligence.pptx
12 pages
MCQs DL Mid I R20 2023 With Answers
No ratings yet
MCQs DL Mid I R20 2023 With Answers
3 pages
MBA in AI for Business
No ratings yet
MBA in AI for Business
23 pages
Fake News Detection AI
No ratings yet
Fake News Detection AI
18 pages