0% found this document useful (0 votes)
17 views30 pages

GAPE_module_2 - Copy

The document provides an overview of Generative Adversarial Networks (GANs), detailing their structure, working principles, and mathematical formulations. It explains the roles of the generator and discriminator in creating synthetic data and distinguishing it from real data, as well as the training dynamics involved. Additionally, it discusses word embeddings, one-hot encoding, and their limitations, highlighting how word embeddings address these issues in natural language processing.

Uploaded by

Ed Philip Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views30 pages

GAPE_module_2 - Copy

The document provides an overview of Generative Adversarial Networks (GANs), detailing their structure, working principles, and mathematical formulations. It explains the roles of the generator and discriminator in creating synthetic data and distinguishing it from real data, as well as the training dynamics involved. Additionally, it discusses word embeddings, one-hot encoding, and their limitations, highlighting how word embeddings address these issues in natural language processing.

Uploaded by

Ed Philip Luis
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

GAPE module 2

April 2025

Three Marks Questions

1 What is a Generative Adversarial Network (GAN)? Could you


explain its basic working principle?
Introduction to GANs
A Generative Adversarial Network (GAN) is an advanced deep learning framework designed to generate
synthetic data samples that closely resemble real data from a given training set. GANs can create realistic
images, music, text, and other types of data by learning the underlying patterns and distributions of the training
data.
Core Components
GANs consist of two competing neural networks that work in opposition:

• 1. Generator (G)
– Function: Creates synthetic data samples from random noise
– Input: Takes a random noise vector (latent vector) as input
– Output: Produces synthetic data (e.g., images, text)
– Initial Performance: Generates low-quality outputs that improve with training
– Goal: To produce data indistinguishable from real data
• 2. Discriminator (D)
– Function: Acts as a classifier distinguishing real from fake data
– Input: Receives both real training data and generated samples
– Output: Predicts probability (0 to 1) of input being real
– Goal: To accurately identify generated samples as fake

Working Principle
The two networks engage in an adversarial training process:

Figure: GAN architecture showing generator and discriminator in adversarial training

1. Generator’s Process:

1
• Takes random noise vector z from latent space
• Transforms it through neural network layers into synthetic data G(z)
• Attempts to make G(z) resemble real data distribution
2. Discriminator’s Process:
• Receives both real data samples x and generated samples G(z)
• Outputs probability D(x) or D(G(z)) indicating ”realness”
• Provides feedback to the generator through backpropagation

Adversarial Training Process


The training follows a minimax game framework:

• Generator’s Objective: Minimize the discriminator’s ability to detect fakes


• Discriminator’s Objective: Maximize its accuracy in distinguishing real from fake

This creates a dynamic equilibrium where:


• The generator continuously improves its outputs to fool the discriminator
• The discriminator continuously improves its detection capabilities
Mathematical Formulation
The GAN objective is expressed as a two-player minimax game:

min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D

Where:
• D(x) = Discriminator’s probability that real data x is real
• G(z) = Generator’s output from noise z
• D(G(z)) = Discriminator’s probability that generated sample is real
• pdata = Distribution of real data
• pz = Distribution of generator’s input noise
Training Dynamics and Convergence

• Initial Phase:
– Generator produces obvious fakes
– Discriminator easily identifies them (D(G(z)) ≈ 0)
• Intermediate Phase:
– Generator improves quality
– Discriminator becomes more sophisticated
– Adversarial competition drives improvement
• Ideal Convergence:
– Generator produces outputs indistinguishable from real data
– Discriminator outputs 50% probability (random guessing)
– Nash equilibrium is reached

Practical Example: Generating Fake Faces


Training Progression:

• Early Stage:
– Generator: Produces random noise or blurry shapes

2
– Discriminator: Easily identifies fakes (high accuracy)
• Middle Stage:
– Generator: Creates face-like structures with basic features
– Discriminator: Begins to struggle with better fakes

• Final Stage:
– Generator: Produces photorealistic faces
– Discriminator: Cannot reliably distinguish real from fake ( approx50% accuracy)

2 Describe the mathematical formulation of GANs and the objective


function used in training alternatives.
Generative Adversarial Networks (GANs) are a powerful class of machine learning frameworks consisting of a
Generator (G) and a Discriminator (D), trained in a two-player minimax game setting. The formulation, loss
functions, training dynamics, and alternate approaches are outlined below.
1. Objective Function:

min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D

2. Loss Functions:
• Discriminator:
LD = −Ex∼pdata (x) [log D(x)] − Ez∼pz (z) [log(1 − D(G(z)))]

• Generator (original):
LG = −Ez∼pz (z) [log(1 − D(G(z)))]

• Generator (non-saturating):
LG = −Ez∼pz (z) [log D(G(z))]

3. Training Dynamics:
• Discriminator update:

∇D V (D, G) = ∇D (Ex [log D(x)] + Ez [log(1 − D(G(z)))])

• Generator update:
∇G V (D, G) = ∇G Ez [log(1 − D(G(z)))]

• Heuristic alternative for Generator:

min −Ez [log D(G(z))]


G

4. Optimality: At equilibrium, D(x) = 0.5 and the generated distribution equals the real data distribution.
5. LSGAN Loss Functions:
• Discriminator:
1 1
LD = Ex∼pdata (x) [(D(x) − b)2 ] + Ez∼pz (z) [(D(G(z)) − a)2 ]
2 2
• Generator:
1
LG = Ez∼pz (z) [(D(G(z)) − c)2 ]
2

3
3 Explain the difference between generative and discriminative mod-
els with examples and equations.
Generative Models
Objective: Generative models learn the joint probability distribution:

P (X, Y )

This models how data features X and labels Y are generated together.
Functionality:
• Estimate the data generation process, allowing creation of new data points.
• Compute conditional probabilities using Bayes’ Theorem:

P (X | Y ) · P (Y )
P (Y | X) =
P (X)

Examples:
• Naı̈ve Bayes Classifier: Assumes feature independence.
• Hidden Markov Models (HMMs): For sequences with hidden states.
• Gaussian Mixture Models (GMMs): Mixtures of Gaussians.
• Variational Autoencoders (VAEs): Latent representations and generation.
• Generative Adversarial Networks (GANs): Competing networks for realistic data.
• Diffusion Models: Denoise random variables step-by-step.
Use Cases:
• Synthetic data generation
• Anomaly detection
• Semi-supervised learning
Example: Spam Detection (Generative)
• Learns P (X | Spam), P (X | Not Spam)
• Learns P (Spam), P (Not Spam)
• Uses Bayes’ theorem to compute P (Spam | X)
Discriminative Models
Objective: Discriminative models model the conditional probability:

P (Y | X)

They directly learn how to map inputs X to outputs Y .


Functionality:
• Learn the mapping from X to Y
• Optimize boundaries between classes
Examples:
• Logistic Regression: Models binary outcomes.
• SVM: Separates data with hyperplanes.
• Decision Trees: Feature-based splits.
• Random Forests: Ensemble of trees.

4
• Neural Networks: For complex mappings.
Use Cases:
• Classification and regression
• Real-time predictions
Example: Spam Detection (Discriminative)
• Learns P (Spam | X) and P (Not Spam | X)
• Predicts label with higher probability
• Does not model how data is generated

4 How does the Generator loss and Discriminator loss function in a


GAN?
Generative Adversarial Networks (GANs) consist of two neural networks — a Generator (G) and a Discrim-
inator (D) — that play a two-player minimax game. The generator tries to create fake data that looks like real
data, while the discriminator tries to tell real from fake. These two networks are trained simultaneously in a
loop, each improving based on the other’s performance.

4.1 Original Minimax GAN Loss Function


The value function for the GAN is:

min max V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))]


G D

• x: Sample from real data distribution pdata


• z: Random noise vector from distribution pz
• G(z): Fake data generated by the generator
• D(x): Discriminator’s probability that x is real
• D(G(z)): Discriminator’s probability that generated data is real

4.2 Discriminator Loss Function


The Discriminator’s goal is to maximize:

log D(x) + log(1 − D(G(z)))


So the loss function to minimize is:

LD = −Ex∼pdata [log D(x)] − Ez∼pz [log(1 − D(G(z)))]


This is computed using binary cross-entropy:
• Real data: Label = 1
• Fake data: Label = 0

4.3 3. Generator Loss Function


The Generator’s original loss:

LG = −Ez∼pz [log(1 − D(G(z)))]


To prevent vanishing gradients, the non-saturating version is used:

LG = −Ez∼pz [log D(G(z))]


This encourages the generator to improve more effectively.

5
4.4 Training Dynamics
Discriminator Training:
• Input: Real data x and fake data G(z)
• Update to maximize: log D(x) + log(1 − D(G(z)))
Generator Training:
• Input: Noise z
• Generate G(z), evaluate via D
• Update to maximize: log D(G(z))

5 What is word embedding, and why is it important in NLP?


5.1 What Is Word Embedding?
Word embedding maps words or phrases from a vocabulary to real-valued vectors in a lower-dimensional space.
These vectors are designed so that words with similar meanings or contexts are located close to each other in
this space. For instance, the vectors for ”king” and ”queen” would be near each other, reflecting their semantic
similarity.

5.2 Why Are Word Embeddings Important in NLP?


• Captures Semantic Relationships: Word embeddings effectively model semantic similarities. A classic
example is:
vec(king) − vec(man) + vec(woman) ≈ vec(queen)
• Reduces Dimensionality: Word embeddings reduce high-dimensional, sparse vectors into lower-dimensional
dense ones, improving computational efficiency.
• Improves Model Generalization: They boost the performance of models like RNNs, LSTMs, and
Transformers in tasks such as sentiment analysis and machine translation.
• Facilitates Contextual Understanding: Models like BERT produce embeddings that depend on con-
text, helping disambiguate words like “bank” in different sentences.
• Handles Out-of-Vocabulary Words: Techniques like fastText represent words by subword units (e.g.,
“unhappiness” → “un”, “happy”, “ness”).

5.3 Common Word Embedding Techniques


• Word2Vec: Developed by Google, using CBOW and Skip-Gram models to learn embeddings.
• GloVe: Created by Stanford, it builds embeddings from co-occurrence statistics.
• fastText: From Facebook AI, incorporates subword information to handle rare words.
• BERT: Google’s transformer-based model produces dynamic, context-aware embeddings.

6 What Is One-Hot Encoding in NLP? Explain With Examples and


Techniques
6.1 What Is One-Hot Encoding?
In this method, each word is represented as a binary vector where:
• The vector length equals the vocabulary size.
• Only one element is set to 1 (based on the word’s index).
• All other elements are set to 0.
This ensures a unique, distinct representation for each word without implying any similarity between different
words.

6
6.2 How It Works
Step 1: Vocabulary Creation
Given the sentences:

"I love NLP"


"NLP is fun"

We extract unique words and assign them indices:

• ”I” → 0
• ”love” → 1
• ”NLP” → 2

• ”is” → 3
• ”fun” → 4

Step 2: Vector Representation

Word One-Hot Vector


”I” [1, 0, 0, 0, 0]
”love” [0, 1, 0, 0, 0]
”NLP” [0, 0, 1, 0, 0]
”is” [0, 0, 0, 1, 0]
”fun” [0, 0, 0, 0, 1]

6.3 Key Properties


• High-Dimensional and Sparse: For large vocabularies, one-hot vectors become very long and mostly
filled with zeros.
• No Semantic Meaning: Vectors are orthogonal and equidistant, so words like “king” and “queen” are
treated as unrelated as “king” and “apple”.

• Simple and Interpretable: Easy to implement, often used in early models or for creating bag-of-words
representations.

6.4 Use Case: Text Classification


To classify the sentiment of sentences:

• ”I love NLP” → [1, 1, 1, 0, 0]


• ”NLP is fun” → [0, 0, 1, 1, 1]

These sentence-level vectors (bag-of-words) can be used as input to classifiers like logistic regression.

6.5 Visual Example


Vocabulary = ["cat", "dog", "fish"]

"cat" → [1, 0, 0]
"dog" → [0, 1, 0]
"fish" → [0, 0, 1]

Each word has a unique, sparse binary vector with no encoded semantic meaning.

7
7 What are the limitations of one-hot encoding, and how do word
embeddings overcome them?
7.1 Limitations of One-Hot Encoding
7.1.1 High Dimensionality & Sparsity
One-hot vectors are as long as the vocabulary size. For large vocabularies (e.g., 50K words), this leads to huge
sparse vectors.

• Memory inefficient
• Slow computation
• Curse of dimensionality

7.1.2 Lack of Semantic Relationships


One-hot vectors are orthogonal and equidistant. For example, “king”, “queen”, and “apple” are all equally
unrelated:
”king” ⊥ ”queen” ⊥ ”apple”

7.1.3 No Contextual Awareness


The same vector is used for a word in every context:
• ”bank” in ”river bank” vs. ”bank account”

• Cannot distinguish polysemous words

7.1.4 Poor Handling of OOV Words


Words not in the vocabulary are often replaced by a generic [UNK] token, losing semantic information.

7.1.5 Poor ML Generalization


Each word is treated independently, making generalization in ML tasks difficult.

7.2 How Word Embeddings Solve These Problems


7.2.1 Dimensionality Reduction
Word embeddings reduce dimensions (e.g., 300) while preserving relationships.

7.2.2 Semantic Similarity


Similar words are close in the embedding space. Vector arithmetic becomes possible:

”king” − ”man” + ”woman” ≈ ”queen”

7.2.3 Contextual Understanding


• Static Embeddings: One vector per word (e.g., Word2Vec, GloVe)
• Contextual Embeddings: Vectors change based on sentence (e.g., BERT, ELMo)

7.2.4 OOV Handling


Methods like FastText break words into subwords, e.g., “unhappiness” → “un”, “happy”, “ness”.

7.2.5 Better Generalization


Pretrained embeddings enable transfer of language knowledge to downstream tasks (e.g., sentiment analysis).

8
8 What is Word2Vec, and how does it help in word representation?
8.1 What Is Word2Vec?
Word2Vec is a group of models used to produce word embeddings—dense, continuous vector representations of
words—in a high-dimensional vector space. The foundational concept is based on the distributional hypothe-
sis, which states that words appearing in similar contexts tend to have similar meanings. Introduced by Mikolov
et al. at Google in 2013, Word2Vec analyzes massive text corpora to learn these contextual relationships.
This results in semantically similar words being located near each other in the embedding space. For example,
”king” and ”queen” would appear closer together in vector space than ”king” and ”apple”.

8.2 How Does Word2Vec Work?


Word2Vec utilizes a shallow, two-layer neural network to learn embeddings. It offers two training architectures:

8.2.1 1. Continuous Bag of Words (CBOW)


• Objective: Predict the target (center) word based on its surrounding context words.

• Mechanism: Given a context window (e.g., two words before and after), the model tries to predict the
word in the middle.
• Characteristics: Faster to train and generally performs better on high-frequency words.

8.2.2 2. Skip-Gram
• Objective: Predict the context words given a target (center) word.

• Mechanism: For every word in the corpus, the model attempts to predict words within a predefined
context window.
• Characteristics: Performs better with smaller datasets and rare or less frequent words.

8.3 Training Process


The process involves:
• A shallow neural network with a single hidden layer.
• The input is either the context words (CBOW) or a target word (Skip-Gram).

• The weights of the hidden layer become the word embeddings.


• Common optimization methods used:
– Negative Sampling: Helps in updating only a small subset of weights.
– Hierarchical Softmax: Speeds up computation for rare words.

8.4 Why Is Word2Vec Important?


8.4.1 Semantic Similarity
Word2Vec captures not just co-occurrence patterns but also semantic and syntactic relationships. Words that
appear in similar contexts end up with similar vector representations. For example:

vector(”king”) ≈ vector(”queen”), vector(”Paris”) ≈ vector(”France”)

8.4.2 Vector Arithmetic


Word2Vec supports operations on word vectors to find analogies. A famous example:

vector(”king”) − vector(”man”) + vector(”woman”) ≈ vector(”queen”)

9
8.4.3 Dimensionality Reduction
Instead of sparse, high-dimensional one-hot vectors, Word2Vec produces dense embeddings typically of 100–300
dimensions. This leads to:
• Reduced computational cost.

• Lower memory usage.


• Faster training and inference.

8.4.4 Improved NLP Task Performance


The embeddings generated by Word2Vec are transferable and enhance performance across many NLP tasks such
as:

• Sentiment analysis
• Machine translation
• Information retrieval

8.4.5 Efficiency and Scalability


• Word2Vec scales efficiently to billions of words.
• Once trained, the word embeddings can be reused across different tasks, enabling transfer learning.

9 Explain the basic idea behind the Continuous Bag of Words (CBOW)
model.
9.1 What is the CBOW Model?
The Continuous Bag of Words (CBOW) model is a neural network-based architecture used in Natural Language
Processing (NLP) to learn word embeddings. It belongs to the Word2Vec framework and is one of the two main
models, the other being the Skip-Gram model.
The key idea behind CBOW is simple: predict a target word based on its surrounding context words.
This approach is rooted in the distributional hypothesis — the notion that words occurring in similar contexts
tend to have similar meanings.

9.2 Basic Concept of CBOW (With Example)


Suppose we have the sentence:

“The quick brown fox jumps over the lazy dog”

With a context window size of 2, and if the target word is “fox”, then:
• The context words are: ”quick”, ”brown”, ”jumps”, ”over”
• The CBOW model’s job is to use these context words to predict “fox”.
In another example, for the sentence:

“The cat sat on the mat”

With a window size of 2:


• To predict ”sat”, the context might be [”The”, ”cat”, ”on”, ”the”]

• To predict ”on”, the context could be [”cat”, ”sat”, ”the”, ”mat”]

10
9.3 Architecture of CBOW
The CBOW model consists of three main layers:

1. Input Layer:
• Each context word is represented using one-hot encoding — a vector with the size of the vocabulary
where only one index is ”1” and all others are ”0”.

2. Projection (Hidden) Layer:


• These one-hot vectors are multiplied by a shared embedding matrix to produce dense vector embed-
dings.
• The embeddings of all context words are then averaged (or summed) to form a single vector repre-
senting the overall context.

3. Output Layer:
• The averaged context vector is passed through a softmax layer to compute the probability distribution
over the entire vocabulary.
• The target word is predicted based on this probability distribution.

9.4 Training the CBOW Model


The training objective is to minimize the error in predicting the target word. This is typically done using
the cross-entropy loss function.

L = − log P (wt | wt−k , . . . , wt+k )


Where:

• wt is the target word


• wt−k , . . . , wt+k are the context words
• k is the context window size

The optimization is carried out using techniques like Stochastic Gradient Descent (SGD).

10 How does the Skip-gram model differ from CBOW in Word2Vec?


10.1 Overview
Word2Vec is a popular algorithm used for learning word embeddings, which are continuous vector representa-
tions of words. It has two major architectures: CBOW (Continuous Bag of Words) and Skip-gram. These
two differ fundamentally in training approach, performance, and use cases.

10.2 Core Concept: Predicting with Context vs. Predicting Context

Model Input → Output Example Training Objective


CBOW [”The”, ”cat”, ?, ”the”, ”mat”] → ”sat” Maximize the probability of the center word given context
Skip-gram ”sat” → [”The”, ”cat”, ”on”, ”the”, ”mat”] Maximize the probability of context words given the center wo

Analogy:
• CBOW: Like a fill-in-the-blank exercise.
• Skip-gram: Like reverse dictionary lookup.

11
10.3 Architecture and Training Differences
Feature CBOW Skip-gram
Objective Predict target word from context Predict context words from tar-
get word
Input Multiple context words (aver- Single target word embedding
aged embeddings)
Output Predicts one word (center word) Predicts multiple words (context
words)
Computational Cost Faster (uses averaging, fewer Slower (many training pairs)
training pairs)
Performance Better for frequent words Better for rare words
Use Case Ideal for large datasets Ideal for small datasets or rare
word handling

10.4 Why the Performance Difference?


CBOW averages the context vectors, which smooths out the noise and provides robust representations for frequent
words. However, this averaging can cause the model to lose fine-grained distinctions, making it less effective for
rare words.
Skip-gram treats each context-target pair individually. This leads to better representation for rare words, as
the model doesn’t dilute information through averaging.

10.5 Examples
CBOW: Input (context): [”The”, ”cat”, ”on”, ”the”, ”mat”] Output (target): ”sat”
Skip-gram: Input (target): ”sat” Output (context): [”The”, ”cat”, ”on”, ”the”, ”mat”]

10.6 Illustrative Diagrams


CBOW Architecture:
Context Words: ["The", "cat", "on", "the", "mat"]

Averaged Embeddings

Neural Network

Predict Center Word: "sat"

Skip-gram Architecture:
Target Word: "sat"

Neural Network

Predict Context Words: ["The", "cat", "on", "the", "mat"]

11 What is the significance of context window size in CBOW and


Skip-gram?
11.1 What Is Context Window Size?
The context window size is a hyperparameter in Word2Vec models (both CBOW and Skip-gram) that defines
how many words before and after a target word are considered as context during training.
A window size of 2 implies:
• CBOW: Uses 2 words before and 2 words after the target word (total of 4 context words).
• Skip-gram: The target word predicts 2 words before and 2 words after.
Example:

12
Sentence: ”The quick brown fox jumps over the lazy dog”
Target word: ”fox”
Window size = 2 → Context words: [”quick”, ”brown”, ”jumps”, ”over”]

11.2 Role in CBOW and Skip-Gram Models


Model Function Effect of Smaller Effect of Larger
Window Window
CBOW Predicts target word based on con- Captures syntactic re- Captures semantic re-
text words. lationships. lationships.
Skip-gram Predicts context words from target Focuses on fine-grained Learns semantic asso-
word. syntax. ciations.

11.3 Detailed Behavior


11.3.1 CBOW
• Small Window (2–5): Learns strong local syntax (e.g., “eats apple”), ideal for frequent terms.

• Large Window (10+): Captures broader semantics (e.g., “fox” with “clever”), but may dilute patterns
due to over-averaging.

11.3.2 Skip-gram
• Small Window (2–5): Captures local syntactic relations like “jumps” → “fox”. Effective for rare words.
• Large Window (10+): Captures broader context like “fox” → “tail”, “hunt”. Risk of noisy, less relevant
words.

11.4 Trade-offs and Guidelines


Window Size Best For CBOW Skip-gram
2–5 Syntactic tasks (POS tagging) Good Excellent
5–10 Balanced semantics/syntax Best Good
10+ Topic modeling Risk of over-averaging Better than CBOW

12 How does Word2Vec capture the semantic meaning of words?


12.1 Core Principle: Contextual Semantics
Word2Vec is grounded in the distributional hypothesis — the idea that ”you shall know a word by the
company it keeps.” It captures the semantic meaning of words by examining the contexts in which they
appear within large text corpora. The method transforms words into dense vector representations (also
called embeddings) such that words appearing in similar contexts are positioned closely in the vector
space.
For example:

• Words like “king” and “queen” often occur near words like “royalty,” “crown,” and “palace”.
• Similarly, “apple” and “banana” may co-occur with words like “fruit,” “eat,” and “tree”.

12.2 Learning Semantic Relationships via Word2Vec


Word2Vec employs two neural network architectures:
• CBOW (Continuous Bag of Words): Predicts a target word based on surrounding context words.
• Skip-Gram: Predicts context words given a target word.
Using a sliding context window, the models analyze word co-occurrence patterns. They are trained to
maximize the probability of observing real word pairs (target-context) while adjusting vector values,
thereby capturing semantic relationships.

13
12.3 How It Works Internally
12.3.1 Neural Network Training
• Architecture: A shallow neural network with one hidden layer.

• CBOW: Aggregates vectors of surrounding words to predict the target.


• Skip-Gram: Uses the target word to predict surrounding words.

12.3.2 Training Objective


The model is trained to reduce prediction error. As a result, words appearing in similar contexts develop similar
vector embeddings. This allows us to quantify word similarity and perform algebraic operations on
vectors.

12.3.3 Vector Space Properties


Word2Vec embeddings display fascinating properties:
• Proximity: Similar words like “dog” and “puppy” have nearby vectors.
• Vector Arithmetic: Word relationships manifest as linear offsets.

vector(”king”) − vector(”man”) + vector(”woman”) ≈ vector(”queen”)

12.4 Semantic Relationships Captured


Relationship Type Example Property in Vector Space
Synonyms “happy” “joyful” Near-identical vectors
Antonyms “hot” vs. “cold” Opposite directions on some axes
Categories “dog” → “animal” Hierarchical clustering
Analogies king – man + woman queen Linear vector offsets
Functional assoc. “driver” → “car” Co-occurrence in similar contexts

12.5 Mechanisms Enhancing Semantic Capture


12.5.1 Context Window
Defines the span of words considered around a target word.

• Smaller Window (2–5): Captures syntactic relationships (e.g., adjective-noun).


• Larger Window (10+): Captures semantic or thematic relationships (e.g., topic-level).

12.5.2 Negative Sampling


Instead of computing probabilities across the entire vocabulary, the model samples a few negative (unrelated)
words to contrast against positive (context-related) words. This reduces computation and improves accuracy.

12.5.3 Subsampling of Frequent Words


Common words like “the” or “is” are downsampled to reduce their overwhelming influence. This helps the
model focus on more informative and content-rich words.

12.5.4 Dense Low-dimensional Vectors


Word2Vec creates embeddings typically in 100–300 dimensions, where each dimension encodes some latent
semantic or syntactic feature (e.g., gender, category, tense).

14
13 What is negative sampling in Word2Vec, and why is it used?
13.1 What Is Negative Sampling?
Negative sampling is a method used in the Word2Vec Skip-Gram model to speed up training. Instead of
computing the full softmax over all words in the vocabulary, the model updates weights for:
• The target word (e.g., ‘‘king’’),
• Its actual context word(s) (e.g., ‘‘queen’’),
• A few randomly sampled unrelated “negative” words (e.g., ‘‘apple’’, ‘‘car’’).

13.2 Why Is It Used?


• Computational Efficiency: Avoids the expensive computation of full softmax, reducing training time
significantly.

• Scalability: Can be applied to massive vocabularies (millions of words).


• Effective Embedding Quality: Helps the model learn to distinguish between similar and unrelated
words.

13.3 How It Works


For each (target, context) pair such as (‘‘king’’, ‘‘queen’’):
1. It treats the actual pair as a positive example.

2. It samples k negative examples such as ‘‘apple’’, ‘‘car’’, ‘‘book’’.


3. It maximizes the probability of the correct pair and minimizes the probability of the incorrect ones.

13.4 Loss Function


Let:
• vking = vector for ‘‘king’’
• vqueen = vector for ‘‘queen’’
• vnegi = vector for i-th negative sample

• σ = sigmoid function
• k = number of negative samples
The loss function is:
k
X
L = − log σ(vking · vqueen ) − log σ(−vking · vnegi )
i=1

13.5 Example
• Real context: “king” → “queen” ⇒ increase similarity.
• Negative samples: “king” → “apple”/“car” ⇒ decrease similarity.

13.6 Semantic Vector Arithmetic


Word embeddings learned through this approach support analogies such as:

vector(“king”) − vector(“man”) + vector(“woman”) ≈ vector(“queen”)

15
14 How does Word2Vec handle polysemy, i.e., words with multiple
meanings?
14.1 What is Polysemy in Word2Vec
In Word2Vec, every word is represented by a single vector, regardless of how many meanings that word may
have in different contexts. This creates a key limitation when it comes to polysemous words — words with
multiple meanings, such as:
• “Bank”: could refer to a financial institution or the side of a river.
• “Bat”: could mean a flying mammal or a cricket/baseball bat.
Since Word2Vec generates static embeddings, the word “bank” will have only one vector that tries to
represent both meanings.

14.2 Limitations of Word2Vec with Polysemy


1. Static Embeddings: Word2Vec does not consider sentence-level context. It averages the usage of a word
across the entire corpus.
• ‘‘Deposit money in the bank’’ → financial meaning
• ‘‘Fish swim near the bank’’ → geographic meaning
Word2Vec outputs the same vector for “bank” in both sentences.
2. Context Blindness: Word2Vec uses a fixed window and ignores syntactic structure, failing to distinguish
word senses accurately.

14.3 How Word2Vec Partially Handles Polysemy


• Averaging Multiple Meanings: Vectors become a weighted average of various contexts. Downside:
diluted semantics.
• Dominant Meaning Bias: More frequent meanings dominate the vector. E.g., “bank” in financial texts
skews embedding.
• Clustering of Contexts: Usage patterns sometimes cluster but are incidental.

14.4 Improvements in Variants and Modern Models


• FastText: Uses subwords to distinguish partially, e.g., “riverbank”, “banking”. Still context-blind.
• Word Sense Embeddings: Assigns multiple vectors for each meaning.
• BERT/ELMo: Generates contextual embeddings:
– BERT(“Deposit money in the bank”) → vector for “bank” aligns with financial.
– BERT(“Fish swim near the bank”) → vector for “bank” aligns with river.
Ten Marks Questions

15 Explain the architecture of Generative Adversarial Networks (GANs).


How do the Generator and Discriminator interacts during train-
ing?
15.1 Core Idea
Generative Adversarial Networks (GANs) are a class of machine learning frameworks introduced by Ian Goodfel-
low in 2014. The core concept involves two neural networks—the Generator (G) and Discriminator (D)—engaged
in a minimax game.
• Generator (G): Learns to generate fake data that resembles real data.
• Discriminator (D): Learns to distinguish between real and fake data.

16
15.2 GAN Architecture
Generator (G)
• Input: Random noise vector z
• Output: Synthetic data sample (e.g., image)
• Design: Transposed CNN or MLP
• Objective: Generate data indistinguishable from real data

Discriminator (D)
• Input: Real or fake data
• Output: Probability that input is real
• Design: CNN or MLP
• Objective: Correctly classify real vs. fake

Zero-Sum Game
G tries to fool D, while D tries to detect fakes—each network improves through competition.

15.3 Mathematical Formulation


min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D

15.4 Interaction During Training


Discriminator Training
• Train on real data (maximize log D(x))
• Train on fake data (maximize log(1 − D(G(z))))
Generator Training
• Train to maximize log D(G(z))
• Uses feedback from D to update weights

15.5 Loss Functions


Discriminator Loss:
LD = −Ex∼pdata [log D(x)] − Ez∼pz [log(1 − D(G(z)))]
Generator Loss (standard and non-saturating):

LG = −Ez∼pz [log(1 − D(G(z)))] or LG = Ez∼pz [log D(G(z))]

15.6 Training Algorithm


1. Sample real data x ∼ pdata
2. Generate fake data: G(z), z ∼ pz
3. Update D using both real and fake data
4. Update G to fool D using feedback
5. Repeat until D cannot distinguish real/fake

15.7 Training Dynamics


• Discriminator often trained more frequently
• Enhances stability and convergence

17
15.8 Training Progression
• Generator improves at realism
• Discriminator improves at detection
• At convergence, D cannot distinguish real from fake (50% accuracy)

15.9 Visual Diagram


+--------------------+ Real Samples (x)
| Real Data Source |--------------------+
+--------------------+ |
+ v +
+-------------+ Real/Fake? +-----+
z ---> | Generator G |--------------------------> | D |
noise +-------------+ +-----+
| ^ ^ ^
| | | |
+-----+---------> Generated Samples <-----+ |

Figure: GAN architecture showing generator and discriminator in adversarial training

16 How does backpropagation work in GANs? Explain how both


the Generator and Discriminator are updated during training.
16.1 Introduction to GAN Training
Generative Adversarial Networks consist of two competing neural networks:
• Generator (G): Creates synthetic data from random noise
• Discriminator (D): Distinguishes real data from generated samples
The training follows a minimax game with the objective function:
min max V (D, G) = Ex∼pdata [log D(x)] + Ez∼pz [log(1 − D(G(z)))]
G D

16.2 Training Process Overview


16.2.1 Forward Pass
1. Sample real data: x ∼ pdata (x)
2. Sample noise vector: z ∼ pz (z)
3. Generate fake data: x̂ = G(z)
4. Compute discriminator outputs: D(x) and D(x̂)

18
Figure 1: GAN training framework showing the adversarial relationship

16.3 Discriminator Update


16.3.1 Loss Function
LD = − [log D(x) + log(1 − D(G(z)))]

16.3.2 Backpropagation Steps


• Compute gradients of LD with respect to D’s parameters

• Update D using gradient descent:


θD ← θD − η∇θD LD

• Goal: Improve real/fake classification accuracy

16.4 Generator Update


16.4.1 Loss Function (Non-saturating version)
LG = − log D(G(z))

16.4.2 Backpropagation Steps


• Freeze D’s weights during G update
• Compute gradients through D into G

• Update G’s parameters:


θG ← θG − η∇θG LG

• Goal: Make generated samples more realistic

Figure 2: Gradient flow during generator update (backpropagation through D)

19
17 Explain the differences between the Generator and Discriminator
in a GAN. What are their objectives, and how do they influence
each other?
17.1 Introduction
Generative Adversarial Networks (GANs) consist of two neural networks locked in an adversarial competition:
• Generator (G): Creates synthetic data

• Discriminator (D): Evaluates data authenticity

Figure 3: GAN framework showing the adversarial relationship between G and D

17.2 Core Differences

Table 1: Comparison of Generator and Discriminator

Aspect Generator (G) Discriminator (D)


Role Counterfeiter - produces synthetic data Detective - evaluates data authenticity
Input Random noise vector z ∼ pz (z) Real data x or generated data G(z)
Output Fake sample G(z) Probability D(·) ∈ [0, 1] of being real
Objective Fool D into accepting fakes as real Correctly classify real vs. generated
samples
Loss Function minG Ez [log(1 − D(G(z)))] or maxD Ex [log D(x)] + Ez [log(1 −
maxG Ez [log D(G(z))] D(G(z)))]

17.3 Detailed Objectives


17.3.1 Generator (G)
• Primary Goal: Learn data distribution pdata from noise pz
• Learning Signal: Receives gradients from D through backpropagation

• Improvement Mechanism: Adjusts to produce samples that increase D(G(z))

17.3.2 Discriminator (D)


• Primary Goal: Act as binary classifier
• Training Data:
– Real samples labeled 1
– Generated samples labeled 0
• Improvement Mechanism: Becomes better at detecting subtle artifacts in fakes

20
gan_training_dynamics.png

Figure 4: Evolution of generator output and discriminator decision boundary during training

17.4 Adversarial Dynamics


17.4.1 Training Process
1. Phase 1 - D Update:
• Freeze G, sample real x and fake G(z)
• Update D to maximize:
LD = log D(x) + log(1 − D(G(z)))
2. Phase 2 - G Update:
• Freeze D, sample new z
• Update G to minimize (original) or maximize (non-saturating):

LG = log(1 − D(G(z))) or − log D(G(z))

18 What is the role of the loss function in GANs? How are Gener-
ator loss and Discriminator loss computed?
18.1 Role of the Loss Function in GANs
In a Generative Adversarial Network (GAN), two neural networks — the Generator (G) and Discriminator
(D) — are trained simultaneously in a competitive setting. The loss function:
• Guides the learning of both networks.
• Quantifies how well G fools D and how well D detects fakes.

21
18.2 GAN Objective: Minimax Game
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D

18.3 Discriminator Loss LD


 
LD = − Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]

18.4 Generator Loss LG


• Saturating loss:
LG = Ez∼pz (z) [log(1 − D(G(z)))]
• Non-saturating heuristic (preferred):
LG = −Ez∼pz (z) [log D(G(z))]

18.5 Objective Flow


Real Data x → D(x) → log D(x) → LD
(
log(1 − D(G(z))) → LD
Fake Data G(z) → D(G(z)) →
− log(D(G(z))) → LG

18.6 Loss Function Summary Table


Component Goal Loss Function
Discriminator Classify Real/Fake LD = −[log D(x) + log(1 − D(G(z)))]
Generator Fool Discriminator LG = − log D(G(z)) (heuristic)

19 How are GANs used in image synthesis, deepfake generation,


and data augmentation? Discuss real- world applications and
challenges.
Generative Adversarial Networks (GANs) are a class of machine learning models composed of two neural net-
works: a Generator that produces realistic data and a Discriminator that distinguishes real data from fake data.
The networks train together in a game-like setup where the Generator aims to fool the Discriminator.
Their ability to generate realistic synthetic data has made GANs valuable in:
• Image synthesis
• Deepfake generation
• Data augmentation

19.1 Image Synthesis Using GANs


19.1.1 Definition
Image synthesis is the process of generating new, high-quality images that resemble those in a given dataset.

19.1.2 How GANs Work


The Generator network starts with random noise and learns to produce realistic images, while the Discriminator
evaluates how close these are to real images. Over time, the Generator improves until it can create images
indistinguishable from real ones.
Input: Noise vector z → Generator G(z) → Fake image

19.1.3 Popular Architectures


• DCGAN (Deep Convolutional GAN): Introduces convolutional layers to improve image quality.
• StyleGAN: Enables control over style and features at various image layers.
• BigGAN: Scalable GANs trained on large datasets like ImageNet.

22
19.1.4 Applications
Domain Example
Art Style transfer and sketch-to-image conversion (e.g., NVIDIA GauGAN)
Gaming Texture generation and environment synthesis
Fashion AI-generated clothing and apparel
Web tools ThisPersonDoesNotExist.com (photorealistic human faces)

19.2 Deepfake Generation Using GANs


19.2.1 Definition
Deepfakes use AI to generate highly realistic fake images, audio, or videos—most commonly through facial
manipulation.

19.2.2 How GANs Work


GANs, particularly conditional GANs (cGANs) and autoencoder-based GANs, learn facial features from a dataset
and use them to:
• Replace one face with another in video frames
• Synthesize new expressions, speech, or gestures
Input: Source + Target face data → GAN → Fake but realistic face/video

19.2.3 Applications
Use Case Example
Movies De-aging actors (e.g., The Irishman), reviving deceased actors (e.g., Peter Cushing in Rogue One)
Apps Reface, Zao for face-swapping
Virtual Avatars Used in VR/AR to simulate expressions

19.2.4 Risks and Ethical Challenges


• Misinformation: Fake political/celebrity videos
• Fraud: Fake CEO calls used in scams
• Privacy Violation: Non-consensual content (e.g., deepfake pornography)

19.2.5 Detection Efforts


• Facebook’s Deepfake Detection Challenge
• Watermarking solutions (e.g., Project Origin)

19.3 Data Augmentation Using GANs


19.3.1 Definition
Data augmentation creates more data samples to help machine learning models generalize better. This is crucial
when original datasets are small or imbalanced.

19.3.2 How GANs Work


GANs generate synthetic samples (e.g., new images) to enrich the training dataset, preserving variations in
features like angle, lighting, and shape.
Input: Small training dataset → GAN → Synthetic, realistic training examples

19.3.3 Applications
Domain Example
Medical imaging Generate CT/MRI scans for rare diseases (e.g., MedGAN, DCGAN)
Autonomous driving Simulate rain, snow, night-time for testing
Retail/E-commerce Virtual try-on with Pix2Pix, CycleGAN
Security testing Stress-testing face recognition models with fake data

23
20 What are the ethical implications of GANs in terms of misuse,
bias, and security risks? How can these risks be mitigated?
Generative Adversarial Networks (GANs) are powerful tools capable of generating highly realistic synthetic data.
While they enable impressive advances in image synthesis, data augmentation, and personalization, they also
raise significant ethical concerns. The main issues stem from their potential misuse, inherent biases in training
data, and associated security threats. This section consolidates and organizes the key concerns and possible
mitigation strategies.

20.1 Misuse of GANs


GANs can be used maliciously to create fake multimedia content, deceive systems, or even commit fraud.

Risks and Examples


• Deepfakes: Hyper-realistic fake videos or images of people without their consent. For example, celebrities’
faces being swapped into explicit content, or fake videos of politicians making false statements.

• Fraud and Disinformation: A 2019 case saw fraudsters using AI-generated CEO voices to authorize a
$243,000 wire transfer. In another instance, a deepfake video impersonating the Ukrainian president spread
misinformation during conflict.
• Pornographic Misuse: Non-consensual creation of explicit content using real individuals’ faces.

Mitigation Strategies
• Legal Regulations: Laws like California’s AB-730 and EU digital policies criminalize non-consensual
deepfakes.
• Watermarking and Cryptographic Signatures: Tools like Project Origin embed invisible watermarks
into synthetic media for traceability.
• Detection Tools: Platforms use AI-based systems (e.g., Microsoft’s Video Authenticator) to flag deep-
fakes.

• Platform Policies: Social media and video platforms can enforce terms prohibiting synthetic, misleading
content.

20.2 Bias and Fairness in GAN Outputs


GANs learn from the data they are trained on. If this data has skewed representations, the outputs will also
reflect these biases.

Risks and Examples


• Training Data Bias: GANs like StyleGAN have shown a tendency to generate predominantly light-
skinned faces, underrepresenting other demographics.

• Discriminatory Outputs: GANs used in hiring or advertising tools may reinforce gender or racial
stereotypes (e.g., job applicant faces filtered by skin tone or facial features).

Mitigation Strategies
• Diverse Training Data: Curate datasets that represent a wide range of demographics and environments
(e.g., use tools like FairGAN).

• Bias Detection and Auditing: Tools like IBM’s AI Fairness 360 or fairness metrics (e.g., demographic
parity, equalized odds) help identify and reduce bias.
• Adversarial Debiasing: Use training techniques like GANsanitization to mask sensitive features during
learning.
• Human Oversight: Involve ethicists, legal experts, and affected communities in design and deployment
stages.

24
20.3 Security Risks Associated with GANs
The ability of GANs to generate realistic data opens up potential attack surfaces on digital security systems.

Risks and Examples


• Biometric Spoofing: Fake faces or fingerprints can be used to trick facial recognition systems, potentially
breaching high-security environments.
• Adversarial Attacks: GANs can generate inputs specifically crafted to fool AI models, like tricking
self-driving cars or fraud detection systems.
• Data Leakage: If not properly trained, GANs can memorize and regenerate real data, compromising
privacy.
• Data Poisoning: Malicious synthetic data (e.g., TrojanGAN) may be injected to corrupt models.

Mitigation Strategies
• Robust Authentication: Use multi-factor authentication methods and liveness detection to prevent
spoofing.
• Adversarial Training: Train models with adversarial examples to build robustness against GAN-generated
fakes (e.g., FakeCatcher).
• Differential Privacy: Add noise during training to prevent memorization of sensitive data.
• Secure Data Pipelines: Validate the sources of training data to ensure integrity.

20.4 Role of Policy and Public Awareness


• Transparency: Developers and content creators should disclose when AI-generated content is being used.
• Education and Awareness: Public education campaigns can help people recognize synthetic content
and understand the risks.
• Interdisciplinary Approach: Combine AI innovation with legal, sociological, and ethical research to
ensure responsible development.

21 Compare and contrast one-hot encoding and word embeddings.


Discuss their advantages and limitations with examples.
Natural Language Processing (NLP) involves processing and analyzing human language data. Since computers
can’t process words directly, we convert text into numbers using techniques like one-hot encoding and word
embeddings. These methods help us feed textual data into machine learning models.

21.1 One-Hot Encoding


Definition:
One-hot encoding represents each word as a binary vector of size V , where V is the total number of unique words
in the vocabulary. In this vector:
• All positions are zeros except for one.
• The index corresponding to the word is set to 1.
Example:
Assume the vocabulary is: ["cat", "dog", "fish", "bird"]

• ”cat” → [1, 0, 0, 0]
• ”dog” → [0, 1, 0, 0]
• ”fish” → [0, 0, 1, 0]
• ”bird” → [0, 0, 0, 1]

25
Advantages:
• Simple: Easy to implement and intuitive.
• Unique: Every word gets a distinct, non-overlapping representation.
Limitations:

• High dimensionality: For large vocabularies (e.g., 50K words), vectors become massive.
• No semantic similarity: Words like ”cat” and ”dog” are as distant as ”cat” and ”airplane”.
• Sparse vectors: Most values are zeros — inefficient storage and computation.

Use Cases:
• Useful in simple models like Bag-of-Words (BoW).
• Feasible for small datasets with limited vocabulary.

21.2 Word Embeddings


Definition:
Word embeddings are dense vectors (usually 50–300 dimensions) learned from text data using models such as
Word2Vec, GloVe, FastText, and BERT. These vectors capture semantic meaning — similar words have similar
vector representations.
Example:
Assuming 3-dimensional embeddings:

• ”cat” → [0.21, 0.54, -0.33]


• ”dog” → [0.20, 0.51, -0.29]
• ”bird” → [-0.10, 0.50, 0.20]

Advantages:
• Semantic richness: Similar words lie close in space.

• Compact representation: Lower dimensions than one-hot (e.g., 300D instead of 50K).
• Transfer learning: Pre-trained embeddings boost performance, especially for small datasets.
• Context-awareness (BERT, GPT): Adjusts word meaning based on sentence.
Limitations:

• Needs training: Requires large datasets (like Wikipedia).


• Opaque: Harder to interpret or debug compared to one-hot.
• Out-of-Vocabulary (OOV): Traditional models can’t handle unseen words unless they use subword tech-
niques (e.g., FastText).

Use Cases:
• NLP tasks like sentiment analysis, text classification, machine translation.
• Deep learning architectures like RNNs and Transformers.

22 Explain the Word2Vec architecture, including the CBOW and


Skip-gram models. How do they work? Which one performs
better for rare words?
Word2Vec is a neural network-based technique introduced by Mikolov et al. (2013) at Google for learning word
embeddings—dense vector representations of words that encode semantic and syntactic similarities.

26
22.1 Overview of Word2Vec
• Goal: Represent each word as a vector in high-dimensional space where similar words are close.
• Training Objective: Predict either the target word from surrounding context words or context from a
target word.
• Two Main Architectures:
– Continuous Bag of Words (CBOW)
– Skip-gram

22.2 Continuous Bag of Words (CBOW)


Objective: Predict the target word wt given context words wt−2 , wt−1 , wt+1 , wt+2 :
P (wt | wt−2 , wt−1 , wt+1 , wt+2 )
Architecture:
• Input: One-hot encoded context words.
• Projection: Average of context embeddings.
• Output: Softmax to predict the target word.
Example: “The cat sat on the mat” with context window 2:
• Input: [“The”, “sat”, “on”, “the”]
• Target: “cat”
Advantages:
• Fast to train.
• Performs well for frequent words.
Disadvantages:
• Less effective for rare words.
• Ignores word order.

22.3 Skip-gram
Objective: Predict context words wt−2 , wt−1 , wt+1 , wt+2 given the target word wt :
P (wt−2 , wt−1 , wt+1 , wt+2 | wt )
Architecture:
• Input: One-hot encoded target word.
• Projection: Embedding of the target word.
• Output: Softmax to predict context words.
Example: “The quick brown fox jumps” with window size 2:
• Target: “fox”
• Predicted context: [“quick”, “brown”, “jumps”]
Advantages:
• Works better for rare words.
• Captures fine-grained semantic relationships.
Disadvantages:
• Slower training.
• Requires more data.

27
22.4 Performance Comparison for Rare Words
Skip-gram performs better for rare words:
• Directly updates the rare word’s embedding.
• Generates more training instances using multiple contexts.
• Avoids averaging over unrelated contexts, preserving meaning.
Example: Rare word “obfuscate”:
• CBOW: Averages “the”, “code”, “was”, “hard” — meaning diluted.
• Skip-gram: Uses “obfuscate” to predict meaningful context like “code”, “complexity”.

23 How is Word2Vec trained using neural networks? Discuss the


training mechanism, loss function, and optimization techniques
used.
Word2Vec is a technique to learn continuous vector representations (embeddings) for words using a shallow
neural network. It captures semantic relationships between words such that similar words have vectors close to
each other in the embedding space.
There are two main architectures:
- CBOW (Continuous Bag of Words): Predicts a word given its context. - Skip-gram: Predicts context
words given a center word.

23.1 Neural Network Architecture


Word2Vec uses a simple feedforward neural network with one hidden layer and no activation function. It contains:
• Input Layer: One-hot encoded vectors of size |V |, where V is the vocabulary size.
• Projection (Hidden) Layer: A weight matrix W ∈ R|V |×d maps input to a d-dimensional dense em-
bedding space.
• Output Layer: Weight matrix W ′ ∈ Rd×|V | maps the embedding back to a vocabulary-sized vector.

23.2 Training Mechanism


23.2.1 CBOW (Continuous Bag of Words)
C
1 X
⃗vcontext = ⃗vi , ⃗z = W ′ · ⃗vcontext , ŷ = softmax(⃗z).
C i=1

23.2.2 Skip-gram
- Use center word’s embedding to predict each context word using softmax.

23.3 Loss Function



e⃗vwo ⃗uwi
P (wo | wi ) = P ⃗
vw⊤⃗
uwi
w∈V e

L = − log P (wo | wi )
CBOW: L = − log P (wt | context)
PC
Skip-gram: L = − c=1 log P (wc | wt )

23.4 Optimization Techniques


23.4.1 Negative Sampling
K
X
⊤ ⊤
L= − log σ(⃗vw ⃗u )
o wi
− log σ(−⃗vw ⃗u )
k wi
k=1

28
23.4.2 Hierarchical Softmax
- Builds a binary tree (Huffman tree) - Predicts using path from root to leaf - Time complexity: O(log |V |)

23.5 Optimization Algorithm


- Uses SGD - Learning rate decay - Subsampling of frequent words

24 Explain the mathematical formulation of CBOW and Skip-gram.


How does the difference in training objectives affect their perfor-
mance?
Word2Vec captures both semantic (meaning-based) and syntactic (grammar-based) relationships between
words by learning how words are used in context from large text corpora. It is based on the Distributional
Hypothesis which states that “a word is characterized by the company it keeps.”

24.1 Learning Mechanism


Word2Vec uses two architectures:
• CBOW (Continuous Bag of Words): Predicts a word based on its surrounding context.

• Skip-gram: Predicts context words given a target word.


Words that appear in similar contexts (e.g., “cat” and “dog”) are mapped to similar vectors. Similarly, gram-
matical roles are captured through consistent vector shifts.

24.2 Capturing Semantic Relationships


Semantic relationships refer to meaning-based connections between words.
Examples of Semantic Analogies:
⃗ − man
king ⃗ + woman
⃗ ≈ queen

⃗ − France
Paris ⃗ ⃗ ≈ Rome
+ Italy ⃗

Examples of Semantic Similarity:


• Synonyms: “happy” and “joyful” have high cosine similarity.
• Thematic: “coffee” and “tea” are similar (both beverages).

24.3 Capturing Syntactic Relationships


Syntactic relationships involve grammatical transformations such as verb tenses and plurals.
Examples:

running − run ⃗
⃗ ≈ swimming ⃗
− swim
⃗ − apple
apples ⃗ ≈ cars
⃗ − car


children ⃗ ≈ mice
− child ⃗ − mouse

24.4 Visualization of Relationships


Using techniques like t-SNE or PCA, we can visualize:
• Clusters of semantically similar words (e.g., countries, fruits).
• Vector directions showing grammatical transformations (e.g., tense, gender).

29
24.5 Training Process
CBOW and Skip-gram models generate training pairs from text. For example:
For “The cat sat on the mat” with window size 2:
• Skip-gram: (”cat”, ”the”), (”cat”, ”sat”), (”cat”, ”on”)

• CBOW: [”the”, ”sat”, ”on”, ”the”] → predict ”cat”


The model uses Negative Sampling to distinguish real context pairs from random noise, optimizing the
vector space.

30

You might also like