GAPE_module_2 - Copy
GAPE_module_2 - Copy
April 2025
• 1. Generator (G)
– Function: Creates synthetic data samples from random noise
– Input: Takes a random noise vector (latent vector) as input
– Output: Produces synthetic data (e.g., images, text)
– Initial Performance: Generates low-quality outputs that improve with training
– Goal: To produce data indistinguishable from real data
• 2. Discriminator (D)
– Function: Acts as a classifier distinguishing real from fake data
– Input: Receives both real training data and generated samples
– Output: Predicts probability (0 to 1) of input being real
– Goal: To accurately identify generated samples as fake
Working Principle
The two networks engage in an adversarial training process:
1. Generator’s Process:
1
• Takes random noise vector z from latent space
• Transforms it through neural network layers into synthetic data G(z)
• Attempts to make G(z) resemble real data distribution
2. Discriminator’s Process:
• Receives both real data samples x and generated samples G(z)
• Outputs probability D(x) or D(G(z)) indicating ”realness”
• Provides feedback to the generator through backpropagation
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D
Where:
• D(x) = Discriminator’s probability that real data x is real
• G(z) = Generator’s output from noise z
• D(G(z)) = Discriminator’s probability that generated sample is real
• pdata = Distribution of real data
• pz = Distribution of generator’s input noise
Training Dynamics and Convergence
• Initial Phase:
– Generator produces obvious fakes
– Discriminator easily identifies them (D(G(z)) ≈ 0)
• Intermediate Phase:
– Generator improves quality
– Discriminator becomes more sophisticated
– Adversarial competition drives improvement
• Ideal Convergence:
– Generator produces outputs indistinguishable from real data
– Discriminator outputs 50% probability (random guessing)
– Nash equilibrium is reached
• Early Stage:
– Generator: Produces random noise or blurry shapes
2
– Discriminator: Easily identifies fakes (high accuracy)
• Middle Stage:
– Generator: Creates face-like structures with basic features
– Discriminator: Begins to struggle with better fakes
• Final Stage:
– Generator: Produces photorealistic faces
– Discriminator: Cannot reliably distinguish real from fake ( approx50% accuracy)
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D
2. Loss Functions:
• Discriminator:
LD = −Ex∼pdata (x) [log D(x)] − Ez∼pz (z) [log(1 − D(G(z)))]
• Generator (original):
LG = −Ez∼pz (z) [log(1 − D(G(z)))]
• Generator (non-saturating):
LG = −Ez∼pz (z) [log D(G(z))]
3. Training Dynamics:
• Discriminator update:
• Generator update:
∇G V (D, G) = ∇G Ez [log(1 − D(G(z)))]
4. Optimality: At equilibrium, D(x) = 0.5 and the generated distribution equals the real data distribution.
5. LSGAN Loss Functions:
• Discriminator:
1 1
LD = Ex∼pdata (x) [(D(x) − b)2 ] + Ez∼pz (z) [(D(G(z)) − a)2 ]
2 2
• Generator:
1
LG = Ez∼pz (z) [(D(G(z)) − c)2 ]
2
3
3 Explain the difference between generative and discriminative mod-
els with examples and equations.
Generative Models
Objective: Generative models learn the joint probability distribution:
P (X, Y )
This models how data features X and labels Y are generated together.
Functionality:
• Estimate the data generation process, allowing creation of new data points.
• Compute conditional probabilities using Bayes’ Theorem:
P (X | Y ) · P (Y )
P (Y | X) =
P (X)
Examples:
• Naı̈ve Bayes Classifier: Assumes feature independence.
• Hidden Markov Models (HMMs): For sequences with hidden states.
• Gaussian Mixture Models (GMMs): Mixtures of Gaussians.
• Variational Autoencoders (VAEs): Latent representations and generation.
• Generative Adversarial Networks (GANs): Competing networks for realistic data.
• Diffusion Models: Denoise random variables step-by-step.
Use Cases:
• Synthetic data generation
• Anomaly detection
• Semi-supervised learning
Example: Spam Detection (Generative)
• Learns P (X | Spam), P (X | Not Spam)
• Learns P (Spam), P (Not Spam)
• Uses Bayes’ theorem to compute P (Spam | X)
Discriminative Models
Objective: Discriminative models model the conditional probability:
P (Y | X)
4
• Neural Networks: For complex mappings.
Use Cases:
• Classification and regression
• Real-time predictions
Example: Spam Detection (Discriminative)
• Learns P (Spam | X) and P (Not Spam | X)
• Predicts label with higher probability
• Does not model how data is generated
5
4.4 Training Dynamics
Discriminator Training:
• Input: Real data x and fake data G(z)
• Update to maximize: log D(x) + log(1 − D(G(z)))
Generator Training:
• Input: Noise z
• Generate G(z), evaluate via D
• Update to maximize: log D(G(z))
6
6.2 How It Works
Step 1: Vocabulary Creation
Given the sentences:
• ”I” → 0
• ”love” → 1
• ”NLP” → 2
• ”is” → 3
• ”fun” → 4
• Simple and Interpretable: Easy to implement, often used in early models or for creating bag-of-words
representations.
These sentence-level vectors (bag-of-words) can be used as input to classifiers like logistic regression.
"cat" → [1, 0, 0]
"dog" → [0, 1, 0]
"fish" → [0, 0, 1]
Each word has a unique, sparse binary vector with no encoded semantic meaning.
7
7 What are the limitations of one-hot encoding, and how do word
embeddings overcome them?
7.1 Limitations of One-Hot Encoding
7.1.1 High Dimensionality & Sparsity
One-hot vectors are as long as the vocabulary size. For large vocabularies (e.g., 50K words), this leads to huge
sparse vectors.
• Memory inefficient
• Slow computation
• Curse of dimensionality
8
8 What is Word2Vec, and how does it help in word representation?
8.1 What Is Word2Vec?
Word2Vec is a group of models used to produce word embeddings—dense, continuous vector representations of
words—in a high-dimensional vector space. The foundational concept is based on the distributional hypothe-
sis, which states that words appearing in similar contexts tend to have similar meanings. Introduced by Mikolov
et al. at Google in 2013, Word2Vec analyzes massive text corpora to learn these contextual relationships.
This results in semantically similar words being located near each other in the embedding space. For example,
”king” and ”queen” would appear closer together in vector space than ”king” and ”apple”.
• Mechanism: Given a context window (e.g., two words before and after), the model tries to predict the
word in the middle.
• Characteristics: Faster to train and generally performs better on high-frequency words.
8.2.2 2. Skip-Gram
• Objective: Predict the context words given a target (center) word.
• Mechanism: For every word in the corpus, the model attempts to predict words within a predefined
context window.
• Characteristics: Performs better with smaller datasets and rare or less frequent words.
9
8.4.3 Dimensionality Reduction
Instead of sparse, high-dimensional one-hot vectors, Word2Vec produces dense embeddings typically of 100–300
dimensions. This leads to:
• Reduced computational cost.
• Sentiment analysis
• Machine translation
• Information retrieval
9 Explain the basic idea behind the Continuous Bag of Words (CBOW)
model.
9.1 What is the CBOW Model?
The Continuous Bag of Words (CBOW) model is a neural network-based architecture used in Natural Language
Processing (NLP) to learn word embeddings. It belongs to the Word2Vec framework and is one of the two main
models, the other being the Skip-Gram model.
The key idea behind CBOW is simple: predict a target word based on its surrounding context words.
This approach is rooted in the distributional hypothesis — the notion that words occurring in similar contexts
tend to have similar meanings.
With a context window size of 2, and if the target word is “fox”, then:
• The context words are: ”quick”, ”brown”, ”jumps”, ”over”
• The CBOW model’s job is to use these context words to predict “fox”.
In another example, for the sentence:
10
9.3 Architecture of CBOW
The CBOW model consists of three main layers:
1. Input Layer:
• Each context word is represented using one-hot encoding — a vector with the size of the vocabulary
where only one index is ”1” and all others are ”0”.
3. Output Layer:
• The averaged context vector is passed through a softmax layer to compute the probability distribution
over the entire vocabulary.
• The target word is predicted based on this probability distribution.
The optimization is carried out using techniques like Stochastic Gradient Descent (SGD).
Analogy:
• CBOW: Like a fill-in-the-blank exercise.
• Skip-gram: Like reverse dictionary lookup.
11
10.3 Architecture and Training Differences
Feature CBOW Skip-gram
Objective Predict target word from context Predict context words from tar-
get word
Input Multiple context words (aver- Single target word embedding
aged embeddings)
Output Predicts one word (center word) Predicts multiple words (context
words)
Computational Cost Faster (uses averaging, fewer Slower (many training pairs)
training pairs)
Performance Better for frequent words Better for rare words
Use Case Ideal for large datasets Ideal for small datasets or rare
word handling
10.5 Examples
CBOW: Input (context): [”The”, ”cat”, ”on”, ”the”, ”mat”] Output (target): ”sat”
Skip-gram: Input (target): ”sat” Output (context): [”The”, ”cat”, ”on”, ”the”, ”mat”]
Skip-gram Architecture:
Target Word: "sat"
↓
Neural Network
↓
Predict Context Words: ["The", "cat", "on", "the", "mat"]
12
Sentence: ”The quick brown fox jumps over the lazy dog”
Target word: ”fox”
Window size = 2 → Context words: [”quick”, ”brown”, ”jumps”, ”over”]
• Large Window (10+): Captures broader semantics (e.g., “fox” with “clever”), but may dilute patterns
due to over-averaging.
11.3.2 Skip-gram
• Small Window (2–5): Captures local syntactic relations like “jumps” → “fox”. Effective for rare words.
• Large Window (10+): Captures broader context like “fox” → “tail”, “hunt”. Risk of noisy, less relevant
words.
• Words like “king” and “queen” often occur near words like “royalty,” “crown,” and “palace”.
• Similarly, “apple” and “banana” may co-occur with words like “fruit,” “eat,” and “tree”.
13
12.3 How It Works Internally
12.3.1 Neural Network Training
• Architecture: A shallow neural network with one hidden layer.
14
13 What is negative sampling in Word2Vec, and why is it used?
13.1 What Is Negative Sampling?
Negative sampling is a method used in the Word2Vec Skip-Gram model to speed up training. Instead of
computing the full softmax over all words in the vocabulary, the model updates weights for:
• The target word (e.g., ‘‘king’’),
• Its actual context word(s) (e.g., ‘‘queen’’),
• A few randomly sampled unrelated “negative” words (e.g., ‘‘apple’’, ‘‘car’’).
• σ = sigmoid function
• k = number of negative samples
The loss function is:
k
X
L = − log σ(vking · vqueen ) − log σ(−vking · vnegi )
i=1
13.5 Example
• Real context: “king” → “queen” ⇒ increase similarity.
• Negative samples: “king” → “apple”/“car” ⇒ decrease similarity.
15
14 How does Word2Vec handle polysemy, i.e., words with multiple
meanings?
14.1 What is Polysemy in Word2Vec
In Word2Vec, every word is represented by a single vector, regardless of how many meanings that word may
have in different contexts. This creates a key limitation when it comes to polysemous words — words with
multiple meanings, such as:
• “Bank”: could refer to a financial institution or the side of a river.
• “Bat”: could mean a flying mammal or a cricket/baseball bat.
Since Word2Vec generates static embeddings, the word “bank” will have only one vector that tries to
represent both meanings.
16
15.2 GAN Architecture
Generator (G)
• Input: Random noise vector z
• Output: Synthetic data sample (e.g., image)
• Design: Transposed CNN or MLP
• Objective: Generate data indistinguishable from real data
Discriminator (D)
• Input: Real or fake data
• Output: Probability that input is real
• Design: CNN or MLP
• Objective: Correctly classify real vs. fake
Zero-Sum Game
G tries to fool D, while D tries to detect fakes—each network improves through competition.
17
15.8 Training Progression
• Generator improves at realism
• Discriminator improves at detection
• At convergence, D cannot distinguish real from fake (50% accuracy)
18
Figure 1: GAN training framework showing the adversarial relationship
19
17 Explain the differences between the Generator and Discriminator
in a GAN. What are their objectives, and how do they influence
each other?
17.1 Introduction
Generative Adversarial Networks (GANs) consist of two neural networks locked in an adversarial competition:
• Generator (G): Creates synthetic data
20
gan_training_dynamics.png
Figure 4: Evolution of generator output and discriminator decision boundary during training
18 What is the role of the loss function in GANs? How are Gener-
ator loss and Discriminator loss computed?
18.1 Role of the Loss Function in GANs
In a Generative Adversarial Network (GAN), two neural networks — the Generator (G) and Discriminator
(D) — are trained simultaneously in a competitive setting. The loss function:
• Guides the learning of both networks.
• Quantifies how well G fools D and how well D detects fakes.
21
18.2 GAN Objective: Minimax Game
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]
G D
22
19.1.4 Applications
Domain Example
Art Style transfer and sketch-to-image conversion (e.g., NVIDIA GauGAN)
Gaming Texture generation and environment synthesis
Fashion AI-generated clothing and apparel
Web tools ThisPersonDoesNotExist.com (photorealistic human faces)
19.2.3 Applications
Use Case Example
Movies De-aging actors (e.g., The Irishman), reviving deceased actors (e.g., Peter Cushing in Rogue One)
Apps Reface, Zao for face-swapping
Virtual Avatars Used in VR/AR to simulate expressions
19.3.3 Applications
Domain Example
Medical imaging Generate CT/MRI scans for rare diseases (e.g., MedGAN, DCGAN)
Autonomous driving Simulate rain, snow, night-time for testing
Retail/E-commerce Virtual try-on with Pix2Pix, CycleGAN
Security testing Stress-testing face recognition models with fake data
23
20 What are the ethical implications of GANs in terms of misuse,
bias, and security risks? How can these risks be mitigated?
Generative Adversarial Networks (GANs) are powerful tools capable of generating highly realistic synthetic data.
While they enable impressive advances in image synthesis, data augmentation, and personalization, they also
raise significant ethical concerns. The main issues stem from their potential misuse, inherent biases in training
data, and associated security threats. This section consolidates and organizes the key concerns and possible
mitigation strategies.
• Fraud and Disinformation: A 2019 case saw fraudsters using AI-generated CEO voices to authorize a
$243,000 wire transfer. In another instance, a deepfake video impersonating the Ukrainian president spread
misinformation during conflict.
• Pornographic Misuse: Non-consensual creation of explicit content using real individuals’ faces.
Mitigation Strategies
• Legal Regulations: Laws like California’s AB-730 and EU digital policies criminalize non-consensual
deepfakes.
• Watermarking and Cryptographic Signatures: Tools like Project Origin embed invisible watermarks
into synthetic media for traceability.
• Detection Tools: Platforms use AI-based systems (e.g., Microsoft’s Video Authenticator) to flag deep-
fakes.
• Platform Policies: Social media and video platforms can enforce terms prohibiting synthetic, misleading
content.
• Discriminatory Outputs: GANs used in hiring or advertising tools may reinforce gender or racial
stereotypes (e.g., job applicant faces filtered by skin tone or facial features).
Mitigation Strategies
• Diverse Training Data: Curate datasets that represent a wide range of demographics and environments
(e.g., use tools like FairGAN).
• Bias Detection and Auditing: Tools like IBM’s AI Fairness 360 or fairness metrics (e.g., demographic
parity, equalized odds) help identify and reduce bias.
• Adversarial Debiasing: Use training techniques like GANsanitization to mask sensitive features during
learning.
• Human Oversight: Involve ethicists, legal experts, and affected communities in design and deployment
stages.
24
20.3 Security Risks Associated with GANs
The ability of GANs to generate realistic data opens up potential attack surfaces on digital security systems.
Mitigation Strategies
• Robust Authentication: Use multi-factor authentication methods and liveness detection to prevent
spoofing.
• Adversarial Training: Train models with adversarial examples to build robustness against GAN-generated
fakes (e.g., FakeCatcher).
• Differential Privacy: Add noise during training to prevent memorization of sensitive data.
• Secure Data Pipelines: Validate the sources of training data to ensure integrity.
• ”cat” → [1, 0, 0, 0]
• ”dog” → [0, 1, 0, 0]
• ”fish” → [0, 0, 1, 0]
• ”bird” → [0, 0, 0, 1]
25
Advantages:
• Simple: Easy to implement and intuitive.
• Unique: Every word gets a distinct, non-overlapping representation.
Limitations:
• High dimensionality: For large vocabularies (e.g., 50K words), vectors become massive.
• No semantic similarity: Words like ”cat” and ”dog” are as distant as ”cat” and ”airplane”.
• Sparse vectors: Most values are zeros — inefficient storage and computation.
Use Cases:
• Useful in simple models like Bag-of-Words (BoW).
• Feasible for small datasets with limited vocabulary.
Advantages:
• Semantic richness: Similar words lie close in space.
• Compact representation: Lower dimensions than one-hot (e.g., 300D instead of 50K).
• Transfer learning: Pre-trained embeddings boost performance, especially for small datasets.
• Context-awareness (BERT, GPT): Adjusts word meaning based on sentence.
Limitations:
Use Cases:
• NLP tasks like sentiment analysis, text classification, machine translation.
• Deep learning architectures like RNNs and Transformers.
26
22.1 Overview of Word2Vec
• Goal: Represent each word as a vector in high-dimensional space where similar words are close.
• Training Objective: Predict either the target word from surrounding context words or context from a
target word.
• Two Main Architectures:
– Continuous Bag of Words (CBOW)
– Skip-gram
22.3 Skip-gram
Objective: Predict context words wt−2 , wt−1 , wt+1 , wt+2 given the target word wt :
P (wt−2 , wt−1 , wt+1 , wt+2 | wt )
Architecture:
• Input: One-hot encoded target word.
• Projection: Embedding of the target word.
• Output: Softmax to predict context words.
Example: “The quick brown fox jumps” with window size 2:
• Target: “fox”
• Predicted context: [“quick”, “brown”, “jumps”]
Advantages:
• Works better for rare words.
• Captures fine-grained semantic relationships.
Disadvantages:
• Slower training.
• Requires more data.
27
22.4 Performance Comparison for Rare Words
Skip-gram performs better for rare words:
• Directly updates the rare word’s embedding.
• Generates more training instances using multiple contexts.
• Avoids averaging over unrelated contexts, preserving meaning.
Example: Rare word “obfuscate”:
• CBOW: Averages “the”, “code”, “was”, “hard” — meaning diluted.
• Skip-gram: Uses “obfuscate” to predict meaningful context like “code”, “complexity”.
23.2.2 Skip-gram
- Use center word’s embedding to predict each context word using softmax.
L = − log P (wo | wi )
CBOW: L = − log P (wt | context)
PC
Skip-gram: L = − c=1 log P (wc | wt )
28
23.4.2 Hierarchical Softmax
- Builds a binary tree (Huffman tree) - Predicts using path from root to leaf - Time complexity: O(log |V |)
29
24.5 Training Process
CBOW and Skip-gram models generate training pairs from text. For example:
For “The cat sat on the mat” with window size 2:
• Skip-gram: (”cat”, ”the”), (”cat”, ”sat”), (”cat”, ”on”)
30