Word representations → Language modeling → Attention/Transformers → BERT (encoder) →
GPT (decoder).
I’ll do each in the 7 checkpoints you asked for.
1) Word Representations (one-hot, TF-IDF →
word2vec/GloVe → contextual vectors)
1. Why it arose
Computers need numbers, not words. We needed a way to turn text into vectors for any
ML model.
2. Without it, problem
No numeric input ⇒ can’t train. Early “bag-of-words” lost word order and meaning;
vectors were huge and sparse.
3. What it is
Techniques to map words to vectors:
● One-hot/TF-IDF (sparse counts)
● Dense embeddings (word2vec/GloVe): small, real-valued vectors capturing similarity.
4. What problems it solved (key concepts)
● Dense vectors capture distributional semantics (“you shall know a word by the
company it keeps”).
● word2vec: learn vectors by predicting context (CBOW/Skip-gram).
● GloVe: factorizes global co-occurrence statistics.
5. Idea/objective
Make words that appear in similar contexts have nearby vectors; reduce sparsity;
generalize better.
6. Applications
Every NLP task: classification, NER, search, recommendation, etc.
7. Limitations & how to overcome
● Static meaning: one vector per word can’t handle polysemy (“bank”).
Fix: contextual embeddings from Transformers (BERT/GPT) give a different vector per
usage.
● No long-range syntax in BoW/word2vec.
Fix: attention models that see full sentences.
2) Transformers & Self-Attention (the bridge)
1. Why it arose
RNNs/LSTMs struggled with long dependencies and were slow to parallelize.
2. Without it, problem
Vanishing gradients, sequential bottlenecks, limited context capture.
3. What it is
An architecture that uses self-attention instead of recurrence to mix information across
all positions.
4. Key concepts
● Q, K, V projections; attention = softmax(QKᵀ/√d) V
● Multi-head: multiple subspaces; positional encodings add order.
5. Idea/objective
Let each token “look” at any other token and decide what matters—fast, parallel,
long-range.
6. Applications
Backbone for BERT, GPT, T5, vision transformers, speech, protein folding.
7. Limitations & how to overcome
● Quadratic cost in sequence length.
Fixes: sparse/linear attention (Longformer, Performer), chunking, recurrence,
state-space models.
● Context window finite.
Fixes: long-context variants, retrieval-augmented models.
3) BERT (Bidirectional Encoder Representations from
Transformers)
1. Why it arose
Static embeddings missed context; we needed contextual vectors and better transfer
learning for understanding tasks.
2. Without it, problem
Models had to be trained from scratch per task; poor sentence-level understanding;
limited data efficiency.
3. What it is
A Transformer encoder trained bidirectionally with:
● Masked Language Modeling (MLM): randomly mask ~15% tokens; predict them using
both left & right context.
● (Original) Next Sentence Prediction (NSP): predict if sentence B follows sentence A.
4. What problems it solved (concepts)
● Produces contextual token embeddings (different vector for “bank” in “river bank” vs
“loan bank”).
● Pretrain → fine-tune paradigm: one big model adapted to many tasks.
5. Idea/objective
Learn deep language understanding from large corpora; transfer that knowledge to
downstream tasks with little data.
6. Applications
Classification, NER, QA (extractive), sentence similarity, entailment, search ranking.
Often: add a small head and fine-tune.
7. Limitations & how to overcome
● Not generative (MLM sees both sides; awkward for open-ended text).
Fix: use decoder models (GPT) or encoder-decoder (T5).
● Pretrain–finetune mismatch (NSP unhelpful; masking strategy).
Fix: RoBERTa (no NSP, more data, dynamic masking), DeBERTa (disentangled
attention).
● Sequence length & quadratic cost.
Fix: Longformer/BigBird, distillation (DistilBERT), parameter sharing (ALBERT).
4) GPT (Generative Pretrained Transformer)
1. Why it arose
We needed fluent generation and strong few-shot behavior with a single, unified
objective.
2. Without it, problem
Encoder-only models excel at understanding but are clunky for open-ended generation
and instruction following.
3. What it is
A Transformer decoder trained with causal (autoregressive) LM: predict next token
given previous tokens; mask future positions.
4. What problems it solved (concepts)
● Unified objective yields strong generalization; few/zero-shot via prompting.
● Natural fit for writing, dialogue, code, reasoning chains (with the right prompting).
5. Idea/objective
Model the distribution P(next token | prior tokens) so well that useful behavior emerges
when prompted.
6. Applications
Chatbots, content generation, code completion, summarization, translation, tool-use via
function calling, RAG.
7. Limitations & how to overcome
● Hallucinations & factuality.
Fix: Retrieval-Augmented Generation (RAG), grounding, tool calling, system
prompts/guardrails.
● Bias/toxicity from data.
Fix: dataset curation, RLHF/RLAIF, safety filters.
● Context limits & cost.
Fix: long-context models, compression, planning + external memory.
● Weak at structured extraction vs encoders in some cases.
Fix: fine-tune for extraction or combine with encoder models.
TL;DR mental map
● Word vectors → gave meaning but were static.
● Transformers → solved long-range/context + parallelism.
● BERT (encoder, MLM) → best for understanding; contextual embeddings for
classification/QA.
● GPT (decoder, causal LM) → best for generation; few-shot via prompts.
Here’s a clean, viva-friendly way to understand why BERT was invented, what it is, how it
works, what it’s good/bad at, and how it compares to earlier methods (ELMo/GPT). I’ll
keep the flow you asked for.
1) Why did this topic arise? (the pain)
Before BERT, strong language models were directional:
● Left→Right (e.g., GPT): each token sees only the past.
● Right←Left (or two separate LSTMs as in ELMo): one pass each way, then combine.
That worked for generation, but it’s sub-optimal for understanding. Natural language meaning
is often decided by both left and right neighbors:
“I went to the bank to deposit cash.” vs “I sat on the bank of the river.”
To decide which “bank,” you want to look both left and right at once. Unidirectional LMs can’t
fully do that.
2) What is BERT (in one line)?
BERT = Bidirectional Encoder Representations from Transformers.
It’s a Transformer encoder-only model pre-trained to fill in masked words (and originally, to
judge if two sentences belong together), so it learns deep, bidirectional context. Then you
fine-tune it on your task with a small head on top.
3) Key idea & objective (how it avoids the
“seeing itself” cheat)
BERT creates a Cloze test during pre-training:
● Randomly mask ~15% of tokens in the input.
● Train the model to predict the masked tokens from both left and right context.
● (Original BERT also used Next Sentence Prediction (NSP): given two sentences A,B,
predict if B follows A.)
4) What problem does BERT solve
(conceptually)?
● It learns contextual word meanings that depend on both sides of a word (true
bidirectionality).
● It provides powerful pre-trained representations you can adapt with little labeled data.
● It’s focused on understanding (classification, tagging, QA), not free-form generation.
5) How BERT works (simple mental model
+ a concrete example)
Architecture.
Just the Transformer encoder stack (multi-head self-attention + feed-forward, repeated). No
decoder.
Tokens & special symbols.
● Add [CLS] at the start (sentence-level summary token).
● Use [SEP] to separate sentence A and B (for pair tasks).
● Some tokens are replaced by [MASK] during pre-training.
Example (masked LM).
Input: “I went to the [MASK] to deposit cash.”
BERT attends to all tokens on both sides and predicts “bank”.
Fine-tuning heads.
● Text classification (sentiment, topic): put a small classifier on top of the final hidden
state of [CLS].
● Token labeling (NER, POS): put a token-wise classifier on each position.
● Extractive QA (SQuAD): add two classifiers to predict start and end indices of the
answer span.
6) Real-world applications (how it helps)
● Customer feedback sentiment: fine-tune BERT-base for POSITIVE/NEGATIVE; small
labeled set is enough because BERT already “knows” a lot of English.
● Entity extraction for KYC/compliance: tag names, locations, organizations from
documents.
● Search & semantic retrieval: encode queries and passages; use cosine similarity for
ranking (e.g., bi-encoder setups).
● Question answering on knowledge bases: pick answer spans from product manuals,
policies, FAQs.
● Intent detection & slot filling in chatbots.
● Document classification (legal, medical, support tickets).
Compared with traditional TF-IDF or word2vec, BERT’s embeddings change with context, so
polysemy (“bank”) and long-distance cues are handled far better.
7) How BERT relates to ELMo and GPT
(quick history anchor)
● ELMo (2018): two LSTMs (left→right and right←left) trained as language models;
combine their states. It’s contextual, but still uses RNNs and separate directions.
● GPT (2018): Transformer decoder, strictly left→right; great for generation and
zero-shot transfer, but not bidirectional.
● BERT (2018): Transformer encoder, bidirectional via masked LM; excellent for
understanding tasks.
A handy rule:
BERT = understand, GPT = generate.
(Modern models can blur this line, but it’s still a good intuition.)
8) Limitations of BERT (and practical ways
to overcome)
1. [MASK] mismatch: at fine-tune/test time, texts don’t contain [MASK], but BERT saw
many [MASK]s during pre-training.
Overcome: RoBERTa (no NSP, dynamic masking, more data) improves robustness;
ELECTRA trains a discriminator to detect replaced tokens instead of masking, further
reducing mismatch.
2. Not generative (can’t easily compute full sentence probabilities or write long text).
Overcome: use GPT/BART/T5 for generation or seq2seq tasks.
3. Input length ~512 tokens and quadratic attention cost.
Overcome: long-context encoder variants (Longformer, BigBird, RoPE-style rotary
with sliding windows) or chunking with retrieval.
4. Heavy to serve (multi-head attention over full sequence).
Overcome: model compression (DistilBERT), ALBERT (parameter sharing),
quantization, or use MiniLM, TinyBERT; add adapters/LoRA for light fine-tuning.
5. Domain shift (legal, medical, code).
Overcome: continue pre-training on in-domain text (DAPT/TAPT), or start from
domain models (BioBERT, Legal-BERT).
6. NSP objective often unhelpful.
Overcome: RoBERTa drops NSP; alternatives like Sentence Order Prediction
(ALBERT) can help when inter-sentence order matters.
9) Short, exam-ready comparison: BERT
vs GPT
● Directionality: BERT = bidirectional encoder (masked LM); GPT = left-to-right
decoder (causal LM).
● Best for: BERT = understanding (classification, NER, QA extractive); GPT =
generation (writing, dialogue, code).
● Objective: BERT = predict masked tokens (+/– NSP); GPT = predict next token.
● Outputs: BERT usually feeds a small task head; GPT directly generates tokens.
10) Minimal “how you’d use it” recipe
(HuggingFace, conceptually)
1. Load bert-base-uncased.
2. Tokenize text(s); prepend [CLS], append [SEP].
3. Add a tiny classifier (e.g., 2-way softmax) on top of the [CLS] vector.
4. Fine-tune for a few epochs—done.
One-line takeaway
BERT brought true bidirectional context to pre-trained language understanding by training a
Transformer encoder to fill in masked words. That single idea made sentence representations
far richer, slashed labeled-data needs, and set new standards on classification, tagging, and
QA—while leaving open-ended generation to GPT-style models.
Great slide—let’s decode it piece by piece and tie it to how you actually use BERT.
1) What the architecture picture is saying
Left diagram (stack of “Trm” blocks):
● E₁ … E (yellow) are the input embeddings for each token position 1…N.
● Each oval Trm is a Transformer-encoder layer (multi-head self-attention +
feed-forward + residual + layer norm).
● The criss-cross arrows mean: in self-attention, every token can attend to every other
token (bidirectional). There’s no causal mask like GPT.
● After passing through L encoder layers, we get T₁ … T (green), the final contextual
vectors—one vector per input position. These are the features you give to tiny
task-specific heads.
Right diagram (BASE vs LARGE):
● BERT-base: 12 encoder layers, hidden size 768, 12 attention heads (~110M params).
● BERT-large: 24 encoder layers, hidden size 1024, 16 heads (~340M params).
● Deeper/wider → usually better but slower.
2) Input representation (what actually goes
in)
Before the first encoder layer, BERT builds an embedding for each position by summing three
pieces:
1. Token embeddings
○ Text is split by WordPiece (subword) tokenizer.
Example: “unaffordable” → ["un", "##affordable"] (exact splits vary).
○ BERT adds special tokens:
■ [CLS] at the very start (sentence-level summary token).
■ [SEP] as a separator/end marker (between sentence A and B, and at the
end).
For a pair input (e.g., NLI/QA):
[CLS] I liked the movie [SEP] It was funny [SEP]
○
2. Segment (token-type) embeddings
○ Marks which sentence a token belongs to.
○ By convention: 0 for sentence A, 1 for sentence B.
○ For single-sentence tasks, everything (except [CLS]/[SEP]) is type 0.
○ Purpose: lets the model know the boundary and do “A vs B” reasoning.
3. Positional embeddings
○ Absolute position indices 0…N−1 so the model knows token order.
○ In BERT, positions do not reset for sentence B; they continue counting after A.
Final embedding per position =
TokenEmbed[i] + SegmentEmbed[type(i)] + PositionEmbed[pos(i)]
Then LayerNorm + Dropout → into the first encoder layer.
Also provided at runtime:
● Attention mask to ignore [PAD] tokens when batching sequences of different lengths.
● (No causal mask) because BERT is bidirectional.
3) What comes out & how it’s used
● Per-token outputs T₁…T : contextual vectors (e.g., 768-dim for base).
● [CLS] output: treated as a summary for the whole sequence.
● Typical heads:
○ Classification (sentiment, topic, NLI): a small dense+softmax on top of [CLS].
○ Token labeling (NER, POS): a classifier on each token output.
○ Extractive QA: two classifiers over tokens for start and end positions.
4) Concrete mini-examples
A) Sentiment (single sentence)
Input IDs: [CLS] i loved this movie [SEP]
Seg. IDs: 0 0 0 0 0 0
Pos. IDs: 0 1 2 3 4 5
→ Take final hidden state of [CLS] → 2-way softmax (pos/neg).
B) Natural Language Inference (pair)
[CLS] A man is cooking [SEP] A person is making food [SEP]
A/0................ B/1.....................
→ [CLS] → 3-way softmax (entail/contradict/neutral).
Segment IDs (0 for left, 1 for right) let BERT reason “A vs B”.
C) QA (extractive)
[CLS] question tokens [SEP] passage tokens [SEP]
→ Predict start/end indices within the passage span.
5) Why these three embeddings matter
(intuition)
● Token: what word/subword is it?
● Position: where is it in the sequence?
● Segment: which sentence (A or B) does it belong to?
Summing them gives the model explicit signals for identity, order, and sentence
grouping—everything self-attention needs to disambiguate meaning (e.g., which “bank”) using
both left and right context.
6) Practical notes you can mention in a
viva
● Vocabulary ≈ 30k WordPiece tokens; special tokens include [CLS], [SEP], [PAD],
[MASK], [UNK].
● Dropout ~0.1, layer norm everywhere, GELU activation in FFN.
● Padding is masked out in attention; no future mask (unlike GPT).
● Pooling: original BERT adds a dense+tanh “pooler” on the final [CLS]; many tasks just
use the raw [CLS] hidden state.
7) Quick comparison call-outs (ties back
to earlier slides)
● ELMo: two RNN LMs (L→R and R→L) combined; not Transformer, not jointly
bidirectional inside a single layer.
● GPT: Transformer decoder, left-to-right only; great for generation.
● BERT: Transformer encoder, true bidirectional via masked LM; great for
understanding.
That’s the full story behind the figure and the input representation: sum of token + segment +
positional embeddings → stacked encoder layers with bidirectional self-attention → task
heads.
Here’s exactly what that Input Representation slide is doing, in plain words.
What you’re seeing
BERT builds one vector per token position by summing three embeddings of the same size
(e.g., 768 for BERT-base):
final_vector[i] = TokenEmbedding[i] + SegmentEmbedding[i] + PositionEmbedding[i]
Those per-token vectors go into the first Transformer encoder layer.
1) Tokenization + special tokens
● Text is split into WordPiece subwords from a ~30k vocab.
Example: playing → ["play", "##ing"] (the ## means “continuation of a word”).
● BERT inserts special tokens:
○ [CLS] at the very start (its output is used as a sentence-level summary for
classification).
○ [SEP] to separate sentences and to mark the end.
In the slide:
[CLS] my dog is cute [SEP] he likes play ##ing [SEP]
This is a pair input: “Sentence A” = my dog is cute, “Sentence B” = he likes playing.
The Token Embeddings row shows the lookup vector for each token/subword (e.g., E_my,
E_dog, …, E_##ing, etc.).
2) Segment (token-type) embeddings
● Tell BERT which sentence each token belongs to.
● Original BERT has two types: A (id=0) and B (id=1).
○ All tokens from the first segment (including its [SEP]) get EA.
○ All tokens from the second segment (including its [SEP]) get EB.
● For single-sentence tasks, everything gets EA (token-type id 0).
Purpose: helps the model learn relationships between the two sentences (useful for NLI, QA
with question+passage, etc.).
3) Positional embeddings
● Tell BERT where each token is: positions 0,1,2,… across the whole sequence.
(Positions do not reset at [SEP]; they continue counting.)
● These are learned vectors E0, E1, … that BERT adds so self-attention knows order.
4) Why “each token is sum of three embeddings”?
Self-attention alone has no built-in sense of what token, which sentence, or which position.
Summing:
● Token = identity/meaning,
● Segment = sentence membership,
● Position = order,
gives the encoder everything it needs to reason bidirectionally.
5) “Single sequence is much more efficient”
Attention cost grows quadratically with length. If you don’t need two segments:
● Use one segment (all token-type ids = 0).
● Fewer tokens ⇒ less memory/compute, faster training/inference, less truncation.
Use pairs only when the task truly benefits (e.g., question+passage, premise+hypothesis).
6) Shapes you can quote in a viva
For a batch of size B and sequence length L (after padding/truncation):
● input_ids: [B, L] (WordPiece ids incl. [CLS], [SEP], [PAD])
● token_type_ids (segment ids): [B, L] (0 for A, 1 for B)
● attention_mask: [B, L] (1 = real token, 0 = pad)
● After the encoder:
○ last_hidden_state: [B, L, H] (H=768 base, 1024 large)
○ pooler_output (optional): [B, H] (tanh-transformed [CLS])
7) Tiny end-to-end example (classification)
Text: “I loved this movie”
Input: [CLS] I loved this movie [SEP]
Token ids: … … … … … …
Token types: 0 0 0 0 0 0
Positions: 0 1 2 3 4 5
Embed sum → Transformer encoders → take [CLS] → softmax (pos/neg)
That’s the slide: WordPiece tokens + [CLS]/[SEP] → add segment ids → add positions →
sum → encode.
Here’s the Masked Language Modeling (MLM) objective from the slides, explained in the
viva-friendly “why → what → how → examples → tricks → limits” flow.
Why do we need MLM?
● We want a model that understands a word using both left and right context (true
bidirectionality).
● A normal left-to-right LM can’t look right; if we let it, the word could “see itself,” so the
probability training breaks.
● Idea: hide (mask) a few words and ask the model to guess them from surrounding
context. Now it can safely look both ways because the true token is hidden.
What is MLM (in one line)?
Randomly choose k% (~15%) of tokens in the input, and train the model to predict the original
tokens at those positions using the full surrounding context.
How it’s done (mechanics)
1. Choose positions to predict: e.g., 15% of token positions (after WordPiece
tokenization). Do not select special tokens like [CLS], [SEP], [PAD].
2. Corrupt the chosen tokens using the 80/10/10 rule:
○ 80% of selected positions → replace token with [MASK]
went to the store → went to the [MASK]
○ 10% → replace with a random token
went to the store → went to the running
○ 10% → keep the original token unchanged
went to the store → went to the store
3. Compute loss only at the selected positions (even when kept the same):
L=−∑i∈Mlogpθ(xi∣x∖M)\mathcal L = -\sum_{i\in M}\log p_\theta(x_i \mid x_{\setminus
M})
where MM is the set of chosen positions.
Why the 80/10/10 trick?
If we replaced with [MASK] 100% of the time, the model would see lots of [MASK] during
pretraining but never during fine-tuning/inference → distribution mismatch. Mixing in random
and unchanged keeps inputs realistic and prevents the model from overfitting to the presence of
[MASK].
Choosing k (=15%)
● Too little masking (e.g., 1–5%): very few supervised targets per sequence ⇒ weak
learning signal and expensive training.
● Too much masking (e.g., 40–50%): the remaining context is too damaged ⇒ not
enough context to infer meaning, optimization gets harder.
● Around 15% is a good balance: enough targets per batch without destroying context.
Intuition with an example
“the man went to the [MASK] to buy a [MASK] of milk”
Using both sides, BERT can predict store and gallon.
This forces the model to encode semantics like places you go to buy things and
units used with milk, learned directly from context, not from rules.
What does MLM buy us in practice?
● Bidirectional understanding of each token (solves “bank” ambiguity by using both
neighbors).
● Data-efficient transfer: after pretraining, you add a tiny head and fine-tune with
relatively little labeled data for classification, NER, QA, etc.
Practical notes you can mention
● WordPiece vocab (~30k); masking happens after tokenization (can mask subwords like
##ing).
● Loss is only on the selected positions; other tokens just provide context.
● Use attention masks to ignore padding; no causal mask (BERT is bidirectional).
Known limitations (and how people
improved it)
1. [MASK] mismatch still exists (even with 80/10/10).
○ Fixes: RoBERTa (no NSP, dynamic masking, more data); ELECTRA (different
objective: detect replaced tokens instead of predicting masked ones) is more
sample-efficient.
2. Not generative / no proper sequence likelihood.
○ Fixes: use GPT for generation, or denoising seq2seq models (BART, T5) when
you need to produce text.
3. Quadratic attention cost & input length (~512).
○ Fixes: Longformer, BigBird, chunking + retrieval.
4. Domain shift (medical/legal/code).
○ Fixes: continue pretraining on in-domain text (DAPT/TAPT) or use domain
variants (BioBERT, Legal-BERT).
One-sentence viva takeaway
MLM trains BERT by hiding ~15% of tokens and asking the encoder to recover them from
both left and right context (using the 80/10/10 corruption rule), giving us powerful
bidirectional representations for downstream understanding tasks.
Here’s the Next Sentence Prediction (NSP) objective in the same “why → what → how →
examples → when it helps → limits & fixes” flow.
Why was NSP added?
BERT isn’t just about single sentences; many real tasks are pairwise:
● QA: Question ↔ Passage
● NLI: Premise ↔ Hypothesis
● Retrieval/Reranking: Query ↔ Document
The authors wanted BERT to learn relationships across two sentences/segments
during pre-training, not only within a single sentence via MLM.
What is NSP (one line)?
Given two text segments A and B, predict whether B is the actual next segment that follows A
in the corpus (IsNext) or just a random segment (NotNext).
How it’s set up (mechanics)
● Build inputs as:
[CLS] A [SEP] B [SEP]
● Provide segment IDs: tokens from A → 0, from B → 1.
● Attach a small binary classifier on the [CLS] output.
● Sampling: For each training pair:
○ 50% IsNext: B is the true continuation of A.
○ 50% NotNext: B is sampled randomly from elsewhere in the corpus.
● Loss during pre-training:
Ltotal=LMLM+LNSP\mathcal L_{\text{total}} = \mathcal L_{\text{MLM}} + \mathcal
L_{\text{NSP}}
Example (like your slide)
● IsNext
A: “The man went to the store.”
B: “He bought a gallon of milk.”
● NotNext
A: “The man went to the store.”
B: “Penguins are flightless.”
BERT learns to use the [CLS] representation to judge cross-sentence coherence/relatedness.
Where NSP is supposed to help
● Gives the model a pair-level signal, which should transfer to:
○ Natural Language Inference (entailment/contradiction)
○ QA / Passage selection (is this passage relevant to the question?)
○ Sentence/Document pair classification tasks
Important limitations (what later papers
found)
1. Too easy/biased negatives. Random B is often obviously unrelated (topic mismatch);
the model may learn topic detection rather than true discourse order.
2. Mixed empirical value. RoBERTa removed NSP entirely (kept only MLM with more
data & dynamic masking) and got better results on many benchmarks.
3. Coherence vs. adjacency. “Next” ≠ “coherent”; Wikipedia paragraph boundaries aren’t
perfect, so labels can be noisy.
Better alternatives used later
● SOP (Sentence Order Prediction, ALBERT): take two consecutive segments and
swap order half the time; predict if order is correct. Harder than random negatives,
focuses on discourse order.
● Hard negatives / in-batch negatives: choose B that is topically similar but wrong
(harder).
● Contrastive objectives (e.g., SimCSE, DeCLUTR): pull true pairs together in
embedding space, push others apart.
● Drop NSP (RoBERTa) and rely on stronger MLM + more data.
● Different pretraining altogether (ELECTRA’s replaced-token detection, BART/T5
denoising seq2seq) depending on downstream goals.
Viva takeaway (one sentence)
NSP trains BERT to judge whether B follows A, giving a coarse cross-sentence signal during
pre-training; it’s simple (binary classification on [CLS]), but later work showed it can be
unnecessary or suboptimal, with SOP, hard/contrastive negatives, or just stronger MLM
often performing better.
Here’s Next Sentence Prediction (NSP) in a crisp, viva-ready way.
What NSP is
During BERT pre-training, the model sometimes sees two segments:
[CLS] A-sentence tokens [SEP] B-sentence tokens [SEP]
It must predict (from the [CLS] vector) whether B really follows A in the corpus (IsNext) or is a
random sentence (NotNext). This is a binary classification trained together with MLM:
Ltotal=LMLM+LNSP\mathcal L_{\text{total}}=\mathcal L_{\text{MLM}}+\mathcal L_{\text{NSP}}
Why it was added
Many downstream tasks are pairwise: question↔passage, premise↔hypothesis,
query↔document. NSP gives BERT an explicit signal about cross-sentence relations during
pre-training instead of learning only within-sentence context.
How it’s built (mechanics)
● Input building: Segment A gets token-type id 0, segment B gets 1; positions continue
across A→B.
● Sampling: For each pair, with 50% probability B is the true continuation (IsNext); with
50%, B is a randomly sampled sentence (NotNext).
● Head: A small dense layer on top of [CLS] → softmax( IsNext / NotNext ).
● Training: Standard cross-entropy on this label, added to the MLM loss.
Tiny example
● IsNext:
A: “The man went to the store.”
B: “He bought a gallon of milk.”
● NotNext:
A: “The man went to the store.”
B: “Penguins are flightless.”
The model must encode in [CLS] whether B is a coherent continuation of A.
What it helps (intended)
● Gives BERT a prior for coherence and A↔B relevance, which should help:
○ NLI (entail/contradict/neutral),
○ Passage selection / reranking for QA,
○ Sentence-pair classification tasks.
Main limitations (what later work found)
1. Easy negatives: Random B is usually off-topic → the task reduces to topic match, not
true discourse order.
2. Mixed gains: Empirically, removing NSP and training longer on more data (RoBERTa)
often improves scores.
3. Label noise: Corpus boundaries aren’t perfect; “next” isn’t always the semantically right
continuation.
Better alternatives / fixes
● SOP (Sentence Order Prediction, ALBERT): use two consecutive segments and swap
their order half the time; predict if order is correct (harder than random negatives).
● Hard negatives / in-batch negatives: choose B that is semantically similar but
wrong.
● Contrastive pair objectives (SimCSE/InfoNCE-style) to learn pair similarity.
● Drop NSP entirely (RoBERTa) and rely on stronger MLM with more data.
When would you still use something
NSP-like?
● Your downstream task is pair-matching/ranking and you can generate hard negatives
(e.g., top-k BM25 passages that are not the answer).
● You want a pair-aware pre-train for small data regimes.
One-line takeaway
NSP trains BERT’s [CLS] to judge if B follows A, injecting cross-sentence knowledge;
it’s simple but often too easy—SOP, hard/contrastive negatives, or just stronger MLM
usually work better.
Let’s read that “Model Architecture” slide line-by-line and turn every bullet into an intuition you
can say in a viva.
Transformer encoder
BERT is encoder-only. One encoder layer =
(a) Multi-Head Self-Attention → Add & LayerNorm →
(b) Position-wise Feed-Forward → Add & LayerNorm.
Data used for pre-training
● Wikipedia (≈2.5B words) + BookCorpus (≈800M words).
No labels—BERT learns with Masked-LM (+ NSP).
Why these corpora? They’re large, clean English, general-domain: good prior knowledge for
transfer.
Training time: 1M steps (~40 epochs)
● BERT was trained for 1,000,000 update steps.
● Trick for efficiency: most steps with shorter sequences (128), last steps with 512 tokens
to expose the model to long contexts.
Roughly “~40 epochs” means, with the way text is chunked and masked, the model
sees the dataset many times.
Optimizer and schedule
● AdamW (Adam with decoupled weight decay)—better generalization than L2 inside
Adam.
Typical hyperparameters from the paper:
LR 1e-4, β₁=0.9, β₂=0.999, ε=1e-6, weight decay 0.01, dropout 0.1, warmup for the
first ~10k steps then linear decay of LR to 0.
Why warmup + linear decay? Warmup stabilizes the start; linear decay avoids overshooting
later.
Model sizes you should remember
● BERT-Base: 12 layers, hidden 768, 12 heads → ~110M params.
● BERT-Large: 24 layers, hidden 1024, 16 heads → ~340M params.
Per-token output shape after the encoder is [batch, seq_len, hidden] (e.g., [B,
L, 768] for Base).
The [CLS] vector (the first position) is used for sentence-level classification.
Compute used in the paper
“4×4 or 8×8 TPU slice for ~4 days” ≈ 16 TPUs (Base) or 64 TPUs (Large). That’s the scale
needed to pre-train from scratch.
Fine-tuning on your task is much cheaper (minutes–hours on a single GPU).
How the little diagrams map to numbers
● Bottom right (Q, K, V → Scaled Dot-Product Attention): shows one head. BERT
projects the hidden state to Q/K/V, does attention, concatenates all heads, then a linear
layer.
● Top right (Add & Norm boxes): the residual connections + LayerNorm after attention
and after the feed-forward block. This stabilizes very deep stacks (12/24 layers).
● “Positional Encoding” circle: in BERT these are learned positional embeddings
(added to token + segment embeddings before layer 1).
A tiny numeric walk-through (BERT-Base)
Say batch B=8, sequence length L=128:
1. Input embeddings: shape [8,128,768] (token + segment + position sum).
2. In one attention head: Q/K/V each [8,128,64]; attention weights [8,12,128,128] across
heads.
3. After 12 stacked layers → output [8,128,768].
Use output at [CLS] (position 0) for classification, or all positions for NER/QA.
One-sentence viva summary
BERT is a 12/24-layer Transformer encoder trained on Wikipedia + BookCorpus with
AdamW (LR 1e-4, warmup→linear decay), using multi-head bidirectional self-attention
(12/16 heads, hidden 768/1024), pre-trained for ~1M steps (mostly short sequences, some
long), and then fine-tuned with small task heads.
Awesome—this slide is the “how to use BERT” slide. It shows the 4 common fine-tuning
patterns. In all cases you:
1. Tokenize with WordPiece and build inputs
[CLS] ... [SEP] ( ... [SEP]) + token-type ids (0 for segment A, 1 for segment
B) + attention mask.
2. Feed through BERT (all layers).
3. Add a tiny task head and train end-to-end (usually 2–4 epochs).
(a) Sentence-pair classification
(MNLI/QQP/QNLI/STS-B/MRPC/RTE/SWAG)
Input: [CLS] sentence1 [SEP] sentence2 [SEP]
Token types: s1→0, s2→1
Head: a small classifier on top of [CLS] → label (e.g., entail/contradict/neutral, duplicate/not,
similarity score).
Loss: Cross-entropy (or MSE for STS-B).
Why pairs? Tasks where “A relates to B” (natural language inference, paraphrase,
question-premise).
(b) Single-sentence classification (SST-2, CoLA)
Input: [CLS] sentence [SEP] (single segment)
Head: classifier on [CLS] → label (sentiment, acceptability).
Loss: Cross-entropy.
Tip: You can also pool all token states (mean/max), but [CLS] is the default.
(c) Extractive Question Answering (SQuAD)
Input: [CLS] question [SEP] passage [SEP]
Head: two token-wise classifiers over the sequence → start logits and end logits for each
token.
Prediction: choose the best (start, end) span in the passage (end ≥ start, length cap).
Loss: sum of cross-entropies for gold start and end positions.
Notes: If the passage is long (>512), slide a window over it and pick the best span across
windows.
(d) Token-level tagging (NER: CoNLL-2003)
Input: [CLS] sentence [SEP]
Head: a classifier at each token → BIO labels (B-PER, I-ORG, O, …).
Loss: token-wise cross-entropy (mask out padding).
Subword labels: common choices are (i) label the first subword only and ignore the rest, or (ii)
copy the label to all subwords.
Upgrade (optional): add a CRF on top to enforce valid BIO transitions.
Hyperparameters that usually “just work”
● Batch size: 16–32 (adjust for memory).
● LR (AdamW): 2e-5, 3e-5, or 5e-5; weight decay 0.01; warmup ~10% steps; linear
decay.
● Epochs: 2–4 (watch validation; early stop).
● Max length: 128–512; truncate or use sliding windows for long texts.
● Gradient clip: e.g., 1.0 for stability.
Common pitfalls & fixes
● 512-token limit: use sliding windows, retrieval to shorten context, or long-context
models (Longformer/BigBird) if needed.
● Imbalanced classes: use class-weighted loss or focal loss.
● Label/subword mismatch (NER): be explicit about how you align labels.
● Small data: try parameter-efficient tuning (Adapters/LoRA), freeze lower layers, or
augment data.
● Overfitting: monitor validation, use dropout 0.1, early stopping.
Metrics to report
● Classification: accuracy / F1 (macro if imbalanced).
● NER: token-level or entity-level F1.
● QA: Exact Match and F1 over answer tokens.
● STS-B: Pearson/Spearman correlation.
That’s the whole picture: build the input with [CLS]/[SEP], run BERT, attach the right head
([CLS] for sentence tasks, span heads for QA, per-token head for tagging), and fine-tune
everything jointly.
Here’s how to read that GLUE results slide and what it means for BERT.
What is GLUE?
GLUE is a bundle of diverse language understanding tasks. It checks if a model can do many
things with little task-specific change. (BERT just adds a tiny head and fine-tunes.)
Columns = tasks (sizes shown under each name):
● MNLI (m/mm) – multi-genre natural language inference; scores are accuracy on
matched / mismatched test sets.
● QQP – Quora duplicate-question pairs; GLUE reports F1 (older GLUE tables often show
a single F1 number).
● QNLI – question vs sentence (is the sentence the answer?) – accuracy.
● SST-2 – movie review sentiment – accuracy.
● CoLA – grammatical acceptability – Matthews corr. (MCC).
● STS-B – sentence similarity – average of Pearson & Spearman correlations (×100).
● MRPC – paraphrase detection – GLUE uses the average of F1 and accuracy.
● RTE – textual entailment – accuracy.
(WNLI is excluded here because it’s tiny/noisy in the original leaderboard.)
Rows (systems being compared)
● Pre-OpenAI SOTA – the best published systems before GPT.
● BiLSTM+ELMo+Attn – strong pre-Transformer baseline.
● OpenAI GPT – first big Transformer decoder (left-to-right).
● BERTBASE / BERTLARGE – Transformer encoder (bidirectional, MLM+NSP).
What the numbers tell you
● Higher is better in every column.
● BERTLARGE is bold across most tasks and has the best Average (81.9), beating GPT’s
75.2 by a wide margin.
● MNLI: BERTLARGE 86.7/85.9 > GPT 82.1/81.4 — big jump on a large, hard NLI task.
● Low-data tasks (RTE 70.1, CoLA 60.5) show especially large gains, showing how
pre-training helps when labeled data is scarce.
● SST-2 (94.9) and QNLI (91.1) are also strong; STS-B (86.5) shows BERT’s semantic
similarity is solid.
● Average is a simple mean of task metrics (not all metrics are identical, so it’s an
indicative—but not perfect—summary).
Why BERT wins
1. Bidirectional self-attention (uses left + right context) → better token understanding
(“bank” the place vs river).
2. Pre-training at scale (Wikipedia + BookCorpus) with Masked LM (+ NSP) → strong
general language knowledge.
3. Minimal task heads + end-to-end fine-tuning → adapts quickly to each dataset.
How to use this in a viva (talking points)
● “GLUE evaluates broad understanding; BERT-Large set a new SOTA by 6–7 points over
GPT on the average, with notable gains on NLI (MNLI/QNLI) and low-resource tasks
(RTE/CoLA). The key reasons are bidirectional context and effective pre-training
objectives (MLM, NSP). Metrics differ per task (Acc, F1, MCC, correlations), but BERT
improves most of them with the same architecture plus small heads.”
Here’s what that Conclusions slide is getting at, unpacked and made viva-ready.
1) “Empirical results are great, but the biggest impact…”
What changed the field wasn’t just new SOTA numbers—it was the recipe.
BERT proved that you can:
1. Pre-train one generic model on huge unlabeled text (MLM+NSP), then
2. Fine-tune the same backbone with tiny task-specific heads.
This replaces dozens of bespoke architectures (custom CNN/LSTM + attention per task) with
one backbone + a small head. It gives:
● Label efficiency: far fewer labeled examples needed.
● Uniform workflow: same code path for sentiment, NLI, NER, QA, etc.
● Better transfer: knowledge learned on Wikipedia/Books helps everywhere.
Real-world effect: teams ship NLP features faster by fine-tuning BERT(-like) checkpoints instead
of re-inventing models for every task.
2) “With pre-training, bigger == better (so far)”
Once you adopt the pre-train→fine-tune recipe, scaling the model and/or data almost
monotonically improves quality:
● More layers/hidden size/heads → better representations.
● More pre-training tokens/steps → better generalization.
● Better optimizers/schedules → unlock scale (e.g., AdamW + warmup + decay).
This observation foreshadowed today’s scaling laws: performance rises predictably with
parameters, data, and compute. It’s why BERT-Large beat BERT-Base, and later
RoBERTa-Large/ELECTRA-Large beat earlier models.
Caveats to mention: returns are diminishing per dollar; you hit latency/memory/energy limits;
data quality matters as much as quantity.
3) “Unclear if adding things on top of BERT really helps a
lot”
After BERT, many tweaks tried to bolt on extra modules (task-specific attention, multi-task
towers, fancy pooling). In practice:
● Simple wins: a linear head on [CLS] (or span heads for QA, token head for NER) +
end-to-end fine-tuning already gives most of the gain.
● Bigger impact came from training, not gadgets:
○ RoBERTa: removed NSP, used more data, dynamic masking, longer training →
strong boost.
○ ALBERT/DistilBERT/MiniLM: parameter sharing or distillation → efficiency
gains.
○ ELECTRA: better pre-training objective (replaced-token detection) → sample
efficiency gains.
Implication: If you’re building systems, choose a good checkpoint and focus on data,
objective, and scale before inventing complex heads.
What this means for different audiences
For practitioners (product teams):
● Start from a reputable checkpoint (e.g., BERT/RoBERTa/ELECTRA for understanding).
● Pick a size that fits latency/memory constraints; use DistilBERT/MiniLM for
mobile/edge.
● Get more from data (cleaning, domain-adaptive pre-training) and PEFT (Adapters/LoRA)
than from exotic architectures.
● Measure and manage cost/latency; consider quantization, pruning, or distillation.
For researchers:
● The “bigger is better” trend raises the compute barrier; impactful work often targets:
○ Efficiency: sparse attention, better optimizers, distillation, quantization.
○ Objectives & data: improved pre-training tasks, high-quality or domain data.
○ Capabilities gaps: long-context reasoning, faithfulness, factuality, robustness,
safety.
Balanced view (good to say in a viva)
● Pros: unified pipeline, label efficiency, strong baselines across tasks, predictable gains
from scale.
● Cons/risks: high compute/energy, latency and memory at inference, bias & privacy
issues from web pre-training, and diminishing returns without better data.
One-liner:
BERT’s lasting contribution isn’t just better scores—it established the foundation-model
paradigm: pre-train big, fine-tune small. Since then, scaling and better pre-training (more data,
better objectives) have mattered more than adding fancy task-specific modules.