0% found this document useful (0 votes)

13 views43 pages

Class Notes

Uploaded by

23280040

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views43 pages

Class Notes

Uploaded by

23280040

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 43

Word representations → Language modeling → Attention/Transformers → BERT (encoder) →

GPT (decoder).
I’ll do each in the 7 checkpoints you asked for.

1) Word Representations (one-hot, TF-IDF →

word2vec/GloVe → contextual vectors)
1. Why it arose
Computers need numbers, not words. We needed a way to turn text into vectors for any
ML model.
2. Without it, problem
No numeric input ⇒ can’t train. Early “bag-of-words” lost word order and meaning;
vectors were huge and sparse.

3. What it is

Techniques to map words to vectors:

● One-hot/TF-IDF (sparse counts)

● Dense embeddings (word2vec/GloVe): small, real-valued vectors capturing similarity.

4. What problems it solved (key concepts)

● Dense vectors capture distributional semantics (“you shall know a word by the
company it keeps”).

● word2vec: learn vectors by predicting context (CBOW/Skip-gram).

● GloVe: factorizes global co-occurrence statistics.

5. Idea/objective
Make words that appear in similar contexts have nearby vectors; reduce sparsity;
generalize better.

6. Applications
Every NLP task: classification, NER, search, recommendation, etc.

7. Limitations & how to overcome

● Static meaning: one vector per word can’t handle polysemy (“bank”).
Fix: contextual embeddings from Transformers (BERT/GPT) give a different vector per
usage.

● No long-range syntax in BoW/word2vec.

Fix: attention models that see full sentences.

2) Transformers & Self-Attention (the bridge)

1. Why it arose
RNNs/LSTMs struggled with long dependencies and were slow to parallelize.
2. Without it, problem
Vanishing gradients, sequential bottlenecks, limited context capture.

3. What it is

An architecture that uses self-attention instead of recurrence to mix information across
all positions.

4. Key concepts

● Q, K, V projections; attention = softmax(QKᵀ/√d) V

● Multi-head: multiple subspaces; positional encodings add order.

5. Idea/objective
Let each token “look” at any other token and decide what matters—fast, parallel,
long-range.

6. Applications
Backbone for BERT, GPT, T5, vision transformers, speech, protein folding.

7. Limitations & how to overcome

● Quadratic cost in sequence length.

Fixes: sparse/linear attention (Longformer, Performer), chunking, recurrence,
state-space models.

● Context window finite.

Fixes: long-context variants, retrieval-augmented models.

3) BERT (Bidirectional Encoder Representations from

Transformers)
1. Why it arose
Static embeddings missed context; we needed contextual vectors and better transfer
learning for understanding tasks.

2. Without it, problem

Models had to be trained from scratch per task; poor sentence-level understanding;
limited data efficiency.
3. What it is
A Transformer encoder trained bidirectionally with:

● Masked Language Modeling (MLM): randomly mask ~15% tokens; predict them using
both left & right context.

● (Original) Next Sentence Prediction (NSP): predict if sentence B follows sentence A.

4. What problems it solved (concepts)

● Produces contextual token embeddings (different vector for “bank” in “river bank” vs
“loan bank”).

● Pretrain → fine-tune paradigm: one big model adapted to many tasks.

5. Idea/objective
Learn deep language understanding from large corpora; transfer that knowledge to
downstream tasks with little data.

6. Applications
Classification, NER, QA (extractive), sentence similarity, entailment, search ranking.
Often: add a small head and fine-tune.

7. Limitations & how to overcome

● Not generative (MLM sees both sides; awkward for open-ended text).
Fix: use decoder models (GPT) or encoder-decoder (T5).

● Pretrain–finetune mismatch (NSP unhelpful; masking strategy).

Fix: RoBERTa (no NSP, more data, dynamic masking), DeBERTa (disentangled
attention).

● Sequence length & quadratic cost.

Fix: Longformer/BigBird, distillation (DistilBERT), parameter sharing (ALBERT).

4) GPT (Generative Pretrained Transformer)

1. Why it arose
We needed fluent generation and strong few-shot behavior with a single, unified
objective.
2. Without it, problem
Encoder-only models excel at understanding but are clunky for open-ended generation
and instruction following.

3. What it is

A Transformer decoder trained with causal (autoregressive) LM: predict next token
given previous tokens; mask future positions.

4. What problems it solved (concepts)

● Unified objective yields strong generalization; few/zero-shot via prompting.

● Natural fit for writing, dialogue, code, reasoning chains (with the right prompting).

5. Idea/objective
Model the distribution P(next token | prior tokens) so well that useful behavior emerges
when prompted.

6. Applications
Chatbots, content generation, code completion, summarization, translation, tool-use via
function calling, RAG.

7. Limitations & how to overcome

● Hallucinations & factuality.

Fix: Retrieval-Augmented Generation (RAG), grounding, tool calling, system
prompts/guardrails.

● Bias/toxicity from data.

Fix: dataset curation, RLHF/RLAIF, safety filters.

● Context limits & cost.

Fix: long-context models, compression, planning + external memory.

● Weak at structured extraction vs encoders in some cases.

Fix: fine-tune for extraction or combine with encoder models.

TL;DR mental map

● Word vectors → gave meaning but were static.
● Transformers → solved long-range/context + parallelism.

● BERT (encoder, MLM) → best for understanding; contextual embeddings for

classification/QA.

● GPT (decoder, causal LM) → best for generation; few-shot via prompts.
Here’s a clean, viva-friendly way to understand why BERT was invented, what it is, how it
works, what it’s good/bad at, and how it compares to earlier methods (ELMo/GPT). I’ll
keep the flow you asked for.

1) Why did this topic arise? (the pain)

Before BERT, strong language models were directional:

● Left→Right (e.g., GPT): each token sees only the past.

● Right←Left (or two separate LSTMs as in ELMo): one pass each way, then combine.

That worked for generation, but it’s sub-optimal for understanding. Natural language meaning
is often decided by both left and right neighbors:

“I went to the bank to deposit cash.” vs “I sat on the bank of the river.”

To decide which “bank,” you want to look both left and right at once. Unidirectional LMs can’t
fully do that.
2) What is BERT (in one line)?
BERT = Bidirectional Encoder Representations from Transformers.
It’s a Transformer encoder-only model pre-trained to fill in masked words (and originally, to
judge if two sentences belong together), so it learns deep, bidirectional context. Then you
fine-tune it on your task with a small head on top.

3) Key idea & objective (how it avoids the

“seeing itself” cheat)
BERT creates a Cloze test during pre-training:

● Randomly mask ~15% of tokens in the input.

● Train the model to predict the masked tokens from both left and right context.

● (Original BERT also used Next Sentence Prediction (NSP): given two sentences A,B,
predict if B follows A.)
4) What problem does BERT solve
(conceptually)?
● It learns contextual word meanings that depend on both sides of a word (true
bidirectionality).

● It provides powerful pre-trained representations you can adapt with little labeled data.

● It’s focused on understanding (classification, tagging, QA), not free-form generation.

5) How BERT works (simple mental model

+ a concrete example)
Architecture.
Just the Transformer encoder stack (multi-head self-attention + feed-forward, repeated). No
decoder.

Tokens & special symbols.

● Add [CLS] at the start (sentence-level summary token).

● Use [SEP] to separate sentence A and B (for pair tasks).

● Some tokens are replaced by [MASK] during pre-training.

Example (masked LM).

Input: “I went to the [MASK] to deposit cash.”
BERT attends to all tokens on both sides and predicts “bank”.

Fine-tuning heads.

● Text classification (sentiment, topic): put a small classifier on top of the final hidden
state of [CLS].

● Token labeling (NER, POS): put a token-wise classifier on each position.

● Extractive QA (SQuAD): add two classifiers to predict start and end indices of the
answer span.

6) Real-world applications (how it helps)

● Customer feedback sentiment: fine-tune BERT-base for POSITIVE/NEGATIVE; small
labeled set is enough because BERT already “knows” a lot of English.

● Entity extraction for KYC/compliance: tag names, locations, organizations from

documents.

● Search & semantic retrieval: encode queries and passages; use cosine similarity for
ranking (e.g., bi-encoder setups).

● Question answering on knowledge bases: pick answer spans from product manuals,
policies, FAQs.

● Intent detection & slot filling in chatbots.

● Document classification (legal, medical, support tickets).

Compared with traditional TF-IDF or word2vec, BERT’s embeddings change with context, so
polysemy (“bank”) and long-distance cues are handled far better.

7) How BERT relates to ELMo and GPT

(quick history anchor)
● ELMo (2018): two LSTMs (left→right and right←left) trained as language models;
combine their states. It’s contextual, but still uses RNNs and separate directions.

● GPT (2018): Transformer decoder, strictly left→right; great for generation and
zero-shot transfer, but not bidirectional.

● BERT (2018): Transformer encoder, bidirectional via masked LM; excellent for
understanding tasks.
A handy rule:
BERT = understand, GPT = generate.
(Modern models can blur this line, but it’s still a good intuition.)

8) Limitations of BERT (and practical ways

to overcome)
1. [MASK] mismatch: at fine-tune/test time, texts don’t contain [MASK], but BERT saw
many [MASK]s during pre-training.
Overcome: RoBERTa (no NSP, dynamic masking, more data) improves robustness;
ELECTRA trains a discriminator to detect replaced tokens instead of masking, further
reducing mismatch.

2. Not generative (can’t easily compute full sentence probabilities or write long text).
Overcome: use GPT/BART/T5 for generation or seq2seq tasks.

3. Input length ~512 tokens and quadratic attention cost.

Overcome: long-context encoder variants (Longformer, BigBird, RoPE-style rotary
with sliding windows) or chunking with retrieval.

4. Heavy to serve (multi-head attention over full sequence).

Overcome: model compression (DistilBERT), ALBERT (parameter sharing),
quantization, or use MiniLM, TinyBERT; add adapters/LoRA for light fine-tuning.

5. Domain shift (legal, medical, code).

Overcome: continue pre-training on in-domain text (DAPT/TAPT), or start from
domain models (BioBERT, Legal-BERT).

6. NSP objective often unhelpful.

Overcome: RoBERTa drops NSP; alternatives like Sentence Order Prediction
(ALBERT) can help when inter-sentence order matters.

9) Short, exam-ready comparison: BERT

vs GPT
● Directionality: BERT = bidirectional encoder (masked LM); GPT = left-to-right
decoder (causal LM).
● Best for: BERT = understanding (classification, NER, QA extractive); GPT =
generation (writing, dialogue, code).

● Objective: BERT = predict masked tokens (+/– NSP); GPT = predict next token.

● Outputs: BERT usually feeds a small task head; GPT directly generates tokens.

10) Minimal “how you’d use it” recipe

(HuggingFace, conceptually)
1. Load bert-base-uncased.

2. Tokenize text(s); prepend [CLS], append [SEP].

3. Add a tiny classifier (e.g., 2-way softmax) on top of the [CLS] vector.

4. Fine-tune for a few epochs—done.

One-line takeaway
BERT brought true bidirectional context to pre-trained language understanding by training a
Transformer encoder to fill in masked words. That single idea made sentence representations
far richer, slashed labeled-data needs, and set new standards on classification, tagging, and
QA—while leaving open-ended generation to GPT-style models.
Great slide—let’s decode it piece by piece and tie it to how you actually use BERT.

1) What the architecture picture is saying

Left diagram (stack of “Trm” blocks):

● E₁ … E (yellow) are the input embeddings for each token position 1…N.

● Each oval Trm is a Transformer-encoder layer (multi-head self-attention +

feed-forward + residual + layer norm).

● The criss-cross arrows mean: in self-attention, every token can attend to every other
token (bidirectional). There’s no causal mask like GPT.

● After passing through L encoder layers, we get T₁ … T (green), the final contextual
vectors—one vector per input position. These are the features you give to tiny
task-specific heads.

Right diagram (BASE vs LARGE):

● BERT-base: 12 encoder layers, hidden size 768, 12 attention heads (~110M params).

● BERT-large: 24 encoder layers, hidden size 1024, 16 heads (~340M params).

● Deeper/wider → usually better but slower.

2) Input representation (what actually goes

in)
Before the first encoder layer, BERT builds an embedding for each position by summing three
pieces:

1. Token embeddings

○ Text is split by WordPiece (subword) tokenizer.

Example: “unaffordable” → ["un", "##affordable"] (exact splits vary).

○ BERT adds special tokens:

■ [CLS] at the very start (sentence-level summary token).

■ [SEP] as a separator/end marker (between sentence A and B, and at the

end).

For a pair input (e.g., NLI/QA):

[CLS] I liked the movie [SEP] It was funny [SEP]

○
2. Segment (token-type) embeddings

○ Marks which sentence a token belongs to.

○ By convention: 0 for sentence A, 1 for sentence B.

○ For single-sentence tasks, everything (except [CLS]/[SEP]) is type 0.

○ Purpose: lets the model know the boundary and do “A vs B” reasoning.

3. Positional embeddings

○ Absolute position indices 0…N−1 so the model knows token order.

○ In BERT, positions do not reset for sentence B; they continue counting after A.

Final embedding per position =

TokenEmbed[i] + SegmentEmbed[type(i)] + PositionEmbed[pos(i)]
Then LayerNorm + Dropout → into the first encoder layer.

Also provided at runtime:

● Attention mask to ignore [PAD] tokens when batching sequences of different lengths.

● (No causal mask) because BERT is bidirectional.

3) What comes out & how it’s used

● Per-token outputs T₁…T : contextual vectors (e.g., 768-dim for base).

● [CLS] output: treated as a summary for the whole sequence.

● Typical heads:

○ Classification (sentiment, topic, NLI): a small dense+softmax on top of [CLS].

○ Token labeling (NER, POS): a classifier on each token output.

○ Extractive QA: two classifiers over tokens for start and end positions.

4) Concrete mini-examples
A) Sentiment (single sentence)

Input IDs: [CLS] i loved this movie [SEP]

Seg. IDs: 0 0 0 0 0 0
Pos. IDs: 0 1 2 3 4 5

→ Take final hidden state of [CLS] → 2-way softmax (pos/neg).

B) Natural Language Inference (pair)

[CLS] A man is cooking [SEP] A person is making food [SEP]
A/0................ B/1.....................

→ [CLS] → 3-way softmax (entail/contradict/neutral).

Segment IDs (0 for left, 1 for right) let BERT reason “A vs B”.

C) QA (extractive)

[CLS] question tokens [SEP] passage tokens [SEP]

→ Predict start/end indices within the passage span.

5) Why these three embeddings matter

(intuition)
● Token: what word/subword is it?

● Position: where is it in the sequence?

● Segment: which sentence (A or B) does it belong to?

Summing them gives the model explicit signals for identity, order, and sentence
grouping—everything self-attention needs to disambiguate meaning (e.g., which “bank”) using
both left and right context.

6) Practical notes you can mention in a

viva
● Vocabulary ≈ 30k WordPiece tokens; special tokens include [CLS], [SEP], [PAD],
[MASK], [UNK].

● Dropout ~0.1, layer norm everywhere, GELU activation in FFN.

● Padding is masked out in attention; no future mask (unlike GPT).

● Pooling: original BERT adds a dense+tanh “pooler” on the final [CLS]; many tasks just
use the raw [CLS] hidden state.

7) Quick comparison call-outs (ties back

to earlier slides)
● ELMo: two RNN LMs (L→R and R→L) combined; not Transformer, not jointly
bidirectional inside a single layer.

● GPT: Transformer decoder, left-to-right only; great for generation.

● BERT: Transformer encoder, true bidirectional via masked LM; great for
understanding.

That’s the full story behind the figure and the input representation: sum of token + segment +
positional embeddings → stacked encoder layers with bidirectional self-attention → task
heads.

Here’s exactly what that Input Representation slide is doing, in plain words.
What you’re seeing
BERT builds one vector per token position by summing three embeddings of the same size
(e.g., 768 for BERT-base):

final_vector[i] = TokenEmbedding[i] + SegmentEmbedding[i] + PositionEmbedding[i]

Those per-token vectors go into the first Transformer encoder layer.

1) Tokenization + special tokens

● Text is split into WordPiece subwords from a ~30k vocab.
Example: playing → ["play", "##ing"] (the ## means “continuation of a word”).

● BERT inserts special tokens:

○ [CLS] at the very start (its output is used as a sentence-level summary for
classification).

○ [SEP] to separate sentences and to mark the end.

In the slide:
[CLS] my dog is cute [SEP] he likes play ##ing [SEP]
This is a pair input: “Sentence A” = my dog is cute, “Sentence B” = he likes playing.

The Token Embeddings row shows the lookup vector for each token/subword (e.g., E_my,
E_dog, …, E_##ing, etc.).

2) Segment (token-type) embeddings

● Tell BERT which sentence each token belongs to.

● Original BERT has two types: A (id=0) and B (id=1).

○ All tokens from the first segment (including its [SEP]) get EA.
○ All tokens from the second segment (including its [SEP]) get EB.

● For single-sentence tasks, everything gets EA (token-type id 0).

Purpose: helps the model learn relationships between the two sentences (useful for NLI, QA
with question+passage, etc.).

3) Positional embeddings
● Tell BERT where each token is: positions 0,1,2,… across the whole sequence.
(Positions do not reset at [SEP]; they continue counting.)

● These are learned vectors E0, E1, … that BERT adds so self-attention knows order.

4) Why “each token is sum of three embeddings”?

Self-attention alone has no built-in sense of what token, which sentence, or which position.
Summing:

● Token = identity/meaning,

● Segment = sentence membership,

● Position = order,
gives the encoder everything it needs to reason bidirectionally.

5) “Single sequence is much more efficient”

Attention cost grows quadratically with length. If you don’t need two segments:

● Use one segment (all token-type ids = 0).

● Fewer tokens ⇒ less memory/compute, faster training/inference, less truncation.

Use pairs only when the task truly benefits (e.g., question+passage, premise+hypothesis).

6) Shapes you can quote in a viva

For a batch of size B and sequence length L (after padding/truncation):

● input_ids: [B, L] (WordPiece ids incl. [CLS], [SEP], [PAD])

● token_type_ids (segment ids): [B, L] (0 for A, 1 for B)

● attention_mask: [B, L] (1 = real token, 0 = pad)

● After the encoder:

○ last_hidden_state: [B, L, H] (H=768 base, 1024 large)

○ pooler_output (optional): [B, H] (tanh-transformed [CLS])

7) Tiny end-to-end example (classification)

Text: “I loved this movie”

Input: [CLS] I loved this movie [SEP]

Token ids: … … … … … …
Token types: 0 0 0 0 0 0
Positions: 0 1 2 3 4 5
Embed sum → Transformer encoders → take [CLS] → softmax (pos/neg)

That’s the slide: WordPiece tokens + [CLS]/[SEP] → add segment ids → add positions →
sum → encode.
Here’s the Masked Language Modeling (MLM) objective from the slides, explained in the
viva-friendly “why → what → how → examples → tricks → limits” flow.

Why do we need MLM?

● We want a model that understands a word using both left and right context (true
bidirectionality).

● A normal left-to-right LM can’t look right; if we let it, the word could “see itself,” so the
probability training breaks.

● Idea: hide (mask) a few words and ask the model to guess them from surrounding
context. Now it can safely look both ways because the true token is hidden.

What is MLM (in one line)?

Randomly choose k% (~15%) of tokens in the input, and train the model to predict the original
tokens at those positions using the full surrounding context.

How it’s done (mechanics)

1. Choose positions to predict: e.g., 15% of token positions (after WordPiece
tokenization). Do not select special tokens like [CLS], [SEP], [PAD].

2. Corrupt the chosen tokens using the 80/10/10 rule:

○ 80% of selected positions → replace token with [MASK]

went to the store → went to the [MASK]

○ 10% → replace with a random token

went to the store → went to the running

○ 10% → keep the original token unchanged

went to the store → went to the store

3. Compute loss only at the selected positions (even when kept the same):
L=−∑i∈Mlog⁡pθ(xi∣x∖M)\mathcal L = -\sum_{i\in M}\log p_\theta(x_i \mid x_{\setminus
M})
where MM is the set of chosen positions.

Why the 80/10/10 trick?

If we replaced with [MASK] 100% of the time, the model would see lots of [MASK] during
pretraining but never during fine-tuning/inference → distribution mismatch. Mixing in random
and unchanged keeps inputs realistic and prevents the model from overfitting to the presence of
[MASK].

Choosing k (=15%)
● Too little masking (e.g., 1–5%): very few supervised targets per sequence ⇒ weak
learning signal and expensive training.

● Too much masking (e.g., 40–50%): the remaining context is too damaged ⇒ not
enough context to infer meaning, optimization gets harder.

● Around 15% is a good balance: enough targets per batch without destroying context.

Intuition with an example

“the man went to the [MASK] to buy a [MASK] of milk”
Using both sides, BERT can predict store and gallon.
This forces the model to encode semantics like places you go to buy things and
units used with milk, learned directly from context, not from rules.

What does MLM buy us in practice?

● Bidirectional understanding of each token (solves “bank” ambiguity by using both
neighbors).

● Data-efficient transfer: after pretraining, you add a tiny head and fine-tune with
relatively little labeled data for classification, NER, QA, etc.

Practical notes you can mention

● WordPiece vocab (~30k); masking happens after tokenization (can mask subwords like
##ing).

● Loss is only on the selected positions; other tokens just provide context.
● Use attention masks to ignore padding; no causal mask (BERT is bidirectional).

Known limitations (and how people

improved it)
1. [MASK] mismatch still exists (even with 80/10/10).

○ Fixes: RoBERTa (no NSP, dynamic masking, more data); ELECTRA (different
objective: detect replaced tokens instead of predicting masked ones) is more
sample-efficient.

2. Not generative / no proper sequence likelihood.

○ Fixes: use GPT for generation, or denoising seq2seq models (BART, T5) when
you need to produce text.

3. Quadratic attention cost & input length (~512).

○ Fixes: Longformer, BigBird, chunking + retrieval.

4. Domain shift (medical/legal/code).

○ Fixes: continue pretraining on in-domain text (DAPT/TAPT) or use domain

variants (BioBERT, Legal-BERT).

One-sentence viva takeaway

MLM trains BERT by hiding ~15% of tokens and asking the encoder to recover them from
both left and right context (using the 80/10/10 corruption rule), giving us powerful
bidirectional representations for downstream understanding tasks.
Here’s the Next Sentence Prediction (NSP) objective in the same “why → what → how →
examples → when it helps → limits & fixes” flow.

Why was NSP added?

BERT isn’t just about single sentences; many real tasks are pairwise:

● QA: Question ↔ Passage

● NLI: Premise ↔ Hypothesis

● Retrieval/Reranking: Query ↔ Document

The authors wanted BERT to learn relationships across two sentences/segments
during pre-training, not only within a single sentence via MLM.

What is NSP (one line)?

Given two text segments A and B, predict whether B is the actual next segment that follows A
in the corpus (IsNext) or just a random segment (NotNext).
How it’s set up (mechanics)
● Build inputs as:
[CLS] A [SEP] B [SEP]

● Provide segment IDs: tokens from A → 0, from B → 1.

● Attach a small binary classifier on the [CLS] output.

● Sampling: For each training pair:

○ 50% IsNext: B is the true continuation of A.

○ 50% NotNext: B is sampled randomly from elsewhere in the corpus.

● Loss during pre-training:

Ltotal=LMLM+LNSP\mathcal L_{\text{total}} = \mathcal L_{\text{MLM}} + \mathcal
L_{\text{NSP}}

Example (like your slide)

● IsNext
A: “The man went to the store.”
B: “He bought a gallon of milk.”

● NotNext
A: “The man went to the store.”
B: “Penguins are flightless.”

BERT learns to use the [CLS] representation to judge cross-sentence coherence/relatedness.

Where NSP is supposed to help

● Gives the model a pair-level signal, which should transfer to:

○ Natural Language Inference (entailment/contradiction)

○ QA / Passage selection (is this passage relevant to the question?)

○ Sentence/Document pair classification tasks

Important limitations (what later papers

found)
1. Too easy/biased negatives. Random B is often obviously unrelated (topic mismatch);
the model may learn topic detection rather than true discourse order.

2. Mixed empirical value. RoBERTa removed NSP entirely (kept only MLM with more
data & dynamic masking) and got better results on many benchmarks.

3. Coherence vs. adjacency. “Next” ≠ “coherent”; Wikipedia paragraph boundaries aren’t
perfect, so labels can be noisy.

Better alternatives used later

● SOP (Sentence Order Prediction, ALBERT): take two consecutive segments and
swap order half the time; predict if order is correct. Harder than random negatives,
focuses on discourse order.

● Hard negatives / in-batch negatives: choose B that is topically similar but wrong
(harder).

● Contrastive objectives (e.g., SimCSE, DeCLUTR): pull true pairs together in

embedding space, push others apart.

● Drop NSP (RoBERTa) and rely on stronger MLM + more data.

● Different pretraining altogether (ELECTRA’s replaced-token detection, BART/T5

denoising seq2seq) depending on downstream goals.

Viva takeaway (one sentence)

NSP trains BERT to judge whether B follows A, giving a coarse cross-sentence signal during
pre-training; it’s simple (binary classification on [CLS]), but later work showed it can be
unnecessary or suboptimal, with SOP, hard/contrastive negatives, or just stronger MLM
often performing better.

Here’s Next Sentence Prediction (NSP) in a crisp, viva-ready way.

What NSP is
During BERT pre-training, the model sometimes sees two segments:

[CLS] A-sentence tokens [SEP] B-sentence tokens [SEP]

It must predict (from the [CLS] vector) whether B really follows A in the corpus (IsNext) or is a
random sentence (NotNext). This is a binary classification trained together with MLM:

Ltotal=LMLM+LNSP\mathcal L_{\text{total}}=\mathcal L_{\text{MLM}}+\mathcal L_{\text{NSP}}

Why it was added

Many downstream tasks are pairwise: question↔passage, premise↔hypothesis,
query↔document. NSP gives BERT an explicit signal about cross-sentence relations during
pre-training instead of learning only within-sentence context.

How it’s built (mechanics)

● Input building: Segment A gets token-type id 0, segment B gets 1; positions continue
across A→B.

● Sampling: For each pair, with 50% probability B is the true continuation (IsNext); with
50%, B is a randomly sampled sentence (NotNext).

● Head: A small dense layer on top of [CLS] → softmax( IsNext / NotNext ).

● Training: Standard cross-entropy on this label, added to the MLM loss.

Tiny example
● IsNext:
A: “The man went to the store.”
B: “He bought a gallon of milk.”

● NotNext:
A: “The man went to the store.”
B: “Penguins are flightless.”

The model must encode in [CLS] whether B is a coherent continuation of A.

What it helps (intended)

● Gives BERT a prior for coherence and A↔B relevance, which should help:

○ NLI (entail/contradict/neutral),

○ Passage selection / reranking for QA,

○ Sentence-pair classification tasks.

Main limitations (what later work found)

1. Easy negatives: Random B is usually off-topic → the task reduces to topic match, not
true discourse order.

2. Mixed gains: Empirically, removing NSP and training longer on more data (RoBERTa)
often improves scores.

3. Label noise: Corpus boundaries aren’t perfect; “next” isn’t always the semantically right
continuation.

Better alternatives / fixes

● SOP (Sentence Order Prediction, ALBERT): use two consecutive segments and swap
their order half the time; predict if order is correct (harder than random negatives).

● Hard negatives / in-batch negatives: choose B that is semantically similar but

wrong.

● Contrastive pair objectives (SimCSE/InfoNCE-style) to learn pair similarity.

● Drop NSP entirely (RoBERTa) and rely on stronger MLM with more data.

When would you still use something

NSP-like?
● Your downstream task is pair-matching/ranking and you can generate hard negatives
(e.g., top-k BM25 passages that are not the answer).

● You want a pair-aware pre-train for small data regimes.

One-line takeaway
NSP trains BERT’s [CLS] to judge if B follows A, injecting cross-sentence knowledge;
it’s simple but often too easy—SOP, hard/contrastive negatives, or just stronger MLM
usually work better.
Let’s read that “Model Architecture” slide line-by-line and turn every bullet into an intuition you
can say in a viva.

Transformer encoder
BERT is encoder-only. One encoder layer =
(a) Multi-Head Self-Attention → Add & LayerNorm →
(b) Position-wise Feed-Forward → Add & LayerNorm.
Data used for pre-training
● Wikipedia (≈2.5B words) + BookCorpus (≈800M words).
No labels—BERT learns with Masked-LM (+ NSP).

Why these corpora? They’re large, clean English, general-domain: good prior knowledge for
transfer.

Training time: 1M steps (~40 epochs)

● BERT was trained for 1,000,000 update steps.

● Trick for efficiency: most steps with shorter sequences (128), last steps with 512 tokens
to expose the model to long contexts.

Roughly “~40 epochs” means, with the way text is chunked and masked, the model
sees the dataset many times.
Optimizer and schedule
● AdamW (Adam with decoupled weight decay)—better generalization than L2 inside
Adam.
Typical hyperparameters from the paper:
LR 1e-4, β₁=0.9, β₂=0.999, ε=1e-6, weight decay 0.01, dropout 0.1, warmup for the
first ~10k steps then linear decay of LR to 0.

Why warmup + linear decay? Warmup stabilizes the start; linear decay avoids overshooting
later.

Model sizes you should remember

● BERT-Base: 12 layers, hidden 768, 12 heads → ~110M params.

● BERT-Large: 24 layers, hidden 1024, 16 heads → ~340M params.

Per-token output shape after the encoder is [batch, seq_len, hidden] (e.g., [B,
L, 768] for Base).
The [CLS] vector (the first position) is used for sentence-level classification.

Compute used in the paper

“4×4 or 8×8 TPU slice for ~4 days” ≈ 16 TPUs (Base) or 64 TPUs (Large). That’s the scale
needed to pre-train from scratch.
Fine-tuning on your task is much cheaper (minutes–hours on a single GPU).

How the little diagrams map to numbers

● Bottom right (Q, K, V → Scaled Dot-Product Attention): shows one head. BERT
projects the hidden state to Q/K/V, does attention, concatenates all heads, then a linear
layer.

● Top right (Add & Norm boxes): the residual connections + LayerNorm after attention
and after the feed-forward block. This stabilizes very deep stacks (12/24 layers).

● “Positional Encoding” circle: in BERT these are learned positional embeddings

(added to token + segment embeddings before layer 1).

A tiny numeric walk-through (BERT-Base)

Say batch B=8, sequence length L=128:

1. Input embeddings: shape [8,128,768] (token + segment + position sum).

2. In one attention head: Q/K/V each [8,128,64]; attention weights [8,12,128,128] across
heads.

3. After 12 stacked layers → output [8,128,768].

Use output at [CLS] (position 0) for classification, or all positions for NER/QA.

One-sentence viva summary

BERT is a 12/24-layer Transformer encoder trained on Wikipedia + BookCorpus with
AdamW (LR 1e-4, warmup→linear decay), using multi-head bidirectional self-attention
(12/16 heads, hidden 768/1024), pre-trained for ~1M steps (mostly short sequences, some
long), and then fine-tuned with small task heads.
Awesome—this slide is the “how to use BERT” slide. It shows the 4 common fine-tuning
patterns. In all cases you:

1. Tokenize with WordPiece and build inputs

[CLS] ... [SEP] ( ... [SEP]) + token-type ids (0 for segment A, 1 for segment
B) + attention mask.

2. Feed through BERT (all layers).

3. Add a tiny task head and train end-to-end (usually 2–4 epochs).

(a) Sentence-pair classification

(MNLI/QQP/QNLI/STS-B/MRPC/RTE/SWAG)
Input: [CLS] sentence1 [SEP] sentence2 [SEP]
Token types: s1→0, s2→1
Head: a small classifier on top of [CLS] → label (e.g., entail/contradict/neutral, duplicate/not,
similarity score).
Loss: Cross-entropy (or MSE for STS-B).
Why pairs? Tasks where “A relates to B” (natural language inference, paraphrase,
question-premise).

(b) Single-sentence classification (SST-2, CoLA)

Input: [CLS] sentence [SEP] (single segment)
Head: classifier on [CLS] → label (sentiment, acceptability).
Loss: Cross-entropy.
Tip: You can also pool all token states (mean/max), but [CLS] is the default.

(c) Extractive Question Answering (SQuAD)

Input: [CLS] question [SEP] passage [SEP]
Head: two token-wise classifiers over the sequence → start logits and end logits for each
token.
Prediction: choose the best (start, end) span in the passage (end ≥ start, length cap).
Loss: sum of cross-entropies for gold start and end positions.
Notes: If the passage is long (>512), slide a window over it and pick the best span across
windows.

(d) Token-level tagging (NER: CoNLL-2003)

Input: [CLS] sentence [SEP]
Head: a classifier at each token → BIO labels (B-PER, I-ORG, O, …).
Loss: token-wise cross-entropy (mask out padding).
Subword labels: common choices are (i) label the first subword only and ignore the rest, or (ii)
copy the label to all subwords.
Upgrade (optional): add a CRF on top to enforce valid BIO transitions.

Hyperparameters that usually “just work”

● Batch size: 16–32 (adjust for memory).
● LR (AdamW): 2e-5, 3e-5, or 5e-5; weight decay 0.01; warmup ~10% steps; linear
decay.

● Epochs: 2–4 (watch validation; early stop).

● Max length: 128–512; truncate or use sliding windows for long texts.

● Gradient clip: e.g., 1.0 for stability.

Common pitfalls & fixes

● 512-token limit: use sliding windows, retrieval to shorten context, or long-context
models (Longformer/BigBird) if needed.

● Imbalanced classes: use class-weighted loss or focal loss.

● Label/subword mismatch (NER): be explicit about how you align labels.

● Small data: try parameter-efficient tuning (Adapters/LoRA), freeze lower layers, or

augment data.

● Overfitting: monitor validation, use dropout 0.1, early stopping.

Metrics to report
● Classification: accuracy / F1 (macro if imbalanced).

● NER: token-level or entity-level F1.

● QA: Exact Match and F1 over answer tokens.

● STS-B: Pearson/Spearman correlation.

That’s the whole picture: build the input with [CLS]/[SEP], run BERT, attach the right head
([CLS] for sentence tasks, span heads for QA, per-token head for tagging), and fine-tune
everything jointly.
Here’s how to read that GLUE results slide and what it means for BERT.

What is GLUE?
GLUE is a bundle of diverse language understanding tasks. It checks if a model can do many
things with little task-specific change. (BERT just adds a tiny head and fine-tunes.)

Columns = tasks (sizes shown under each name):

● MNLI (m/mm) – multi-genre natural language inference; scores are accuracy on

matched / mismatched test sets.

● QQP – Quora duplicate-question pairs; GLUE reports F1 (older GLUE tables often show
a single F1 number).

● QNLI – question vs sentence (is the sentence the answer?) – accuracy.

● SST-2 – movie review sentiment – accuracy.

● CoLA – grammatical acceptability – Matthews corr. (MCC).

● STS-B – sentence similarity – average of Pearson & Spearman correlations (×100).

● MRPC – paraphrase detection – GLUE uses the average of F1 and accuracy.

● RTE – textual entailment – accuracy.

(WNLI is excluded here because it’s tiny/noisy in the original leaderboard.)
Rows (systems being compared)
● Pre-OpenAI SOTA – the best published systems before GPT.

● BiLSTM+ELMo+Attn – strong pre-Transformer baseline.

● OpenAI GPT – first big Transformer decoder (left-to-right).

● BERTBASE / BERTLARGE – Transformer encoder (bidirectional, MLM+NSP).

What the numbers tell you

● Higher is better in every column.

● BERTLARGE is bold across most tasks and has the best Average (81.9), beating GPT’s
75.2 by a wide margin.

● MNLI: BERTLARGE 86.7/85.9 > GPT 82.1/81.4 — big jump on a large, hard NLI task.

● Low-data tasks (RTE 70.1, CoLA 60.5) show especially large gains, showing how
pre-training helps when labeled data is scarce.

● SST-2 (94.9) and QNLI (91.1) are also strong; STS-B (86.5) shows BERT’s semantic
similarity is solid.

● Average is a simple mean of task metrics (not all metrics are identical, so it’s an
indicative—but not perfect—summary).

Why BERT wins

1. Bidirectional self-attention (uses left + right context) → better token understanding
(“bank” the place vs river).

2. Pre-training at scale (Wikipedia + BookCorpus) with Masked LM (+ NSP) → strong

general language knowledge.

3. Minimal task heads + end-to-end fine-tuning → adapts quickly to each dataset.

How to use this in a viva (talking points)

● “GLUE evaluates broad understanding; BERT-Large set a new SOTA by 6–7 points over
GPT on the average, with notable gains on NLI (MNLI/QNLI) and low-resource tasks
(RTE/CoLA). The key reasons are bidirectional context and effective pre-training
objectives (MLM, NSP). Metrics differ per task (Acc, F1, MCC, correlations), but BERT
improves most of them with the same architecture plus small heads.”

Here’s what that Conclusions slide is getting at, unpacked and made viva-ready.

1) “Empirical results are great, but the biggest impact…”

What changed the field wasn’t just new SOTA numbers—it was the recipe.
BERT proved that you can:

1. Pre-train one generic model on huge unlabeled text (MLM+NSP), then

2. Fine-tune the same backbone with tiny task-specific heads.

This replaces dozens of bespoke architectures (custom CNN/LSTM + attention per task) with
one backbone + a small head. It gives:

● Label efficiency: far fewer labeled examples needed.

● Uniform workflow: same code path for sentiment, NLI, NER, QA, etc.

● Better transfer: knowledge learned on Wikipedia/Books helps everywhere.

Real-world effect: teams ship NLP features faster by fine-tuning BERT(-like) checkpoints instead
of re-inventing models for every task.

2) “With pre-training, bigger == better (so far)”

Once you adopt the pre-train→fine-tune recipe, scaling the model and/or data almost
monotonically improves quality:

● More layers/hidden size/heads → better representations.

● More pre-training tokens/steps → better generalization.

● Better optimizers/schedules → unlock scale (e.g., AdamW + warmup + decay).

This observation foreshadowed today’s scaling laws: performance rises predictably with
parameters, data, and compute. It’s why BERT-Large beat BERT-Base, and later
RoBERTa-Large/ELECTRA-Large beat earlier models.

Caveats to mention: returns are diminishing per dollar; you hit latency/memory/energy limits;
data quality matters as much as quantity.

3) “Unclear if adding things on top of BERT really helps a

lot”
After BERT, many tweaks tried to bolt on extra modules (task-specific attention, multi-task
towers, fancy pooling). In practice:

● Simple wins: a linear head on [CLS] (or span heads for QA, token head for NER) +
end-to-end fine-tuning already gives most of the gain.

● Bigger impact came from training, not gadgets:

○ RoBERTa: removed NSP, used more data, dynamic masking, longer training →
strong boost.

○ ALBERT/DistilBERT/MiniLM: parameter sharing or distillation → efficiency

gains.

○ ELECTRA: better pre-training objective (replaced-token detection) → sample

efficiency gains.

Implication: If you’re building systems, choose a good checkpoint and focus on data,
objective, and scale before inventing complex heads.

What this means for different audiences

For practitioners (product teams):

● Start from a reputable checkpoint (e.g., BERT/RoBERTa/ELECTRA for understanding).

● Pick a size that fits latency/memory constraints; use DistilBERT/MiniLM for

mobile/edge.

● Get more from data (cleaning, domain-adaptive pre-training) and PEFT (Adapters/LoRA)
than from exotic architectures.

● Measure and manage cost/latency; consider quantization, pruning, or distillation.

For researchers:

● The “bigger is better” trend raises the compute barrier; impactful work often targets:

○ Efficiency: sparse attention, better optimizers, distillation, quantization.

○ Objectives & data: improved pre-training tasks, high-quality or domain data.

○ Capabilities gaps: long-context reasoning, faithfulness, factuality, robustness,

safety.
Balanced view (good to say in a viva)
● Pros: unified pipeline, label efficiency, strong baselines across tasks, predictable gains
from scale.

● Cons/risks: high compute/energy, latency and memory at inference, bias & privacy
issues from web pre-training, and diminishing returns without better data.

One-liner:
BERT’s lasting contribution isn’t just better scores—it established the foundation-model
paradigm: pre-train big, fine-tune small. Since then, scaling and better pre-training (more data,
better objectives) have mattered more than adding fancy task-specific modules.

Understanding BERT and NLP Innovations
No ratings yet
Understanding BERT and NLP Innovations
98 pages
Bert Model - NLP
No ratings yet
Bert Model - NLP
10 pages
Understanding BERT: Architecture & Applications
No ratings yet
Understanding BERT: Architecture & Applications
4 pages
BERT (Bidirectional Encoder Representations From Transformers)
No ratings yet
BERT (Bidirectional Encoder Representations From Transformers)
4 pages
BERT: Bidirectional Encoder Insights
No ratings yet
BERT: Bidirectional Encoder Insights
24 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
21CSE356T-NLP-Unit 4.2
No ratings yet
21CSE356T-NLP-Unit 4.2
31 pages
BERT and Transformer
No ratings yet
BERT and Transformer
48 pages
BERT for NLP Experts
No ratings yet
BERT for NLP Experts
17 pages
Transformers MUIA
No ratings yet
Transformers MUIA
34 pages
BERT: Key Insights for NLP Students
No ratings yet
BERT: Key Insights for NLP Students
33 pages
Punctuation Restoration Using BERTs Variants
No ratings yet
Punctuation Restoration Using BERTs Variants
11 pages
Understanding BERT's Bidirectional Encoder
No ratings yet
Understanding BERT's Bidirectional Encoder
8 pages
GenAI Workflow Automation NPTEL Zoom Course
No ratings yet
GenAI Workflow Automation NPTEL Zoom Course
88 pages
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
No ratings yet
NLP Pretrained Language Models BERT and Its Variants Model Analysis ML Pretraining Finetuning
71 pages
11 Bert
No ratings yet
11 Bert
66 pages
BERT Interview Questions and Cross Questions-1
No ratings yet
BERT Interview Questions and Cross Questions-1
9 pages
BERT Architecture
No ratings yet
BERT Architecture
8 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
BERT Finetuning Theory
No ratings yet
BERT Finetuning Theory
14 pages
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
No ratings yet
Huggingface Co Blog Warm Starting Encoder Decoder Data Preprocessing
20 pages
Transformer Part3 16 Mar 23 PDF
No ratings yet
Transformer Part3 16 Mar 23 PDF
59 pages
Transformers and BERT in NLP
No ratings yet
Transformers and BERT in NLP
20 pages
Jacob Devlin BERT
No ratings yet
Jacob Devlin BERT
43 pages
495 Lecture 11 BERT
No ratings yet
495 Lecture 11 BERT
31 pages
A Modern Bidirectional Encoder For Fast, Memory Efficient, and Long Context Finetuning and Inference
No ratings yet
A Modern Bidirectional Encoder For Fast, Memory Efficient, and Long Context Finetuning and Inference
20 pages
Evolution of NLP Models: LSTM to BERT
No ratings yet
Evolution of NLP Models: LSTM to BERT
30 pages
BERT Applications in Natural Language Processing: A Review
No ratings yet
BERT Applications in Natural Language Processing: A Review
49 pages
BERT (Bidirectional Encoder Represe
No ratings yet
BERT (Bidirectional Encoder Represe
1 page
Bert Ayman
No ratings yet
Bert Ayman
5 pages
Presentation 11
No ratings yet
Presentation 11
20 pages
Localizing BERT for NLP Tasks
No ratings yet
Localizing BERT for NLP Tasks
1 page
Neo Bert
No ratings yet
Neo Bert
19 pages
GPT-2 and BERT for Question Generation
No ratings yet
GPT-2 and BERT for Question Generation
10 pages
14.chapter10 AdvancedDeepLearningForText
No ratings yet
14.chapter10 AdvancedDeepLearningForText
22 pages
Lecture 13 - Transformer Encoder Decoderv2
No ratings yet
Lecture 13 - Transformer Encoder Decoderv2
65 pages
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
No ratings yet
INTELLIPAAT - 2024 - 01 - 20 - Tansformers Cont. and Autoencoders
11 pages
Preprint Jesus
No ratings yet
Preprint Jesus
2 pages
BERT Language Model
No ratings yet
BERT Language Model
7 pages
Neobert: A Next-Generation Bert: Lola Le Breton Quentin Fournier John X. Morris Mariam El Mezouar Sarath Chandar
No ratings yet
Neobert: A Next-Generation Bert: Lola Le Breton Quentin Fournier John X. Morris Mariam El Mezouar Sarath Chandar
23 pages
11-Transformer LLMs Updated
No ratings yet
11-Transformer LLMs Updated
96 pages
Rebertsubmission116 NW
No ratings yet
Rebertsubmission116 NW
26 pages
Pretraining Part1 16 Mar 23 PDF
No ratings yet
Pretraining Part1 16 Mar 23 PDF
32 pages
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
No ratings yet
Complete NLP Guide - From Fundamentals To Deep Learning With TensorFlow
13 pages
Transformer Models in NLP
No ratings yet
Transformer Models in NLP
5 pages
Transformers for AI Enthusiasts
No ratings yet
Transformers for AI Enthusiasts
11 pages
Understanding BERT
No ratings yet
Understanding BERT
4 pages
NLP LLM
No ratings yet
NLP LLM
47 pages
Bert Explained
No ratings yet
Bert Explained
8 pages
Advanced Concepts in Machine Learning and Natural
No ratings yet
Advanced Concepts in Machine Learning and Natural
8 pages
Algorithm BERT
No ratings yet
Algorithm BERT
1 page
Data Mining Report
No ratings yet
Data Mining Report
17 pages
Difference Between BART and BERT
No ratings yet
Difference Between BART and BERT
2 pages
Warm-Starting Encoder-Decoder Models
No ratings yet
Warm-Starting Encoder-Decoder Models
50 pages
1102AITA04 AI For Text Analytics
No ratings yet
1102AITA04 AI For Text Analytics
88 pages
Bert
No ratings yet
Bert
20 pages
Computational Intelligence Endsem
No ratings yet
Computational Intelligence Endsem
8 pages
NLP Year in Review - 2019 - Dair - Ai - Medium
No ratings yet
NLP Year in Review - 2019 - Dair - Ai - Medium
26 pages
Understanding BERT: A Comprehensive Survey
No ratings yet
Understanding BERT: A Comprehensive Survey
23 pages
Effectively Leveraging BERT For Legal Document Classification
No ratings yet
Effectively Leveraging BERT For Legal Document Classification
7 pages
Unified Approaches To Handwritten Digit Recognition A Fusion of Four Advanced Models
No ratings yet
Unified Approaches To Handwritten Digit Recognition A Fusion of Four Advanced Models
6 pages
Sharda Bia10e Tif 06
No ratings yet
Sharda Bia10e Tif 06
11 pages
1 s2.0 S0957417423032803 Main
No ratings yet
1 s2.0 S0957417423032803 Main
29 pages
Lecture 07 - Machine Learning Types Semi and Self Supervised Learning
No ratings yet
Lecture 07 - Machine Learning Types Semi and Self Supervised Learning
13 pages
Intro To ML Slides 11
No ratings yet
Intro To ML Slides 11
13 pages
CMP 346 Artificial Intelligence 3-1-3
No ratings yet
CMP 346 Artificial Intelligence 3-1-3
6 pages
Stock Price Prediction: A Comparative Study Between Traditional Statistical Approach and Machine Learning Approach
No ratings yet
Stock Price Prediction: A Comparative Study Between Traditional Statistical Approach and Machine Learning Approach
37 pages
ObjectiveQ&a Mid-I NNDL
No ratings yet
ObjectiveQ&a Mid-I NNDL
15 pages
Nobel Prize - Physics - Technology Meets Physics
No ratings yet
Nobel Prize - Physics - Technology Meets Physics
50 pages
Transformer Vs MOE
No ratings yet
Transformer Vs MOE
7 pages
Delineating The Effective Use of Self-Supervised Learning in Single Cell Genomics
No ratings yet
Delineating The Effective Use of Self-Supervised Learning in Single Cell Genomics
14 pages
Principles of Deep Robotics (Book)
No ratings yet
Principles of Deep Robotics (Book)
10 pages
GENAI
No ratings yet
GENAI
6 pages
Project Ideas
No ratings yet
Project Ideas
10 pages
Gen AiI Worksheet Class9
0% (1)
Gen AiI Worksheet Class9
2 pages
Soft Computing Question Bank
No ratings yet
Soft Computing Question Bank
3 pages
Sample Paper Machine Learning Techniques KCS055
No ratings yet
Sample Paper Machine Learning Techniques KCS055
5 pages
Class8 AI Ans Key
No ratings yet
Class8 AI Ans Key
2 pages
Intro to AI: Concepts & Applications
No ratings yet
Intro to AI: Concepts & Applications
25 pages
LLM Model
No ratings yet
LLM Model
43 pages
MachineLearning Algorithm Hope
No ratings yet
MachineLearning Algorithm Hope
134 pages
cs231n 2019 Lecture10
No ratings yet
cs231n 2019 Lecture10
106 pages
Workshop AI Baker PDF
No ratings yet
Workshop AI Baker PDF
88 pages
AIML Solved Paper Nov-Dec 2024
No ratings yet
AIML Solved Paper Nov-Dec 2024
2 pages
Unit 3 - DS - 1st Year
No ratings yet
Unit 3 - DS - 1st Year
5 pages
Conference Template A4
No ratings yet
Conference Template A4
6 pages
Batch - 16 IAI
No ratings yet
Batch - 16 IAI
5 pages
Histologi Jaringan Saraf dan Neural Networks
No ratings yet
Histologi Jaringan Saraf dan Neural Networks
38 pages
REPA-E: Unlocking VAE For End-to-End Tuning With Latent Diffusion Transformers
No ratings yet
REPA-E: Unlocking VAE For End-to-End Tuning With Latent Diffusion Transformers
12 pages

Class Notes

Uploaded by

Class Notes

Uploaded by

Word representations → Language modeling → Attention/Transformers → BERT (encoder) →

1) Word Representations (one-hot, TF-IDF →

3.​ What it is​

●​ One-hot/TF-IDF (sparse counts)​

●​ Dense embeddings (word2vec/GloVe): small, real-valued vectors capturing similarity.​

4.​ What problems it solved (key concepts)​

●​ word2vec: learn vectors by predicting context (CBOW/Skip-gram).​

●​ GloVe: factorizes global co-occurrence statistics.​

7.​ Limitations & how to overcome​

●​ No long-range syntax in BoW/word2vec.​

2) Transformers & Self-Attention (the bridge)

3.​ What it is​

4.​ Key concepts​

●​ Q, K, V projections; attention = softmax(QKᵀ/√d) V​

●​ Multi-head: multiple subspaces; positional encodings add order.​

7.​ Limitations & how to overcome​

●​ Quadratic cost in sequence length.​

●​ Context window finite.​

3) BERT (Bidirectional Encoder Representations from

2.​ Without it, problem​

4.​ What problems it solved (concepts)​

●​ Pretrain → fine-tune paradigm: one big model adapted to many tasks.​

7.​ Limitations & how to overcome​

●​ Pretrain–finetune mismatch (NSP unhelpful; masking strategy).​

●​ Sequence length & quadratic cost.​

4) GPT (Generative Pretrained Transformer)

3.​ What it is​

4.​ What problems it solved (concepts)​

●​ Unified objective yields strong generalization; few/zero-shot via prompting.​

7.​ Limitations & how to overcome​

●​ Hallucinations & factuality.​

●​ Bias/toxicity from data.​

●​ Context limits & cost.​

●​ Weak at structured extraction vs encoders in some cases.​

TL;DR mental map

●​ BERT (encoder, MLM) → best for understanding; contextual embeddings for

1) Why did this topic arise? (the pain)

●​ Left→Right (e.g., GPT): each token sees only the past.​

3) Key idea & objective (how it avoids the

●​ Randomly mask ~15% of tokens in the input.​

●​ It’s focused on understanding (classification, tagging, QA), not free-form generation.​

5) How BERT works (simple mental model

Tokens & special symbols.

●​ Add [CLS] at the start (sentence-level summary token).​

●​ Use [SEP] to separate sentence A and B (for pair tasks).​

●​ Some tokens are replaced by [MASK] during pre-training.​

Example (masked LM).​

●​ Token labeling (NER, POS): put a token-wise classifier on each position.​

6) Real-world applications (how it helps)

●​ Entity extraction for KYC/compliance: tag names, locations, organizations from

●​ Intent detection & slot filling in chatbots.​

●​ Document classification (legal, medical, support tickets).​

7) How BERT relates to ELMo and GPT

8) Limitations of BERT (and practical ways

3.​ Input length ~512 tokens and quadratic attention cost.​

4.​ Heavy to serve (multi-head attention over full sequence).​

5.​ Domain shift (legal, medical, code).​

6.​ NSP objective often unhelpful.​

9) Short, exam-ready comparison: BERT

10) Minimal “how you’d use it” recipe

2.​ Tokenize text(s); prepend [CLS], append [SEP].​

4.​ Fine-tune for a few epochs—done.​

1) What the architecture picture is saying

●​ Each oval Trm is a Transformer-encoder layer (multi-head self-attention +

Right diagram (BASE vs LARGE):

●​ BERT-large: 24 encoder layers, hidden size 1024, 16 heads (~340M params).​

2) Input representation (what actually goes

1.​ Token embeddings​

○​ Text is split by WordPiece (subword) tokenizer.​

○​ BERT adds special tokens:​

■​ [CLS] at the very start (sentence-level summary token).​

■​ [SEP] as a separator/end marker (between sentence A and B, and at the

For a pair input (e.g., NLI/QA):​

○​ Marks which sentence a token belongs to.​

○​ By convention: 0 for sentence A, 1 for sentence B.​

○​ For single-sentence tasks, everything (except [CLS]/[SEP]) is type 0.​

3. What it is

● One-hot/TF-IDF (sparse counts)

● Dense embeddings (word2vec/GloVe): small, real-valued vectors capturing similarity.

4. What problems it solved (key concepts)

● word2vec: learn vectors by predicting context (CBOW/Skip-gram).

● GloVe: factorizes global co-occurrence statistics.

7. Limitations & how to overcome

● No long-range syntax in BoW/word2vec.

3. What it is

4. Key concepts

● Q, K, V projections; attention = softmax(QKᵀ/√d) V

● Multi-head: multiple subspaces; positional encodings add order.

7. Limitations & how to overcome

● Quadratic cost in sequence length.

● Context window finite.

2. Without it, problem

4. What problems it solved (concepts)

● Pretrain → fine-tune paradigm: one big model adapted to many tasks.

7. Limitations & how to overcome

● Pretrain–finetune mismatch (NSP unhelpful; masking strategy).

● Sequence length & quadratic cost.

3. What it is

4. What problems it solved (concepts)

● Unified objective yields strong generalization; few/zero-shot via prompting.

7. Limitations & how to overcome

● Hallucinations & factuality.

● Bias/toxicity from data.

● Context limits & cost.

● Weak at structured extraction vs encoders in some cases.

● BERT (encoder, MLM) → best for understanding; contextual embeddings for

● Left→Right (e.g., GPT): each token sees only the past.

● Randomly mask ~15% of tokens in the input.

● It’s focused on understanding (classification, tagging, QA), not free-form generation.

● Add [CLS] at the start (sentence-level summary token).

● Use [SEP] to separate sentence A and B (for pair tasks).

● Some tokens are replaced by [MASK] during pre-training.

Example (masked LM).

● Token labeling (NER, POS): put a token-wise classifier on each position.

● Entity extraction for KYC/compliance: tag names, locations, organizations from

● Intent detection & slot filling in chatbots.

● Document classification (legal, medical, support tickets).

3. Input length ~512 tokens and quadratic attention cost.

4. Heavy to serve (multi-head attention over full sequence).

5. Domain shift (legal, medical, code).

6. NSP objective often unhelpful.

2. Tokenize text(s); prepend [CLS], append [SEP].

4. Fine-tune for a few epochs—done.

● Each oval Trm is a Transformer-encoder layer (multi-head self-attention +

● BERT-large: 24 encoder layers, hidden size 1024, 16 heads (~340M params).

1. Token embeddings

○ Text is split by WordPiece (subword) tokenizer.

○ BERT adds special tokens:

■ [CLS] at the very start (sentence-level summary token).

■ [SEP] as a separator/end marker (between sentence A and B, and at the

For a pair input (e.g., NLI/QA):

○ Marks which sentence a token belongs to.

○ By convention: 0 for sentence A, 1 for sentence B.

○ For single-sentence tasks, everything (except [CLS]/[SEP]) is type 0.

○ Purpose: lets the model know the boundary and do “A vs B” reasoning.

3. Positional embeddings

○ Absolute position indices 0…N−1 so the model knows token order.

Final embedding per position =

● (No causal mask) because BERT is bidirectional.

● [CLS] output: treated as a summary for the whole sequence.

○ Classification (sentiment, topic, NLI): a small dense+softmax on top of [CLS].

○ Token labeling (NER, POS): a classifier on each token output.

→ [CLS] → 3-way softmax (entail/contradict/neutral).

● Position: where is it in the sequence?

● Segment: which sentence (A or B) does it belong to?

● Dropout ~0.1, layer norm everywhere, GELU activation in FFN.

● Padding is masked out in attention; no future mask (unlike GPT).

● GPT: Transformer decoder, left-to-right only; great for generation.

● BERT inserts special tokens:

○ [SEP] to separate sentences and to mark the end.

● Original BERT has two types: A (id=0) and B (id=1).

● For single-sentence tasks, everything gets EA (token-type id 0).

● Segment = sentence membership,

● Use one segment (all token-type ids = 0).

● Fewer tokens ⇒ less memory/compute, faster training/inference, less truncation.

● input_ids: [B, L] (WordPiece ids incl. [CLS], [SEP], [PAD])