Bert
Bert
(Bi-Directional Encoder
Representations From TRansformers)
Happy 6th Birthday
Bert!!
2018 in Machine
lEarning and NLP
BERT VS OPEN AI GPT VS ELMO
PRE-BERT: ELMO (Embeddings from
Language Model)
Generates contextual word embeddings, meaning the same word can have
different meanings depending on the sentence.
PRE-BERT: OpenAI GPT
A language model that reads text in one direction (left-to-right) using a deep
Transformer decoder architecture.
BERT VS OPEN AI GPT VS ELMO
Fine-Tuning Approach
BERT uses a deep Transformer encoder and is designed to be
fine-tuned for specific tasks.
Key Feature: It learns word representations using
bidirectional context, meaning it looks at both the words
before and after a target word.
● Why? Understanding both left and right contexts helps
clarify word meaning.
● Example 1: "We went to the river bank." (Here, 'bank'
refers to the river's edge.)
● Example 2: "I need to go to the bank to make a deposit."
(Here, 'bank' refers to a financial institution.)
Bidirectional Conditioning
↓
Indirectly see itself in Multi-layered
context?
Masked Language Modeling (MLM)!
Solution: Mask out k% of the input words, and then
predict the masked words
store gallon
↑ ↑
k: usually 15%
- Too much masking → not enough context
- Too little masking → computationally expensive
MLM (Continued)
Selection of masked tokens:
- 15% are uniformly sampled.
80-10-10 Corruption
● 10% are unchanged.
Let’s go to the bank’s ATM → Let’s go to the bank’s ATM
→ Always biased to the correct selection.
● 10% are replaced with a random word in the vocabulary.
Let’s go to the bank’s ATM → Let’s go to the boo ATM
● 80% of predicted words are replaced with the [MASK] token.
Let’s go to the bank’s ATM → Let’s go to the [MASK] ATM
Handling relationships between multiple
sentences:
QQP:
Q1: Where can I learn to invest in stocks?
Q2: How can I learn more about stocks?
{duplicate, not duplicate}
• For sentence pair tasks, use [SEP] to separate the two segments with segment
embeddings
• Add a linear classifier on top of [CLS] representation and introduce C × h new
parameters (C: # of classes, h: hidden size)
Token Level Tasks
● Extractive Question Answering
SQuAD
MetLife Stadium
- MLM>> left-to-
right LMs
- NSP improves on
some tasks
Later work (Joshi et
al. 2020, Liu et al.
2019) argued that
NSP is not useful.
Ablation Study: Model Size
The empirical results from BERT are great, but the biggest
impact on the field is:
- With pre-training, bigger == better, without clear limits
so far.
After Bert
RoBERTa (Liu et al., 2019)
Alammar, Jay. “The Illustrated Bert, Elmo, and Co. (How NLP Cracked Transfer
Learning).” The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer
Learning) – Jay Alammar – Visualizing Machine Learning One Concept at a
Time., jalammar.github.io/illustrated-bert/. Accessed 23 Oct. 2024.