0% found this document useful (0 votes)
30 views36 pages

Bert

Uploaded by

jessicafan3ck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views36 pages

Bert

Uploaded by

jessicafan3ck
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 36

BERT

(Bi-Directional Encoder
Representations From TRansformers)
Happy 6th Birthday
Bert!!
2018 in Machine
lEarning and NLP
BERT VS OPEN AI GPT VS ELMO
PRE-BERT: ELMO (Embeddings from
Language Model)
Generates contextual word embeddings, meaning the same word can have
different meanings depending on the sentence.
PRE-BERT: OpenAI GPT
A language model that reads text in one direction (left-to-right) using a deep
Transformer decoder architecture.
BERT VS OPEN AI GPT VS ELMO
Fine-Tuning Approach
BERT uses a deep Transformer encoder and is designed to be
fine-tuned for specific tasks.
Key Feature: It learns word representations using
bidirectional context, meaning it looks at both the words
before and after a target word.
● Why? Understanding both left and right contexts helps
clarify word meaning.
● Example 1: "We went to the river bank." (Here, 'bank'
refers to the river's edge.)
● Example 2: "I need to go to the bank to make a deposit."
(Here, 'bank' refers to a financial institution.)
Bidirectional Conditioning

Indirectly see itself in Multi-layered
context?
Masked Language Modeling (MLM)!
Solution: Mask out k% of the input words, and then
predict the masked words
store gallon

↑ ↑

the man went to the [MASK] to buy a [MASK] of milk

k: usually 15%
- Too much masking → not enough context
- Too little masking → computationally expensive
MLM (Continued)
Selection of masked tokens:
- 15% are uniformly sampled.
80-10-10 Corruption
● 10% are unchanged.
Let’s go to the bank’s ATM → Let’s go to the bank’s ATM
→ Always biased to the correct selection.
● 10% are replaced with a random word in the vocabulary.
Let’s go to the bank’s ATM → Let’s go to the boo ATM
● 80% of predicted words are replaced with the [MASK] token.
Let’s go to the bank’s ATM → Let’s go to the [MASK] ATM
Handling relationships between multiple
sentences:

Two Sentence Tasks


Given Two sentences A and B, is b likely
to be the sentence that follows A or not?
Next Sentence Prediction (NSP)
NSP is designed to reduce the gap between pre-training and fine-tuning
Bert Base and Bert Large
BERT-base: 12 layers, 768 hidden size, 12 attention heads,
110M parameters
- Same hidden size as OpenAI GPT

• BERT-large: 24 layers, 1024 hidden size, 16 attention heads,


340M parameters
BERT-base: developed for performance comparison with
OpenAI GPT
BERT-large: grossly large model for state of the art results
BERT Pre-Training

• Training corpus: Wikipedia (2.5B) + BooksCorpus (0.8B)


- OpenAI GPT was trained on BooksCorpus only.
• Max sequence size: 512 word pieces (roughly 256 and
256
for two non-contiguous sequences)
• Trained for 1M steps, batch size 128k
Bert Pre-Training

- MLM and NSP are


trained together
- [CLS] is pre-
trained for NSP
- Other token
representations
are trained for
MLM
Fine-Tuning Bert: “Pretrain once, finetune
many times”Token Level Tasks
Sentence Level Tasks
Sentence Level Tasks
● Sentence Pair Classification Tasks
MNLI:
Premise: A soccer game with multiple males playing.
Hypothesis: Some men are playing a sport.
{entailment, contradiction, neutral}

QQP:
Q1: Where can I learn to invest in stocks?
Q2: How can I learn more about stocks?
{duplicate, not duplicate}

● Single Sentence Classification Tasks


SST2:
rich veins of funny stuff in this movie
{positive, negative}
Sentence Level Tasks

• For sentence pair tasks, use [SEP] to separate the two segments with segment
embeddings
• Add a linear classifier on top of [CLS] representation and introduce C × h new
parameters (C: # of classes, h: hidden size)
Token Level Tasks
● Extractive Question Answering
SQuAD
MetLife Stadium

● Named Entity Recognition


CoNLL 2003 NER John Smith lives in New York
B-PER I-PER O O B-LOC I-LOC
Token Level Tasks

• For token-level prediction tasks, add linear classifier on top of hidden


representations
Experimental Results: GLUE
Experimental Results: SQuAd
Ablation Study: Pre-training Tasks

- MLM>> left-to-
right LMs
- NSP improves on
some tasks
Later work (Joshi et
al. 2020, Liu et al.
2019) argued that
NSP is not useful.
Ablation Study: Model Size

The bigger the better!!!


Ablation Study: TRaining Efficiency

MLM takes longer


to converge
because it only
predicts 15% of
tokens.
Conclusions (in Early 2019)

The empirical results from BERT are great, but the biggest
impact on the field is:
- With pre-training, bigger == better, without clear limits
so far.
After Bert
RoBERTa (Liu et al., 2019)

ALBERT (Lan et al., 2020)

ELECTRA (Clark et al., 2020)


After Bert
• Models that handle long contexts ( ≫ 512 tokens)
- Longformer, Big Bird (this is really cute), …
• Multilingual BERT
- Trained single model on 104 languages from Wikipedia. Shared
110k WordPiece vocabulary
• BERT extended to different domains
- SciBERT, BioBERT, FinBERT, ClinicalBERT, …
• Making BERT smaller to use
- DistillBERT, TinyBERT, …
After Bert
Text Generation Using BERT (generally less effective
compared to OpenAI’s GPT
New Rankings! (using glue)
Citations
Devlin, Jacob, et al. “Bert: Pre-Training of Deep Bidirectional Transformers for
Language Understanding.” arXiv.Org, 24 May 2019, arxiv.org/abs/1810.04805.

Fall 2022 Lecture 2: Bert (Encoder-Only Models),


www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec02.pdf.
Accessed 23 Oct. 2024.

Alammar, Jay. “The Illustrated Bert, Elmo, and Co. (How NLP Cracked Transfer
Learning).” The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer
Learning) – Jay Alammar – Visualizing Machine Learning One Concept at a
Time., jalammar.github.io/illustrated-bert/. Accessed 23 Oct. 2024.

Bert Image from Muppet Wiki: 700 × 1,165

Elmo Image from Muppet Wiki: 800 × 979

You might also like