Bert

Uploaded by

jessicafan3ck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views36 pages

Bert

Uploaded by

jessicafan3ck

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 36

BERT

(Bi-Directional Encoder
Representations From TRansformers)
Happy 6th Birthday
Bert!!
2018 in Machine
lEarning and NLP
BERT VS OPEN AI GPT VS ELMO
PRE-BERT: ELMO (Embeddings from
Language Model)
Generates contextual word embeddings, meaning the same word can have
different meanings depending on the sentence.
PRE-BERT: OpenAI GPT
A language model that reads text in one direction (left-to-right) using a deep
Transformer decoder architecture.
BERT VS OPEN AI GPT VS ELMO
Fine-Tuning Approach
BERT uses a deep Transformer encoder and is designed to be
fine-tuned for specific tasks.
Key Feature: It learns word representations using
bidirectional context, meaning it looks at both the words
before and after a target word.
● Why? Understanding both left and right contexts helps
clarify word meaning.
● Example 1: "We went to the river bank." (Here, 'bank'
refers to the river's edge.)
● Example 2: "I need to go to the bank to make a deposit."
(Here, 'bank' refers to a financial institution.)
Bidirectional Conditioning
↓
Indirectly see itself in Multi-layered
context?
Masked Language Modeling (MLM)!
Solution: Mask out k% of the input words, and then
predict the masked words
store gallon

↑ ↑

the man went to the [MASK] to buy a [MASK] of milk

k: usually 15%
- Too much masking → not enough context
- Too little masking → computationally expensive
MLM (Continued)
Selection of masked tokens:
- 15% are uniformly sampled.
80-10-10 Corruption
● 10% are unchanged.
Let’s go to the bank’s ATM → Let’s go to the bank’s ATM
→ Always biased to the correct selection.
● 10% are replaced with a random word in the vocabulary.
Let’s go to the bank’s ATM → Let’s go to the boo ATM
● 80% of predicted words are replaced with the [MASK] token.
Let’s go to the bank’s ATM → Let’s go to the [MASK] ATM
Handling relationships between multiple
sentences:

Two Sentence Tasks

Given Two sentences A and B, is b likely
to be the sentence that follows A or not?
Next Sentence Prediction (NSP)
NSP is designed to reduce the gap between pre-training and fine-tuning
Bert Base and Bert Large
BERT-base: 12 layers, 768 hidden size, 12 attention heads,
110M parameters
- Same hidden size as OpenAI GPT

• BERT-large: 24 layers, 1024 hidden size, 16 attention heads,

340M parameters
BERT-base: developed for performance comparison with
OpenAI GPT
BERT-large: grossly large model for state of the art results
BERT Pre-Training

• Training corpus: Wikipedia (2.5B) + BooksCorpus (0.8B)

- OpenAI GPT was trained on BooksCorpus only.
• Max sequence size: 512 word pieces (roughly 256 and
256
for two non-contiguous sequences)
• Trained for 1M steps, batch size 128k
Bert Pre-Training

- MLM and NSP are

trained together
- [CLS] is pre-
trained for NSP
- Other token
representations
are trained for
MLM
Fine-Tuning Bert: “Pretrain once, finetune
many times”Token Level Tasks
Sentence Level Tasks
Sentence Level Tasks
● Sentence Pair Classification Tasks
MNLI:
Premise: A soccer game with multiple males playing.
Hypothesis: Some men are playing a sport.
{entailment, contradiction, neutral}

QQP:
Q1: Where can I learn to invest in stocks?
Q2: How can I learn more about stocks?
{duplicate, not duplicate}

● Single Sentence Classification Tasks

SST2:
rich veins of funny stuff in this movie
{positive, negative}
Sentence Level Tasks

• For sentence pair tasks, use [SEP] to separate the two segments with segment
embeddings
• Add a linear classifier on top of [CLS] representation and introduce C × h new
parameters (C: # of classes, h: hidden size)
Token Level Tasks
● Extractive Question Answering
SQuAD
MetLife Stadium

● Named Entity Recognition

CoNLL 2003 NER John Smith lives in New York
B-PER I-PER O O B-LOC I-LOC
Token Level Tasks

• For token-level prediction tasks, add linear classifier on top of hidden

representations
Experimental Results: GLUE
Experimental Results: SQuAd
Ablation Study: Pre-training Tasks

- MLM>> left-to-
right LMs
- NSP improves on
some tasks
Later work (Joshi et
al. 2020, Liu et al.
2019) argued that
NSP is not useful.
Ablation Study: Model Size

The bigger the better!!!

Ablation Study: TRaining Efficiency

MLM takes longer

to converge
because it only
predicts 15% of
tokens.
Conclusions (in Early 2019)

The empirical results from BERT are great, but the biggest
impact on the field is:
- With pre-training, bigger == better, without clear limits
so far.
After Bert
RoBERTa (Liu et al., 2019)

ALBERT (Lan et al., 2020)

ELECTRA (Clark et al., 2020)

After Bert
• Models that handle long contexts ( ≫ 512 tokens)
- Longformer, Big Bird (this is really cute), …
• Multilingual BERT
- Trained single model on 104 languages from Wikipedia. Shared
110k WordPiece vocabulary
• BERT extended to different domains
- SciBERT, BioBERT, FinBERT, ClinicalBERT, …
• Making BERT smaller to use
- DistillBERT, TinyBERT, …
After Bert
Text Generation Using BERT (generally less effective
compared to OpenAI’s GPT
New Rankings! (using glue)
Citations
Devlin, Jacob, et al. “Bert: Pre-Training of Deep Bidirectional Transformers for
Language Understanding.” arXiv.Org, 24 May 2019, arxiv.org/abs/1810.04805.

Fall 2022 Lecture 2: Bert (Encoder-Only Models),

www.cs.princeton.edu/courses/archive/fall22/cos597G/lectures/lec02.pdf.
Accessed 23 Oct. 2024.

Alammar, Jay. “The Illustrated Bert, Elmo, and Co. (How NLP Cracked Transfer
Learning).” The Illustrated BERT, ELMo, and Co. (How NLP Cracked Transfer
Learning) – Jay Alammar – Visualizing Machine Learning One Concept at a
Time., jalammar.github.io/illustrated-bert/. Accessed 23 Oct. 2024.

Bert Image from Muppet Wiki: 700 × 1,165

Elmo Image from Muppet Wiki: 800 × 979

Danka T Mathematics of Machine Learning Master Linear Algebr
100% (1)
Danka T Mathematics of Machine Learning Master Linear Algebr
729 pages
Computer Vision Unit 4
No ratings yet
Computer Vision Unit 4
186 pages
Deep Learning Step by Step
No ratings yet
Deep Learning Step by Step
171 pages
Neural Networks PDF
No ratings yet
Neural Networks PDF
89 pages
LangChain & RAG
No ratings yet
LangChain & RAG
62 pages
Unit 2
No ratings yet
Unit 2
112 pages
Autoencoders
No ratings yet
Autoencoders
66 pages
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick
No ratings yet
Word2Vec Tutorial - The Skip-Gram Model Chris McCormick
18 pages
Navies Bayes
No ratings yet
Navies Bayes
18 pages
DL Lab Manual
No ratings yet
DL Lab Manual
65 pages
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
No ratings yet
NCA-GENL Nvidia Generative Ai Llms Exam Dumps
5 pages
DL All Units Materials
No ratings yet
DL All Units Materials
138 pages
Immediate Access Starting Out With Python 5th Edition Verified PDF Download
No ratings yet
Immediate Access Starting Out With Python 5th Edition Verified PDF Download
409 pages
2023 Intro To Generative Ai
No ratings yet
2023 Intro To Generative Ai
15 pages
Deep Learning
No ratings yet
Deep Learning
127 pages
Word 2 Vec
No ratings yet
Word 2 Vec
6 pages
Knowledge Graph Construction Using Large Language Models
No ratings yet
Knowledge Graph Construction Using Large Language Models
17 pages
CE 215 - Geotechnical Engineering Lab I: Sieve Analysis
No ratings yet
CE 215 - Geotechnical Engineering Lab I: Sieve Analysis
3 pages
Soft Max
No ratings yet
Soft Max
6 pages
U00 Syllabus 1
No ratings yet
U00 Syllabus 1
55 pages
UNIT-I - Introduction To Computer Vision
No ratings yet
UNIT-I - Introduction To Computer Vision
45 pages
Module2.3 Hyperparameter Optimization
No ratings yet
Module2.3 Hyperparameter Optimization
29 pages
5 Techiques To FineTune LLMs
No ratings yet
5 Techiques To FineTune LLMs
7 pages
Deep Learning Approaches For Network Int
No ratings yet
Deep Learning Approaches For Network Int
116 pages
NLP JNTUH Unit 4
No ratings yet
NLP JNTUH Unit 4
22 pages
The Modern Movement in Architecture
No ratings yet
The Modern Movement in Architecture
8 pages
RAG With Math
No ratings yet
RAG With Math
7 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Natural Language Toolkit NLTK PDF
No ratings yet
Natural Language Toolkit NLTK PDF
23 pages
Deep Learning (MODULE-3)
No ratings yet
Deep Learning (MODULE-3)
85 pages
NLP DL Lecture4
No ratings yet
NLP DL Lecture4
78 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Skip Gram
100% (1)
Skip Gram
37 pages
U01 Java Maven Git
No ratings yet
U01 Java Maven Git
65 pages
Unit 2 DL
No ratings yet
Unit 2 DL
44 pages
Lec14 Pretraining
No ratings yet
Lec14 Pretraining
42 pages
Unit 3-2
100% (1)
Unit 3-2
50 pages
Deep Learning
No ratings yet
Deep Learning
49 pages
RBM, DBN, and DBM
No ratings yet
RBM, DBN, and DBM
79 pages
Generic Structure and Example of Announcement in English
100% (1)
Generic Structure and Example of Announcement in English
1 page
Btech CSE
No ratings yet
Btech CSE
17 pages
CS485 Ch5 Transformers
No ratings yet
CS485 Ch5 Transformers
50 pages
(Ebook PDF) Law & Ethics For Health Professions 7Th Edition Download
No ratings yet
(Ebook PDF) Law & Ethics For Health Professions 7Th Edition Download
50 pages
Lesson 4 Logic and Knowledge Representation
No ratings yet
Lesson 4 Logic and Knowledge Representation
100 pages
Mechatronics Systems
No ratings yet
Mechatronics Systems
31 pages
3 18 Transmission
No ratings yet
3 18 Transmission
30 pages
Lec 02
No ratings yet
Lec 02
33 pages
Generative Adversial Network
No ratings yet
Generative Adversial Network
21 pages
Module 1 - General Safety 2005
No ratings yet
Module 1 - General Safety 2005
32 pages
U02 1-Encapsulation
No ratings yet
U02 1-Encapsulation
31 pages
Unbalanced Forces Inquiry Lab
No ratings yet
Unbalanced Forces Inquiry Lab
19 pages
A Practical Guide To Graph Neural Networks
No ratings yet
A Practical Guide To Graph Neural Networks
28 pages
Res Net
No ratings yet
Res Net
13 pages
A Review On Large Language Models Architectures Ap
No ratings yet
A Review On Large Language Models Architectures Ap
31 pages
Introduction To NFPA 1700
No ratings yet
Introduction To NFPA 1700
41 pages
542 315 Word2vec
No ratings yet
542 315 Word2vec
20 pages
Clustering & Association Algorithms 4
No ratings yet
Clustering & Association Algorithms 4
17 pages
NEURAL NETWORKS and Deep Learning: Going Deep About Neural Network
No ratings yet
NEURAL NETWORKS and Deep Learning: Going Deep About Neural Network
4 pages
Graaff-Reinet Sedimentation Case Study
No ratings yet
Graaff-Reinet Sedimentation Case Study
14 pages
Lab I TENSOR FLOW AND KERAS
No ratings yet
Lab I TENSOR FLOW AND KERAS
3 pages
Hyperparameters
No ratings yet
Hyperparameters
15 pages
PThread API Reference
No ratings yet
PThread API Reference
348 pages
Workplace Bullying Thesis Statement
100% (3)
Workplace Bullying Thesis Statement
7 pages
Tf-Idf: David Kauchak cs160 Fall 2009
No ratings yet
Tf-Idf: David Kauchak cs160 Fall 2009
51 pages
General Framework For Object Detection
No ratings yet
General Framework For Object Detection
9 pages
ASE2024 CodeGenSurvey-7
No ratings yet
ASE2024 CodeGenSurvey-7
17 pages
Electronic Properties of Amorphous Silicon Carbon Are Correlated With The Methane Flow Rate
No ratings yet
Electronic Properties of Amorphous Silicon Carbon Are Correlated With The Methane Flow Rate
8 pages
Literature Review On Impact of Climate Change On Agriculture
100% (1)
Literature Review On Impact of Climate Change On Agriculture
5 pages
Eights LLM Model App
No ratings yet
Eights LLM Model App
8 pages
Modeling and Simulation of Methanation Catalytic Reactor in Ammonia Plant
No ratings yet
Modeling and Simulation of Methanation Catalytic Reactor in Ammonia Plant
8 pages
Memorandum No. 16 Series of 2022: Date: To: Subject
No ratings yet
Memorandum No. 16 Series of 2022: Date: To: Subject
7 pages
Model With One-Word Context: 2vec 2vec 2vec 2vec
100% (1)
Model With One-Word Context: 2vec 2vec 2vec 2vec
17 pages
Ielts Writing Topic
No ratings yet
Ielts Writing Topic
3 pages
Our Erasmus+ Ka122 (SCH) Project
No ratings yet
Our Erasmus+ Ka122 (SCH) Project
3 pages
Simple Libraries in Python
No ratings yet
Simple Libraries in Python
12 pages
10 Principles of Good Treatment of Children and Adolescents With Disabilities (Module 2 Video Transcript)
No ratings yet
10 Principles of Good Treatment of Children and Adolescents With Disabilities (Module 2 Video Transcript)
3 pages
Arithmetic - Time Speed Distance - Basic To Moderate Level - DPP 08 - (MBA PIONEER 2023)
No ratings yet
Arithmetic - Time Speed Distance - Basic To Moderate Level - DPP 08 - (MBA PIONEER 2023)
13 pages
Dropout Vs Pruning
No ratings yet
Dropout Vs Pruning
2 pages
Grade 8 Social Studies Schemes of Work Term 1 KLB Top Scholar
100% (1)
Grade 8 Social Studies Schemes of Work Term 1 KLB Top Scholar
2 pages
Grade 8A
No ratings yet
Grade 8A
1 page
COST-BEHAVIOR - Mas 23
No ratings yet
COST-BEHAVIOR - Mas 23
12 pages
Rain Water Harvesting: CO5I-B Presented By: Jyoti Kolhe (73) Shubham Jadhav (74) Gaurav Thombare
No ratings yet
Rain Water Harvesting: CO5I-B Presented By: Jyoti Kolhe (73) Shubham Jadhav (74) Gaurav Thombare
21 pages
Machine Design Procedure - An Engineer's Guide To Mechanical
No ratings yet
Machine Design Procedure - An Engineer's Guide To Mechanical
2 pages
Sevil Ozsut - Poster
No ratings yet
Sevil Ozsut - Poster
1 page
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Pe Research Paper
No ratings yet
Pe Research Paper
4 pages
Sex Ratio at Birth: 1. Definition
No ratings yet
Sex Ratio at Birth: 1. Definition
2 pages
Introduction To Computer Vision
No ratings yet
Introduction To Computer Vision
10 pages
Chapter 2. Pair Programming
No ratings yet
Chapter 2. Pair Programming
15 pages
Pthread
No ratings yet
Pthread
4 pages
Contemporary World Chapter 1. Defining Globalization
No ratings yet
Contemporary World Chapter 1. Defining Globalization
4 pages
How To Apply To Be A Prefect or Other Leadership Roles: W Y N T D
No ratings yet
How To Apply To Be A Prefect or Other Leadership Roles: W Y N T D
2 pages
Face Detection and Smile Detection
No ratings yet
Face Detection and Smile Detection
8 pages
PyTorch Workflow Fundamentals
No ratings yet
PyTorch Workflow Fundamentals
1 page
Theories of Meaning
No ratings yet
Theories of Meaning
5 pages