Modern Natural Language Processing:
Where do we go from here?
Antoine Bosselut
Section Outline
• Advances: NLP Successes, Pretraining, Scale
• New Problems: Prompting, Knowledge & Reasoning, Retrieval-Augmentation,
Robustness, Multimodality
2
Core Methods
3
Word Embeddings
• Words and other tokens become vectors; no longer discrete symbols!
• Need to define a vocabulary of words (or token types) V that our system can
assign to a vector
• Word embeddings can be learned in a self-supervised manner from large
quantities of raw text
• Learning word embeddings from scratch using labeled data for a task is data-
inefficient!
• Three main algorithms: Continuous Bag of Words (CBOW), Skip-gram, and
GloVe
4
LMs & RNNs
• Language models learn to estimate the distribution over the next word given a
context
• Early neural LMs (and n-gram models) suffered from fixed context windows
• Recurrent neural networks can theoretically learn to model an unbounded
context length
• no increase in model size because weights are shared across time steps
• Practically, however, vanishing gradients stop vanilla RNNs from learning useful
long-range dependencies
• LSTMs are variants of recurrent networks that mitigate the vanishing gradient
problem
• used for for many sequence-to-sequence tasks
5
Transformers
• Temporal Bottleneck: Vanishing gradients stop many RNN architectures from
learning long-range dependencies
• Parallelisation Bottleneck: RNN states depend on previous time step hidden
state, so must be computed in series
• Attention: Direct connections between output states and inputs (solves
temporal bottleneck)
• Self-Attention: Remove recurrence over input, allowing parallel computation
for encoding
• Transformers use self-attention to encode sequences, but now require position
embeddings to capture sequence order
6
Text Generation
• Text generation is the foundation of many useful NLP applications (e.g.,
translation, summarisation, dialogue systems)
• Autoregressive: models generate one token a time, using the context and
previously generated tokens as inputs to generate the next token
• Teacher forcing is the premier algorithm for training text generators
• A variety of decoding algorithms can be used to generate text from models,
each trading off expected quality vs. diversity in different ways.
• Automatic evaluation of NLG systems (content overlap, model-based, human)
is difficult as most metrics fall short of reliable estimates of output quality
7
Deep Learning Successes in NLP
8
Question
What did these ingredients propel ?
9
Pretraining
Massive Text Corpus Transformer Language Model
Used to
Learn
10 (Radford et al., 2018, 2019, many others)
Pretraining: Two Approaches
(Causal, Left-to-right) Masked
Language Modeling Language Modeling
I really enjoyed the movie we I really enjoyed the ____ we
watched on ____ watched on Saturday!
11 (Radford et al., 2018, 2019, many others) (Devlin et al., 2018; Liu et al., 2020)
Fine-tuning a single model
• Prepend special token [CLS]: Classify output embedding for this token
12 Devlin et al. (2019)
Fine-tuning a single model
• Prepend special token [CLS]: Classify output embedding for this token
• Can use same model for classification tasks, sentence pair tasks, sequence
labelling tasks, and many more!
13 Devlin et al. (2019)
Pretraining Improvements!
Superhuman results on benchmark datasets!
All top models use pretrained transformers!
14
Scale: Parameters
# Parameters in Model
Time
15
Scale: Data
ELMo: 1B training tokens
BERT: 3.3B training tokens
RoBERTa: ~30B training tokens
16
Slide Credit: Mohit Iyyer
Scale: Flops
17 Slide Credit: Mohit Iyyer
Scaling Laws
18 Kaplan et al. (2020)
Question
Why do we want to make these models as big as possible?
19
Fine-tuning a single model
• Prepend special token [CLS]: Classify output embedding for this token
• Can use same model for classification tasks, sentence pair tasks, sequence
labelling tasks, and many more!
20 Devlin et al. (2019)
Efficient tuning: LoRA
• During fine-tuning:
- Keep all pretrained
f(x) parameters frozen
h
- LoRA: Initialise new Feedforward Net (FFN)
alongside components of transformer blocks
Pretrained
Pretrained 𝐵=0 - Keep these FFN layers limited in number of
Weights
Weights parameters
𝑟
𝑑×𝑑
‣ # 𝑊
parameters ∈inℝFFN layers is
𝑊∈ ℝ 𝑑×𝑑
𝐴 = 𝒩(0, 𝜎 2 ) 2 * d * r, so keep r small
𝑑
𝑑 ‣ r is hidden dimension of FFN
x
x - Only update these FFN layers
21 Hu et al., (2021)
In-context Learning: A new paradigm!
• At very large-scale, language
models exhibit emergent in-
context learning abilities
• Providing examples as input that
depict desired behaviour is
enough for model to replicate it
• No learning required, though
learning can improve this ability
Chain-of-thought Reasoning
Standard Prompting Chain of Thought Prompting
Input Input
Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now? tennis balls does he have now?
A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to
do they have? make lunch and bought 6 more, how many apples
do they have?
Model Output Model Output
A: The answer is 27. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.
Model self-rationalizes through text generation
What do those two abilities
remind you of?
ChatGPT!
25
Alignment
Large Models can be Aligned to new Behaviors
Outcome: Many Tasks
27
Outcome: Personalization
28
Question
Why are these language models so effective at scale?
29
Encoded Knowledge
World knowledge is implicitly encoded in
LM parameters! (e.g., that barbershops are
places to get buzz cuts)
BERT barbershop: 54%
Bob went to the <MASK> (teacher): barber: 20%
to get a buzz cut 24 layer salon: 6%
Transformer stylist: 4%
…
30
Slide Credit: Mohit Iyyer
Question
Why might this be a bad idea?
31
32 Guu et al., 2020 (“REALM”)
Question
What could we do instead?
33
Retrieval-Augmented LLMs
Lewis et al., 2020
Chang et al., 2020 Borgeaud et al., 2021
34
35
Slide credit: Mohit Iyyer
36
Slide credit: Mohit Iyyer
37
Slide credit: Mohit Iyyer
38
Slide credit: Mohit Iyyer
39
Slide credit: Mohit Iyyer
Retrieval-Augmented LLMs
• Readily available at scale, requires no processing
• We have powerful methods for encoding text (e.g., BERT)
40 Slide credit: Mohit Iyyer
Retrieval-Augmented LLMs
• Readily available at scale, requires no processing
• We have powerful methods for encoding text (e.g., BERT)
• However, these methods don’t really work yet with larger units of text (e.g.,
books)
• “Long-context” NLP is an active area of research!
41 Slide credit: Mohit Iyyer
Recap Question
Why might we want to integrate retrieval components?
42
Remaining Problems!
43
Robustness
Deep learning models exploit biases (Bolukbasi et al., 2016), annotation
artifacts (Gururangan et al., 2018), surface patterns (Li & Gauthier, 2017), etc.
They struggle to learn robust understanding abilities
“All the impressive achievements of deep
learning amount to just curve fitting”
(Pearl, 2018)
44
Multimodality
mage Captioning with Attention
Image captioning with attention
, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
45
yright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Xu et al., 2015
Multimodality
Masked
Language Modeling
I really enjoyed the ____ we
watched on Saturday!
46 Lu et al., 2019
Multimodality
Dall-E
Using natural language training Learning to generate images from
to improve computer vision natural language descriptions
https://openai.com/blog/clip/ https://openai.com/blog/dall-e/
47
Thanks for a great semester!
48