0% found this document useful (0 votes)
31 views48 pages

14 LookingForward

The document discusses advancements in Natural Language Processing (NLP), including pretraining, scale, and the emergence of new problems such as prompting and multimodality. It highlights core methods like word embeddings, language models, and transformers, emphasizing the importance of large-scale models and fine-tuning techniques. Additionally, it addresses challenges in robustness and the integration of retrieval components to enhance NLP capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views48 pages

14 LookingForward

The document discusses advancements in Natural Language Processing (NLP), including pretraining, scale, and the emergence of new problems such as prompting and multimodality. It highlights core methods like word embeddings, language models, and transformers, emphasizing the importance of large-scale models and fine-tuning techniques. Additionally, it addresses challenges in robustness and the integration of retrieval components to enhance NLP capabilities.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Modern Natural Language Processing:

Where do we go from here?


Antoine Bosselut
Section Outline

• Advances: NLP Successes, Pretraining, Scale

• New Problems: Prompting, Knowledge & Reasoning, Retrieval-Augmentation,


Robustness, Multimodality

2
Core Methods

3
Word Embeddings

• Words and other tokens become vectors; no longer discrete symbols!


• Need to define a vocabulary of words (or token types) V that our system can
assign to a vector
• Word embeddings can be learned in a self-supervised manner from large
quantities of raw text
• Learning word embeddings from scratch using labeled data for a task is data-
inefficient!
• Three main algorithms: Continuous Bag of Words (CBOW), Skip-gram, and
GloVe

4
LMs & RNNs

• Language models learn to estimate the distribution over the next word given a
context
• Early neural LMs (and n-gram models) suffered from fixed context windows
• Recurrent neural networks can theoretically learn to model an unbounded
context length
• no increase in model size because weights are shared across time steps
• Practically, however, vanishing gradients stop vanilla RNNs from learning useful
long-range dependencies
• LSTMs are variants of recurrent networks that mitigate the vanishing gradient
problem
• used for for many sequence-to-sequence tasks
5
Transformers

• Temporal Bottleneck: Vanishing gradients stop many RNN architectures from


learning long-range dependencies
• Parallelisation Bottleneck: RNN states depend on previous time step hidden
state, so must be computed in series
• Attention: Direct connections between output states and inputs (solves
temporal bottleneck)
• Self-Attention: Remove recurrence over input, allowing parallel computation
for encoding
• Transformers use self-attention to encode sequences, but now require position
embeddings to capture sequence order

6
Text Generation

• Text generation is the foundation of many useful NLP applications (e.g.,


translation, summarisation, dialogue systems)
• Autoregressive: models generate one token a time, using the context and
previously generated tokens as inputs to generate the next token
• Teacher forcing is the premier algorithm for training text generators
• A variety of decoding algorithms can be used to generate text from models,
each trading off expected quality vs. diversity in different ways.
• Automatic evaluation of NLG systems (content overlap, model-based, human)
is difficult as most metrics fall short of reliable estimates of output quality

7
Deep Learning Successes in NLP

8
Question

What did these ingredients propel ?

9
Pretraining

Massive Text Corpus Transformer Language Model

Used to

Learn

10 (Radford et al., 2018, 2019, many others)


Pretraining: Two Approaches

(Causal, Left-to-right) Masked


Language Modeling Language Modeling

I really enjoyed the movie we I really enjoyed the ____ we


watched on ____ watched on Saturday!

11 (Radford et al., 2018, 2019, many others) (Devlin et al., 2018; Liu et al., 2020)
Fine-tuning a single model

• Prepend special token [CLS]: Classify output embedding for this token

12 Devlin et al. (2019)


Fine-tuning a single model

• Prepend special token [CLS]: Classify output embedding for this token
• Can use same model for classification tasks, sentence pair tasks, sequence
labelling tasks, and many more!
13 Devlin et al. (2019)
Pretraining Improvements!

Superhuman results on benchmark datasets!

All top models use pretrained transformers!


14
Scale: Parameters

# Parameters in Model

Time

15
Scale: Data
ELMo: 1B training tokens
BERT: 3.3B training tokens
RoBERTa: ~30B training tokens

16
Slide Credit: Mohit Iyyer
Scale: Flops

17 Slide Credit: Mohit Iyyer


Scaling Laws

18 Kaplan et al. (2020)


Question

Why do we want to make these models as big as possible?

19
Fine-tuning a single model

• Prepend special token [CLS]: Classify output embedding for this token
• Can use same model for classification tasks, sentence pair tasks, sequence
labelling tasks, and many more!
20 Devlin et al. (2019)
Efficient tuning: LoRA

• During fine-tuning:
- Keep all pretrained
f(x) parameters frozen
h
- LoRA: Initialise new Feedforward Net (FFN)
alongside components of transformer blocks
Pretrained
Pretrained 𝐵=0 - Keep these FFN layers limited in number of
Weights
Weights parameters
𝑟
𝑑×𝑑
‣ # 𝑊
parameters ∈inℝFFN layers is
𝑊∈ ℝ 𝑑×𝑑
𝐴 = 𝒩(0, 𝜎 2 ) 2 * d * r, so keep r small
𝑑
𝑑 ‣ r is hidden dimension of FFN
x
x - Only update these FFN layers

21 Hu et al., (2021)
In-context Learning: A new paradigm!

• At very large-scale, language


models exhibit emergent in-
context learning abilities

• Providing examples as input that


depict desired behaviour is
enough for model to replicate it

• No learning required, though


learning can improve this ability
Chain-of-thought Reasoning
Standard Prompting Chain of Thought Prompting
Input Input

Q: Roger has 5 tennis balls. He buys 2 more cans of Q: Roger has 5 tennis balls. He buys 2 more cans of
tennis balls. Each can has 3 tennis balls. How many tennis balls. Each can has 3 tennis balls. How many
tennis balls does he have now? tennis balls does he have now?

A: The answer is 11. A: Roger started with 5 balls. 2 cans of 3 tennis balls
each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to
make lunch and bought 6 more, how many apples Q: The cafeteria had 23 apples. If they used 20 to
do they have? make lunch and bought 6 more, how many apples
do they have?

Model Output Model Output

A: The answer is 27. A: The cafeteria had 23 apples originally. They used
20 to make lunch. So they had 23 - 20 = 3. They
bought 6 more apples, so they have 3 + 6 = 9. The
answer is 9.

Model self-rationalizes through text generation


What do those two abilities
remind you of?
ChatGPT!

25
Alignment

Large Models can be Aligned to new Behaviors


Outcome: Many Tasks

27
Outcome: Personalization

28
Question

Why are these language models so effective at scale?

29
Encoded Knowledge

World knowledge is implicitly encoded in


LM parameters! (e.g., that barbershops are
places to get buzz cuts)

BERT barbershop: 54%


Bob went to the <MASK> (teacher): barber: 20%
to get a buzz cut 24 layer salon: 6%
Transformer stylist: 4%

30
Slide Credit: Mohit Iyyer
Question

Why might this be a bad idea?

31
32 Guu et al., 2020 (“REALM”)
Question

What could we do instead?

33
Retrieval-Augmented LLMs

Lewis et al., 2020

Chang et al., 2020 Borgeaud et al., 2021


34
35
Slide credit: Mohit Iyyer
36
Slide credit: Mohit Iyyer
37
Slide credit: Mohit Iyyer
38
Slide credit: Mohit Iyyer
39
Slide credit: Mohit Iyyer
Retrieval-Augmented LLMs
• Readily available at scale, requires no processing

• We have powerful methods for encoding text (e.g., BERT)

40 Slide credit: Mohit Iyyer


Retrieval-Augmented LLMs
• Readily available at scale, requires no processing

• We have powerful methods for encoding text (e.g., BERT)

• However, these methods don’t really work yet with larger units of text (e.g.,
books)

• “Long-context” NLP is an active area of research!

41 Slide credit: Mohit Iyyer


Recap Question

Why might we want to integrate retrieval components?

42
Remaining Problems!

43
Robustness

Deep learning models exploit biases (Bolukbasi et al., 2016), annotation


artifacts (Gururangan et al., 2018), surface patterns (Li & Gauthier, 2017), etc.

They struggle to learn robust understanding abilities

“All the impressive achievements of deep


learning amount to just curve fitting”

(Pearl, 2018)
44
Multimodality
mage Captioning with Attention

Image captioning with attention


, “Show, Attend, and Tell: Neural Image Caption Generation with Visual Attention”, ICML 2015
45
yright Kelvin Xu, Jimmy Lei Ba, Jamie Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard S. Zemel, and Yoshua Benchio, 2015. Reproduced with permission.
Xu et al., 2015
Multimodality

Masked
Language Modeling

I really enjoyed the ____ we


watched on Saturday!

46 Lu et al., 2019
Multimodality

Dall-E
Using natural language training Learning to generate images from
to improve computer vision natural language descriptions

https://openai.com/blog/clip/ https://openai.com/blog/dall-e/
47
Thanks for a great semester!

48

You might also like