0% found this document useful (0 votes)
9 views36 pages

Multitask Transfer (1)

The document covers the basics of Multi-Task Learning (MTL) and Transfer Learning, discussing problem statements, models, objectives, optimization, challenges, and a case study on YouTube recommendations. It highlights key design decisions for MTL systems, the differences between MTL and Transfer Learning, and the importance of pre-training and fine-tuning in Transfer Learning. The document also addresses challenges such as negative transfer, overfitting, and task similarity, while providing insights into various conditioning choices and architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views36 pages

Multitask Transfer (1)

The document covers the basics of Multi-Task Learning (MTL) and Transfer Learning, discussing problem statements, models, objectives, optimization, challenges, and a case study on YouTube recommendations. It highlights key design decisions for MTL systems, the differences between MTL and Transfer Learning, and the importance of pre-training and fine-tuning in Transfer Learning. The document also addresses challenges such as negative transfer, overfitting, and task similarity, while providing insights into various conditioning choices and architectures.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Multi-Task Learning & Transfer Learning Basics

1
Plan for Today
Multi-Task Learning
- Problem statement
- Models, objectives, optimization
- Challenges
- Case study of real-world multi-task learning

Transfer Learning
- Pre-training & fine-tuning

Goals for by the end of lecture:


- Know the key design decisions when building multi-task learning systems
- Understand the difference between multi-task learning and transfer learning
- Understand the basics of transfer learning

3
Multi-Task Learning

4
Some notation
θ

er
at
x
cat
lyn
y

tig
er c
x

tig
length of paper
fθ(y | x)

Single-task learning: 𝒟 = {(x, y)k} What is a task? (more formally this time)
[supervised] min ℒ(θ, 𝒟)
θ
A task: 𝒯i ≜ {pi(x), pi(y | x), ℒi}
data generating distributions
Typical loss: negative log likelihood
tr test
ℒ(θ, 𝒟) = − 𝔼(x,y)∼𝒟[log fθ(y | x)] Corresponding datasets: 𝒟i 𝒟i
tr
will use 𝒟i as shorthand for 𝒟i :
5
Examples of Tasks
A task: 𝒯i ≜ {pi(x), pi(y | x), ℒi} Multi-task classification: ℒi same across all tasks
e.g. per-language
data generating distributions handwriting recognition
tr test e.g. personalized
Corresponding datasets: 𝒟i 𝒟i spam filter
tr
will use 𝒟i as shorthand for 𝒟i :
Multi-label learning: ℒi , pi(x) same across all tasks
e.g. CelebA attribute recognition
e.g. scene understanding

When might ℒi vary across tasks?


- mixed discrete, continuous labels across tasks
- multiple metrics that you care about
6
θ
length of paper
x y summary of paper
paper review

zi fθ(y | x) fθ(y | x, zi)


task descriptor
e.g. one-hot encoding of the task index
T


or, whatever meta-data you have
Vanilla MTL Objective: min ℒi(θ, 𝒟i)
- personalization: user features/attributes θ
i=1
- language description of the task
- formal specifications of the task

Decisions on the model, the objective, and the optimization.


How should we condition on zi ? What objective should we use?
How to optimize our objective?
7
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?

Objective How should the objective be formed?

Optimization How should the objective be optimized?

8
Conditioning on the task

Let’s assume zi is the one-hot task index.

Question: How should you condition on the task in order to share as little as possible?

9
Conditioning on the task
zi

x y1

multiplicative gating
y2

x y= 1(zi = j)yj
j


x yT

—> independent training within a single network!


with no shared
10 parameters
The other extreme

x y

zi

Concatenate zi with input and/or activations

all parameters are shared


(except the parameters directly following zi, if zi is one-hot)

11
An Alternative View on the Multi-Task Architecture
sh i
Split θ into shared parameters θ and task-specific parameters θ
T
sh i
θ sh,θ 1,…,θ T ∑
Then, our objective is: min ℒi({θ , θ }, 𝒟i)
i=1

Choosing how to Choosing how & where


equivalent to
condition on zi to share parameters

12
Conditioning: Some Common Choices
1. Concatenation-based conditioning 2. Additive conditioning

zi zi

These are actually equivalent! Question: why are they the same thing? (raise your hand)

Concat followed by a
fully-connected layer:

Diagram sources: distill.pub/2018/feature-wise-transformations/ 13


Conditioning: Some Common Choices
3. Multi-head architecture 4. Multiplicative conditioning

Ruder ‘17

Why might multiplicative - more expressive per layer


conditioning be a good idea? - recall: multiplicative gating

Multiplicative conditioning generalizes


independent networks and independent heads.
Diagram sources: distill.pub/2018/feature-wise-transformations/ 14
Conditioning: More Complex Choices

Cross-Stitch Networks. Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network. Liu, Johns, Davison ‘18

Deep Relation Networks. Long, Wang ‘15


Perceiver IO. Jaegle et al. ‘21
15
Conditioning Choices

Unfortunately, these design decisions are


like neural network architecture tuning:
- problem dependent
- largely guided by intuition or
knowledge of the problem
- currently more of an art than a
science

16
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?

Objective How should the objective be formed?

Optimization How should the objective be optimized?

17
T T
Often want to weight
Vanilla MTL Objective min
∑ θ ∑
ℒi(θ, 𝒟i) min wiℒi(θ, 𝒟i)
θ tasks differently:
i=1 i=1

- dynamically adjust - manually based on


How to choose wi?
throughout training importance or priority

a. various heuristics
encourage gradients to have similar magnitudes
(Chen et al. GradNorm. ICML 2018)

b. optimize for the worst-case task loss


min max ℒi(θ, 𝒟i)
θ i
(e.g. for task robustness, or for fairness)

18
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?

Objective How should the objective be formed?

Optimization How should the objective be optimized?

19
Optimizing the objective
T


Vanilla MTL Objective: min ℒi(θ, 𝒟i)
θ
i=1
Basic Version:

1. Sample mini-batch of tasks ℬ ∼ {𝒯i}


b
2. Sample mini-batch datapoints for each task 𝒟i ∼ 𝒟i
̂ ℬ) = b

3. Compute loss on the mini-batch: ℒ(θ, ℒk(θ, 𝒟k )
𝒯k∈ℬ
4. Backpropagate loss to compute gradient ∇θ ℒ̂
5. Apply gradient with your favorite neural net optimizer (e.g. Adam)

Note: This ensures that tasks are sampled uniformly, regardless of data quantities.
Tip: For regression problems, make sure
20
your task labels are on the same scale!
Challenges

21
Challenge #1: Negative transfer

Negative transfer: Sometimes independent networks work the best.

Multi-Task CIFAR-100
} multi-head architectures
recent approaches } cross-stitch architecture
} independent training
(Yu et al. Gradient Surgery for Multi-Task Learning. 2020)

Why? - optimization challenges


- caused by cross-task interference
- tasks may learn at different rates
- limited representational capacity
- multi-task networks often need to be much larger
than their single-task counterparts
22
If you have negative transfer, share less across tasks.
It’s not just a binary decision!
T T
sh i
T∑
min ℒi({θ , θ }, 𝒟i) + t t′

∥θ − θ ∥
θ sh,θ 1,…,θ
i=1 t′=1

“soft parameter sharing”

x y1

<->

<->

<->
<->

constrained weights

x yT

+ allows for more fluid degrees of parameter sharing


- yet another set of design decisions / hyperparameters
23
Challenge #2: Overfitting
You may not be sharing enough!

Multi-task learning <-> a form of regularization

Solution: Share more.

24
Challenge #3: What if you have a lot of tasks?
Should you train all of them together? Which ones will be complementary?

The bad news: No closed-form solution for measuring task similarity.


The good news: There are ways to approximate it from one training run.

Fifty, Amid, Zhao, Yu, Anil, Finn. Efficiently Identifying Task Groupings for Multi-Task Learning. 2021

25
Case study

Goal: Make recommendations for YouTube

27
Case study

Goal: Make recommendations for YouTube

- videos that users will rate highly


Conflicting objectives: - videos that users they will share
- videos that user will watch

implicit bias caused by feedback:


user may have watched it because it was recommended!

28
Framework Set-Up
Input: what the user is currently watching (query video) + user features
1. Generate a few hundred of candidate videos
2. Rank candidates
3. Serve top ranking videos to the user

Candidate videos: pool videos from multiple


candidate generation algorithms
- matching topics of query video
- videos most frequently watched with query video
- And others

Ranking: central topic of this paper

29
The Ranking Problem
Input: query video, candidate video, user & context features

Model output: engagement and satisfaction with candidate video


Engagement: Satisfaction:
- binary classification tasks like clicks - binary classification tasks like clicking “like”
- regression tasks for tasks related to time spent - regression tasks for tasks such as rating

Weighted combination of engagement & satisfaction predictions -> ranking score


score weights manually tuned

Question: Are these objectives reasonable? What are some of the issues that might come up?
30
The Architecture
Basic option: “Shared-Bottom Model"
(i.e. multi-head architecture)

-> harm learning when correlation


between tasks is low

31
The Architecture

Instead: use a form of soft-parameter sharing Allow different parts of the network to “specialize"
“Multi-gate Mixture-of-Experts (MMoE)" expert neural networks
Decide which expert to use for input x, task k:

Compute features from


selected expert:

Compute output:

32
Experiments
Set-Up Results
- Implementation in TensorFlow, TPUs
- Train in temporal order, running training
continuously to consume newly arriving data
- Offline AUC & squared error metrics
- Online A/B testing in comparison to
production system
- live metrics based on time spent, survey
responses, rate of dismissals
- Model computational efficiency matters

Found 20% chance of gating polarization during


33
distributed training -> use drop-out on experts
Plan for Today
Multi-Task Learning
- Problem statement
- Models & training
- Challenges
- Case study of real-world multi-task learning

Transfer Learning
- Pre-training & fine-tuning

34
Multi-Task Learning vs. Transfer Learning

Multi-Task Learning Transfer Learning


Solve multiple tasks 𝒯1, ⋯, 𝒯T at once. Solve target task 𝒯b after solving source task 𝒯a
T by transferring knowledge learned from 𝒯a


min ℒi(θ, 𝒟i)
θ Key assumption: Cannot access data 𝒟a during transfer.
i=1
Side note: 𝒯a may include
multiple tasks itself.
Transfer learning is a valid solution to multi-task learning.
(but not vice versa)

Question: In what settings might transfer learning make sense?


(answer in chat or raise hand)

35
Transfer learning via fine-tuning
pre-trained parameters
tr
✓ ↵r✓ L(✓, D ) training data
for new task
(typically for many gradient steps)

What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16
Some common prac6ces
- Fine-tune with a smaller learning rate
Where do you get the pre-trained parameters? - Smaller learning rate for earlier layers
- ImageNet classifica7on - Freeze earlier layers, gradually unfreeze
- Models trained on large language corpora (BERT, LMs) - Reini7alize last layer
- Other unsupervised learning techniques - Search over hyperparameters via cross-val
- Whatever large, diverse dataset you might have - Architecture choices maLer (e.g. ResNets)
Pre-trained models oOen available online.
36
Universal Language Model Fine-Tuning for Text Classifica6on. Howard, Ruder. ‘18

Fine-tuning doesn’t work well with small target task datasets

Upcoming lectures: few-shot learning via meta-learning

37
Plan for Today
Multi-Task Learning
- Problem statement
- Models, objectives, optimization
- Challenges
- Case study of real-world multi-task learning

Transfer Learning
- Pre-training & fine-tuning

Goals for by the end of lecture:


- Know the key design decisions when building multi-task learning systems
- Understand the difference between multi-task learning and transfer learning
- Understand the basics of transfer learning

38

You might also like