Multitask Transfer (1)
Multitask Transfer (1)
1
Plan for Today
Multi-Task Learning
- Problem statement
- Models, objectives, optimization
- Challenges
- Case study of real-world multi-task learning
Transfer Learning
- Pre-training & fine-tuning
3
Multi-Task Learning
4
Some notation
θ
er
at
x
cat
lyn
y
tig
er c
x
tig
length of paper
fθ(y | x)
Single-task learning: 𝒟 = {(x, y)k} What is a task? (more formally this time)
[supervised] min ℒ(θ, 𝒟)
θ
A task: 𝒯i ≜ {pi(x), pi(y | x), ℒi}
data generating distributions
Typical loss: negative log likelihood
tr test
ℒ(θ, 𝒟) = − 𝔼(x,y)∼𝒟[log fθ(y | x)] Corresponding datasets: 𝒟i 𝒟i
tr
will use 𝒟i as shorthand for 𝒟i :
5
Examples of Tasks
A task: 𝒯i ≜ {pi(x), pi(y | x), ℒi} Multi-task classification: ℒi same across all tasks
e.g. per-language
data generating distributions handwriting recognition
tr test e.g. personalized
Corresponding datasets: 𝒟i 𝒟i spam filter
tr
will use 𝒟i as shorthand for 𝒟i :
Multi-label learning: ℒi , pi(x) same across all tasks
e.g. CelebA attribute recognition
e.g. scene understanding
∑
or, whatever meta-data you have
Vanilla MTL Objective: min ℒi(θ, 𝒟i)
- personalization: user features/attributes θ
i=1
- language description of the task
- formal specifications of the task
8
Conditioning on the task
Question: How should you condition on the task in order to share as little as possible?
9
Conditioning on the task
zi
x y1
multiplicative gating
y2
∑
x y= 1(zi = j)yj
j
…
x yT
x y
zi
11
An Alternative View on the Multi-Task Architecture
sh i
Split θ into shared parameters θ and task-specific parameters θ
T
sh i
θ sh,θ 1,…,θ T ∑
Then, our objective is: min ℒi({θ , θ }, 𝒟i)
i=1
12
Conditioning: Some Common Choices
1. Concatenation-based conditioning 2. Additive conditioning
zi zi
These are actually equivalent! Question: why are they the same thing? (raise your hand)
Concat followed by a
fully-connected layer:
Ruder ‘17
Cross-Stitch Networks. Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network. Liu, Johns, Davison ‘18
16
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?
17
T T
Often want to weight
Vanilla MTL Objective min
∑ θ ∑
ℒi(θ, 𝒟i) min wiℒi(θ, 𝒟i)
θ tasks differently:
i=1 i=1
a. various heuristics
encourage gradients to have similar magnitudes
(Chen et al. GradNorm. ICML 2018)
18
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?
19
Optimizing the objective
T
∑
Vanilla MTL Objective: min ℒi(θ, 𝒟i)
θ
i=1
Basic Version:
Note: This ensures that tasks are sampled uniformly, regardless of data quantities.
Tip: For regression problems, make sure
20
your task labels are on the same scale!
Challenges
21
Challenge #1: Negative transfer
Multi-Task CIFAR-100
} multi-head architectures
recent approaches } cross-stitch architecture
} independent training
(Yu et al. Gradient Surgery for Multi-Task Learning. 2020)
x y1
<->
<->
<->
<->
…
constrained weights
x yT
24
Challenge #3: What if you have a lot of tasks?
Should you train all of them together? Which ones will be complementary?
Fifty, Amid, Zhao, Yu, Anil, Finn. Efficiently Identifying Task Groupings for Multi-Task Learning. 2021
25
Case study
27
Case study
28
Framework Set-Up
Input: what the user is currently watching (query video) + user features
1. Generate a few hundred of candidate videos
2. Rank candidates
3. Serve top ranking videos to the user
29
The Ranking Problem
Input: query video, candidate video, user & context features
Question: Are these objectives reasonable? What are some of the issues that might come up?
30
The Architecture
Basic option: “Shared-Bottom Model"
(i.e. multi-head architecture)
31
The Architecture
Instead: use a form of soft-parameter sharing Allow different parts of the network to “specialize"
“Multi-gate Mixture-of-Experts (MMoE)" expert neural networks
Decide which expert to use for input x, task k:
Compute output:
32
Experiments
Set-Up Results
- Implementation in TensorFlow, TPUs
- Train in temporal order, running training
continuously to consume newly arriving data
- Offline AUC & squared error metrics
- Online A/B testing in comparison to
production system
- live metrics based on time spent, survey
responses, rate of dismissals
- Model computational efficiency matters
Transfer Learning
- Pre-training & fine-tuning
34
Multi-Task Learning vs. Transfer Learning
∑
min ℒi(θ, 𝒟i)
θ Key assumption: Cannot access data 𝒟a during transfer.
i=1
Side note: 𝒯a may include
multiple tasks itself.
Transfer learning is a valid solution to multi-task learning.
(but not vice versa)
35
Transfer learning via fine-tuning
pre-trained parameters
tr
✓ ↵r✓ L(✓, D ) training data
for new task
(typically for many gradient steps)
What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16
Some common prac6ces
- Fine-tune with a smaller learning rate
Where do you get the pre-trained parameters? - Smaller learning rate for earlier layers
- ImageNet classifica7on - Freeze earlier layers, gradually unfreeze
- Models trained on large language corpora (BERT, LMs) - Reini7alize last layer
- Other unsupervised learning techniques - Search over hyperparameters via cross-val
- Whatever large, diverse dataset you might have - Architecture choices maLer (e.g. ResNets)
Pre-trained models oOen available online.
36
Universal Language Model Fine-Tuning for Text Classifica6on. Howard, Ruder. ‘18
37
Plan for Today
Multi-Task Learning
- Problem statement
- Models, objectives, optimization
- Challenges
- Case study of real-world multi-task learning
Transfer Learning
- Pre-training & fine-tuning
38