0% found this document useful (0 votes)

9 views36 pages

Multitask Transfer (1)

The document covers the basics of Multi-Task Learning (MTL) and Transfer Learning, discussing problem statements, models, objectives, optimization, challenges, and a case study on YouTube recommendations. It highlights key design decisions for MTL systems, the differences between MTL and Transfer Learning, and the importance of pre-training and fine-tuning in Transfer Learning. The document also addresses challenges such as negative transfer, overfitting, and task similarity, while providing insights into various conditioning choices and architectures.

Uploaded by

gauravbhardwaj752

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

9 views36 pages

Multitask Transfer (1)

Uploaded by

gauravbhardwaj752

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 36

Multi-Task Learning & Transfer Learning Basics

1
Plan for Today
Multi-Task Learning
- Problem statement
- Models, objectives, optimization
- Challenges
- Case study of real-world multi-task learning

Transfer Learning
- Pre-training & ﬁne-tuning

Goals for by the end of lecture:

- Know the key design decisions when building multi-task learning systems
- Understand the diﬀerence between multi-task learning and transfer learning
- Understand the basics of transfer learning

3
Multi-Task Learning

4
Some notation
θ

er
at
x
cat
lyn
y

tig
er c
x

tig
length of paper
fθ(y | x)

Single-task learning: 𝒟 = {(x, y)k} What is a task? (more formally this time)
[supervised] min ℒ(θ, 𝒟)
θ
A task: 𝒯i ≜ {pi(x), pi(y | x), ℒi}
data generating distributions
Typical loss: negative log likelihood
tr test
ℒ(θ, 𝒟) = − 𝔼(x,y)∼𝒟[log fθ(y | x)] Corresponding datasets: 𝒟i 𝒟i
tr
will use 𝒟i as shorthand for 𝒟i :
5
Examples of Tasks
A task: 𝒯i ≜ {pi(x), pi(y | x), ℒi} Multi-task classiﬁcation: ℒi same across all tasks
e.g. per-language
data generating distributions handwriting recognition
tr test e.g. personalized
Corresponding datasets: 𝒟i 𝒟i spam ﬁlter
tr
will use 𝒟i as shorthand for 𝒟i :
Multi-label learning: ℒi , pi(x) same across all tasks
e.g. CelebA attribute recognition
e.g. scene understanding

When might ℒi vary across tasks?

- mixed discrete, continuous labels across tasks
- multiple metrics that you care about
6
θ
length of paper
x y summary of paper
paper review

zi fθ(y | x) fθ(y | x, zi)

task descriptor
e.g. one-hot encoding of the task index
T

∑
or, whatever meta-data you have
Vanilla MTL Objective: min ℒi(θ, 𝒟i)
- personalization: user features/attributes θ
i=1
- language description of the task
- formal speciﬁcations of the task

Decisions on the model, the objective, and the optimization.

How should we condition on zi ? What objective should we use?
How to optimize our objective?
7
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?

Objective How should the objective be formed?

Optimization How should the objective be optimized?

8
Conditioning on the task

Let’s assume zi is the one-hot task index.

Question: How should you condition on the task in order to share as little as possible?

9
Conditioning on the task
zi

x y1

multiplicative gating
y2
∑
x y= 1(zi = j)yj
j

…
x yT

—> independent training within a single network!

with no shared
10 parameters
The other extreme

x y

Concatenate zi with input and/or activations

all parameters are shared

(except the parameters directly following zi, if zi is one-hot)

11
An Alternative View on the Multi-Task Architecture
sh i
Split θ into shared parameters θ and task-speciﬁc parameters θ
T
sh i
θ sh,θ 1,…,θ T ∑
Then, our objective is: min ℒi({θ , θ }, 𝒟i)
i=1

Choosing how to Choosing how & where

equivalent to
condition on zi to share parameters

12
Conditioning: Some Common Choices
1. Concatenation-based conditioning 2. Additive conditioning

zi zi

These are actually equivalent! Question: why are they the same thing? (raise your hand)

Concat followed by a
fully-connected layer:

Diagram sources: distill.pub/2018/feature-wise-transformations/ 13

Conditioning: Some Common Choices
3. Multi-head architecture 4. Multiplicative conditioning

Ruder ‘17

Why might multiplicative - more expressive per layer

conditioning be a good idea? - recall: multiplicative gating

Multiplicative conditioning generalizes

independent networks and independent heads.
Diagram sources: distill.pub/2018/feature-wise-transformations/ 14
Conditioning: More Complex Choices

Cross-Stitch Networks. Misra, Shrivastava, Gupta, Hebert ‘16 Multi-Task Attention Network. Liu, Johns, Davison ‘18

Deep Relation Networks. Long, Wang ‘15

Perceiver IO. Jaegle et al. ‘21
15
Conditioning Choices

Unfortunately, these design decisions are

like neural network architecture tuning:
- problem dependent
- largely guided by intuition or
knowledge of the problem
- currently more of an art than a
science

16
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?

Objective How should the objective be formed?

Optimization How should the objective be optimized?

17
T T
Often want to weight
Vanilla MTL Objective min
∑ θ ∑
ℒi(θ, 𝒟i) min wiℒi(θ, 𝒟i)
θ tasks diﬀerently:
i=1 i=1

- dynamically adjust - manually based on

How to choose wi?
throughout training importance or priority

a. various heuristics
encourage gradients to have similar magnitudes
(Chen et al. GradNorm. ICML 2018)

b. optimize for the worst-case task loss

min max ℒi(θ, 𝒟i)
θ i
(e.g. for task robustness, or for fairness)

18
How should the model be conditioned on zi?
Model
What parameters of the model should be shared?

Objective How should the objective be formed?

Optimization How should the objective be optimized?

19
Optimizing the objective
T

∑
Vanilla MTL Objective: min ℒi(θ, 𝒟i)
θ
i=1
Basic Version:

1. Sample mini-batch of tasks ℬ ∼ {𝒯i}

b
2. Sample mini-batch datapoints for each task 𝒟i ∼ 𝒟i
̂ ℬ) = b
∑
3. Compute loss on the mini-batch: ℒ(θ, ℒk(θ, 𝒟k )
𝒯k∈ℬ
4. Backpropagate loss to compute gradient ∇θ ℒ̂
5. Apply gradient with your favorite neural net optimizer (e.g. Adam)

Note: This ensures that tasks are sampled uniformly, regardless of data quantities.
Tip: For regression problems, make sure
20
your task labels are on the same scale!
Challenges

21
Challenge #1: Negative transfer

Negative transfer: Sometimes independent networks work the best.

Multi-Task CIFAR-100
} multi-head architectures
recent approaches } cross-stitch architecture
} independent training
(Yu et al. Gradient Surgery for Multi-Task Learning. 2020)

Why? - optimization challenges

- caused by cross-task interference
- tasks may learn at diﬀerent rates
- limited representational capacity
- multi-task networks often need to be much larger
than their single-task counterparts
22
If you have negative transfer, share less across tasks.
It’s not just a binary decision!
T T
sh i
T∑
min ℒi({θ , θ }, 𝒟i) + t t′
∑
∥θ − θ ∥
θ sh,θ 1,…,θ
i=1 t′=1

“soft parameter sharing”

x y1

<->

<->
<->
…
constrained weights

x yT

+ allows for more fluid degrees of parameter sharing

- yet another set of design decisions / hyperparameters
23
Challenge #2: Overﬁtting
You may not be sharing enough!

Multi-task learning <-> a form of regularization

Solution: Share more.

24
Challenge #3: What if you have a lot of tasks?
Should you train all of them together? Which ones will be complementary?

The bad news: No closed-form solution for measuring task similarity.

The good news: There are ways to approximate it from one training run.

Fifty, Amid, Zhao, Yu, Anil, Finn. Eﬃciently Identifying Task Groupings for Multi-Task Learning. 2021

25
Case study

Goal: Make recommendations for YouTube

27
Case study

Goal: Make recommendations for YouTube

- videos that users will rate highly

Conflicting objectives: - videos that users they will share
- videos that user will watch

implicit bias caused by feedback:

user may have watched it because it was recommended!

28
Framework Set-Up
Input: what the user is currently watching (query video) + user features
1. Generate a few hundred of candidate videos
2. Rank candidates
3. Serve top ranking videos to the user

Candidate videos: pool videos from multiple

candidate generation algorithms
- matching topics of query video
- videos most frequently watched with query video
- And others

Ranking: central topic of this paper

29
The Ranking Problem
Input: query video, candidate video, user & context features

Model output: engagement and satisfaction with candidate video

Engagement: Satisfaction:
- binary classiﬁcation tasks like clicks - binary classiﬁcation tasks like clicking “like”
- regression tasks for tasks related to time spent - regression tasks for tasks such as rating

Weighted combination of engagement & satisfaction predictions -> ranking score

score weights manually tuned

Question: Are these objectives reasonable? What are some of the issues that might come up?
30
The Architecture
Basic option: “Shared-Bottom Model"
(i.e. multi-head architecture)

-> harm learning when correlation

between tasks is low

31
The Architecture

Instead: use a form of soft-parameter sharing Allow diﬀerent parts of the network to “specialize"
“Multi-gate Mixture-of-Experts (MMoE)" expert neural networks
Decide which expert to use for input x, task k:

Compute features from

selected expert:

Compute output:

32
Experiments
Set-Up Results
- Implementation in TensorFlow, TPUs
- Train in temporal order, running training
continuously to consume newly arriving data
- Oﬄine AUC & squared error metrics
- Online A/B testing in comparison to
production system
- live metrics based on time spent, survey
responses, rate of dismissals
- Model computational eﬃciency matters

Found 20% chance of gating polarization during

33
distributed training -> use drop-out on experts
Plan for Today
Multi-Task Learning
- Problem statement
- Models & training
- Challenges
- Case study of real-world multi-task learning

Transfer Learning
- Pre-training & ﬁne-tuning

34
Multi-Task Learning vs. Transfer Learning

Multi-Task Learning Transfer Learning

Solve multiple tasks 𝒯1, ⋯, 𝒯T at once. Solve target task 𝒯b after solving source task 𝒯a
T by transferring knowledge learned from 𝒯a

∑
min ℒi(θ, 𝒟i)
θ Key assumption: Cannot access data 𝒟a during transfer.
i=1
Side note: 𝒯a may include
multiple tasks itself.
Transfer learning is a valid solution to multi-task learning.
(but not vice versa)

Question: In what settings might transfer learning make sense?

(answer in chat or raise hand)

35
Transfer learning via ﬁne-tuning
pre-trained parameters
tr
✓ ↵r✓ L(✓, D ) training data
for new task
(typically for many gradient steps)

What makes ImageNet good for transfer learning? Huh, Agrawal, Efros. ‘16
Some common prac6ces
- Fine-tune with a smaller learning rate
Where do you get the pre-trained parameters? - Smaller learning rate for earlier layers
- ImageNet classiﬁca7on - Freeze earlier layers, gradually unfreeze
- Models trained on large language corpora (BERT, LMs) - Reini7alize last layer
- Other unsupervised learning techniques - Search over hyperparameters via cross-val
- Whatever large, diverse dataset you might have - Architecture choices maLer (e.g. ResNets)
Pre-trained models oOen available online.
36
Universal Language Model Fine-Tuning for Text Classiﬁca6on. Howard, Ruder. ‘18

Fine-tuning doesn’t work well with small target task datasets

Upcoming lectures: few-shot learning via meta-learning

37
Plan for Today
Multi-Task Learning
- Problem statement
- Models, objectives, optimization
- Challenges
- Case study of real-world multi-task learning

Transfer Learning
- Pre-training & ﬁne-tuning

Goals for by the end of lecture:

- Know the key design decisions when building multi-task learning systems
- Understand the diﬀerence between multi-task learning and transfer learning
- Understand the basics of transfer learning

Evidence and Procedures For Boundary Location 6th Edition PDF
0% (2)
Evidence and Procedures For Boundary Location 6th Edition PDF
2 pages
Designing Machine Learning Systems by Chip Huygen by Rick
No ratings yet
Designing Machine Learning Systems by Chip Huygen by Rick
15 pages
Neural Networks Desing - Martin T. Hagan - 2nd Edition
100% (2)
Neural Networks Desing - Martin T. Hagan - 2nd Edition
1,013 pages
11 Deep Transfer Learning and Multi Task Learning
No ratings yet
11 Deep Transfer Learning and Multi Task Learning
24 pages
Data Science Guide
100% (1)
Data Science Guide
275 pages
Daily Dose of Data Science Full Archive
No ratings yet
Daily Dose of Data Science Full Archive
53 pages
(Slide) Multi Task Learning
No ratings yet
(Slide) Multi Task Learning
40 pages
Multi Task Learning (MTL)
No ratings yet
Multi Task Learning (MTL)
15 pages
2021_Efficiently Identifying Task Groupings for Multi-Task Learning_Fifty et al_Curran Associates, Inc.
No ratings yet
2021_Efficiently Identifying Task Groupings for Multi-Task Learning_Fifty et al_Curran Associates, Inc.
14 pages
Meta-Learning & Transfer Learning
No ratings yet
Meta-Learning & Transfer Learning
56 pages
2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems
No ratings yet
2018_Multi-Task Learning as Multi-Objective Optimization_Sener_Koltun_Advances in Neural Information Processing Systems
12 pages
Thijs Van Der Laan s3986721 Bachelors Thesis
No ratings yet
Thijs Van Der Laan s3986721 Bachelors Thesis
42 pages
A Survey On Transfer Learning
No ratings yet
A Survey On Transfer Learning
42 pages
Multi-Task Learning On Mnist Image Datasets
No ratings yet
Multi-Task Learning On Mnist Image Datasets
4 pages
Machine Learning
No ratings yet
Machine Learning
25 pages
Lecture3 Transfer Learning
No ratings yet
Lecture3 Transfer Learning
28 pages
Asset-V1 - MITx 6.86x 1T2021 Type@Asset Block@Slides - Lecture1 - Withcredits
No ratings yet
Asset-V1 - MITx 6.86x 1T2021 Type@Asset Block@Slides - Lecture1 - Withcredits
29 pages
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
No ratings yet
2022_MTFormer - Multi-task Learning via Transformer and Cross-Task Reasoning_Xu et al_Springer Nature Switzerland
18 pages
2021_Task Switching Network for Multi-Task Learning_Sun et al_
No ratings yet
2021_Task Switching Network for Multi-Task Learning_Sun et al_
10 pages
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
No ratings yet
AdapterFusion: Non-Destructive Task Composition For Transfer Learning
17 pages
Trends in Personalized Video Recommendations
No ratings yet
Trends in Personalized Video Recommendations
46 pages
A Survey On Multi-Task Learning
No ratings yet
A Survey On Multi-Task Learning
24 pages
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
No ratings yet
A Survey On Multi-Task Learning: Yu Zhang and Qiang Yang
20 pages
2018_A brief review on multi-task learning_Thung_Wee_Multimedia Tools and Applications
No ratings yet
2018_A brief review on multi-task learning_Thung_Wee_Multimedia Tools and Applications
21 pages
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
No ratings yet
UCS_401_Unit-LV_ Trends in Machine Learning_Model and Symbols- Bagging and Boosting, Multitask
44 pages
Data - and AI-driven Methods in Engineering
No ratings yet
Data - and AI-driven Methods in Engineering
40 pages
UNIT III
No ratings yet
UNIT III
26 pages
Training the application of LLM
No ratings yet
Training the application of LLM
68 pages
Neural Network Seminar Anirban
No ratings yet
Neural Network Seminar Anirban
13 pages
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
No ratings yet
2024_Multi-Task Learning in Natural Language Processing - An Overview_Chen et al_ACM Computing Surveys
31 pages
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
No ratings yet
2018_Learning to Multitask_Zhang Et Al_Curran Associates, Inc.
12 pages
01 Introduction ML
No ratings yet
01 Introduction ML
60 pages
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
No ratings yet
Gradnorm: Gradient Normalization For Adaptive Loss Balancing in Deep Multitask Networks
12 pages
Unit-V Tranfer Learning Notes
No ratings yet
Unit-V Tranfer Learning Notes
27 pages
Elaborate on the significance of Hyperparameter Optimization
No ratings yet
Elaborate on the significance of Hyperparameter Optimization
5 pages
Unit 4
No ratings yet
Unit 4
50 pages
Automl: A Perspective Where Industry Meets Academy
No ratings yet
Automl: A Perspective Where Industry Meets Academy
154 pages
Introduction to Machine Learning
No ratings yet
Introduction to Machine Learning
15 pages
2020_Which Tasks Should Be Learned Together in Multi-task Learning_Standley et al_PMLR
No ratings yet
2020_Which Tasks Should Be Learned Together in Multi-task Learning_Standley et al_PMLR
13 pages
2022_Multi-Task Learning for Dense Prediction Tasks - A Survey_Vandenhende et al_IEEE Transactions on Pattern Analysis and Machine Intelligence
No ratings yet
2022_Multi-Task Learning for Dense Prediction Tasks - A Survey_Vandenhende et al_IEEE Transactions on Pattern Analysis and Machine Intelligence
20 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
Deep Learning Unit 2
No ratings yet
Deep Learning Unit 2
25 pages
2024 MTH058 Lecture04 AILearningParadigms
No ratings yet
2024 MTH058 Lecture04 AILearningParadigms
85 pages
L2 Neural Network Basics
No ratings yet
L2 Neural Network Basics
105 pages
Transfer (v3)
No ratings yet
Transfer (v3)
38 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
cs329s 2022 02 Slides MLSD
No ratings yet
cs329s 2022 02 Slides MLSD
99 pages
Yu Et Al. - 2020 - Gradient Surgery For Multi-Task Learning
No ratings yet
Yu Et Al. - 2020 - Gradient Surgery For Multi-Task Learning
27 pages
Transfer Learning Seminar
No ratings yet
Transfer Learning Seminar
12 pages
Lec2 Intro to ML
No ratings yet
Lec2 Intro to ML
35 pages
MFEA-II
No ratings yet
MFEA-II
15 pages
4 CS826 - Meta Learning
No ratings yet
4 CS826 - Meta Learning
40 pages
Lecture 3_1-ML and Data Systems Fundamentals
No ratings yet
Lecture 3_1-ML and Data Systems Fundamentals
48 pages
UNIT_ICHP 4
No ratings yet
UNIT_ICHP 4
19 pages
Survey of Multitask Learning
No ratings yet
Survey of Multitask Learning
20 pages
2308 16474
No ratings yet
2308 16474
6 pages
Deep Learning Fundamentals in Python
From Everand
Deep Learning Fundamentals in Python
LazyProgrammer
4/5 (9)
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Kapayas Saging
No ratings yet
Kapayas Saging
3 pages
FORM 1: Aadhaar Enrolment and Update For (A) Resident Indian, or (B) Non-Resident Indian Having Proof of Address in India (Aged 18 Years and Above)
No ratings yet
FORM 1: Aadhaar Enrolment and Update For (A) Resident Indian, or (B) Non-Resident Indian Having Proof of Address in India (Aged 18 Years and Above)
2 pages
Securing Wealth in Poetry. What if you could embed your net worth… _ by Trithemius _ Coinmonks _ Medium
No ratings yet
Securing Wealth in Poetry. What if you could embed your net worth… _ by Trithemius _ Coinmonks _ Medium
12 pages
UoN Communication Policy
No ratings yet
UoN Communication Policy
13 pages
Chap 00165
No ratings yet
Chap 00165
52 pages
An Illustrated Chronology of The NASA Marshall Center and MSFC Programs
100% (1)
An Illustrated Chronology of The NASA Marshall Center and MSFC Programs
417 pages
Develop Computer Keyboard Skill
No ratings yet
Develop Computer Keyboard Skill
43 pages
Innovation and Security Software Devtar Singh IDC MOSC2011
No ratings yet
Innovation and Security Software Devtar Singh IDC MOSC2011
20 pages
Eng Electronica
No ratings yet
Eng Electronica
6 pages
Test Practice Exam Questions and Answers Solow Model
No ratings yet
Test Practice Exam Questions and Answers Solow Model
5 pages
IndianJMedRes000-7407721 203437
No ratings yet
IndianJMedRes000-7407721 203437
9 pages
ExportQuestion-ATA31 BASIC - A
No ratings yet
ExportQuestion-ATA31 BASIC - A
4 pages
Wing Design K-12
100% (1)
Wing Design K-12
56 pages
Add MITx Credentials To Resume and LinkedIn PDF
No ratings yet
Add MITx Credentials To Resume and LinkedIn PDF
5 pages
PPM Assignment Case Study
50% (2)
PPM Assignment Case Study
2 pages
MB-910T00: Microsoft Dynamics 365 Fundamentals (CRM) : Course Outline
No ratings yet
MB-910T00: Microsoft Dynamics 365 Fundamentals (CRM) : Course Outline
4 pages
Oregon State University Thesis Printing
100% (2)
Oregon State University Thesis Printing
4 pages
Traning Padagogy
No ratings yet
Traning Padagogy
30 pages
Science 5 Week 6 Q4 1
No ratings yet
Science 5 Week 6 Q4 1
36 pages
Demonstration Lesson Plan in Science 6 Inquiry-Based
No ratings yet
Demonstration Lesson Plan in Science 6 Inquiry-Based
11 pages
Benchmarking Sox Costs, Hours and Controls
No ratings yet
Benchmarking Sox Costs, Hours and Controls
45 pages
Dalit Consciousness
100% (1)
Dalit Consciousness
2 pages
Errata Sheet For "The B-52 Competition of 1946... and Dark Horses From Douglas, 1947-1950 (The American Aerospace Archive 3) "
No ratings yet
Errata Sheet For "The B-52 Competition of 1946... and Dark Horses From Douglas, 1947-1950 (The American Aerospace Archive 3) "
5 pages
3331 Ch.10 Circuit Switching and Packet Switching
No ratings yet
3331 Ch.10 Circuit Switching and Packet Switching
30 pages
Product Shellfish Diet 1800 Use Info
No ratings yet
Product Shellfish Diet 1800 Use Info
2 pages
Nepal Government Sanitation and Hygiene Master Plan PDF
No ratings yet
Nepal Government Sanitation and Hygiene Master Plan PDF
49 pages
Hubungan Posisi Kerja Dengan Keluhan Muskuloskeletal Pada Unit Pengelasan Pt. X Bekasi
No ratings yet
Hubungan Posisi Kerja Dengan Keluhan Muskuloskeletal Pada Unit Pengelasan Pt. X Bekasi
10 pages
02 - DCS - Introduction To DCS
100% (1)
02 - DCS - Introduction To DCS
27 pages
Challenging American Boundaries Indigenous People and The Gift of U S Citizenship
100% (1)
Challenging American Boundaries Indigenous People and The Gift of U S Citizenship
14 pages