0% found this document useful (0 votes)
6 views24 pages

11 Deep Transfer Learning and Multi Task Learning

The document discusses Deep Transfer Learning and Multi-task Learning, explaining the concepts, applications, and methodologies involved in transferring knowledge from one task or domain to another. It outlines various strategies such as model fine-tuning, multi-task learning, and the use of pre-trained models for enhancing performance in tasks like image segmentation and natural language processing. Additionally, it highlights the benefits of multi-task learning, including implicit data augmentation and enhanced feature learning.

Uploaded by

khuusshii2517
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views24 pages

11 Deep Transfer Learning and Multi Task Learning

The document discusses Deep Transfer Learning and Multi-task Learning, explaining the concepts, applications, and methodologies involved in transferring knowledge from one task or domain to another. It outlines various strategies such as model fine-tuning, multi-task learning, and the use of pre-trained models for enhancing performance in tasks like image segmentation and natural language processing. Additionally, it highlights the benefits of multi-task learning, including implicit data augmentation and enhanced feature learning.

Uploaded by

khuusshii2517
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 24

Deep Transfer Learning and

Multi-task Learning

Concepts are ensembled from various online sources with a great acknowledgement to all those made them available online.
Transfer Learning
• Transfer a model trained on source data A to
target data B
• Task transfer: in this case, the source and target data
can be the same
• Image classification -> image segmentation
• Machine translation -> sentiment analysis
• Time series prediction -> time series classification
• …
• Data transfer:
• Images of everyday objects -> medical images
• Chinese -> English
• Physiological signals of one patient -> another patient
• …
• Rationale: similar feature can be useful in different
tasks, or shared by different yet related data.
Taxonomy of Transfer Learning

Source Data
Labeled Unlabeled
Model fine-tuning Self-taught learning
Labeled Multi-task learning
Target
Data Domain-adversarial training Self-taught clustering
Unlabeled Zero-shot learning

• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.


Taxonomy of Transfer Learning

Target Data
Labeled Unlabeled
Model fine-tuning Self-taught learning
Labeled Multi-task learning
Source
Data Domain-adversarial training Self-taught clustering
Unlabeled Zero-shot learning

• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.


Taxonomy of Transfer Learning

Target Data
Labeled Unlabeled
Model fine-tuning Self-taught learning
Labeled Multi-task learning
Source
Data Domain-adversarial training Self-taught clustering
Unlabeled Zero-shot learning

• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.


Model Fine Tuning
• For a model trained on a large amount of labeled source data,
transfer it to target data with very little labeled target data. E.g.
Application Source Data Target Data
Medical image segmentation Segmentations of many images Segmentations of several
of daily scenes medical images
Speech recognition Audio data and transcriptions of Limited audio data and
many historical speaker transcriptions of a new speaker
Arrhythmia detection Very-long ECG signals of a large ECG snippet from a new patient
number of historical patients

• Idea: Pre-train a model using labeled source data, then fine-tune


the model with labeled target data.
• Caution: Do NOT overfit the limited amount of labeled target data!
• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.
Conservative Training

Output Layer Output Layer

Hidden Layers • Use parameters of pre- Hidden Layers


trained model to initialize
the parameters of the new
Input Layer model; Input Layer
• Further train the new
model on target data. Limit
the number of epochs to
Source Data avoid over-fitting! Target
Data
• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.
Layer Transfer
Output Layer Output Layer

Hidden Layer 3 Hidden Layer 3


• Use parameters of pre-trained
Hidden Layer 2 model to initialize the Hidden Layer 2 (Freeze!)
parameters of the new model;
Hidden Layer 1 • Freeze the parameters of some Hidden Layer 1 (Freeze!)
hidden layers; only fine-tune
parameters of other layers on
Input Layer Input Layer
target data. Limit the number of
epochs to avoid over-fitting!
• Usually, freeze the first or last
Source Data few layers. Target
Data

• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.


Open-source Pre-trained Models
• Using open-source pre-trained models for transfer
learning is an effective and efficient way to acquire
high-quality deep learning results for your
applications!
• Pre-trained Models for Natural Language
Processing (NLP)
• BERT
• GPT-3
• …
• Pre-trained Models for Computer Vision (CV)
• VGG-16
• ResNet50
• ViT
• …
BERT: Bidirectional Encoder Representations
from Transformers
• A natural language processing model proposed
by Google in 2018;
• pre-trained on 2,500 million words of Wikipedia
and 800 million words of Book Corpus;
• allows for training customized question
answering models in a few hours in using a
single GPU;
• available at
https://github.com/google-research/bert.

• Variants: CodeBERT, RoBERTa, ALBERT, XLNet, …


GPT-3: Generative Pre-trained Transformer 3
• A natural language processing model
proposed by OpenAI in 2020;
• trained on 175 billion parameters, which is 10
times more than any previous non-sparse
language model available;
• strong at tasks such as translation, answering
questions, as well as on-the-fly reasoning-
based tasks like unscrambling words
• has been applied to writing news, generating
codes…
• available at https://openai.com/api/.
VGG-16
• A computer vision model proposed by the
Visual Geometry Group from Oxford;
• pre-trained on the ImageNet corpus; first
runner-up of ILSVRC (ImageNet Large Scale
Visual Recognition Competition) 2014 in the
classification task
• a CNN model with 16 layers and about 138
million parameters;
• has been built into popular deep learning
frameworks such as PyTorch and Keras.

• Variant: VGG-19
ResNet50
• A variant of the ResNet model, a computer
vision model proposed by Microsoft in 2015;
• pre-trained on the ImageNet corpus;
• a CNN model with 50 layers and about 380
million parameters;
• has been built into popular deep learning
frameworks such as PyTorch and Keras.
ViT: Vision Transformer
• A computer vision (CV) model proposed by
Google in 2020;
• introduces the Transformer architecture, which
has achieved huge success in natural language
processing, into CV; the idea is treating patches in
images as words in text;
• can achieve better accuracy and efficiency than
CNNs such as ResNet50;
• available at
https://github.com/google-research/vision_transf
ormer
.

• Variants: Swin Transformer, PVTv2…


Taxonomy of Transfer Learning
Target Data
Labeled Unlabeled
Model fine-tuning Self-taught learning
Labeled Multi-task learning
Source
Data Domain-adversarial training Self-taught clustering
Unlabeled Zero-shot learning

• Hongyi Li, Transfer Learning. https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.


Multi-task Learning (MTL)
• Simultaneously undertaking multiple tasks
using a single network. E.g.
• Simultaneous ECG heartbeat segmentation and
classification
• …
• We do not necessarily need multiple main
tasks. Rather, we can have one main task and
several auxiliary tasks to support the main task.
• Domain adaptation
• Self-supervision
• ….
• Basic forms of MTL: hard or soft parameter
sharing.
Hard Parameter Sharing
• Different tasks share some
layers (i.e. the parameters of Output Layer (Task A) Output Layer (Task B)
these layers), usually used for
feature extraction for input data.
Task-specific layers Task-specific layers
• The output of the shared layers (Task A) (Task B)
(usually learned features) is fed
to different task-specific layers
to obtain the final results.
Shared Layers
(Feature Extractor)

Input Layer (Tasks A & B)

• Sebastian Ruder, An Overview of Multi-Task Learning in Deep Neural Networks. https://ruder.io/multi-task/.


Soft Parameter Sharing
• Replace shared layers (with
identical parameters) with Output Layer (Task A) Output Layer (Task B)
constrained layers, which have
similar or related parameters.
Unconstrained layers Unconstrained layers
• The similarity or relatedness of (Task A) (Task B)
parameters than can be
controlled by a regularization
term in the loss function, or
Constrained Layers Constrained Layers
through connections between (Task A) (Task B)
constrained layers of different
tasks.
Input Layer (Task A) Input Layer (Task B)

• Sebastian Ruder, An Overview of Multi-Task Learning in Deep Neural Networks. https://ruder.io/multi-task/.


Why does MTL work?
• Implicit data augmentation:
• If different tasks have different input data, then each task
can benefit from the extra knowledge encoded in the
input of other tasks.
• Even if all tasks share the same data, simultaneously
learning for multiple tasks can reduce the risk of
overfitting for each one of these tasks.
• Enhanced feature learning
• It may be the case that a specific task is so noisy that we
cannot learn the most relevant features if we only deal
with that particular task.
• Including other tasks makes it easier to uncover truly
relevant features.
• Besides, some features are easier to learn for some tasks
than others. Handling all tasks together can help enhance
the latter’s performance.
• Sebastian Ruder, An Overview of Multi-Task Learning in Deep Neural Networks. https://ruder.io/multi-task/.
MTL-Example: Image Segmentation and
Depth Regression

Fusing semantic segmentation, instance segmentation and per-pixel depth regression tasks using
hard parameter sharing.

• Kendall, Alex, Yarin Gal, and Roberto Cipolla. "Multi-task learning using uncertainty to weigh losses for scene geometry and
semantics." Proceedings of the IEEE conference on computer vision and pattern recognition. 2018.
MTL Example: Cross Language Knowledge Transfer

Fusing language-specific tasks using multi-lingual feature transformation layers by hard parameter sharing.

• Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers." 2013 IEEE International
Conference on Acoustics, Speech and Signal Processing. IEEE, 2013.
MTL Example: Correlated Time Series Forecasting

Fusing task-specific layers using shared layers by hard parameter sharing.

• Cirstea, Razvan-Gabriel, et al. "Correlated time series forecasting using multi-task deep neural networks." Proceedings of the 27th acm international
conference on information and knowledge management. 2018.
References
1. Hongyi Li, Transfer Learning.
https://speech.ee.ntu.edu.tw/~tlkagk/courses/ML_2016/Lecture/transfer%20(v3).pdf.
2. Sejuti Das. Top 8 Pre-Trained NLP Models Developers Must Know.
https://analyticsindiamag.com/top-8-pre-trained-nlp-models-developers-must-know/.
3. Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding."
arXiv preprint arXiv:1810.04805 (2018).
4. Brown, Tom, etFeng, Zhangyin, et al. "Codebert: A pre-trained model for programming and natural
languages." arXiv preprint arXiv:2002.08155 (2020). al. "Language models are few-shot learners."
Advances in neural information processing systems 33 (2020): 1877-1901.
5. Liu, Yinhan, et al. "Roberta: A robustly optimized bert pretraining approach." arXiv preprint
arXiv:1907.11692 (2019).
6. Lan, Zhenzhong, et al. "Albert: A lite bert for self-supervised learning of language representations."
arXiv preprint arXiv:1909.11942 (2019).
7. Yang, Zhilin, et al. "Xlnet: Generalized autoregressive pretraining for language understanding."
Advances in neural information processing systems 32 (2019).
8. Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image
recognition." arXiv preprint arXiv:1409.1556 (2014).
9. He, Kaiming, et al. "Deep residual learning for image recognition." Proceedings of the IEEE conference
on computer vision and pattern recognition. 2016.
10. Dosovitskiy, Alexey, et al. "An image is worth 16x16 words: Transformers for image recognition at
scale." arXiv preprint arXiv:2010.11929 (2020).
11. Liu, Ze, et al. "Swin transformer: Hierarchical vision transformer using shifted windows." Proceedings of
the IEEE/CVF International Conference on Computer Vision. 2021.
12. Wang, Wenhai, et al. "Pvt v2: Improved baselines with pyramid vision transformer." Computational
Visual Media 8.3 (2022): 415-424.
References
13. Sebastian Ruder, An Overview of Multi-Task Learning in Deep Neural Networks.
https://ruder.io/multi-task/.
14. Kendall, Alex, Yarin Gal, and Roberto Cipolla. "Multi-task learning using uncertainty
to weigh losses for scene geometry and semantics." Proceedings of the IEEE
conference on computer vision and pattern recognition. 2018.
15. Rebut, Julien, et al. "Raw High-Definition Radar for Multi-Task
Learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
Recognition. 2022.
16. Huang, Jui-Ting, et al. "Cross-language knowledge transfer using multilingual deep
neural network with shared hidden layers." 2013 IEEE International Conference on
Acoustics, Speech and Signal Processing. IEEE, 2013.
17. Liu, Pengfei, Xipeng Qiu, and Xuanjing Huang. "Recurrent neural network for text
classification with multi-task learning." arXiv preprint arXiv:1605.05101 (2016).
18. Cirstea, Razvan-Gabriel, et al. "Correlated time series forecasting using multi-task
deep neural networks." Proceedings of the 27th acm international conference on
information and knowledge management. 2018.

You might also like