Introduction to Deep Learning 17th January 2025 (2)
Introduction to Deep Learning 17th January 2025 (2)
C. Chandra Sekhar
Dept. of Computer Science and Engineering
Indian Institute of Technology Madras
Chennai-600036
1
Regression and Classification Tasks
2
S.J.D.Prince, Understanding Deep Learning, MIT Press, 2023
Learning Tasks with Structured Outputs
Tiger
Giraffe
Horse
Bear
5
Pattern Classification Tasks in Speech Processing
I rO ju vAr ta lo lu mu khyam sa lu
6
Text Processing Tasks
• Sentence classification
• Parts-of-speech tagging
• Named entity recognition
• Sentiment analysis
7
Classification using Deep Learning Models
Representation Learning: Conventional machine learning techniques (Bayes
Classifiers, MLFFNNs and SVMs) take hand-designed features as input to models.
Focus of deep learning techniques is to learn representation (features) from raw data
given as input to models.
Classification Class
Raw Data Representation Label
Feature Model
Extraction (Bayes Classifier/
MLFFNN/SVM)
Class
Raw Data Feature Extraction and Classification Label
8
Content based Image Retrieval
• Suitable method
for matching
• Measure of
dissimilarity:
Distance metric
learning
9
Content based Image Retrieval
• Images in the
repository should be
annotated
• Image annotation:
Multi-label pattern
classification
10
Image Captioning
11
Video Captioning
● Generate text descriptions by localizing interesting
events in a video.
○ Event detection: Event Proposal Module
○ Event description: Captioning Module
Event Proposal
Proposed Events Captioning Generated captions for
Module
Module different events
Input
12
Visual Question Answering
Man Woman
Yes
How many children are in the
bed?
No Two One
13
Visual Commonsense Reasoning
14
Deep Learning Models
• Deep Feedfoward Neural Networks (DFNNs)
• Stacked Autoencoder based Pre-training for DFNNs
• Convolutional Neural Networks (CNNs)
• Recurrent Neural Networks (RNNs)
• Long Short Term Memory (LSTM) Networks
• Attention based Models: Transformers
– Pre-training of transformer model: BERT
• Generative Models
– Generative Pre-trained Transformers (GPT)
– Variational Autoencoders
– Generative Adversarial Networks (GANs)
– Diffusion Models
15
Multilayer Feedforward Neural Network
• Architecture of an MLFFNN
– Input layer: Linear neurons
– Hidden layers (1 or 2): Sigmoidal neurons
– Output layer: Sigmoidal neurons or Softmax neurons
o
xi i j k s k
. . .
.
. . . .
. . . .
o
xd d J K
s K
16
Deep Feedforward Neural Network
(DFNN)
O
U
T
H H H H H P
I I I I I I U
N D D D D D T
P D D D D D
Input U E E E E E L
N N N N N Output
X T A S
Y
L L L L L L E
A A A A A A R
Y Y Y Y Y Y
E E E E E E
R R R R R R
1 2 3 4 5
17
Optimization Methods for Training a DFNN
• Overfitting:
• L2 regularization method
• Dropout method
• Batch normalization
19
Auto-Association Neural Network (AANN)
Encoder Decoder
Actual Desired
output output
x1 s1 x1
x2 s2 x2
x3 s3 x3
xd sd xd
Input Dimension Reduction Output
Layer Layer Layer
x1 s1
x2 s2
x3 s3
xd sd
Encoder Decoder
21
Multiple AANNs for Stacked Autoencoder
AANN 1 Bottleneck
Features Desired Output
Input z1
x Encoder Decoder x
1 1
Dimension d Dimension l1 Dimension d
AANN 2
Bottleneck
Input Features Desired Output
z1 z2 z1
Encoder Decoder
2 Dimension l2 2 Dimension l1
Dimension l1
AANN 3
Bottleneck
Input Features Desired Output
z2 z3 z2
Encoder Decoder
3 Dimension l3 3
Dimension l2 Dimension l2
22
Stacked Autoencoder for Pre-training a DFNN
A A O
U U A U
T T U T
O O T
Input E O
P Output
X E U S
N E
N
C N T
C C
O O O
D D D L
E E E A
R R R
Y
1 3 E
2 R
24
LeNet5: CNN for
Handwritten Character Recognition
Input 6 6 16 16
Feature Maps Output
Feature Maps Feature Maps Feature Maps 26
28x28
32x32 14x14 10x10 5x5
Convolution
Convolution Pooling
Pooling
• Weight sharing: All the nodes in a feature map in a convolutional layer have the same
synaptic weights (~278000 connections, but only ~1700 weight parameters)
• Output layer: 26 nodes with one node for each character. Each node in the output layer is
connected to the nodes in all the feature maps in the 4th hidden layer.
• W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification:
A comprehensive survey,” Neural Computation, vol.29, pp.2352-2449, 2017.
26
VGG-Net Architecture
• Deep CNN developed by Visual Geometry Group (VGG) of Oxford
university
• Task: Classification of color images belonging to 1000 classes in
the ImageNet dataset
27
U-Net for Image Segmentation
O.Ronneberger, P.Fischer, and T.Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation ”, arXiv, 2015.
28
Faster Region-based CNN (Faster R-CNN)
for Object Detection
S.Ren, K.He, R.Girschick and J.Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv,
2016.
29
Image Captioning
30
Encoder-Decoder Paradigm
for Image Captioning
Representation
Image Caption
Deep of Image
Recurrent
Convolutional Neural
Neural Network Network (RNN)
(DCNN)
Encoder Decoder
31
Recurrent Neural Network (RNN)
State of
hidden
Input at time t layer
𝒙𝒕 at time t
𝒉𝒕 Output at time t
Hidden Output 𝒔𝒕
𝒉𝒕−𝟏 Layer Layer
32
Long Short-Term Memory (LSTM)
• Structure of an LSTM Cell
ht
ot
ft
it
• The RNN that uses LSTM neurons in its hidden layer is shown to
avoid the vanishing gradients problem, leading to faster convergence
during training 33
Encoder-Decoder Paradigm
for Image Captioning
Image
Decoder
(LSTM)
• The output of the pre-final layer in the CNN based encoder is used
as the initial state of the hidden layer of LSTM based decoder 34
Embedding Methods
35
Sequence-to-Sequence Mapping Tasks
• Neural Machine Translation: Translation of a sentence in
the source language to a sentence in the target
language
• Input: A sequence of words
• Output: A sequence of words
• Video Captioning: Generation of a sentence as the
caption for a video represented as a sequence of frames
• Input: A sequence of feature vectors extracted from
the frames of a video
• Output: A sequence of words
• Each of the above tasks involves mapping an input
sequence to an output sequence
36
Encoder-Decoder Paradigm for
Sequence-to-Sequence Mapping
Representation
Recurrent of Input Recurrent
Output Sequence
Input Sequence
Neural
Sequence
Neural
Network Network
(RNN) (RNN)
Encoder Decoder
37
Encoder-Decoder Paradigm for
Sequence-to-Sequence Mapping
• Sequence-to-Sequence Mapping using Encoder-Decoder Paradigm
• Encoder: Generate a representation of the input sequence
• Representation generated by Encoder is given as input to Decoder
• Decoder: Generate the output sequence (A sequence of words)
41
Bidirectional Encoder Representation from
Transformer (BERT)
• Pre-train the generic
representation for several
Natural Language Processing
(NLP) tasks
• Pre-training Methods:
• Masked Language Modelling
(Mask LM)
• Next Sentence Prediction
(NSP)
Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” NAACL, 2019.
42
Generative Pre-trained Transformer (GPT)
• Transformer decoder is pre-trained using unlabeled text data
• GPT can be fine-tuned for downstream tasks that involve text
data
• Auto-regressive model: A word in a sentence is predicted
using all the words preceding that word in the sentence
• Masked multi-head self-attention (MSA) in each layer of
transformer decoder takes the sequence of words preceding a
word in a sentence.
• The decoder is trained to predict the next word in the
sentence.
• GPT-1, GPT-2 and GPT-3: Pre-trained models with different
number of layers trained with different corpora for different
pre-training tasks
A.Redford, J.Wu, R.Child, D.Luan, D.Amodei and I.Sutskever, “Language Models are
Unsupervised Multitask Learners,” 2019
T.Brown et al., “Language Models are Few-Shot Learners,” arXiv:2005.14165v4, 22nd July, 2020
43
Visual Question Answering (VQA) for Images
Man Woman
Yes
Two One
No
44
Open Ended VQA
45
Image VQA Framework
Image Representation
of Image
Image Encoder
Answer
Fusion of Answer Two
Represen- Generator
tations
Question
Representation
Question Question
(Text)
How many
children Encoder
are
in the
bed?
Image Encoder: CNN, ViT Encoder, Swin Tranformer
Question Encoder: LSTM, Transformer encoder, BERT fine-tuned with questions in VQA dataset
Answer Generator: Classifier, Text generator such as GPT fine-tuned with answers in VQA dataset
46
Open Ended VQA Framework
Representation
Decoder
Question and
Partial Answer
Fusion of Next Word
Represen- Predictor
Next
tations word
Image
Representation
Image of Image
Encoder
In open ended VQA, the answer is a sequence of words. The system generates
one word of the answer at a time. The next word in the answer is predicted using
the representations of image, question, and the partial answer corresponding to
the sequence of words generated so far.
NLP Benchmarks:
L.Yang et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications,” arXiv, 2023.
53
Text-to-Image Translation
Output of Stable
Diffusion Model
Word prompts: a
dream of time gone
by, oil painting, red
blue white, canvas,
watercolor, koi fish,
and animals.
Retrieval Augmented Generation (RAG)
RAG Process for Textual Question Answering
Yunfan Gao et al, “Retrieval-Augmented Generation for Large Language Models: A Survey”, arxiv:
2312.10997v5 [cs.CL] 27 March 2024
Coverage of Topics
1. Introduction to deep learning
2. Feedforward neural networks: Model of an artificial
neuron, Activation functions: Sigmoidal function, Recti-linear
unit (ReLU) function, Softmax function, Multi-layer
feedforward neural network, Backpropagation method,
Gradient descent method, Stochastic gradient descent
method
3. Optimization and regularization methods for deep
feedforward neural networks (DFNNs): Optimization
methods: Generalized delta rule, AdaGrad, RMSProp,
Adadelta, AdaM, Second order methods; Regularization
methods: Dropout, Dropconnect; Batch normalization
4. Autoencoders: Autoassociative neural network,
Stacked autoencoder, Greedy layer-wise training, Pre-
training of a DFNN using a stacked autoencoder
57
Coverage of Topics (Contd.)
5. Convolutional neural networks (CNNs): Basic CNN
architecture, Deep CNNs for image classification:
LeNet, VGGNet, GoogLeNet, ResNet; CNNs for image
segmentation: U-Net and Fast RCNN; 1-d CNNs, 3-d
CNNs
6. Recurrent neural networks (RNNs): Architecture of
an RNN, Unfolding an RNN, Backpropagation through
time, Vanishing and exploding gradient problems in
RNNs, Long short term memory (LSTM) units, Gated
recurrent units, Bidirectional RNNs
7. Embedding methods: Image and video embedding
methods; Word embedding methods: Word2Vec, GloVe
58
Coverage of Topics (Contd.)
8. Transformer models: Attention based models, Scale
dot product attention, Multi-head attention (MHA), Self-
attention MHA, Cross-attention MHA, Position encoding,
Encoder module in a transformer, Decoder module in a
transformer, Sequence to sequence mapping using
transformer, Bidirectional encoder representations from
transformers (BERT) model for text processing, Pre-
training a BERT model, Fine-tuning, Generative pre-
trained transformer (GPT), Introduction to large
language models (LLMs)
9. Generative Models: Variational autoencoder (VAE),
Generative adversarial networks (GANs), Introduction
to diffusion models
59
Books and Evaluation Pattern
Text Books:
1. C.M.Bishop and H.Bishop, Deep Learning: Foundations and Concepts, Springer,
2024
2. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer, 2nd Ed., 2023
3. S.J.D.Prince, Understanding Deep Learning, MIT Press, 2023
Reference Books:
1. I.Goodfellow, Y.Bengio and A.Courville, Deep Learning, MIT Press, 2016
2. U.Kamath, J.Liu and J.Whitaker, Deep Learning for NLP and Speech Recognition,
Springer, 2019
3. Nithin Buduma, Nikhil Buduma, Joe Papa, Fundamentals of Deep Learning,
O’Reilly, 2nd Ed., 2022
4. I.Drori, The Science of Deep Learning, Cambridge University Press, 2022
60