0% found this document useful (0 votes)
47 views

Introduction to Deep Learning 17th January 2025 (2)

The document provides an introduction to deep learning, covering various machine learning techniques for classification, including K-Nearest Neighbours, Bayes Classifier, and neural networks. It discusses deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), along with applications in image classification, video captioning, and sequence-to-sequence mapping tasks. Additionally, it highlights optimization and regularization methods for training deep learning models.

Uploaded by

ed22b044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
47 views

Introduction to Deep Learning 17th January 2025 (2)

The document provides an introduction to deep learning, covering various machine learning techniques for classification, including K-Nearest Neighbours, Bayes Classifier, and neural networks. It discusses deep learning models such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs), along with applications in image classification, video captioning, and sequence-to-sequence mapping tasks. Additionally, it highlights optimization and regularization methods for training deep learning models.

Uploaded by

ed22b044
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Introduction to Deep Learning

C. Chandra Sekhar
Dept. of Computer Science and Engineering
Indian Institute of Technology Madras
Chennai-600036

[email protected]

Office Room: SSB 407

1
Regression and Classification Tasks

Machine Learning Model

Machine Learning Model

Machine Learning Model

Machine Learning Model

Machine Learning Model

2
S.J.D.Prince, Understanding Deep Learning, MIT Press, 2023
Learning Tasks with Structured Outputs

S.J.D.Prince, Understanding Deep Learning, MIT Press, 2023 3


Machine Learning Techniques for Classification
• K-Nearest Neighbours Method
• Bayes Classifier
– Statistical modeling
– Unimodal distribution modeling
– Multimodal distribution modeling: Gaussian Mixture
Model
• Multilayer feedforward neural network based
classification
• Support vector machine based classification
• Classification using decision tree
– Random forest based classification
• Classification of sequential or temporal patterns
– Hidden Markov model
Image Classification

Tiger

Giraffe

Horse

Bear

5
Pattern Classification Tasks in Speech Processing

is bu le Tin ki mu khya sa mA chAr

mu nnAL mu da la mei ccar sel vi jey la li ta

I rO ju vAr ta lo lu mu khyam sa lu

• Speech Recognition • Speech Emotion Recognition

• Speaker Recognition • Spoken Language Identification

6
Text Processing Tasks

• Sentence classification
• Parts-of-speech tagging
• Named entity recognition
• Sentiment analysis

7
Classification using Deep Learning Models
Representation Learning: Conventional machine learning techniques (Bayes
Classifiers, MLFFNNs and SVMs) take hand-designed features as input to models.
Focus of deep learning techniques is to learn representation (features) from raw data
given as input to models.

Conventional Approaches to Pattern Classification:

Classification Class
Raw Data Representation Label
Feature Model
Extraction (Bayes Classifier/
MLFFNN/SVM)

Deep Learning based Approaches to Pattern Classification:

Class
Raw Data Feature Extraction and Classification Label

(Deep Convolutional Neural Network)

8
Content based Image Retrieval

• Query-by-example (QBE) Approach

• Suitable method
for matching
• Measure of
dissimilarity:
Distance metric
learning

9
Content based Image Retrieval

• Query-by-semantics (QBS) Approach

• Images in the
repository should be
annotated
• Image annotation:
Multi-label pattern
classification

10
Image Captioning

A group of people shopping at an outdoor market.


There are many vegetables at the fruit stand

O. Vinyals, A. Toshev, S. Bengio and D.Erhan, “Show


and tell: Lessons learned from the 2015 MSCOCO
Image Captioning Challenge,” IEEE Transactions on
Pattern Analysis and Machine Intelligence, vol. 39,
no.4, pp.652-663, April 2017.

A woman holding a camera in crowd

Fang et al., “From captions to visual


concepts and back”, CVPR, 2015.

11
Video Captioning
● Generate text descriptions by localizing interesting
events in a video.
○ Event detection: Event Proposal Module
○ Event description: Captioning Module

Event Proposal
Proposed Events Captioning Generated captions for
Module
Module different events

Input

12
Visual Question Answering

Is there something to cut the Who is wearing glasses?


vegetables with?

Man Woman
Yes
How many children are in the
bed?

No Two One

13
Visual Commonsense Reasoning

14
Deep Learning Models
• Deep Feedfoward Neural Networks (DFNNs)
• Stacked Autoencoder based Pre-training for DFNNs
• Convolutional Neural Networks (CNNs)
• Recurrent Neural Networks (RNNs)
• Long Short Term Memory (LSTM) Networks
• Attention based Models: Transformers
– Pre-training of transformer model: BERT
• Generative Models
– Generative Pre-trained Transformers (GPT)
– Variational Autoencoders
– Generative Adversarial Networks (GANs)
– Diffusion Models

15
Multilayer Feedforward Neural Network
• Architecture of an MLFFNN
– Input layer: Linear neurons
– Hidden layers (1 or 2): Sigmoidal neurons
– Output layer: Sigmoidal neurons or Softmax neurons

Input Layer Hidden Layer Output Layer


o
x1 1 1 1 s 1
o
x2 2 2 2 s 2
. . .
.
. . .
.
. . .
.

o
xi i j k s k
. . .
.
. . . .
. . . .

o
xd d J K
s K
16
Deep Feedforward Neural Network
(DFNN)

O
U
T
H H H H H P
I I I I I I U
N D D D D D T
P D D D D D
Input U E E E E E L
N N N N N Output
X T A S
Y
L L L L L L E
A A A A A A R
Y Y Y Y Y Y
E E E E E E
R R R R R R

1 2 3 4 5

17
Optimization Methods for Training a DFNN

• Slow convergence of gradient descent method


• Problem addressed: How to reduce the number of
epochs taken to reach a local minimum?
• Weight update methods that use the past history
of updates have been shown to be effective.
• Generalized delta rule that uses momentum
factor
• Weight-specific learning rate scheduling methods
(Adaptive learning rate methods)
– AdaGrad
– RMSProp
– AdaDelta
– AdaM
• Second-order methods for optimization
18
Regularization Methods for Training a DFNN

• Underfitting: Model complexity is low

• Overfitting:

–Model complexity is high

–Training dataset size is small

• L2 regularization method

• Dropout method

• Drop connect method

• Batch normalization
19
Auto-Association Neural Network (AANN)
Encoder Decoder
Actual Desired
output output

x1 s1 x1

x2 s2 x2

x3 s3 x3

xd sd xd
Input Dimension Reduction Output
Layer Layer Layer

• AANN uses linear neurons in the Input layer, Dimension


reduction layer and Output layer. It uses sigmoidal neurons in
the other two hidden layers.
• AANN is trained using the backpropagation learning method
• After the model is trained, the output of the Bottleneck Layer
(Dimension reduction layer) is used as the reduced dimension
representation of the input
• Encoder in AANN, also called as autoencoder, is used in Deep
stacked autoencoder network models
20
Auto-Association Neural Network (AANN)

x1 s1

x2 s2

x3 s3

xd sd

Encoder Decoder

21
Multiple AANNs for Stacked Autoencoder
AANN 1 Bottleneck
Features Desired Output
Input z1
x Encoder Decoder x
1 1
Dimension d Dimension l1 Dimension d

AANN 2
Bottleneck
Input Features Desired Output
z1 z2 z1
Encoder Decoder
2 Dimension l2 2 Dimension l1
Dimension l1

AANN 3
Bottleneck
Input Features Desired Output
z2 z3 z2
Encoder Decoder
3 Dimension l3 3
Dimension l2 Dimension l2

22
Stacked Autoencoder for Pre-training a DFNN

A A O
U U A U
T T U T
O O T
Input E O
P Output
X E U S
N E
N
C N T
C C
O O O
D D D L
E E E A
R R R
Y
1 3 E
2 R

•Weights of autoencoders are learnt using unsupervised


learning with unlabeled examples. These weights are used
as the initial weights for DNN.

• Fine-tuning of DNN involves modification of weights


using backpropagation learning method that uses a small
set of labeled examples. 23
Convolution Neural Networks (CNNs)

• Convolutional neural network (CNN) is a special type of


multilayer feedforward neural network (MLFFNN) that is
well suited for image classification.

• Development of CNN is neuro-biologically motivated.

• A CNN is an MLFFNN designed specifically to recognize 2-


dimensional shapes with a high degree of invariance to
translation, scaling, skewing and other forms of distortion.

S. Haykin, Neural Networks and Learning Machines, Prentice-Hall of India, 2011

24
LeNet5: CNN for
Handwritten Character Recognition
Input 6 6 16 16
Feature Maps Output
Feature Maps Feature Maps Feature Maps 26
28x28
32x32 14x14 10x10 5x5

Convolution
Convolution Pooling
Pooling

• Input: 32x32 pixel image of a character centered and normalized in size

• Weight sharing: All the nodes in a feature map in a convolutional layer have the same
synaptic weights (~278000 connections, but only ~1700 weight parameters)

• Output layer: 26 nodes with one node for each character. Each node in the output layer is
connected to the nodes in all the feature maps in the 4th hidden layer.

Y. LeCun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document


recognition,” Proceedings of IEEE, vol.86, no.11, pp.2278-2324, November 1998.
25
CNN Models for Image Classification

• Image Classification (on ImageNet data):


• AlexNet
• VGG-Net
• ResNet
• GoogLeNet
• PReLU-Net
• Batch Normalization(BN)-Inception-ResNet

• W. Rawat and Z. Wang, “Deep convolutional neural networks for image classification:
A comprehensive survey,” Neural Computation, vol.29, pp.2352-2449, 2017.

26
VGG-Net Architecture
• Deep CNN developed by Visual Geometry Group (VGG) of Oxford
university
• Task: Classification of color images belonging to 1000 classes in
the ImageNet dataset

27
U-Net for Image Segmentation

O.Ronneberger, P.Fischer, and T.Brox, “U-Net: Convolutional Networks for Biomedical Image Segmentation ”, arXiv, 2015.

28
Faster Region-based CNN (Faster R-CNN)
for Object Detection

S.Ren, K.He, R.Girschick and J.Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, arXiv,
2016.
29
Image Captioning

A group of people at an outdoor market.

O. Vinyals, A. Toshev, S. Bengio and D.Erhan, “Show and tell:


Lessons learned from the 2015 MSCOCO Image Captioning
Challenge,” IEEE Transactions on Pattern Analysis and Machine
Intelligence, vol. 39, no.4, pp.652-663, April 2017.

A woman holding a camera in crowd

Fang et al., “From captions to visual concepts and back,


CVPR, 2015.

30
Encoder-Decoder Paradigm
for Image Captioning

Representation
Image Caption
Deep of Image
Recurrent
Convolutional Neural
Neural Network Network (RNN)
(DCNN)

Encoder Decoder

31
Recurrent Neural Network (RNN)
State of
hidden
Input at time t layer
𝒙𝒕 at time t
𝒉𝒕 Output at time t
Hidden Output 𝒔𝒕
𝒉𝒕−𝟏 Layer Layer

• The hidden layer uses sigmoidal neurons


• The state of hidden layer (outputs of nodes in the hidden layer) at
time t, 𝒉𝒕 , is dependent on the input at time t and the state of the
hidden layer at time t-1.
• The RNN that uses sigmoidal neurons in its hidden layer is shown
to have the vanishing and exploding gradients problem, due to
which the convergence during training is slow.

32
Long Short-Term Memory (LSTM)
• Structure of an LSTM Cell

ht

ot

ft

it

• The RNN that uses LSTM neurons in its hidden layer is shown to
avoid the vanishing gradients problem, leading to faster convergence
during training 33
Encoder-Decoder Paradigm
for Image Captioning

Image

Decoder
(LSTM)

young person drawing face on sheet <endseq>


Encoder
(VGG-Net)

<startseq> young person drawing face on sheet

• The output of the pre-final layer in the CNN based encoder is used
as the initial state of the hidden layer of LSTM based decoder 34
Embedding Methods

• Image Embedding Methods


• Output of pre-final layer of a deep CNN

• Word Embedding Methods


• Word2Vec
• GloVe
• FastText

35
Sequence-to-Sequence Mapping Tasks
• Neural Machine Translation: Translation of a sentence in
the source language to a sentence in the target
language
• Input: A sequence of words
• Output: A sequence of words
• Video Captioning: Generation of a sentence as the
caption for a video represented as a sequence of frames
• Input: A sequence of feature vectors extracted from
the frames of a video
• Output: A sequence of words
• Each of the above tasks involves mapping an input
sequence to an output sequence

36
Encoder-Decoder Paradigm for
Sequence-to-Sequence Mapping

Representation
Recurrent of Input Recurrent
Output Sequence
Input Sequence
Neural
Sequence
Neural
Network Network
(RNN) (RNN)

Encoder Decoder

37
Encoder-Decoder Paradigm for
Sequence-to-Sequence Mapping
• Sequence-to-Sequence Mapping using Encoder-Decoder Paradigm
• Encoder: Generate a representation of the input sequence
• Representation generated by Encoder is given as input to Decoder
• Decoder: Generate the output sequence (A sequence of words)

• Relationship among the elements of a sequence:


• Typically, an element in the input sequence is related to a few other
elements in the input sequence
• Typically, a word in the output sequence to be generated is related to a few
elements in the input sequence

• LSTM based approach to Sequence-to-Sequence Mapping


• Bidirectional LSTM based Encoder captures dependencies among elements in
the input sequence
• Bidirectional LSTM based Decoder captures dependencies among elements in
the output sequence
• Attention mechanism is introduced to capture dependencies of elements in
the output sequence on elements in the input sequence

• Training the LSTM based Sequence-to-Sequence mapping systems is


computationally intensive, and there is not much scope for parallelization of
operations in the training process 38
Attention based Models for
Sequence-to-Sequence Mapping

• Attention based models try to capture and use


• Relations among elements in the input
sequence (Self-Attention)
• Relations among elements in the output
sequence (Self-Attention)
• Relations between elements in the input
sequence and elements in the output
sequence (Cross-Attention)

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I.


Polosukhin, “Attention is all you need,” NIPS, 2017.
39
Attention-based Model: Transformer

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin,


“Attention is all you need,” NIPS, 2017.
40
Pre-training of Transformer

Encoder and/or decoder of transformer can be


pre-trained using huge amount of unlabeled
data, and then fine-tuned using small amount of
labeled data for a downstream task.

• Encoder pre-training for text data


oBidirectional Encoder Representation from
Transformer (BERT)

• Decoder pre-training for text data


oGenerative Pre-trained Transformer (GPT)

41
Bidirectional Encoder Representation from
Transformer (BERT)
• Pre-train the generic
representation for several
Natural Language Processing
(NLP) tasks

• Pre-training Methods:
• Masked Language Modelling
(Mask LM)
• Next Sentence Prediction
(NSP)

• Fine-tuned for tasks such as


• Sentence classification
• Sentence relationship
• Textual question answering

Image source : BERT(Devlin et al., 2019)

Jacob Devlin Ming-Wei Chang Kenton Lee Kristina Toutanova, “BERT: Pre-training of Deep
Bidirectional Transformers for Language Understanding,” NAACL, 2019.
42
Generative Pre-trained Transformer (GPT)
• Transformer decoder is pre-trained using unlabeled text data
• GPT can be fine-tuned for downstream tasks that involve text
data
• Auto-regressive model: A word in a sentence is predicted
using all the words preceding that word in the sentence
• Masked multi-head self-attention (MSA) in each layer of
transformer decoder takes the sequence of words preceding a
word in a sentence.
• The decoder is trained to predict the next word in the
sentence.
• GPT-1, GPT-2 and GPT-3: Pre-trained models with different
number of layers trained with different corpora for different
pre-training tasks

A.Redford, K.Narasimhan, T.Salimans and I.Sutskever , “Improving Language Understanding by


Generative Pre-training,” 2018

A.Redford, J.Wu, R.Child, D.Luan, D.Amodei and I.Sutskever, “Language Models are
Unsupervised Multitask Learners,” 2019

T.Brown et al., “Language Models are Few-Shot Learners,” arXiv:2005.14165v4, 22nd July, 2020
43
Visual Question Answering (VQA) for Images

Is there something to cut the


Who is wearing glasses?
vegetables with?

Man Woman
Yes

How many children are in the bed?

Two One
No

44
Open Ended VQA

45
Image VQA Framework

Image Representation
of Image

Image Encoder
Answer
Fusion of Answer Two
Represen- Generator
tations
Question
Representation
Question Question

(Text)
How many
children Encoder
are
in the
bed?
Image Encoder: CNN, ViT Encoder, Swin Tranformer

Question Encoder: LSTM, Transformer encoder, BERT fine-tuned with questions in VQA dataset

Fusion of Representations: Concatenation, Co-attention transformer

Answer Generator: Classifier, Text generator such as GPT fine-tuned with answers in VQA dataset
46
Open Ended VQA Framework

Representation

Transformer of Question and


Partial Answer

Decoder
Question and
Partial Answer
Fusion of Next Word
Represen- Predictor
Next
tations word
Image
Representation

Image of Image

Encoder

In open ended VQA, the answer is a sequence of words. The system generates
one word of the answer at a time. The next word in the answer is predicted using
the representations of image, question, and the partial answer corresponding to
the sequence of words generated so far.

A.M.Bellini, N.Parde, M.Matteucci and M.J.Carman, “Towards Open-Ended VQA Models


using Transformers, ” EMNLP, 2020. 47
Generative Models

• Models capable of generation of data (Text, Image,


Video, Music)
• Restricted Boltzmann machine (RBM)
• Variational autoencoder
• Generative pre-trained transformer (GPT)
• Large Language Models (LLMs)
• Generative adversarial network (GAN)
• Diffusion models
• Text-to-image
• Text-to-video
• Text-to-audio
• Text-to-music
48
LLMs: Evolution of GPT Models

NLP Benchmarks:

LAMBADA: LAnguage Modeling Broadened to Account for Discourse Aspects

GLUE: General Language Understanding Evaluation

SQUaD: Stanford Question Answering Dataset


49
Image Generation
Image-to-Image Translation
Sketch-to-Image Generation
Denoising Diffusion Models
for Image Generation

Data Destructing data by adding progressively increasing level noise Noise

Data Generating a new sample by denoising Noise

L.Yang et al., “Diffusion Models: A Comprehensive Survey of Methods and Applications,” arXiv, 2023.

53
Text-to-Image Translation

Output of Stable
Diffusion Model

Word prompts: a
dream of time gone
by, oil painting, red
blue white, canvas,
watercolor, koi fish,
and animals.
Retrieval Augmented Generation (RAG)
RAG Process for Textual Question Answering

Yunfan Gao et al, “Retrieval-Augmented Generation for Large Language Models: A Survey”, arxiv:
2312.10997v5 [cs.CL] 27 March 2024
Coverage of Topics
1. Introduction to deep learning
2. Feedforward neural networks: Model of an artificial
neuron, Activation functions: Sigmoidal function, Recti-linear
unit (ReLU) function, Softmax function, Multi-layer
feedforward neural network, Backpropagation method,
Gradient descent method, Stochastic gradient descent
method
3. Optimization and regularization methods for deep
feedforward neural networks (DFNNs): Optimization
methods: Generalized delta rule, AdaGrad, RMSProp,
Adadelta, AdaM, Second order methods; Regularization
methods: Dropout, Dropconnect; Batch normalization
4. Autoencoders: Autoassociative neural network,
Stacked autoencoder, Greedy layer-wise training, Pre-
training of a DFNN using a stacked autoencoder

57
Coverage of Topics (Contd.)
5. Convolutional neural networks (CNNs): Basic CNN
architecture, Deep CNNs for image classification:
LeNet, VGGNet, GoogLeNet, ResNet; CNNs for image
segmentation: U-Net and Fast RCNN; 1-d CNNs, 3-d
CNNs
6. Recurrent neural networks (RNNs): Architecture of
an RNN, Unfolding an RNN, Backpropagation through
time, Vanishing and exploding gradient problems in
RNNs, Long short term memory (LSTM) units, Gated
recurrent units, Bidirectional RNNs
7. Embedding methods: Image and video embedding
methods; Word embedding methods: Word2Vec, GloVe

58
Coverage of Topics (Contd.)
8. Transformer models: Attention based models, Scale
dot product attention, Multi-head attention (MHA), Self-
attention MHA, Cross-attention MHA, Position encoding,
Encoder module in a transformer, Decoder module in a
transformer, Sequence to sequence mapping using
transformer, Bidirectional encoder representations from
transformers (BERT) model for text processing, Pre-
training a BERT model, Fine-tuning, Generative pre-
trained transformer (GPT), Introduction to large
language models (LLMs)
9. Generative Models: Variational autoencoder (VAE),
Generative adversarial networks (GANs), Introduction
to diffusion models

59
Books and Evaluation Pattern
Text Books:
1. C.M.Bishop and H.Bishop, Deep Learning: Foundations and Concepts, Springer,
2024
2. Charu C. Aggarwal, Neural Networks and Deep Learning, Springer, 2nd Ed., 2023
3. S.J.D.Prince, Understanding Deep Learning, MIT Press, 2023

Reference Books:
1. I.Goodfellow, Y.Bengio and A.Courville, Deep Learning, MIT Press, 2016
2. U.Kamath, J.Liu and J.Whitaker, Deep Learning for NLP and Speech Recognition,
Springer, 2019
3. Nithin Buduma, Nikhil Buduma, Joe Papa, Fundamentals of Deep Learning,
O’Reilly, 2nd Ed., 2022
4. I.Drori, The Science of Deep Learning, Cambridge University Press, 2022

Evaluation Pattern (Tentative)


•Assignments: 30%
•Midsem Examination: 25% (12noon to 1.30PM, Friday, 14th March, 2025)
•Endsem Examination: 45% (9AM to 12noon, Thursday, 8th May, 2025)

60

You might also like