Dr. K.
SUBBARAO, Prof & HOD, CSE-DS 1
Deep Learning
Unit-1 : Topics
Fundamentals of Deep Learning: Artificial Intelligence,
History of Machine learning: Probabilistic Modeling, Early
Neural Networks, Kernel Methods, Decision Trees, Random
forests and Gradient Boosting Machines,
Fundamentals of Machine Learning: Four Branches of
Machine Learning, Evaluating Machine learning Models,
Overfitting and Underfitting. [Text Book 2]
3rd Year 2nd Sem, CSE-Data Science
Introduction to Deep Learning
2
Deep learning is a specific subfield of machine learning: a new
way for learning representations from data that puts an importance on
learning successive layers of increasingly meaningful representations.
The “deep” here stands for the idea of successive layers of
representations.
The number of layers contribute to the models is called its depth of
the model.
Depp Learning also known as layered representations learning or
hierarchical representations learning which uses tens or hundreds of
layers.
These layered representations are learned using neural networks.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Artificial Intelligence, Machine
Learning, and Deep Learning
3
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Artificial intelligence:
Artificial intelligence was born in the 1950, with intention of
4
making computers to “think”.
The definition is as follow, “the effort to automate intellectual
tasks normally performed by humans”.
AI is a general field that encompasses machine learning and deep
learning.
Early chess games used only hardcoded rules (written in code
itself)which do not qualify as machine learning, but later the
human intelligence is integrated in the form of explicit rules
(stored in external files)for taking the decision as human do. This
type of approach is called Symbolic AI.
Hard coded rules cannot be modified easily, whereas explicit rules can
be modified easily whenever needed.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
It was dominant paradigm from 1950 to 1980, before the expert
Machine learning:
5
Lady Ada Lovelace was a friend and collaborator of Charles Babbage,
the inventor of Analytical Engine.
In those days the Analytical Engine was used to automate
mechanical operations to perform mathematical operations in 1830.
It uses punch cards which are similar punch cards used in weaving looms.
It contains Mill (cpu), store (punch cards), reader (cylinder), controlling
unit (leg pedal), and printer.
The limitation of the Analytical Engine is that it just assisted humans,
but cannot take decisions on its own.
In the Year 1950, Alan Turing introduced the Turing Test and also
key concepts that shaped AI.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
6
Machine learning arises from this question: could a computer go
beyond “what we know how to order it to perform”. This enables
the computer to learn the data processing rules from the data
itself.
In classical programming that is Symbolic AI, the programmer
inputs the rules and data to be processed using these rules, and
the system will produce output in the form of answers.
A machine-learning system is trained rather than explicitly
programmed. Here data and answers are given to get rules.
It started flourishing from 1990 and has become most successful
subfield of AI.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Symbolic AI Vs Machine Learning
7
Determining the grade of the student based on % of marks
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Learning representations
from data:
8
Before understanding the difference between deep
learning and other learning approaches it is good
to know the idea of what machine learning
algorithms do.
Every machine learning model expects THREE
things:
Input data points (data)
Examples of the expected output (answers)
A way to measure
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
A machine-learning model transforms its input data into
meaningful outputs.
9
The central problem in machine learning and deep learning
is meaningfully transform data.
Let us take an example to understand THREE things.
Consider an x-axis, and y-axis, and some points
represented by their coordinates in the (x, y) system, as
shown in figure.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
As you can see, we have a few white points and a few black
points.
10
Let’s develop a model that can use the coordinates of the points
and determine whether that point is “BLACK” or “WHITE”. (Ex;
K-Means)
In this case the
The inputs are coordinates of the points
The outputs are “BLACK’ and “WHITE” Colors.
The measure is percentage that clearly gives how many points are
correctly classified.
What we need here is a new representation that clearly separates
white from black points.
If we are searching the different possible coordinate change and come
up with a solution which has good percentage of points being
classified correctly. Then it becomes a machine learning model.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
The deep learning:
11
Deep learning is a mathematical framework for learning representations
from data and is sub field of AI.
Modern deep learning often involves tens or even hundreds of
successive layers of representations— and they’re all learned
automatically from exposure to training data.
In deep learning, these layered representations are (almost always) learned
via models called neural networks.
The term neural network is a reference to neurobiology; some of the
concepts are derived from the inspiration from human brain.
Let us look at one example how deep learning works to recognize the digit
from the hand written image.
The network transforms the digit image into different representations from
the original image and increasingly informative about the final result.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
12
Figure 3: A deep neural network for digit classification
It appears to be multistage information- distillation operation,
where information goes through successive filters and comes
out increasingly purified.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
How Deep Learning Works?
13
At this point the machine learning maps the input into
targets by observing the examples. (Ex: Features of Tomato,
Cherry)
Whereas the Deep learning do this input-to-target mapping
via a deep sequence of simple data transformations (Layers)
and these transformations are learned from the examples (By
extracting the features from the examples).
The specification of what each layer does to its input will be
stored in the layer’s weights, which are bunch of numbers.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
14
igure 4: The loss score is used as a feedback signal to adjust the weigh
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
In technical terms the transformation is parametrized
15
by the layer’s weights. These weights sometimes also
called parameters of layers. Initially these are set
using random values.
In this context, learning means finding a set of
values for the weights of all layers in a network and
also Bias applied at each layer, such that the network
will correctly map example inputs to their associated
targets
Finding the correct value for all of them may a
daunting (frightening) task, because the change in one
parameter will affect other layers.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
To control the neural network, first we have to observe predicted
value, and we need to measure how far this output is from what
16
you expected. This is the job of the loss function of the network,
also called the objective function.
The loss function takes the predictions of the network and the
true target (what you wanted the network to output) and
computes a distance score.
Since the weights are initialized randomly using random process,
the Loss score obviously high.
But with every example (item or image) the network processes,
the weights are adjusted a little in the correct direction, and the
loss score decreases.
This is the training loop, which is repeated sufficient number of
times to reduce the loss score. Then the outputs will be close to the
targets.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Applications of Deep Learning
In particular, deep learning has achieved the following breakthroughs, all
17 in historically Difficult areas of machine learning:
Near-human-level image classification
Near-human-level speech recognition
Near-human-level handwriting transcription
Improved machine translation
Improved text-to-speech conversion
Digital assistants such as Google Now and Amazon Alexa
Near-human-level autonomous driving
Improved ad targeting, as used by Google, Baidu, and Bing
Improved search results on the web
Ability to answer natural-language questions
Superhuman Go playing
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Before Deep Learning: A brief
history of machine learning
18
Deep learning has got more public attention in
the recent times and industries also have
invented never before seen in the history.
The deep learning may not solve all the
problems, it needs sufficient data.
Sometimes other machine learning methods
could solve the problem efficiently than deep
learning.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Probabilistic Modeling:
19
Probabilistic modeling is the process of applying the principles of
statistics to perform data analysis.
It was the earliest way of machine learning. One of the well-known
algorithms in this category is Naïve Bayes algorithm.
Naive Bayes is a type of machine-learning classifier based on
applying Bayes’ theorem while assuming that the features in the
input data are all independent.
This type of algorithm was in use even before first computer came into
existence. The foundation for Bayes Theorem was laid in the 18 th
century.
It is closely related to the logistic regression which is used in
classification.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Early Neural Networks:
20
Figure 5: Structure of Neural Networks
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
The early neural networks have been replaced by the modern neural
21
networks.
The early neural networks have laid the path to the deep learning. The
core idea of neural networks coined in the year 1950, and due its
structure itself was ignored for decades.
When some people independently rediscovered the Backpropagation
algorithm has initiated the neural networks again.
The backpropagation is used to optimize the parameters or weights
used in the neural network using the gradient descent optimization by
control the learning rate.
The first successful practical application of neural nets came in 1989
from Bell Labs, when Yann LeCun combined the earlier ideas of
convolutional neural networks and backpropagation, and
applied them to the problem of classifying handwritten digits.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Kernel Methods:
22
The kernel methods are group of classification algorithms.
The support vector machine is one of the best known algorithm under this
category.
SVM was developed by Vladimir Vapnik and cornna cortes in 1990s at Bell Labs.
SVMs aim at solving classification problems by finding good decision
boundaries between two sets of points belonging to two different categories.
This decision boundary is a line which can be linear or non-linear and separates
two spaces belong to two categories.
SVMs proceed to find these boundaries in two steps:
The data is mapped to a new high-dimensional representation where the decision
boundary can be expressed as a hyperplane.
A good decision boundary is computed by maximizing the distance between two
closest data points on either side, which is also called “margin”.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
The process of mapping the data to a high-dimensional space
can be carried out using the Kernel methods. An example of
23
kernel methods is given below.
These Kernel methods are used to transform the non-linear data
into linear (Ex: y=power(x,2)).
Let us consider a small dataset as shown below:
If we use only the first Feature, that is x. It appears to be non-
linear.
y=power(
x x,2) x
1.2 1.44 2
1.4 1.96 1.5 x
1.3 1.69 1
1.5 2.25 0.5
1.3 1.69 0
0 1 2 3 4 5 6 7
1.2 1.44
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
24
But, if we add second feature using the polynomial
expression y=power(x,2), then the dataset becomes
linearly separable as shown below.
y=power(x,2)
2.5
2
y=power(x,2)
1.5
1
0.5
0
1.15 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Decision Trees
25
Decision trees are Tree-like structures that let you classify input
data points or predict output values given inputs as Shown in the
Figure 7.
Decision Tree is a supervised learning technique that can be used
for both classification and Regression problems, but mostly it is
preferred for solving Classification problems.
They’re easy to visualize and interpret.
It contains 3 main elements: Decision Nodes, Branch, and Leaf
Nodes.
The Decision nodes can have multiple branches whereas the Leaf
nodes cannot contain any further branches.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
26
Figure 7: Decision Tree Figure 8: Example Decision Tree to
Accept Offer
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Random Forests
27
Random Forest is a popular machine learning algorithm that belongs to the
supervised learning technique.
It can be used for both Classification and Regression problems in ML.
It is a collection of large number of specialized decision trees.
It is based on the concept of ensemble learning, which is a process of
combining multiple classifiers to solve a complex problem and to improve the
performance of the model.
The greater number of trees in the forest leads to higher accuracy and
prevents the problem of overfitting.
For the same data different decision trees are created, instead of depending
on one decision tree, the random forest takes the decision from each tree and
based on the majority votes of prediction the final output will be predicted.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
28
Figure 9: Random Forest
Advantages of Random Forests:
• It takes less time to train model as compared to other
algorithms
• It gives high accuracy
• IT can also Dr. K. SUBBARAO, Prof & HOD, CSE-DS
maintain when large portion of data is
Gradient Boosting Machines
29
A gradient boosting machine, much like a random forest,
is a machine learning technique based on ensembling
weak prediction models, generally decision trees.
It uses gradient boosting to improve the performance of
any machine learning model iteratively by addressing the
weak points of the previous models.
When gradient boosting is applied to decision trees, it will
outperform the random forests most of the times.
It is one of the best algorithms to deal with non-perceptual
data.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Fundamentals of Machine Learning
(Part-2)
30
Here we will understand the all the concepts such as
model evaluation, data pre-processing for deep learning,
feature engineering and tackling the model overfit etc –
into a seven step workflow of any machine learning
approach.
Four branches of machine learning:
Supervised learning
Unsupervised learning
Self-supervised learning
Reinforcement learning
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Supervised learning:
31
This is most familiar and used to map the inputs to the known targets.
All most all the deep learning algorithms are belonging to this category.
These are used for both classification and regression tasks. Some of
the applications of the supervised learning are as follow:
Sequence generation – It is used to predict the caption describing a given
picture or image.
Syntax tree prediction – It is used to predict the Syntax tree for the given
sentence.
Object detection – Given a picture, it draws the boundary line around
some objects considering their internal features in the picture or image.
Image segmentation – Divides the image into sub parts based on the
pixels intensity values.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Unsupervised learning:
32
This is used to find the interesting information from the input
without knowledge of the known targets.
This is mainly used in data visualization, data compression, data
denoising, and understanding the correlations present in the
data. This is often treated as bread and butter for the data
analysts before attempting to use any supervised learning
technique.
There are two well-known categories of unsupervised learning
as follow:
Dimensionality reduction
Clustering
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Self-supervised learning:
33
It is specific types of supervised learning and deserves to be
considering as a different category. This is used learn the
patterns without human involvement. Here, labels are also
involved, but are generated from the input data using the
heuristic techniques.
For instance, autoencoders are a well-known instance of self-
supervised learning, where the generated targets are the input,
unmodified.
In the same way, predicting the next frame from the video when
some past frames are given.
It is used in predicting the next words in the text, when the previous
words are given.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Reinforcement learning:
34
In reinforcement learning, an agent receives
information about its environment and learns to
choose actions that will maximize some reward.
This is mostly used in games to predict the next
move which minimizes the loss and maximizes the
reward. Some of the applications are as follow:
Self-driving cars
Robotics
Education
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Evaluating Machine learning Models
35
Once the model is trained it is evaluated to know the
performance. The model is evaluated on the data which is
never-before-seen. If the evaluation is done on the same
data it leads to the model overfitting. Hence the training
data will be split into three sets.
Training, Validation and Test sets:
The data will be split into three sets: training set, validation set,
and test set. The model is trained using the training set and is
evaluated using the validation set. This will help to fine tune
the model. One final test will be conducted on the test set.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
There are different techniques to divide the data into three
sets. They are as follow:
36 Simple Hold-out validation,
K-fold validation,
Iterated K-fold validation with shuffling
SIMPLE HOLD-OUT VALIDATION:
Here dataset will be divided into two parts: Training set and Hold-
out Validation set.
The model is trained using the Training set and is tested with
Validation set.
This is preferred to prevent information leaks that occur when we
divide that data into Three Parts: Training, Validation and Test sets.
Before starting the process the random shuffling can be done to
mix the data well.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
37
Drawback of Hold-out Validation:
Though this is a simple protocol, it suffers
from one drawback. If the dataset size is
small then obviously we have only few
samples or records in the validation set.
Hence, the model may not be tuned well.
This can be addressed with help of K-Folds
Validation and Iterated K-Fold validation.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
38
K-FOLD VALIDATION
Here we split your data into K partitions of equal size. For each
partition i, train a model on the remaining K – 1 partitions, and
evaluate it on partition i.
The Same process is repeated for K Times.
The final score of the model is the average of all the scores obtained
in K Scores.
This is preferred when your model is giving significance variance on
the test set. Here, only one fold may not be considered as validation
set.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
39
ITERATED K-FOLD VALIDATION WITH
SHUFFLING
This one is for situations in which you have relatively
little data available and you need to evaluate your
model as precisely as possible.
It consists of applying K-fold validation multiple
times, shuffling the data every time before
splitting it K ways.
The final score is the average of the scores obtained at
each run of K-fold validation.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Overfitting and Underfitting
40
The central challenge in machine learning is that we must
perform well on new, previously unseen inputs—not just those
on which our model was trained. The ability to perform well
on previously unseen inputs is called generalization.
We can compute some error measure on the training set
called the training error, and we reduce this training error.
What separates machine learning from optimization is that we
want the generalization error, also called the test error to be
low as well. The generalization error is defined as the
expected value of the error on a new input.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
We typically estimate the generalization error of a machine
learning model by measuring its performance on a test set of
41
examples that were collected separately from the training set.
The test error will be computer using the MSE (Means
Square Error) as follow:
Measuring the distance of the observed y-values from the
predicted y-values at each value of x; (y-y`)
Squaring each of these distances; Eg: (y-y`)2
Calculating the mean of each of the squared distances. 1/n* (y-
y`)2
The factors determining how well a machine learning
algorithm will perform are its ability to:
1. Make the training error small.
2. Make the gap between training and test error small.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
These two factors correspond to the two central
42 challenges in machine learning: Underfitting and
overfitting.
Underfitting occurs when the model is not able to obtain a
sufficiently low error value on the training set. That means the
model has not learned from the training sufficient enough.
Overfitting occurs when the gap between the training error and
test error is too large.
To prevent a model from learning misleading or irrelevant
patterns found in the training data, the best solution is to
get more training data. A model trained on more data will
naturally generalize better. The processing of fighting
overfitting this way is called regularization.
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Let’s review some of the most common regularization techniques:
Reducing the network’s size
43 The simplest way to prevent overfitting is to reduce the size of the model: the
number of learnable parameters in the model. It is often referred as Capacity.
Adding weight regularization -
A simple model in this context is a model where the distribution of parameter
values has less entropy (or a model with fewer parameters, as you saw in the
previous section). Thus a common way to mitigate overfitting is to put
constraints on the complexity of a network by forcing its weights to take only
small values, which makes the distribution of weight values more regular. This
is called weight regularization. It is done with help of cost function. This cost
comes in two flavors:
L1 regularization—The cost added is proportional to the absolute value of the
weight coefficients (the L1 norm of the weights).
L2 regularization—The cost added is proportional to the square of the value
of the weight coefficients (the L2 norm of the weights). L2 regularization is also
called weight decay in the context of neural networks
Dr. K. SUBBARAO, Prof & HOD, CSE-DS
Adding dropout
44 Dropout is one of the most effective and most commonly used
regularization techniques for neural networks, developed by
Geoff Hinton and his students at the University of Toronto.
Dropout, applied to a layer, consists of randomly dropping out
(setting to zero) a number of output features of the layer
during training.
Let’s say a given layer would normally return a vector [0.2,
0.5, 1.3, 0.8, 1.1] for a given input sample during training.
After applying dropout, this vector will have a few zero entries
distributed at random: for example, [0, 0.5, 1.3, 0, 1.1].
End of the Unit-1
Dr. K. SUBBARAO, Prof & HOD, CSE-DS