DEEP LEARNING
UNIT -1
Basics of Deep Learning: Introduction to Machine Learning
Machine learning (ML) is a subset of artificial intelligence that
focuses on the development of algorithms and statistical models that
enable computers to perform tasks without being explicitly
programmed. It is a powerful tool used across various industries for
tasks such as data analysis, pattern recognition, and decision-making.
Here are some foundational concepts and components of machine
learning:
1. Definition and Core Concepts:
• Machine Learning (ML): At its core, ML involves training
algorithms to learn from and make predictions or decisions
based on data.
• Data-Driven Approach: Unlike traditional programming, where
rules are explicitly coded, ML uses data to identify patterns and
relationships.
• Model: A model is a mathematical representation of a process
that is used to make predictions or decisions based on input
data.
2. Types of Machine Learning:
• Supervised Learning: In this type, the algorithm learns from
labeled datasets, meaning each training example is paired with
an output label. It is used for tasks like classification and
regression.
• Classification: Predicting discrete labels (e.g., spam or not
spam).
• Regression: Predicting continuous values (e.g., house
prices).
• Unsupervised Learning: The algorithm learns from unlabeled
data, discovering hidden patterns or intrinsic structures.
Common applications include clustering and dimensionality
reduction.
• Clustering: Grouping similar data points together (e.g.,
customer segmentation).
• Dimensionality Reduction: Reducing the number of
variables under consideration (e.g., Principal Component
Analysis).
• Semi-supervised Learning: Combines a small amount of labeled
data with a large amount of unlabeled data during training.
• Reinforcement Learning: The algorithm learns by interacting
with an environment to maximize some notion of cumulative
reward. It is used in fields like robotics and game playing.
3. Key Components of Machine Learning:
• Data: The foundation of any ML system, requiring
preprocessing to handle missing values, normalization, and
feature extraction.
• Features: Individual measurable properties or characteristics
used by models to make predictions.
• Algorithms: The set of rules or processes followed by the model
to learn from data. Examples include decision trees, neural
networks, and support vector machines.
• Training: The process of teaching a model by exposing it to
data, so it can learn the relationships between inputs and
outputs.
• Validation and Testing: After training, the model is validated
and tested on separate datasets to evaluate its performance
and generalize its ability to handle unseen data.
4. Deep Learning:
• Deep Learning (DL): A specialized subset of ML involving neural
networks with many layers (hence "deep"). It is particularly
effective in areas like image and speech recognition.
• Neural Networks: Composed of interconnected nodes or
neurons, these networks mimic the human brain's structure
and are foundational to DL.
• Applications of DL: Include autonomous vehicles, natural
language processing, and medical image analysis.
Difference between Machine Learning and Deep
Learning
Machine learning and Deep Learning both are subsets of artificial
intelligence but there are many similarities and differences between
them.
Machine Learning Deep Learning
Uses artificial neural network
Apply statistical algorithms to learn
architecture to learn the hidden
the hidden patterns and relationships
patterns and relationships in the
in the dataset.
dataset.
Requires the larger volume of
Can work on the smaller amount of
dataset compared to machine
dataset
learning
Better for complex task like image
Better for the low-label task. processing, natural language
processing, etc.
Machine Learning Deep Learning
Takes less time to train the model. Takes more time to train the model.
A model is created by relevant
Relevant features are automatically
features which are manually
extracted from images. It is an end-
extracted from images to detect an
to-end learning process.
object in the image.
More complex, it works like the
Less complex and easy to interpret
black box interpretations of the
the result.
result are not easy.
It can work on the CPU or requires
It requires a high-performance
less computing power as compared
computer with GPU.
to deep learning.
Linear Models: SVMs, Perceptron, and Logistic Regression
Linear models are fundamental components of machine learning,
providing simple yet powerful tools for classification and regression
tasks. These models assume a linear relationship between input
variables and the output, making them interpretable and easy to
implement. Let's explore three key linear models: Support Vector
Machines (SVMs), Perceptrons, and Logistic Regression.
1. Support Vector Machines (SVMs):
• Purpose: SVMs are primarily used for classification tasks. They
aim to find the optimal hyperplane that separates data points
of different classes with the maximum margin.
• Hyperplane: In an nn-dimensional space, a hyperplane is a flat
affine subspace of dimension n−1n−1. For a 2D dataset, the
hyperplane is a line; for 3D, it’s a plane, and so forth.
• Margin: The margin is the distance between the hyperplane
and the closest data points from each class. SVMs attempt to
maximize this margin to improve the model's robustness to new
data.
• Support Vectors: These are the data points that lie closest to
the hyperplane and influence its position. They are critical in
defining the hyperplane.
• Kernel Trick: SVMs can be extended to solve non-linear
problems using kernel functions, which transform the input
space into a higher-dimensional space where a linear separator
may exist.
2. Perceptron:
• Purpose: The Perceptron is one of the simplest types of
artificial neural networks and is used for binary classification
tasks. It is the foundational building block for more complex
neural networks.
• Structure: A single-layer Perceptron consists of input nodes
connected to an output node. Each connection has an
associated weight, and the output is determined by a weighted
sum of inputs, followed by a threshold activation function.
• Learning Rule: The Perceptron uses a simple learning rule that
adjusts weights based on the error of the predictions. If the
prediction is incorrect, the weights are updated to reduce the
error.
• Limitations: The Perceptron can only classify linearly separable
data. For more complex datasets, multilayer Perceptrons (also
known as feedforward neural networks) are necessary.
3. Logistic Regression:
• Purpose: Despite its name, logistic regression is a classification
algorithm used to predict binary outcomes. It estimates the
probability that a given input belongs to a particular class.
• Logistic Function: It uses the logistic function (or sigmoid
function) to map predicted values to probabilities. The logistic
function is defined as:
• Decision Boundary: The decision boundary is a linear separator
determined by the weights learned during training. The
boundary is defined by the condition where the predicted
probability equals 0.5.
• Loss Function: Logistic regression uses the cross-entropy loss
function to measure the discrepancy between the predicted
probabilities and the actual class labels.
• Extensions: Logistic regression can be extended to multiclass
classification problems using techniques such as one-vs-rest
(OvR) or softmax regression.
Neural Networks: Models, Types, and Training
Neural networks are a class of machine learning models inspired by
the structure and function of the human brain. They consist of
interconnected nodes, or neurons, that work together to process
input data and generate output. Neural networks have become a
cornerstone of deep learning, enabling significant advancements in
fields such as computer vision, natural language processing, and
more. Below, we explore the models, types, and training methods of
neural networks.
1. Models of Neural Networks:
• Artificial Neural Networks (ANNs): ANNs are the basic form of
neural networks, consisting of input layers, hidden layers, and
output layers. Each layer contains neurons interconnected with
the next layer through weighted connections.
• Feedforward Neural Networks (FNNs): In FNNs, connections
between nodes do not form cycles. Data flows in one direction
from input to output, making them suitable for tasks like image
and speech recognition.
• Recurrent Neural Networks (RNNs): RNNs are designed for
sequential data, where connections form cycles, allowing
information to persist. They are widely used in time-series
analysis and natural language processing.
• Convolutional Neural Networks (CNNs): CNNs are specialized
for processing grid-like data, such as images. They use
convolutional layers to automatically and adaptively learn
spatial hierarchies of features.
• Generative Adversarial Networks (GANs): GANs consist of two
neural networks, a generator and a discriminator, pitted against
each other. They are used to generate realistic data, such as
images and audio.
2. Types of Neural Networks:
• Single-Layer Perceptrons: The simplest form of neural network
with a single layer of output nodes. Suitable for linearly
separable data.
• Multilayer Perceptrons (MLPs): Consist of multiple layers of
neurons, allowing them to model complex, non-linear
relationships. They are the building blocks for many deep
learning models.
• Deep Neural Networks (DNNs): DNNs have many hidden layers
and are capable of learning intricate patterns in data. They
require large amounts of data and computational power.
• Long Short-Term Memory Networks (LSTMs): A type of RNN
designed to overcome the vanishing gradient problem, making
them effective for long-term dependencies in sequence
prediction tasks.
• Autoencoders: Unsupervised learning models used for tasks
such as dimensionality reduction and feature learning. They
consist of an encoder and a decoder network.
3. Training of Neural Networks:
• Data Preparation: Properly prepared and preprocessed data is
crucial. This includes normalization, handling missing values,
and data augmentation to improve model robustness.
• Forward Propagation: During training, input data is passed
through the network, with each neuron computing a weighted
sum of inputs and applying an activation function to produce
output.
• Loss Function: The loss function measures the discrepancy
between predicted outputs and actual targets. Common loss
functions include mean squared error for regression and cross-
entropy for classification.
• Backpropagation: An optimization algorithm used to minimize
the loss function by adjusting the weights of the network. It
involves computing gradients of the loss with respect to each
weight and updating them using techniques like gradient
descent.
• Activation Functions: Non-linear functions applied to each
neuron's output to introduce non-linearity into the network,
enabling it to model complex functions. Popular activation
functions include ReLU, sigmoid, and tanh.
• Regularization Techniques: Methods like dropout and L2
regularization help prevent overfitting by adding constraints to
the learning process.
• Evaluation and Tuning: The model's performance is evaluated
using metrics appropriate for the task. Hyperparameters such
as learning rate, batch size, and network architecture are tuned
to optimize performance.
Loss Function and Its Types
A loss function, also known as a cost function or objective function, is
a crucial component in the training of machine learning models. It
quantifies how well or poorly a model's predictions align with the
actual target values. By providing a measure of the discrepancy
between the predicted and actual values, the loss function guides the
optimization process, enabling the model to learn from data and
improve its performance. Below, we delve into the concept of loss
functions and explore some of the most common types used in
machine learning and deep learning.
1. Purpose of Loss Functions:
• Quantification of Error: The loss function provides a numerical
representation of the error in the model's predictions, which is
essential for assessing model performance.
• Guidance for Optimization: The optimization algorithm uses
the loss function to update the model's parameters (weights) in
a way that minimizes this error, thereby improving the model's
accuracy.
• Model Evaluation: By evaluating the loss on a validation set,
practitioners can gauge how well the model is expected to
perform on unseen data.
2. Types of Loss Functions:
The choice of loss function is dependent on the type of machine
learning task—whether it's regression, classification, or another
specialized task. Here are some commonly used loss functions:
For Regression:
• Mean Squared Error (MSE): MSE measures the average of the
squares of the errors, where the error is the difference between
predicted and actual values. It is given by:
MSE is sensitive to outliers, as it squares the errors.
• Mean Absolute Error (MAE): MAE is the average of the
absolute differences between predicted and actual values. It is
defined as:
MAE is more robust to outliers compared to MSE.
• Huber Loss: A combination of MSE and MAE, Huber loss is less
sensitive to outliers while still penalizing large errors. It is
defined piecewise, with a parameter δδ controlling the
transition between MAE and MSE behavior.
For Classification:
• Cross-Entropy Loss (Log Loss): Widely used in classification
tasks, cross-entropy loss measures the difference between the
true distribution and the predicted distribution. For binary
classification, it is given by:
For multiclass classification, the formula extends to accommodate
multiple classes.
• Hinge Loss: Commonly used with Support Vector Machines,
hinge loss is suitable for "maximum-margin" classification. It is
defined as:
It focuses on correctly classifying data points with a margin.
Specialized Loss Functions:
• Kullback-Leibler Divergence (KL Divergence): Used in tasks
involving probability distributions, such as in Variational
Autoencoders, KL Divergence measures how one probability
distribution diverges from a second, expected probability
distribution.
• Contrastive Loss: Used in Siamese networks for tasks like face
verification, contrastive loss measures the similarity between
pairs of samples.
Propagation Learning Algorithm: Understanding Backpropagation
Backpropagation is a widely used propagation learning algorithm in
the training of artificial neural networks. It is a supervised learning
technique that involves propagating errors backward through the
network to update the model's weights, minimizing the loss function.
Backpropagation has been instrumental in the success of deep
learning, enabling neural networks to learn complex patterns from
data. Let's explore the principles and steps involved in the
backpropagation algorithm.
1. Overview of Backpropagation:
• Goal: Backpropagation aims to minimize the loss function by
adjusting the weights of the network based on the error
between predicted and actual outputs.
• Gradient Descent: The algorithm uses gradient descent, an
optimization technique, to find the set of weights that
minimizes the loss function.
• Chain Rule: Backpropagation relies on the chain rule of calculus
to compute gradients of the loss function with respect to each
weight in the network.
2. Steps in Backpropagation:
The backpropagation process involves several key steps, which are
typically repeated for each batch of training data until the model
converges:
• Forward Propagation:
1. Input Data: The process begins with feeding input data
into the network.
2. Activation Functions: Each neuron computes a weighted
sum of its inputs and applies an activation function to
produce an output.
3. Output Layer: The final layer produces the network's
predictions.
• Calculate Loss:
1. Loss Function: Compute the loss using a suitable loss
function (e.g., mean squared error for regression, cross-
entropy for classification).
• Backward Propagation:
1. Compute Gradients: Using the chain rule, compute the
gradient of the loss function with respect to each weight.
This involves:
• Calculating the gradient of the loss with respect to
the output of each neuron.
• Propagating these gradients backward through the
network.
2. Adjust Weights: Update the weights using the gradients
to reduce the loss. The weight update rule is typically:
where ww is the weight, ηη is the learning rate, and is the
gradient of the loss with respect to the weight.
Regularization in Machine Learning
Regularization is a set of techniques used to prevent overfitting in
machine learning models. Overfitting occurs when a model learns the
noise in the training data rather than the underlying pattern, leading
to poor generalization to unseen data. Regularization adds a penalty
to the loss function to constrain the complexity of the model,
encouraging it to focus on the most important features. Here, we
explore the concept of regularization and some common techniques
used to achieve it.
1. Importance of Regularization:
• Prevents Overfitting: By discouraging overly complex models,
regularization helps ensure that the model generalizes better to
new data.
• Improves Model Robustness: Regularization can lead to
simpler, more interpretable models that are less sensitive to
variations in the input data.
• Encourages Feature Selection: In some cases, regularization can
help identify the most relevant features, reducing the
dimensionality of the problem.
2. Common Regularization Techniques:
L1 and L2 Regularization:
• L1 Regularization (Lasso): Adds the absolute value of the
coefficients as a penalty term to the loss function. It can lead to
sparse models where some feature weights are reduced to zero,
effectively performing feature selection.
where λ is the regularization parameter, and wjj are the model
weights.
• L2 Regularization (Ridge): Adds the square of the coefficients as
a penalty term to the loss function, helping to prevent large
weights.
L2 regularization tends to shrink weights more evenly, without setting
them to zero.
• Elastic Net: Combines L1 and L2 regularization, balancing the
benefits of both approaches. It is useful when multiple features
are correlated with the target.
Dropout:
• Purpose: Dropout is a regularization technique used in neural
networks to prevent overfitting by randomly setting a fraction
of the neurons to zero during each training iteration.
• Mechanism: This forces the network to learn robust features
that are not reliant on specific neurons, improving
generalization.
• Implementation: During training, each neuron's output is
retained with a probability pp, and during testing, the full
network is used with outputs scaled by pp.
Early Stopping:
• Purpose: Early stopping is a simple yet effective regularization
technique that involves monitoring the model's performance on
a validation set during training.
• Mechanism: Training is halted when the validation performance
stops improving, preventing the model from overfitting to the
training data.
• Benefits: Early stopping reduces training time and can lead to
better generalization on unseen data.
Data Augmentation:
• Purpose: Data augmentation involves artificially increasing the
size of the training dataset by applying transformations such as
rotation, scaling, and flipping to the input data.
• Mechanism: By exposing the model to a wider variety of data,
data augmentation helps the model learn more robust features
that generalize well.
• Applications: Commonly used in image and audio processing to
simulate real-world variations.
Batch Normalization: Enhancing Neural Network Training
Batch normalization is a technique used to improve the training of
deep neural networks by normalizing the inputs of each layer.
Introduced by Sergey Ioffe and Christian Szegedy in 2015, batch
normalization addresses issues related to internal covariate shift,
accelerates convergence, and can act as a regularizer to reduce the
need for other regularization techniques such as dropout. Here's an
exploration of batch normalization, its purpose, and its
implementation.
1. Purpose of Batch Normalization:
• Internal Covariate Shift: This refers to the change in the
distribution of network activations due to updates in the
parameters of previous layers during training. Batch
normalization reduces this shift, stabilizing the learning process.
• Faster Convergence: By normalizing inputs, batch normalization
allows the network to use higher learning rates, speeding up
convergence and reducing training time.
• Regularization Effect: It introduces a slight regularization effect,
which can improve generalization and reduce the need for
other forms of regularization.
2. How Batch Normalization Works:
Batch normalization is applied to each mini-batch of data during
training. The process involves the following steps:
• Normalization: For each feature in the mini-batch, compute the
mean and variance, and normalize the input by subtracting the
mean and dividing by the standard deviation. This transforms
the input to have a mean of 0 and a variance of 1:
• Scaling and Shifting: Apply learned scaling (γ) and shifting (β)
parameters to the normalized input. This allows the model to
learn the optimal mean and variance for each feature:
• Learning Parameters: The parameters γ and β are learned
during training, allowing the model to retain the
representational power of the network.
3. Benefits of Batch Normalization:
• Improved Stability: Reduces the sensitivity to the scale of
features and initial weights, leading to a more stable and robust
training process.
• Higher Learning Rates: By mitigating issues like vanishing and
exploding gradients, batch normalization enables the use of
higher learning rates, accelerating the convergence of the
model.
• Reduced Dependence on Initialization: Makes the network less
sensitive to the choice of initial parameters, simplifying the
model development process.
• Regularization Effect: Acts as a regularizer by adding noise to
the network through mini-batch statistics, potentially reducing
the need for dropout or other regularization techniques.
4. Considerations and Limitations:
• Mini-Batch Dependency: The normalization is based on mini-
batch statistics, which can introduce noise, especially with small
batch sizes. In such cases, careful tuning or techniques like layer
normalization might be needed.
• Inference Phase: During inference, batch normalization uses
the moving averages of mean and variance computed during
training instead of mini-batch statistics to ensure consistent
performance.
• Applicability: While batch normalization is widely used in
convolutional neural networks (CNNs), its effectiveness in other
architectures, like recurrent neural networks (RNNs), may vary.
Supervised Machine Learning
Supervised learning is a type of machine learning where a model is
trained on labelled data—meaning each input is paired with the
correct output. the model learns by comparing its predictions with
the actual answers provided in the training data. Over time, it adjusts
itself to minimize errors and improve accuracy. The goal of
supervised learning is to make accurate predictions when given new,
unseen data. For example, if a model is trained to recognize
handwritten digits, it will use what it learned to correctly identify
new numbers it hasn’t seen before.
Supervised learning can be applied in various forms,
including supervised learning classification and supervised learning
regression, making it a crucial technique in the field of artificial
intelligence and supervised data mining.
How Supervised Machine Learning Works?
Where supervised learning algorithm consists of input features and
corresponding output labels. The process works through:
• Training Data: The model is provided with a training dataset
that includes input data (features) and corresponding output
data (labels or target variables).
• Learning Process: The algorithm processes the training data,
learning the relationships between the input features and the
output labels. This is achieved by adjusting the model’s
parameters to minimize the difference between its predictions
and the actual labels.
After training, the model is evaluated using a test dataset to measure
its accuracy and performance. Then the model’s performance is
optimized by adjusting parameters and using techniques like cross-
validation to balance bias and variance. This ensures the model
generalizes well to new, unseen data.
In summary, supervised machine learning involves training a model
on labeled data to learn patterns and relationships, which it then
uses to make accurate predictions on new data.
Shallow Neural Networks vs Deep Neural Networks
Shallow Neural Networks Deep Neural Networks
Shallow Neural network with
Deep Neural network with many
few layers (usually 1 hidden
layers (multiple hidden layers).
layer).
Complexity is low. Complexity is high.
Limited learning capacity. Higher learning capacity.
Lower risk of overfitting. Higher risk of overfitting.
Requires more data for effective
Requires less data.
training.
Fewer parameters counts in Many more parameters counts in
the shallow neural networks. the deep neural networks.
Requires less computational Requires more computational
resources. resources (e.g., GPUs).
Easier to interpret. More difficult to interpret.
Example: Single-layer Example: Convolutional Neural
Perceptron, Logistic Networks (CNNs), Recurrent
Regression. Neural Networks (RNNs).