0% found this document useful (0 votes)
14 views48 pages

Deep Learning UNIT-II Part1

This document covers key concepts in deep learning, including the vanishing and exploding gradient problems, hyperparameters, and the building blocks of deep networks. It discusses various activation functions, weight initialization strategies, and the importance of mini-batch size and vectorization in training models. Additionally, it outlines the roles of input, hidden, and output layers, as well as the significance of loss functions and optimizers in the training process.

Uploaded by

22b81a6610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views48 pages

Deep Learning UNIT-II Part1

This document covers key concepts in deep learning, including the vanishing and exploding gradient problems, hyperparameters, and the building blocks of deep networks. It discusses various activation functions, weight initialization strategies, and the importance of mini-batch size and vectorization in training models. Additionally, it outlines the roles of input, hidden, and output layers, as well as the significance of loss functions and optimizers in the training process.

Uploaded by

22b81a6610
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 48

UNIT-II

Syllabus
• Vanishing Gradients and Exploding Gradients problem.
• Hyper Parameters – layer size, Magnitudes, regularization, activation
functions, weight initialization strategies, mini-batch size,
vectorization.
• Building blocks of deep networks- Feed Forward multi-layer neural
networks
• Unsupervised Pretrained Networks- Auto encoders -Sparse
Autoencoders, Denoising Autoencoders, Deep Belief Networks
(DBNs), Generative Adversarial Networks (GANs).
• Training a multi-layer artificial neural network and training deep
models.
h1 = w1 * x1 + w4 * x2 + b
Output at h1 = Sigmoid(h1)
Vanishing Gradient Problem
Vanishing Gradient Problem
• Definition: The gradients become very small as they are propagated
backward, especially in deep networks.
• Effect:
• Early layers (close to the input) learn very slowly or stop learning.
• This happens because the small gradients essentially "vanish" during
backpropagation.
Example
Why it happens?
• Activation functions like sigmoid or tanh squash their outputs into a
small range (e.g., 0 to 1 for sigmoid).
• Their derivatives are also small for most input values.
• When these small derivatives are multiplied across many layers, the
gradient becomes negligible.
Vanishing Gradients & Exploding Gradients
• If the gradients are too small or too large, they can cause problems
for the model.
If we use Sigmoid Activation function
If we use Tanh Activation function

* Remaining terms
Solution for vanishing Gradient Problem: ReLU Activation
Exploding Gradient Problem
• The gradients become extremely large as they are propagated
backward.
• Effect: The weights grow uncontrollably, leading to unstable training.
• Why it happens : If the weights or gradients are too large, the
multiplication during backpropagation causes them to grow
exponentially.
* Remaining terms
Solution for Exploding Gradients using Gradient Clipping

Generally Clip value is used while defining the optimizer


Solutions

• Vanishing Gradient:
• Use activation functions with non-vanishing gradients (e.g., ReLU, Leaky
ReLU).
• Initialize weights properly (e.g., Xavier Initialization, He Initialization).
• Use techniques like Batch Normalization to stabilize training.
• Exploding Gradient:
• Apply gradient clipping to cap large gradients.
• Use regularization techniques like L2-norm to control weight growth.
• Ensure proper weight initialization.
Weight Initialization Techniques

Designed for tanh activation, scales weights based on the number of input and output neurons.

Designed for ReLU activation, scales weights based on the number of input neurons.
Weights Initialization Techniques
Example for Model Parameters

Input features

Weight importance
For Single Input feature
Multiple Input features
Hyper parameters
• Hyper parameters are key settings or configurations that influence how a neural
network model is trained and how well it performs.
• Hyper parameters are set before training begins.
Different Hyper parameters to be considered
• Number of Epochs
• Learning rate
• Layer Size (Number of Neurons per Layer)
• Magnitudes (Weight Magnitudes)
• Regularization
• Activation Functions
• Weight Initialization Strategies
• Mini-Batch Size
• Vectorization
Hyper parameters
Number of Epochs
• An epoch refers to one complete pass through the entire training dataset during
the training process.
• The number of epochs determines how many times the model will iterate over
the entire dataset.
• If the number of epochs is too small, the model might not have enough
opportunity to learn and converge to a good solution (underfitting).
• If it is too large, the model might learn the training data too well, leading to
overfitting (memorizing the data and not generalizing well to new data).
• We can use early stopping (stopping training if performance stops improving) to
avoid overfitting.
Learning Rate
• The learning rate controls how much the model's weights are
updated with respect to the error or loss after each training step.
new weight = old weight - (learning rate * gradient)
• A high learning rate can cause the model to converge too quickly,
potentially skipping over the optimal solution .
• A low learning rate ensures smaller updates to weights, which might
make the training process more stable but also slow, possibly getting
stuck in suboptimal solutions (local minima).
• adaptive learning rate methods (like Adam) to fine-tune learning
during training.
Layer Size (Number of Neurons per Layer)
• The layer size determines the capacity of the network to learn from
the data.
• Too few neurons may not allow the model to capture complex
patterns, while too many can lead to overfitting.
• A common practice is to start with a moderate number of neurons
and adjust based on performance.
Magnitudes (Weight Magnitudes)
• Refers to the size of the values assigned to the weights (connections
between neurons).
• If the weights are too large, they can lead to unstable gradients and
difficulty in training.
• Very small weights can result in the network being unable to learn
effectively.
• Better to use Weight initialization techniques.
Regularization
• Techniques used to prevent the model from overfitting the training
data, i.e., the model becoming too specific to the training set and not
generalizing well to new data.
Activation Functions
• Mathematical functions applied to the output of each neuron to introduce non-
linearity into the model.
• These functions allows the neural network to learn more complex patterns.
• Commonly used Activation functions:
• ReLU (Rectified Linear Unit): Most commonly used, it returns 0 for negative
inputs and the input itself for positive ones.
• Sigmoid: Compresses values between 0 and 1, often used for binary classification
problems.
• Tanh: Similar to sigmoid but compresses values between -1 and 1.
• Choosing the right one helps the network capture complex relationships in the
data.
Weight Initialization Strategies
• Refers to how the initial weights are set before training begins.
• Proper initialization can help the model converge faster and avoid
problems like vanishing or exploding gradients.
• Common strategies:
• Random Initialization: Weights are initialized randomly, often using
small values from a normal distribution.
• Xavier/Glorot Initialization
• He Initialization
Mini-Batch Size
• Refers to the number of training examples used in one
forward/backward pass (one "iteration") of the model.
• The mini-batch size affects how often the model updates its weights.
• Smaller mini-batches can lead to noisier updates, but they also make
the process slower.
• Larger mini-batches may converge more smoothly but take longer to
compute each update.
• A typical mini-batch size is between 32 and 128, but it can vary based
on the dataset and model.
Vectorization
• The process of converting data into a Vector(Matrix) format that
allows for efficient parallel computation.
• In deep learning, this often means representing inputs (like images or
text) as matrices or tensors and performing operations on these
structures efficiently.
• Deep learning models often involve millions of parameters, and
performing operations in a vectorized way (using matrix
multiplications, for instance) allows the model to leverage hardware
like GPUs for faster computations.
• Modern deep learning frameworks (like TensorFlow and PyTorch)
automatically take care of vectorization, but it’s important to
understand the concept since it significantly speeds up training.
Building blocks of deep networks
The components work together to form a deep
neural network.
Building blocks of deep networks
• Input Layer
• Purpose: Accepts the input data in a structured format.
• Details:
• The number of neurons in this layer matches the number of features in the
dataset.
• Example: For an image of size 28×28, the input layer would have 784 neurons
if the image is flattened.
• In image recognition task, each neuron could represent a pixel of the
image.
Hidden Layers
• Purpose: Extract features and perform computations.
• Details:
• These layers are placed between the input and output layers.
• They consist of neurons that apply transformations to the data.
• The transformations are controlled by weights (learned during training) and
biases.
• Example: Dense (fully connected) layers, Convolutional layers, Recurrent
layers.
Depth of Network = Number of Hidden Layers + 2
Neurons
• Purpose: Perform computations.
• Details:
• Each neuron computes a weighted sum of its inputs, adds a bias, and applies
an activation function.
• Output of a neuron:
z = activation( ∑ ( w ⋅ x ) + b )
Activation Functions
• Purpose: Introduce non-linearity to the model, enabling it to learn
complex patterns.
• Examples:

• Softmax: Used in the output layer for multi-class classification.


Output Layer
• Purpose: Produces the final output of the model.
• Details:
• The number of neurons corresponds to the number of output classes or
predictions.

• Activation functions depend on the problem type:


• Regression: Linear activation.
• Binary classification: Sigmoid.
• Multi-class classification: Softmax.
Weights and Biases
• Purpose: Represent the learnable parameters of the model.
• Details:
• Weights: Control the importance of each input feature.
• Biases: Allow the activation functions to shift.
• Loss Function
• Purpose: Measures the difference between the predicted output and
the true output.
• Examples:
• Mean Squared Error (MSE) for regression.
• Cross-Entropy Loss for classification.
Optimizer
• Purpose: Updates the weights and biases to minimize the loss
function.
• Examples:
• Gradient Descent
• Adam
• RMSProp
• Forward Propagation
• Purpose: Passes input data through the network to compute
predictions.
• input → weighted sum + bias → activation function → next layer

• Backward Propagation
• Purpose: Updates weights and biases using gradients computed from
the loss function via the chain rule.

You might also like