Deep Learning UNIT-II Part1
Deep Learning UNIT-II Part1
Syllabus
• Vanishing Gradients and Exploding Gradients problem.
• Hyper Parameters – layer size, Magnitudes, regularization, activation
functions, weight initialization strategies, mini-batch size,
vectorization.
• Building blocks of deep networks- Feed Forward multi-layer neural
networks
• Unsupervised Pretrained Networks- Auto encoders -Sparse
Autoencoders, Denoising Autoencoders, Deep Belief Networks
(DBNs), Generative Adversarial Networks (GANs).
• Training a multi-layer artificial neural network and training deep
models.
h1 = w1 * x1 + w4 * x2 + b
Output at h1 = Sigmoid(h1)
Vanishing Gradient Problem
Vanishing Gradient Problem
• Definition: The gradients become very small as they are propagated
backward, especially in deep networks.
• Effect:
• Early layers (close to the input) learn very slowly or stop learning.
• This happens because the small gradients essentially "vanish" during
backpropagation.
Example
Why it happens?
• Activation functions like sigmoid or tanh squash their outputs into a
small range (e.g., 0 to 1 for sigmoid).
• Their derivatives are also small for most input values.
• When these small derivatives are multiplied across many layers, the
gradient becomes negligible.
Vanishing Gradients & Exploding Gradients
• If the gradients are too small or too large, they can cause problems
for the model.
If we use Sigmoid Activation function
If we use Tanh Activation function
* Remaining terms
Solution for vanishing Gradient Problem: ReLU Activation
Exploding Gradient Problem
• The gradients become extremely large as they are propagated
backward.
• Effect: The weights grow uncontrollably, leading to unstable training.
• Why it happens : If the weights or gradients are too large, the
multiplication during backpropagation causes them to grow
exponentially.
* Remaining terms
Solution for Exploding Gradients using Gradient Clipping
• Vanishing Gradient:
• Use activation functions with non-vanishing gradients (e.g., ReLU, Leaky
ReLU).
• Initialize weights properly (e.g., Xavier Initialization, He Initialization).
• Use techniques like Batch Normalization to stabilize training.
• Exploding Gradient:
• Apply gradient clipping to cap large gradients.
• Use regularization techniques like L2-norm to control weight growth.
• Ensure proper weight initialization.
Weight Initialization Techniques
Designed for tanh activation, scales weights based on the number of input and output neurons.
Designed for ReLU activation, scales weights based on the number of input neurons.
Weights Initialization Techniques
Example for Model Parameters
Input features
Weight importance
For Single Input feature
Multiple Input features
Hyper parameters
• Hyper parameters are key settings or configurations that influence how a neural
network model is trained and how well it performs.
• Hyper parameters are set before training begins.
Different Hyper parameters to be considered
• Number of Epochs
• Learning rate
• Layer Size (Number of Neurons per Layer)
• Magnitudes (Weight Magnitudes)
• Regularization
• Activation Functions
• Weight Initialization Strategies
• Mini-Batch Size
• Vectorization
Hyper parameters
Number of Epochs
• An epoch refers to one complete pass through the entire training dataset during
the training process.
• The number of epochs determines how many times the model will iterate over
the entire dataset.
• If the number of epochs is too small, the model might not have enough
opportunity to learn and converge to a good solution (underfitting).
• If it is too large, the model might learn the training data too well, leading to
overfitting (memorizing the data and not generalizing well to new data).
• We can use early stopping (stopping training if performance stops improving) to
avoid overfitting.
Learning Rate
• The learning rate controls how much the model's weights are
updated with respect to the error or loss after each training step.
new weight = old weight - (learning rate * gradient)
• A high learning rate can cause the model to converge too quickly,
potentially skipping over the optimal solution .
• A low learning rate ensures smaller updates to weights, which might
make the training process more stable but also slow, possibly getting
stuck in suboptimal solutions (local minima).
• adaptive learning rate methods (like Adam) to fine-tune learning
during training.
Layer Size (Number of Neurons per Layer)
• The layer size determines the capacity of the network to learn from
the data.
• Too few neurons may not allow the model to capture complex
patterns, while too many can lead to overfitting.
• A common practice is to start with a moderate number of neurons
and adjust based on performance.
Magnitudes (Weight Magnitudes)
• Refers to the size of the values assigned to the weights (connections
between neurons).
• If the weights are too large, they can lead to unstable gradients and
difficulty in training.
• Very small weights can result in the network being unable to learn
effectively.
• Better to use Weight initialization techniques.
Regularization
• Techniques used to prevent the model from overfitting the training
data, i.e., the model becoming too specific to the training set and not
generalizing well to new data.
Activation Functions
• Mathematical functions applied to the output of each neuron to introduce non-
linearity into the model.
• These functions allows the neural network to learn more complex patterns.
• Commonly used Activation functions:
• ReLU (Rectified Linear Unit): Most commonly used, it returns 0 for negative
inputs and the input itself for positive ones.
• Sigmoid: Compresses values between 0 and 1, often used for binary classification
problems.
• Tanh: Similar to sigmoid but compresses values between -1 and 1.
• Choosing the right one helps the network capture complex relationships in the
data.
Weight Initialization Strategies
• Refers to how the initial weights are set before training begins.
• Proper initialization can help the model converge faster and avoid
problems like vanishing or exploding gradients.
• Common strategies:
• Random Initialization: Weights are initialized randomly, often using
small values from a normal distribution.
• Xavier/Glorot Initialization
• He Initialization
Mini-Batch Size
• Refers to the number of training examples used in one
forward/backward pass (one "iteration") of the model.
• The mini-batch size affects how often the model updates its weights.
• Smaller mini-batches can lead to noisier updates, but they also make
the process slower.
• Larger mini-batches may converge more smoothly but take longer to
compute each update.
• A typical mini-batch size is between 32 and 128, but it can vary based
on the dataset and model.
Vectorization
• The process of converting data into a Vector(Matrix) format that
allows for efficient parallel computation.
• In deep learning, this often means representing inputs (like images or
text) as matrices or tensors and performing operations on these
structures efficiently.
• Deep learning models often involve millions of parameters, and
performing operations in a vectorized way (using matrix
multiplications, for instance) allows the model to leverage hardware
like GPUs for faster computations.
• Modern deep learning frameworks (like TensorFlow and PyTorch)
automatically take care of vectorization, but it’s important to
understand the concept since it significantly speeds up training.
Building blocks of deep networks
The components work together to form a deep
neural network.
Building blocks of deep networks
• Input Layer
• Purpose: Accepts the input data in a structured format.
• Details:
• The number of neurons in this layer matches the number of features in the
dataset.
• Example: For an image of size 28×28, the input layer would have 784 neurons
if the image is flattened.
• In image recognition task, each neuron could represent a pixel of the
image.
Hidden Layers
• Purpose: Extract features and perform computations.
• Details:
• These layers are placed between the input and output layers.
• They consist of neurons that apply transformations to the data.
• The transformations are controlled by weights (learned during training) and
biases.
• Example: Dense (fully connected) layers, Convolutional layers, Recurrent
layers.
Depth of Network = Number of Hidden Layers + 2
Neurons
• Purpose: Perform computations.
• Details:
• Each neuron computes a weighted sum of its inputs, adds a bias, and applies
an activation function.
• Output of a neuron:
z = activation( ∑ ( w ⋅ x ) + b )
Activation Functions
• Purpose: Introduce non-linearity to the model, enabling it to learn
complex patterns.
• Examples:
• Backward Propagation
• Purpose: Updates weights and biases using gradients computed from
the loss function via the chain rule.