0% found this document useful (0 votes)
33 views12 pages

Module 3-DL

Module 3 covers optimization techniques for training deep models, focusing on empirical risk minimization, challenges in neural network optimization, and basic algorithms like stochastic gradient descent. Key challenges include ill-conditioning, local minima, plateaus, and gradient issues, which complicate the training process. The document also discusses parameter initialization strategies and adaptive learning rate algorithms such as AdaGrad, emphasizing their importance in improving convergence and model performance.

Uploaded by

Shreya shresth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
33 views12 pages

Module 3-DL

Module 3 covers optimization techniques for training deep models, focusing on empirical risk minimization, challenges in neural network optimization, and basic algorithms like stochastic gradient descent. Key challenges include ill-conditioning, local minima, plateaus, and gradient issues, which complicate the training process. The document also discusses parameter initialization strategies and adaptive learning rate algorithms such as AdaGrad, emphasizing their importance in improving convergence and model performance.

Uploaded by

Shreya shresth
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Module 3

Optimization for Training Deep Models: Empirical Risk Minimization, Challenges in Neural
Network Optimization, Basic Algorithms: Stochastic Gradient Descent, Parameter Initialization
Strategies,
Algorithms with Adaptive Learning Rates: The AdaGrad algorithm, The RMSProp algorithm,
Choosing the Right Optimization Algorithm.
Textbook 1: Chapter: 8.1-8.5

What Is Optimization in Deep Learning?


• In deep learning, optimization is like solving a puzzle where we adjust the pieces (model parameters)
to get the best result (make good predictions).
• Training a neural network is one of the hardest puzzles because it can take a long time, lots of data,
and powerful computers to do it right.

What Are We Trying to Do?


• The main goal is to find the best settings (parameters) for the model so it performs well.
• We do this by minimizing something called the cost function, which measures how wrong the model
is. Smaller values mean the model is better at predicting.

How Is Deep Learning Different From Other Optimization?


• In regular optimization, you just focus on minimizing a function. Simple and direct.
• In deep learning, it's trickier because:
o What you really care about is how well the model performs on new (test) data.
o But you can only adjust the model using the training data you already have.
o So, we work on a "proxy" (an easier problem) and hope it helps with the real goal.

What Is a Cost Function?


• Imagine you want to teach the model to predict something, like if an email is spam or not.
• A cost function is a formula that measures how far off the predictions are from the actual answers.
o For example, if the model says an email is spam but it isn’t, the cost function gives it a penalty.
o The goal is to make these penalties as small as possible for all the training examples.

##MQP. Explain Empirical risk minimization.

Empirical Risk Minimization (ERM) is a method in machine learning where we try to train a model by
minimizing the average error (or loss) on the training data.

Breaking It Down:
1. What Is "Risk"?
o Risk is the expected error of the model when making predictions on new, unseen data. This
is what we ideally want to minimize.
2. Why Do We Use "Empirical" Risk?
o We don’t have access to all possible data (the true data distribution, pdata).
o Instead, we work with a finite training set and calculate the average loss over this training
data. This is called empirical risk.
Why ERM Isn’t Perfect:
• Overfitting: If the model is too complex, it might just "memorize" the training data instead of learning
general patterns.
• No Guarantee for Generalization: Just because the model performs well on the training data doesn’t
mean it will work well on unseen data.

Why ERM Is Still Useful:


• It simplifies the machine learning problem by turning it into an optimization problem.
• We hope that by minimizing the error on the training set, the model also learns patterns that work well
on unseen data.

Benefits of ERM:
o Simple and Clear: It turns the problem of teaching a machine into a math problem.
o Practical: Easy to calculate using available training data.

Drawbacks of ERM:
o Overfitting: If the model is too complex, it might memorize the training data instead of learning
general patterns, which leads to poor performance on new data.
o Not Always Feasible: Some error measures, like "0-1 loss" (did the model get it right or wrong),
are hard to use with current optimization techniques.
o Doesn’t Focus on the Real Goal: The true goal is to do well on unseen data, but ERM focuses only
on the training data.

Why It’s Not Always Used in Deep Learning:


o In deep learning, we use advanced methods (like surrogate loss functions and regularization) to
overcome ERM’s problems, especially overfitting.

ERM is about teaching a machine by making it do well on training data. It’s useful, but it can lead to problems
like memorization (overfitting) and doesn’t always help the machine do well on new, unseen data.

##MQP. Explain the challenges occur in neural network optimization in detail.


Identify and elaborate all the Key Challenges in Optimizing Neural Networks (NNs).(8M)
Key Challenges in Optimizing Neural Networks (NNs):
1. Ill-Conditioning:

• Neural network optimization often faces issues related to the Hessian matrix's condition number, which
measures how sensitive the output is to changes in the input.
• If the Hessian matrix is poorly conditioned, gradient descent can become inefficient because the
gradient direction may lead to very small or very large steps.
• Impact: Ill-conditioning slows down the learning process significantly, even when there is a strong
gradient available.
• Example: In deep neural networks, certain directions in the parameter space may have steep curvature,
requiring the learning rate to be reduced to prevent overshooting.

2. Local Minima:

• Neural networks are non-convex, meaning the loss function has many local minima rather than a single
global minimum.
• Weight Space Symmetry:
o Neural networks often exhibit multiple equivalent local minima due to symmetry in the weight
space. For instance, swapping neurons in a layer does not change the output but creates new
minima.
• High-Cost Local Minima:
o While most local minima in deep networks have low costs, some high-cost local minima can
degrade performance.
• Impact: High-dimensional neural networks often require algorithms to bypass suboptimal local
minima efficiently.

3. Plateaus and Saddle Points:

• In high-dimensional spaces, saddle points (flat regions) are more common than local minima or
maxima.
• Saddle Points:
o These are areas where some directions lead uphill while others lead downhill. The gradient
near saddle points is close to zero, making optimization slow.
• Plateaus:
o Extended flat regions in the loss surface can significantly hinder the optimization process.
• Impact: Training may stagnate in these regions, and escaping them often requires advanced techniques
such as momentum or adaptive optimization methods.

4. Exploding and Vanishing Gradients:

• Vanishing Gradients:
o Gradients can diminish to near-zero when propagated backward through many layers,
especially in activation functions like sigmoid or tanh.
o Effect: This makes it difficult for earlier layers to learn, particularly in deep networks or
recurrent neural networks (RNNs).
• Exploding Gradients:
o Gradients can grow exponentially during backpropagation, causing unstable updates and
divergence in learning.
• Solutions:
o Using gradient clipping to handle exploding gradients.
o Employing advanced architectures like LSTM (Long Short-Term Memory) for RNNs to
mitigate vanishing gradients.
• Impact: Both issues make training deep networks challenging and require careful tuning of learning
rates and network initialization.
5. Cliffs in the Loss Surface:

• Loss functions for deep networks often contain steep, cliff-like regions caused by highly non-linear
transformations in parameter space.
• A single large update near a cliff can result in:
o Parameters moving too far off track.
o Loss of previously achieved optimization progress.
• Impact: This can destabilize training, especially for recurrent networks where such cliffs are more
common due to repeated multiplications over time steps.

6. Long-Term Dependencies:

• RNNs and other sequential models struggle with learning dependencies over long sequences due to
their deep computational graphs.
• Repeated operations over multiple time steps amplify issues such as:
o Vanishing gradients for long-term dependencies.
o Exploding gradients for sequences with large eigenvalues in the recurrent matrix.
• Example: Predicting the outcome of a sentence based on words that appeared far back in the sequence.

7. Inexact Gradients:

• Gradients are usually estimated using minibatches instead of the entire dataset to reduce computation
time.
• Problem:
o These minibatch estimates introduce noise and variance in the optimization process.
o The stochastic nature of gradient updates can lead to instability or slow convergence.
• Impact: Larger batch sizes reduce noise but require more memory, while smaller batch sizes add
regularization but slow convergence.

8. Poor Correspondence Between Local and Global Structure:

• In some cases, the gradient's local direction does not lead toward the global minimum.
• Problem:
o Optimization trajectories can be inefficient and take long, winding paths around obstacles in
the loss surface.
• Example: The cost function may not have a clear global minimum but asymptote toward lower values.
This makes it hard for the optimizer to navigate effectively.

9. Theoretical Limits of Optimization:

• Certain mathematical results suggest inherent limits in optimization:


o Intractable problems: Some problems are computationally too complex to solve in reasonable
time.
o Network Size Trade-offs: Smaller networks may not converge well, while larger networks
increase computational demand.
• Impact: Theoretical results often do not apply directly to practical scenarios but highlight the
fundamental difficulty of finding optimal solutions.
10. Initialization Sensitivity:

• Poor Initialization:
o If network weights are not initialized properly, training can fail to converge or get stuck in poor
local minima.
• Strategies:
o Random initialization from a uniform or Gaussian distribution.
o Heuristics like Xavier or He initialization to scale weights based on layer size.
• Impact: Good initialization can significantly improve convergence speed and model performance.

Practical Implications and Solutions:


To address these challenges, practitioners use:
• Advanced Optimizers:
o Adaptive methods like Adam or RMSProp to adjust learning rates dynamically.
o Momentum-based methods to overcome plateaus and saddle points.
• Gradient Clipping: Prevents exploding gradients by capping their magnitude.
• Batch Normalization: Mitigates vanishing gradients by normalizing layer inputs.
• Skip Connections: Used in ResNets to avoid vanishing gradients by introducing identity mappings.
• Careful Regularization: Techniques like dropout prevent overfitting and improve generalization.
• Good Initialization: Strategies like Xavier or orthogonal initialization set the foundation for smoother
training.

These solutions aim to make neural network training more stable and efficient, even for complex, high-
dimensional problems.

####q. Describe the Stochastic Gradient Descent (SGD) algorithm. How does it work,
and what are its advantages and disadvantages?
Stochastic Gradient Descent (SGD)
What is SGD?
• Stochastic Gradient Descent (SGD) is a popular algorithm used to train machine learning models,
especially for deep learning.
• Goal: To reduce the model’s error (also called loss) by adjusting its parameters step by step in the
right direction.
Example of the Algorithm in Action

1. Start with random values for model parameters.


2. Pick 10 random data points (minibatch) from your training set.
3. Calculate how wrong the model is for those 10 points (the loss).
4. Update the model’s parameters slightly to make it less wrong.
5. Repeat this process with other minibatches until the error is small or you’ve trained for enough steps.

Advantages of SGD
1. Fast for Large Datasets:
o Instead of processing the entire dataset, SGD uses small chunks (minibatches), making it
faster and efficient for very large datasets.
2. Quick Initial Progress:
o SGD often reduces the error significantly in the early stages of training, even with a few
updates.
3. Helps Escape Local Minima:
o The randomness in selecting minibatches allows SGD to avoid shallow local minima and find
better solutions.
4. Scalable:
o SGD works well even for models with millions of parameters and datasets with millions of
examples.

Disadvantages of SGD
1. Noisy Updates:
o Since it uses random minibatches, the updates can fluctuate, making the training process a bit
unstable.
2. Sensitive to Learning Rate:
o Choosing the right learning rate is tricky:
§ Too high: The error may oscillate or even increase.
§ Too low: The model may take forever to learn.
3. Slower Convergence:
o While SGD is fast at the beginning, it becomes slower as it gets closer to the best solution.
4. Requires Tuning:
o Hyperparameters like learning rate, minibatch size, and decay schedule must be carefully
tuned for good performance.

Summary of Advantages and Disadvantages


Advantages Disadvantages
Faster and efficient for large datasets Noisy updates can cause instability
Makes quick progress early Sensitive to learning rate
Escapes shallow local minima Slower as it nears the best solution
Scalable for big models Requires tuning of hyperparameters

Why Use SGD?


Think of SGD as walking downhill to reach the bottom (minimum loss). Instead of looking at the entire
landscape (all data points), you use a flashlight to see a small part (minibatch) and decide your next step
based on that. It’s fast and works well for big problems, but you need to pick the right step size (learning
rate) to avoid stumbling.

###q. Discuss different parameter initialization strategies for neural networks. Why is
proper initialization important?
Parameter Initialization Strategies for Neural Networks

Why is Proper Initialization Important?

• Neural network training starts with an initial point, which significantly affects:
o Whether the optimization converges at all.
o The speed of convergence.
o The quality of the final solution in terms of cost and generalization.
• Poor initialization can lead to:
o Numerical instability and failure to converge.
o Stagnation in optimization or convergence to suboptimal solutions.

Key Initialization Strategies:

1.Random Initialization:

• Weights are set randomly (e.g., using Gaussian or uniform distributions).


• Why? To ensure that neurons in the network start off differently (this is called breaking symmetry).
• Challenge: If weights are too large, gradients can explode. If they’re too small, gradients can vanish.

3. Orthogonal Initialization:

o Weights are set to be orthogonal (mathematically independent of each other) with a scaling
factor.
o Why? Preserves the size of activations and gradients across layers.

4. Sparse Initialization:

o Not all weights are initialized—only a subset are non-zero.


o Why? To ensure diversity among neurons while keeping the total input manageable.

5. Bias Initialization:

o Biases are usually set to 0.


o For specific cases:
§ ReLU Layers: Biases can be set slightly above zero (e.g., 0.1) to prevent early
saturation.
§ Output Layers: Set biases to match the expected output distribution (e.g., using
softmax for classification).

6. Pretrained Initialization:
o Start with weights from a previously trained model (e.g., transfer learning).
o Why? Speeds up training and often leads to better results, especially for similar tasks.

7. Special Cases for Recurrent Neural Networks (RNNs):

o For LSTMs, initialize the forget gate bias to 1 to ensure better handling of long-term
dependencies.

Why Random Initialization is Popular:


• Simple and Effective: Random initialization works well because it’s computationally cheap and
ensures neurons start differently.
• Balances Signal Flow: It prevents the network from losing or amplifying signals as they move
forward or backward.

Key Challenges with Initialization:


1. Exploding or Vanishing Gradients:
o Large weights → Gradients grow too big (exploding).
o Small weights → Gradients shrink to near zero (vanishing).
o Solution: Use scaled initialization methods (e.g., Xavier, He).
2. Layer Size Matters:
o Larger layers need smaller weights to avoid instability.
3. Generalization vs. Optimization:
o Some initializations improve learning speed but hurt the model’s ability to generalize to new
data.

Practical Tips for Initialization:


• Test and Adjust:
o Check if activations and gradients are reasonable after initialization. Adjust weights if signals
shrink or explode in early layers.
• Treat as Hyperparameter:
o Treat the weight scale as something to tune during training.
• Use Standard Methods:
o For most networks, methods like Xavier or He initialization work well.

By starting with the right initialization, training becomes faster, smoother, and more likely to succeed.

###MQP. Explain AdaGrad and write an algorithm of AdaGrad.

AdaGrad (Adaptive Gradient Algorithm) is an optimization technique used in machine learning and deep
learning to improve the training process. It adjusts the learning rate for each parameter in the model, allowing
the learning rate to adapt based on the parameter's updates over time.

Key Features of AdaGrad:


1. Learning Rate Adjustment:
o Instead of using a fixed learning rate for all parameters, AdaGrad adjusts the learning rate for
each parameter based on how frequently it is updated.
o Parameters that receive larger gradients (more updates) get smaller learning rates, while
parameters with smaller gradients (fewer updates) maintain higher learning rates.
2. Works for Sparse Data:
o AdaGrad is particularly effective for tasks with sparse data, like natural language processing
or text classification.
3. Limitations:
o Over time, the learning rate for frequently updated parameters can become too small, which
might slow down training or stop it entirely.

How Does AdaGrad Work? (Step-by-Step)


1. Start Training:
o Begin with the model's parameters (weights and biases) and a learning rate (ϵϵ).
2. Track Progress:
o Keep a running total of how much each parameter has been updated (sum of squared gradients).
3. Adjust the Learning Rate:
o For parameters that change a lot (frequent updates), reduce their learning rate.
o For parameters that don't change much, keep their learning rate higher.
4. Update the Parameters:
o Use the adjusted learning rates to make changes to the parameters and improve the model's
predictions.

Advantages of AdaGrad:
1. Adaptive Learning Rate:
o Adjusts learning rates for individual parameters, which is especially useful when parameters
have different importance.
2. Effective for Sparse Data:
o Performs well on sparse features or data.
3. No Manual Learning Rate Tuning:
o Reduces the need for tuning the learning rate during training.

Disadvantages of AdaGrad:
1. Learning Rate Decay:
o Over time, the learning rate can become too small, slowing or stopping training.
2. Not Always Suitable for Deep Learning:
o The excessive decrease in learning rate can make it less effective for deep neural networks.

AdaGrad helps your model learn faster by giving smaller updates to parameters that change a lot
and larger updates to parameters that change less. However, it can slow down over time as the learning rate
decreases too much.

In Simple Words
• Imagine you're a teacher adjusting your teaching style for each student:
o If a student is learning quickly (parameter updated a lot), you slow down and guide them
less (smaller learning rate).
o If a student is struggling (parameter updated less), you spend more time with them (larger
learning rate).

This is what AdaGrad does for model parameters—it focuses on improving areas that need more attention
while stepping back from areas that are already learning well.

###MQP. Explain Adam algorithm in detail.

Adam stands for Adaptive Moment Estimation. It’s an optimization algorithm used to train machine
learning models, especially deep learning. Adam combines the best features of two other
methods: Momentum (which smoothens updates) and RMSProp (which adjusts learning rates for each
parameter).

Key Features of Adam:


1. Momentum:
o Adam uses momentum to keep track of past gradients (changes in the loss function), helping the
model make smoother updates.
2. Adaptive Learning Rate:
o It adjusts the learning rate for each parameter individually, so frequently updated parameters get
smaller steps, and less frequently updated ones get larger steps.
3. Bias Correction:
o Early in training, Adam corrects for biases that arise because it starts with zero values for
tracking gradients. This ensures better updates in the initial stages.

ADAM ALGORITHM-
In Simple Words:
Adam is like a smart guide that:
• Uses past updates to smooth training (momentum).
• Adjusts step sizes for each parameter based on how frequently they change.
• Corrects mistakes in the beginning to ensure steady progress.

It’s fast, adaptive, and widely used because it performs well on a variety of machine learning tasks.

Advantages of Adam
1. Adaptive Learning Rates: Automatically adjusts the learning rate for each parameter, making it easier to
train models without manual tuning.
2. Efficient and Fast: Works well with large datasets and complex models.
3. Smoother Updates: Momentum helps smooth out noisy updates for stable training.
4. Bias Correction: Corrects biases in the initial stages of training for better updates.

Disadvantages of Adam
1. High Memory Usage: Requires storing additional values (momentum and scaling) for each parameter,
increasing memory demands.
2. Learning Rate Sensitivity: Although adaptive, the global learning rate (αα) might still need fine-tuning
for some tasks.
3. Suboptimal Convergence: Adam can converge to suboptimal solutions (not the best minimum) in certain
cases.
4. Slower in Some Scenarios: May be slower compared to simpler optimizers like SGD in tasks with simple
loss landscapes.

###SIMP. Explain the RMSProp algorithm. How does it address the limitations of
AdaGrad?

RMSProp is an algorithm used to train machine learning models. It adjusts the learning rate for each
parameter during training, making learning faster and smoother. RMSProp improves on AdaGrad, which
tends to slow down too much as training progresses.

How Does RMSProp Work?


1. Start with Initial Settings:
o Begin with a learning rate (ϵ), a decay rate (ρ, typically 0.9), and a small constant (δ) to avoid
dividing by zero.
2. Track Gradients:
o RMSProp keeps a running average of the recent squared gradients for each parameter.
o This helps the algorithm focus on recent updates and ignore very old ones.
3. Adjust the Learning Rate:
o Parameters that change a lot (large gradients) get smaller updates.
o Parameters that don’t change much (small gradients) get larger updates.
4. Update Parameters:
o RMSProp adjusts the model's parameters step by step using the scaled gradients.

How RMSProp Fixes AdaGrad's Problem:


• AdaGrad's Issue:
o AdaGrad remembers all past gradients, which makes the learning rate shrink too much over time.
This slows down training or stops it completely.
• RMSProp's Solution:
o RMSProp "forgets" old gradients by focusing only on recent ones. This keeps the learning rate
stable and prevents it from becoming too small.

Why Use RMSProp?


1. Keeps Learning Rates Stable:
o It prevents learning from slowing down too much like it does in AdaGrad.
2. Adapts to Each Parameter:
o Each parameter gets its own learning rate based on how frequently it changes.
3. Works for Deep Learning:
o RMSProp is great for problems like neural networks where the loss function is complex and
constantly changing.

Advantages:
• Fixes AdaGrad's Limitations: RMSProp avoids shrinking learning rates by forgetting old gradients.
• Efficient for Deep Learning: Handles complex, non-convex problems (like neural networks) very well.

Disadvantages:
• Requires Tuning: You still need to fine-tune the learning rate (ϵϵ) and decay rate (ρρ) for best
performance.
• Memory Usage: Needs additional memory to store the moving average of squared gradients.

Simple Analogy:
Imagine climbing a hill:
• AdaGrad keeps track of every step you’ve taken, so it slows down too much as you go.
• RMSProp only cares about your recent steps, so it adjusts your pace intelligently to help you keep
moving forward efficiently.

You might also like