Optimization in Deep Learning
• What is Optimization in Deep Learning?
• Optimization is the process of adjusting the weights
and biases of a neural network so that the model learns
patterns in the data. The goal is to minimize the loss
function (the error between predicted and actual
values) using optimization algorithms.
• 1. Momentum
• 👉 Idea: Instead of moving only based on the current gradient, we
also keep some “memory” of the past updates. This helps to move
faster in the right direction and avoid zig-zagging.
• 📝 Example:
• Imagine you are rolling a ball down a hill. If the slope is steep, the ball
picks up speed and keeps moving due to momentum. Even if the hill
flattens a bit, the ball continues rolling forward.
• 💡 In training:
• Without momentum → model takes small, shaky steps.
• With momentum → model moves faster and smoother toward the
minimum.
• Example: Walking in a Crowded Market
• Imagine you are trying to walk through a crowded street to reach a shop at
the end (your goal = minimum loss).
• If you just look at people around you and take tiny steps to avoid bumping
(like plain SGD), you’ll move slowly, sometimes even zig-zagging.
• But if you keep walking in the same direction with some force (momentum),
even when small obstacles come, you don’t stop — you push through
smoothly and reach faster.
• 👉 Meaning in optimization:
• Momentum helps the model carry forward useful direction from past updates
so it doesn’t get stuck making small, shaky moves.
2. Nesterov Accelerated Gradient (NAG)
• 👉 Idea: Similar to momentum, but instead of calculating the gradient
at the current position, we look a little ahead in the direction of
momentum before updating.
📝 Example:
• You are running down a hill with momentum. Instead of looking at the
slope where you are, you look slightly ahead of you to see where you’re
going, and adjust your path in advance.
• 💡 In training:
• This avoids overshooting the minimum and helps faster convergence.
• 3. Adagrad (Adaptive Gradient)
• 👉 Idea: Adjusts the learning rate for each parameter individually. Parameters that does
change much get a bigger learning rate; parameters that change less get a smaller learning
rate.
Example:
• Suppose you’re preparing for two subjects:
• Subject A (Math) → you’re weak, need more study time (larger updates), larger learning
rate.
• Subject B (English) → you’re already strong, need less revision (smaller updates) smaller
learning rate.
• 💡 In training:
• Good for sparse data like text or recommendation systems. But, learning rate keeps
shrinking over time, which may stop learning.
• Imagine you are walking down a rocky mountain trail. You want to slow down according
to the rough path.
• 4. RMSprop
• 👉 Idea: Fixes Adagrad’s problem of decreasing learning rate too much
by keeping a moving average of past squared gradients.
• 📝 Example:
• Imagine you are walking down a rocky mountain trail. You don’t want to
slow down forever (like Adagrad). Instead, you keep adjusting your
speed depending on how rough the path is recently, not the whole
journey.
• 💡 In training:
• RMSprop works well for recurrent neural networks (RNNs) and non-
stationary problems.
• 5. Adam (Adaptive Moment Estimation)
• 👉 Idea: Combines Momentum (moving average of gradients) + RMSprop
(moving average of squared gradients). It’s like getting the best of both
worlds.
• 📝 Example:
• Think of riding a bike downhill:
• You use momentum (speed from previous steps).
• You also adjust speed based on the roughness of the road (like RMSprop).
• Together, this helps you reach the destination faster and more safely.
• 💡 In training:
• Adam is the most popular optimizer—it adapts learning rates, uses
momentum, and converges faster in practice.
How Learning Differs from Pure Optimization
• How Learning Differs from Pure Optimization
• 🔹 1. Introduction
• In machine learning, optimization and learning are closely related but not
the same.
• Optimization is about minimizing (or maximizing) an objective function,
usually the loss function.
• Learning goes beyond optimization, as it not only reduces training error but
also ensures that the model can handle new, unseen data (generalization).
• 2. Optimization
• Definition: Optimization is the process of finding the set
of parameters (weights and biases) that minimize the
loss function.
• Goal: Achieve the best fit for the training dataset.
Limitation: If we only optimize, the model may overfit —
meaning it performs well on training data but poorly on
new data.
3. Learning
• Definition: Learning is the process of building models that
can generalize knowledge from training data to unseen
data.
• Goal: Achieve both low training error and low
testing/validation error.
• Key Aspect: Learning requires optimization but also
includes generalization techniques like regularization,
dropout, early stopping, etc.
• 4. Detailed Example
🎓 Student Exam Example
• Pure Optimization (Memorization):
• A student memorizes answers from last year’s exam
papers.
If the same questions appear, the student scores full
marks (perfect optimization).
But if new or slightly different questions appear, the
student struggles because they didn’t actually
understand the subject.
• This is like a machine learning model that perfectly minimizes training error
but fails on test data (overfitting).
• Learning (Understanding Concepts):
• Another student studies concepts, practices different types of problems, and
understands the subject.
• They may not remember exact answers but can apply knowledge to solve new
questions.
• 👉 This is like a machine learning model that not only optimizes the training
loss but also learns general patterns that work on unseen data (generalization).
• HR Example: Employee Attrition Prediction
🔹 Scenario
• A company wants to use machine learning to predict employee attrition
(who is likely to leave the company).
• Pure Optimization
• The model is trained on past employee records (age, salary, job role,
years at company, performance rating, etc.).
• If we only optimize:
The model learns patterns specific to past data (e.g., “employees under 30
with low salary left”).
It achieves very high accuracy on the training dataset (minimum loss).
• Problem:
• When new employees join with different patterns (e.g.,
remote workers, new departments), the model fails.
• It overfits to the old HR data.
• 👉 This is pure optimization: the model is good at
memorizing past cases but not at handling new ones.
• Learning
• The model not only minimizes error on past data but also learns general
patterns about attrition.
• Example:
• It learns that low job satisfaction, limited growth opportunities, and poor
work-life balance contribute to attrition across different employee groups.
• Now, when a new employee with slightly different attributes comes in, the
model can still predict attrition correctly because it has understood the
broader relationships, not just memorized old data.
• 👉 This is learning: the system generalizes beyond past data and adapts
to new employee profiles.

unit 2 chapter 1 optimization in deep learning

  • 1.
  • 2.
    • What isOptimization in Deep Learning? • Optimization is the process of adjusting the weights and biases of a neural network so that the model learns patterns in the data. The goal is to minimize the loss function (the error between predicted and actual values) using optimization algorithms.
  • 3.
    • 1. Momentum •👉 Idea: Instead of moving only based on the current gradient, we also keep some “memory” of the past updates. This helps to move faster in the right direction and avoid zig-zagging. • 📝 Example: • Imagine you are rolling a ball down a hill. If the slope is steep, the ball picks up speed and keeps moving due to momentum. Even if the hill flattens a bit, the ball continues rolling forward. • 💡 In training: • Without momentum → model takes small, shaky steps. • With momentum → model moves faster and smoother toward the minimum.
  • 4.
    • Example: Walkingin a Crowded Market • Imagine you are trying to walk through a crowded street to reach a shop at the end (your goal = minimum loss). • If you just look at people around you and take tiny steps to avoid bumping (like plain SGD), you’ll move slowly, sometimes even zig-zagging. • But if you keep walking in the same direction with some force (momentum), even when small obstacles come, you don’t stop — you push through smoothly and reach faster. • 👉 Meaning in optimization: • Momentum helps the model carry forward useful direction from past updates so it doesn’t get stuck making small, shaky moves.
  • 5.
    2. Nesterov AcceleratedGradient (NAG) • 👉 Idea: Similar to momentum, but instead of calculating the gradient at the current position, we look a little ahead in the direction of momentum before updating. 📝 Example: • You are running down a hill with momentum. Instead of looking at the slope where you are, you look slightly ahead of you to see where you’re going, and adjust your path in advance. • 💡 In training: • This avoids overshooting the minimum and helps faster convergence.
  • 6.
    • 3. Adagrad(Adaptive Gradient) • 👉 Idea: Adjusts the learning rate for each parameter individually. Parameters that does change much get a bigger learning rate; parameters that change less get a smaller learning rate. Example: • Suppose you’re preparing for two subjects: • Subject A (Math) → you’re weak, need more study time (larger updates), larger learning rate. • Subject B (English) → you’re already strong, need less revision (smaller updates) smaller learning rate. • 💡 In training: • Good for sparse data like text or recommendation systems. But, learning rate keeps shrinking over time, which may stop learning. • Imagine you are walking down a rocky mountain trail. You want to slow down according to the rough path.
  • 7.
    • 4. RMSprop •👉 Idea: Fixes Adagrad’s problem of decreasing learning rate too much by keeping a moving average of past squared gradients. • 📝 Example: • Imagine you are walking down a rocky mountain trail. You don’t want to slow down forever (like Adagrad). Instead, you keep adjusting your speed depending on how rough the path is recently, not the whole journey. • 💡 In training: • RMSprop works well for recurrent neural networks (RNNs) and non- stationary problems.
  • 8.
    • 5. Adam(Adaptive Moment Estimation) • 👉 Idea: Combines Momentum (moving average of gradients) + RMSprop (moving average of squared gradients). It’s like getting the best of both worlds. • 📝 Example: • Think of riding a bike downhill: • You use momentum (speed from previous steps). • You also adjust speed based on the roughness of the road (like RMSprop). • Together, this helps you reach the destination faster and more safely. • 💡 In training: • Adam is the most popular optimizer—it adapts learning rates, uses momentum, and converges faster in practice.
  • 9.
    How Learning Differsfrom Pure Optimization
  • 10.
    • How LearningDiffers from Pure Optimization • 🔹 1. Introduction • In machine learning, optimization and learning are closely related but not the same. • Optimization is about minimizing (or maximizing) an objective function, usually the loss function. • Learning goes beyond optimization, as it not only reduces training error but also ensures that the model can handle new, unseen data (generalization).
  • 11.
    • 2. Optimization •Definition: Optimization is the process of finding the set of parameters (weights and biases) that minimize the loss function. • Goal: Achieve the best fit for the training dataset. Limitation: If we only optimize, the model may overfit — meaning it performs well on training data but poorly on new data.
  • 12.
    3. Learning • Definition:Learning is the process of building models that can generalize knowledge from training data to unseen data. • Goal: Achieve both low training error and low testing/validation error. • Key Aspect: Learning requires optimization but also includes generalization techniques like regularization, dropout, early stopping, etc.
  • 13.
    • 4. DetailedExample 🎓 Student Exam Example • Pure Optimization (Memorization): • A student memorizes answers from last year’s exam papers. If the same questions appear, the student scores full marks (perfect optimization). But if new or slightly different questions appear, the student struggles because they didn’t actually understand the subject.
  • 14.
    • This islike a machine learning model that perfectly minimizes training error but fails on test data (overfitting). • Learning (Understanding Concepts): • Another student studies concepts, practices different types of problems, and understands the subject. • They may not remember exact answers but can apply knowledge to solve new questions. • 👉 This is like a machine learning model that not only optimizes the training loss but also learns general patterns that work on unseen data (generalization).
  • 15.
    • HR Example:Employee Attrition Prediction 🔹 Scenario • A company wants to use machine learning to predict employee attrition (who is likely to leave the company). • Pure Optimization • The model is trained on past employee records (age, salary, job role, years at company, performance rating, etc.). • If we only optimize: The model learns patterns specific to past data (e.g., “employees under 30 with low salary left”). It achieves very high accuracy on the training dataset (minimum loss).
  • 16.
    • Problem: • Whennew employees join with different patterns (e.g., remote workers, new departments), the model fails. • It overfits to the old HR data. • 👉 This is pure optimization: the model is good at memorizing past cases but not at handling new ones.
  • 17.
    • Learning • Themodel not only minimizes error on past data but also learns general patterns about attrition. • Example: • It learns that low job satisfaction, limited growth opportunities, and poor work-life balance contribute to attrition across different employee groups. • Now, when a new employee with slightly different attributes comes in, the model can still predict attrition correctly because it has understood the broader relationships, not just memorized old data. • 👉 This is learning: the system generalizes beyond past data and adapts to new employee profiles.