• What isOptimization in Deep Learning?
• Optimization is the process of adjusting the weights
and biases of a neural network so that the model learns
patterns in the data. The goal is to minimize the loss
function (the error between predicted and actual
values) using optimization algorithms.
3.
• 1. Momentum
•👉 Idea: Instead of moving only based on the current gradient, we
also keep some “memory” of the past updates. This helps to move
faster in the right direction and avoid zig-zagging.
• 📝 Example:
• Imagine you are rolling a ball down a hill. If the slope is steep, the ball
picks up speed and keeps moving due to momentum. Even if the hill
flattens a bit, the ball continues rolling forward.
• 💡 In training:
• Without momentum → model takes small, shaky steps.
• With momentum → model moves faster and smoother toward the
minimum.
4.
• Example: Walkingin a Crowded Market
• Imagine you are trying to walk through a crowded street to reach a shop at
the end (your goal = minimum loss).
• If you just look at people around you and take tiny steps to avoid bumping
(like plain SGD), you’ll move slowly, sometimes even zig-zagging.
• But if you keep walking in the same direction with some force (momentum),
even when small obstacles come, you don’t stop — you push through
smoothly and reach faster.
• 👉 Meaning in optimization:
• Momentum helps the model carry forward useful direction from past updates
so it doesn’t get stuck making small, shaky moves.
5.
2. Nesterov AcceleratedGradient (NAG)
• 👉 Idea: Similar to momentum, but instead of calculating the gradient
at the current position, we look a little ahead in the direction of
momentum before updating.
📝 Example:
• You are running down a hill with momentum. Instead of looking at the
slope where you are, you look slightly ahead of you to see where you’re
going, and adjust your path in advance.
• 💡 In training:
• This avoids overshooting the minimum and helps faster convergence.
6.
• 3. Adagrad(Adaptive Gradient)
• 👉 Idea: Adjusts the learning rate for each parameter individually. Parameters that does
change much get a bigger learning rate; parameters that change less get a smaller learning
rate.
Example:
• Suppose you’re preparing for two subjects:
• Subject A (Math) → you’re weak, need more study time (larger updates), larger learning
rate.
• Subject B (English) → you’re already strong, need less revision (smaller updates) smaller
learning rate.
• 💡 In training:
• Good for sparse data like text or recommendation systems. But, learning rate keeps
shrinking over time, which may stop learning.
• Imagine you are walking down a rocky mountain trail. You want to slow down according
to the rough path.
7.
• 4. RMSprop
•👉 Idea: Fixes Adagrad’s problem of decreasing learning rate too much
by keeping a moving average of past squared gradients.
• 📝 Example:
• Imagine you are walking down a rocky mountain trail. You don’t want to
slow down forever (like Adagrad). Instead, you keep adjusting your
speed depending on how rough the path is recently, not the whole
journey.
• 💡 In training:
• RMSprop works well for recurrent neural networks (RNNs) and non-
stationary problems.
8.
• 5. Adam(Adaptive Moment Estimation)
• 👉 Idea: Combines Momentum (moving average of gradients) + RMSprop
(moving average of squared gradients). It’s like getting the best of both
worlds.
• 📝 Example:
• Think of riding a bike downhill:
• You use momentum (speed from previous steps).
• You also adjust speed based on the roughness of the road (like RMSprop).
• Together, this helps you reach the destination faster and more safely.
• 💡 In training:
• Adam is the most popular optimizer—it adapts learning rates, uses
momentum, and converges faster in practice.
• How LearningDiffers from Pure Optimization
• 🔹 1. Introduction
• In machine learning, optimization and learning are closely related but not
the same.
• Optimization is about minimizing (or maximizing) an objective function,
usually the loss function.
• Learning goes beyond optimization, as it not only reduces training error but
also ensures that the model can handle new, unseen data (generalization).
11.
• 2. Optimization
•Definition: Optimization is the process of finding the set
of parameters (weights and biases) that minimize the
loss function.
• Goal: Achieve the best fit for the training dataset.
Limitation: If we only optimize, the model may overfit —
meaning it performs well on training data but poorly on
new data.
12.
3. Learning
• Definition:Learning is the process of building models that
can generalize knowledge from training data to unseen
data.
• Goal: Achieve both low training error and low
testing/validation error.
• Key Aspect: Learning requires optimization but also
includes generalization techniques like regularization,
dropout, early stopping, etc.
13.
• 4. DetailedExample
🎓 Student Exam Example
• Pure Optimization (Memorization):
• A student memorizes answers from last year’s exam
papers.
If the same questions appear, the student scores full
marks (perfect optimization).
But if new or slightly different questions appear, the
student struggles because they didn’t actually
understand the subject.
14.
• This islike a machine learning model that perfectly minimizes training error
but fails on test data (overfitting).
• Learning (Understanding Concepts):
• Another student studies concepts, practices different types of problems, and
understands the subject.
• They may not remember exact answers but can apply knowledge to solve new
questions.
• 👉 This is like a machine learning model that not only optimizes the training
loss but also learns general patterns that work on unseen data (generalization).
15.
• HR Example:Employee Attrition Prediction
🔹 Scenario
• A company wants to use machine learning to predict employee attrition
(who is likely to leave the company).
• Pure Optimization
• The model is trained on past employee records (age, salary, job role,
years at company, performance rating, etc.).
• If we only optimize:
The model learns patterns specific to past data (e.g., “employees under 30
with low salary left”).
It achieves very high accuracy on the training dataset (minimum loss).
16.
• Problem:
• Whennew employees join with different patterns (e.g.,
remote workers, new departments), the model fails.
• It overfits to the old HR data.
• 👉 This is pure optimization: the model is good at
memorizing past cases but not at handling new ones.
17.
• Learning
• Themodel not only minimizes error on past data but also learns general
patterns about attrition.
• Example:
• It learns that low job satisfaction, limited growth opportunities, and poor
work-life balance contribute to attrition across different employee groups.
• Now, when a new employee with slightly different attributes comes in, the
model can still predict attrition correctly because it has understood the
broader relationships, not just memorized old data.
• 👉 This is learning: the system generalizes beyond past data and adapts
to new employee profiles.