Lecture 04 - Optimization - 4p
Lecture 04 - Optimization - 4p
“Thinking is the hardest work there is, which is probably the reason why so few engage in it.”
-Henry Ford
CSE555
Deep Learning
Spring 2025
Optimization
Prepared by Dr. Erdogan Sevilgen & Dr. Yakup Genc
Optimization (these slides are summarized version of the draft book by Bengio et al)
© 2016-2025 Yakup Genç & Y. Sinan Akgul
1 2
3 4
3/19/2025
5 6
• The expectation is taken across the data generating distribution rather than just over the finite training set
• However, we do not know the true distribution so, the following function is used; where m is the number of training examples
7 8
3/19/2025
9 10
Cross-Entropy Cross-Entropy
11 12
3/19/2025
0. 0 7.4186
0.0 2.8201
0.0 4.7915
1.0 0.0710
https://gombru.github.io/2018/05/23/cross_entropy_loss/
Spring 2025 Deep Learning 13 Spring 2025 Deep Learning 14
13 14
Halting optimization: • The objective function usually decomposes as a sum over the training
• Training algorithms: examples
– do not continue until a local minimum, • Update to the parameters based on an expected cost using only a
– halts while the surrogate loss function still has large derivatives subset of the terms of the full cost function;
• E.g. halts when a convergence criterion based on early stopping is • Ex: maximum likelihood estimation problems
satisfied – Maximizing expectations over the training set
– Early stopping criterion is usually based on the true underlying loss function,
such as 0-1 loss measured on a validation set, and is designed to cause the
algorithm to halt whenever overfitting begins to occur – Gradient property:
15 16
3/19/2025
Ideal J(Θ)
Instance1 J(Θ)
Instance2 J(Θ)
Θ Θ Θ Θ Θ Θ
Spring 2025 Deep Learning 17 Spring 2025 Deep Learning 18
17 18
Instance3 J(Θ)
Batch J(Θ)
Θ Θ Θ Θ Θ Θ
Spring 2025 Deep Learning 19 Spring 2025 Deep Learning 20
19 20
3/19/2025
21 22
23 24
3/19/2025
It is also crucial that the minibatches be selected randomly. • An interesting motivation for minibatch stochastic gradient descent is
• To compute an unbiased estimate of gradient, samples should be that it follows the gradient of the true generalization error so long as no
independent. examples are repeated.
• For two subsequent gradient estimates to be independent from each
other, two subsequent minibatches should also be independent from • If both x and y are discrete. In this case, the generalization error can be
each other. written as a sum
• Many datasets are most naturally arranged in a way where successive
examples are highly correlated. • and the exact gradient is
– it is necessary to shuffle the examples before selecting minibatches
25 26
• An unbiased estimator of the exact gradient of the generalization error • Most implementations of minibatch stochastic gradient pass through
is the gradient of the loss with respect to the parameters for that the data multiple times unless the training set is extremely large
minibatch • On the first pass, each minibatch is used to compute an unbiased
estimate of the true generalization error. On the second pass, the
estimate becomes biased because it is formed by re-sampling values
that have already been used, rather than obtaining new fair samples
• Updating θ in the direction of the calculated gradient performs from the data generating distribution.
stochastic gradient descent on the generalization error – increases the gap between training error and test error
• When using an extremely large training set, overfitting is not an issue,
so underfitting and computational efficiency become the predominant
concerns
Spring 2025 Deep Learning 27 Spring 2025 Deep Learning 28
27 28
3/19/2025
Challenges Ill-Conditioning
• Optimization in general is an extremely difficult task • Ill-conditioning: causing SGD to get “stuck” in the sense that even very small steps
increase the cost function
– Convex optimization is simpler but not easy, obtained by carefully designing the • The ill-conditioning problem is generally present in neural network training
objective function problems.
– Training training neural networks: usually more general non-convex case. • Ex: a second-order Taylor series expansion of the cost function. A gradient descent
step of −ɛg will add the following to the cost.
• In many cases, the gradient norm does not shrink significantly throughout learning,
but the left term grows by more than order of magnitude.
• The result is that learning becomes very slow despite the presence of a strong
gradient because the learning rate must be shrunk to compensate for even stronger
curvature.
29 30
31 32
3/19/2025
33 34
35 36
3/19/2025
Empirically it has been found that — for training DNN models — the accumulation of squared
gradients from the beginning of training results in a premature and excessive decrease in the
effective learning rate
Spring 2025 Deep Learning 37 Spring 2025 Deep Learning 38
37 38
• RMSprop
– Addresses the deficiency of AdaGrad by changing the gradient accumulation into
an exponentially weighted moving average
• ADAM
– Another adaptive learning rate optimization algorithm that best seen as a variant
on RMSprop+momentum with a few important distinctions.
• Adam, momentum is incorporated directly as an estimate of the first order moment (with
exponential weighting) of the gradient
• Adam includes bias corrections to the estimates of both the first-order moments (the
momentum term) and the (uncentered) second order moments to account for their
initialization at the origin
39 40
3/19/2025
41 42
• There is no consensus
• Some of the most used ones: SGD, SGD+momentum, RMSprop,
RMSprop+momentum, AdaDelta and Adam
• Hyperparameter tuning
43