0% found this document useful (0 votes)
10 views11 pages

Lecture 04 - Optimization - 4p

The document discusses optimization techniques in deep learning, focusing on the training of neural networks and the challenges associated with it. It covers topics such as empirical risk minimization, surrogate loss functions, and various optimization algorithms including stochastic gradient descent and its variants. Additionally, it highlights issues like ill-conditioning and local minima that can complicate the optimization process.

Uploaded by

emirkan.b.yilmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views11 pages

Lecture 04 - Optimization - 4p

The document discusses optimization techniques in deep learning, focusing on the training of neural networks and the challenges associated with it. It covers topics such as empirical risk minimization, surrogate loss functions, and various optimization algorithms including stochastic gradient descent and its variants. Additionally, it highlights issues like ill-conditioning and local minima that can complicate the optimization process.

Uploaded by

emirkan.b.yilmaz
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

3/19/2025

“Thinking is the hardest work there is, which is probably the reason why so few engage in it.”

-Henry Ford

CSE555
Deep Learning

Spring 2025
Optimization
Prepared by Dr. Erdogan Sevilgen & Dr. Yakup Genc
Optimization (these slides are summarized version of the draft book by Bengio et al)
© 2016-2025 Yakup Genç & Y. Sinan Akgul

Spring 2025 Deep Learning 2

1 2

Supervised Machine Learning Introduction

• Neural Network Training: finding the parameters θ of a neural network


# of experiences observed outcome observed input that significantly reduce a cost function J(θ)
• Cost function typically includes a performance measure evaluated on
the entire training set as well as additional regularization terms
arg max 𝑔( 𝑦 , 𝑓(𝑥 )) • An optimization problem:
∈ – Important and expensive
– Takes a long time
– Special techniques are used. Eg: gradient-based optimization
goodness/performance
hypothesis/program space selected hypothesis
measure

Spring 2025 Deep Learning 3 Spring 2025 Deep Learning 4

3 4
3/19/2025

Introduction Training vs Optimization

• Training problem vs optimization problem • Training in DL is an indirect optimization


• Challenges – A performance measure P is aimed to be optimized,
• defined with respect to the test set
• Optimization algorithms • may also be intractable
– Initialization of parameters – But a cost function J(θ) is optimized and assume it will improve P, as well
• Advanced techniques: • In pure optimization, minimizing J is a goal
– Adaptive learning rate
– Use of second derivatives
– Higher level procedures

Spring 2025 Deep Learning 5 Spring 2025 Deep Learning 6

5 6

Empirical Risk Minimization Empirical Risk Minimization


• Optimization algorithms for training include some specialization on the specific structure of the objective
function
• Optimization problem becomes: empirical risk minimization
• A typical cost function:

• The expectation is taken across the data generating distribution rather than just over the finite training set
• However, we do not know the true distribution so, the following function is used; where m is the number of training examples

Spring 2025 Deep Learning 7 Spring 2025 Deep Learning 8

7 8
3/19/2025

Surrogate Loss Functions Surrogate Loss Functions


• Ex: the negative log-likelihood of the correct class can be used as a surrogate for
Problems: the 0-1 loss.
– The negative log-likelihood allows the model to estimate the conditional probability of the
• Overfitting!... classes, given the input, and if the model can do that well, then it can pick the classes that
yield the least classification error in expectation.
• The negative log-likelihood may result in being able to learn more
– Models with high capacity can simply memorize the training set. – loss often continues to decrease for a long time after the training set 0-1 loss has reached
zero
• The gradient descent is not useful for many loss functions • The robustness can be improved by
– further pushing the classes apart from each other
– have no useful derivatives – obtaining a more confident and reliable classifier

– Ex: 0-1 loss


• Use a different cost fuction: surrogate loss function

Spring 2025 Deep Learning 9 Spring 2025 Deep Learning 10

9 10

Cross-Entropy Cross-Entropy

Spring 2025 Deep Learning 11 Spring 2025 Deep Learning 12

11 12
3/19/2025

Categorical Cross-Entropy Cross-Entropy

0. 0 7.4186
0.0 2.8201
0.0 4.7915
1.0 0.0710

https://gombru.github.io/2018/05/23/cross_entropy_loss/
Spring 2025 Deep Learning 13 Spring 2025 Deep Learning 14

13 14

Surrogate Loss Functions Batch and Minibatch Algorithms

Halting optimization: • The objective function usually decomposes as a sum over the training
• Training algorithms: examples
– do not continue until a local minimum, • Update to the parameters based on an expected cost using only a
– halts while the surrogate loss function still has large derivatives subset of the terms of the full cost function;
• E.g. halts when a convergence criterion based on early stopping is • Ex: maximum likelihood estimation problems
satisfied – Maximizing expectations over the training set
– Early stopping criterion is usually based on the true underlying loss function,
such as 0-1 loss measured on a validation set, and is designed to cause the
algorithm to halt whenever overfitting begins to occur – Gradient property:

Spring 2025 Deep Learning 15 Spring 2025 Deep Learning 16

15 16
3/19/2025

Instance vs Batch Instance vs Batch


J(Θ) J(Θ)

Ideal J(Θ)
Instance1 J(Θ)
Instance2 J(Θ)

Θ Θ Θ Θ Θ Θ
Spring 2025 Deep Learning 17 Spring 2025 Deep Learning 18

17 18

Instance vs Batch Instance vs Batch


J(Θ) J(Θ)

Ideal J(Θ) Ideal J(Θ)

Instance3 J(Θ)
Batch J(Θ)

Θ Θ Θ Θ Θ Θ
Spring 2025 Deep Learning 19 Spring 2025 Deep Learning 20

19 20
3/19/2025

Batch and Minibatch Algorithms Batch and Minibatch Algorithms


• Instead of evaluating the model on every example in the entire dataset which is • Optimization algorithms using the entire training set are called batch or
expensive
deterministic gradient methods,
• In practice, compute these expectations for a random sample of a small number of
examples from the dataset • Optimization algorithms that use only a single example at a time are
• Recall that the standard error of the mean estimated from n samples is given by sometimes called stochastic or online methods
σ/√n, where σ is the true standard deviation of the value of the samples.
• Optimization algorithms using minibatches are called minibatch or
• Ex: two hypothetical estimates of the gradient, one based on 100 examples and
another based on 10,000 examples. The latter requires 100 times more computation minibatch stochastic methods
than the former, but reduces the standard error of the mean only by a factor of 10. – “batch size” is used to describe the size of a minibatch.
• Most optimization algorithms converge much faster if they are allowed to rapidly
compute approximate estimates of the gradient rather than slowly computing the
exact gradient.

Spring 2025 Deep Learning 21 Spring 2025 Deep Learning 22

21 22

Batch and Minibatch Algorithms Batch and Minibatch Algorithms


Minibatch sizes are generally determined by the following factors: • Different kinds of algorithms use different kinds of information from the
• Larger batches provide a more accurate estimate of the gradient, but with minibatch in different ways.
less than linear returns.
• Multicore architectures are usually underutilized by extremely small batches. • Some algorithms are more sensitive to sampling error than others
• Memory size may be a limiting factor in batch size. If all examples in the batch – they use information that is difficult to estimate accurately with few samples,
are to be processed in parallel, then the amount of memory scales with the – they use information in ways that amplify sampling errors more.
batch size.
• Some kinds of hardware achieve better runtime with specific sizes of arrays. • Ex:
Especially when using GPUs, it is common for power of 2 batch sizes to offer – Methods that compute updates based only on the gradient g are usually
better runtime. relatively robust and can handle smaller batch sizes like 100.
• Small batches can offer a regularizing effect, perhaps due to the noise they – Methods using the Hessian matrix to compute updates typically require much
add to the learning process. larger batch sizes like 10,000.

Spring 2025 Deep Learning 23 Spring 2025 Deep Learning 24

23 24
3/19/2025

Batch and Minibatch Algorithms Batch and Minibatch Algorithms

It is also crucial that the minibatches be selected randomly. • An interesting motivation for minibatch stochastic gradient descent is
• To compute an unbiased estimate of gradient, samples should be that it follows the gradient of the true generalization error so long as no
independent. examples are repeated.
• For two subsequent gradient estimates to be independent from each
other, two subsequent minibatches should also be independent from • If both x and y are discrete. In this case, the generalization error can be
each other. written as a sum
• Many datasets are most naturally arranged in a way where successive
examples are highly correlated. • and the exact gradient is
– it is necessary to shuffle the examples before selecting minibatches

Spring 2025 Deep Learning 25 Spring 2025 Deep Learning 26

25 26

Batch and Minibatch Algorithms Batch and Minibatch Algorithms

• An unbiased estimator of the exact gradient of the generalization error • Most implementations of minibatch stochastic gradient pass through
is the gradient of the loss with respect to the parameters for that the data multiple times unless the training set is extremely large
minibatch • On the first pass, each minibatch is used to compute an unbiased
estimate of the true generalization error. On the second pass, the
estimate becomes biased because it is formed by re-sampling values
that have already been used, rather than obtaining new fair samples
• Updating θ in the direction of the calculated gradient performs from the data generating distribution.
stochastic gradient descent on the generalization error – increases the gap between training error and test error
• When using an extremely large training set, overfitting is not an issue,
so underfitting and computational efficiency become the predominant
concerns
Spring 2025 Deep Learning 27 Spring 2025 Deep Learning 28

27 28
3/19/2025

Challenges Ill-Conditioning

• Optimization in general is an extremely difficult task • Ill-conditioning: causing SGD to get “stuck” in the sense that even very small steps
increase the cost function
– Convex optimization is simpler but not easy, obtained by carefully designing the • The ill-conditioning problem is generally present in neural network training
objective function problems.
– Training training neural networks: usually more general non-convex case. • Ex: a second-order Taylor series expansion of the cost function. A gradient descent
step of −ɛg will add the following to the cost.

• In many cases, the gradient norm does not shrink significantly throughout learning,
but the left term grows by more than order of magnitude.
• The result is that learning becomes very slow despite the presence of a strong
gradient because the learning rate must be shrunk to compensate for even stronger
curvature.

Spring 2025 Deep Learning 29 Spring 2025 Deep Learning 30

29 30

Ill-Conditioning Local Minima

• A convex optimization problem can be reduced to


the problem of finding a local minimum.
– Some convex functions have more than a single global
minimum point, but a good solution is obtained if a
critical point of any kind reached
• With non-convex functions, such as neural nets, it
is possible to have many local minima.
– Usually extremely large number of local minima.

Spring 2025 Deep Learning 31 Spring 2025 Deep Learning 32

31 32
3/19/2025

Basic Algorithms Basic Algorithms


• The gradient descent algorithm follows the gradient of an entire training set downhill • The true gradient of the total cost function becomes small and then 0 when we
• This may be accelerated considerably by using stochastic gradient descent to follow the gradient of randomly approach and reach a minimum using batch gradient descent, so batch
selected minibatches downhill gradient descent can use a fixed learning rate.
• However, the learning rate is a crucial parameter for the SGD algorithm.
– using a fixed learning rate not enough
– gradually decrease the learning rate over time (since gradient estimator introduces a
source of noise and does not vanish even when we arrive at a minimum)
• In practice, it is common to decay the learning rate linearly until iteration τ

• After iteration τ, it is common to leave ɛ constant


• Choose the learning rate by monitoring learning curves that plot the objective
function as a function of time

Spring 2025 Deep Learning 33 Spring 2025 Deep Learning 34

33 34

Basic Algorithms Momentum


• Usually τ may be set to the number of iterations required to make a few hundred
passes through the training set. Usually final learning rate should be set to roughly
1% the value of initial learning rate.
• If the inital learning rate is too large, the learning curve will show violent oscillations,
with the cost function often increasing significantly.
– Gentle oscillations are fine, especially if training with a stochastic cost function such as the cost
function arising from the use of dropout.
• If the learning rate is too low, learning proceeds slowly, and if the initial learning rate
is too low, learning may become stuck with a high cost value.
• Typically, the optimal initial learning rate, in terms of total training time and the final
cost value, is higher than the learning rate that yields the best performance after the
first 100 iterations or so.
– it is usually best to monitor the first several iterations and use a learning rate that is higher than
the best-performing learning rate at this time, but not so high that it causes severe instability.

Spring 2025 Deep Learning 35 Spring 2025 Deep Learning 36

35 36
3/19/2025

Nesterov accelerated gradient


Adagrad

Empirically it has been found that — for training DNN models — the accumulation of squared
gradients from the beginning of training results in a premature and excessive decrease in the
effective learning rate
Spring 2025 Deep Learning 37 Spring 2025 Deep Learning 38

37 38

Other methods… Adagrad

• RMSprop
– Addresses the deficiency of AdaGrad by changing the gradient accumulation into
an exponentially weighted moving average
• ADAM
– Another adaptive learning rate optimization algorithm that best seen as a variant
on RMSprop+momentum with a few important distinctions.
• Adam, momentum is incorporated directly as an estimate of the first order moment (with
exponential weighting) of the gradient
• Adam includes bias corrections to the estimates of both the first-order moments (the
momentum term) and the (uncentered) second order moments to account for their
initialization at the origin

Spring 2025 Deep Learning 39 Spring 2025 Deep Learning 40

39 40
3/19/2025

RMSprop RMSProp + Momentum

Spring 2025 Deep Learning 41 Spring 2025 Deep Learning 42

41 42

Choosing the Right Algorithm

• There is no consensus
• Some of the most used ones: SGD, SGD+momentum, RMSprop,
RMSprop+momentum, AdaDelta and Adam
• Hyperparameter tuning

Spring 2025 Deep Learning 43

43

You might also like