0% found this document useful (0 votes)

10 views11 pages

Lecture 04 - Optimization - 4p

The document discusses optimization techniques in deep learning, focusing on the training of neural networks and the challenges associated with it. It covers topics such as empirical risk minimization, surrogate loss functions, and various optimization algorithms including stochastic gradient descent and its variants. Additionally, it highlights issues like ill-conditioning and local minima that can complicate the optimization process.

Uploaded by

emirkan.b.yilmaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

Lecture 04 - Optimization - 4p

Uploaded by

emirkan.b.yilmaz

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

3/19/2025

“Thinking is the hardest work there is, which is probably the reason why so few engage in it.”

-Henry Ford

CSE555
Deep Learning

Spring 2025
Optimization
Prepared by Dr. Erdogan Sevilgen & Dr. Yakup Genc
Optimization (these slides are summarized version of the draft book by Bengio et al)
© 2016-2025 Yakup Genç & Y. Sinan Akgul

Spring 2025 Deep Learning 2

1 2

Supervised Machine Learning Introduction

• Neural Network Training: finding the parameters θ of a neural network

# of experiences observed outcome observed input that significantly reduce a cost function J(θ)
• Cost function typically includes a performance measure evaluated on
the entire training set as well as additional regularization terms
arg max 𝑔( 𝑦 , 𝑓(𝑥 )) • An optimization problem:
∈ – Important and expensive
– Takes a long time
– Special techniques are used. Eg: gradient-based optimization
goodness/performance
hypothesis/program space selected hypothesis
measure

Spring 2025 Deep Learning 3 Spring 2025 Deep Learning 4

3 4
3/19/2025

Introduction Training vs Optimization

• Training problem vs optimization problem • Training in DL is an indirect optimization

• Challenges – A performance measure P is aimed to be optimized,
• defined with respect to the test set
• Optimization algorithms • may also be intractable
– Initialization of parameters – But a cost function J(θ) is optimized and assume it will improve P, as well
• Advanced techniques: • In pure optimization, minimizing J is a goal
– Adaptive learning rate
– Use of second derivatives
– Higher level procedures

Spring 2025 Deep Learning 5 Spring 2025 Deep Learning 6

5 6

Empirical Risk Minimization Empirical Risk Minimization

• Optimization algorithms for training include some specialization on the specific structure of the objective
function
• Optimization problem becomes: empirical risk minimization
• A typical cost function:

• The expectation is taken across the data generating distribution rather than just over the finite training set
• However, we do not know the true distribution so, the following function is used; where m is the number of training examples

Spring 2025 Deep Learning 7 Spring 2025 Deep Learning 8

7 8
3/19/2025

Surrogate Loss Functions Surrogate Loss Functions

• Ex: the negative log-likelihood of the correct class can be used as a surrogate for
Problems: the 0-1 loss.
– The negative log-likelihood allows the model to estimate the conditional probability of the
• Overfitting!... classes, given the input, and if the model can do that well, then it can pick the classes that
yield the least classification error in expectation.
• The negative log-likelihood may result in being able to learn more
– Models with high capacity can simply memorize the training set. – loss often continues to decrease for a long time after the training set 0-1 loss has reached
zero
• The gradient descent is not useful for many loss functions • The robustness can be improved by
– further pushing the classes apart from each other
– have no useful derivatives – obtaining a more confident and reliable classifier

– Ex: 0-1 loss

• Use a different cost fuction: surrogate loss function

Spring 2025 Deep Learning 9 Spring 2025 Deep Learning 10

9 10

Cross-Entropy Cross-Entropy

Spring 2025 Deep Learning 11 Spring 2025 Deep Learning 12

11 12
3/19/2025

Categorical Cross-Entropy Cross-Entropy

0. 0 7.4186
0.0 2.8201
0.0 4.7915
1.0 0.0710

https://gombru.github.io/2018/05/23/cross_entropy_loss/
Spring 2025 Deep Learning 13 Spring 2025 Deep Learning 14

13 14

Surrogate Loss Functions Batch and Minibatch Algorithms

Halting optimization: • The objective function usually decomposes as a sum over the training
• Training algorithms: examples
– do not continue until a local minimum, • Update to the parameters based on an expected cost using only a
– halts while the surrogate loss function still has large derivatives subset of the terms of the full cost function;
• E.g. halts when a convergence criterion based on early stopping is • Ex: maximum likelihood estimation problems
satisfied – Maximizing expectations over the training set
– Early stopping criterion is usually based on the true underlying loss function,
such as 0-1 loss measured on a validation set, and is designed to cause the
algorithm to halt whenever overfitting begins to occur – Gradient property:

Spring 2025 Deep Learning 15 Spring 2025 Deep Learning 16

15 16
3/19/2025

Instance vs Batch Instance vs Batch

J(Θ) J(Θ)

Ideal J(Θ)
Instance1 J(Θ)
Instance2 J(Θ)

Θ Θ Θ Θ Θ Θ
Spring 2025 Deep Learning 17 Spring 2025 Deep Learning 18

17 18

Instance vs Batch Instance vs Batch

J(Θ) J(Θ)

Ideal J(Θ) Ideal J(Θ)

Instance3 J(Θ)
Batch J(Θ)

Θ Θ Θ Θ Θ Θ
Spring 2025 Deep Learning 19 Spring 2025 Deep Learning 20

19 20
3/19/2025

Batch and Minibatch Algorithms Batch and Minibatch Algorithms

• Instead of evaluating the model on every example in the entire dataset which is • Optimization algorithms using the entire training set are called batch or
expensive
deterministic gradient methods,
• In practice, compute these expectations for a random sample of a small number of
examples from the dataset • Optimization algorithms that use only a single example at a time are
• Recall that the standard error of the mean estimated from n samples is given by sometimes called stochastic or online methods
σ/√n, where σ is the true standard deviation of the value of the samples.
• Optimization algorithms using minibatches are called minibatch or
• Ex: two hypothetical estimates of the gradient, one based on 100 examples and
another based on 10,000 examples. The latter requires 100 times more computation minibatch stochastic methods
than the former, but reduces the standard error of the mean only by a factor of 10. – “batch size” is used to describe the size of a minibatch.
• Most optimization algorithms converge much faster if they are allowed to rapidly
compute approximate estimates of the gradient rather than slowly computing the
exact gradient.

Spring 2025 Deep Learning 21 Spring 2025 Deep Learning 22

21 22

Batch and Minibatch Algorithms Batch and Minibatch Algorithms

Minibatch sizes are generally determined by the following factors: • Different kinds of algorithms use different kinds of information from the
• Larger batches provide a more accurate estimate of the gradient, but with minibatch in different ways.
less than linear returns.
• Multicore architectures are usually underutilized by extremely small batches. • Some algorithms are more sensitive to sampling error than others
• Memory size may be a limiting factor in batch size. If all examples in the batch – they use information that is difficult to estimate accurately with few samples,
are to be processed in parallel, then the amount of memory scales with the – they use information in ways that amplify sampling errors more.
batch size.
• Some kinds of hardware achieve better runtime with specific sizes of arrays. • Ex:
Especially when using GPUs, it is common for power of 2 batch sizes to offer – Methods that compute updates based only on the gradient g are usually
better runtime. relatively robust and can handle smaller batch sizes like 100.
• Small batches can offer a regularizing effect, perhaps due to the noise they – Methods using the Hessian matrix to compute updates typically require much
add to the learning process. larger batch sizes like 10,000.

Spring 2025 Deep Learning 23 Spring 2025 Deep Learning 24

23 24
3/19/2025

Batch and Minibatch Algorithms Batch and Minibatch Algorithms

It is also crucial that the minibatches be selected randomly. • An interesting motivation for minibatch stochastic gradient descent is
• To compute an unbiased estimate of gradient, samples should be that it follows the gradient of the true generalization error so long as no
independent. examples are repeated.
• For two subsequent gradient estimates to be independent from each
other, two subsequent minibatches should also be independent from • If both x and y are discrete. In this case, the generalization error can be
each other. written as a sum
• Many datasets are most naturally arranged in a way where successive
examples are highly correlated. • and the exact gradient is
– it is necessary to shuffle the examples before selecting minibatches

Spring 2025 Deep Learning 25 Spring 2025 Deep Learning 26

25 26

Batch and Minibatch Algorithms Batch and Minibatch Algorithms

• An unbiased estimator of the exact gradient of the generalization error • Most implementations of minibatch stochastic gradient pass through
is the gradient of the loss with respect to the parameters for that the data multiple times unless the training set is extremely large
minibatch • On the first pass, each minibatch is used to compute an unbiased
estimate of the true generalization error. On the second pass, the
estimate becomes biased because it is formed by re-sampling values
that have already been used, rather than obtaining new fair samples
• Updating θ in the direction of the calculated gradient performs from the data generating distribution.
stochastic gradient descent on the generalization error – increases the gap between training error and test error
• When using an extremely large training set, overfitting is not an issue,
so underfitting and computational efficiency become the predominant
concerns
Spring 2025 Deep Learning 27 Spring 2025 Deep Learning 28

27 28
3/19/2025

Challenges Ill-Conditioning

• Optimization in general is an extremely difficult task • Ill-conditioning: causing SGD to get “stuck” in the sense that even very small steps
increase the cost function
– Convex optimization is simpler but not easy, obtained by carefully designing the • The ill-conditioning problem is generally present in neural network training
objective function problems.
– Training training neural networks: usually more general non-convex case. • Ex: a second-order Taylor series expansion of the cost function. A gradient descent
step of −ɛg will add the following to the cost.

• In many cases, the gradient norm does not shrink significantly throughout learning,
but the left term grows by more than order of magnitude.
• The result is that learning becomes very slow despite the presence of a strong
gradient because the learning rate must be shrunk to compensate for even stronger
curvature.

Spring 2025 Deep Learning 29 Spring 2025 Deep Learning 30

29 30

Ill-Conditioning Local Minima

• A convex optimization problem can be reduced to

the problem of finding a local minimum.
– Some convex functions have more than a single global
minimum point, but a good solution is obtained if a
critical point of any kind reached
• With non-convex functions, such as neural nets, it
is possible to have many local minima.
– Usually extremely large number of local minima.

Spring 2025 Deep Learning 31 Spring 2025 Deep Learning 32

31 32
3/19/2025

Basic Algorithms Basic Algorithms

• The gradient descent algorithm follows the gradient of an entire training set downhill • The true gradient of the total cost function becomes small and then 0 when we
• This may be accelerated considerably by using stochastic gradient descent to follow the gradient of randomly approach and reach a minimum using batch gradient descent, so batch
selected minibatches downhill gradient descent can use a fixed learning rate.
• However, the learning rate is a crucial parameter for the SGD algorithm.
– using a fixed learning rate not enough
– gradually decrease the learning rate over time (since gradient estimator introduces a
source of noise and does not vanish even when we arrive at a minimum)
• In practice, it is common to decay the learning rate linearly until iteration τ

• After iteration τ, it is common to leave ɛ constant

• Choose the learning rate by monitoring learning curves that plot the objective
function as a function of time

Spring 2025 Deep Learning 33 Spring 2025 Deep Learning 34

33 34

Basic Algorithms Momentum

• Usually τ may be set to the number of iterations required to make a few hundred
passes through the training set. Usually final learning rate should be set to roughly
1% the value of initial learning rate.
• If the inital learning rate is too large, the learning curve will show violent oscillations,
with the cost function often increasing significantly.
– Gentle oscillations are fine, especially if training with a stochastic cost function such as the cost
function arising from the use of dropout.
• If the learning rate is too low, learning proceeds slowly, and if the initial learning rate
is too low, learning may become stuck with a high cost value.
• Typically, the optimal initial learning rate, in terms of total training time and the final
cost value, is higher than the learning rate that yields the best performance after the
first 100 iterations or so.
– it is usually best to monitor the first several iterations and use a learning rate that is higher than
the best-performing learning rate at this time, but not so high that it causes severe instability.

Spring 2025 Deep Learning 35 Spring 2025 Deep Learning 36

35 36
3/19/2025

Nesterov accelerated gradient

Adagrad

Empirically it has been found that — for training DNN models — the accumulation of squared
gradients from the beginning of training results in a premature and excessive decrease in the
effective learning rate
Spring 2025 Deep Learning 37 Spring 2025 Deep Learning 38

37 38

Other methods… Adagrad

• RMSprop
– Addresses the deficiency of AdaGrad by changing the gradient accumulation into
an exponentially weighted moving average
• ADAM
– Another adaptive learning rate optimization algorithm that best seen as a variant
on RMSprop+momentum with a few important distinctions.
• Adam, momentum is incorporated directly as an estimate of the first order moment (with
exponential weighting) of the gradient
• Adam includes bias corrections to the estimates of both the first-order moments (the
momentum term) and the (uncentered) second order moments to account for their
initialization at the origin

Spring 2025 Deep Learning 39 Spring 2025 Deep Learning 40

39 40
3/19/2025

RMSprop RMSProp + Momentum

Spring 2025 Deep Learning 41 Spring 2025 Deep Learning 42

41 42

Choosing the Right Algorithm

• There is no consensus
• Some of the most used ones: SGD, SGD+momentum, RMSprop,
RMSprop+momentum, AdaDelta and Adam
• Hyperparameter tuning

Spring 2025 Deep Learning 43

Unit-3
No ratings yet
Unit-3
47 pages
Lecture_2
No ratings yet
Lecture_2
31 pages
Module 1.Pptx
No ratings yet
Module 1.Pptx
64 pages
week 06 - Deep Feedforward Networks - Optimization
No ratings yet
week 06 - Deep Feedforward Networks - Optimization
83 pages
Ch2-Training, Optimization and Regularization of DNN-new (1)
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new (1)
114 pages
Lecture 02 - Neural Networks - 4p
No ratings yet
Lecture 02 - Neural Networks - 4p
10 pages
DL Test-2
No ratings yet
DL Test-2
28 pages
Deep Neural Networks
No ratings yet
Deep Neural Networks
48 pages
Deep Learning Module 3
No ratings yet
Deep Learning Module 3
15 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
DL-12
No ratings yet
DL-12
55 pages
Optimizer
No ratings yet
Optimizer
13 pages
DLA-CAT 1
No ratings yet
DLA-CAT 1
37 pages
L10 Learning II Gradient Based Learning
No ratings yet
L10 Learning II Gradient Based Learning
72 pages
04 Optimization
No ratings yet
04 Optimization
62 pages
4-Tensors and Opeartions- Probability Basics-Gradient Descent-27!07!2024 (1)
No ratings yet
4-Tensors and Opeartions- Probability Basics-Gradient Descent-27!07!2024 (1)
18 pages
cst414-deep learning module 2
No ratings yet
cst414-deep learning module 2
13 pages
ANN-TP (1)
No ratings yet
ANN-TP (1)
40 pages
NLP-NeuralNetworks Reading Notes
No ratings yet
NLP-NeuralNetworks Reading Notes
13 pages
SCSA3015 Deep Learning Unit 4 PDF
No ratings yet
SCSA3015 Deep Learning Unit 4 PDF
30 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
cours5
No ratings yet
cours5
23 pages
Op Tim Ization
No ratings yet
Op Tim Ization
22 pages
Deep Learning
No ratings yet
Deep Learning
23 pages
Lecture 03 - Feedforward Networks - 4p
No ratings yet
Lecture 03 - Feedforward Networks - 4p
19 pages
Lec 7 Optimization Part 2
No ratings yet
Lec 7 Optimization Part 2
139 pages
WEEK 4
No ratings yet
WEEK 4
61 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
DL 4
No ratings yet
DL 4
15 pages
Module3_notes
No ratings yet
Module3_notes
18 pages
DL unit 4&5
No ratings yet
DL unit 4&5
27 pages
Introduction To Deep Learning
No ratings yet
Introduction To Deep Learning
34 pages
NISS Deep Learning Tutorial
No ratings yet
NISS Deep Learning Tutorial
58 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
tutorial 1,2
No ratings yet
tutorial 1,2
12 pages
Module 2
No ratings yet
Module 2
67 pages
Deep Learning (MODULE-2) (2)
No ratings yet
Deep Learning (MODULE-2) (2)
86 pages
S09_DNN_Gradients_wip
No ratings yet
S09_DNN_Gradients_wip
28 pages
Deep Learning
100% (2)
Deep Learning
49 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Unit 4 A
No ratings yet
Unit 4 A
16 pages
4 Optimization
No ratings yet
4 Optimization
48 pages
UNIT V NNHDL
No ratings yet
UNIT V NNHDL
33 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
7 Optimization2 Stochastic Gradient
No ratings yet
7 Optimization2 Stochastic Gradient
114 pages
AI Chapter 4
No ratings yet
AI Chapter 4
63 pages
Lec 2 Basics of machine learning (1)
No ratings yet
Lec 2 Basics of machine learning (1)
35 pages
Lec 10 Oct 24
No ratings yet
Lec 10 Oct 24
21 pages
Deep Learning Module-03 Search Creators
No ratings yet
Deep Learning Module-03 Search Creators
20 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
L0
No ratings yet
L0
26 pages
-3
No ratings yet
-3
28 pages
Lec 2
No ratings yet
Lec 2
5 pages
3
No ratings yet
3
11 pages
Adl Unit 1 2
No ratings yet
Adl Unit 1 2
67 pages
UNIT3
No ratings yet
UNIT3
37 pages
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
No ratings yet
Machine Learning Basics: Lecture Slides For Chapter 5 of Deep Learning Ian Goodfellow
85 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Agile Foundation Courseware – English
From Everand
Agile Foundation Courseware – English
Nader Rad
No ratings yet
Privacy & Data Protection Practitioner Courseware - English
From Everand
Privacy & Data Protection Practitioner Courseware - English
Marios Siathas
No ratings yet
CSE555 Syllabus2025
No ratings yet
CSE555 Syllabus2025
3 pages
lec9
No ratings yet
lec9
13 pages
Lecture 05 - Regularization - 4p
No ratings yet
Lecture 05 - Regularization - 4p
21 pages
Lecture 01 - Introduction to ML - 4p
No ratings yet
Lecture 01 - Introduction to ML - 4p
11 pages
Lecture 6 Evolutionary Sequence Alignment Algorithms
No ratings yet
Lecture 6 Evolutionary Sequence Alignment Algorithms
26 pages
Final Draft PDF
No ratings yet
Final Draft PDF
135 pages
Introduction To Statistics
No ratings yet
Introduction To Statistics
23 pages
Ch.2 Worksheet
No ratings yet
Ch.2 Worksheet
3 pages
Jimin - Lesson50.2P (J)
No ratings yet
Jimin - Lesson50.2P (J)
6 pages
Transportation and Assignment Problems - Quiz
No ratings yet
Transportation and Assignment Problems - Quiz
2 pages
Motion in a Straight Line Formulas
No ratings yet
Motion in a Straight Line Formulas
4 pages
Final Report
No ratings yet
Final Report
72 pages
Unit 3-Psychological Testing
No ratings yet
Unit 3-Psychological Testing
23 pages
Motion in a Circle (A-Level Physics Topicals P4)
No ratings yet
Motion in a Circle (A-Level Physics Topicals P4)
18 pages
Unit 2 Discrete PRrobability Distribution
No ratings yet
Unit 2 Discrete PRrobability Distribution
4 pages
Cable and Tension Structures
No ratings yet
Cable and Tension Structures
17 pages
Tephigramp
No ratings yet
Tephigramp
8 pages
Measurement Review: C. Johannesson
No ratings yet
Measurement Review: C. Johannesson
2 pages
ST Programming PDF
No ratings yet
ST Programming PDF
446 pages
Mathematics 8 Lesson Plan
No ratings yet
Mathematics 8 Lesson Plan
8 pages
Part 6 Mesin Fluida
No ratings yet
Part 6 Mesin Fluida
19 pages
Hodrick-Prescott-Filter Add-In: Compute Hodrick-Prescott Filtered Time Series With Excel
No ratings yet
Hodrick-Prescott-Filter Add-In: Compute Hodrick-Prescott Filtered Time Series With Excel
29 pages
SUBJECT-verb Rules
No ratings yet
SUBJECT-verb Rules
4 pages
MSCFE 620 Group Submission - 1 Revised
No ratings yet
MSCFE 620 Group Submission - 1 Revised
7 pages
CIE 262 - Tutorial Sheet 5-2023
No ratings yet
CIE 262 - Tutorial Sheet 5-2023
7 pages
Academic Vs Non Academic Language
No ratings yet
Academic Vs Non Academic Language
55 pages
BPharm I Year Syllabus (Effective From 2017-18)
100% (1)
BPharm I Year Syllabus (Effective From 2017-18)
46 pages
Resume Saufi
100% (1)
Resume Saufi
3 pages
Programming SQL Server CLR Integration
No ratings yet
Programming SQL Server CLR Integration
12 pages
Curi 2017
No ratings yet
Curi 2017
12 pages
Mini Workshop On Computational Physics
No ratings yet
Mini Workshop On Computational Physics
2 pages
Sr. No. Core Areas Percentage: Electronics
No ratings yet
Sr. No. Core Areas Percentage: Electronics
3 pages
Lecture 5 - Orthographic Projection
No ratings yet
Lecture 5 - Orthographic Projection
56 pages
Application of TOPSIS Method For Decision Making: April 2020
No ratings yet
Application of TOPSIS Method For Decision Making: April 2020
7 pages