0% found this document useful (0 votes)

30 views23 pages

Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang

The document discusses the differences between convex and nonconvex functions, highlighting the properties of local and global minima, saddle points, and the challenges in optimizing nonconvex functions. It emphasizes the importance of initialization, optimization algorithms, and batch size in achieving better convergence outcomes. The conclusion reiterates that the number of global minima is significantly less than local minima and saddle points, with full gradient descent often leading to saddle points while stochastic gradient descent can help escape them.

Uploaded by

MInh Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views23 pages

Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang

Uploaded by

MInh Thanh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Convex Function vs.

Nonconvex Function:
A Little Bit Theory

Shusen Wang
Global Extremum vs. Local Extremum
Local Minimum of a function 𝑓 𝐰
If 𝑓 𝐰 ⋆ ≤ 𝑓 𝐰 for all 𝐰 in a neighborhood
of 𝐰 ⋆ , then 𝐰 ⋆ is a local minimum of 𝑓.

Global Minimum of a function 𝑓 𝐰

If 𝑓 𝐰 ⋆ ≤ 𝑓 𝐰 for all 𝐰 in the domain of
𝑓, then 𝐰 ⋆ is a global minimum of 𝑓.

• A global minimum is a local minimum.

• Global minimum may not be unique.
Properties of Local Minimum

Assume 𝑓 is defined on ℝ) .

Properties of local minimum 𝐰 ⋆ :

1. The gradient at 𝐰 ⋆ , 𝛻𝑓 𝐰 ⋆ ∈ ℝ) , is
all-zeros.
2. The Hessian matrix at 𝐰 ⋆ , 𝛻 * 𝑓 𝐰 ⋆ ∈
ℝ)×) , is positive semidefinite (i.e., all
of its 𝑑 eigenvalues are nonnegative.)
Convex Function

• Convex function: The line segment

between any two points on the graph of
the function lies above or on the graph

Properties of a convex function 𝑓:

1. Local minimum = global minimum.
2. The Hessian matrix 𝛻 * 𝑓 𝐰 is positive
semi-definite everywhere.
3. 𝛻𝑓 𝐰 ⋆ = 𝟎 𝐰 ⋆ is a global
Graph of a convex function
minimum.
Nonconvex Function

Properties:
1. Local minimum = global minimum.
2. The Hessian matrix 𝛻 * 𝑓 𝐰 is positive
semi-definite everywhere.
3. 𝛻𝑓 𝐰 ⋆ = 𝟎 𝐰 ⋆ is a global
Graph of a nonconvex function
minimum.
Global Minimum is Unlikely to Reach

• #local minima ≫ #global minima.

• The final solution depends on the
initialization.
• Reaching one of the global minima is
very unlikely.

Graph of a nonconvex function

Saddle Point

saddle point 𝐱 345

Definition of saddle point:

1. The gradient of 𝑓 at a saddle point is all-
zeros: 𝛻𝑓 𝐰345 = 𝟎.
2. The Hessian matrix 𝛻 * 𝑓 𝐰345 has both
positive and negative eigenvalues..

Graph of a nonconvex function

Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.

• In 2D, #saddle points and #local minimum

are comparable.
• It is not true in high-dim.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.

• In high dim, #saddle points is much greater than #local minima.

• The Hessian has 𝑑 eigenvalues, each of which can be positive or negative.
• 2) combinations of positive and negative eigenvalues.
• One out of the 2) combinations corresponds to local minima.
• 2) − 2 combinations corresponds to saddle points.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.

• In high dim, the number of saddle points is much larger than local minima.

• If a neural net is optimized by the full gradient descent, it will converge to a

saddle point.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!

• Bad initialization results in convergence to bad regions.
• Because of the nonconvexity, global minimum cannot be attained.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!

• Bad initialization results in convergence to bad regions.
• Because of the nonconvexity, global minimum cannot be attained.
• Rule of thumb :
• The trainable parameters (e.g., the filters of ConvNet) are randomly initialized
with proper scaling.
• Bad scaling leads to terrible results.
• All-zero and all-one initializations are bad ideas.
• Pretrained parameters can be very good initialization.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!

Be careful about the optimization algorithm!

• Full gradient descent will be stuck in a

saddle point.
• Because the gradient is near zero when
approaching the saddle point.
• Stochastic gradient descent (SGD) can
escape the saddle points.
• Because it is random and noisy.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!

Be careful about the optimization algorithm!

Be careful about the batch size!

• For parallel computing with multiple GPUs, larger batch size è lower
per-epoch runtime.
• Large batch size, e.g., 10𝐾, may result in bad generalization.
Accurate, Large Minibatch SGD:
… More
Training ImageNet about
in 1 Hour the Batch Size
oyal • Batch size
Piotr Dollár Rosslarger than 8𝐾 results
Girshick in poor generalization.
Pieter Noordhuis
owski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He
• Large batch size is good for time-efficiency.
• Lots Facebook
of tricks are required in large batch training.

stract 40

ImageNet top-1 validation error

35
ith large neural networks and
larger networks and larger
raining times that impede re- 30

gress. Distributed synchronous

ion to this problem by dividing 25
ool of parallel workers. Yet to
the per-worker workload must 20
rivial growth in the SGD mini- 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
mini-batch size
e empirically show that on the
Figure
The figure ImageNet
1. the
is from top-1 validation error [Link]: Training ImageNet in 1 Hour”
paper “Accurate, Large Minibatch minibatch size.
batches cause optimization dif-
Error range of plus/minus two standard deviations is shown. We
addressed the trained networks
present a simple and general technique for scaling distributed syn-
Specifically, we show no loss
… More about the Batch Size
• Researchers’ conjecture:
• Small batch size è flat local minima; Big batch size è shape local minima.
• Flat local minima generalizes better (on the test set).
(a) SGD, 128, 7.37% (b) SGD, 8192, 11.07% (c) Adam, 128
Batch Size = 128 Batch Size = 8192

(e) SGD, 128, 6.00% (f) SGD, 8192, 10.19%

The figure is from paper [Link]
(g) Adam, 128
… More about the Batch Size

• There are papers supportive of small batch training, e.g.,

[Link]
Do Not Believe Deep Learning Theories Blindly

Explanations

Empirical study
Summary

• #global minima ≪ #local minima ≪ #saddle points.

• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.
Summary

• #global minima ≪ #local minima ≪ #saddle points.

• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.

• Initialization is crucial.
• Proper scaling.
• Pretrain.
Summary

• #global minima ≪ #local minima ≪ #saddle points.

• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.

• Initialization is crucial.
• Proper scaling.
• Pretrain.

• Batch size affects time efficiency and generalization.

Lecture 8
No ratings yet
Lecture 8
16 pages
ML Optimization Techniques Explained
No ratings yet
ML Optimization Techniques Explained
25 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Op Tim Ization
No ratings yet
Op Tim Ization
9 pages
Optimization in Deep Learning - by Adam Cataldo - AC On AI
No ratings yet
Optimization in Deep Learning - by Adam Cataldo - AC On AI
5 pages
Convolutional Neural Network Basics
100% (1)
Convolutional Neural Network Basics
59 pages
Optimizing Deep Neural Network Parameters
No ratings yet
Optimizing Deep Neural Network Parameters
10 pages
DL UNIT II PART II (IMP) Optimization For Training Deep Model
100% (1)
DL UNIT II PART II (IMP) Optimization For Training Deep Model
81 pages
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
No ratings yet
ECS171: Machine Learning: Lecture 4: Optimization (LFD 3.3, SGD)
45 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
Dive Into Deep Learning-435-462
No ratings yet
Dive Into Deep Learning-435-462
28 pages
Gradient Descent in Non-Convex Optimization
No ratings yet
Gradient Descent in Non-Convex Optimization
21 pages
Irs Optimiazation Meta
No ratings yet
Irs Optimiazation Meta
32 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
DL - Unit 2
No ratings yet
DL - Unit 2
60 pages
Tut04 - One Algorithm To Optimize Them All
No ratings yet
Tut04 - One Algorithm To Optimize Them All
19 pages
Why Do Local Methods Solve Nonconvex Problems
No ratings yet
Why Do Local Methods Solve Nonconvex Problems
19 pages
PCA and Convex Optimization and Bias, Variance-2
No ratings yet
PCA and Convex Optimization and Bias, Variance-2
29 pages
Week 06 - Deep Feedforward Networks - Optimization
No ratings yet
Week 06 - Deep Feedforward Networks - Optimization
83 pages
Unit3 Rev3
No ratings yet
Unit3 Rev3
201 pages
Gradient Descent
No ratings yet
Gradient Descent
4 pages
Train Longer, Generalize Better: Closing The Generalization Gap in Large Batch Training of Neural Networks
No ratings yet
Train Longer, Generalize Better: Closing The Generalization Gap in Large Batch Training of Neural Networks
15 pages
Unit 2
No ratings yet
Unit 2
76 pages
Chapter
No ratings yet
Chapter
46 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Stochastic Gradient Descent Basics
No ratings yet
Stochastic Gradient Descent Basics
22 pages
DNN M3 Optimization
No ratings yet
DNN M3 Optimization
81 pages
Module 3dl1
No ratings yet
Module 3dl1
11 pages
Gradient Descent
No ratings yet
Gradient Descent
52 pages
Unit 2 Introduction To Deep Learning
67% (3)
Unit 2 Introduction To Deep Learning
79 pages
Lecture 21
No ratings yet
Lecture 21
49 pages
Lecture05 Descent
No ratings yet
Lecture05 Descent
31 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Neural Network Optimization Techniques
No ratings yet
Neural Network Optimization Techniques
22 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Chap 4 Beyond Gradient Descent
No ratings yet
Chap 4 Beyond Gradient Descent
26 pages
Tutorial On Neural Network Optimization Problems: Presentation by Ian Goodfellow
No ratings yet
Tutorial On Neural Network Optimization Problems: Presentation by Ian Goodfellow
42 pages
Topic 3.2: Network Training
No ratings yet
Topic 3.2: Network Training
7 pages
Unit V NNHDL
No ratings yet
Unit V NNHDL
33 pages
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
No ratings yet
8 Adagrad, RMSprop, Adam 04 Sep 2020material I 04 Sep 2020 Module4 Optimization
50 pages
Gradient Descent
No ratings yet
Gradient Descent
17 pages
Deep Learning Module 3-1
No ratings yet
Deep Learning Module 3-1
31 pages
MLSS Complete PDF
No ratings yet
MLSS Complete PDF
106 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
No ratings yet
DS303: Introduction To Machine Learning: Stochastic Gradient Descent
19 pages
When Are Nonconvex Problems Not Scary
No ratings yet
When Are Nonconvex Problems Not Scary
11 pages
Deep Learning Optimization Tutorial
No ratings yet
Deep Learning Optimization Tutorial
3 pages
08 Practical Aspects of Deep Learning 2
No ratings yet
08 Practical Aspects of Deep Learning 2
100 pages
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
No ratings yet
Optimization For ML: CS771: Introduction To Machine Learning Nisheeth
18 pages
Lec 3
No ratings yet
Lec 3
22 pages
14 Efficient Learning
No ratings yet
14 Efficient Learning
7 pages
Why Deep Learning Works?: Olivier Bousquet Google Brain, Zürich - G.co/brain DS3 2017, Palaiseau
No ratings yet
Why Deep Learning Works?: Olivier Bousquet Google Brain, Zürich - G.co/brain DS3 2017, Palaiseau
50 pages
SGD 2
No ratings yet
SGD 2
18 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
Lecture 5
No ratings yet
Lecture 5
4 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
RNN + RL: Shusen Wang
No ratings yet
RNN + RL: Shusen Wang
51 pages
2022 Streaming Summit Netflix
No ratings yet
2022 Streaming Summit Netflix
100 pages
Siamese Network: Shusen Wang
No ratings yet
Siamese Network: Shusen Wang
51 pages
Few-Shot Learning Explained
No ratings yet
Few-Shot Learning Explained
42 pages
Neural Architecture Search Guide
No ratings yet
Neural Architecture Search Guide
20 pages
Data Poisoning Attacks in NLP Models
No ratings yet
Data Poisoning Attacks in NLP Models
17 pages
Deep Q-Networks for RL Experts
No ratings yet
Deep Q-Networks for RL Experts
53 pages
Text Generation: Shusen Wang
No ratings yet
Text Generation: Shusen Wang
49 pages
Policy-Based RL Overview by Shusen Wang
No ratings yet
Policy-Based RL Overview by Shusen Wang
46 pages
Seq2Seq Neural Machine Translation
No ratings yet
Seq2Seq Neural Machine Translation
57 pages
RNNs for Sequential Data Modeling
No ratings yet
RNNs for Sequential Data Modeling
33 pages
Convolutional Neural Networks Overview
No ratings yet
Convolutional Neural Networks Overview
75 pages
Common CNN Architectures: Shusen Wang
No ratings yet
Common CNN Architectures: Shusen Wang
67 pages
History of Digital Nomadism Explained
No ratings yet
History of Digital Nomadism Explained
5 pages
Tool Nose Radius Compensation in CNC Turning
No ratings yet
Tool Nose Radius Compensation in CNC Turning
5 pages
Canadian 60KTL
No ratings yet
Canadian 60KTL
5 pages
Volvo D11 Engine Brochure
100% (1)
Volvo D11 Engine Brochure
7 pages
Bus Math Worksheet - Salaries
100% (1)
Bus Math Worksheet - Salaries
1 page
RSLogix 500 MicroLogix 1100 Report
No ratings yet
RSLogix 500 MicroLogix 1100 Report
30 pages
The Big Dawg Dynamics
100% (1)
The Big Dawg Dynamics
2 pages
LTCSettlement 6
No ratings yet
LTCSettlement 6
2 pages
DG F24 Christmas Wrap Buy Plan - Final
No ratings yet
DG F24 Christmas Wrap Buy Plan - Final
2 pages
Web Service Integration With SAP
No ratings yet
Web Service Integration With SAP
7 pages
Marchetti MTK 35
No ratings yet
Marchetti MTK 35
8 pages
Citizen Life Insurance Bonus Shares
No ratings yet
Citizen Life Insurance Bonus Shares
8 pages
UATG
No ratings yet
UATG
24 pages
Past Papers LDC To UDC
100% (1)
Past Papers LDC To UDC
118 pages
Overview of the Cruise Line Industry
No ratings yet
Overview of the Cruise Line Industry
45 pages
Tender Evaluation Guide
No ratings yet
Tender Evaluation Guide
27 pages
Exploring Strategy
25% (4)
Exploring Strategy
3 pages
MGIT - Placements Report - 2024-25 in A3
No ratings yet
MGIT - Placements Report - 2024-25 in A3
3 pages
503 AV Interface Datasheet en
No ratings yet
503 AV Interface Datasheet en
2 pages
World's Smallest Pocket Computer
No ratings yet
World's Smallest Pocket Computer
15 pages
Lab Guides: Java SE 8 Programming Language
No ratings yet
Lab Guides: Java SE 8 Programming Language
15 pages
Demonstration and Fabrication Tolerance Study of Temperature-Insensitive Silicon-Photonic MZI Tunable by A Metal Heater
No ratings yet
Demonstration and Fabrication Tolerance Study of Temperature-Insensitive Silicon-Photonic MZI Tunable by A Metal Heater
7 pages
100 PHD Rules
No ratings yet
100 PHD Rules
14 pages
Syllabus-7th Grade 2018-2019
No ratings yet
Syllabus-7th Grade 2018-2019
4 pages
Mechanical Measurements 6th Edition Thomas G. Beckwith Ebook PDF Available
No ratings yet
Mechanical Measurements 6th Edition Thomas G. Beckwith Ebook PDF Available
41 pages
Farm Mechanization Overview
No ratings yet
Farm Mechanization Overview
16 pages
Outdoor Cabinet
No ratings yet
Outdoor Cabinet
50 pages
Conventional CMOS Latches and Flip Flops, Pulsed Latches, Resettable and Enabled Latches and Flip Flops
No ratings yet
Conventional CMOS Latches and Flip Flops, Pulsed Latches, Resettable and Enabled Latches and Flip Flops
52 pages
Rotex Coupling
100% (1)
Rotex Coupling
39 pages
2023 Annual Report Overview
No ratings yet
2023 Annual Report Overview
146 pages

Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang

Uploaded by

Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang

Uploaded by

Convex Function vs.

Global Minimum of a function 𝑓 𝐰

• A global minimum is a local minimum.

Properties of local minimum 𝐰 ⋆ :

• Convex function: The line segment

Properties of a convex function 𝑓:

• #local minima ≫ #global minima.

Graph of a nonconvex function

saddle point 𝐱 345

Definition of saddle point:

Graph of a nonconvex function

• Full gradient descent stops at either a saddle point or a local minimum.

• Full gradient descent stops at either a saddle point or a local minimum.

• In 2D, #saddle points and #local minimum

• Full gradient descent stops at either a saddle point or a local minimum.

• In high dim, #saddle points is much greater than #local minima.

• Full gradient descent stops at either a saddle point or a local minimum.

• If a neural net is optimized by the full gradient descent, it will converge to a

Be careful about the initialization!

Be careful about the initialization!

Be careful about the initialization!

Be careful about the optimization algorithm!

• Full gradient descent will be stuck in a

Be careful about the initialization!

Be careful about the optimization algorithm!

Be careful about the batch size!

ImageNet top-1 validation error

gress. Distributed synchronous

(e) SGD, 128, 6.00% (f) SGD, 8192, 10.19%

• There are papers supportive of small batch training, e.g.,

• #global minima ≪ #local minima ≪ #saddle points.

• #global minima ≪ #local minima ≪ #saddle points.

• #global minima ≪ #local minima ≪ #saddle points.

• Batch size affects time efficiency and generalization.

You might also like