0% found this document useful (0 votes)
30 views23 pages

Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang

The document discusses the differences between convex and nonconvex functions, highlighting the properties of local and global minima, saddle points, and the challenges in optimizing nonconvex functions. It emphasizes the importance of initialization, optimization algorithms, and batch size in achieving better convergence outcomes. The conclusion reiterates that the number of global minima is significantly less than local minima and saddle points, with full gradient descent often leading to saddle points while stochastic gradient descent can help escape them.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views23 pages

Convex Function vs. Nonconvex Function: A Little Bit Theory: Shusen Wang

The document discusses the differences between convex and nonconvex functions, highlighting the properties of local and global minima, saddle points, and the challenges in optimizing nonconvex functions. It emphasizes the importance of initialization, optimization algorithms, and batch size in achieving better convergence outcomes. The conclusion reiterates that the number of global minima is significantly less than local minima and saddle points, with full gradient descent often leading to saddle points while stochastic gradient descent can help escape them.

Uploaded by

MInh Thanh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Convex Function vs.

Nonconvex Function:
A Little Bit Theory

Shusen Wang
Global Extremum vs. Local Extremum
Local Minimum of a function 𝑓 𝐰
If 𝑓 𝐰 ⋆ ≤ 𝑓 𝐰 for all 𝐰 in a neighborhood
of 𝐰 ⋆ , then 𝐰 ⋆ is a local minimum of 𝑓.

Global Minimum of a function 𝑓 𝐰


If 𝑓 𝐰 ⋆ ≤ 𝑓 𝐰 for all 𝐰 in the domain of
𝑓, then 𝐰 ⋆ is a global minimum of 𝑓.

• A global minimum is a local minimum.


• Global minimum may not be unique.
Properties of Local Minimum

Assume 𝑓 is defined on ℝ) .

Properties of local minimum 𝐰 ⋆ :


1. The gradient at 𝐰 ⋆ , 𝛻𝑓 𝐰 ⋆ ∈ ℝ) , is
all-zeros.
2. The Hessian matrix at 𝐰 ⋆ , 𝛻 * 𝑓 𝐰 ⋆ ∈
ℝ)×) , is positive semidefinite (i.e., all
of its 𝑑 eigenvalues are nonnegative.)
Convex Function

• Convex function: The line segment


between any two points on the graph of
the function lies above or on the graph

Properties of a convex function 𝑓:


1. Local minimum = global minimum.
2. The Hessian matrix 𝛻 * 𝑓 𝐰 is positive
semi-definite everywhere.
3. 𝛻𝑓 𝐰 ⋆ = 𝟎 𝐰 ⋆ is a global
Graph of a convex function
minimum.
Nonconvex Function

Properties:
1. Local minimum = global minimum.
2. The Hessian matrix 𝛻 * 𝑓 𝐰 is positive
semi-definite everywhere.
3. 𝛻𝑓 𝐰 ⋆ = 𝟎 𝐰 ⋆ is a global
Graph of a nonconvex function
minimum.
Global Minimum is Unlikely to Reach

• #local minima ≫ #global minima.


• The final solution depends on the
initialization.
• Reaching one of the global minima is
very unlikely.

Graph of a nonconvex function


Saddle Point

saddle point 𝐱 345

Definition of saddle point:


1. The gradient of 𝑓 at a saddle point is all-
zeros: 𝛻𝑓 𝐰345 = 𝟎.
2. The Hessian matrix 𝛻 * 𝑓 𝐰345 has both
positive and negative eigenvalues..

Graph of a nonconvex function


Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.


Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.

• In 2D, #saddle points and #local minimum


are comparable.
• It is not true in high-dim.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.

• In high dim, #saddle points is much greater than #local minima.


• The Hessian has 𝑑 eigenvalues, each of which can be positive or negative.
• 2) combinations of positive and negative eigenvalues.
• One out of the 2) combinations corresponds to local minima.
• 2) − 2 combinations corresponds to saddle points.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.

• Full gradient descent stops at either a saddle point or a local minimum.

• In high dim, the number of saddle points is much larger than local minima.

• If a neural net is optimized by the full gradient descent, it will converge to a


saddle point.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!


• Bad initialization results in convergence to bad regions.
• Because of the nonconvexity, global minimum cannot be attained.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!


• Bad initialization results in convergence to bad regions.
• Because of the nonconvexity, global minimum cannot be attained.
• Rule of thumb :
• The trainable parameters (e.g., the filters of ConvNet) are randomly initialized
with proper scaling.
• Bad scaling leads to terrible results.
• All-zero and all-one initializations are bad ideas.
• Pretrained parameters can be very good initialization.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!

Be careful about the optimization algorithm!

• Full gradient descent will be stuck in a


saddle point.
• Because the gradient is near zero when
approaching the saddle point.
• Stochastic gradient descent (SGD) can
escape the saddle points.
• Because it is random and noisy.
Be Careful When Optimizing a Nonconvex Function

Be careful about the initialization!

Be careful about the optimization algorithm!

Be careful about the batch size!

• For parallel computing with multiple GPUs, larger batch size è lower
per-epoch runtime.
• Large batch size, e.g., 10𝐾, may result in bad generalization.
Accurate, Large Minibatch SGD:
… More
Training ImageNet about
in 1 Hour the Batch Size
oyal • Batch size
Piotr Dollár Rosslarger than 8𝐾 results
Girshick in poor generalization.
Pieter Noordhuis
owski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He
• Large batch size is good for time-efficiency.
• Lots Facebook
of tricks are required in large batch training.

stract 40

ImageNet top-1 validation error


35
ith large neural networks and
larger networks and larger
raining times that impede re- 30

gress. Distributed synchronous


ion to this problem by dividing 25
ool of parallel workers. Yet to
the per-worker workload must 20
rivial growth in the SGD mini- 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
mini-batch size
e empirically show that on the
Figure
The figure ImageNet
1. the
is from top-1 validation error [Link]: Training ImageNet in 1 Hour”
paper “Accurate, Large Minibatch minibatch size.
batches cause optimization dif-
Error range of plus/minus two standard deviations is shown. We
addressed the trained networks
present a simple and general technique for scaling distributed syn-
Specifically, we show no loss
… More about the Batch Size
• Researchers’ conjecture:
• Small batch size è flat local minima; Big batch size è shape local minima.
• Flat local minima generalizes better (on the test set).
(a) SGD, 128, 7.37% (b) SGD, 8192, 11.07% (c) Adam, 128
Batch Size = 128 Batch Size = 8192

(e) SGD, 128, 6.00% (f) SGD, 8192, 10.19%


The figure is from paper [Link]
(g) Adam, 128
… More about the Batch Size

• There are papers supportive of small batch training, e.g.,


[Link]
Do Not Believe Deep Learning Theories Blindly

Explanations

Empirical study
Summary

• #global minima ≪ #local minima ≪ #saddle points.


• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.
Summary

• #global minima ≪ #local minima ≪ #saddle points.


• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.

• Initialization is crucial.
• Proper scaling.
• Pretrain.
Summary

• #global minima ≪ #local minima ≪ #saddle points.


• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.

• Initialization is crucial.
• Proper scaling.
• Pretrain.

• Batch size affects time efficiency and generalization.

You might also like