Convex Function vs.
Nonconvex Function:
A Little Bit Theory
Shusen Wang
Global Extremum vs. Local Extremum
Local Minimum of a function 𝑓 𝐰
If 𝑓 𝐰 ⋆ ≤ 𝑓 𝐰 for all 𝐰 in a neighborhood
of 𝐰 ⋆ , then 𝐰 ⋆ is a local minimum of 𝑓.
Global Minimum of a function 𝑓 𝐰
If 𝑓 𝐰 ⋆ ≤ 𝑓 𝐰 for all 𝐰 in the domain of
𝑓, then 𝐰 ⋆ is a global minimum of 𝑓.
• A global minimum is a local minimum.
• Global minimum may not be unique.
Properties of Local Minimum
Assume 𝑓 is defined on ℝ) .
Properties of local minimum 𝐰 ⋆ :
1. The gradient at 𝐰 ⋆ , 𝛻𝑓 𝐰 ⋆ ∈ ℝ) , is
all-zeros.
2. The Hessian matrix at 𝐰 ⋆ , 𝛻 * 𝑓 𝐰 ⋆ ∈
ℝ)×) , is positive semidefinite (i.e., all
of its 𝑑 eigenvalues are nonnegative.)
Convex Function
• Convex function: The line segment
between any two points on the graph of
the function lies above or on the graph
Properties of a convex function 𝑓:
1. Local minimum = global minimum.
2. The Hessian matrix 𝛻 * 𝑓 𝐰 is positive
semi-definite everywhere.
3. 𝛻𝑓 𝐰 ⋆ = 𝟎 𝐰 ⋆ is a global
Graph of a convex function
minimum.
Nonconvex Function
Properties:
1. Local minimum = global minimum.
2. The Hessian matrix 𝛻 * 𝑓 𝐰 is positive
semi-definite everywhere.
3. 𝛻𝑓 𝐰 ⋆ = 𝟎 𝐰 ⋆ is a global
Graph of a nonconvex function
minimum.
Global Minimum is Unlikely to Reach
• #local minima ≫ #global minima.
• The final solution depends on the
initialization.
• Reaching one of the global minima is
very unlikely.
Graph of a nonconvex function
Saddle Point
saddle point 𝐱 345
Definition of saddle point:
1. The gradient of 𝑓 at a saddle point is all-
zeros: 𝛻𝑓 𝐰345 = 𝟎.
2. The Hessian matrix 𝛻 * 𝑓 𝐰345 has both
positive and negative eigenvalues..
Graph of a nonconvex function
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
• Full gradient descent stops at either a saddle point or a local minimum.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
• Full gradient descent stops at either a saddle point or a local minimum.
• In 2D, #saddle points and #local minimum
are comparable.
• It is not true in high-dim.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
• Full gradient descent stops at either a saddle point or a local minimum.
• In high dim, #saddle points is much greater than #local minima.
• The Hessian has 𝑑 eigenvalues, each of which can be positive or negative.
• 2) combinations of positive and negative eigenvalues.
• One out of the 2) combinations corresponds to local minima.
• 2) − 2 combinations corresponds to saddle points.
Saddle Point vs. Local Minimum
saddle point 𝐰345 local minimum 𝐰 ⋆
• Gradient: 𝛻𝑓 𝐰345 = 𝟎. • Gradient: 𝛻𝑓 𝐰 ⋆ = 𝟎.
• Hessian: 𝛻 * 𝑓 𝐰345 has both • Hessian: 𝛻 * 𝑓 𝐰 ⋆ does not have
positive and negative eigenvalues. negative eigenvalues.
• Full gradient descent stops at either a saddle point or a local minimum.
• In high dim, the number of saddle points is much larger than local minima.
• If a neural net is optimized by the full gradient descent, it will converge to a
saddle point.
Be Careful When Optimizing a Nonconvex Function
Be careful about the initialization!
• Bad initialization results in convergence to bad regions.
• Because of the nonconvexity, global minimum cannot be attained.
Be Careful When Optimizing a Nonconvex Function
Be careful about the initialization!
• Bad initialization results in convergence to bad regions.
• Because of the nonconvexity, global minimum cannot be attained.
• Rule of thumb :
• The trainable parameters (e.g., the filters of ConvNet) are randomly initialized
with proper scaling.
• Bad scaling leads to terrible results.
• All-zero and all-one initializations are bad ideas.
• Pretrained parameters can be very good initialization.
Be Careful When Optimizing a Nonconvex Function
Be careful about the initialization!
Be careful about the optimization algorithm!
• Full gradient descent will be stuck in a
saddle point.
• Because the gradient is near zero when
approaching the saddle point.
• Stochastic gradient descent (SGD) can
escape the saddle points.
• Because it is random and noisy.
Be Careful When Optimizing a Nonconvex Function
Be careful about the initialization!
Be careful about the optimization algorithm!
Be careful about the batch size!
• For parallel computing with multiple GPUs, larger batch size è lower
per-epoch runtime.
• Large batch size, e.g., 10𝐾, may result in bad generalization.
Accurate, Large Minibatch SGD:
… More
Training ImageNet about
in 1 Hour the Batch Size
oyal • Batch size
Piotr Dollár Rosslarger than 8𝐾 results
Girshick in poor generalization.
Pieter Noordhuis
owski Aapo Kyrola Andrew Tulloch Yangqing Jia Kaiming He
• Large batch size is good for time-efficiency.
• Lots Facebook
of tricks are required in large batch training.
stract 40
ImageNet top-1 validation error
35
ith large neural networks and
larger networks and larger
raining times that impede re- 30
gress. Distributed synchronous
ion to this problem by dividing 25
ool of parallel workers. Yet to
the per-worker workload must 20
rivial growth in the SGD mini- 64 128 256 512 1k 2k 4k 8k 16k 32k 64k
mini-batch size
e empirically show that on the
Figure
The figure ImageNet
1. the
is from top-1 validation error [Link]: Training ImageNet in 1 Hour”
paper “Accurate, Large Minibatch minibatch size.
batches cause optimization dif-
Error range of plus/minus two standard deviations is shown. We
addressed the trained networks
present a simple and general technique for scaling distributed syn-
Specifically, we show no loss
… More about the Batch Size
• Researchers’ conjecture:
• Small batch size è flat local minima; Big batch size è shape local minima.
• Flat local minima generalizes better (on the test set).
(a) SGD, 128, 7.37% (b) SGD, 8192, 11.07% (c) Adam, 128
Batch Size = 128 Batch Size = 8192
(e) SGD, 128, 6.00% (f) SGD, 8192, 10.19%
The figure is from paper [Link]
(g) Adam, 128
… More about the Batch Size
• There are papers supportive of small batch training, e.g.,
[Link]
Do Not Believe Deep Learning Theories Blindly
Explanations
Empirical study
Summary
• #global minima ≪ #local minima ≪ #saddle points.
• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.
Summary
• #global minima ≪ #local minima ≪ #saddle points.
• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.
• Initialization is crucial.
• Proper scaling.
• Pretrain.
Summary
• #global minima ≪ #local minima ≪ #saddle points.
• Full gradient descent converges to a saddle point.
• SGD converges to a local minimum.
• Initialization is crucial.
• Proper scaling.
• Pretrain.
• Batch size affects time efficiency and generalization.