0% found this document useful (0 votes)
104 views136 pages

Gradient-based Optimization Techniques

The document discusses gradient-based optimization methods, including an introduction to complexity analysis using Big O notation and an overview of convex functions and optimization problems. Convex functions are defined and their properties are explained, including that any locally optimal point of a convex problem is globally optimal and optimality criteria for convex optimization problems.

Uploaded by

Olabiyi Ridwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
104 views136 pages

Gradient-based Optimization Techniques

The document discusses gradient-based optimization methods, including an introduction to complexity analysis using Big O notation and an overview of convex functions and optimization problems. Convex functions are defined and their properties are explained, including that any locally optimal point of a convex problem is globally optimal and optimality criteria for convex optimization problems.

Uploaded by

Olabiyi Ridwan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Gradient-based Optimization Methods

Hao Yan

January 23, 2024

Hao Yan Gradient-based Optimization Methods January 23, 2024 1 / 91


Outline

1 Complexity
Big O Notation

2 Convex and Complexity Analysis


Convexity

3 Gradient Descent and its variants


Introduction to Gradient Descent
Understand Gradient
Gradient Descent: How Machine Learns

4 Second-order Optimization Methods


Second-Order Optimization Methods

Hao Yan Gradient-based Optimization Methods January 23, 2024 2 / 91


Complexity

Section 1

Complexity

Hao Yan Gradient-based Optimization Methods January 23, 2024 3 / 91


Complexity Big O Notation

Von Neumann architecture General-purpose processors

Components
Memory (RAM):
Central processing unit (CPU)
Input/output system
Memory stores program and data
Program instructions execute sequentially

Hao Yan Gradient-based Optimization Methods January 23, 2024 4 / 91


Complexity Big O Notation

Random-access machine (RAM) model

RAM machine consistes of


a fixed program
an unbounded memory
Read-only input tape and wrie only output tape
Assumptions
Instructions are executed one after the other (non concurrency)
Each tape can hold an arbitary integer
Each “simple” operation takes 1 step: (+, -, =, if, call, memory access
on a random location)

Hao Yan Gradient-based Optimization Methods January 23, 2024 5 / 91


Complexity Big O Notation

Space and Time Complexity

Time Complexity: count the number of flops of an algorithm with size


n.
Space Complexity: memory required for the algorithm
Why it is important?
How efficient is the algorithm?
What happens if the problems scale up?

Hao Yan Gradient-based Optimization Methods January 23, 2024 6 / 91


Complexity Big O Notation

Assymptotic Growth

Line C it
Assumptotic Oder of Growth Tin 01961
Upper bound: T (n) is O(g (n)) if exists c, n0 such that
T (n)  c · g (n) for all n n0
Lower bound: T (n) is ⌦(g (n)) if exists c such that T (n) c · g (n)
for all n n0
Tight bound: T (n) is ⇥(g (n)) if it is both O(g (n)) and ⌦(g (n))
exists constants c1 , c2 and n0 such that

0  c1 g (n)  f (n)  c2 g (n)8n n0

Hao Yan Gradient-based Optimization Methods January 23, 2024 7 / 91


Complexity Big O Notation

Mostly Used: Big-O-Notation

Count the number of flops of an algorithm with size n.


O(1) describes an algorithm that will always execute in the same time
(or space) regardless of the size of the input data set.
O(n) describes an algorithm whose performance will grow linearly and
in direct proportion to the size of the input data set.
O(n2 ) represents an algorithm whose performance is directly
proportional to the square of the size of the input data set.

Hao Yan Gradient-based Optimization Methods January 23, 2024 8 / 91


Complexity Big O Notation

Examples

2
1 2

2n2 + 3n + 1 = O(n2 )
n = O(n2 )
n! = O(nn )
n n
men 1
adf.

Ii
Hao Yan Gradient-based Optimization Methods January 23, 2024 9 / 91
Convex and Complexity Analysis

Section 2

Convex and Complexity Analysis

Hao Yan Gradient-based Optimization Methods January 23, 2024 10 / 91


Convex and Complexity Analysis Convexity

Convex Functions
f is convex if domf is a convex set and
f (✓x + (1 ✓)y )  ✓f (x) + (1 ✓)f (y )
for all x, y 2 domf , 0  ✓  1
f is strictly convex if domf is a convex set and
f (✓x + (1 ✓)y ) < ✓f (x) + (1 ✓)f (y )
for all x, y 2 domf , 0 < ✓ < 1

É
if
Hao Yan Gradient-based Optimization Methods January 23, 2024 11 / 91
f is Convex and Complexity
differentiable if domAnalysis
f is openConvexity
and the gradient
✓ ◆
Convex Functions forrfDifferential
(x) =
@x
, Functions
@f (x) @f (x)
@x
,...,
@f (x)
@x 1 2 n

f is convex
exists iff
at each x 2 dom f
with *differentiable
1st-order condition:*fdifferentiable f 051
with convex domain is convex iff
T
f (y) f (y
f (x) + rf (y +x)rffor
) (x)f T(x) (x)all x,
(yy 2 x)
dom f

f (y)
f (x) + f (x)T (y x)

(x, f (x))
5
first-order approximation of f is global underestimator

with twice differentiable f


Convex functions 3–7

Hessian L
H = r2 f ⌫ 0 8x

Strong Convex r2 f __
E
0 8x
refers all eigenvalues are positive
⌫ refers all eigenvalues are non-negative
Hao Yan Gradient-based Optimization Methods January 23, 2024 12 / 91
Convex and Complexity Analysis Convexity

Convex Functions for Differential Functions

Theorem (Theorem)

Any locally optimal point of a convex problem is globally optima

Theorem (Theorem)

For unconstrained differential convex optimization Problem: x is optimal iff


x 2 domf0 , rf0 (x) = 0

FEET
Hao Yan
t.EE
Gradient-based Optimization Methods
ay
January 23, 2024 13 / 91
Convex and Complexity Analysis Convexity

Proof of These Theorems

f
Proof.
suppose x is locally optimal. Prove f0 (y ) f0 (x).
Define x✓ = ✓y + (1 ✓)x and
F E 19
ily all E
f0 (x✓ ) = f0 (✓y + (1 ✓)x)  ✓f0 (y ) + (1 ✓)f0 (x)

Therefore
1 7712569
0
f0 (y ) f0 (x) (f0 (x✓ ) f0 (x))
✓ 70.0 0
If we choose small enough ✓, x✓ would be close enough to x such that
f0 (x✓ ) f0 (x) 0. Therefore, f0 (y ) f0 (x) 8y 2 dom(f0 )

Hao Yan Gradient-based Optimization Methods January 23, 2024 14 / 91


Convex and Complexity Analysis Convexity

Optimality Criterion

Theorem (Theorem)

x is optimal iff rf0 (x)T (y x) 0 for all feasible y

Nonnegative Problem: x is2 optimal iff x 2 domf0 , x 0,


(
rf0 (x)i 0 xi = 0
rf0 (x)i = 0 xi > 0

Theorem (Theorem)

x is optimal iff x 2 domf0 , rf0 (x) = 0 (unconstrained)

Hao Yan Gradient-based Optimization Methods January 23, 2024 15 / 91


Convex and Complexity Analysis Convexity

Strongly Convex

1 W no
Theorem (Theorem) oof C of
For strictly or strongly convex functions, the global optimum is unique.

Ensuring unique solutions is important for identifiable model.

Hao Yan Gradient-based Optimization Methods January 23, 2024 16 / 91


Convex and Complexity Analysis Convexity

Example of Convex Functions on R 1

Convex
Affine: ax + b
Exponential: exp(ax)
Powers: x ↵ on x > 0 for ↵ 1 or ↵  0
Powers of absolute value: |x|p on R for p 1
Negative entropy: x log x on x > 0
Concave:
Affine: ax + b
Logarithm: log x
Powers: x ↵ on R for ↵ 2 [0, 1]

Hao Yan Gradient-based Optimization Methods January 23, 2024 17 / 91


Convex and Complexity Analysis Convexity

Example of Convex Functions on R n and R m⇥n

Example on R n
Affine: f (x) = aTP
x +b
n
Norms: kxkp = ( i=1 |xi |p )1/p
Example on R m⇥n Pal
Affine: f (X ) = tr (AT X ) + b
spectral f (X ) = kX k2 = max (X )

Hao Yan Gradient-based Optimization Methods January 23, 2024 18 / 91


Convex and Complexity Analysis Convexity

A Brief History

A. Convex optimization: before 1980


Develop more efficient algorithms for convex problems
B. Convex optimization for machine learning: 1980-2000
Apply mature convex optimization algorithms on machine learning
models
C. Large scale (High-dimensional) convex optimization: 2000-2010
Algorithms become more specific to machine learning models/convex
problems (can be non-smooth)
Scalability is very important SI
D. Huge scale machine learning: 2010-now convexity is longer a
problem for optimization
Non-convex problems
Parallel \ distributed optimization

Hao Yan Gradient-based Optimization Methods January 23, 2024 19 / 91


Convex and Complexity Analysis Convexity

What About Non-convex function


7 23 2
Stationary Point rf (x) = 0 can be
Local Minimum 05 3
Local Maximum of 604
Saddle Point

HE.to
17
Eq
If
dimensus
p
Why is non-convex optimization hard in high dimensions?

Hao Yan
of Gradient-based Optimization Methods January 23, 2024 20 / 91
Convex and Complexity Analysis Convexity

Non-convex Optimization is Hard - Many Local Optimums


Non-convex functions may have many local minimums
Are the local minimums equally good?
Not, but almost equally good for many structured problems. Finding
one is typically good enough.
N

LEA
O

Hao Yan Gradient-based Optimization Methods January 23, 2024 21 / 91


Convex and Complexity Analysis Convexity

Non-convex Optimization is Hard - Much More Saddle


Points

Non-convex functions have much more saddle points in high


dimensions
How to escape from saddle points for high-dimensional problems
Avoiding saddle points is more important (much more saddle points in
high-dimensions)

Hao Yan Gradient-based Optimization Methods January 23, 2024 22 / 91


Convex and Complexity Analysis Convexity

Non-convex Problem: Active in Research

If f is non-convex, most algorithms can only converge to stationary


points.
Case-by-Case study for global optimum guarantee
Often based on solving convex subproblems
Classical non-convex problem that can be easily solved

X k2F
YEA
min kY
rank(X )=k

Other examples of non-convex optimization that cannot be solved


easily (globally)
Tensor factorization
Neural network

Hao Yan Gradient-based Optimization Methods January 23, 2024 23 / 91


Convex and Complexity Analysis Convexity

Example: Convexity of Linear Regression


III

Problem 48
min ky X ✓k2

Please compute the Hessian matrix


Is it convex? T

Is it strongly convex?
How about Ridge regression

KER 1191112
Efi ITX

Hao Yan Gradient-based Optimization Methods January 23, 2024 24 / 91


20
A
Holfy 401 9 40
ExO2O Y
yty
257
2
21
20 05 2 7
0

OEIRP
OF 2
Ctxxc 1 4170
Δ

is rank define
Multi culinary
Oig Xix 0

non unique solutus


Have
Convex and Complexity Analysis Convexity

Example: Convexity of Logistic Regression

Logistic regression function


X
L(✓) = yi log( (✓T xi )) (1 yi ) log(1 (✓T xi ))

BCE
i

1
Here (x) = 1+exp( x) .

d
(x) = (x)(1 (x))
dx
Please compute the Hessian matrix

HE 0 Multi colienity
Is it convex?
Is it strongly convex?
How about Ridge regression Cig H O

Hao Yan Gradient-based Optimization Methods January 23, 2024 25 / 91


Convex and Complexity Analysis Convexity

Example: Support Vector Machine

Support Vector Machine


1 X
2
min kw k + C max{0, 1 yi w T xi }
w 2
i

max{0, 1 w T a} is a convex function.


1 2
2 kw k is also a convex function

Hao Yan Gradient-based Optimization Methods January 23, 2024 26 / 91


Gradient Descent and its variants

Section 3

Gradient Descent and its variants

Hao Yan Gradient-based Optimization Methods January 23, 2024 27 / 91


Gradient Descent and its variants Introduction to Gradient Descent

Why Gradient Descent?

Gradient descent lies in the heart of modern machine learning


algorithms
Simple to use
Acceptable convergence property but scalable to big data and
high-dimensional problems
Big n: Stochastic Optimization
Big p: Model parallelization

Hao Yan Gradient-based Optimization Methods January 23, 2024 28 / 91


Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Video from https://www.youtube.com/watch?v=V1eYniJ0Rnk

Hao Yan Gradient-based Optimization Methods January 23, 2024 29 / 91


Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Hao Yan Gradient-based Optimization Methods January 23, 2024 30 / 91


Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Differentiable
Input ! Model Prediction #$

Update Gradient
Parameter
%
descent Truth #
Learner

Hao Yan Gradient-based Optimization Methods January 23, 2024 31 / 91


Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Video from
https://www.youtube.com/watch?v=IHZwWFHWa-w\&t=707s
Hao Yan Gradient-based Optimization Methods January 23, 2024 32 / 91
Gradient Descent and its variants Understand Gradient

Vector Differentiation
0 1
x1
x: vector defined as x = @ ... A
B C

xn
Assume y = f (x), where x is a scalar Output is a vector
0 df 1
dy B dx. C
1
B f PD fall
= @ .. A
dx @f n
@x

*Our interest: *Vector differentiation or Gradient: y = f (x) is a scalar


0 @f 1
dy
@x1 Emodel parameters
B
= rx y = @ .. C
. A
dx scaler
@f
@xn I
Why this is our interest?
Hao Yan Gradient-based Optimization Methods January 23, 2024 33 / 91
Gradient Descent and its variants Understand Gradient

Visualize Gradient
Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as
✓ ◆
df @f @f
rf = = ,··· ,
dx @x1 @xp

315 I
7 414

Hao Yan Gradient-based Optimization Methods January 23, 2024 34 / 91


Gradient Descent and its variants Understand Gradient

Derivative, Gradient

Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as


✓ ◆
@f @f
rf = ,··· ,
@x1 @xp

Example: f (x, y , z) = x + 2y + 3z, rf (x, y , z) = (1, 2, 3)


Example: f (x, y , z) = x 2 + 2y 2 + 3z 2 ,rf (x, y , z) = (2x, 4y , 6z)

Hao Yan Gradient-based Optimization Methods January 23, 2024 35 / 91


Gradient Descent and its variants Understand Gradient

Types of Matrix Derivative

scalar

Types of Matrix Derivative


Types Scalar Vector Matrix
@y @y @Y
Scalar @x @x @x
@y @y
Vector @x @x
@y
Matrix @X

vector to vector
vector
Scatory

gradient Jacobian
Hao Yan Gradient-based Optimization Methods January 23, 2024 36 / 91
Gradient Descent and its variants Understand Gradient

Useful Rules

0
@f (x)
@x1
1 I
@f (x) B .. C Aij
=B . C
@x @
@f (x)
A
Ai

1441
@xp
0 1 0 1
@aT x
a
@xT a @aT x
B @x1
.. C B .1 C
@x = @x =B
@ . C = @ .. A = a
A
@aT x ap
@xp
@Ax
= AT
@x
Y AX Y
Five.ir d A
Hao Yan Gradient-based Optimization Methods January 23, 2024 37 / 91
Gradient Descent and its variants Understand Gradient

Example: Linear Regression

Linear Regression Loss Function


1
l(✓) = min ky X ✓k2
✓ n

What is the gradient


@l(✓)
=?
@✓
What is the optimal point and is it unique?

Hao Yan Gradient-based Optimization Methods January 23, 2024 38 / 91


Gradient Descent and its variants Understand Gradient

Jacobian
Assume y = f(x), both x, y are vectors of size n and m
FAI II
f = (f1 , · · · , fm ), x = (x1 , · · · , xn ), y = (y1 , · · · , ym ) where, yi = fi (x)
Define Jacobian: vector to vector mapping @y @x
0 @f1 (x) 1 0 @ @ 1
@x @x1 f1 (x) ··· @xn f1 (x)
@y B .. C B .. .. C
=@ . A=@ . . A
@x @fm (x) @ @
@x m @xm fm (x) @xn fm (x)
Special case: x and y are of the same size
Multivariable function f looks locally like as a linear transformation of x
Known as Jacobian matrix

HER

Hao Yan Gradient-based Optimization Methods January 23, 2024 39 / 91


Gradient Descent and its variants Understand Gradient

Example of Jacobian

x = r cos ✓, y = r sin ✓ 6 g riot


Define a = (x, y ), b = (r , ✓)
@a
J = @b
Meaning
✓ ◆ ✓ @x @x ◆✓ ◆ ✓ ◆✓ ◆
x @r @✓ r cos ✓ r sin ✓ r
= @y @y =
y @r @✓
✓ sin ✓ r cos ✓ ✓

If we count the area


✓ ◆
cos ✓ r sin ✓
dxdy = drd✓ = rdrd✓
sin ✓ r cos ✓

The absolute value of the Jacobian times the area of the corresponding
rectangle

Hao Yan Gradient-based Optimization Methods January 23, 2024 40 / 91


Gradient Descent and its variants Understand Gradient

Derivative, Gradient, and Hessian


A special case: Hessian Matrix vector
@ @f
What is Hessian H = @x
2 @x 2 3
@2f @2f

0
@ f
6 @x 2 ···
6 1 @x1 @x2 @x1 @xn 7
7
6
0 @ @f 1 6 @ 2 f 7
@2f 2
@ f 7
@x @x1 6 ··· 7
B . C 6 @x2 @x1 @x22 @x2 @xn 7
H=@ .. A=6 6
7
7
6 .. .. .. .. 7
@ @f
6 . . . . 7
@x @xn 6 7
6 7
4 @2f @2f 2
@ f 5

31
···
His @xn @x1 @xn @x2 @xn2
2 3
6x 1 Symethic
Hji
$f(x,y,z)=x3 +2y2 +3z2 +xy,$ H = 4 1 4 5
6
Hessian encodes the curvation information of the function
HI
Hao Yan Gradient-based Optimization Methods January 23, 2024 41 / 91
Gradient Descent and its variants Understand Gradient

Hessian and Curvature

y 00
Curvature in 1D: For y = f (x),  = 3 , a larger y 00 means a
(1+y 0 ) 2
1
larger curvature.  = R

Hao Yan Gradient-based Optimization Methods January 23, 2024 42 / 91


Gradient Descent and its variants Understand Gradient

Hessian and Curvature

Curature in 2D: The Gaussian Curature is

I
det(H)
K =  1 1 =
1 + krf k2

At critical point rf = 0, K = det(H)


Types 1 > 0, 2 > 0, Bowl-like. 1 < 0, 2 < 0, Mountain-like.
1 2 < 0, Saddle-like

Hao Yan Gradient-based Optimization Methods January 23, 2024 43 / 91


Gradient Descent and its variants Understand Gradient

Gradient Rules

Linear Rule: r(↵f (x) + g (x)) = ↵rf (x) + rg (x)


Product Rule:
@(f (x)g (x)) @(g (x)) @(f (x))
= f (x) + g (x)
@x @x @x
or equivalently

r(f (x)g (x)) = f (x)rg (x) + g (x)rf (x)

Hao Yan Gradient-based Optimization Methods January 23, 2024 44 / 91


Gradient Descent and its variants Understand Gradient

Chain Rule

Suppose we have a vector function g(x) and scalar function f (x).


✓ ◆T
@f (g(x)) @g(x) @f (a)
=
@x @x @x

Here,
⇣ a⌘ = g (x).
@g(x) gradient
@x is the Jacobian of g.

Hao Yan Gradient-based Optimization Methods January 23, 2024 45 / 91


Gradient Descent and its variants Understand Gradient

Product Rule (2)

Suppose we have two vector functions f(x) and g(x).

@ f T (x)g(x)
@x
=
@ f T (x)
@x
g(x) +
@ gT (x)
@x
f(x) 0
Some special examples
@
@x (u T
v) = @uT
@x v + @vT
@x u
un
@ 2 @ T @uT @uT T

@x kuk 2 = @x u u = @x u + @x u = 2 @u
@x u
@
@x (x T
Ax) = 2Ax if A T
=A

Hao Yan Gradient-based Optimization Methods January 23, 2024 46 / 91


Gradient Descent and its variants Understand Gradient

Property of Gradients

Theorem (Theorem)
feitehj frtiteof.li
lim✏!0 0✏
f (x+✏h) f (x)
= rf · h Inerprudat
of
455h
0
Theorem (Theorem)

IE
rf
maxkv k=1 (rf · v ) achieves its maximum at v = krf k
0
Proof.
rf · v = rf kv k cos ✓  krf k, the equality holds when v = crf , where
1
c = krf k.

MK of
minimum
9
Hao Yan Gradient-based Optimization Methods January 23, 2024 47 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Why Gradient Descent?

Gradient descent lies in the heart of modern machine learning


algorithms
Simple to use
Acceptable convergence property but scalable to big data and
high-dimensional problems
Big n: Stochastic Optimization
Big p: Model parallelization

Hao Yan Gradient-based Optimization Methods January 23, 2024 48 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Derivative and Gradient


Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as
✓ ◆
@f @f
rf = ,··· ,
@x1 @xp

Hao Yan Gradient-based Optimization Methods January 23, 2024 49 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Derivative, Gradient

Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as


✓ ◆
@f @f
rf = ,··· ,
@x1 @xp

Example: f (x, y , z) = x + 2y + 3z, rf (x, y , z) = (1, 2, 3)


Example: f (x, y , z) = x 2 + 2y 2 + 3z 2 ,rf (x, y , z) = (2x, 4y , 6z)

Hao Yan Gradient-based Optimization Methods January 23, 2024 50 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Derivative, Gradient, and Hessian


Hessian Matrix
2 3
@2f @2f @2f
6 @x 2 ···
6 1 @x1 @x2@x1 @xn 77
6 7
6 @2f 2
@ f 2
@ f 7
6 ··· 7
6 @x2 @x1 @x2 2 @x2 @xn 7
H=6
6
7
7
6 .. .. .. .. 7
6 . . . . 7
6 7
6 7
4 @2f 2
@ f 2
@ f 5
···
@xn @x1 @xn @x2 @xn2
2 3
6x 1
f (x, y , z) = x 3 + 2y 2 + 3z 2 + xy ,H = 4 1 4 5, which is not
6
p.s.d
Hao Yan Gradient-based Optimization Methods January 23, 2024 51 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Property of Gradients

Theorem (Theorem)
f (x+✏h) f (x)
lim✏!0 ✏ = rf · h

Theorem (Theorem)
rf
maxkv k=1 (rf · v ) achieves its maximum at v = krf k

Proof.
rf · v = rf kv k cos ✓  krf k, the equality holds when v = crf , where
1
c = krf k.

Hao Yan Gradient-based Optimization Methods January 23, 2024 52 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Gradient Descent
Assume that f and rf at each iteration can be easily evaluated
Recall that we have f : R n → R, convex and differentiable, want to
solve
Steepest Descent:

minn f (x)
x2R
i.e find x ⇤ such that f (x ⇤ ) = min f (x)
Steepest Descent:

xk+1 = xk ↵k rf (xk )
How to identify ↵k
Trial and Error: Select a fixed ↵k or reduce ↵k after f (xk ) is stable
Backtracking: ↵0 , 12 ↵0 , 14 ↵0 , · · · until a sufficient decrease in f is
obtained
Use information from function f to decide
Hao Yan Gradient-based Optimization Methods January 23, 2024 53 / 91
Fixed step size
Gradient Descent and its variants Gradient Descent: How Machine Learns

Bigtake
mply SteptkSize
= t for all k = 1, 2, 3, . . ., can diverge if t is too
nsider Consider
f (x) =f (x)
(10x 2 + x2 )/2, gradient descent after 8 steps:
1 2 +2x 2 )/2, gradient descent after 8 steps
= (10x 1 2

20 ●

E
10

*
0
−10
−20

Hao Yan Gradient-based Optimization Methods January 23, 2024 54 / 91


slow if t is too small. Same example, gradient desce
Gradient Descent and its variants Gradient Descent: How Machine Learns

ps:Small Step Size


Same example, gradient descent after 100 steps
20 ●


●●
●●



1
















10











*
0
−10
20

Hao Yan Gradient-based Optimization Methods January 23, 2024 55 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Correct Step Size


acking picks up roughly the right step size (13 steps)
Consider f (x) = (10x12 + x22 )/2, gradient descent after 13 steps

20



10





jiffy ●●

*
0
−10

Hao Yan Gradient-based Optimization Methods January 23, 2024 56 / 91


0
Gradient Descent and its variants Gradient Descent: How Machine Learns

Line Search
min fix
Gradient step xk+1 = xk ↵rf (xk )
Look at which ↵ minimize f (xk ) 2 is a scalar

↵k = arg min f (xk ↵rf (xk ))


Simple Bisection rule can be used to find ↵


Inexact Line Search: look for stepsize results in sufficient decrease
p
Decaying stepsize: ↵k = 1/ k
Fixed stepsize

Hao Yan Gradient-based Optimization Methods January 23, 2024 57 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Strong Convexity x̅05x µ


OF EMI
4421101112
0
10 0

eig of i M 70
If µI 4 r2 f (x) 4 LI

µ L
ky xk22  f (y ) f (x) T
rf (x) (y x)  ky xk22
2 2
For strong convex problem µ > 0
Weak convex problem µ = 0

fig fix 7ft g 1


1 19 01
Tff 9 1

45
Hao Yan Gradient-based Optimization Methods January 23, 2024 58 / 91
2412
9 1
y't 9 x 119

774TH
mysis my
Of LI 60 Stryly Coney

0 19 May 04
91
2 g x 219 09

09 7ft 19 0 0

19 2 If
y x
tof
AD
sealer
Convergence speed
she 0
45 sk 0 How fast

die 0

Ref 0

1 1 Linear emergence
IS 2

c Linear
longer

2
Irreverence
4,1 I

3 quadratic emerge
4 2
How iterations to achieve E accuracy
many

Linear convergence

XK
ICE knflog.cl
loy2

Quadratic

sublinar Die CE

k
Gradient Descent and its variants Gradient Descent: How Machine Learns

Why Strong Convexity Matters?

A strongly convex function is also strictly convex


A strictly convex function has a unique global optimum
+attrl atex :: width0.7[[file : Figs/highlowdim.pdf ]]

Hao Yan Gradient-based Optimization Methods January 23, 2024 59 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Stationary Point Convergence

Theorem (Theorem)

Assume r2 f (x) 4 LI , rf (xk ) ! 0. The algorithm will converge to a


stationary point.

Proof.
The proof sketch is layed as:
Convexity: f (xk+1 ) f (xk ) rf (xk )T (xk+1 xk )  L2 kxk+1 xk k22
since xk+1 = xk ↵k rf (xk ).
f (xk+1 )  f (xk ) ↵k (1 ↵2k L)krf (xk )k22
1 2
Select step size ↵k = 1/L, have f (xk+1 )  f (xk ) 2L krf (x k )k2
By
PNsummation from k = 0 to k = N and
krf (x )k2  2L(f (x ) f (xN+1 )]
k=1 k 0

Hao Yan Gradient-based Optimization Methods January 23, 2024 60 / 91


stationary point convergence

1 1 2 FEdpga

2141 Xk 2 11 t
that fÑt2 t 11 12 may
f ocean
fell F 110 14 f
912k

t.EE
110 1 115
genes 14 tails I

ota.nl sc
E
5 0fk
Gradient Descent and its variants Gradient Descent: How Machine Learns

Stationary Point Convergence

Theorem (Theorem) unique global optimal

I
1
kxk x ⇤ k is a decreasing sequence in k with ↵k = L

Proof.
Ike 2k 21 01 14 If't
⇤ 2 1
kxk+1 x k = kxk x ⇤
rf (xk )k2 2 7 1
L
⇤ 2 2 1
= kxk x k rf (xk ) (xk x ) + 2 krf (xk )k22
> ⇤
L L
⇤ 2 1
 kxk x k krf (xk )k2
L
We know that kxk x ⇤ k2 is non-decreasing in k.

Hao Yan Gradient-based Optimization Methods January 23, 2024 61 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Stationary Point Convergence

Theorem (Theorem)
1 1
r2 f (x) 4 LI , f (x ⇤ )  f (x L rf (x))  f (x) 2L krf (x)k 2

Proof.
First we can show that

⇤ 1
f (x )  f (x rf (x))
L
1 L 1
 f (x) rf (x) rf (x) + k rf (x)k22
>
L 2 L
1
= f (x) krf (x)k22
2L

Hao Yan Gradient-based Optimization Methods January 23, 2024 62 / 91


Let
2 405

flat f x Ef fin of of
11141T
for 1171T
I
fk 0
Gradient Descent and its variants Gradient Descent: How Machine Learns

Weakly Convex: 1/k sublinear


O
Theorem (Theorem)
2Lkx0 xk2
Assume µI 4 r2 f (x)
4 LI , if µ 0, f (xk ) f (x ⇤ )  k , f (xk ) has
sublinear convergence. x ⇤ is the optimal solution.

Denote optimal x ⇤ , it can be shown that {kxk x ⇤ k} is decreasing


Define k = f (xk ) f (x ⇤ ). By convexity we have

k  rf (xk )T (xk x ⇤ )  krf (xk )kkxk x ⇤ k  krf (xk )kkx0 x ⇤k


o
1
f (xk+1 )  f (xk ) krf (xk )k22
2L
Subtracting f (x ⇤ ) for both sides, we have
1 1
k+1  k krf (xk )k2  k
2
k
2L 2Lkx0 x ⇤ k2

Hao Yan Gradient-based Optimization Methods January 23, 2024 63 / 91


decresig sequence 1171 If

meet
flk fly 7ft folk convexity of f

d
fam f 571117 4101
Ok

ok
IT
1 KIT
0kt Ok a
Gradient Descent and its variants Gradient Descent: How Machine Learns

Weakly convex: 1/k sublinear

Take inverse and apply (1 ✏) 1 1+✏

1 1 1
0114
/(1 k)
k+1 k 2Lkx0 x ⇤ k2

a
1 1 1 k +1
+ +
k 2Lkx0 x ⇤ k2 0 2Lkx0 x ⇤ k2

which yields
2Lkx xk2
0
f (xk+1 ) f (x ⇤ ) 
k +1
The classical 1/k convergence rate (sublinear)

Hao Yan Gradient-based Optimization Methods January 23, 2024 64 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Strong Convex: Linear


Theorem (Theorem)

Assume µI 4 r2 f (x) 4 LI , if µ > 0, xk has linear convergence


0
kxk+1 x ⇤ k22  (1 ↵µ)kxk x ⇤ k2 if 0 < ↵  L1

Proof. onvergence
We know that xk , by plugging in strong convexity:
f (x ⇤ ) f (xk ) + rf (xk )> (xk x ⇤ ) + 12 µkxk x ⇤ k2 and
1
f (x ⇤ )  f (x) 2L krf (x)k22 , we have
Dec11 Dic 210 11
kxk+1 x ⇤ k22 = kxk x ⇤ ↵rf (xk )k2
= kxk x ⇤ k2 2↵rf (xk )> (xk x ⇤ ) + ↵2 krf (xk )k2
 (1 ↵µ)kxk x ⇤ k2 2↵(f (xk ) f (x ⇤ )) + ↵2 krf (xk )k2
 (1 ↵µ)kxk x ⇤ k2 2↵(1 ↵L)(f (xk ) f (x ⇤ ))

If we Yan↵  1 , we have
Haolet 2↵(1
Gradient-based ↵L) is
Optimization negative, and
Methods can
January be droped.
23, 2024 65 / 91
7
β
1101 11
fm 2 110th
i
Of fish f off
101

f
101k Conexity

076.5 An mg
IEEE

11ft f 217 1111 71 11

k
2 1 1

4
112
1 111 2

I
1
Gradient Descent and its variants Gradient Descent: How Machine Learns

8
Strong Convex: Linear

Theorem (Theorem) Strously Cunex


2
Assume µI 4 r2 f (x) 4 LI , if µ > 0, by set ↵k = L+µ , f (xk ) has linear
convergence
Kool I
L 2 2k
f (xk ) ⇤
f (x )  (1 ) kx0 x ⇤ k2
2 +1

Strong convex case yield a linear/geometric


F rate, which is generally
much better than any sublinear rate.

Linear convergence

Hao Yan Gradient-based Optimization Methods January 23, 2024 66 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns
The slow linear rate is typical!
What is the Problem of Gradient Descent

4 14 4
Not just
f (xk a
) pessimistic bound!
f (x ⇤ )  L2 (1 2 2k
x ⇤ k2 , when  1, 1 2
⇡1
+1 ) kx0 +1

not
ÉlI
f 15
431

Hao Yan Gradient-based Optimization Methods January 23, 2024 67 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

FISTA Acceleration

ANNA AN
Basic step is Aggregate gradient over histry

xk+1 = xk + ↵k pk , pk = rf (xk ) + k pk 1

Accelerated version
xk = yk L1 rf (yk )
1
p
tk+1 = 2 (1 + 1 + 4tk2 )
yk+1 = xk + ttkk+11 (xk xk 1)

O
1
Weakly convex f , converges with f (xk ) f (x ⇤ ) ⇠ k2
1/k 2 is optimal rate for sublinear problem
u

G
Hao Yan Gradient-based Optimization Methods January 23, 2024 68 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Example: Linear Regression

Linear Regression Loss Function


1
l( ) = min ky X k2 + k k2
n
2 T
Gradient: r l( ) = n X (y X )+2

2 T
k+1 = k + ↵k ( X (y X k) 2 k)
n

011
Hao Yan Gradient-based Optimization Methods January 23, 2024 69 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Linear Regression Loss Function

Hessian Matrix:

2 2
H = r l( ) = r X T (y X )+2 I
n
2
= r XTX + 2 I
n
2 T
= X X +2 I ⌫0
n
Is it convex? Yes!
Is it strongly convex?
If > 0: Yes!
If <0
If n > p, if X is full rank (with rank p), it is strongly convex
if p > n, not strongly convex!

Hao Yan Gradient-based Optimization Methods January 23, 2024 70 / 91


Gradient Descent and its variants Gradient Descent: How Machine Learns

Complexity of Gradient Descent

KIT IR complexity

Time Complexity:
Compute X takes O(np) XP
Compute X T (y X ) takes O(np) 1 13 Olap
Space Complexity: a
Store X takes O(np)

2
XY XP
Olap

Hao Yan Gradient-based Optimization Methods January 23, 2024 71 / 91


F

H
O up
Gradient Descent and its variants Gradient Descent: How Machine Learns

Comparison of Gradient Descent and Analytical solution

closed
form β XX 2 XTY
I ocupy
Space Time
Gradient Descent O(np) O(np) iteration
per
Analytical Solution O(np) O(np 2 )
e
Question:
Is Gradient descent straightly better due to smaller complexity?
When should we use analytical solution?

total
ff If
of Iteration E C complexity
of a

Analitiial
6
1 peas C
Hao Yan Gradient-based Optimization Methods January 23, 2024 72 / 91
Second-order Optimization Methods

Section 4

Second-order Optimization Methods

Hao Yan Gradient-based Optimization Methods January 23, 2024 73 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Second-order Optimization Method - Newton’s Method


Suppose we want to solve

HEI
min f (x)
x
At x = x0 , f (x) can be approximated by
T 1
f (x) ⇡ h(x) := f (x0 ) + rf (x0 ) (x x0 ) + (x x0 )T H(x0 )(x x0 )
2

HEI
H(x0 ) is the Hessian of f (x) defined by
@2
Hij (x0 ) = f (x)
@xi @xj
1
arg min f (x) ⇡ arg min h(x) = x0 H(x0 ) rf (x0 )
x x

Take Home Message:


Newton’s Method considers the curvature information of the loss surfaces
by using the Heissian information
Hao Yan Gradient-based Optimization Methods January 23, 2024 74 / 91
min

Th 7fk.lt 11x 17
e

Of Xo Hm off
HELIX
04101.7 GD

II
H Newton's director
Linear o
11h11

tik i ii.in

ind R

mint 05
Invariant scale
of

If
t
affect AD

H I 41
1

D.at C E
DX
Second-order Optimization Methods Second-Order Optimization Methods

Newton Algorithm

Algorithm
Given x0 , set k = 0
Solve dk such that H(xk )dk = rf (xk )
Normally set ↵k = 1, xk+1 = xk + ↵k dk
Repeat until convergence
Solving dk requires to assume that H(xk ) is nonsingular at each
iteration
Newton’s method considers curvature of the original problem
Scale invariant for Newton methods
Newton’s method is invariant to affine transformation x Dx

Hao Yan Gradient-based Optimization Methods January 23, 2024 75 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Descent Property

Theorem (Theorem)

Descent Direction: If r2 f 0, then the Newton’s step is a descent


direction

Proof.
If r2 f 2
4 20
1120
0, then we know that r f is positive definite and invertible. As a
result, we have
We know that the newton’s direction
x = xk+1 xk = [r2 f (xk )] 1 rf (xk ) so

rf (x)> x = rf (xk )[r2 f (xk )] 1


rf (xk ) < 0.

This shows that the Newton’s step is a descent direction

Hao Yan Gradient-based Optimization Methods January 23, 2024 76 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Convergence Property

Theorem (Theorem)

Newton’s method has quadratic convergence rate

We will give a proof for 1-D case for example, where x is a scalar.
consider g (x) = rf (x). Consider the 2nd order Taylor approximation
of g (x). There exists ⇠k between xk and x ⇤ such as
1 2

0 = g (x ) = g (xk ) + rg (xk )(x ⇤
xk ) + r g (⇠k )(xk x ⇤ )2
2
Suppoe that r 1 g (x exists
k)

1 1
0 = [r g (xk )]g (xk ) + (x xk ) + [r 1 g (xk )]r2 g (⇠k )(xk x ⇤ )2

2
1
= (xk+1 xk ) + x ⇤
xk + [r 1 g (xk )]r2 g (⇠k )(xk x ⇤ )2
2
Hao Yan Gradient-based Optimization Methods January 23, 2024 77 / 91
Quadratic convergence

I Tac.tn

11dm 21 1 106 2 11

0.1 0.05 0.025 0.0125

0 I 0.01 0.000 0.000000

909 7512

Skf 01k 0H

7 9611914
0 109509611
Xia 2
76

EI.si t Ie
Second-order Optimization Methods Second-Order Optimization Methods

Convergence Property

This gives
ek+1 xk+1 x ⇤ 1 1
2
:= 2
= [r g (xk )]r2 g (xk )
ek ⇤
(xk x ) 2

|r2 f (x)|
Let M = supx,y |2rf (y )| < 1, therefore we know that
xk+1 x ⇤
limk!1 (x x ⇤ )2  M < 1. If we select the initial point
k
|e0 | < 1, ek ! 0 with quadratic rate.

Hao Yan Gradient-based Optimization Methods January 23, 2024 78 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Newton Convergence
log ly I
Newton’s method has Local Quadratic convergence E 1
kxk+1 x ⇤ k = O(kxk x ⇤ k2 )

Theorem (Theorem)

If f is strongly convex on S with constant m, and r2 f is Lipschitz


continuous on S with constant L > 0 such that

kr2 f (x) r2 f (ydo)k2  Lkx y k2

Then the number of iterations until f (x) f (x ⇤ )  ✏ is bounded above by

f (x0 ) f (x ⇤ ) ✏i
+ log log( )

Hao Yan Gradient-based Optimization Methods January 23, 2024 79 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Scale Invariant
Gradient descent is not scale invariant
Newton’s method considers curvature of the original problem through
the Hessian matrix
Scale invariant for Newton methods
Newton’s method is invariant to affine transformation x Dx

Take Home Message:


Newton’s Method is scale-invariant and less sensitive to scale normalization
as gradient descent does.
Hao Yan Gradient-based Optimization Methods January 23, 2024 80 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Can it be Practical?

High-dimensional problem when p > 104


Computing r2 f is in general not possible
2
OP XP
r f may have structures such as sparsity or low rank
0

Often don’t need r2 f – Use approximation instead


Do we really need the full Hessian matrix? P3
Gradient descent is a special case of Newton’s method with identity
Hessian approximation Hk = ↵k 1 I
xk+1 = xk Hk 1 rf (xk ) ! xk+1 = xk ↵k rf (xk )
Gradient descent with identity Hessian approximation works great

Hao Yan Gradient-based Optimization Methods January 23, 2024 81 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Quasi-Newton’s Method

BFGS: maintains a low-rank approximation to the inverse Hessian


L-BFGS: (Limited memory version) doesn’t explicitly store the Hessian
matrix
Newton-CG: Compute the newton’s step via iterative conjugate
gradient approach

Of plan Offa
Heissian
free optimism
for 2nd order
Hao Yan Gradient-based Optimization Methods January 23, 2024 82 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Sampled Newton

Pn
L(⇠; x) is separable L(⇠; x) = i=1 L(⇠i ; x)
Stochastic Gradient descent use subset of full samples to estimate the
gradient
Sampled subset of B ✓ {1, 2, · · · , n} randomly

1 X
r2 f (⇠) = r2 L(⇠i ; x)
|B|
i2B

Problem: for high-dimensional problem (large p, p > 104 )


r2 L(⇠i ; x) is of size p ⇥ p, which is too large to store or compute.

Hao Yan Gradient-based Optimization Methods January 23, 2024 83 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Quasi-Newton Methods

Maintains an approximation to the Hessian that’s filled in using


information gained on successive steps.
Generate sequence {Bk } of approximate Hessian alongside the iterate
sequence wk , and caculate steps dk by solving

Hk d k = rf (xk ) Nenton directors


Silvey
Update from Bk ! Bk+1 so that
Approximate Hessian mimics the behavior of true Hessian over this
step
r2 f (xk+1 )sk ⇡ yk
where sk = xk+1 xk , yk := rf (xk+1 ) rf (xk ) so we enforce

B k s k = yk

Hao Yan Gradient-based Optimization Methods January 23, 2024 84 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm


tank 2 update
Bk ss T Bk yy T
Bk+1 = Bk s T Bk s
+ yT s
, where s = sk and y = yk
Start with B0 = ⇢I for some multiple ⇢ that is consistent with problem
scaling, e.g. s T y /s T s
Uv
Can maintain instead an Hk to inverse Hessian
5 YEX
Hk+1 = (I ⇢sy T )Hk (I ⇢ys T ) + ⇢ss T Hi Pap
Where ⇢ = 1/(y T s).
If sftth.gs
Can prove superlinear local convergence for BFGS and other
OCP
quasi-Newton methods: kxk+1 x ⇤ k/kxk x ⇤ k ! 0.
Slower than Gradient Descent but not as fast as Newton.
Cheaper than Newton
space Stre Matrix Hi
Problem: Hk is too large to store
p
Hao Yan Gradient-based Optimization Methods January 23, 2024 85 / 91
Second-order Optimization Methods Second-Order Optimization Methods

L-BFGS

1
5k Yi PI O mp
LBFGS doesn’t store the p ⇥ p matrices Hk or Bk from BFGS explicitly
Only keep track sk and yk from last few iterations (e.g. m = 5 or 10)
Take an initial matrix (B0 or H0 ) and assume that m steps have been
taken since.
A simple procedure computes Bk u via a series of inner and outer
products with the matrices sk j and yk j from last m iterations:
j = 0, · · · , m 1
Requires 2mp storage and O(mp) linear algebra operations
No superlinear convergence proved, but good behavior is observed on
a wide range of applications

Hao Yan Gradient-based Optimization Methods January 23, 2024 86 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Optimization in Logistic Regression

Loss function: (we can assume b = 0 for simplicity)


N
X
l(✓) = yi log (✓T xi ) (1 yi ) log(1 (✓T xi ))
i=1

The gradient X
0
l (✓) = xi (ti yi )
i
(k)
Define ti = (✓(k)T xi ) as the target
Gradient Descent
X (k)
(k+1) (k)
✓ =✓ xi (ti yi )
i

Hao Yan Gradient-based Optimization Methods January 23, 2024 87 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Logistic Loss
Assume
P(y = 1|x, ✓) = (✓T x + b)
P(y = 0|x, ✓) = 1 (✓T x + b)
Equivalently
P(y |x; ✓) = P(y = 1|x, ✓)y P(y = 0|x, ✓)1 y

Likelihood:
Y
P(y |x; ✓) = P(yi |xi ; ✓)
i
N
Y
= P(yi = 1|x, ✓)yi P(y = 0|x, ✓)1 yi

i=1

BCE := log P(y |x; ✓)


N
X
= ( yi log P(yi = 1|x, ✓) + (1 yi ) log P(yi = 0|x, ✓))
i=1
Hao Yan Gradient-based Optimization Methods January 23, 2024 88 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Logistic Loss
The binary cross
PNentropy is given
as:BCE = ( i=1 yi log P(yi = 1|x, ✓) + (1 yi ) log P(yi = 0|x, ✓))

We can see BCE is a linear combination of two convex terms, which is


a also a convex function
Example: when yi = 1, yi log P(yi = 1|x, ✓) = log P(yi = 1|x, ✓)
(red curve)
log P(yi = 1|x, ✓) is 0 when P(yi = 1|x, ✓) = 1 correct prediction
log P(yi = 1|x, ✓) is 1 when P(yi = 1|x, ✓) = 0 : wrong prediction
Hao Yan Gradient-based Optimization Methods January 23, 2024 89 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Optimization in Logistic Regression

Loss function: (we can assume b = 0 for simplicity)


N
X
l(✓) = yi log (✓T xi ) (1 yi ) log(1 (✓T xi ))
i=1

The Hessian matrix is


X
H= xi xi0 ti (1 ti )
i
= X T WX ⌫ 0

Theorem (Theorem)

BCE loss function is a Convex function.

Hao Yan Gradient-based Optimization Methods January 23, 2024 90 / 91


Second-order Optimization Methods Second-Order Optimization Methods

Optimization in Logistic Regression


Loss function: (we can assume b = 0 for simplicity)
N
X
l(✓) = yi log (✓T xi ) (1 yi ) log(1 (✓T xi ))
i=1

The Hessian matrix is


X
H= xi xi0 ti (1 ti )
i
= X T WX ⌫ 0
W = diag(t1 (1 t1 ), · · · , tn (1 tn )) Weighted
Solution: Linear Regression
1
✓(k+1) = ✓(k) + (X T W(k) X ) X T W(k) Z(k)
Z(k) = W(k)1 (y t) is the residual
Solve weighted linear regression in closed form
Hao Yan Gradient-based Optimization Methods January 23, 2024 91 / 91
119
f 0112
9
Newton's stop for 1st Here

00 0
151 y
H X 7 0

0 0 Ht of.la
0 XX XXO 9

XD Xy XXXX
XX XTY

min y xojwly xoye


nto. O

0 0
Stochastic Gradient Descent

Section 5

Stochastic Gradient Descent

Hao Yan Gradient-based Optimization Methods February 8, 2024 86 / 113


Stochastic Optimization

Random Variable 3 µL
Model permet
so
11.551
Data or
ITSamples.IT Jn
If
m.int fo 5i L.ln

GD
016 4 74 E 0511.9
use all n to estimate 0L reduce variance

Stochastic GD

EEgrf9
one

si.fi
pick
Variance

one both of si.IE 0f619i Eg0fk.s


GD
M
sample B sample n samples
medium smallest
Variance large
orpletston np
1 1
Stochastic Gradient Descent Stochastic Gradient Descent

Problem of Gradient Methods

I ftp is O P
Data: ⇠ 2 S sample space
DIE
Loss function: F (x, ⇠)
Average of Irish
Risk function E (F ) = E⇠ F (x, ⇠) = f (x)
1 Pn
Empirical risk: En (F ) = n i=1 F (x, ⇠i )
Gradient Descent: _i.is
n
1X
xk+1 = xk ↵k
n
rx F (xk , ⇠i ) OCP
i=1 is too expensive
for each

III
step
Problem: n can be very large: n ⇡ 100000, evaluating En (Fe) at each
iteration is extremely time-consuming

Hao Yan Gradient-based Optimization Methods February 8, 2024 87 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Solution: Stochastic Gradient Descent

What if at each iteration k, we pick up ⇠k randomly from {⇠1 , · · · , ⇠n }


Main ideas: Any rx F (xk , ⇠k ) is an unbiased estimator of
E⇠ (rx F (xk , ⇠))
xk+1 = xk ↵k rx F (xk , ⇠k )
Significantly reduce the time for each iteration
Mini-Batch: use a batch of samples more than one training sample.
B
1 X
E⇠ (rx L(x, ⇠)) ⇡ F (x, ⇠˜i )
B
i=1

Hao Yan Gradient-based Optimization Methods February 8, 2024 88 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Optimization
Stochastic Optimization

ML optimise 0
Replace
With the gradient
noisy gradients, with cheaper noisy estimates {[}Robbins and
update
Monro, 1951{]}
What if at each iteration
t+1 = t + t
ˆ up x(i trandomly
i, we pick ) from {x1 , · · · , xn }
Any r✓ l(✓; xk ) is an unbiased estimator of Ex (r✓ l(✓; x))
Requires unbiased gradients, ˆ ( ) = ( )
Hao Yan Gradient-based Optimization Methods February 8, 2024 89 / 113
Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Gradient Descent

die randomly
Algorithm: ✓t+1 = ✓t
I
↵t r✓ l(✓t ; xk )
Guaranteed to converge to a local optimum {[}Bottou, 1996{]}
Significantly reduce the time for each iteration
Mini-Batch: use a batch of samples more than one training sample.
B
1 X
Ex (r✓ l(✓; x)) ⇡ l(✓; xi )
B
i=1

Has enabled modern machine learning

Hao Yan Gradient-based Optimization Methods February 8, 2024 90 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Gradient Descent

Go

Hao Yan Gradient-based Optimization Methods February 8, 2024 91 / 113


poI E I

EIEEF

E.it o

ask local
a

i
E
Stochastic Gradient Descent Stochastic Gradient Descent

Classical Stochastic Gradient Descent

Under the following assumption, SGD converges


Loss function differentiable and P
bounded below P
1 1
Learning rate conditions satisfy k=1 ↵k = 1, k=1 ↵k2 < 1
Other mild conditions on the distribution of the gradient function
1
For strongly convex function, ↵k = kµ

max(k✓k ✓⇤ k2 , µM2 )
E (k✓k ✓ ⇤ k2 ) 
2k
where 0 µI 4 r2 f (x) 4 LI , E (krf (x)k2 )  M 2
In reality, ↵k = ↵0 (1 + ↵0 k) 1

For weakly convex, the convergence rate is O( p1k ), (sublinear


convergence)

Hao Yan Gradient-based Optimization Methods February 8, 2024 92 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Proof
ak = 12 E (k✓k ✓⇤ k2 ). Assume M > 0 such that
E (kr✓ f (✓, x)k2 )  M 2 . Thus
1
k✓k+1 ✓⇤ k22
2
1
= k✓k+1 ✓⇤ ↵k r✓ F (✓k , xk )k2
2
1 ⇤ 2 1 2
= k✓k+1 ✓ k ↵k (✓k ✓ ) r✓ F (✓k , xk ) + ↵k kr✓ F (✓k , xk )k2
⇤ T
2 2
Take expectations on xk
⇤ T 1 2 2
ak+1  ak ↵k Exk [(✓k ✓ ) r✓ F (✓k , xk )] + ↵k M
2
We have
Exk [(✓k ✓⇤ )T r✓ F (✓k , xk )] = (✓k ✓⇤ )T Exk [r✓ F (✓k , xk )]
= (✓k ✓ ⇤ ) T gk
where
Hao Yan
(✓k , xk )] (unbias
gk = Exk [r✓ FGradient-based estimator)
Optimization Methods February 8, 2024 93 / 113
Stochastic Gradient Descent Stochastic Gradient Descent

Proof

By strong convexity:

⇤ ⇤ 1 ⇤ 2 1 ⇤ 2 1
(✓k ✓ )gk f (✓k ) f (✓ ) + µk✓k ✓ k µk✓k ✓ k = µak
2 2 2
We have
E ((✓k ✓⇤ )gk ) 2µak
We have
1 2 2
ak+1  (1 2µ↵k )ak + ↵k M
2
1
When ↵k = kµ , we have ak  Q
2k , where Q := max(kxk x ⇤ k2 , µM2 )

Hao Yan Gradient-based Optimization Methods February 8, 2024 94 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

The Trade-offs of Large Scale Learning

f ⇤ = arg minf E (f ) is the best possible prediction


fF⇤ = arg minf 2F E (f ) is the best function in certain parametrized
family F
fn = arg minf 2F En (f ) is the empirical optimum
f˜n is the result achieved by Optimization algorithm
E = E [E (fF⇤ ) E (f ⇤ )] + E [E (fn⇤ ) E (fF⇤ )] + E [E (f˜n ) E (fn )]
| {z } | {z } | {z }
Eapp Eest Eopt
Approximation Error: Eapp measures the error of limiting f 2 F
Estimation Error: Eest measures the error of using emperical risk rather
than espected risk
Optimizaiton error: Eopt measures the impact of approximate
optimization bassed on emperical risk

Hao Yan Gradient-based Optimization Methods February 8, 2024 95 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Large-scale Optimization

Constraint: Maximum computation time Tma⇠ , Maximal training size


nmax
family of function F, optimization accuracy ⇢, the number of
exmaples n
(
n  nmax
minF ,⇢,n E = Eapp + Eest + Eopt s.t.
T (F, ⇢, n)  Tmax
Small-scale learning problems: constrained by the maximal number of
samples nmax . Optimization time is not an issue. Eopt ! 0 by choosing
very small ⇢ (optimization accuracy), minimizing Eest by choosing
n = nmax . Classical approximation-estimation trade-off.
Large-scale learning problems: constrained by the computing time Tmax
when nmax is very large. Approximate optimization can achieve better
expected risk because more training examples can be processed during
the allowed time

Hao Yan Gradient-based Optimization Methods February 8, 2024 96 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Approximation-Estimation-Optimization Tradeoff

F n ⇢
Eapp (approximation error) &
Eest (estimation error) % &
Eopt (optimization error) ... ... %
T (computation time) % % &

Hao Yan Gradient-based Optimization Methods February 8, 2024 97 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Approximation-Estimation-Optimization Tradeoff

Approximation-Estimation-Optimization Tradeoff

n: number of samples, d: dimensionality,  condition number

Hao Yan Gradient-based Optimization Methods February 8, 2024 98 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Example 1: Compute Mean

arg min✓ Ex [ 12 (x ✓)2 ], choose ↵k = 1/k


Every time generate ✓k and use gradient descent

✓k = ✓k 1 ↵k (✓k 1 xk )
1 k 1 1
= ✓k 1 (✓k 1 xk ) = ✓k 1 + xk
k k k
k
X
k✓k = (k 1)✓k 1 + xk = · · · = xi
i=1
1 Pk
✓k = k i=1 xi is the mean estimator
Convergence E (k✓k ✓⇤ k2 ) = O( k1 )

Hao Yan Gradient-based Optimization Methods February 8, 2024 99 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Example 2: Linear Regression


X
min ky X k2 = min (yi T
xi ) 2
i

Linear Regression Loss Function


X
l( ) = min (yi T
xi ) 2
i

The gradient can be computed by

r (yi xiT )2 = 2( T
xi yi )xi

Stochastic Gradient descent: for each time, sample an xi and compute


the following update:
(⌧ +1) (⌧ ) T
= + 2⌘( xi yi )xi

Hao Yan Gradient-based Optimization Methods February 8, 2024 100 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Complexity for LR

Space complexity, save xi , yi , : O(p)


Time complexity:
T
xi : O(p)
T
xi yi : O(1)
(⌧ )
+ ⌘( T xi yi )xi : O(p)
(⌧ +1) (⌧ ) T
= + ⌘( xi yi )xi

Space Time
SGD O(p) O(p)
Gradient Descent O(np) O(np)
Analytical solution O(np) O(np 2 )
Table: Time and storage complexity table

Hao Yan Gradient-based Optimization Methods February 8, 2024 101 / 113


Stochastic Gradient Descent Stochastic Gradient Descent

Scale variance

Scale invariance: x ! Dx, the optimization algorithm behaves the


same
GD or SGD is not scalable invariance, since gradient is not scale
invariance

Hao Yan Gradient-based Optimization Methods February 8, 2024 102 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

SGD with Momentum

Momentum
vt = vt 1 + (1 )r l( )
= ↵vt

Hao Yan Gradient-based Optimization Methods February 8, 2024 103 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Why Momentum

Hao Yan Gradient-based Optimization Methods February 8, 2024 104 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Other Popular Variants

Adagrad: adapts the learning rate to the parameters, well suited with
sparse data.
Adadelta: extension of adagrad to reduce its agressive, monotonically
decreasing learning rate
Adam: add momentum to adaptive learning rates for parameter
Different approach:
Momentum accelerates our search in direction of minima: ball running
down a slope
RMSProp impedes our search in direction of oscillations: heavy ball
with friction

Hao Yan Gradient-based Optimization Methods February 8, 2024 105 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

RMSprop
Derived by Geoffrey Hinton, while suggesting a random idea during a
Coursera class.
RMSProp also tries to dampen the oscillations, but in a different way
than momentum
Update Rule, for each parameter j

2 @l( )
vj,t = ⇢vj,t 1 + (1 ⇢)gj,t , gj,t =
@ j

j,t+1 = j,t p gj,t
vj,t + ✏
Use exponential averaging of the squared gradient
Exponential averaging: More recent gradient is more important
This squared gradient ensure the gradient of the oscillating direction
will accumulate, which leads to a small learning rate
Parameters that would ordinarily receive smaller or less frequent
updates receive larger updates with
Hao Yan Gradient-based Optimization Methods February 8, 2024 106 / 113
Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Adam

Combine ideas from RMSprop and Momentum


Update Rule, for each parameter j

@l( )
vj,t = ⇢1 vj,t 1 + (1 ⇢1 )gj,t , gj,t =
@ j
2
sj,t = ⇢2 sj,t 1+ (1 ⇢2 )gj,t
vj,t
j,t+1 = j,t p gj,t
sj,t + ✏

vj,t : consider momentum


sj,t : axis depedent learning rate

Hao Yan Gradient-based Optimization Methods February 8, 2024 107 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Benchmark Methods

Hao Yan Gradient-based Optimization Methods February 8, 2024 108 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Benchmark Methods

Hao Yan Gradient-based Optimization Methods February 8, 2024 109 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Parallel Gradient Descent

Gradient Computing
x x ↵rf (x)
Example: empirical risk minimization:
N
1X
arg min fi (x)
x n
i=1

Map: compute rx fi (x)


Reduce: compute
1 X
rf (x) = ( rx fi (x))
n
i

Problem: Lots of communication and synchronization

Hao Yan Gradient-based Optimization Methods February 8, 2024 110 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Paralell SGD methods

Parallelizing SG is hard and ongoing research problem


Hogwild
Allows performing SGD updates in parallel on CPUs with shared
memory (input data is sparse and each update only modify a fraction
of all parameters)
Asynchronous variant of SGD: Downpour SGD
used in DistBelief framework at Google (predecessor to TensorFlow)

Hao Yan Gradient-based Optimization Methods February 8, 2024 111 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Other gradient-based algorithms

Other 1st Order Method:


Proximal Gradient
Subgradient Descent
Second-order method:
(Quasi) Newton’s method
Limited-memory BFGS

Hao Yan Gradient-based Optimization Methods February 8, 2024 112 / 113


Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Other Methods

Integer Programming
Cutting plane methods
Branch and bound methods
Stochastic Algorithms
Direct Monte-Carlo sampling-based Optimization
Heuristics Algorithms:
Simulated annealing
Tabu search
Swarm-based optimization algorithms (e.g. Particle swarm
optimization)
Evolutionary algorithms (Genetic Algorithms)
Learn to optimize (RNN-based learning)

Hao Yan Gradient-based Optimization Methods February 8, 2024 113 / 113

You might also like