0% found this document useful (0 votes)

104 views136 pages

Gradient-based Optimization Techniques

The document discusses gradient-based optimization methods, including an introduction to complexity analysis using Big O notation and an overview of convex functions and optimization problems. Convex functions are defined and their properties are explained, including that any locally optimal point of a convex problem is globally optimal and optimality criteria for convex optimization problems.

Uploaded by

Olabiyi Ridwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

104 views136 pages

Gradient-based Optimization Techniques

Uploaded by

Olabiyi Ridwan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Gradient-based Optimization Methods

Hao Yan

January 23, 2024

Hao Yan Gradient-based Optimization Methods January 23, 2024 1 / 91

Outline

1 Complexity
Big O Notation

2 Convex and Complexity Analysis

Convexity

3 Gradient Descent and its variants

Introduction to Gradient Descent
Understand Gradient
Gradient Descent: How Machine Learns

4 Second-order Optimization Methods

Second-Order Optimization Methods

Hao Yan Gradient-based Optimization Methods January 23, 2024 2 / 91

Complexity

Section 1

Complexity

Hao Yan Gradient-based Optimization Methods January 23, 2024 3 / 91

Complexity Big O Notation

Von Neumann architecture General-purpose processors

Components
Memory (RAM):
Central processing unit (CPU)
Input/output system
Memory stores program and data
Program instructions execute sequentially

Hao Yan Gradient-based Optimization Methods January 23, 2024 4 / 91

Complexity Big O Notation

Random-access machine (RAM) model

RAM machine consistes of

a fixed program
an unbounded memory
Read-only input tape and wrie only output tape
Assumptions
Instructions are executed one after the other (non concurrency)
Each tape can hold an arbitary integer
Each “simple” operation takes 1 step: (+, -, =, if, call, memory access
on a random location)

Hao Yan Gradient-based Optimization Methods January 23, 2024 5 / 91

Complexity Big O Notation

Space and Time Complexity

Time Complexity: count the number of flops of an algorithm with size

n.
Space Complexity: memory required for the algorithm
Why it is important?
How eﬃcient is the algorithm?
What happens if the problems scale up?

Hao Yan Gradient-based Optimization Methods January 23, 2024 6 / 91

Complexity Big O Notation

Assymptotic Growth

Line C it
Assumptotic Oder of Growth Tin 01961
Upper bound: T (n) is O(g (n)) if exists c, n0 such that
T (n)  c · g (n) for all n n0
Lower bound: T (n) is ⌦(g (n)) if exists c such that T (n) c · g (n)
for all n n0
Tight bound: T (n) is ⇥(g (n)) if it is both O(g (n)) and ⌦(g (n))
exists constants c1 , c2 and n0 such that

0  c1 g (n)  f (n)  c2 g (n)8n n0

Hao Yan Gradient-based Optimization Methods January 23, 2024 7 / 91

Complexity Big O Notation

Mostly Used: Big-O-Notation

Count the number of flops of an algorithm with size n.

O(1) describes an algorithm that will always execute in the same time
(or space) regardless of the size of the input data set.
O(n) describes an algorithm whose performance will grow linearly and
in direct proportion to the size of the input data set.
O(n2 ) represents an algorithm whose performance is directly
proportional to the square of the size of the input data set.

Hao Yan Gradient-based Optimization Methods January 23, 2024 8 / 91

Complexity Big O Notation

Examples

2
1 2

2n2 + 3n + 1 = O(n2 )
n = O(n2 )
n! = O(nn )
n n
men 1
adf.

Ii
Hao Yan Gradient-based Optimization Methods January 23, 2024 9 / 91
Convex and Complexity Analysis

Section 2

Convex and Complexity Analysis

Hao Yan Gradient-based Optimization Methods January 23, 2024 10 / 91

Convex and Complexity Analysis Convexity

Convex Functions
f is convex if domf is a convex set and
f (✓x + (1 ✓)y )  ✓f (x) + (1 ✓)f (y )
for all x, y 2 domf , 0  ✓  1
f is strictly convex if domf is a convex set and
f (✓x + (1 ✓)y ) < ✓f (x) + (1 ✓)f (y )
for all x, y 2 domf , 0 < ✓ < 1

É
if
Hao Yan Gradient-based Optimization Methods January 23, 2024 11 / 91
f is Convex and Complexity
differentiable if domAnalysis
f is openConvexity
and the gradient
✓ ◆
Convex Functions forrfDiﬀerential
(x) =
@x
, Functions
@f (x) @f (x)
@x
,...,
@f (x)
@x 1 2 n

f is convex
exists iﬀ
at each x 2 dom f
with *diﬀerentiable
1st-order condition:*fdifferentiable f 051
with convex domain is convex iff
T
f (y) f (y
f (x) + rf (y +x)rffor
) (x)f T(x) (x)all x,
(yy 2 x)
dom f

f (y)
f (x) + f (x)T (y x)

(x, f (x))
5
first-order approximation of f is global underestimator

with twice diﬀerentiable f

Convex functions 3–7

Hessian L
H = r2 f ⌫ 0 8x

Strong Convex r2 f __
E
0 8x
refers all eigenvalues are positive
⌫ refers all eigenvalues are non-negative
Hao Yan Gradient-based Optimization Methods January 23, 2024 12 / 91
Convex and Complexity Analysis Convexity

Convex Functions for Diﬀerential Functions

Theorem (Theorem)

Any locally optimal point of a convex problem is globally optima

Theorem (Theorem)

For unconstrained diﬀerential convex optimization Problem: x is optimal iﬀ

x 2 domf0 , rf0 (x) = 0

FEET
Hao Yan
t.EE
Gradient-based Optimization Methods
ay
January 23, 2024 13 / 91
Convex and Complexity Analysis Convexity

Proof of These Theorems

f
Proof.
suppose x is locally optimal. Prove f0 (y ) f0 (x).
Define x✓ = ✓y + (1 ✓)x and
F E 19
ily all E
f0 (x✓ ) = f0 (✓y + (1 ✓)x)  ✓f0 (y ) + (1 ✓)f0 (x)

Therefore
1 7712569
0
f0 (y ) f0 (x) (f0 (x✓ ) f0 (x))
✓ 70.0 0
If we choose small enough ✓, x✓ would be close enough to x such that
f0 (x✓ ) f0 (x) 0. Therefore, f0 (y ) f0 (x) 8y 2 dom(f0 )

Hao Yan Gradient-based Optimization Methods January 23, 2024 14 / 91

Convex and Complexity Analysis Convexity

Optimality Criterion

Theorem (Theorem)

x is optimal iﬀ rf0 (x)T (y x) 0 for all feasible y

Nonnegative Problem: x is2 optimal iﬀ x 2 domf0 , x 0,

(
rf0 (x)i 0 xi = 0
rf0 (x)i = 0 xi > 0

Theorem (Theorem)

x is optimal iﬀ x 2 domf0 , rf0 (x) = 0 (unconstrained)

Hao Yan Gradient-based Optimization Methods January 23, 2024 15 / 91

Convex and Complexity Analysis Convexity

Strongly Convex

1 W no
Theorem (Theorem) oof C of
For strictly or strongly convex functions, the global optimum is unique.

Ensuring unique solutions is important for identifiable model.

Hao Yan Gradient-based Optimization Methods January 23, 2024 16 / 91

Convex and Complexity Analysis Convexity

Example of Convex Functions on R 1

Convex
Aﬃne: ax + b
Exponential: exp(ax)
Powers: x ↵ on x > 0 for ↵ 1 or ↵  0
Powers of absolute value: |x|p on R for p 1
Negative entropy: x log x on x > 0
Concave:
Aﬃne: ax + b
Logarithm: log x
Powers: x ↵ on R for ↵ 2 [0, 1]

Hao Yan Gradient-based Optimization Methods January 23, 2024 17 / 91

Convex and Complexity Analysis Convexity

Example of Convex Functions on R n and R m⇥n

Example on R n
Aﬃne: f (x) = aTP
x +b
n
Norms: kxkp = ( i=1 |xi |p )1/p
Example on R m⇥n Pal
Aﬃne: f (X ) = tr (AT X ) + b
spectral f (X ) = kX k2 = max (X )

Hao Yan Gradient-based Optimization Methods January 23, 2024 18 / 91

Convex and Complexity Analysis Convexity

A Brief History

A. Convex optimization: before 1980

Develop more eﬃcient algorithms for convex problems
B. Convex optimization for machine learning: 1980-2000
Apply mature convex optimization algorithms on machine learning
models
C. Large scale (High-dimensional) convex optimization: 2000-2010
Algorithms become more specific to machine learning models/convex
problems (can be non-smooth)
Scalability is very important SI
D. Huge scale machine learning: 2010-now convexity is longer a
problem for optimization
Non-convex problems
Parallel \ distributed optimization

Hao Yan Gradient-based Optimization Methods January 23, 2024 19 / 91

Convex and Complexity Analysis Convexity

What About Non-convex function

7 23 2
Stationary Point rf (x) = 0 can be
Local Minimum 05 3
Local Maximum of 604
Saddle Point

HE.to
17
Eq
If
dimensus
p
Why is non-convex optimization hard in high dimensions?

Hao Yan
of Gradient-based Optimization Methods January 23, 2024 20 / 91
Convex and Complexity Analysis Convexity

Non-convex Optimization is Hard - Many Local Optimums

Non-convex functions may have many local minimums
Are the local minimums equally good?
Not, but almost equally good for many structured problems. Finding
one is typically good enough.
N

LEA
O

Hao Yan Gradient-based Optimization Methods January 23, 2024 21 / 91

Convex and Complexity Analysis Convexity

Non-convex Optimization is Hard - Much More Saddle

Points

Non-convex functions have much more saddle points in high

dimensions
How to escape from saddle points for high-dimensional problems
Avoiding saddle points is more important (much more saddle points in
high-dimensions)

Hao Yan Gradient-based Optimization Methods January 23, 2024 22 / 91

Convex and Complexity Analysis Convexity

Non-convex Problem: Active in Research

If f is non-convex, most algorithms can only converge to stationary

points.
Case-by-Case study for global optimum guarantee
Often based on solving convex subproblems
Classical non-convex problem that can be easily solved

X k2F
YEA
min kY
rank(X )=k

Other examples of non-convex optimization that cannot be solved

easily (globally)
Tensor factorization
Neural network

Hao Yan Gradient-based Optimization Methods January 23, 2024 23 / 91

Convex and Complexity Analysis Convexity

Example: Convexity of Linear Regression

III

Problem 48
min ky X ✓k2
✓

Please compute the Hessian matrix

Is it convex? T

Is it strongly convex?
How about Ridge regression

KER 1191112
Efi ITX

Hao Yan Gradient-based Optimization Methods January 23, 2024 24 / 91

20
A
Holfy 401 9 40
ExO2O Y
yty
257
2
21
20 05 2 7
0

OEIRP
OF 2
Ctxxc 1 4170
Δ

is rank define
Multi culinary
Oig Xix 0

non unique solutus

Have
Convex and Complexity Analysis Convexity

Example: Convexity of Logistic Regression

Logistic regression function

X
L(✓) = yi log( (✓T xi )) (1 yi ) log(1 (✓T xi ))

BCE
i

1
Here (x) = 1+exp( x) .

d
(x) = (x)(1 (x))
dx
Please compute the Hessian matrix

HE 0 Multi colienity
Is it convex?
Is it strongly convex?
How about Ridge regression Cig H O

Hao Yan Gradient-based Optimization Methods January 23, 2024 25 / 91

Convex and Complexity Analysis Convexity

Example: Support Vector Machine

Support Vector Machine

1 X
2
min kw k + C max{0, 1 yi w T xi }
w 2
i

max{0, 1 w T a} is a convex function.

1 2
2 kw k is also a convex function

Hao Yan Gradient-based Optimization Methods January 23, 2024 26 / 91

Gradient Descent and its variants

Section 3

Gradient Descent and its variants

Hao Yan Gradient-based Optimization Methods January 23, 2024 27 / 91

Gradient Descent and its variants Introduction to Gradient Descent

Why Gradient Descent?

Gradient descent lies in the heart of modern machine learning

algorithms
Simple to use
Acceptable convergence property but scalable to big data and
high-dimensional problems
Big n: Stochastic Optimization
Big p: Model parallelization

Hao Yan Gradient-based Optimization Methods January 23, 2024 28 / 91

Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Video from https://www.youtube.com/watch?v=V1eYniJ0Rnk

Hao Yan Gradient-based Optimization Methods January 23, 2024 29 / 91

Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Hao Yan Gradient-based Optimization Methods January 23, 2024 30 / 91

Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Differentiable
Input ! Model Prediction #$

Update Gradient
Parameter
%
descent Truth #
Learner

Hao Yan Gradient-based Optimization Methods January 23, 2024 31 / 91

Gradient Descent and its variants Introduction to Gradient Descent

Gradient Descent - How Machine Learns

Video from
https://www.youtube.com/watch?v=IHZwWFHWa-w\&t=707s
Hao Yan Gradient-based Optimization Methods January 23, 2024 32 / 91
Gradient Descent and its variants Understand Gradient

Vector Diﬀerentiation
0 1
x1
x: vector defined as x = @ ... A
B C

xn
Assume y = f (x), where x is a scalar Output is a vector
0 df 1
dy B dx. C
1
B f PD fall
= @ .. A
dx @f n
@x

Our interest: Vector diﬀerentiation or Gradient: y = f (x) is a scalar

0 @f 1
dy
@x1 Emodel parameters
B
= rx y = @ .. C
. A
dx scaler
@f
@xn I
Why this is our interest?
Hao Yan Gradient-based Optimization Methods January 23, 2024 33 / 91
Gradient Descent and its variants Understand Gradient

Visualize Gradient
Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as
✓ ◆
df @f @f
rf = = ,··· ,
dx @x1 @xp

315 I
7 414

Hao Yan Gradient-based Optimization Methods January 23, 2024 34 / 91

Gradient Descent and its variants Understand Gradient

Derivative, Gradient

Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as

✓ ◆
@f @f
rf = ,··· ,
@x1 @xp

Example: f (x, y , z) = x + 2y + 3z, rf (x, y , z) = (1, 2, 3)

Example: f (x, y , z) = x 2 + 2y 2 + 3z 2 ,rf (x, y , z) = (2x, 4y , 6z)

Hao Yan Gradient-based Optimization Methods January 23, 2024 35 / 91

Gradient Descent and its variants Understand Gradient

Types of Matrix Derivative

scalar

Types of Matrix Derivative

Types Scalar Vector Matrix
@y @y @Y
Scalar @x @x @x
@y @y
Vector @x @x
@y
Matrix @X

vector to vector
vector
Scatory

gradient Jacobian
Hao Yan Gradient-based Optimization Methods January 23, 2024 36 / 91
Gradient Descent and its variants Understand Gradient

Useful Rules

0
@f (x)
@x1
1 I
@f (x) B .. C Aij
=B . C
@x @
@f (x)
A
Ai

1441
@xp
0 1 0 1
@aT x
a
@xT a @aT x
B @x1
.. C B .1 C
@x = @x =B
@ . C = @ .. A = a
A
@aT x ap
@xp
@Ax
= AT
@x
Y AX Y
Five.ir d A
Hao Yan Gradient-based Optimization Methods January 23, 2024 37 / 91
Gradient Descent and its variants Understand Gradient

Example: Linear Regression

Linear Regression Loss Function

1
l(✓) = min ky X ✓k2
✓ n

What is the gradient

@l(✓)
=?
@✓
What is the optimal point and is it unique?

Hao Yan Gradient-based Optimization Methods January 23, 2024 38 / 91

Gradient Descent and its variants Understand Gradient

Jacobian
Assume y = f(x), both x, y are vectors of size n and m
FAI II
f = (f1 , · · · , fm ), x = (x1 , · · · , xn ), y = (y1 , · · · , ym ) where, yi = fi (x)
Define Jacobian: vector to vector mapping @y @x
0 @f1 (x) 1 0 @ @ 1
@x @x1 f1 (x) ··· @xn f1 (x)
@y B .. C B .. .. C
=@ . A=@ . . A
@x @fm (x) @ @
@x m @xm fm (x) @xn fm (x)
Special case: x and y are of the same size
Multivariable function f looks locally like as a linear transformation of x
Known as Jacobian matrix

HER

Hao Yan Gradient-based Optimization Methods January 23, 2024 39 / 91

Gradient Descent and its variants Understand Gradient

Example of Jacobian

x = r cos ✓, y = r sin ✓ 6 g riot

Define a = (x, y ), b = (r , ✓)
@a
J = @b
Meaning
✓ ◆ ✓ @x @x ◆✓ ◆ ✓ ◆✓ ◆
x @r @✓ r cos ✓ r sin ✓ r
= @y @y =
y @r @✓
✓ sin ✓ r cos ✓ ✓

If we count the area

✓ ◆
cos ✓ r sin ✓
dxdy = drd✓ = rdrd✓
sin ✓ r cos ✓

The absolute value of the Jacobian times the area of the corresponding
rectangle

Hao Yan Gradient-based Optimization Methods January 23, 2024 40 / 91

Gradient Descent and its variants Understand Gradient

Derivative, Gradient, and Hessian

A special case: Hessian Matrix vector
@ @f
What is Hessian H = @x
2 @x 2 3
@2f @2f

0
@ f
6 @x 2 ···
6 1 @x1 @x2 @x1 @xn 7
7
6
0 @ @f 1 6 @ 2 f 7
@2f 2
@ f 7
@x @x1 6 ··· 7
B . C 6 @x2 @x1 @x22 @x2 @xn 7
H=@ .. A=6 6
7
7
6 .. .. .. .. 7
@ @f
6 . . . . 7
@x @xn 6 7
6 7
4 @2f @2f 2
@ f 5

31
···
His @xn @x1 @xn @x2 @xn2
2 3
6x 1 Symethic
Hji
$f(x,y,z)=x3 +2y2 +3z2 +xy,$ H = 4 1 4 5
6
Hessian encodes the curvation information of the function
HI
Hao Yan Gradient-based Optimization Methods January 23, 2024 41 / 91
Gradient Descent and its variants Understand Gradient

Hessian and Curvature

y 00
Curvature in 1D: For y = f (x),  = 3 , a larger y 00 means a
(1+y 0 ) 2
1
larger curvature.  = R

Hao Yan Gradient-based Optimization Methods January 23, 2024 42 / 91

Gradient Descent and its variants Understand Gradient

Hessian and Curvature

Curature in 2D: The Gaussian Curature is

I
det(H)
K =  1 1 =
1 + krf k2

At critical point rf = 0, K = det(H)

Types 1 > 0, 2 > 0, Bowl-like. 1 < 0, 2 < 0, Mountain-like.
1 2 < 0, Saddle-like

Hao Yan Gradient-based Optimization Methods January 23, 2024 43 / 91

Gradient Descent and its variants Understand Gradient

Gradient Rules

Linear Rule: r(↵f (x) + g (x)) = ↵rf (x) + rg (x)

Product Rule:
@(f (x)g (x)) @(g (x)) @(f (x))
= f (x) + g (x)
@x @x @x
or equivalently

r(f (x)g (x)) = f (x)rg (x) + g (x)rf (x)

Hao Yan Gradient-based Optimization Methods January 23, 2024 44 / 91

Gradient Descent and its variants Understand Gradient

Chain Rule

Suppose we have a vector function g(x) and scalar function f (x).

✓ ◆T
@f (g(x)) @g(x) @f (a)
=
@x @x @x

Here,
⇣ a⌘ = g (x).
@g(x) gradient
@x is the Jacobian of g.

Hao Yan Gradient-based Optimization Methods January 23, 2024 45 / 91

Gradient Descent and its variants Understand Gradient

Product Rule (2)

Suppose we have two vector functions f(x) and g(x).

@ f T (x)g(x)
@x
=
@ f T (x)
@x
g(x) +
@ gT (x)
@x
f(x) 0
Some special examples
@
@x (u T
v) = @uT
@x v + @vT
@x u
un
@ 2 @ T @uT @uT T

@x kuk 2 = @x u u = @x u + @x u = 2 @u
@x u
@
@x (x T
Ax) = 2Ax if A T
=A

Hao Yan Gradient-based Optimization Methods January 23, 2024 46 / 91

Gradient Descent and its variants Understand Gradient

Property of Gradients

Theorem (Theorem)
feitehj frtiteof.li
lim✏!0 0✏
f (x+✏h) f (x)
= rf · h Inerprudat
of
455h
0
Theorem (Theorem)

IE
rf
maxkv k=1 (rf · v ) achieves its maximum at v = krf k
0
Proof.
rf · v = rf kv k cos ✓  krf k, the equality holds when v = crf , where
1
c = krf k.

MK of
minimum
9
Hao Yan Gradient-based Optimization Methods January 23, 2024 47 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Why Gradient Descent?

Gradient descent lies in the heart of modern machine learning

algorithms
Simple to use
Acceptable convergence property but scalable to big data and
high-dimensional problems
Big n: Stochastic Optimization
Big p: Model parallelization

Hao Yan Gradient-based Optimization Methods January 23, 2024 48 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Derivative and Gradient

Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as
✓ ◆
@f @f
rf = ,··· ,
@x1 @xp

Hao Yan Gradient-based Optimization Methods January 23, 2024 49 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Derivative, Gradient

Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as

✓ ◆
@f @f
rf = ,··· ,
@x1 @xp

Example: f (x, y , z) = x + 2y + 3z, rf (x, y , z) = (1, 2, 3)

Example: f (x, y , z) = x 2 + 2y 2 + 3z 2 ,rf (x, y , z) = (2x, 4y , 6z)

Hao Yan Gradient-based Optimization Methods January 23, 2024 50 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Derivative, Gradient, and Hessian

Hessian Matrix
2 3
@2f @2f @2f
6 @x 2 ···
6 1 @x1 @x2@x1 @xn 77
6 7
6 @2f 2
@ f 2
@ f 7
6 ··· 7
6 @x2 @x1 @x2 2 @x2 @xn 7
H=6
6
7
7
6 .. .. .. .. 7
6 . . . . 7
6 7
6 7
4 @2f 2
@ f 2
@ f 5
···
@xn @x1 @xn @x2 @xn2
2 3
6x 1
f (x, y , z) = x 3 + 2y 2 + 3z 2 + xy ,H = 4 1 4 5, which is not
6
p.s.d
Hao Yan Gradient-based Optimization Methods January 23, 2024 51 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Property of Gradients

Theorem (Theorem)
f (x+✏h) f (x)
lim✏!0 ✏ = rf · h

Theorem (Theorem)
rf
maxkv k=1 (rf · v ) achieves its maximum at v = krf k

Proof.
rf · v = rf kv k cos ✓  krf k, the equality holds when v = crf , where
1
c = krf k.

Hao Yan Gradient-based Optimization Methods January 23, 2024 52 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Gradient Descent
Assume that f and rf at each iteration can be easily evaluated
Recall that we have f : R n → R, convex and diﬀerentiable, want to
solve
Steepest Descent:

minn f (x)
x2R
i.e find x ⇤ such that f (x ⇤ ) = min f (x)
Steepest Descent:

xk+1 = xk ↵k rf (xk )
How to identify ↵k
Trial and Error: Select a fixed ↵k or reduce ↵k after f (xk ) is stable
Backtracking: ↵0 , 12 ↵0 , 14 ↵0 , · · · until a suﬃcient decrease in f is
obtained
Use information from function f to decide
Hao Yan Gradient-based Optimization Methods January 23, 2024 53 / 91
Fixed step size
Gradient Descent and its variants Gradient Descent: How Machine Learns

Bigtake
mply SteptkSize
= t for all k = 1, 2, 3, . . ., can diverge if t is too
nsider Consider
f (x) =f (x)
(10x 2 + x2 )/2, gradient descent after 8 steps:
1 2 +2x 2 )/2, gradient descent after 8 steps
= (10x 1 2

20 ●

E
10

*
0
−10
−20

Hao Yan Gradient-based Optimization Methods January 23, 2024 54 / 91

slow if t is too small. Same example, gradient desce
Gradient Descent and its variants Gradient Descent: How Machine Learns

ps:Small Step Size

Same example, gradient descent after 100 steps
20 ●
●
●
●●
●●
●
●
●
●

1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10

●
●
●
●
●
●
●
●
●
●
●

*
0
−10
20

Hao Yan Gradient-based Optimization Methods January 23, 2024 55 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Correct Step Size

acking picks up roughly the right step size (13 steps)
Consider f (x) = (10x12 + x22 )/2, gradient descent after 13 steps

20
●

●
●
10

●
●
●
●
jiffy ●●
●
●

*
0
−10

Hao Yan Gradient-based Optimization Methods January 23, 2024 56 / 91

0
Gradient Descent and its variants Gradient Descent: How Machine Learns

Line Search
min fix
Gradient step xk+1 = xk ↵rf (xk )
Look at which ↵ minimize f (xk ) 2 is a scalar

↵k = arg min f (xk ↵rf (xk ))

↵

Simple Bisection rule can be used to find ↵

Inexact Line Search: look for stepsize results in suﬃcient decrease
p
Decaying stepsize: ↵k = 1/ k
Fixed stepsize

Hao Yan Gradient-based Optimization Methods January 23, 2024 57 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Strong Convexity x̅05x µ

OF EMI
4421101112
0
10 0

eig of i M 70
If µI 4 r2 f (x) 4 LI

µ L
ky xk22  f (y ) f (x) T
rf (x) (y x)  ky xk22
2 2
For strong convex problem µ > 0
Weak convex problem µ = 0

fig fix 7ft g 1

1 19 01
Tff 9 1

45
Hao Yan Gradient-based Optimization Methods January 23, 2024 58 / 91
2412
9 1
y't 9 x 119

774TH
mysis my
Of LI 60 Stryly Coney

0 19 May 04
91
2 g x 219 09

09 7ft 19 0 0

19 2 If
y x
tof
AD
sealer
Convergence speed
she 0
45 sk 0 How fast

die 0

Ref 0

1 1 Linear emergence
IS 2

c Linear
longer

2
Irreverence
4,1 I

3 quadratic emerge
4 2
How iterations to achieve E accuracy
many

Linear convergence

XK
ICE knflog.cl
loy2

Quadratic

sublinar Die CE

k
Gradient Descent and its variants Gradient Descent: How Machine Learns

Why Strong Convexity Matters?

A strongly convex function is also strictly convex

A strictly convex function has a unique global optimum
+attrl atex :: width0.7[[file : Figs/highlowdim.pdf ]]

Hao Yan Gradient-based Optimization Methods January 23, 2024 59 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Stationary Point Convergence

Theorem (Theorem)

Assume r2 f (x) 4 LI , rf (xk ) ! 0. The algorithm will converge to a

stationary point.

Proof.
The proof sketch is layed as:
Convexity: f (xk+1 ) f (xk ) rf (xk )T (xk+1 xk )  L2 kxk+1 xk k22
since xk+1 = xk ↵k rf (xk ).
f (xk+1 )  f (xk ) ↵k (1 ↵2k L)krf (xk )k22
1 2
Select step size ↵k = 1/L, have f (xk+1 )  f (xk ) 2L krf (x k )k2
By
PNsummation from k = 0 to k = N and
krf (x )k2  2L(f (x ) f (xN+1 )]
k=1 k 0

Hao Yan Gradient-based Optimization Methods January 23, 2024 60 / 91

stationary point convergence

1 1 2 FEdpga

2141 Xk 2 11 t
that fÑt2 t 11 12 may
f ocean
fell F 110 14 f
912k

t.EE
110 1 115
genes 14 tails I

ota.nl sc
E
5 0fk
Gradient Descent and its variants Gradient Descent: How Machine Learns

Stationary Point Convergence

Theorem (Theorem) unique global optimal

I
1
kxk x ⇤ k is a decreasing sequence in k with ↵k = L

Proof.
Ike 2k 21 01 14 If't
⇤ 2 1
kxk+1 x k = kxk x ⇤
rf (xk )k2 2 7 1
L
⇤ 2 2 1
= kxk x k rf (xk ) (xk x ) + 2 krf (xk )k22
> ⇤
L L
⇤ 2 1
 kxk x k krf (xk )k2
L
We know that kxk x ⇤ k2 is non-decreasing in k.

Hao Yan Gradient-based Optimization Methods January 23, 2024 61 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Stationary Point Convergence

Theorem (Theorem)
1 1
r2 f (x) 4 LI , f (x ⇤ )  f (x L rf (x))  f (x) 2L krf (x)k 2

Proof.
First we can show that

⇤ 1
f (x )  f (x rf (x))
L
1 L 1
 f (x) rf (x) rf (x) + k rf (x)k22
>
L 2 L
1
= f (x) krf (x)k22
2L

Hao Yan Gradient-based Optimization Methods January 23, 2024 62 / 91

Let
2 405

flat f x Ef fin of of
11141T
for 1171T
I
fk 0
Gradient Descent and its variants Gradient Descent: How Machine Learns

Weakly Convex: 1/k sublinear

O
Theorem (Theorem)
2Lkx0 xk2
Assume µI 4 r2 f (x)
4 LI , if µ 0, f (xk ) f (x ⇤ )  k , f (xk ) has
sublinear convergence. x ⇤ is the optimal solution.

Denote optimal x ⇤ , it can be shown that {kxk x ⇤ k} is decreasing

Define k = f (xk ) f (x ⇤ ). By convexity we have

k  rf (xk )T (xk x ⇤ )  krf (xk )kkxk x ⇤ k  krf (xk )kkx0 x ⇤k

o
1
f (xk+1 )  f (xk ) krf (xk )k22
2L
Subtracting f (x ⇤ ) for both sides, we have
1 1
k+1  k krf (xk )k2  k
2
k
2L 2Lkx0 x ⇤ k2

Hao Yan Gradient-based Optimization Methods January 23, 2024 63 / 91

decresig sequence 1171 If

meet
flk fly 7ft folk convexity of f

d
fam f 571117 4101
Ok

ok
IT
1 KIT
0kt Ok a
Gradient Descent and its variants Gradient Descent: How Machine Learns

Weakly convex: 1/k sublinear

Take inverse and apply (1 ✏) 1 1+✏

1 1 1
0114
/(1 k)
k+1 k 2Lkx0 x ⇤ k2

a
1 1 1 k +1
+ +
k 2Lkx0 x ⇤ k2 0 2Lkx0 x ⇤ k2

which yields
2Lkx xk2
0
f (xk+1 ) f (x ⇤ ) 
k +1
The classical 1/k convergence rate (sublinear)

Hao Yan Gradient-based Optimization Methods January 23, 2024 64 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Strong Convex: Linear

Theorem (Theorem)

Assume µI 4 r2 f (x) 4 LI , if µ > 0, xk has linear convergence

0
kxk+1 x ⇤ k22  (1 ↵µ)kxk x ⇤ k2 if 0 < ↵  L1

Proof. onvergence
We know that xk , by plugging in strong convexity:
f (x ⇤ ) f (xk ) + rf (xk )> (xk x ⇤ ) + 12 µkxk x ⇤ k2 and
1
f (x ⇤ )  f (x) 2L krf (x)k22 , we have
Dec11 Dic 210 11
kxk+1 x ⇤ k22 = kxk x ⇤ ↵rf (xk )k2
= kxk x ⇤ k2 2↵rf (xk )> (xk x ⇤ ) + ↵2 krf (xk )k2
 (1 ↵µ)kxk x ⇤ k2 2↵(f (xk ) f (x ⇤ )) + ↵2 krf (xk )k2
 (1 ↵µ)kxk x ⇤ k2 2↵(1 ↵L)(f (xk ) f (x ⇤ ))

If we Yan↵  1 , we have
Haolet 2↵(1
Gradient-based ↵L) is
Optimization negative, and
Methods can
January be droped.
23, 2024 65 / 91
7
β
1101 11
fm 2 110th
i
Of fish f off
101

f
101k Conexity

076.5 An mg
IEEE

11ft f 217 1111 71 11

k
2 1 1

4
112
1 111 2

I
1
Gradient Descent and its variants Gradient Descent: How Machine Learns

8
Strong Convex: Linear

Theorem (Theorem) Strously Cunex

2
Assume µI 4 r2 f (x) 4 LI , if µ > 0, by set ↵k = L+µ , f (xk ) has linear
convergence
Kool I
L 2 2k
f (xk ) ⇤
f (x )  (1 ) kx0 x ⇤ k2
2 +1

Strong convex case yield a linear/geometric

F rate, which is generally
much better than any sublinear rate.

Linear convergence

Hao Yan Gradient-based Optimization Methods January 23, 2024 66 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns
The slow linear rate is typical!
What is the Problem of Gradient Descent

4 14 4
Not just
f (xk a
) pessimistic bound!
f (x ⇤ )  L2 (1 2 2k
x ⇤ k2 , when  1, 1 2
⇡1
+1 ) kx0 +1

not
ÉlI
f 15
431

Hao Yan Gradient-based Optimization Methods January 23, 2024 67 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

FISTA Acceleration

ANNA AN
Basic step is Aggregate gradient over histry

xk+1 = xk + ↵k pk , pk = rf (xk ) + k pk 1

Accelerated version
xk = yk L1 rf (yk )
1
p
tk+1 = 2 (1 + 1 + 4tk2 )
yk+1 = xk + ttkk+11 (xk xk 1)

O
1
Weakly convex f , converges with f (xk ) f (x ⇤ ) ⇠ k2
1/k 2 is optimal rate for sublinear problem
u

G
Hao Yan Gradient-based Optimization Methods January 23, 2024 68 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Example: Linear Regression

Linear Regression Loss Function

1
l( ) = min ky X k2 + k k2
n
2 T
Gradient: r l( ) = n X (y X )+2

2 T
k+1 = k + ↵k ( X (y X k) 2 k)
n

011
Hao Yan Gradient-based Optimization Methods January 23, 2024 69 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns

Linear Regression Loss Function

Hessian Matrix:

2 2
H = r l( ) = r X T (y X )+2 I
n
2
= r XTX + 2 I
n
2 T
= X X +2 I ⌫0
n
Is it convex? Yes!
Is it strongly convex?
If > 0: Yes!
If <0
If n > p, if X is full rank (with rank p), it is strongly convex
if p > n, not strongly convex!

Hao Yan Gradient-based Optimization Methods January 23, 2024 70 / 91

Gradient Descent and its variants Gradient Descent: How Machine Learns

Complexity of Gradient Descent

KIT IR complexity

Time Complexity:
Compute X takes O(np) XP
Compute X T (y X ) takes O(np) 1 13 Olap
Space Complexity: a
Store X takes O(np)

2
XY XP
Olap

Hao Yan Gradient-based Optimization Methods January 23, 2024 71 / 91

H
O up
Gradient Descent and its variants Gradient Descent: How Machine Learns

Comparison of Gradient Descent and Analytical solution

closed
form β XX 2 XTY
I ocupy
Space Time
Gradient Descent O(np) O(np) iteration
per
Analytical Solution O(np) O(np 2 )
e
Question:
Is Gradient descent straightly better due to smaller complexity?
When should we use analytical solution?

total
ff If
of Iteration E C complexity
of a

Analitiial
6
1 peas C
Hao Yan Gradient-based Optimization Methods January 23, 2024 72 / 91
Second-order Optimization Methods

Section 4

Second-order Optimization Methods

Hao Yan Gradient-based Optimization Methods January 23, 2024 73 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Second-order Optimization Method - Newton’s Method

Suppose we want to solve

HEI
min f (x)
x
At x = x0 , f (x) can be approximated by
T 1
f (x) ⇡ h(x) := f (x0 ) + rf (x0 ) (x x0 ) + (x x0 )T H(x0 )(x x0 )
2

HEI
H(x0 ) is the Hessian of f (x) defined by
@2
Hij (x0 ) = f (x)
@xi @xj
1
arg min f (x) ⇡ arg min h(x) = x0 H(x0 ) rf (x0 )
x x

Take Home Message:

Newton’s Method considers the curvature information of the loss surfaces
by using the Heissian information
Hao Yan Gradient-based Optimization Methods January 23, 2024 74 / 91
min

Th 7fk.lt 11x 17
e

Of Xo Hm off
HELIX
04101.7 GD

II
H Newton's director
Linear o
11h11

tik i ii.in

ind R

mint 05
Invariant scale
of

If
t
affect AD

H I 41
1

D.at C E
DX
Second-order Optimization Methods Second-Order Optimization Methods

Newton Algorithm

Algorithm
Given x0 , set k = 0
Solve dk such that H(xk )dk = rf (xk )
Normally set ↵k = 1, xk+1 = xk + ↵k dk
Repeat until convergence
Solving dk requires to assume that H(xk ) is nonsingular at each
iteration
Newton’s method considers curvature of the original problem
Scale invariant for Newton methods
Newton’s method is invariant to aﬃne transformation x Dx

Hao Yan Gradient-based Optimization Methods January 23, 2024 75 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Descent Property

Theorem (Theorem)

Descent Direction: If r2 f 0, then the Newton’s step is a descent

direction

Proof.
If r2 f 2
4 20
1120
0, then we know that r f is positive definite and invertible. As a
result, we have
We know that the newton’s direction
x = xk+1 xk = [r2 f (xk )] 1 rf (xk ) so

rf (x)> x = rf (xk )[r2 f (xk )] 1

rf (xk ) < 0.

This shows that the Newton’s step is a descent direction

Hao Yan Gradient-based Optimization Methods January 23, 2024 76 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Convergence Property

Theorem (Theorem)

Newton’s method has quadratic convergence rate

We will give a proof for 1-D case for example, where x is a scalar.
consider g (x) = rf (x). Consider the 2nd order Taylor approximation
of g (x). There exists ⇠k between xk and x ⇤ such as
1 2
⇤
0 = g (x ) = g (xk ) + rg (xk )(x ⇤
xk ) + r g (⇠k )(xk x ⇤ )2
2
Suppoe that r 1 g (x exists
k)

1 1
0 = [r g (xk )]g (xk ) + (x xk ) + [r 1 g (xk )]r2 g (⇠k )(xk x ⇤ )2
⇤
2
1
= (xk+1 xk ) + x ⇤
xk + [r 1 g (xk )]r2 g (⇠k )(xk x ⇤ )2
2
Hao Yan Gradient-based Optimization Methods January 23, 2024 77 / 91
Quadratic convergence

I Tac.tn

11dm 21 1 106 2 11

0.1 0.05 0.025 0.0125

0 I 0.01 0.000 0.000000

909 7512

Skf 01k 0H

7 9611914
0 109509611
Xia 2
76

EI.si t Ie
Second-order Optimization Methods Second-Order Optimization Methods

Convergence Property

This gives
ek+1 xk+1 x ⇤ 1 1
2
:= 2
= [r g (xk )]r2 g (xk )
ek ⇤
(xk x ) 2

|r2 f (x)|
Let M = supx,y |2rf (y )| < 1, therefore we know that
xk+1 x ⇤
limk!1 (x x ⇤ )2  M < 1. If we select the initial point
k
|e0 | < 1, ek ! 0 with quadratic rate.

Hao Yan Gradient-based Optimization Methods January 23, 2024 78 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Newton Convergence
log ly I
Newton’s method has Local Quadratic convergence E 1
kxk+1 x ⇤ k = O(kxk x ⇤ k2 )

Theorem (Theorem)

If f is strongly convex on S with constant m, and r2 f is Lipschitz

continuous on S with constant L > 0 such that

kr2 f (x) r2 f (ydo)k2  Lkx y k2

Then the number of iterations until f (x) f (x ⇤ )  ✏ is bounded above by

f (x0 ) f (x ⇤ ) ✏i
+ log log( )
✏

Hao Yan Gradient-based Optimization Methods January 23, 2024 79 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Scale Invariant
Gradient descent is not scale invariant
Newton’s method considers curvature of the original problem through
the Hessian matrix
Scale invariant for Newton methods
Newton’s method is invariant to aﬃne transformation x Dx

Take Home Message:

Newton’s Method is scale-invariant and less sensitive to scale normalization
as gradient descent does.
Hao Yan Gradient-based Optimization Methods January 23, 2024 80 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Can it be Practical?

High-dimensional problem when p > 104

Computing r2 f is in general not possible
2
OP XP
r f may have structures such as sparsity or low rank
0

Often don’t need r2 f – Use approximation instead

Do we really need the full Hessian matrix? P3
Gradient descent is a special case of Newton’s method with identity
Hessian approximation Hk = ↵k 1 I
xk+1 = xk Hk 1 rf (xk ) ! xk+1 = xk ↵k rf (xk )
Gradient descent with identity Hessian approximation works great

Hao Yan Gradient-based Optimization Methods January 23, 2024 81 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Quasi-Newton’s Method

BFGS: maintains a low-rank approximation to the inverse Hessian

L-BFGS: (Limited memory version) doesn’t explicitly store the Hessian
matrix
Newton-CG: Compute the newton’s step via iterative conjugate
gradient approach

Of plan Offa
Heissian
free optimism
for 2nd order
Hao Yan Gradient-based Optimization Methods January 23, 2024 82 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Sampled Newton

Pn
L(⇠; x) is separable L(⇠; x) = i=1 L(⇠i ; x)
Stochastic Gradient descent use subset of full samples to estimate the
gradient
Sampled subset of B ✓ {1, 2, · · · , n} randomly

1 X
r2 f (⇠) = r2 L(⇠i ; x)
|B|
i2B

Problem: for high-dimensional problem (large p, p > 104 )

r2 L(⇠i ; x) is of size p ⇥ p, which is too large to store or compute.

Hao Yan Gradient-based Optimization Methods January 23, 2024 83 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Quasi-Newton Methods

Maintains an approximation to the Hessian that’s filled in using

information gained on successive steps.
Generate sequence {Bk } of approximate Hessian alongside the iterate
sequence wk , and caculate steps dk by solving

Hk d k = rf (xk ) Nenton directors

Silvey
Update from Bk ! Bk+1 so that
Approximate Hessian mimics the behavior of true Hessian over this
step
r2 f (xk+1 )sk ⇡ yk
where sk = xk+1 xk , yk := rf (xk+1 ) rf (xk ) so we enforce

B k s k = yk

Hao Yan Gradient-based Optimization Methods January 23, 2024 84 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm

tank 2 update
Bk ss T Bk yy T
Bk+1 = Bk s T Bk s
+ yT s
, where s = sk and y = yk
Start with B0 = ⇢I for some multiple ⇢ that is consistent with problem
scaling, e.g. s T y /s T s
Uv
Can maintain instead an Hk to inverse Hessian
5 YEX
Hk+1 = (I ⇢sy T )Hk (I ⇢ys T ) + ⇢ss T Hi Pap
Where ⇢ = 1/(y T s).
If sftth.gs
Can prove superlinear local convergence for BFGS and other
OCP
quasi-Newton methods: kxk+1 x ⇤ k/kxk x ⇤ k ! 0.
Slower than Gradient Descent but not as fast as Newton.
Cheaper than Newton
space Stre Matrix Hi
Problem: Hk is too large to store
p
Hao Yan Gradient-based Optimization Methods January 23, 2024 85 / 91
Second-order Optimization Methods Second-Order Optimization Methods

L-BFGS

1
5k Yi PI O mp
LBFGS doesn’t store the p ⇥ p matrices Hk or Bk from BFGS explicitly
Only keep track sk and yk from last few iterations (e.g. m = 5 or 10)
Take an initial matrix (B0 or H0 ) and assume that m steps have been
taken since.
A simple procedure computes Bk u via a series of inner and outer
products with the matrices sk j and yk j from last m iterations:
j = 0, · · · , m 1
Requires 2mp storage and O(mp) linear algebra operations
No superlinear convergence proved, but good behavior is observed on
a wide range of applications

Hao Yan Gradient-based Optimization Methods January 23, 2024 86 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Optimization in Logistic Regression

Loss function: (we can assume b = 0 for simplicity)

N
X
l(✓) = yi log (✓T xi ) (1 yi ) log(1 (✓T xi ))
i=1

The gradient X
0
l (✓) = xi (ti yi )
i
(k)
Define ti = (✓(k)T xi ) as the target
Gradient Descent
X (k)
(k+1) (k)
✓ =✓ xi (ti yi )
i

Hao Yan Gradient-based Optimization Methods January 23, 2024 87 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Likelihood:
Y
P(y |x; ✓) = P(yi |xi ; ✓)
i
N
Y
= P(yi = 1|x, ✓)yi P(y = 0|x, ✓)1 yi

i=1

BCE := log P(y |x; ✓)

N
X
= ( yi log P(yi = 1|x, ✓) + (1 yi ) log P(yi = 0|x, ✓))
i=1
Hao Yan Gradient-based Optimization Methods January 23, 2024 88 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Logistic Loss
The binary cross
PNentropy is given
as:BCE = ( i=1 yi log P(yi = 1|x, ✓) + (1 yi ) log P(yi = 0|x, ✓))

We can see BCE is a linear combination of two convex terms, which is

a also a convex function
Example: when yi = 1, yi log P(yi = 1|x, ✓) = log P(yi = 1|x, ✓)
(red curve)
log P(yi = 1|x, ✓) is 0 when P(yi = 1|x, ✓) = 1 correct prediction
log P(yi = 1|x, ✓) is 1 when P(yi = 1|x, ✓) = 0 : wrong prediction
Hao Yan Gradient-based Optimization Methods January 23, 2024 89 / 91
Second-order Optimization Methods Second-Order Optimization Methods

Optimization in Logistic Regression

Loss function: (we can assume b = 0 for simplicity)

N
X
l(✓) = yi log (✓T xi ) (1 yi ) log(1 (✓T xi ))
i=1

The Hessian matrix is

X
H= xi xi0 ti (1 ti )
i
= X T WX ⌫ 0

Theorem (Theorem)

BCE loss function is a Convex function.

Hao Yan Gradient-based Optimization Methods January 23, 2024 90 / 91

Second-order Optimization Methods Second-Order Optimization Methods

Optimization in Logistic Regression

Loss function: (we can assume b = 0 for simplicity)
N
X
l(✓) = yi log (✓T xi ) (1 yi ) log(1 (✓T xi ))
i=1

The Hessian matrix is

X
H= xi xi0 ti (1 ti )
i
= X T WX ⌫ 0
W = diag(t1 (1 t1 ), · · · , tn (1 tn )) Weighted
Solution: Linear Regression
1
✓(k+1) = ✓(k) + (X T W(k) X ) X T W(k) Z(k)
Z(k) = W(k)1 (y t) is the residual
Solve weighted linear regression in closed form
Hao Yan Gradient-based Optimization Methods January 23, 2024 91 / 91
119
f 0112
9
Newton's stop for 1st Here

00 0
151 y
H X 7 0

0 0 Ht of.la
0 XX XXO 9

XD Xy XXXX
XX XTY

min y xojwly xoye

nto. O

0 0
Stochastic Gradient Descent

Section 5

Stochastic Gradient Descent

Hao Yan Gradient-based Optimization Methods February 8, 2024 86 / 113

Stochastic Optimization

Random Variable 3 µL
Model permet
so
11.551
Data or
ITSamples.IT Jn
If
m.int fo 5i L.ln

GD
016 4 74 E 0511.9
use all n to estimate 0L reduce variance

Stochastic GD

EEgrf9
one

si.fi
pick
Variance

one both of si.IE 0f619i Eg0fk.s

GD
M
sample B sample n samples
medium smallest
Variance large
orpletston np
1 1
Stochastic Gradient Descent Stochastic Gradient Descent

Problem of Gradient Methods

I ftp is O P
Data: ⇠ 2 S sample space
DIE
Loss function: F (x, ⇠)
Average of Irish
Risk function E (F ) = E⇠ F (x, ⇠) = f (x)
1 Pn
Empirical risk: En (F ) = n i=1 F (x, ⇠i )
Gradient Descent: _i.is
n
1X
xk+1 = xk ↵k
n
rx F (xk , ⇠i ) OCP
i=1 is too expensive
for each

III
step
Problem: n can be very large: n ⇡ 100000, evaluating En (Fe) at each
iteration is extremely time-consuming

Hao Yan Gradient-based Optimization Methods February 8, 2024 87 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Solution: Stochastic Gradient Descent

What if at each iteration k, we pick up ⇠k randomly from {⇠1 , · · · , ⇠n }

Main ideas: Any rx F (xk , ⇠k ) is an unbiased estimator of
E⇠ (rx F (xk , ⇠))
xk+1 = xk ↵k rx F (xk , ⇠k )
Significantly reduce the time for each iteration
Mini-Batch: use a batch of samples more than one training sample.
B
1 X
E⇠ (rx L(x, ⇠)) ⇡ F (x, ⇠˜i )
B
i=1

Hao Yan Gradient-based Optimization Methods February 8, 2024 88 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Optimization
Stochastic Optimization

ML optimise 0
Replace
With the gradient
noisy gradients, with cheaper noisy estimates {[}Robbins and
update
Monro, 1951{]}
What if at each iteration
t+1 = t + t
ˆ up x(i trandomly
i, we pick ) from {x1 , · · · , xn }
Any r✓ l(✓; xk ) is an unbiased estimator of Ex (r✓ l(✓; x))
Requires unbiased gradients, ˆ ( ) = ( )
Hao Yan Gradient-based Optimization Methods February 8, 2024 89 / 113
Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Gradient Descent

die randomly
Algorithm: ✓t+1 = ✓t
I
↵t r✓ l(✓t ; xk )
Guaranteed to converge to a local optimum {[}Bottou, 1996{]}
Significantly reduce the time for each iteration
Mini-Batch: use a batch of samples more than one training sample.
B
1 X
Ex (r✓ l(✓; x)) ⇡ l(✓; xi )
B
i=1

Has enabled modern machine learning

Hao Yan Gradient-based Optimization Methods February 8, 2024 90 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Stochastic Gradient Descent

Hao Yan Gradient-based Optimization Methods February 8, 2024 91 / 113

poI E I

EIEEF

E.it o

ask local
a

i
E
Stochastic Gradient Descent Stochastic Gradient Descent

Classical Stochastic Gradient Descent

Under the following assumption, SGD converges

Loss function diﬀerentiable and P
bounded below P
1 1
Learning rate conditions satisfy k=1 ↵k = 1, k=1 ↵k2 < 1
Other mild conditions on the distribution of the gradient function
1
For strongly convex function, ↵k = kµ

max(k✓k ✓⇤ k2 , µM2 )
E (k✓k ✓ ⇤ k2 ) 
2k
where 0 µI 4 r2 f (x) 4 LI , E (krf (x)k2 )  M 2
In reality, ↵k = ↵0 (1 + ↵0 k) 1

For weakly convex, the convergence rate is O( p1k ), (sublinear

convergence)

Hao Yan Gradient-based Optimization Methods February 8, 2024 92 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Proof
ak = 12 E (k✓k ✓⇤ k2 ). Assume M > 0 such that
E (kr✓ f (✓, x)k2 )  M 2 . Thus
1
k✓k+1 ✓⇤ k22
2
1
= k✓k+1 ✓⇤ ↵k r✓ F (✓k , xk )k2
2
1 ⇤ 2 1 2
= k✓k+1 ✓ k ↵k (✓k ✓ ) r✓ F (✓k , xk ) + ↵k kr✓ F (✓k , xk )k2
⇤ T
2 2
Take expectations on xk
⇤ T 1 2 2
ak+1  ak ↵k Exk [(✓k ✓ ) r✓ F (✓k , xk )] + ↵k M
2
We have
Exk [(✓k ✓⇤ )T r✓ F (✓k , xk )] = (✓k ✓⇤ )T Exk [r✓ F (✓k , xk )]
= (✓k ✓ ⇤ ) T gk
where
Hao Yan
(✓k , xk )] (unbias
gk = Exk [r✓ FGradient-based estimator)
Optimization Methods February 8, 2024 93 / 113
Stochastic Gradient Descent Stochastic Gradient Descent

Proof

By strong convexity:

⇤ ⇤ 1 ⇤ 2 1 ⇤ 2 1
(✓k ✓ )gk f (✓k ) f (✓ ) + µk✓k ✓ k µk✓k ✓ k = µak
2 2 2
We have
E ((✓k ✓⇤ )gk ) 2µak
We have
1 2 2
ak+1  (1 2µ↵k )ak + ↵k M
2
1
When ↵k = kµ , we have ak  Q
2k , where Q := max(kxk x ⇤ k2 , µM2 )

Hao Yan Gradient-based Optimization Methods February 8, 2024 94 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

The Trade-oﬀs of Large Scale Learning

f ⇤ = arg minf E (f ) is the best possible prediction

fF⇤ = arg minf 2F E (f ) is the best function in certain parametrized
family F
fn = arg minf 2F En (f ) is the empirical optimum
f˜n is the result achieved by Optimization algorithm
E = E [E (fF⇤ ) E (f ⇤ )] + E [E (fn⇤ ) E (fF⇤ )] + E [E (f˜n ) E (fn )]
| {z } | {z } | {z }
Eapp Eest Eopt
Approximation Error: Eapp measures the error of limiting f 2 F
Estimation Error: Eest measures the error of using emperical risk rather
than espected risk
Optimizaiton error: Eopt measures the impact of approximate
optimization bassed on emperical risk

Hao Yan Gradient-based Optimization Methods February 8, 2024 95 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Large-scale Optimization

Constraint: Maximum computation time Tma⇠ , Maximal training size

nmax
family of function F, optimization accuracy ⇢, the number of
exmaples n
(
n  nmax
minF ,⇢,n E = Eapp + Eest + Eopt s.t.
T (F, ⇢, n)  Tmax
Small-scale learning problems: constrained by the maximal number of
samples nmax . Optimization time is not an issue. Eopt ! 0 by choosing
very small ⇢ (optimization accuracy), minimizing Eest by choosing
n = nmax . Classical approximation-estimation trade-oﬀ.
Large-scale learning problems: constrained by the computing time Tmax
when nmax is very large. Approximate optimization can achieve better
expected risk because more training examples can be processed during
the allowed time

Hao Yan Gradient-based Optimization Methods February 8, 2024 96 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Approximation-Estimation-Optimization Tradeoﬀ

F n ⇢
Eapp (approximation error) &
Eest (estimation error) % &
Eopt (optimization error) ... ... %
T (computation time) % % &

Hao Yan Gradient-based Optimization Methods February 8, 2024 97 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Approximation-Estimation-Optimization Tradeoﬀ

n: number of samples, d: dimensionality,  condition number

Hao Yan Gradient-based Optimization Methods February 8, 2024 98 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Example 1: Compute Mean

arg min✓ Ex [ 12 (x ✓)2 ], choose ↵k = 1/k

Every time generate ✓k and use gradient descent

✓k = ✓k 1 ↵k (✓k 1 xk )
1 k 1 1
= ✓k 1 (✓k 1 xk ) = ✓k 1 + xk
k k k
k
X
k✓k = (k 1)✓k 1 + xk = · · · = xi
i=1
1 Pk
✓k = k i=1 xi is the mean estimator
Convergence E (k✓k ✓⇤ k2 ) = O( k1 )

Hao Yan Gradient-based Optimization Methods February 8, 2024 99 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Example 2: Linear Regression

X
min ky X k2 = min (yi T
xi ) 2
i

Linear Regression Loss Function

X
l( ) = min (yi T
xi ) 2
i

The gradient can be computed by

r (yi xiT )2 = 2( T
xi yi )xi

Stochastic Gradient descent: for each time, sample an xi and compute

the following update:
(⌧ +1) (⌧ ) T
= + 2⌘( xi yi )xi

Hao Yan Gradient-based Optimization Methods February 8, 2024 100 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Complexity for LR

Space complexity, save xi , yi , : O(p)

Time complexity:
T
xi : O(p)
T
xi yi : O(1)
(⌧ )
+ ⌘( T xi yi )xi : O(p)
(⌧ +1) (⌧ ) T
= + ⌘( xi yi )xi

Space Time
SGD O(p) O(p)
Gradient Descent O(np) O(np)
Analytical solution O(np) O(np 2 )
Table: Time and storage complexity table

Hao Yan Gradient-based Optimization Methods February 8, 2024 101 / 113

Stochastic Gradient Descent Stochastic Gradient Descent

Scale variance

Scale invariance: x ! Dx, the optimization algorithm behaves the

same
GD or SGD is not scalable invariance, since gradient is not scale
invariance

Hao Yan Gradient-based Optimization Methods February 8, 2024 102 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

SGD with Momentum

Momentum
vt = vt 1 + (1 )r l( )
= ↵vt

Hao Yan Gradient-based Optimization Methods February 8, 2024 103 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Why Momentum

Hao Yan Gradient-based Optimization Methods February 8, 2024 104 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Other Popular Variants

Adagrad: adapts the learning rate to the parameters, well suited with
sparse data.
Adadelta: extension of adagrad to reduce its agressive, monotonically
decreasing learning rate
Adam: add momentum to adaptive learning rates for parameter
Diﬀerent approach:
Momentum accelerates our search in direction of minima: ball running
down a slope
RMSProp impedes our search in direction of oscillations: heavy ball
with friction

Hao Yan Gradient-based Optimization Methods February 8, 2024 105 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

RMSprop
Derived by Geoﬀrey Hinton, while suggesting a random idea during a
Coursera class.
RMSProp also tries to dampen the oscillations, but in a diﬀerent way
than momentum
Update Rule, for each parameter j

2 @l( )
vj,t = ⇢vj,t 1 + (1 ⇢)gj,t , gj,t =
@ j
⌘
j,t+1 = j,t p gj,t
vj,t + ✏
Use exponential averaging of the squared gradient
Exponential averaging: More recent gradient is more important
This squared gradient ensure the gradient of the oscillating direction
will accumulate, which leads to a small learning rate
Parameters that would ordinarily receive smaller or less frequent
updates receive larger updates with
Hao Yan Gradient-based Optimization Methods February 8, 2024 106 / 113
Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Adam

Combine ideas from RMSprop and Momentum

Update Rule, for each parameter j

@l( )
vj,t = ⇢1 vj,t 1 + (1 ⇢1 )gj,t , gj,t =
@ j
2
sj,t = ⇢2 sj,t 1+ (1 ⇢2 )gj,t
vj,t
j,t+1 = j,t p gj,t
sj,t + ✏

vj,t : consider momentum

sj,t : axis depedent learning rate

Hao Yan Gradient-based Optimization Methods February 8, 2024 107 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Benchmark Methods

Hao Yan Gradient-based Optimization Methods February 8, 2024 108 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Benchmark Methods

Hao Yan Gradient-based Optimization Methods February 8, 2024 109 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Parallel Gradient Descent

Gradient Computing
x x ↵rf (x)
Example: empirical risk minimization:
N
1X
arg min fi (x)
x n
i=1

Map: compute rx fi (x)

Reduce: compute
1 X
rf (x) = ( rx fi (x))
n
i

Problem: Lots of communication and synchronization

Hao Yan Gradient-based Optimization Methods February 8, 2024 110 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Paralell SGD methods

Parallelizing SG is hard and ongoing research problem

Hogwild
Allows performing SGD updates in parallel on CPUs with shared
memory (input data is sparse and each update only modify a fraction
of all parameters)
Asynchronous variant of SGD: Downpour SGD
used in DistBelief framework at Google (predecessor to TensorFlow)

Hao Yan Gradient-based Optimization Methods February 8, 2024 111 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Other gradient-based algorithms

Other 1st Order Method:

Proximal Gradient
Subgradient Descent
Second-order method:
(Quasi) Newton’s method
Limited-memory BFGS

Hao Yan Gradient-based Optimization Methods February 8, 2024 112 / 113

Stochastic Gradient Descent Other Stochatic Gradient Optimizers

Other Methods

Integer Programming
Cutting plane methods
Branch and bound methods
Stochastic Algorithms
Direct Monte-Carlo sampling-based Optimization
Heuristics Algorithms:
Simulated annealing
Tabu search
Swarm-based optimization algorithms (e.g. Particle swarm
optimization)
Evolutionary algorithms (Genetic Algorithms)
Learn to optimize (RNN-based learning)

Hao Yan Gradient-based Optimization Methods February 8, 2024 113 / 113

Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
No ratings yet
Chapter 0: Introduction: 0.2.1 Examples in Machine Learning
4 pages
ConvexSpring25 Week 1 2
No ratings yet
ConvexSpring25 Week 1 2
46 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
CH 4
No ratings yet
CH 4
28 pages
Efficient Methods in Optimization
No ratings yet
Efficient Methods in Optimization
159 pages
Pink Blue Cute Rabbit Group Project Presentation
No ratings yet
Pink Blue Cute Rabbit Group Project Presentation
29 pages
Exam 2018
No ratings yet
Exam 2018
18 pages
Optimization for Engineers
No ratings yet
Optimization for Engineers
166 pages
ConvexSpring25 Week9
No ratings yet
ConvexSpring25 Week9
26 pages
Optimization: Optimization and Some Traditional Methods
No ratings yet
Optimization: Optimization and Some Traditional Methods
21 pages
Optimization Models
No ratings yet
Optimization Models
104 pages
Optimization for Engineers
No ratings yet
Optimization for Engineers
59 pages
Nesterov - Introductory Lectures Convex Programming Vol I
No ratings yet
Nesterov - Introductory Lectures Convex Programming Vol I
212 pages
Berkeley-Tutorial Optimization For Machine Learningpart2
No ratings yet
Berkeley-Tutorial Optimization For Machine Learningpart2
35 pages
Unconstrained Optimization Techniques
No ratings yet
Unconstrained Optimization Techniques
25 pages
Optimization With R - Tips and Tricks
No ratings yet
Optimization With R - Tips and Tricks
17 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Numerical Optimization Course Notes
No ratings yet
Numerical Optimization Course Notes
96 pages
Numopt 0
No ratings yet
Numopt 0
163 pages
Convex Optimization in Machine Learning
No ratings yet
Convex Optimization in Machine Learning
110 pages
Matinf 2360 Part 3
No ratings yet
Matinf 2360 Part 3
106 pages
Algorithms and Complexity
No ratings yet
Algorithms and Complexity
130 pages
Optimization Theory 2
No ratings yet
Optimization Theory 2
27 pages
Cheatsheet
No ratings yet
Cheatsheet
2 pages
Chapter 9st - Non-Linear Programming
No ratings yet
Chapter 9st - Non-Linear Programming
21 pages
Exam1Review Annotated
No ratings yet
Exam1Review Annotated
13 pages
Process Optimization
100% (1)
Process Optimization
70 pages
Concave + Convex
No ratings yet
Concave + Convex
37 pages
LGT2
No ratings yet
LGT2
32 pages
ML Optimization Techniques Explained
No ratings yet
ML Optimization Techniques Explained
25 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Unit 2-DLV
No ratings yet
Unit 2-DLV
84 pages
Convex Module B
No ratings yet
Convex Module B
29 pages
Bologna 07
No ratings yet
Bologna 07
315 pages
Power Systems Operation and Management: Second Lecture
No ratings yet
Power Systems Operation and Management: Second Lecture
35 pages
Op Tim Ization Uw 06
No ratings yet
Op Tim Ization Uw 06
29 pages
Convex Optimization Algorithms 1st Edition Dimitri P. Bertsekas Download PDF
No ratings yet
Convex Optimization Algorithms 1st Edition Dimitri P. Bertsekas Download PDF
49 pages
Unit 2
No ratings yet
Unit 2
76 pages
Concept of Optimization
No ratings yet
Concept of Optimization
34 pages
CS-6777 Liu Abs
100% (1)
CS-6777 Liu Abs
103 pages
Optimization Algorithms For Data Analysis Wright
No ratings yet
Optimization Algorithms For Data Analysis Wright
49 pages
55 Optimization
No ratings yet
55 Optimization
21 pages
CSE488 Lab6 Optimization
No ratings yet
CSE488 Lab6 Optimization
20 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Optimization Problems Overview
No ratings yet
Optimization Problems Overview
22 pages
Nonlinear Optimization with MATLAB
No ratings yet
Nonlinear Optimization with MATLAB
38 pages
Nonlinear Optimization: Benny Yakir
No ratings yet
Nonlinear Optimization: Benny Yakir
38 pages
Optimizing Deep Neural Network Parameters
No ratings yet
Optimizing Deep Neural Network Parameters
10 pages
EE227C Course Notes: Convex Optimization
No ratings yet
EE227C Course Notes: Convex Optimization
122 pages
Chapter 3. Graph Platforms and Processing: Platform Considerations
No ratings yet
Chapter 3. Graph Platforms and Processing: Platform Considerations
12 pages
Software Requirement Document Template
No ratings yet
Software Requirement Document Template
7 pages
Wicked Whims Tuning Errors Report
No ratings yet
Wicked Whims Tuning Errors Report
130 pages
ARM C Data Types and Efficiency
No ratings yet
ARM C Data Types and Efficiency
24 pages
MAD LAB MANUAL - R20 Prepard by P Bharah Kumar
No ratings yet
MAD LAB MANUAL - R20 Prepard by P Bharah Kumar
40 pages
II CSE CS3352 FDS QB Unit5
No ratings yet
II CSE CS3352 FDS QB Unit5
4 pages
Huawei Fusionserver rh2288h v3 Server Supported Disc
No ratings yet
Huawei Fusionserver rh2288h v3 Server Supported Disc
1 page
Silo - Tips Catia v5 Scripting With Python and Win32api
No ratings yet
Silo - Tips Catia v5 Scripting With Python and Win32api
25 pages
Bankers Algorithm Solution
No ratings yet
Bankers Algorithm Solution
11 pages
Information Systems Technology Guide
No ratings yet
Information Systems Technology Guide
16 pages
DPS TERM-2 Syllabus
No ratings yet
DPS TERM-2 Syllabus
2 pages
Digital Signal Processing - Unit V: 1. Explain The Key Features of TMS320C5x Key Features
No ratings yet
Digital Signal Processing - Unit V: 1. Explain The Key Features of TMS320C5x Key Features
2 pages
Tai Lieu Huong Dan Van Hanh Ocr-2112 Router For Scada-Spc v1.2023
No ratings yet
Tai Lieu Huong Dan Van Hanh Ocr-2112 Router For Scada-Spc v1.2023
34 pages
Acme Presentation
No ratings yet
Acme Presentation
25 pages
Pibus
No ratings yet
Pibus
35 pages
1x3 Router Project With Direct Explanation
No ratings yet
1x3 Router Project With Direct Explanation
3 pages
Python Programming Assignment 02
No ratings yet
Python Programming Assignment 02
4 pages
UE20CS352 Lab Assignment 8.1Q
No ratings yet
UE20CS352 Lab Assignment 8.1Q
5 pages
LTE Network Capacity Analysis
No ratings yet
LTE Network Capacity Analysis
37 pages
FPGA Adaptive Beamforming With HDL Coder and Zynq RFSoC
No ratings yet
FPGA Adaptive Beamforming With HDL Coder and Zynq RFSoC
36 pages
Crs112 8g 4s in Networking Switch Manual
No ratings yet
Crs112 8g 4s in Networking Switch Manual
8 pages
ALEOS 4.18.1 Software Configuration User Guide For AirLink RV55 r1
No ratings yet
ALEOS 4.18.1 Software Configuration User Guide For AirLink RV55 r1
591 pages
EIP-ET200 Configuration Tool User Reference Guide
No ratings yet
EIP-ET200 Configuration Tool User Reference Guide
54 pages
BIPARD Computer Exam Question Bank
100% (2)
BIPARD Computer Exam Question Bank
295 pages
Understanding BCD and 8421 Code
No ratings yet
Understanding BCD and 8421 Code
6 pages
Veritas DLO 9.4 BOI Setup and Configuration Guide
No ratings yet
Veritas DLO 9.4 BOI Setup and Configuration Guide
39 pages
Minepacks Configuration Guide
No ratings yet
Minepacks Configuration Guide
5 pages
Nutanix Files User Guide
No ratings yet
Nutanix Files User Guide
70 pages
Instrument Loop Check Guide
100% (1)
Instrument Loop Check Guide
1 page
Projects Computer 8th
No ratings yet
Projects Computer 8th
6 pages

Gradient-based Optimization Techniques

Uploaded by

Gradient-based Optimization Techniques

Uploaded by

Gradient-based Optimization Methods

January 23, 2024

Hao Yan Gradient-based Optimization Methods January 23, 2024 1 / 91

2 Convex and Complexity Analysis

3 Gradient Descent and its variants

4 Second-order Optimization Methods

Hao Yan Gradient-based Optimization Methods January 23, 2024 2 / 91

Hao Yan Gradient-based Optimization Methods January 23, 2024 3 / 91

Von Neumann architecture General-purpose processors

Hao Yan Gradient-based Optimization Methods January 23, 2024 4 / 91

Random-access machine (RAM) model

RAM machine consistes of

Hao Yan Gradient-based Optimization Methods January 23, 2024 5 / 91

Space and Time Complexity

Time Complexity: count the number of flops of an algorithm with size

Hao Yan Gradient-based Optimization Methods January 23, 2024 6 / 91

0  c1 g (n)  f (n)  c2 g (n)8n n0

Hao Yan Gradient-based Optimization Methods January 23, 2024 7 / 91

Mostly Used: Big-O-Notation

Count the number of flops of an algorithm with size n.

Hao Yan Gradient-based Optimization Methods January 23, 2024 8 / 91

Convex and Complexity Analysis

Hao Yan Gradient-based Optimization Methods January 23, 2024 10 / 91

with twice diﬀerentiable f

Convex Functions for Diﬀerential Functions

Any locally optimal point of a convex problem is globally optima

For unconstrained diﬀerential convex optimization Problem: x is optimal iﬀ

Proof of These Theorems

Hao Yan Gradient-based Optimization Methods January 23, 2024 14 / 91

x is optimal iﬀ rf0 (x)T (y x) 0 for all feasible y

Nonnegative Problem: x is2 optimal iﬀ x 2 domf0 , x 0,

x is optimal iﬀ x 2 domf0 , rf0 (x) = 0 (unconstrained)

Hao Yan Gradient-based Optimization Methods January 23, 2024 15 / 91

Ensuring unique solutions is important for identifiable model.

Hao Yan Gradient-based Optimization Methods January 23, 2024 16 / 91

Example of Convex Functions on R 1

Hao Yan Gradient-based Optimization Methods January 23, 2024 17 / 91

Example of Convex Functions on R n and R m⇥n

Hao Yan Gradient-based Optimization Methods January 23, 2024 18 / 91

A. Convex optimization: before 1980

Hao Yan Gradient-based Optimization Methods January 23, 2024 19 / 91

What About Non-convex function

Non-convex Optimization is Hard - Many Local Optimums

Hao Yan Gradient-based Optimization Methods January 23, 2024 21 / 91

Non-convex Optimization is Hard - Much More Saddle

Non-convex functions have much more saddle points in high

Hao Yan Gradient-based Optimization Methods January 23, 2024 22 / 91

Non-convex Problem: Active in Research

If f is non-convex, most algorithms can only converge to stationary

Other examples of non-convex optimization that cannot be solved

Hao Yan Gradient-based Optimization Methods January 23, 2024 23 / 91

Example: Convexity of Linear Regression

Please compute the Hessian matrix

Hao Yan Gradient-based Optimization Methods January 23, 2024 24 / 91

non unique solutus

Example: Convexity of Logistic Regression

Logistic regression function

Hao Yan Gradient-based Optimization Methods January 23, 2024 25 / 91

Example: Support Vector Machine

Support Vector Machine

max{0, 1 w T a} is a convex function.

Hao Yan Gradient-based Optimization Methods January 23, 2024 26 / 91

Gradient Descent and its variants

Hao Yan Gradient-based Optimization Methods January 23, 2024 27 / 91

Why Gradient Descent?

Gradient descent lies in the heart of modern machine learning

Hao Yan Gradient-based Optimization Methods January 23, 2024 28 / 91

Gradient Descent - How Machine Learns

Video from https://www.youtube.com/watch?v=V1eYniJ0Rnk

Hao Yan Gradient-based Optimization Methods January 23, 2024 29 / 91

Gradient Descent - How Machine Learns

Hao Yan Gradient-based Optimization Methods January 23, 2024 30 / 91

Gradient Descent - How Machine Learns

Hao Yan Gradient-based Optimization Methods January 23, 2024 31 / 91

Gradient Descent - How Machine Learns

*Our interest: *Vector diﬀerentiation or Gradient: y = f (x) is a scalar

Our interest: Vector diﬀerentiation or Gradient: y = f (x) is a scalar