Gradient-based Optimization Techniques
Gradient-based Optimization Techniques
Hao Yan
1 Complexity
Big O Notation
Section 1
Complexity
Components
Memory (RAM):
Central processing unit (CPU)
Input/output system
Memory stores program and data
Program instructions execute sequentially
Assymptotic Growth
Line C it
Assumptotic Oder of Growth Tin 01961
Upper bound: T (n) is O(g (n)) if exists c, n0 such that
T (n) c · g (n) for all n n0
Lower bound: T (n) is ⌦(g (n)) if exists c such that T (n) c · g (n)
for all n n0
Tight bound: T (n) is ⇥(g (n)) if it is both O(g (n)) and ⌦(g (n))
exists constants c1 , c2 and n0 such that
Examples
2
1 2
2n2 + 3n + 1 = O(n2 )
n = O(n2 )
n! = O(nn )
n n
men 1
adf.
Ii
Hao Yan Gradient-based Optimization Methods January 23, 2024 9 / 91
Convex and Complexity Analysis
Section 2
Convex Functions
f is convex if domf is a convex set and
f (✓x + (1 ✓)y ) ✓f (x) + (1 ✓)f (y )
for all x, y 2 domf , 0 ✓ 1
f is strictly convex if domf is a convex set and
f (✓x + (1 ✓)y ) < ✓f (x) + (1 ✓)f (y )
for all x, y 2 domf , 0 < ✓ < 1
É
if
Hao Yan Gradient-based Optimization Methods January 23, 2024 11 / 91
f is Convex and Complexity
differentiable if domAnalysis
f is openConvexity
and the gradient
✓ ◆
Convex Functions forrfDifferential
(x) =
@x
, Functions
@f (x) @f (x)
@x
,...,
@f (x)
@x 1 2 n
f is convex
exists iff
at each x 2 dom f
with *differentiable
1st-order condition:*fdifferentiable f 051
with convex domain is convex iff
T
f (y) f (y
f (x) + rf (y +x)rffor
) (x)f T(x) (x)all x,
(yy 2 x)
dom f
f (y)
f (x) + f (x)T (y x)
(x, f (x))
5
first-order approximation of f is global underestimator
Hessian L
H = r2 f ⌫ 0 8x
Strong Convex r2 f __
E
0 8x
refers all eigenvalues are positive
⌫ refers all eigenvalues are non-negative
Hao Yan Gradient-based Optimization Methods January 23, 2024 12 / 91
Convex and Complexity Analysis Convexity
Theorem (Theorem)
Theorem (Theorem)
FEET
Hao Yan
t.EE
Gradient-based Optimization Methods
ay
January 23, 2024 13 / 91
Convex and Complexity Analysis Convexity
f
Proof.
suppose x is locally optimal. Prove f0 (y ) f0 (x).
Define x✓ = ✓y + (1 ✓)x and
F E 19
ily all E
f0 (x✓ ) = f0 (✓y + (1 ✓)x) ✓f0 (y ) + (1 ✓)f0 (x)
Therefore
1 7712569
0
f0 (y ) f0 (x) (f0 (x✓ ) f0 (x))
✓ 70.0 0
If we choose small enough ✓, x✓ would be close enough to x such that
f0 (x✓ ) f0 (x) 0. Therefore, f0 (y ) f0 (x) 8y 2 dom(f0 )
Optimality Criterion
Theorem (Theorem)
Theorem (Theorem)
Strongly Convex
1 W no
Theorem (Theorem) oof C of
For strictly or strongly convex functions, the global optimum is unique.
Convex
Affine: ax + b
Exponential: exp(ax)
Powers: x ↵ on x > 0 for ↵ 1 or ↵ 0
Powers of absolute value: |x|p on R for p 1
Negative entropy: x log x on x > 0
Concave:
Affine: ax + b
Logarithm: log x
Powers: x ↵ on R for ↵ 2 [0, 1]
Example on R n
Affine: f (x) = aTP
x +b
n
Norms: kxkp = ( i=1 |xi |p )1/p
Example on R m⇥n Pal
Affine: f (X ) = tr (AT X ) + b
spectral f (X ) = kX k2 = max (X )
A Brief History
HE.to
17
Eq
If
dimensus
p
Why is non-convex optimization hard in high dimensions?
Hao Yan
of Gradient-based Optimization Methods January 23, 2024 20 / 91
Convex and Complexity Analysis Convexity
LEA
O
X k2F
YEA
min kY
rank(X )=k
Problem 48
min ky X ✓k2
✓
Is it strongly convex?
How about Ridge regression
KER 1191112
Efi ITX
OEIRP
OF 2
Ctxxc 1 4170
Δ
is rank define
Multi culinary
Oig Xix 0
BCE
i
1
Here (x) = 1+exp( x) .
d
(x) = (x)(1 (x))
dx
Please compute the Hessian matrix
HE 0 Multi colienity
Is it convex?
Is it strongly convex?
How about Ridge regression Cig H O
Section 3
Differentiable
Input ! Model Prediction #$
Update Gradient
Parameter
%
descent Truth #
Learner
Video from
https://www.youtube.com/watch?v=IHZwWFHWa-w\&t=707s
Hao Yan Gradient-based Optimization Methods January 23, 2024 32 / 91
Gradient Descent and its variants Understand Gradient
Vector Differentiation
0 1
x1
x: vector defined as x = @ ... A
B C
xn
Assume y = f (x), where x is a scalar Output is a vector
0 df 1
dy B dx. C
1
B f PD fall
= @ .. A
dx @f n
@x
Visualize Gradient
Gradient off (x1 , · · · , xp ) according to x = (x1 , · · · , xp ) is given as
✓ ◆
df @f @f
rf = = ,··· ,
dx @x1 @xp
315 I
7 414
Derivative, Gradient
scalar
vector to vector
vector
Scatory
gradient Jacobian
Hao Yan Gradient-based Optimization Methods January 23, 2024 36 / 91
Gradient Descent and its variants Understand Gradient
Useful Rules
0
@f (x)
@x1
1 I
@f (x) B .. C Aij
=B . C
@x @
@f (x)
A
Ai
1441
@xp
0 1 0 1
@aT x
a
@xT a @aT x
B @x1
.. C B .1 C
@x = @x =B
@ . C = @ .. A = a
A
@aT x ap
@xp
@Ax
= AT
@x
Y AX Y
Five.ir d A
Hao Yan Gradient-based Optimization Methods January 23, 2024 37 / 91
Gradient Descent and its variants Understand Gradient
Jacobian
Assume y = f(x), both x, y are vectors of size n and m
FAI II
f = (f1 , · · · , fm ), x = (x1 , · · · , xn ), y = (y1 , · · · , ym ) where, yi = fi (x)
Define Jacobian: vector to vector mapping @y @x
0 @f1 (x) 1 0 @ @ 1
@x @x1 f1 (x) ··· @xn f1 (x)
@y B .. C B .. .. C
=@ . A=@ . . A
@x @fm (x) @ @
@x m @xm fm (x) @xn fm (x)
Special case: x and y are of the same size
Multivariable function f looks locally like as a linear transformation of x
Known as Jacobian matrix
HER
Example of Jacobian
The absolute value of the Jacobian times the area of the corresponding
rectangle
0
@ f
6 @x 2 ···
6 1 @x1 @x2 @x1 @xn 7
7
6
0 @ @f 1 6 @ 2 f 7
@2f 2
@ f 7
@x @x1 6 ··· 7
B . C 6 @x2 @x1 @x22 @x2 @xn 7
H=@ .. A=6 6
7
7
6 .. .. .. .. 7
@ @f
6 . . . . 7
@x @xn 6 7
6 7
4 @2f @2f 2
@ f 5
31
···
His @xn @x1 @xn @x2 @xn2
2 3
6x 1 Symethic
Hji
$f(x,y,z)=x3 +2y2 +3z2 +xy,$ H = 4 1 4 5
6
Hessian encodes the curvation information of the function
HI
Hao Yan Gradient-based Optimization Methods January 23, 2024 41 / 91
Gradient Descent and its variants Understand Gradient
y 00
Curvature in 1D: For y = f (x), = 3 , a larger y 00 means a
(1+y 0 ) 2
1
larger curvature. = R
I
det(H)
K = 1 1 =
1 + krf k2
Gradient Rules
Chain Rule
Here,
⇣ a⌘ = g (x).
@g(x) gradient
@x is the Jacobian of g.
@ f T (x)g(x)
@x
=
@ f T (x)
@x
g(x) +
@ gT (x)
@x
f(x) 0
Some special examples
@
@x (u T
v) = @uT
@x v + @vT
@x u
un
@ 2 @ T @uT @uT T
@x kuk 2 = @x u u = @x u + @x u = 2 @u
@x u
@
@x (x T
Ax) = 2Ax if A T
=A
Property of Gradients
Theorem (Theorem)
feitehj frtiteof.li
lim✏!0 0✏
f (x+✏h) f (x)
= rf · h Inerprudat
of
455h
0
Theorem (Theorem)
IE
rf
maxkv k=1 (rf · v ) achieves its maximum at v = krf k
0
Proof.
rf · v = rf kv k cos ✓ krf k, the equality holds when v = crf , where
1
c = krf k.
MK of
minimum
9
Hao Yan Gradient-based Optimization Methods January 23, 2024 47 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns
Derivative, Gradient
Property of Gradients
Theorem (Theorem)
f (x+✏h) f (x)
lim✏!0 ✏ = rf · h
Theorem (Theorem)
rf
maxkv k=1 (rf · v ) achieves its maximum at v = krf k
Proof.
rf · v = rf kv k cos ✓ krf k, the equality holds when v = crf , where
1
c = krf k.
Gradient Descent
Assume that f and rf at each iteration can be easily evaluated
Recall that we have f : R n → R, convex and differentiable, want to
solve
Steepest Descent:
minn f (x)
x2R
i.e find x ⇤ such that f (x ⇤ ) = min f (x)
Steepest Descent:
xk+1 = xk ↵k rf (xk )
How to identify ↵k
Trial and Error: Select a fixed ↵k or reduce ↵k after f (xk ) is stable
Backtracking: ↵0 , 12 ↵0 , 14 ↵0 , · · · until a sufficient decrease in f is
obtained
Use information from function f to decide
Hao Yan Gradient-based Optimization Methods January 23, 2024 53 / 91
Fixed step size
Gradient Descent and its variants Gradient Descent: How Machine Learns
Bigtake
mply SteptkSize
= t for all k = 1, 2, 3, . . ., can diverge if t is too
nsider Consider
f (x) =f (x)
(10x 2 + x2 )/2, gradient descent after 8 steps:
1 2 +2x 2 )/2, gradient descent after 8 steps
= (10x 1 2
20 ●
E
10
*
0
−10
−20
1
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
●
●
●
●
●
●
●
●
●
●
●
*
0
−10
20
20
●
●
●
10
●
●
●
●
jiffy ●●
●
●
*
0
−10
Line Search
min fix
Gradient step xk+1 = xk ↵rf (xk )
Look at which ↵ minimize f (xk ) 2 is a scalar
eig of i M 70
If µI 4 r2 f (x) 4 LI
µ L
ky xk22 f (y ) f (x) T
rf (x) (y x) ky xk22
2 2
For strong convex problem µ > 0
Weak convex problem µ = 0
45
Hao Yan Gradient-based Optimization Methods January 23, 2024 58 / 91
2412
9 1
y't 9 x 119
774TH
mysis my
Of LI 60 Stryly Coney
0 19 May 04
91
2 g x 219 09
09 7ft 19 0 0
19 2 If
y x
tof
AD
sealer
Convergence speed
she 0
45 sk 0 How fast
die 0
Ref 0
1 1 Linear emergence
IS 2
c Linear
longer
2
Irreverence
4,1 I
3 quadratic emerge
4 2
How iterations to achieve E accuracy
many
Linear convergence
XK
ICE knflog.cl
loy2
Quadratic
sublinar Die CE
k
Gradient Descent and its variants Gradient Descent: How Machine Learns
Theorem (Theorem)
Proof.
The proof sketch is layed as:
Convexity: f (xk+1 ) f (xk ) rf (xk )T (xk+1 xk ) L2 kxk+1 xk k22
since xk+1 = xk ↵k rf (xk ).
f (xk+1 ) f (xk ) ↵k (1 ↵2k L)krf (xk )k22
1 2
Select step size ↵k = 1/L, have f (xk+1 ) f (xk ) 2L krf (x k )k2
By
PNsummation from k = 0 to k = N and
krf (x )k2 2L(f (x ) f (xN+1 )]
k=1 k 0
1 1 2 FEdpga
2141 Xk 2 11 t
that fÑt2 t 11 12 may
f ocean
fell F 110 14 f
912k
t.EE
110 1 115
genes 14 tails I
ota.nl sc
E
5 0fk
Gradient Descent and its variants Gradient Descent: How Machine Learns
I
1
kxk x ⇤ k is a decreasing sequence in k with ↵k = L
Proof.
Ike 2k 21 01 14 If't
⇤ 2 1
kxk+1 x k = kxk x ⇤
rf (xk )k2 2 7 1
L
⇤ 2 2 1
= kxk x k rf (xk ) (xk x ) + 2 krf (xk )k22
> ⇤
L L
⇤ 2 1
kxk x k krf (xk )k2
L
We know that kxk x ⇤ k2 is non-decreasing in k.
Theorem (Theorem)
1 1
r2 f (x) 4 LI , f (x ⇤ ) f (x L rf (x)) f (x) 2L krf (x)k 2
Proof.
First we can show that
⇤ 1
f (x ) f (x rf (x))
L
1 L 1
f (x) rf (x) rf (x) + k rf (x)k22
>
L 2 L
1
= f (x) krf (x)k22
2L
flat f x Ef fin of of
11141T
for 1171T
I
fk 0
Gradient Descent and its variants Gradient Descent: How Machine Learns
meet
flk fly 7ft folk convexity of f
d
fam f 571117 4101
Ok
ok
IT
1 KIT
0kt Ok a
Gradient Descent and its variants Gradient Descent: How Machine Learns
1 1 1
0114
/(1 k)
k+1 k 2Lkx0 x ⇤ k2
a
1 1 1 k +1
+ +
k 2Lkx0 x ⇤ k2 0 2Lkx0 x ⇤ k2
which yields
2Lkx xk2
0
f (xk+1 ) f (x ⇤ )
k +1
The classical 1/k convergence rate (sublinear)
Proof. onvergence
We know that xk , by plugging in strong convexity:
f (x ⇤ ) f (xk ) + rf (xk )> (xk x ⇤ ) + 12 µkxk x ⇤ k2 and
1
f (x ⇤ ) f (x) 2L krf (x)k22 , we have
Dec11 Dic 210 11
kxk+1 x ⇤ k22 = kxk x ⇤ ↵rf (xk )k2
= kxk x ⇤ k2 2↵rf (xk )> (xk x ⇤ ) + ↵2 krf (xk )k2
(1 ↵µ)kxk x ⇤ k2 2↵(f (xk ) f (x ⇤ )) + ↵2 krf (xk )k2
(1 ↵µ)kxk x ⇤ k2 2↵(1 ↵L)(f (xk ) f (x ⇤ ))
If we Yan↵ 1 , we have
Haolet 2↵(1
Gradient-based ↵L) is
Optimization negative, and
Methods can
January be droped.
23, 2024 65 / 91
7
β
1101 11
fm 2 110th
i
Of fish f off
101
f
101k Conexity
076.5 An mg
IEEE
k
2 1 1
4
112
1 111 2
I
1
Gradient Descent and its variants Gradient Descent: How Machine Learns
8
Strong Convex: Linear
Linear convergence
4 14 4
Not just
f (xk a
) pessimistic bound!
f (x ⇤ ) L2 (1 2 2k
x ⇤ k2 , when 1, 1 2
⇡1
+1 ) kx0 +1
not
ÉlI
f 15
431
FISTA Acceleration
ANNA AN
Basic step is Aggregate gradient over histry
xk+1 = xk + ↵k pk , pk = rf (xk ) + k pk 1
Accelerated version
xk = yk L1 rf (yk )
1
p
tk+1 = 2 (1 + 1 + 4tk2 )
yk+1 = xk + ttkk+11 (xk xk 1)
O
1
Weakly convex f , converges with f (xk ) f (x ⇤ ) ⇠ k2
1/k 2 is optimal rate for sublinear problem
u
G
Hao Yan Gradient-based Optimization Methods January 23, 2024 68 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns
2 T
k+1 = k + ↵k ( X (y X k) 2 k)
n
011
Hao Yan Gradient-based Optimization Methods January 23, 2024 69 / 91
Gradient Descent and its variants Gradient Descent: How Machine Learns
Hessian Matrix:
2 2
H = r l( ) = r X T (y X )+2 I
n
2
= r XTX + 2 I
n
2 T
= X X +2 I ⌫0
n
Is it convex? Yes!
Is it strongly convex?
If > 0: Yes!
If <0
If n > p, if X is full rank (with rank p), it is strongly convex
if p > n, not strongly convex!
KIT IR complexity
Time Complexity:
Compute X takes O(np) XP
Compute X T (y X ) takes O(np) 1 13 Olap
Space Complexity: a
Store X takes O(np)
2
XY XP
Olap
H
O up
Gradient Descent and its variants Gradient Descent: How Machine Learns
closed
form β XX 2 XTY
I ocupy
Space Time
Gradient Descent O(np) O(np) iteration
per
Analytical Solution O(np) O(np 2 )
e
Question:
Is Gradient descent straightly better due to smaller complexity?
When should we use analytical solution?
total
ff If
of Iteration E C complexity
of a
Analitiial
6
1 peas C
Hao Yan Gradient-based Optimization Methods January 23, 2024 72 / 91
Second-order Optimization Methods
Section 4
HEI
min f (x)
x
At x = x0 , f (x) can be approximated by
T 1
f (x) ⇡ h(x) := f (x0 ) + rf (x0 ) (x x0 ) + (x x0 )T H(x0 )(x x0 )
2
HEI
H(x0 ) is the Hessian of f (x) defined by
@2
Hij (x0 ) = f (x)
@xi @xj
1
arg min f (x) ⇡ arg min h(x) = x0 H(x0 ) rf (x0 )
x x
Th 7fk.lt 11x 17
e
Of Xo Hm off
HELIX
04101.7 GD
II
H Newton's director
Linear o
11h11
tik i ii.in
ind R
mint 05
Invariant scale
of
If
t
affect AD
H I 41
1
D.at C E
DX
Second-order Optimization Methods Second-Order Optimization Methods
Newton Algorithm
Algorithm
Given x0 , set k = 0
Solve dk such that H(xk )dk = rf (xk )
Normally set ↵k = 1, xk+1 = xk + ↵k dk
Repeat until convergence
Solving dk requires to assume that H(xk ) is nonsingular at each
iteration
Newton’s method considers curvature of the original problem
Scale invariant for Newton methods
Newton’s method is invariant to affine transformation x Dx
Descent Property
Theorem (Theorem)
Proof.
If r2 f 2
4 20
1120
0, then we know that r f is positive definite and invertible. As a
result, we have
We know that the newton’s direction
x = xk+1 xk = [r2 f (xk )] 1 rf (xk ) so
Convergence Property
Theorem (Theorem)
We will give a proof for 1-D case for example, where x is a scalar.
consider g (x) = rf (x). Consider the 2nd order Taylor approximation
of g (x). There exists ⇠k between xk and x ⇤ such as
1 2
⇤
0 = g (x ) = g (xk ) + rg (xk )(x ⇤
xk ) + r g (⇠k )(xk x ⇤ )2
2
Suppoe that r 1 g (x exists
k)
1 1
0 = [r g (xk )]g (xk ) + (x xk ) + [r 1 g (xk )]r2 g (⇠k )(xk x ⇤ )2
⇤
2
1
= (xk+1 xk ) + x ⇤
xk + [r 1 g (xk )]r2 g (⇠k )(xk x ⇤ )2
2
Hao Yan Gradient-based Optimization Methods January 23, 2024 77 / 91
Quadratic convergence
I Tac.tn
11dm 21 1 106 2 11
909 7512
Skf 01k 0H
7 9611914
0 109509611
Xia 2
76
EI.si t Ie
Second-order Optimization Methods Second-Order Optimization Methods
Convergence Property
This gives
ek+1 xk+1 x ⇤ 1 1
2
:= 2
= [r g (xk )]r2 g (xk )
ek ⇤
(xk x ) 2
|r2 f (x)|
Let M = supx,y |2rf (y )| < 1, therefore we know that
xk+1 x ⇤
limk!1 (x x ⇤ )2 M < 1. If we select the initial point
k
|e0 | < 1, ek ! 0 with quadratic rate.
Newton Convergence
log ly I
Newton’s method has Local Quadratic convergence E 1
kxk+1 x ⇤ k = O(kxk x ⇤ k2 )
Theorem (Theorem)
f (x0 ) f (x ⇤ ) ✏i
+ log log( )
✏
Scale Invariant
Gradient descent is not scale invariant
Newton’s method considers curvature of the original problem through
the Hessian matrix
Scale invariant for Newton methods
Newton’s method is invariant to affine transformation x Dx
Can it be Practical?
Quasi-Newton’s Method
Of plan Offa
Heissian
free optimism
for 2nd order
Hao Yan Gradient-based Optimization Methods January 23, 2024 82 / 91
Second-order Optimization Methods Second-Order Optimization Methods
Sampled Newton
Pn
L(⇠; x) is separable L(⇠; x) = i=1 L(⇠i ; x)
Stochastic Gradient descent use subset of full samples to estimate the
gradient
Sampled subset of B ✓ {1, 2, · · · , n} randomly
1 X
r2 f (⇠) = r2 L(⇠i ; x)
|B|
i2B
Quasi-Newton Methods
B k s k = yk
L-BFGS
1
5k Yi PI O mp
LBFGS doesn’t store the p ⇥ p matrices Hk or Bk from BFGS explicitly
Only keep track sk and yk from last few iterations (e.g. m = 5 or 10)
Take an initial matrix (B0 or H0 ) and assume that m steps have been
taken since.
A simple procedure computes Bk u via a series of inner and outer
products with the matrices sk j and yk j from last m iterations:
j = 0, · · · , m 1
Requires 2mp storage and O(mp) linear algebra operations
No superlinear convergence proved, but good behavior is observed on
a wide range of applications
The gradient X
0
l (✓) = xi (ti yi )
i
(k)
Define ti = (✓(k)T xi ) as the target
Gradient Descent
X (k)
(k+1) (k)
✓ =✓ xi (ti yi )
i
Logistic Loss
Assume
P(y = 1|x, ✓) = (✓T x + b)
P(y = 0|x, ✓) = 1 (✓T x + b)
Equivalently
P(y |x; ✓) = P(y = 1|x, ✓)y P(y = 0|x, ✓)1 y
Likelihood:
Y
P(y |x; ✓) = P(yi |xi ; ✓)
i
N
Y
= P(yi = 1|x, ✓)yi P(y = 0|x, ✓)1 yi
i=1
Logistic Loss
The binary cross
PNentropy is given
as:BCE = ( i=1 yi log P(yi = 1|x, ✓) + (1 yi ) log P(yi = 0|x, ✓))
Theorem (Theorem)
00 0
151 y
H X 7 0
0 0 Ht of.la
0 XX XXO 9
XD Xy XXXX
XX XTY
0 0
Stochastic Gradient Descent
Section 5
Random Variable 3 µL
Model permet
so
11.551
Data or
ITSamples.IT Jn
If
m.int fo 5i L.ln
GD
016 4 74 E 0511.9
use all n to estimate 0L reduce variance
Stochastic GD
EEgrf9
one
si.fi
pick
Variance
I ftp is O P
Data: ⇠ 2 S sample space
DIE
Loss function: F (x, ⇠)
Average of Irish
Risk function E (F ) = E⇠ F (x, ⇠) = f (x)
1 Pn
Empirical risk: En (F ) = n i=1 F (x, ⇠i )
Gradient Descent: _i.is
n
1X
xk+1 = xk ↵k
n
rx F (xk , ⇠i ) OCP
i=1 is too expensive
for each
III
step
Problem: n can be very large: n ⇡ 100000, evaluating En (Fe) at each
iteration is extremely time-consuming
Stochastic Optimization
Stochastic Optimization
ML optimise 0
Replace
With the gradient
noisy gradients, with cheaper noisy estimates {[}Robbins and
update
Monro, 1951{]}
What if at each iteration
t+1 = t + t
ˆ up x(i trandomly
i, we pick ) from {x1 , · · · , xn }
Any r✓ l(✓; xk ) is an unbiased estimator of Ex (r✓ l(✓; x))
Requires unbiased gradients, ˆ ( ) = ( )
Hao Yan Gradient-based Optimization Methods February 8, 2024 89 / 113
Stochastic Gradient Descent Stochastic Gradient Descent
die randomly
Algorithm: ✓t+1 = ✓t
I
↵t r✓ l(✓t ; xk )
Guaranteed to converge to a local optimum {[}Bottou, 1996{]}
Significantly reduce the time for each iteration
Mini-Batch: use a batch of samples more than one training sample.
B
1 X
Ex (r✓ l(✓; x)) ⇡ l(✓; xi )
B
i=1
Go
EIEEF
E.it o
ask local
a
i
E
Stochastic Gradient Descent Stochastic Gradient Descent
max(k✓k ✓⇤ k2 , µM2 )
E (k✓k ✓ ⇤ k2 )
2k
where 0 µI 4 r2 f (x) 4 LI , E (krf (x)k2 ) M 2
In reality, ↵k = ↵0 (1 + ↵0 k) 1
Proof
ak = 12 E (k✓k ✓⇤ k2 ). Assume M > 0 such that
E (kr✓ f (✓, x)k2 ) M 2 . Thus
1
k✓k+1 ✓⇤ k22
2
1
= k✓k+1 ✓⇤ ↵k r✓ F (✓k , xk )k2
2
1 ⇤ 2 1 2
= k✓k+1 ✓ k ↵k (✓k ✓ ) r✓ F (✓k , xk ) + ↵k kr✓ F (✓k , xk )k2
⇤ T
2 2
Take expectations on xk
⇤ T 1 2 2
ak+1 ak ↵k Exk [(✓k ✓ ) r✓ F (✓k , xk )] + ↵k M
2
We have
Exk [(✓k ✓⇤ )T r✓ F (✓k , xk )] = (✓k ✓⇤ )T Exk [r✓ F (✓k , xk )]
= (✓k ✓ ⇤ ) T gk
where
Hao Yan
(✓k , xk )] (unbias
gk = Exk [r✓ FGradient-based estimator)
Optimization Methods February 8, 2024 93 / 113
Stochastic Gradient Descent Stochastic Gradient Descent
Proof
By strong convexity:
⇤ ⇤ 1 ⇤ 2 1 ⇤ 2 1
(✓k ✓ )gk f (✓k ) f (✓ ) + µk✓k ✓ k µk✓k ✓ k = µak
2 2 2
We have
E ((✓k ✓⇤ )gk ) 2µak
We have
1 2 2
ak+1 (1 2µ↵k )ak + ↵k M
2
1
When ↵k = kµ , we have ak Q
2k , where Q := max(kxk x ⇤ k2 , µM2 )
Large-scale Optimization
Approximation-Estimation-Optimization Tradeoff
F n ⇢
Eapp (approximation error) &
Eest (estimation error) % &
Eopt (optimization error) ... ... %
T (computation time) % % &
Approximation-Estimation-Optimization Tradeoff
Approximation-Estimation-Optimization Tradeoff
✓k = ✓k 1 ↵k (✓k 1 xk )
1 k 1 1
= ✓k 1 (✓k 1 xk ) = ✓k 1 + xk
k k k
k
X
k✓k = (k 1)✓k 1 + xk = · · · = xi
i=1
1 Pk
✓k = k i=1 xi is the mean estimator
Convergence E (k✓k ✓⇤ k2 ) = O( k1 )
r (yi xiT )2 = 2( T
xi yi )xi
Complexity for LR
Space Time
SGD O(p) O(p)
Gradient Descent O(np) O(np)
Analytical solution O(np) O(np 2 )
Table: Time and storage complexity table
Scale variance
Momentum
vt = vt 1 + (1 )r l( )
= ↵vt
Why Momentum
Adagrad: adapts the learning rate to the parameters, well suited with
sparse data.
Adadelta: extension of adagrad to reduce its agressive, monotonically
decreasing learning rate
Adam: add momentum to adaptive learning rates for parameter
Different approach:
Momentum accelerates our search in direction of minima: ball running
down a slope
RMSProp impedes our search in direction of oscillations: heavy ball
with friction
RMSprop
Derived by Geoffrey Hinton, while suggesting a random idea during a
Coursera class.
RMSProp also tries to dampen the oscillations, but in a different way
than momentum
Update Rule, for each parameter j
2 @l( )
vj,t = ⇢vj,t 1 + (1 ⇢)gj,t , gj,t =
@ j
⌘
j,t+1 = j,t p gj,t
vj,t + ✏
Use exponential averaging of the squared gradient
Exponential averaging: More recent gradient is more important
This squared gradient ensure the gradient of the oscillating direction
will accumulate, which leads to a small learning rate
Parameters that would ordinarily receive smaller or less frequent
updates receive larger updates with
Hao Yan Gradient-based Optimization Methods February 8, 2024 106 / 113
Stochastic Gradient Descent Other Stochatic Gradient Optimizers
Adam
@l( )
vj,t = ⇢1 vj,t 1 + (1 ⇢1 )gj,t , gj,t =
@ j
2
sj,t = ⇢2 sj,t 1+ (1 ⇢2 )gj,t
vj,t
j,t+1 = j,t p gj,t
sj,t + ✏
Benchmark Methods
Benchmark Methods
Gradient Computing
x x ↵rf (x)
Example: empirical risk minimization:
N
1X
arg min fi (x)
x n
i=1
Other Methods
Integer Programming
Cutting plane methods
Branch and bound methods
Stochastic Algorithms
Direct Monte-Carlo sampling-based Optimization
Heuristics Algorithms:
Simulated annealing
Tabu search
Swarm-based optimization algorithms (e.g. Particle swarm
optimization)
Evolutionary algorithms (Genetic Algorithms)
Learn to optimize (RNN-based learning)