0% found this document useful (0 votes)
14 views58 pages

11 Gradient Descent

Uploaded by

mb6hbk2ctg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views58 pages

11 Gradient Descent

Uploaded by

mb6hbk2ctg
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

Foundations of Data Science, Fall 2024

Introduction to Data Science for Doctoral Students, Fall 2024

11. Gradient Descent Optimisation

Dr. Haozhe Zhang

October 28, 2024

MSc: https://lms.uzh.ch/url/RepositoryEntry/17589469505
PhD: https://lms.uzh.ch/url/RepositoryEntry/17589469506
Calculus Background

We consider differentiable multivariate functions f : RD → R

Two fundamental observations are behind gradient-based optimisation:

1. The gradient of f is perpendicular to the contour line of f

2. The gradient of f points in the direction of the steepest increase of f

To search for a local minimum of f , we can repeatedly move in the direction of the
steepest decrease of f , so in the opposite direction of the gradient of f .

To move faster (yet computationally more expensive), we can employ optimisation


methods that rely on the second-order derivatives of f , so on its Hessian matrix

1
Gradient Vectors are Orthogonal to Contour Curves

Theorem: If a function f is differentiable, the gradient of f at a point is either zero


or perpendicular to the contour line of f at that point.

Intuition for perpendicularity: Two hikers


at the same location on a mountain.
1. Choose the direction where the
slope is steepest

2. Choose a path that keeps the same


height

The theorem says that they depart in


directions perpendicular to each other.

2
Gradient Points in the Direction of the Steepest Increase (1/2)

Each component of the gradient says how fast the function changes with respect
to the standard basis:
∂f f (a + h ui ) − f(a)
(a) = lim
∂ xi h →0 h

where ui = [0 · · · 1 · · · 0]T is the unit vector in the direction of xi


1 i D

What about changing with respect to the direction of some arbitrary vector v?

3
Gradient Points in the Direction of the Steepest Increase (1/2)

Each component of the gradient says how fast the function changes with respect
to the standard basis:
∂f f (a + h ui ) − f(a)
(a) = lim
∂ xi h →0 h

where ui = [0 · · · 1 · · · 0]T is the unit vector in the direction of xi


1 i D

What about changing with respect to the direction of some arbitrary vector v?

Directional Derivative ∇v : derivative in direction of v

f (a + h v) − f (a) ∂f ∂f
∇v f (a) = lim = ∇f (a) · v = v1 (a) + . . . + vD (a)
h →0 h ∂ x1 ∂ xD

∇v f (a) compounds the effect of moving along the direction of the first
component, then along the direction of the second component, and so on

3
Gradient Points in the Direction of the Steepest Increase (2/2)

Directional Derivative in the direction of v: ∇v f (a) = ∇f (a) · v

Which direction should we walk from a so that f ’s output increases most quickly?

• That is, which unit vector v maximizes the directional derivative along v?

4
Gradient Points in the Direction of the Steepest Increase (2/2)

Directional Derivative in the direction of v: ∇v f (a) = ∇f (a) · v

Which direction should we walk from a so that f ’s output increases most quickly?

• That is, which unit vector v maximizes the directional derivative along v?

∗ 1
v = argmax ∇v f (a) = argmax ∇f (a) · v
v v
2 3
= argmax ∥∇f (a)∥∥v∥ cos θ = argmax ∥∇f (a)∥ cos θ
v v

where:

• (1) is the definition of directional derivative


• (2) is the definition of dot product, θ is the angle between ∇f (a) and v
• (3) holds since ∥v∥ = 1 (v is a unit vector)

Maximal value is for cos θ = 1 or θ = 0, so ∇f (a) and v∗ have the same direction

4
The Hessian Matrix

Hessian H: The matrix of all second-order partial derivatives of f

• Symmetric as long as all second derivatives exist

• Captures the curvature of the surface

5
The Hessian Matrix

Hessian H: The matrix of all second-order partial derivatives of f

• Symmetric as long as all second derivatives exist

• Captures the curvature of the surface

H has positive eigenvalues H has negative eigenvalues H has mixed eigenvalues

local minimum local maximum saddle point

5
Gradient and Hessian: Example

w12 w2
z = f ( w1 , w2 ) = 2
+ 22
a b

  " #
∂f 2w1
∇w f =  ∂ w1  = a2
2w2
∂f
∂ w2 b2

   
∂2f ∂2f 2
∂ 2 ∂ w1 ∂ w2 a2
0
H=  w2 1 =
∂2f

∂ f 2
∂ w2 ∂ w1 ∂ w22 0 b2

6
Gradient Descent

6
Gradient Descent Algorithm

Gradient descent is one of the simplest, but very general optimisation algorithms
for finding a local minimum of a differentiable function

• Recall our general optimisation goal for a loss function f with parameters w:


w = argmin f (w)
w

• Gradient descent is iterative:

wt +1 = wt − ηt gt = wt − ηt ∇f (wt )

• It produces a new vector wt +1 at each iteration t

• At each iteration, w moves in the direction of the steepest descent

• Gradient descent may or may not reach w∗ = wj after any number j of


iterations

• ηt > 0 is the learning rate or step size

7
Gradient Descent: Convex vs. Non-Convex Functions

8
Gradient Descent for Least Squares Regression

N
X
L(w) = (Xw − y)T (Xw − y) = (xTi w − yi )2
i =1

We can compute the gradient of L with respect to w


 
∇w L = 2 XT Xw − XT y

9
Gradient Descent for Least Squares Regression

N
X
L(w) = (Xw − y)T (Xw − y) = (xTi w − yi )2
i =1

We can compute the gradient of L with respect to w


 
∇w L = 2 XT Xw − XT y

Gradient descent vs closed-form solution for very large (N) and wide (D) datasets

• In both cases we need to compute A = XT X and XT y

• Closed-form solution: also invert (perturbation of) A

• Gradient descent: recompute Aw at each iteration

So the number of iterations must be small to pay off

9
Choosing a Step Size

Choosing a good step-size is key and we may want a time-varying step size

10
Choosing a Step Size

Choosing a good step-size is key and we may want a time-varying step size

• If step size is too large, the algorithm may never converge

10
Choosing a Step Size

Choosing a good step-size is key and we may want a time-varying step size

• If step size is too large, the algorithm may never converge

• If step size is too small, the convergence may be very slow

10
Choosing a Step Size

• Constant step size: ηt = c

1
• Decaying step size: ηt = c/t. Different rates of decay common, e.g., √
t

• Backtracking line search


• Start with c/t (usually a large value)

• Check for a decrease: Is f (wt − ηt ∇f (wt )) < f (wt )?

• If decrease condition not met, multiply ηt by a decaying factor, e.g., 0.5

• Repeat until the decrease condition is met

11
When to Stop? Test for Convergence

Fixed number of iterations: Terminate if t≥T

Small increase: Terminate if f (wt +1 ) − f (wt ) ≤ ϵ1

Small change: Terminate if ∥wt +1 − wt ∥ ≤ ϵ2

12
Stochastic Gradient Descent

12
Optimisation Algorithms for Machine Learning

We minimise the objective function over data points (x1 , y1 ), . . . , (xN , yN )


N
1 X
L(w) = ℓ(w; xi , yi ) + λR(w)
N | {z } | {z }
i =1
Loss per data point Regularisation

The gradient of the objective function is


N
1 X
∇w L = ∇w ℓ(w; xi , yi ) + λ∇w R(w)
N
i =1

13
Optimisation Algorithms for Machine Learning

We minimise the objective function over data points (x1 , y1 ), . . . , (xN , yN )


N
1 X
L(w) = ℓ(w; xi , yi ) + λR(w)
N | {z } | {z }
i =1
Loss per data point Regularisation

The gradient of the objective function is


N
1 X
∇w L = ∇w ℓ(w; xi , yi ) + λ∇w R(w)
N
i =1

For Ridge Regression we have


N
1 X T
Lridge (w) = (w xi − yi )2 + λ
|w
T
{z w}
N | {z }
i =1 ℓ2 regularisation
square loss

N
1 X T
∇w Lridge = 2(w xi − yi )xi + 2λw
N
i =1

13
Stochastic Gradient Descent

As part of the learning algorithm, we calculate the following gradient:


N
1 X
∇w L = ∇w ℓ(w; xi , yi ) + λ∇w R(w)
N
i =1

Suppose we pick a random datapoint (xi , yi ) and evaluate gi = ∇w ℓ(w; xi , yi )

What is E[gi ]?

14
Stochastic Gradient Descent

As part of the learning algorithm, we calculate the following gradient:


N
1 X
∇w L = ∇w ℓ(w; xi , yi ) + λ∇w R(w)
N
i =1

Suppose we pick a random datapoint (xi , yi ) and evaluate gi = ∇w ℓ(w; xi , yi )

What is E[gi ]?

N
1 X
E[gi ] = ∇w ℓ(w; xi , yi )
N
i =1

In expectation gi points in the same direction as the entire gradient


(except for the regularisation term)

14
Stochastic Gradient Descent

As part of the learning algorithm, we calculate the following gradient:


N
1 X
∇w L = ∇w ℓ(w; xi , yi ) + λ∇w R(w)
N
i =1

Suppose we pick a random datapoint (xi , yi ) and evaluate gi = ∇w ℓ(w; xi , yi )

What is E[gi ]?

N
1 X
E[gi ] = ∇w ℓ(w; xi , yi )
N
i =1

In expectation gi points in the same direction as the entire gradient


(except for the regularisation term)

We compute the gradient at one data point instead of at all data points!
• Online learning
• Cheap to compute one gradient

14
Stochastic Gradient Descent vs (Batch) Gradient Descent

• 1000 data points for training and 1000 data points for test
• 2 features x1 ∼ N (0, 5) and x2 ∼ N (0, 8); centred labels
• Least-squares linear regression model fw (x) = x1 w1 + x2 w2
• Parameters (w1 , w2 ): initial (−2, −3) and final (1, 1)

15
Stochastic Gradient Descent vs (Batch) Gradient Descent

• 1000 data points for training and 1000 data points for test
• 2 features x1 ∼ N (0, 5) and x2 ∼ N (0, 8); centred labels
• Least-squares linear regression model fw (x) = x1 w1 + x2 w2
• Parameters (w1 , w2 ): initial (−2, −3) and final (1, 1)

In practice: mini-batch gradient descent significantly improves the performance

• reduces the variance in the gradients and hence it is more stable than SGD
15
Sub-Gradient Descent

15
Minimising the Lasso Objective

Linear model trained with least squares loss and ℓ1 -regularisation:

N
X D
X
Llasso (w) = (wT xi − yi )2 + λ | wi |
i =1 i =1

• Quadratic part of the loss function can’t be framed as linear programming

• Lasso regularisation does not allow for closed-form solutions

• Typically resort to general optimisation methods

• We still have the problem that the objective function is not differentiable
everywhere!

16
Minimising the Lasso Objective

Linear model trained with least squares loss and ℓ1 -regularisation:

N
X D
X
Llasso (w) = (wT xi − yi )2 + λ | wi |
i =1 i =1

• Quadratic part of the loss function can’t be framed as linear programming

• Lasso regularisation does not allow for closed-form solutions

• Typically resort to general optimisation methods

• We still have the problem that the objective function is not differentiable
everywhere!

In these cases, we can use the sub-gradient descent approach:

• The function may have several sub-gradients at a given point


• Choose any of these sub-gradients in the gradient descent update formula
16
Sub-gradient Descent

We discuss the case when f is convex:


f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ) for all x , y and for α ∈ [0, 1]

f1 (x ) = 0.1x 2

17
Sub-gradient Descent

We discuss the case when f is convex:


f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ) for all x , y and for α ∈ [0, 1]

f1 (x ) = 0.1x 2

f (x ) if x < 2
1
f2 ( x ) =
2x − 3.6 otherwise

Tangent lines at x = 2
17
Sub-gradient Descent

We discuss the case when f is convex:


f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ) for all x , y and for α ∈ [0, 1]

f1 (x ) = 0.1x 2

f (x ) if x < 2
1
f2 ( x ) =
2x − 3.6 otherwise

Tangent lines at x = 2
17
Sub-gradient Descent

We discuss the case when f is convex:


f (αx + (1 − α)y ) ≤ αf (x ) + (1 − α)f (y ) for all x , y and for α ∈ [0, 1]

A convex function lies above the tangent plane at any point.

Univariate case: f (x ) ≥ f (x0 ) + g · (x − x0 ) for every g in the set f ′ (x0 ) of


sub-derivatives g of f at x0

Multivariate case: f (x) ≥ f (x0 ) + gT (x − x0 ) for every g in the set ∇f (x0 ) of


sub-gradients g of f at x0 17
Sub-gradient Descent: Example 1

Compute sub-derivatives of f (z ) = |z | at points x0 = 1, x1 = −3, and x2 = 0.


|z |
1

f (z ) ≥ f (xi ) + g · (z − xi )
f (z ) 0.5 f (z ) ≥ f (x0 ) + g · (z − x0 ) = 1 + g · (z − 1)
f (z ) ≥ f (x1 ) + g · (z − x1 ) = 3 + g · (z + 3)
−1 −0.5 0.5 1 f (z ) ≥ f (x2 ) + g · (z − x2 ) = g · z
z
−0.5

−1

18
Sub-gradient Descent: Example 1

Compute sub-derivatives of f (z ) = |z | at points x0 = 1, x1 = −3, and x2 = 0.


|z |
1

f (z ) ≥ f (xi ) + g · (z − xi )
f (z ) 0.5 f (z ) ≥ f (x0 ) + g · (z − x0 ) = 1 + g · (z − 1)
f (z ) ≥ f (x1 ) + g · (z − x1 ) = 3 + g · (z + 3)
−1 −0.5 0.5 1 f (z ) ≥ f (x2 ) + g · (z − x2 ) = g · z
z
−0.5

−1

f is differentiable at x0 = 1 and x1 = −3, so there is one derivative of f at each of


these points. For x0 = 1, g = 1. For x1 = −3, g = −1.

18
Sub-gradient Descent: Example 1

Compute sub-derivatives of f (z ) = |z | at points x0 = 1, x1 = −3, and x2 = 0.


|z |
1

f (z ) ≥ f (xi ) + g · (z − xi )
f (z ) 0.5 f (z ) ≥ f (x0 ) + g · (z − x0 ) = 1 + g · (z − 1)
f (z ) ≥ f (x1 ) + g · (z − x1 ) = 3 + g · (z + 3)
−1 −0.5 0.5 1 f (z ) ≥ f (x2 ) + g · (z − x2 ) = g · z
z
−0.5

−1

f is differentiable at x0 = 1 and x1 = −3, so there is one derivative of f at each of


these points. For x0 = 1, g = 1. For x1 = −3, g = −1.

At x2 = 0, f admits the sub-derivatives g ∈ [−1, 1].

18
Sub-gradient Descent: Example 2

Compute a sub-derivative of f (z ) = max(z , 0) at point x0 = 0.


max z , 0
1

f (z ) 0.5

−1 −0.5 0.5 1
z
−0.5

−1

19
Sub-gradient Descent: Example 2

Compute a sub-derivative of f (z ) = max(z , 0) at point x0 = 0.


max z , 0
1

f (z ) 0.5

−1 −0.5 0.5 1
z
−0.5

−1

f (z ) ≥ f (x0 ) + g · (z − x0 ) = g · z

The sub-derivatives g ∈ [0, 1] satisfy the above inequality.

19
Constrained Convex Optimisation

19
Constrained Convex Optimisation

Gradient descent

• Minimises f (x) by moving in the negative gradient direction at each step:

wt +1 = wt − ηt ∇f (wt )

• There is no constraint on the parameters

20
Constrained Convex Optimisation

Gradient descent

• Minimises f (x) by moving in the negative gradient direction at each step:

wt +1 = wt − ηt ∇f (wt )

• There is no constraint on the parameters

Projected gradient descent

• Minimises f (x) subject to additional constraints wC ∈ C:

zt +1 = wt − ηt ∇f (wt )
wt +1 = argmin ∥zt +1 − wC ∥
wC ∈C

• Gradient step is followed by a projection step

20
Constrained Convex Optimisation: Examples

Minimise (Xw − y)T (Xw − y) subject to the ridge and lasso constraints
PD
wT w < R i =1 | wi | < R

21
Second Order Methods

21
Newton’s Method

In calculus: Finds roots of a differentiable function f , i.e., solutions to f (x) = 0

In optimisation: Finds roots of f ′ , i.e., solutions to f ′ (x) = 0

• Function f needs to be twice-differentiable

• The roots of f ′ are stationary points of f , i.e., minima/maxima/saddle points

22
Newton’s Method in One Dimension

• Construct a sequence of points x1 , . . . , xn starting with an initial guess x0

23
Newton’s Method in One Dimension

• Construct a sequence of points x1 , . . . , xn starting with an initial guess x0

• Sequence converges towards a minimiser x ∗ of f using sequence of

second-order Taylor approximations of f around the iterates:

′ 1
f (x ) ≈ f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk )
2

23
Newton’s Method in One Dimension

• Construct a sequence of points x1 , . . . , xn starting with an initial guess x0

• Sequence converges towards a minimiser x ∗ of f using sequence of

second-order Taylor approximations of f around the iterates:

′ 1
f (x ) ≈ f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk )
2

• xk +1 = x ∗ defined as the minimiser of this quadratic approximation

23
Newton’s Method in One Dimension

• Construct a sequence of points x1 , . . . , xn starting with an initial guess x0

• Sequence converges towards a minimiser x ∗ of f using sequence of

second-order Taylor approximations of f around the iterates:

′ 1
f (x ) ≈ f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk )
2

• xk +1 = x ∗ defined as the minimiser of this quadratic approximation

• If f ′′ is positive, then the quadratic approximation is convex,

and a minimiser is obtained by setting the derivative to zero:

 
d ′ 1
0= f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk ) = f ′ (xk ) + (x ∗ − xk )f ′′ (xk )
dx 2

Then,
∗ ′ ′′ −1
xk +1 = x = xk − f (xk )[f (xk )]
23
Geometric Interpretation of Newton’s Method

At iteration k , we fit a paraboloid to the surface of f at xk with the same slopes


and curvature as the surface at xk and go for the extremum of that paraboloid

24
Geometric Interpretation of Newton’s Method

At iteration k , we fit a paraboloid to the surface of f at xk with the same slopes


and curvature as the surface at xk and go for the extremum of that paraboloid

f (x ) = x 3 − 9x g (x ) = f (x0 ) + (x − x0 )f ′ (x0 ) = x03 − 9x0 + (x − x0 )(3x02 − 9)

• Gradient descent: First derivative


• Local linear approximation
24
Geometric Interpretation of Newton’s Method

At iteration k , we fit a paraboloid to the surface of f at xk with the same slopes


and curvature as the surface at xk and go for the extremum of that paraboloid

f (x ) = x 3 − 9x g (x ) = f (x0 ) + (x − x0 )f ′ (x0 ) = x03 − 9x0 + (x − x0 )(3x02 − 9)


r (x ) = f (x0 )+(x −x0 )f ′ (x0 )+ 21 (x −x0 )2 f ′′ (x0 ) = x03 −9x0 +(x −x0 )(3x02 −9)+ 12 (x −x0 )6x0

• Newton: Second derivative


• Gradient descent: First derivative
• Degree 2 Taylor approximation
• Local linear approximation around current point 24
Newton’s Method in High Dimensions

First derivative → gradient Second derivative → Hessian

• Approximate f around xk using second order Taylor approximation

T 1
f (x) ≈ f (xk ) + gk (x − xk ) + (x − xk )T Hk (x − xk )
2

• The gradient of f is given by:

∇x f = gk + Hk (x − xk )

• Setting ∇x f = 0, we obtain x∗ = xk − H− 1
k gk

• We move directly to the (unique) stationary point x∗ of f

• We repeat the above iteration with xk +1 = x∗

25
Newton’s Method: Computation and Convergence

Newton’s method

• Computational requirements at each Newton step


D

• D+ 2
partial derivatives and inverse of the Hessian

• Instead: Compute x as the solution of the system of linear equations

Hk (x − xk ) = −gk

using factorisations (e.g., Cholesky) of Hk

26
Newton’s Method: Computation and Convergence

Newton’s method

• Computational requirements at each Newton step


D

• D+ 2
partial derivatives and inverse of the Hessian

• Instead: Compute x as the solution of the system of linear equations

Hk (x − xk ) = −gk

using factorisations (e.g., Cholesky) of Hk

• For convex f
• It converges to stationary points of the quadratic approximation

• All stationary points are global minima

26
Newton’s Method: Computation and Convergence

Newton’s method

• Computational requirements at each Newton step


D

• D+ 2
partial derivatives and inverse of the Hessian

• Instead: Compute x as the solution of the system of linear equations

Hk (x − xk ) = −gk

using factorisations (e.g., Cholesky) of Hk

• For convex f
• It converges to stationary points of the quadratic approximation

• All stationary points are global minima

• For non-convex f
• Stationary points may not be minima nor in the decreasing direction of f

• Not successful for training deep neural networks:


abundance of saddle points for their non-convex objective functions
26
Summary

Convex Optimization

• Convex Optimization is ‘efficient’ (i.e., polynomial time)

• Try to cast learning problem as a convex optimization problem

• Many, many extensions of standard gradient descent exist: Adagrad,


Momentum-based, BGFS, L-BGFS, Adam, etc.

• Books: Boyd and Vandenberghe, Nesterov’s Book

Non-Convex Optimization

• Encountered frequently in deep learning

• (Stochastic) Gradient Descent gives local minima

• Nonlinear Programming - Dimitri Bertsekas

27

You might also like