11 Gradient Descent
11 Gradient Descent
MSc: https://lms.uzh.ch/url/RepositoryEntry/17589469505
PhD: https://lms.uzh.ch/url/RepositoryEntry/17589469506
Calculus Background
To search for a local minimum of f , we can repeatedly move in the direction of the
steepest decrease of f , so in the opposite direction of the gradient of f .
1
Gradient Vectors are Orthogonal to Contour Curves
2
Gradient Points in the Direction of the Steepest Increase (1/2)
Each component of the gradient says how fast the function changes with respect
to the standard basis:
∂f f (a + h ui ) − f(a)
(a) = lim
∂ xi h →0 h
What about changing with respect to the direction of some arbitrary vector v?
3
Gradient Points in the Direction of the Steepest Increase (1/2)
Each component of the gradient says how fast the function changes with respect
to the standard basis:
∂f f (a + h ui ) − f(a)
(a) = lim
∂ xi h →0 h
What about changing with respect to the direction of some arbitrary vector v?
f (a + h v) − f (a) ∂f ∂f
∇v f (a) = lim = ∇f (a) · v = v1 (a) + . . . + vD (a)
h →0 h ∂ x1 ∂ xD
∇v f (a) compounds the effect of moving along the direction of the first
component, then along the direction of the second component, and so on
3
Gradient Points in the Direction of the Steepest Increase (2/2)
Which direction should we walk from a so that f ’s output increases most quickly?
• That is, which unit vector v maximizes the directional derivative along v?
4
Gradient Points in the Direction of the Steepest Increase (2/2)
Which direction should we walk from a so that f ’s output increases most quickly?
• That is, which unit vector v maximizes the directional derivative along v?
∗ 1
v = argmax ∇v f (a) = argmax ∇f (a) · v
v v
2 3
= argmax ∥∇f (a)∥∥v∥ cos θ = argmax ∥∇f (a)∥ cos θ
v v
where:
Maximal value is for cos θ = 1 or θ = 0, so ∇f (a) and v∗ have the same direction
4
The Hessian Matrix
5
The Hessian Matrix
5
Gradient and Hessian: Example
w12 w2
z = f ( w1 , w2 ) = 2
+ 22
a b
" #
∂f 2w1
∇w f = ∂ w1 = a2
2w2
∂f
∂ w2 b2
∂2f ∂2f 2
∂ 2 ∂ w1 ∂ w2 a2
0
H= w2 1 =
∂2f
∂ f 2
∂ w2 ∂ w1 ∂ w22 0 b2
6
Gradient Descent
6
Gradient Descent Algorithm
Gradient descent is one of the simplest, but very general optimisation algorithms
for finding a local minimum of a differentiable function
• Recall our general optimisation goal for a loss function f with parameters w:
∗
w = argmin f (w)
w
wt +1 = wt − ηt gt = wt − ηt ∇f (wt )
7
Gradient Descent: Convex vs. Non-Convex Functions
8
Gradient Descent for Least Squares Regression
N
X
L(w) = (Xw − y)T (Xw − y) = (xTi w − yi )2
i =1
9
Gradient Descent for Least Squares Regression
N
X
L(w) = (Xw − y)T (Xw − y) = (xTi w − yi )2
i =1
Gradient descent vs closed-form solution for very large (N) and wide (D) datasets
9
Choosing a Step Size
Choosing a good step-size is key and we may want a time-varying step size
10
Choosing a Step Size
Choosing a good step-size is key and we may want a time-varying step size
10
Choosing a Step Size
Choosing a good step-size is key and we may want a time-varying step size
10
Choosing a Step Size
1
• Decaying step size: ηt = c/t. Different rates of decay common, e.g., √
t
11
When to Stop? Test for Convergence
12
Stochastic Gradient Descent
12
Optimisation Algorithms for Machine Learning
13
Optimisation Algorithms for Machine Learning
N
1 X T
∇w Lridge = 2(w xi − yi )xi + 2λw
N
i =1
13
Stochastic Gradient Descent
What is E[gi ]?
14
Stochastic Gradient Descent
What is E[gi ]?
N
1 X
E[gi ] = ∇w ℓ(w; xi , yi )
N
i =1
14
Stochastic Gradient Descent
What is E[gi ]?
N
1 X
E[gi ] = ∇w ℓ(w; xi , yi )
N
i =1
We compute the gradient at one data point instead of at all data points!
• Online learning
• Cheap to compute one gradient
14
Stochastic Gradient Descent vs (Batch) Gradient Descent
• 1000 data points for training and 1000 data points for test
• 2 features x1 ∼ N (0, 5) and x2 ∼ N (0, 8); centred labels
• Least-squares linear regression model fw (x) = x1 w1 + x2 w2
• Parameters (w1 , w2 ): initial (−2, −3) and final (1, 1)
15
Stochastic Gradient Descent vs (Batch) Gradient Descent
• 1000 data points for training and 1000 data points for test
• 2 features x1 ∼ N (0, 5) and x2 ∼ N (0, 8); centred labels
• Least-squares linear regression model fw (x) = x1 w1 + x2 w2
• Parameters (w1 , w2 ): initial (−2, −3) and final (1, 1)
• reduces the variance in the gradients and hence it is more stable than SGD
15
Sub-Gradient Descent
15
Minimising the Lasso Objective
N
X D
X
Llasso (w) = (wT xi − yi )2 + λ | wi |
i =1 i =1
• We still have the problem that the objective function is not differentiable
everywhere!
16
Minimising the Lasso Objective
N
X D
X
Llasso (w) = (wT xi − yi )2 + λ | wi |
i =1 i =1
• We still have the problem that the objective function is not differentiable
everywhere!
f1 (x ) = 0.1x 2
17
Sub-gradient Descent
f1 (x ) = 0.1x 2
f (x ) if x < 2
1
f2 ( x ) =
2x − 3.6 otherwise
Tangent lines at x = 2
17
Sub-gradient Descent
f1 (x ) = 0.1x 2
f (x ) if x < 2
1
f2 ( x ) =
2x − 3.6 otherwise
Tangent lines at x = 2
17
Sub-gradient Descent
f (z ) ≥ f (xi ) + g · (z − xi )
f (z ) 0.5 f (z ) ≥ f (x0 ) + g · (z − x0 ) = 1 + g · (z − 1)
f (z ) ≥ f (x1 ) + g · (z − x1 ) = 3 + g · (z + 3)
−1 −0.5 0.5 1 f (z ) ≥ f (x2 ) + g · (z − x2 ) = g · z
z
−0.5
−1
18
Sub-gradient Descent: Example 1
f (z ) ≥ f (xi ) + g · (z − xi )
f (z ) 0.5 f (z ) ≥ f (x0 ) + g · (z − x0 ) = 1 + g · (z − 1)
f (z ) ≥ f (x1 ) + g · (z − x1 ) = 3 + g · (z + 3)
−1 −0.5 0.5 1 f (z ) ≥ f (x2 ) + g · (z − x2 ) = g · z
z
−0.5
−1
18
Sub-gradient Descent: Example 1
f (z ) ≥ f (xi ) + g · (z − xi )
f (z ) 0.5 f (z ) ≥ f (x0 ) + g · (z − x0 ) = 1 + g · (z − 1)
f (z ) ≥ f (x1 ) + g · (z − x1 ) = 3 + g · (z + 3)
−1 −0.5 0.5 1 f (z ) ≥ f (x2 ) + g · (z − x2 ) = g · z
z
−0.5
−1
18
Sub-gradient Descent: Example 2
f (z ) 0.5
−1 −0.5 0.5 1
z
−0.5
−1
19
Sub-gradient Descent: Example 2
f (z ) 0.5
−1 −0.5 0.5 1
z
−0.5
−1
f (z ) ≥ f (x0 ) + g · (z − x0 ) = g · z
19
Constrained Convex Optimisation
19
Constrained Convex Optimisation
Gradient descent
wt +1 = wt − ηt ∇f (wt )
20
Constrained Convex Optimisation
Gradient descent
wt +1 = wt − ηt ∇f (wt )
zt +1 = wt − ηt ∇f (wt )
wt +1 = argmin ∥zt +1 − wC ∥
wC ∈C
20
Constrained Convex Optimisation: Examples
Minimise (Xw − y)T (Xw − y) subject to the ridge and lasso constraints
PD
wT w < R i =1 | wi | < R
21
Second Order Methods
21
Newton’s Method
22
Newton’s Method in One Dimension
23
Newton’s Method in One Dimension
′ 1
f (x ) ≈ f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk )
2
23
Newton’s Method in One Dimension
′ 1
f (x ) ≈ f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk )
2
23
Newton’s Method in One Dimension
′ 1
f (x ) ≈ f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk )
2
d ′ 1
0= f (xk ) + (x − xk )f (xk ) + (x − xk )2 f ′′ (xk ) = f ′ (xk ) + (x ∗ − xk )f ′′ (xk )
dx 2
Then,
∗ ′ ′′ −1
xk +1 = x = xk − f (xk )[f (xk )]
23
Geometric Interpretation of Newton’s Method
24
Geometric Interpretation of Newton’s Method
T 1
f (x) ≈ f (xk ) + gk (x − xk ) + (x − xk )T Hk (x − xk )
2
∇x f = gk + Hk (x − xk )
• Setting ∇x f = 0, we obtain x∗ = xk − H− 1
k gk
25
Newton’s Method: Computation and Convergence
Newton’s method
Hk (x − xk ) = −gk
26
Newton’s Method: Computation and Convergence
Newton’s method
Hk (x − xk ) = −gk
• For convex f
• It converges to stationary points of the quadratic approximation
26
Newton’s Method: Computation and Convergence
Newton’s method
Hk (x − xk ) = −gk
• For convex f
• It converges to stationary points of the quadratic approximation
• For non-convex f
• Stationary points may not be minima nor in the decreasing direction of f
Convex Optimization
Non-Convex Optimization
27