0% found this document useful (0 votes)
2 views33 pages

First Order Method

The document discusses first-order methods for local optimization, focusing on gradient descent algorithms that utilize a function's first derivatives to find descent directions efficiently. It explains the first-order optimality condition, which characterizes minima of differentiable functions, and highlights the challenges in solving first-order systems of equations. Additionally, it covers coordinate descent as a sequential approach to finding stationary points and provides examples of minimizing functions using these concepts.

Uploaded by

Hoàng Lê Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views33 pages

First Order Method

The document discusses first-order methods for local optimization, focusing on gradient descent algorithms that utilize a function's first derivatives to find descent directions efficiently. It explains the first-order optimality condition, which characterizes minima of differentiable functions, and highlights the challenges in solving first-order systems of equations. Additionally, it covers coordinate descent as a sequential approach to finding stationary points and provides examples of minimizing functions using these concepts.

Uploaded by

Hoàng Lê Văn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

 First-Order Methods

In local optimization methods, we repeatedly re�ne an initial sample input by traveling in


, i.e., directions in the input space that lead to points lower and lower on the
function.

We have considered methods for �nding such descent direction at each step:

• random search: prohibitively expensive as the dimension grows.


• coordinate seach: severly restricted in terms of the quality of descent direction.

By exploiting a function's , we can construct


that determines high quality descent directions at a cheaper cost.

 Big picture view of the gradient descent algorithm

A local optimization method is one where we aim to �nd minima of a given function by
beginning at some point w0 and taking number of steps w1 , w2 , w3 , . . . , wK of the generic
form

w k = w k−1 + αd k .
where:

• d k are direction vectors (which ideally are descent directions that lead us to lower and
lower parts of a function)
• α is called the steplength parameter

A algorithm, which is a �rst-order local optimization method, employs a


function's �rst derivative(s) to cheaply compute a high-quality descent direction.

image.png

• The of a function helps form the best to the function


(called the �rst-order Taylor series approximation).

• Because the approximation the function locally, the descent direction of the

1 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

tangent hyperplane (or the tangent line) is also a descent direction for the function itself.

• It is to compute the descent direction of a line or a hyperplane.

 The First-Order Optimality Condition

The �rst-order optimality condition states the behavior of how any differentiable function's �rst
derivative(s) behave at its minima.

 The �rt-order condition

func1 = lambda w: w**2 + 3


func2 = lambda w: w[0]**2 + w[1]**2 + 3
compare_2d3d(func1 = func1,func2 = func2)

• We draw the �rst-order approximation (a tangent line/hyperplane) at the function's


minimum.
• In both examples, the tangent line/hyperplane is perfectly �at, indicating that the �rst
derivative(s) is exactly at the function's minimum.

The value of �rst-order derivative(s) provide a convenient way of characterizing minimum values
of a function g. When N = 1, any point v where
d
g (v) = 0
dw
is a potential minimum.

2 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Analogously with general N dimensional input, any N dimensional point v where every
of g is zero, that is

∂w1
g(v) = 0

∂w2
g(v) = 0



∂wN
g(v) = 0
is a *potential minimum. This system of N equations is naturally referred to as the *�rst order
system of equations**. We can write the �rst order system more compactly using gradient
notation as

∇g (v) = 0N×1 .

The �rst order optimality condition translates the problem of identifying a function's minimum
points into the task of solving a system of N �rst order equations.

However, there are two problems with the �rst-order characterization of minima.

1. It is virtually impossible (with few exceptions) to solve a general function's �rst order
system of equations for 'closed form' solutions.
2. The �rst order optimality conditions characterizes not only , but also and
of a function.

 Example 1: Finding points of zero derivative for single-input functions

func1 = lambda w: np.sin(2*w)


func2 = lambda w: w**3
func3 = lambda w: np.sin(3*w) + 0.1*w**2
show_stationary(func1 = func1,func2 = func2,func3 = func3)

3 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

It is not only global minima that have zero derivatives, but other points, e.g., local minima, local
and global maxima, and saddle points, as well.

Points having zero-valued derivative(s) are collectively referred to as for


.

: Stationary points of a function g (including minimax,


maxima, and saddle points) satisfy the �rst order condition

∇g (v) = 0N×1

If a function is *convex* (e.g., a quadratic function), then any point at which a function satis�es
the �rst order condition must be a global minima. A convex function has no maxima nor saddle
points.

Example 2: A simple looking function but di�cult to compute


 (algebraically) global minimum

1
g(w) = (w4 + w2 + 10w)
50

w = np.linspace(-5,5,50)
g = lambda w: 1/50*(w**4 + w**2 + 10*w)
figure = plt.figure(figsize=(6,3))
plt.plot(w,g(w),linewidth=2,color='k')
plt.xlabel('$w$',fontsize=14)
plt.ylabel('$g(w)$',rotation=0,labelpad=15,fontsize=14)
plt.show()

4 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

d 1
g(w) = (4w3 + 2w + 10) = 2w3 + w + 5 = 0 = 0
dw 50

d 1
g(w) = (4w3 + 2w + 10) = 2w3 + w + 5 = 0 = 0
dw 50

The one providing the minimum of the function g is:


3 −−−−−
−−−−−−
√√− 2031 − 45 1
w= − −−−−−
−−−
−−−−−−−−
√3 6 (√2031 − 45)
2
63

which can be computed - after much toil - using centuries old tricks developed for just such
problems.

 Example 3: Stationary points of a general multi-input quadratic function

Take the general multi-input quadratic function

g (w) = a + bT w + wT Cw
where C is an N × N symmetric matrix, b is an N × 1 vector, and a is a scalar.

Computing the �rst derivative (gradient) we have

∇g (w) = 2Cw + b

Setting this equal to zero gives a symmetric and linear system of equations of the form
1
Cw = − b
2
whose solutions are stationary points of the original function.

 Coordinate descent and the �rst order optimality condition

For a given cost function g (w) taking in N dimensional input, the stationary points (minima
included) of this function are those satisfying the system of equations

5 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

included) of this function are those satisfying the system of equations

∇g (v) = 0N×1
or written out one equation at-a-time as

∂w1
g(v) = 0

∂w2
g(v) = 0



∂wN
g(v) = 0.

Instead of solving a �rst order system *simulatenously, we can solve them *sequentially** in a
coordinate-wise approach if each equation can be solved for the closed form. if such an
equation can be solved for the closed form.

g(v) = 0
∂wn

• We �rst initialize at an input point w0 , and begin by updating the �rst coordinate

g (w0 ) = 0
∂w1
for the optimal �rst weight w⋆1 .
• We then update the �rst coordinate of the vector w0 with this solution, and call the
updated set of weights w1 .

• Continuing this pattern to update the nth weight we solve



g (wn−1 ) = 0
∂wn
for w⋆n , and update the nth weight using this value forming the updated set of weights
wn .

After we sweep through all N weights a single time we can re�ne our solution by sweeping
through the weights again (as with any other coordinate wise method). At the kth such sweep
we update the nth weight by solving the single equation

g (wk+n−1 ) = 0
∂wn
and update the nth weight of wk+n−1 , and so on.

Example 4: Minimizing convex quadratic functions via �rst order


 coordinate descent

6 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

coordinate descent

We minimize the simple quadratic

g(w0 , w1 ) = w20 + w21 + 2


which can be written in vector-matrix
g (w) = a + bT w + wT Cw
0 10
where a = 2, b = [ ], and C = [ ].
0 01

3
We initialize at w = [ ] and run 1 iteration of the algorithm.
4

plotter = Visualizer()

10
0][ ] + [ w1 w2 ] [ ][ 1 ]
w1 w
g(w0 , w1 ) = w20 + w21 + 2 = 2 + [ 0
w2 01 w2

a = 2
b = np.zeros((2,1))
C = np.eye(2)
# a quadratic function defined using the constants above
g = lambda w: (a + np.dot(b.T,w) + np.dot(np.dot(w.T,C),w))[0]
# initialization
w = np.array([3,4])
max_its = 1
weight_history,cost_history = coordinate_descent_for_quadratic(g,w,max_its,a,b,C)
plotter.two_input_contour_plot(g,weight_history,xmin = -1.5,xmax = 4.5,ymin = -1.5,ymax =

7 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

21
0][ ] + [ w1 w2 ] [ ][ 1 ]
w1 w
g(w0 , w1 ) = a + bT w + wT Cw = 20 + [ 0
w2 12 w2

a = 20
b = np.zeros((2,1))
C = np.array([[2,1],[1,2]])
g = lambda w: (a + np.dot(b.T,w) + np.dot(np.dot(w.T,C),w))[0]
w = np.array([3,4])
max_its = 2
weight_history,cost_history = coordinate_descent_for_quadratic(g,w,max_its,a,b,C)
plotter.two_input_contour_plot(g,weight_history,xmin = -4.5,xmax = 4.5,ymin = -4.5,ymax =

 Geometry of First-Order Taylor Series

• We discuss important characteristics of the *hyperplane* including the direction of


steepest ascent and steepest descent.
• A special hyperplane: the �rst-order Taylor series approximation to a function.

 The anatomy of hyperlanes


8 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

The anatomy of hyperlanes

A general N -dimensional hyperplane can be characterized as:

h(w1 , w2 , … , wN ) = a + b1 w1 + b2 w2 + … + bN wN
where a, b1 , … , bN are scalar parameters. We can rewrite h more compactly as:

h(w) = a + bT w

When N = 1, we have h(w) = a + bw, which is the formula for a one-dimensional line in a
space whose input space (characterized by w) is one-dimensional.

For general N , h(w)= a + bT w is an N -dimensional hyperplane in an (N + 1)-


dimensional space whose input space (characterized by w1 , w2 , … , wN ) is N -dimensional.

 Steepest ascent and descent directions

Capture.PNG

When N = 1, for any point w0 in the input space, there are only 2 directions:
• Moving to the right of w0 increases the value of h, and hence it is an ascent direction.
• Moving to the left of w0 decreases the value of h, and hence it is a descent direction.

When N > 1, thre are in�nitely many directions to move in, some providing ascent, descent, or
preserving the value of h.

How can we �nd the direction that produces the largest ascent (or descent), commonly referred
to as the direction of steepest ascent (or descent).

To search for the direction of steepest ascent at a given point w0 :

maximize h (w0 + d)
d
over all unit-length vectors d.

Note that h (w0 + d) can be written as:


a + bT (w0 + d) = a + bT w0 + bT d

where the �rst two terms are constant with respect to d.

9 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Therefore, maximizing the value of h (w0 + d) is equivalent to maximizing bT d.

We have:
bT d = ∥b∥2 ∥d∥2 cos (θ),
where ∥b∥2 does not change with respect to d, and ∥d∥2 = 1.

Then the optimization problem becomes:


maximize cos (θ),
θ
where θ is the angle between the vectors b and d.

b
• Of all unit directions, d = provides the (where θ = 0 and
∥b∥2
cos (θ) = 1)

−b
• Similarly, the unit direction d = provides the (where θ = π and
∥b∥2
cos (θ) = −1)

 The gradient and the direction of steepest ascent/descent

• A function g(w) can be approximated locally around a given point w0 by a hyperplane


h(w) :
h(w) = g(w0 ) + ∇g(w0 )T (w − w0 )
which can be rewritten as h(w) = a + b w (as previously).
T

• Then, we have:

a = g(w0 ) − ∇g(w0 )T w0 and b = ∇g(w0 )

• This hyperplane is the �rst-order Taylor series approximation of g at w0 , and is to


g at this point.

• Because h is constructed to closely approximate g near the point w0 , its steepest ascent
and descent directions also tell us the direction to travel to increase or decrease the value
of the function g itself the point w0 .

Gradient Descent
10 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

 Gradient Descent

A local optimization method aims to �nd minima of a given function g(w) by beginning at some
point w0 and taking a number of steps w1 , w2 , … , wK of the form:
wk = wk−1 + αdk
where:

• dk are descent directions


• α is the steplength parameter.

The negative gradient −∇g(w) of a function g(w) computed at a particular point de�nes a
valid descent direction at that point.

If we employ the negative gradient direction d = −∇g(wk−1 ), the sequence of steps takes
k

the form:

wk = wk−1 − α∇g(wk−1 )

The local optimization method with the above update step is called the
algorithm.

Capture.PNG

• Begin at the initial point w0 , we make an approximation to g(w) at the point (w0 , g(w0 ))
with the �rst-order Taylor series approximation.
• Moving in the negative gradient direction provided by this approximation we arrive at
w1 = w0 − α dw d
g(w0 ).
• Repeat this process at w1 , moving in the negative gradient direction to
w2 = w1 − α dw d
g(w1 ).
•…

Start coding or generate with AI.

 Gradient Descent
• Gradient descent is better than naive zero-order approaches.

• The negative gradient direction provides a descent direction for the function locally.

11 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• The negative gradient direction provides a descent direction for the function locally.
• The descent directions provided via the gradients are easier to compute than seeking out a
descent direction at random.

 The gradient descent algorithm

function g, steplength α, maximum number of steps K , and initial point w0

for k = 1...K
wk = wk−1 − α∇g (wk−1 )
K K
history of weights {wk }k=0 and corresponding function evaluations {g (wk )}k=0

• We can simply return the �nal set of weights wK or the entire sequence of gradient
K
descent steps {wk }k=0 .

• When does stop?

• If the steplength α is chosen properly, the algorithm will stop near stationary points of the
function, typically or .
• If the step
w k = w k−1 − α∇g (wk−1 )
does not move from the prior point w k−1 then this can mean that the direction we are
traveling in is vanishing, i.e., −∇g (wk ) ≈ 0N×1 . This is a stationary point of the
function.

static_plotter = static_visualizer()
anim_plotter = anim_visualizer()

 Example 1: A convex single input example


Use gradient descent to minimize the polynomial function
1
g(w) = (w4 + w2 + 10w) .
50

The global minimum of the function is:


3 −−−−−
−−−−−−
√√− 2031 − 45 1
w= −

12 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

w= −−−−−
−−−−−−−−−
−−
√3 6 (√2031 − 45)
2
63

With gradient descent we can determine a point that is close to this one. The gradient of the
function is:
∂ 2 3 1 1
g (w) = w + w+ .
∂w 25 25 5

We initialize the gradient descent algorithm at w0 = 2.5, constant steplength α = 1, and run
for 25 iterations.

g = lambda w: 1/float(50)*(w**4 + w**2 + 10*w) # try other functions too! Like g = lambda w: n
w = 2.5; alpha = 1; max_its = 25;
weight_history,cost_history = gradient_descent(g,alpha,max_its,w)
anim_plotter.gradient_descent(g,weight_history,savepath=video_path_1,fps=1)

# standard imports
import matplotlib.pyplot as plt
from IPython.display import Image, HTML
from base64 import b64encode
def show_video(video_path, width = 1000):
video_file = open(video_path, "r+b").read()
video_url = f"data:video/mp4;base64,{b64encode(video_file).decode()}"
return HTML(f"""<video width={width} controls><source src="{video_url}"></video>""")

 Example 2: A non-convex single input example


g(w) = sin(3w) + 0.1w2

13 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• In order to �nd the global minimum of a function using gradient descent one may need to
run it several times with different initializations and/or steplength schemes.

• We initialize two runs at w0


= 4.5 and w0 = −1.5 .
• We use a �xed steplength of α = 0.05 for all 10 iterations.

g = lambda w: np.sin(3*w) + 0.1*w**2


alpha = 0.05; w = 4.5; max_its = 10;
weight_history_1,cost_history_1 = gradient_descent(g,alpha,max_its,w)
alpha = 0.05; w = -1.5; max_its = 10;
weight_history_2,cost_history_2 = gradient_descent(g,alpha,max_its,w)
static_plotter.single_input_plot(g,[weight_history_1,weight_history_2],[cost_history_1,cost_histo

Depending on where we initialize, we may end up near a local or global minimum.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = True

The cost function history plots can be used for debugging as well as selecting proper values for
the steplength α

14 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

the steplength α.

 Example 3: A convex multi-input example


g(w1 , w2 ) = w21 + w22 + 2

The gradient of the function is:


2w1
∇g (w) = [ ]
2w2

We run gradient descent with 10 steps using the steplength/learning rate value α = 0.1

g = lambda w: np.dot(w.T,w) + 2
w = np.array([1.5,2]); max_its = 10; alpha = 0.2;
weight_history,cost_history = gradient_descent(g,alpha,max_its,w)
static_plotter.two_input_surface_contour_plot(g,weight_history,num_contours = 25,view = [

• We can plot the cost function history plot.

static_plotter.plot_cost_histories([cost_history],start = 0,points = True,labels = ['gradient des

15 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• We can view the progress of the optimization run regardless of the dimension of the
function wea re minimizing.

 Basic steplength choices for gradient descent

• We need to choose the steplength/learning rate parameter α carefully for gradient


descent.

• Two most common choices are:

1. Using a �xed $\alpha$ value for each step: $\alpha=10^{\gamma}$, where $


\gamma$ is (often) an negative integer.
2. Using a diminishing steplength, e.g., $\alpha=\frac{1}{k}$ at the $k^{th}$ step of a
run.

• In general, we would like to choose the largest possible value for α that leads to proper
convergence.

 Example: Fixed steplength selection for a single input convex function


g(w) = w2

demo_2d = grad_descent_visualizer_2d()
demo_3d = grad_descent_visualizer_3d()
video_path_2 = './video_2.mp4'

g = lambda w: w**2
w_init = -2.5
steplength_range = np.linspace(10**-5,1.5,150)
max_its = 5

16 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

max_its = 5
demo_2d.animate_it(savepath=video_path_2,w_init = w_init, g = g, steplength_range = steplength_ra

 Example: Fixed steplength selection for a multi-input non-convex function


g (w1 , w2 ) = sin(w1 )

g = lambda w: np.sin(w[0])
w_init = [1,0]; alpha_range = np.linspace(2*10**-4,5,200); max_its = 10; view = [10,120];
demo_3d.animate_it(savepath=video_path_3,g = g,w_init = w_init,alpha_range = alpha_range,max_its

Example: Comparing �xed and diminishing steplengths for a single input


 convex function.

17 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

g(w) = |w| .

This function has a single global minimum at w = 0 and a derivative de�ned (everywhere but at
w = 0)
d
g(w) = {
+1 if w > 0
dw −1 if w < 0.

We make two runs of 20 steps of gradient descent:

• Each is initialized at the point w0


= 2 with a �xed steplength rule of α = 0.5
• Diminishing stepthlength rule: α = k1 .

g = lambda w: np.abs(w)
alpha_choice = 0.5; w = 1.75; max_its = 20;
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)
alpha_choice = 'diminishing'; w = 1.75; max_its = 20;
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.single_input_plot(g,[weight_history_1,weight_history_2],[cost_history_1,cost_histo

• A diminishing steplength is necessary to reach a point close to the minimum of the


function.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = True

18 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Oscillation in the cost function history plot: not always a bad



thing

• We often use the to tune the steplength parameter α.

• It is not ultimately important that the plot be strictly decreasing (i.e., the algorithm
descends at every single step).
• It is critical to �nd a value of α that allows gradient descent to �nd the lowest possible
function value.
• Sometimes, the best choice of α for a given minimization might cause gradient descent to
move up and down.

 Example
Minimize the function:
g (w) = w20 + w21 + 2 sin(1.5 (w0 + w1 ))2 + 2

g = lambda w: w[0]**2 + w[1]**2 + 2*np.sin(1.5*(w[0] + w[1])) + 2


static_plotter.two_input_original_contour_plot(g,num_contours = 25,xmin = -4,xmax = 4, ymin =

19 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

3
• We run three runs start at the same initial point w0 = [ ] and take 10 steps, and all
3
three runs use a (different) �xed steplength. the �rst run uses a �xed steplength of
α = 10−2 , the second run α = 10−1 , and the third run α = 100 .

# first run
w = np.array([3.0,3.0]); max_its = 10;
alpha_choice = 10**(-2);
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)

# second run
alpha_choice = 10**(-1);
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice,max_its,w)

# third run
alpha_choice = 10**(0);
weight_history_3,cost_history_3 = gradient_descent(g,alpha_choice,max_its,w)

histories = [weight_history_1,weight_history_2,weight_history_3]
static_plotter.two_input_contour_horiz_plots(g,histories,show_original=False,num_contours=

static_plotter.plot_cost_histories([cost_history_1,cost_history_2,cost_history_3],start =

20 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• For run 1, α = 10−2 was too small.


• For run 2, α = 10−1 , the algorithm descends at each step, but converges to the
1.5
near [ ].
1.5
• For run 3, α = 100 , the algorithm oscillates wildly, but reaches a point near the
−0.5
[ ].
−0.5

• This example was designed speci�cally for the demonstration purpose. However, in
practice, it is just �ne for the cost function history of gradient descent to osciallate up and
down.

 Convergence behavior and steplength parameter selection

• When does gradient descent stop?

• If the steplength is chosen properly, the algorithm will halt near stationary points of a
function, typically a minima or saddle points.
• If the step w k = w k−1 − α∇g (wk−1 ) does not move from the prior point w k−1
signi�cantly then this can mean only one thing: The direction we are traveling in is
vanishing, i.e., −∇g (wk ) ≈ 0N×1 .

• We can wait for gradient descent to get su�ciently close to a stationary point, i.e.,
∥∥∇g (w k−1 )∥∥2 is su�ciently small.
1 ∥ k
• Or when steps no longer make su�cient progress, i.e., N ∥w − w k−1 ∥
∥2 < ϵ .
• Or when corresponding evaluations no longer differ substantially,
1 ∣∣ g(w k ) − g(w k−1 ) ∣∣ < ϵ.
N

21 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• A practical way is to halt gradient descent is to simply run the algorithm for
.
• This is typically set manually / heuristically depending on computing resources, domain
knowledge, and the choice of the steplength parameter α.

 Natural Weaknesses of Gradient Descent

static_plotter = static_visualizer()
demo = grad_descent_visualizer()

 Natural Weaknesses of Gradient Descent


• Gradient descent is a local optimization method that employs the at
each step.

• The negative gradient direction is a true descent direction and is often cheap to compute.

image.png

• Like any vectors, the negative gradient consists of a and a .


• Depending on the minimization function, these attributes can pose different challenges
when using the negative gradient as a descent direction.

The negative gradient direction

• A fundamental property of the (negative) gradient direction is that it always points


to the contours of a function.
• The gradient ascent/descent direction at an input w0 is always perpendicular to the
contour g(w) = g(w0 )

Example 1. Gradient descent directions on the contour plot of a quadratic


 function

g (w) = w20 + w21 + 2

22 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

g = lambda w: w[0]**2 + w[1]**2 + 2


pts = np.array([[ 4.24698761, 1.39640246, -3.75877989],
[-0.49560712, 3.22926095, -3.65478083]])
illustrate_gradients(g,pts);

Example 2. Gradient descent directions on the contour plot of a wavy


 function

g (w) = w20 + w21 + 2sin(1.5 (w0 + w1 ))2 + 2.

g = lambda w: w[0]**2 + w[1]**2 + 2*np.sin(1.5*(w[0] + w[1])) + 2


pts = np.array([[ 4.24698761, 1.39640246, -3.75877989],
[-0.49560712, 3.22926095, -3.65478083]])
illustrate_gradients(g,pts)

Example 3. Gradient descent directions on the contour plot of a standard



23 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

 non-convex test function

g = lambda w: (w[0]**2 + w[1] - 11)**2 + (w[0] + w[1]**2 - 7)**2


pts = np.array([[ 2.2430266 , -1.06962305, -1.60668751],
[-0.57717812, 1.38128471, -1.61134124]])
illustrate_gradients(g,pts)

The (negative) gradient direction points perpendicular to the contours of


 any function

• If we suppose g (w) is a differentiable function and a is some input point, then a lies on
the contour de�ned by all those points where g (w) = g (a) = c for some constant c.

• If we take another point from this contour b very close to a then the vector a − b is
essentially perpendicular to the gradient ∇g (a) since

∇g(a)T (a − b) = 0
essentially de�nes the line in the input space whose normal vector is precisely ∇g (a).

• So indeed both the ascent and descent directions de�ned by the gradient (i.e., the positive
and negative gradient directions) of g at a are perpendicular to the contour there.

• And since a was any arbitrary input of g, the same argument holds for each of its inputs.

 1. The 'zig-zagging' behavior of gradient descent

24 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• In practice the fact that the negative gradient always points perpendicular to the contour
of a function can, depending on the function being minimized, make the negative gradient
direction rapidly or during a run of gradient descent.

• This in turn can cause in the gradient descent steps themselves.

• Too much zig-zagging and many gradient descent steps are


required to adequately minimize a function.

Example 4. Zig-zagging behavior of gradient descent on three simple


 quadratic functions

We consider three N = 2 dimensional quadratics that take the general form


g(w) = a + bT w + wT Cw.
The constants a and b are set to zero, and the matrix C is set as follows
0.5 0
• the �rst quadratic (shown in the top panel below) has C =[ ]
0 12
0.1 0
• the second quadratic (shown in the middle panel below) has C =[ ]
0 12
0.01 0
• the third quadratic (shown in the bottom panel below) has C =[ ]
0 12

The three quadratics differ only in the *top left entry* of their C matrix. As we change this single
value of C we *elongate* the contours signi�cantly along the horizontal axis.

10
We run 25 gradient descent steps with initialization at w0 =[ ] and α = 10−1 .
1

a1 = 0
b1 = 0*np.ones((2,1))
C1 = np.array([[0.5,0],[0,9.75]])
g1 = lambda w: (a1 + np.dot(b1.T,w) + np.dot(np.dot(w.T,C1),w))[0]
w = np.array([10.0,1.0]); max_its = 25; alpha_choice = 10**(-1);
weight_history_1,cost_history_1 = gradient_descent(g1,alpha_choice,max_its,w)

a2 = 0
b2 = 0*np.ones((2 1))

25 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

b2 = 0*np.ones((2,1))
C2 = np.array([[0.1,0],[0,9.75]])
g2 = lambda w: (a2 + np.dot(b2.T,w) + np.dot(np.dot(w.T,C2),w))[0]
weight_history_2,cost_history_2 = gradient_descent(g2,alpha_choice,max_its,w)

a3 = 0
b3 = 0*np.ones((2,1))
C3 = np.array([[0.01,0],[0,9.75]])
g3 = lambda w: (a3 + np.dot(b3.T,w) + np.dot(np.dot(w.T,C3),w))[0]
weight_history_3,cost_history_3 = gradient_descent(g3,alpha_choice,max_its,w)

histories = [weight_history_1,weight_history_2,weight_history_3]
gs = [g1,g2,g3]

static_plotter.two_input_contour_vert_plots(gs,histories,num_contours = 20,xmin = -1,xmax =

 The 'zig-zagging' behavior of gradient descent


The zig-zagging behavior of gradient descent in each of these cases above is completely due to

26 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

The zig-zagging behavior of gradient descent in each of these cases above is completely due to
the rapid change in negative gradient direction during each run.

static_plotter.plot_grad_directions_v2(weight_history_1)

static_plotter.plot_grad_directions_v2(weight_history_2)

27 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

static_plotter.plot_grad_directions_v2(weight_history_3)

The slow convergence caused by zig-zagging in each case can be seen in the slow decrease of
the cost function history associated with each run.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2,cost_history_3],start =

28 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• We can ameilorate this zig-zagging behavior by *reducing the steplength value* α.


• However this does not solve the underlying problem that zig-zagging produces - which is
.
• Typically in order to ameliorate or even eliminate zig-zagging this way requires a very small
steplength, which leads back to the fundamental problem of slow convergence.

a1 = 0
b1 = 0*np.ones((2,1))
C1 = np.array([[0.5,0],[0,9.75]])
g1 = lambda w: (a1 + np.dot(b1.T,w) + np.dot(np.dot(w.T,C1),w))[0]
w = np.array([10.0,1.0]); max_its = 15; alpha_choice = 10**(-2);
weight_history_1,cost_history_1 = gradient_descent(g1,alpha_choice,max_its,w)
static_plotter.two_input_contour_plot(g1,weight_history_1,show_original = False,num_contours =

2. The (negative) gradient magnitude vanishes near stationary



points

• As we know from the �rst order condition for optimality, the (negative) gradient vanishes
at stationary points.

• That is if w is a minimum, maximum, or saddle point then we know that ∇g (w) = 0.

29 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• The magnitude of the gradient vanishes at stationary points, that is ∥∇g (w) ∥2 = 0.

• By extension, the (negative) gradient at points near a stationary point have non-zero
direction but vanishing magnitude i.e., ∥∇g (w) ∥2 ≈ 0.

 The slow-crawling behavior of gradient descent

• Due to the vanishing behavior if the negative gradient magnitude near stationary points,
gradient descent steps progress very slowly (or crawl) near stationary points.

• Consider the general local optimization step:

wk = wk−1 + αdk−1
we saw that if d
k−1
is a unit length descent direction found by any zero order search approach
that the distance traveled with this step equals precisely the steplength value α since
∥∥wk − wk−1 ∥∥ = ∥∥(wk−1 + αdk−1 ) − wk−1 ∥
∥ = α∥
∥ dk−1 ∥
∥2 = α.
2 2

Again here the key assumption made was that our descent direction d
k−1
had unit length.

• However, for gradient descent, our descent direction d = −∇g (wk−1 ) is not
k−1

guaranteed to have unit length


∥∇g (wk−1 )∥
• We travel a distance proportional to the magnitude of the gradient = α∥ ∥2 :
∥∥wk − wk−1 ∥∥ = ∥∥(wk−1 − α∇g (wk−1 )) − wk−1 ∥ ∥∇g (w )∥
∥2 = α∥
k−1
∥2 .
2

 Example 5. Slow-crawling behavior of GD near the minimum of a function


g(w) = w4 + 0.1
• We initialize far from the minimum and set the steplength α = 10−1 .

g = lambda w: w**4 + 0.1


w = -1.0; max_its = 10; alpha_choice = 10**(-1);
weight_history,cost_history = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.single_input_plot(g,[weight_history],[cost_history],wmin = -1.1,wmax = 1.1

30 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• Gradient descent crawls as it approaches the minimum because the magnitude of the
gradient vanishes here.

 Example 6. Slow-crawling behavior of GD near saddle points

Consider the non-convex function

g(w) = max(0, (3w − 2.3)3 + 1)2 + max(0, (−3w + 0.7)3 + 1)2


1 7 23
which has a minimum at w = 2
and saddle points at w = 30
and w = 30
.

We make a run of gradient descent on this function using 50 steps with α = 10−2 , initialized
such that it approaches one of these saddle points and so slows to a halt.

g = lambda w: np.maximum(0,(3*w - 2.3)**3 + 1)**2 + np.maximum(0, (-3*w + 0.7)**3 + 1)**2

demo.draw_2d(g=g, w_inits = [0],steplength = 0.01,max_its = 50,version = 'unnormalized',wmin =

 Example 7. Slow-crawling behavior of GD in large �at regions of a function


31 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Example 7. Slow-crawling behavior of GD in large �at regions of a function

g(w0 , w1 ) = tanh(4w0 + 4w1 ) + max(1, 0.4w20 ) + 1


2
• Gradient descent starting at the point w0 = [ ] , in a long narrow valley.
2

• The magnitude of the gradient being almost zero here, we cannot make much progress
employing 1000 steps of gradient descent with a steplength α = 10−1 .

g = lambda w: np.tanh(4*w[0] + 4*w[1]) + max(0.4*w[0]**2,1) + 1


w = np.array([1.0,2.0]); max_its = 1000; alpha_choice = 10**(-1);
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.two_input_surface_contour_plot(g,weight_history_1,view = [20,300],num_contours =

32 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

33 of 33 6/15/2025, 5:12 PM

You might also like