First Order Method
First Order Method
First-Order Methods
We have considered methods for �nding such descent direction at each step:
A local optimization method is one where we aim to �nd minima of a given function by
beginning at some point w0 and taking number of steps w1 , w2 , w3 , . . . , wK of the generic
form
w k = w k−1 + αd k .
where:
• d k are direction vectors (which ideally are descent directions that lead us to lower and
lower parts of a function)
• α is called the steplength parameter
image.png
• Because the approximation the function locally, the descent direction of the
1 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
tangent hyperplane (or the tangent line) is also a descent direction for the function itself.
The �rst-order optimality condition states the behavior of how any differentiable function's �rst
derivative(s) behave at its minima.
The value of �rst-order derivative(s) provide a convenient way of characterizing minimum values
of a function g. When N = 1, any point v where
d
g (v) = 0
dw
is a potential minimum.
2 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
Analogously with general N dimensional input, any N dimensional point v where every
of g is zero, that is
∂
∂w1
g(v) = 0
∂
∂w2
g(v) = 0
⋮
∂
∂wN
g(v) = 0
is a *potential minimum. This system of N equations is naturally referred to as the *�rst order
system of equations**. We can write the �rst order system more compactly using gradient
notation as
∇g (v) = 0N×1 .
The �rst order optimality condition translates the problem of identifying a function's minimum
points into the task of solving a system of N �rst order equations.
However, there are two problems with the �rst-order characterization of minima.
1. It is virtually impossible (with few exceptions) to solve a general function's �rst order
system of equations for 'closed form' solutions.
2. The �rst order optimality conditions characterizes not only , but also and
of a function.
3 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
It is not only global minima that have zero derivatives, but other points, e.g., local minima, local
and global maxima, and saddle points, as well.
∇g (v) = 0N×1
If a function is *convex* (e.g., a quadratic function), then any point at which a function satis�es
the �rst order condition must be a global minima. A convex function has no maxima nor saddle
points.
1
g(w) = (w4 + w2 + 10w)
50
w = np.linspace(-5,5,50)
g = lambda w: 1/50*(w**4 + w**2 + 10*w)
figure = plt.figure(figsize=(6,3))
plt.plot(w,g(w),linewidth=2,color='k')
plt.xlabel('$w$',fontsize=14)
plt.ylabel('$g(w)$',rotation=0,labelpad=15,fontsize=14)
plt.show()
4 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
d 1
g(w) = (4w3 + 2w + 10) = 2w3 + w + 5 = 0 = 0
dw 50
d 1
g(w) = (4w3 + 2w + 10) = 2w3 + w + 5 = 0 = 0
dw 50
which can be computed - after much toil - using centuries old tricks developed for just such
problems.
g (w) = a + bT w + wT Cw
where C is an N × N symmetric matrix, b is an N × 1 vector, and a is a scalar.
∇g (w) = 2Cw + b
Setting this equal to zero gives a symmetric and linear system of equations of the form
1
Cw = − b
2
whose solutions are stationary points of the original function.
For a given cost function g (w) taking in N dimensional input, the stationary points (minima
included) of this function are those satisfying the system of equations
5 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
∇g (v) = 0N×1
or written out one equation at-a-time as
∂
∂w1
g(v) = 0
∂
∂w2
g(v) = 0
⋮
∂
∂wN
g(v) = 0.
Instead of solving a �rst order system *simulatenously, we can solve them *sequentially** in a
coordinate-wise approach if each equation can be solved for the closed form. if such an
equation can be solved for the closed form.
∂
g(v) = 0
∂wn
• We �rst initialize at an input point w0 , and begin by updating the �rst coordinate
∂
g (w0 ) = 0
∂w1
for the optimal �rst weight w⋆1 .
• We then update the �rst coordinate of the vector w0 with this solution, and call the
updated set of weights w1 .
After we sweep through all N weights a single time we can re�ne our solution by sweeping
through the weights again (as with any other coordinate wise method). At the kth such sweep
we update the nth weight by solving the single equation
∂
g (wk+n−1 ) = 0
∂wn
and update the nth weight of wk+n−1 , and so on.
6 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
coordinate descent
3
We initialize at w = [ ] and run 1 iteration of the algorithm.
4
plotter = Visualizer()
10
0][ ] + [ w1 w2 ] [ ][ 1 ]
w1 w
g(w0 , w1 ) = w20 + w21 + 2 = 2 + [ 0
w2 01 w2
a = 2
b = np.zeros((2,1))
C = np.eye(2)
# a quadratic function defined using the constants above
g = lambda w: (a + np.dot(b.T,w) + np.dot(np.dot(w.T,C),w))[0]
# initialization
w = np.array([3,4])
max_its = 1
weight_history,cost_history = coordinate_descent_for_quadratic(g,w,max_its,a,b,C)
plotter.two_input_contour_plot(g,weight_history,xmin = -1.5,xmax = 4.5,ymin = -1.5,ymax =
7 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
21
0][ ] + [ w1 w2 ] [ ][ 1 ]
w1 w
g(w0 , w1 ) = a + bT w + wT Cw = 20 + [ 0
w2 12 w2
a = 20
b = np.zeros((2,1))
C = np.array([[2,1],[1,2]])
g = lambda w: (a + np.dot(b.T,w) + np.dot(np.dot(w.T,C),w))[0]
w = np.array([3,4])
max_its = 2
weight_history,cost_history = coordinate_descent_for_quadratic(g,w,max_its,a,b,C)
plotter.two_input_contour_plot(g,weight_history,xmin = -4.5,xmax = 4.5,ymin = -4.5,ymax =
h(w1 , w2 , … , wN ) = a + b1 w1 + b2 w2 + … + bN wN
where a, b1 , … , bN are scalar parameters. We can rewrite h more compactly as:
h(w) = a + bT w
When N = 1, we have h(w) = a + bw, which is the formula for a one-dimensional line in a
space whose input space (characterized by w) is one-dimensional.
Capture.PNG
When N = 1, for any point w0 in the input space, there are only 2 directions:
• Moving to the right of w0 increases the value of h, and hence it is an ascent direction.
• Moving to the left of w0 decreases the value of h, and hence it is a descent direction.
When N > 1, thre are in�nitely many directions to move in, some providing ascent, descent, or
preserving the value of h.
How can we �nd the direction that produces the largest ascent (or descent), commonly referred
to as the direction of steepest ascent (or descent).
maximize h (w0 + d)
d
over all unit-length vectors d.
9 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
We have:
bT d = ∥b∥2 ∥d∥2 cos (θ),
where ∥b∥2 does not change with respect to d, and ∥d∥2 = 1.
b
• Of all unit directions, d = provides the (where θ = 0 and
∥b∥2
cos (θ) = 1)
−b
• Similarly, the unit direction d = provides the (where θ = π and
∥b∥2
cos (θ) = −1)
• Then, we have:
• Because h is constructed to closely approximate g near the point w0 , its steepest ascent
and descent directions also tell us the direction to travel to increase or decrease the value
of the function g itself the point w0 .
Gradient Descent
10 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
Gradient Descent
A local optimization method aims to �nd minima of a given function g(w) by beginning at some
point w0 and taking a number of steps w1 , w2 , … , wK of the form:
wk = wk−1 + αdk
where:
The negative gradient −∇g(w) of a function g(w) computed at a particular point de�nes a
valid descent direction at that point.
If we employ the negative gradient direction d = −∇g(wk−1 ), the sequence of steps takes
k
the form:
wk = wk−1 − α∇g(wk−1 )
The local optimization method with the above update step is called the
algorithm.
Capture.PNG
• Begin at the initial point w0 , we make an approximation to g(w) at the point (w0 , g(w0 ))
with the �rst-order Taylor series approximation.
• Moving in the negative gradient direction provided by this approximation we arrive at
w1 = w0 − α dw d
g(w0 ).
• Repeat this process at w1 , moving in the negative gradient direction to
w2 = w1 − α dw d
g(w1 ).
•…
Gradient Descent
• Gradient descent is better than naive zero-order approaches.
• The negative gradient direction provides a descent direction for the function locally.
11 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• The negative gradient direction provides a descent direction for the function locally.
• The descent directions provided via the gradients are easier to compute than seeking out a
descent direction at random.
for k = 1...K
wk = wk−1 − α∇g (wk−1 )
K K
history of weights {wk }k=0 and corresponding function evaluations {g (wk )}k=0
• We can simply return the �nal set of weights wK or the entire sequence of gradient
K
descent steps {wk }k=0 .
• If the steplength α is chosen properly, the algorithm will stop near stationary points of the
function, typically or .
• If the step
w k = w k−1 − α∇g (wk−1 )
does not move from the prior point w k−1 then this can mean that the direction we are
traveling in is vanishing, i.e., −∇g (wk ) ≈ 0N×1 . This is a stationary point of the
function.
static_plotter = static_visualizer()
anim_plotter = anim_visualizer()
12 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
w= −−−−−
−−−−−−−−−
−−
√3 6 (√2031 − 45)
2
63
With gradient descent we can determine a point that is close to this one. The gradient of the
function is:
∂ 2 3 1 1
g (w) = w + w+ .
∂w 25 25 5
We initialize the gradient descent algorithm at w0 = 2.5, constant steplength α = 1, and run
for 25 iterations.
g = lambda w: 1/float(50)*(w**4 + w**2 + 10*w) # try other functions too! Like g = lambda w: n
w = 2.5; alpha = 1; max_its = 25;
weight_history,cost_history = gradient_descent(g,alpha,max_its,w)
anim_plotter.gradient_descent(g,weight_history,savepath=video_path_1,fps=1)
# standard imports
import matplotlib.pyplot as plt
from IPython.display import Image, HTML
from base64 import b64encode
def show_video(video_path, width = 1000):
video_file = open(video_path, "r+b").read()
video_url = f"data:video/mp4;base64,{b64encode(video_file).decode()}"
return HTML(f"""<video width={width} controls><source src="{video_url}"></video>""")
13 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• In order to �nd the global minimum of a function using gradient descent one may need to
run it several times with different initializations and/or steplength schemes.
The cost function history plots can be used for debugging as well as selecting proper values for
the steplength α
14 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
the steplength α.
We run gradient descent with 10 steps using the steplength/learning rate value α = 0.1
g = lambda w: np.dot(w.T,w) + 2
w = np.array([1.5,2]); max_its = 10; alpha = 0.2;
weight_history,cost_history = gradient_descent(g,alpha,max_its,w)
static_plotter.two_input_surface_contour_plot(g,weight_history,num_contours = 25,view = [
15 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• We can view the progress of the optimization run regardless of the dimension of the
function wea re minimizing.
• In general, we would like to choose the largest possible value for α that leads to proper
convergence.
demo_2d = grad_descent_visualizer_2d()
demo_3d = grad_descent_visualizer_3d()
video_path_2 = './video_2.mp4'
g = lambda w: w**2
w_init = -2.5
steplength_range = np.linspace(10**-5,1.5,150)
max_its = 5
16 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
max_its = 5
demo_2d.animate_it(savepath=video_path_2,w_init = w_init, g = g, steplength_range = steplength_ra
g = lambda w: np.sin(w[0])
w_init = [1,0]; alpha_range = np.linspace(2*10**-4,5,200); max_its = 10; view = [10,120];
demo_3d.animate_it(savepath=video_path_3,g = g,w_init = w_init,alpha_range = alpha_range,max_its
17 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
g(w) = |w| .
This function has a single global minimum at w = 0 and a derivative de�ned (everywhere but at
w = 0)
d
g(w) = {
+1 if w > 0
dw −1 if w < 0.
g = lambda w: np.abs(w)
alpha_choice = 0.5; w = 1.75; max_its = 20;
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)
alpha_choice = 'diminishing'; w = 1.75; max_its = 20;
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.single_input_plot(g,[weight_history_1,weight_history_2],[cost_history_1,cost_histo
18 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• It is not ultimately important that the plot be strictly decreasing (i.e., the algorithm
descends at every single step).
• It is critical to �nd a value of α that allows gradient descent to �nd the lowest possible
function value.
• Sometimes, the best choice of α for a given minimization might cause gradient descent to
move up and down.
Example
Minimize the function:
g (w) = w20 + w21 + 2 sin(1.5 (w0 + w1 ))2 + 2
19 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
3
• We run three runs start at the same initial point w0 = [ ] and take 10 steps, and all
3
three runs use a (different) �xed steplength. the �rst run uses a �xed steplength of
α = 10−2 , the second run α = 10−1 , and the third run α = 100 .
# first run
w = np.array([3.0,3.0]); max_its = 10;
alpha_choice = 10**(-2);
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)
# second run
alpha_choice = 10**(-1);
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice,max_its,w)
# third run
alpha_choice = 10**(0);
weight_history_3,cost_history_3 = gradient_descent(g,alpha_choice,max_its,w)
histories = [weight_history_1,weight_history_2,weight_history_3]
static_plotter.two_input_contour_horiz_plots(g,histories,show_original=False,num_contours=
static_plotter.plot_cost_histories([cost_history_1,cost_history_2,cost_history_3],start =
20 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• This example was designed speci�cally for the demonstration purpose. However, in
practice, it is just �ne for the cost function history of gradient descent to osciallate up and
down.
• If the steplength is chosen properly, the algorithm will halt near stationary points of a
function, typically a minima or saddle points.
• If the step w k = w k−1 − α∇g (wk−1 ) does not move from the prior point w k−1
signi�cantly then this can mean only one thing: The direction we are traveling in is
vanishing, i.e., −∇g (wk ) ≈ 0N×1 .
• We can wait for gradient descent to get su�ciently close to a stationary point, i.e.,
∥∥∇g (w k−1 )∥∥2 is su�ciently small.
1 ∥ k
• Or when steps no longer make su�cient progress, i.e., N ∥w − w k−1 ∥
∥2 < ϵ .
• Or when corresponding evaluations no longer differ substantially,
1 ∣∣ g(w k ) − g(w k−1 ) ∣∣ < ϵ.
N
21 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• A practical way is to halt gradient descent is to simply run the algorithm for
.
• This is typically set manually / heuristically depending on computing resources, domain
knowledge, and the choice of the steplength parameter α.
static_plotter = static_visualizer()
demo = grad_descent_visualizer()
• The negative gradient direction is a true descent direction and is often cheap to compute.
image.png
22 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• If we suppose g (w) is a differentiable function and a is some input point, then a lies on
the contour de�ned by all those points where g (w) = g (a) = c for some constant c.
• If we take another point from this contour b very close to a then the vector a − b is
essentially perpendicular to the gradient ∇g (a) since
∇g(a)T (a − b) = 0
essentially de�nes the line in the input space whose normal vector is precisely ∇g (a).
• So indeed both the ascent and descent directions de�ned by the gradient (i.e., the positive
and negative gradient directions) of g at a are perpendicular to the contour there.
• And since a was any arbitrary input of g, the same argument holds for each of its inputs.
24 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• In practice the fact that the negative gradient always points perpendicular to the contour
of a function can, depending on the function being minimized, make the negative gradient
direction rapidly or during a run of gradient descent.
The three quadratics differ only in the *top left entry* of their C matrix. As we change this single
value of C we *elongate* the contours signi�cantly along the horizontal axis.
10
We run 25 gradient descent steps with initialization at w0 =[ ] and α = 10−1 .
1
a1 = 0
b1 = 0*np.ones((2,1))
C1 = np.array([[0.5,0],[0,9.75]])
g1 = lambda w: (a1 + np.dot(b1.T,w) + np.dot(np.dot(w.T,C1),w))[0]
w = np.array([10.0,1.0]); max_its = 25; alpha_choice = 10**(-1);
weight_history_1,cost_history_1 = gradient_descent(g1,alpha_choice,max_its,w)
a2 = 0
b2 = 0*np.ones((2 1))
25 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
b2 = 0*np.ones((2,1))
C2 = np.array([[0.1,0],[0,9.75]])
g2 = lambda w: (a2 + np.dot(b2.T,w) + np.dot(np.dot(w.T,C2),w))[0]
weight_history_2,cost_history_2 = gradient_descent(g2,alpha_choice,max_its,w)
a3 = 0
b3 = 0*np.ones((2,1))
C3 = np.array([[0.01,0],[0,9.75]])
g3 = lambda w: (a3 + np.dot(b3.T,w) + np.dot(np.dot(w.T,C3),w))[0]
weight_history_3,cost_history_3 = gradient_descent(g3,alpha_choice,max_its,w)
histories = [weight_history_1,weight_history_2,weight_history_3]
gs = [g1,g2,g3]
26 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
The zig-zagging behavior of gradient descent in each of these cases above is completely due to
the rapid change in negative gradient direction during each run.
static_plotter.plot_grad_directions_v2(weight_history_1)
static_plotter.plot_grad_directions_v2(weight_history_2)
27 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
static_plotter.plot_grad_directions_v2(weight_history_3)
The slow convergence caused by zig-zagging in each case can be seen in the slow decrease of
the cost function history associated with each run.
static_plotter.plot_cost_histories([cost_history_1,cost_history_2,cost_history_3],start =
28 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
a1 = 0
b1 = 0*np.ones((2,1))
C1 = np.array([[0.5,0],[0,9.75]])
g1 = lambda w: (a1 + np.dot(b1.T,w) + np.dot(np.dot(w.T,C1),w))[0]
w = np.array([10.0,1.0]); max_its = 15; alpha_choice = 10**(-2);
weight_history_1,cost_history_1 = gradient_descent(g1,alpha_choice,max_its,w)
static_plotter.two_input_contour_plot(g1,weight_history_1,show_original = False,num_contours =
• As we know from the �rst order condition for optimality, the (negative) gradient vanishes
at stationary points.
29 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• The magnitude of the gradient vanishes at stationary points, that is ∥∇g (w) ∥2 = 0.
• By extension, the (negative) gradient at points near a stationary point have non-zero
direction but vanishing magnitude i.e., ∥∇g (w) ∥2 ≈ 0.
• Due to the vanishing behavior if the negative gradient magnitude near stationary points,
gradient descent steps progress very slowly (or crawl) near stationary points.
wk = wk−1 + αdk−1
we saw that if d
k−1
is a unit length descent direction found by any zero order search approach
that the distance traveled with this step equals precisely the steplength value α since
∥∥wk − wk−1 ∥∥ = ∥∥(wk−1 + αdk−1 ) − wk−1 ∥
∥ = α∥
∥ dk−1 ∥
∥2 = α.
2 2
Again here the key assumption made was that our descent direction d
k−1
had unit length.
• However, for gradient descent, our descent direction d = −∇g (wk−1 ) is not
k−1
30 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
• Gradient descent crawls as it approaches the minimum because the magnitude of the
gradient vanishes here.
We make a run of gradient descent on this function using 50 steps with α = 10−2 , initialized
such that it approaches one of these saddle points and so slows to a halt.
• The magnitude of the gradient being almost zero here, we cannot make much progress
employing 1000 steps of gradient descent with a steplength α = 10−1 .
32 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...
33 of 33 6/15/2025, 5:12 PM