0% found this document useful (0 votes)

2 views33 pages

First Order Method

The document discusses first-order methods for local optimization, focusing on gradient descent algorithms that utilize a function's first derivatives to find descent directions efficiently. It explains the first-order optimality condition, which characterizes minima of differentiable functions, and highlights the challenges in solving first-order systems of equations. Additionally, it covers coordinate descent as a sequential approach to finding stationary points and provides examples of minimizing functions using these concepts.

Uploaded by

Hoàng Lê Văn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

2 views33 pages

First Order Method

Uploaded by

Hoàng Lê Văn

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 33

FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

 First-Order Methods

In local optimization methods, we repeatedly re�ne an initial sample input by traveling in

, i.e., directions in the input space that lead to points lower and lower on the
function.

We have considered methods for �nding such descent direction at each step:

• random search: prohibitively expensive as the dimension grows.

• coordinate seach: severly restricted in terms of the quality of descent direction.

By exploiting a function's , we can construct

that determines high quality descent directions at a cheaper cost.

 Big picture view of the gradient descent algorithm

A local optimization method is one where we aim to �nd minima of a given function by
beginning at some point w0 and taking number of steps w1 , w2 , w3 , . . . , wK of the generic
form

w k = w k−1 + αd k .
where:

• d k are direction vectors (which ideally are descent directions that lead us to lower and
lower parts of a function)
• α is called the steplength parameter

A algorithm, which is a �rst-order local optimization method, employs a

function's �rst derivative(s) to cheaply compute a high-quality descent direction.

image.png

• The of a function helps form the best to the function

(called the �rst-order Taylor series approximation).

• Because the approximation the function locally, the descent direction of the

1 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

tangent hyperplane (or the tangent line) is also a descent direction for the function itself.

• It is to compute the descent direction of a line or a hyperplane.

 The First-Order Optimality Condition

The �rst-order optimality condition states the behavior of how any differentiable function's �rst
derivative(s) behave at its minima.

 The �rt-order condition

func1 = lambda w: w**2 + 3

func2 = lambda w: w[0]**2 + w[1]**2 + 3
compare_2d3d(func1 = func1,func2 = func2)

• We draw the �rst-order approximation (a tangent line/hyperplane) at the function's

minimum.
• In both examples, the tangent line/hyperplane is perfectly �at, indicating that the �rst
derivative(s) is exactly at the function's minimum.

The value of �rst-order derivative(s) provide a convenient way of characterizing minimum values
of a function g. When N = 1, any point v where
d
g (v) = 0
dw
is a potential minimum.

2 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Analogously with general N dimensional input, any N dimensional point v where every
of g is zero, that is
∂
∂w1
g(v) = 0
∂
∂w2
g(v) = 0

⋮
∂
∂wN
g(v) = 0
is a *potential minimum. This system of N equations is naturally referred to as the *�rst order
system of equations**. We can write the �rst order system more compactly using gradient
notation as

∇g (v) = 0N×1 .

The �rst order optimality condition translates the problem of identifying a function's minimum
points into the task of solving a system of N �rst order equations.

However, there are two problems with the �rst-order characterization of minima.

1. It is virtually impossible (with few exceptions) to solve a general function's �rst order
system of equations for 'closed form' solutions.
2. The �rst order optimality conditions characterizes not only , but also and
of a function.

 Example 1: Finding points of zero derivative for single-input functions

func1 = lambda w: np.sin(2*w)

func2 = lambda w: w**3
func3 = lambda w: np.sin(3*w) + 0.1*w**2
show_stationary(func1 = func1,func2 = func2,func3 = func3)

3 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

It is not only global minima that have zero derivatives, but other points, e.g., local minima, local
and global maxima, and saddle points, as well.

Points having zero-valued derivative(s) are collectively referred to as for

: Stationary points of a function g (including minimax,

maxima, and saddle points) satisfy the �rst order condition

∇g (v) = 0N×1

If a function is *convex* (e.g., a quadratic function), then any point at which a function satis�es
the �rst order condition must be a global minima. A convex function has no maxima nor saddle
points.

Example 2: A simple looking function but di�cult to compute

 (algebraically) global minimum

1
g(w) = (w4 + w2 + 10w)
50

w = np.linspace(-5,5,50)
g = lambda w: 1/50*(w**4 + w**2 + 10*w)
figure = plt.figure(figsize=(6,3))
plt.plot(w,g(w),linewidth=2,color='k')
plt.xlabel('$w$',fontsize=14)
plt.ylabel('$g(w)$',rotation=0,labelpad=15,fontsize=14)
plt.show()

4 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

d 1
g(w) = (4w3 + 2w + 10) = 2w3 + w + 5 = 0 = 0
dw 50

The one providing the minimum of the function g is:

3 −−−−−
−−−−−−
√√− 2031 − 45 1
w= − −−−−−
−−−
−−−−−−−−
√3 6 (√2031 − 45)
2
63

which can be computed - after much toil - using centuries old tricks developed for just such
problems.

 Example 3: Stationary points of a general multi-input quadratic function

Take the general multi-input quadratic function

g (w) = a + bT w + wT Cw
where C is an N × N symmetric matrix, b is an N × 1 vector, and a is a scalar.

Computing the �rst derivative (gradient) we have

∇g (w) = 2Cw + b

Setting this equal to zero gives a symmetric and linear system of equations of the form
1
Cw = − b
2
whose solutions are stationary points of the original function.

 Coordinate descent and the �rst order optimality condition

For a given cost function g (w) taking in N dimensional input, the stationary points (minima
included) of this function are those satisfying the system of equations

5 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

included) of this function are those satisfying the system of equations

∇g (v) = 0N×1
or written out one equation at-a-time as
∂
∂w1
g(v) = 0
∂
∂w2
g(v) = 0

⋮
∂
∂wN
g(v) = 0.

Instead of solving a �rst order system *simulatenously, we can solve them *sequentially** in a
coordinate-wise approach if each equation can be solved for the closed form. if such an
equation can be solved for the closed form.
∂
g(v) = 0
∂wn

• We �rst initialize at an input point w0 , and begin by updating the �rst coordinate
∂
g (w0 ) = 0
∂w1
for the optimal �rst weight w⋆1 .
• We then update the �rst coordinate of the vector w0 with this solution, and call the
updated set of weights w1 .

• Continuing this pattern to update the nth weight we solve

∂
g (wn−1 ) = 0
∂wn
for w⋆n , and update the nth weight using this value forming the updated set of weights
wn .

After we sweep through all N weights a single time we can re�ne our solution by sweeping
through the weights again (as with any other coordinate wise method). At the kth such sweep
we update the nth weight by solving the single equation
∂
g (wk+n−1 ) = 0
∂wn
and update the nth weight of wk+n−1 , and so on.

Example 4: Minimizing convex quadratic functions via �rst order

 coordinate descent

6 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

coordinate descent

We minimize the simple quadratic

g(w0 , w1 ) = w20 + w21 + 2

which can be written in vector-matrix
g (w) = a + bT w + wT Cw
0 10
where a = 2, b = [ ], and C = [ ].
0 01

3
We initialize at w = [ ] and run 1 iteration of the algorithm.
4

plotter = Visualizer()

10
0][ ] + [ w1 w2 ] [ ][ 1 ]
w1 w
g(w0 , w1 ) = w20 + w21 + 2 = 2 + [ 0
w2 01 w2

a = 2
b = np.zeros((2,1))
C = np.eye(2)
# a quadratic function defined using the constants above
g = lambda w: (a + np.dot(b.T,w) + np.dot(np.dot(w.T,C),w))[0]
# initialization
w = np.array([3,4])
max_its = 1
weight_history,cost_history = coordinate_descent_for_quadratic(g,w,max_its,a,b,C)
plotter.two_input_contour_plot(g,weight_history,xmin = -1.5,xmax = 4.5,ymin = -1.5,ymax =

7 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

21
0][ ] + [ w1 w2 ] [ ][ 1 ]
w1 w
g(w0 , w1 ) = a + bT w + wT Cw = 20 + [ 0
w2 12 w2

a = 20
b = np.zeros((2,1))
C = np.array([[2,1],[1,2]])
g = lambda w: (a + np.dot(b.T,w) + np.dot(np.dot(w.T,C),w))[0]
w = np.array([3,4])
max_its = 2
weight_history,cost_history = coordinate_descent_for_quadratic(g,w,max_its,a,b,C)
plotter.two_input_contour_plot(g,weight_history,xmin = -4.5,xmax = 4.5,ymin = -4.5,ymax =

 Geometry of First-Order Taylor Series

• We discuss important characteristics of the hyperplane including the direction of

steepest ascent and steepest descent.
• A special hyperplane: the �rst-order Taylor series approximation to a function.

 The anatomy of hyperlanes

8 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

The anatomy of hyperlanes

A general N -dimensional hyperplane can be characterized as:

h(w1 , w2 , … , wN ) = a + b1 w1 + b2 w2 + … + bN wN
where a, b1 , … , bN are scalar parameters. We can rewrite h more compactly as:

h(w) = a + bT w

When N = 1, we have h(w) = a + bw, which is the formula for a one-dimensional line in a
space whose input space (characterized by w) is one-dimensional.

For general N , h(w)= a + bT w is an N -dimensional hyperplane in an (N + 1)-

dimensional space whose input space (characterized by w1 , w2 , … , wN ) is N -dimensional.

 Steepest ascent and descent directions

Capture.PNG

When N = 1, for any point w0 in the input space, there are only 2 directions:
• Moving to the right of w0 increases the value of h, and hence it is an ascent direction.
• Moving to the left of w0 decreases the value of h, and hence it is a descent direction.

When N > 1, thre are in�nitely many directions to move in, some providing ascent, descent, or
preserving the value of h.

How can we �nd the direction that produces the largest ascent (or descent), commonly referred
to as the direction of steepest ascent (or descent).

To search for the direction of steepest ascent at a given point w0 :

maximize h (w0 + d)
d
over all unit-length vectors d.

Note that h (w0 + d) can be written as:

a + bT (w0 + d) = a + bT w0 + bT d

where the �rst two terms are constant with respect to d.

9 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Therefore, maximizing the value of h (w0 + d) is equivalent to maximizing bT d.

We have:
bT d = ∥b∥2 ∥d∥2 cos (θ),
where ∥b∥2 does not change with respect to d, and ∥d∥2 = 1.

Then the optimization problem becomes:

maximize cos (θ),
θ
where θ is the angle between the vectors b and d.

b
• Of all unit directions, d = provides the (where θ = 0 and
∥b∥2
cos (θ) = 1)

−b
• Similarly, the unit direction d = provides the (where θ = π and
∥b∥2
cos (θ) = −1)

 The gradient and the direction of steepest ascent/descent

• A function g(w) can be approximated locally around a given point w0 by a hyperplane

h(w) :
h(w) = g(w0 ) + ∇g(w0 )T (w − w0 )
which can be rewritten as h(w) = a + b w (as previously).
T

• Then, we have:

a = g(w0 ) − ∇g(w0 )T w0 and b = ∇g(w0 )

• This hyperplane is the �rst-order Taylor series approximation of g at w0 , and is to

g at this point.

• Because h is constructed to closely approximate g near the point w0 , its steepest ascent
and descent directions also tell us the direction to travel to increase or decrease the value
of the function g itself the point w0 .

Gradient Descent
10 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

 Gradient Descent

A local optimization method aims to �nd minima of a given function g(w) by beginning at some
point w0 and taking a number of steps w1 , w2 , … , wK of the form:
wk = wk−1 + αdk
where:

• dk are descent directions

• α is the steplength parameter.

The negative gradient −∇g(w) of a function g(w) computed at a particular point de�nes a
valid descent direction at that point.

If we employ the negative gradient direction d = −∇g(wk−1 ), the sequence of steps takes
k

the form:

wk = wk−1 − α∇g(wk−1 )

The local optimization method with the above update step is called the
algorithm.

Capture.PNG

• Begin at the initial point w0 , we make an approximation to g(w) at the point (w0 , g(w0 ))
with the �rst-order Taylor series approximation.
• Moving in the negative gradient direction provided by this approximation we arrive at
w1 = w0 − α dw d
g(w0 ).
• Repeat this process at w1 , moving in the negative gradient direction to
w2 = w1 − α dw d
g(w1 ).
•…

Start coding or generate with AI.

 Gradient Descent
• Gradient descent is better than naive zero-order approaches.

• The negative gradient direction provides a descent direction for the function locally.

11 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• The negative gradient direction provides a descent direction for the function locally.
• The descent directions provided via the gradients are easier to compute than seeking out a
descent direction at random.

 The gradient descent algorithm

function g, steplength α, maximum number of steps K , and initial point w0

for k = 1...K
wk = wk−1 − α∇g (wk−1 )
K K
history of weights {wk }k=0 and corresponding function evaluations {g (wk )}k=0

• We can simply return the �nal set of weights wK or the entire sequence of gradient
K
descent steps {wk }k=0 .

• When does stop?

• If the steplength α is chosen properly, the algorithm will stop near stationary points of the
function, typically or .
• If the step
w k = w k−1 − α∇g (wk−1 )
does not move from the prior point w k−1 then this can mean that the direction we are
traveling in is vanishing, i.e., −∇g (wk ) ≈ 0N×1 . This is a stationary point of the
function.

static_plotter = static_visualizer()
anim_plotter = anim_visualizer()

 Example 1: A convex single input example

Use gradient descent to minimize the polynomial function
1
g(w) = (w4 + w2 + 10w) .
50

The global minimum of the function is:

3 −−−−−
−−−−−−
√√− 2031 − 45 1
w= −

12 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

w= −−−−−
−−−−−−−−−
−−
√3 6 (√2031 − 45)
2
63

With gradient descent we can determine a point that is close to this one. The gradient of the
function is:
∂ 2 3 1 1
g (w) = w + w+ .
∂w 25 25 5

We initialize the gradient descent algorithm at w0 = 2.5, constant steplength α = 1, and run
for 25 iterations.

g = lambda w: 1/float(50)*(w**4 + w**2 + 10*w) # try other functions too! Like g = lambda w: n
w = 2.5; alpha = 1; max_its = 25;
weight_history,cost_history = gradient_descent(g,alpha,max_its,w)
anim_plotter.gradient_descent(g,weight_history,savepath=video_path_1,fps=1)

# standard imports
import matplotlib.pyplot as plt
from IPython.display import Image, HTML
from base64 import b64encode
def show_video(video_path, width = 1000):
video_file = open(video_path, "r+b").read()
video_url = f"data:video/mp4;base64,{b64encode(video_file).decode()}"
return HTML(f"""<video width={width} controls><source src="{video_url}"></video>""")

 Example 2: A non-convex single input example

g(w) = sin(3w) + 0.1w2

13 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• In order to �nd the global minimum of a function using gradient descent one may need to
run it several times with different initializations and/or steplength schemes.

• We initialize two runs at w0

= 4.5 and w0 = −1.5 .
• We use a �xed steplength of α = 0.05 for all 10 iterations.

g = lambda w: np.sin(3w) + 0.1w**2

alpha = 0.05; w = 4.5; max_its = 10;
weight_history_1,cost_history_1 = gradient_descent(g,alpha,max_its,w)
alpha = 0.05; w = -1.5; max_its = 10;
weight_history_2,cost_history_2 = gradient_descent(g,alpha,max_its,w)
static_plotter.single_input_plot(g,[weight_history_1,weight_history_2],[cost_history_1,cost_histo

Depending on where we initialize, we may end up near a local or global minimum.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = True

The cost function history plots can be used for debugging as well as selecting proper values for
the steplength α

14 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

the steplength α.

 Example 3: A convex multi-input example

g(w1 , w2 ) = w21 + w22 + 2

The gradient of the function is:

2w1
∇g (w) = [ ]
2w2

We run gradient descent with 10 steps using the steplength/learning rate value α = 0.1

g = lambda w: np.dot(w.T,w) + 2
w = np.array([1.5,2]); max_its = 10; alpha = 0.2;
weight_history,cost_history = gradient_descent(g,alpha,max_its,w)
static_plotter.two_input_surface_contour_plot(g,weight_history,num_contours = 25,view = [

• We can plot the cost function history plot.

static_plotter.plot_cost_histories([cost_history],start = 0,points = True,labels = ['gradient des

15 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• We can view the progress of the optimization run regardless of the dimension of the
function wea re minimizing.

 Basic steplength choices for gradient descent

• We need to choose the steplength/learning rate parameter α carefully for gradient

descent.

• Two most common choices are:

1. Using a �xed $\alpha$ value for each step: $\alpha=10^{\gamma}$, where $

\gamma$ is (often) an negative integer.
2. Using a diminishing steplength, e.g., $\alpha=\frac{1}{k}$ at the $k^{th}$ step of a
run.

• In general, we would like to choose the largest possible value for α that leads to proper
convergence.

 Example: Fixed steplength selection for a single input convex function

g(w) = w2

demo_2d = grad_descent_visualizer_2d()
demo_3d = grad_descent_visualizer_3d()
video_path_2 = './video_2.mp4'

g = lambda w: w**2
w_init = -2.5
steplength_range = np.linspace(10**-5,1.5,150)
max_its = 5

16 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

max_its = 5
demo_2d.animate_it(savepath=video_path_2,w_init = w_init, g = g, steplength_range = steplength_ra

 Example: Fixed steplength selection for a multi-input non-convex function

g (w1 , w2 ) = sin(w1 )

g = lambda w: np.sin(w[0])
w_init = [1,0]; alpha_range = np.linspace(2*10**-4,5,200); max_its = 10; view = [10,120];
demo_3d.animate_it(savepath=video_path_3,g = g,w_init = w_init,alpha_range = alpha_range,max_its

Example: Comparing �xed and diminishing steplengths for a single input

 convex function.

17 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

g(w) = |w| .

This function has a single global minimum at w = 0 and a derivative de�ned (everywhere but at
w = 0)
d
g(w) = {
+1 if w > 0
dw −1 if w < 0.

We make two runs of 20 steps of gradient descent:

• Each is initialized at the point w0

= 2 with a �xed steplength rule of α = 0.5
• Diminishing stepthlength rule: α = k1 .

g = lambda w: np.abs(w)
alpha_choice = 0.5; w = 1.75; max_its = 20;
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)
alpha_choice = 'diminishing'; w = 1.75; max_its = 20;
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.single_input_plot(g,[weight_history_1,weight_history_2],[cost_history_1,cost_histo

• A diminishing steplength is necessary to reach a point close to the minimum of the

function.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = True

18 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Oscillation in the cost function history plot: not always a bad


thing

• We often use the to tune the steplength parameter α.

• It is not ultimately important that the plot be strictly decreasing (i.e., the algorithm
descends at every single step).
• It is critical to �nd a value of α that allows gradient descent to �nd the lowest possible
function value.
• Sometimes, the best choice of α for a given minimization might cause gradient descent to
move up and down.

 Example
Minimize the function:
g (w) = w20 + w21 + 2 sin(1.5 (w0 + w1 ))2 + 2

g = lambda w: w[0]2 + w[1]2 + 2np.sin(1.5(w[0] + w[1])) + 2

static_plotter.two_input_original_contour_plot(g,num_contours = 25,xmin = -4,xmax = 4, ymin =

19 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

3
• We run three runs start at the same initial point w0 = [ ] and take 10 steps, and all
3
three runs use a (different) �xed steplength. the �rst run uses a �xed steplength of
α = 10−2 , the second run α = 10−1 , and the third run α = 100 .

# first run
w = np.array([3.0,3.0]); max_its = 10;
alpha_choice = 10**(-2);
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)

# second run
alpha_choice = 10**(-1);
weight_history_2,cost_history_2 = gradient_descent(g,alpha_choice,max_its,w)

# third run
alpha_choice = 10**(0);
weight_history_3,cost_history_3 = gradient_descent(g,alpha_choice,max_its,w)

histories = [weight_history_1,weight_history_2,weight_history_3]
static_plotter.two_input_contour_horiz_plots(g,histories,show_original=False,num_contours=

static_plotter.plot_cost_histories([cost_history_1,cost_history_2,cost_history_3],start =

20 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• For run 1, α = 10−2 was too small.

• For run 2, α = 10−1 , the algorithm descends at each step, but converges to the
1.5
near [ ].
1.5
• For run 3, α = 100 , the algorithm oscillates wildly, but reaches a point near the
−0.5
[ ].
−0.5

• This example was designed speci�cally for the demonstration purpose. However, in
practice, it is just �ne for the cost function history of gradient descent to osciallate up and
down.

 Convergence behavior and steplength parameter selection

• When does gradient descent stop?

• If the steplength is chosen properly, the algorithm will halt near stationary points of a
function, typically a minima or saddle points.
• If the step w k = w k−1 − α∇g (wk−1 ) does not move from the prior point w k−1
signi�cantly then this can mean only one thing: The direction we are traveling in is
vanishing, i.e., −∇g (wk ) ≈ 0N×1 .

• We can wait for gradient descent to get su�ciently close to a stationary point, i.e.,
∥∥∇g (w k−1 )∥∥2 is su�ciently small.
1 ∥ k
• Or when steps no longer make su�cient progress, i.e., N ∥w − w k−1 ∥
∥2 < ϵ .
• Or when corresponding evaluations no longer differ substantially,
1 ∣∣ g(w k ) − g(w k−1 ) ∣∣ < ϵ.
N

21 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• A practical way is to halt gradient descent is to simply run the algorithm for
.
• This is typically set manually / heuristically depending on computing resources, domain
knowledge, and the choice of the steplength parameter α.

 Natural Weaknesses of Gradient Descent

static_plotter = static_visualizer()
demo = grad_descent_visualizer()

 Natural Weaknesses of Gradient Descent

• Gradient descent is a local optimization method that employs the at
each step.

• The negative gradient direction is a true descent direction and is often cheap to compute.

image.png

• Like any vectors, the negative gradient consists of a and a .

• Depending on the minimization function, these attributes can pose different challenges
when using the negative gradient as a descent direction.

The negative gradient direction

• A fundamental property of the (negative) gradient direction is that it always points

to the contours of a function.
• The gradient ascent/descent direction at an input w0 is always perpendicular to the
contour g(w) = g(w0 )

Example 1. Gradient descent directions on the contour plot of a quadratic

 function

g (w) = w20 + w21 + 2

22 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

g = lambda w: w[0]2 + w[1]2 + 2

pts = np.array([[ 4.24698761, 1.39640246, -3.75877989],
[-0.49560712, 3.22926095, -3.65478083]])
illustrate_gradients(g,pts);

Example 2. Gradient descent directions on the contour plot of a wavy

 function

g (w) = w20 + w21 + 2sin(1.5 (w0 + w1 ))2 + 2.

g = lambda w: w[0]2 + w[1]2 + 2np.sin(1.5(w[0] + w[1])) + 2

pts = np.array([[ 4.24698761, 1.39640246, -3.75877989],
[-0.49560712, 3.22926095, -3.65478083]])
illustrate_gradients(g,pts)

Example 3. Gradient descent directions on the contour plot of a standard


23 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

 non-convex test function

g = lambda w: (w[0]2 + w[1] - 11)2 + (w[0] + w[1]2 - 7)2

pts = np.array([[ 2.2430266 , -1.06962305, -1.60668751],
[-0.57717812, 1.38128471, -1.61134124]])
illustrate_gradients(g,pts)

The (negative) gradient direction points perpendicular to the contours of

 any function

• If we suppose g (w) is a differentiable function and a is some input point, then a lies on
the contour de�ned by all those points where g (w) = g (a) = c for some constant c.

• If we take another point from this contour b very close to a then the vector a − b is
essentially perpendicular to the gradient ∇g (a) since

∇g(a)T (a − b) = 0
essentially de�nes the line in the input space whose normal vector is precisely ∇g (a).

• So indeed both the ascent and descent directions de�ned by the gradient (i.e., the positive
and negative gradient directions) of g at a are perpendicular to the contour there.

• And since a was any arbitrary input of g, the same argument holds for each of its inputs.

 1. The 'zig-zagging' behavior of gradient descent

24 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• In practice the fact that the negative gradient always points perpendicular to the contour
of a function can, depending on the function being minimized, make the negative gradient
direction rapidly or during a run of gradient descent.

• This in turn can cause in the gradient descent steps themselves.

• Too much zig-zagging and many gradient descent steps are

required to adequately minimize a function.

Example 4. Zig-zagging behavior of gradient descent on three simple

 quadratic functions

We consider three N = 2 dimensional quadratics that take the general form

g(w) = a + bT w + wT Cw.
The constants a and b are set to zero, and the matrix C is set as follows
0.5 0
• the �rst quadratic (shown in the top panel below) has C =[ ]
0 12
0.1 0
• the second quadratic (shown in the middle panel below) has C =[ ]
0 12
0.01 0
• the third quadratic (shown in the bottom panel below) has C =[ ]
0 12

The three quadratics differ only in the *top left entry* of their C matrix. As we change this single
value of C we *elongate* the contours signi�cantly along the horizontal axis.

10
We run 25 gradient descent steps with initialization at w0 =[ ] and α = 10−1 .
1

a1 = 0
b1 = 0*np.ones((2,1))
C1 = np.array([[0.5,0],[0,9.75]])
g1 = lambda w: (a1 + np.dot(b1.T,w) + np.dot(np.dot(w.T,C1),w))[0]
w = np.array([10.0,1.0]); max_its = 25; alpha_choice = 10**(-1);
weight_history_1,cost_history_1 = gradient_descent(g1,alpha_choice,max_its,w)

a2 = 0
b2 = 0*np.ones((2 1))

25 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

b2 = 0*np.ones((2,1))
C2 = np.array([[0.1,0],[0,9.75]])
g2 = lambda w: (a2 + np.dot(b2.T,w) + np.dot(np.dot(w.T,C2),w))[0]
weight_history_2,cost_history_2 = gradient_descent(g2,alpha_choice,max_its,w)

a3 = 0
b3 = 0*np.ones((2,1))
C3 = np.array([[0.01,0],[0,9.75]])
g3 = lambda w: (a3 + np.dot(b3.T,w) + np.dot(np.dot(w.T,C3),w))[0]
weight_history_3,cost_history_3 = gradient_descent(g3,alpha_choice,max_its,w)

histories = [weight_history_1,weight_history_2,weight_history_3]
gs = [g1,g2,g3]

static_plotter.two_input_contour_vert_plots(gs,histories,num_contours = 20,xmin = -1,xmax =

 The 'zig-zagging' behavior of gradient descent

The zig-zagging behavior of gradient descent in each of these cases above is completely due to

26 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

The zig-zagging behavior of gradient descent in each of these cases above is completely due to
the rapid change in negative gradient direction during each run.

static_plotter.plot_grad_directions_v2(weight_history_1)

static_plotter.plot_grad_directions_v2(weight_history_2)

27 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

static_plotter.plot_grad_directions_v2(weight_history_3)

The slow convergence caused by zig-zagging in each case can be seen in the slow decrease of
the cost function history associated with each run.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2,cost_history_3],start =

28 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• We can ameilorate this zig-zagging behavior by reducing the steplength value α.

• However this does not solve the underlying problem that zig-zagging produces - which is
.
• Typically in order to ameliorate or even eliminate zig-zagging this way requires a very small
steplength, which leads back to the fundamental problem of slow convergence.

a1 = 0
b1 = 0*np.ones((2,1))
C1 = np.array([[0.5,0],[0,9.75]])
g1 = lambda w: (a1 + np.dot(b1.T,w) + np.dot(np.dot(w.T,C1),w))[0]
w = np.array([10.0,1.0]); max_its = 15; alpha_choice = 10**(-2);
weight_history_1,cost_history_1 = gradient_descent(g1,alpha_choice,max_its,w)
static_plotter.two_input_contour_plot(g1,weight_history_1,show_original = False,num_contours =

2. The (negative) gradient magnitude vanishes near stationary


points

• As we know from the �rst order condition for optimality, the (negative) gradient vanishes
at stationary points.

• That is if w is a minimum, maximum, or saddle point then we know that ∇g (w) = 0.

29 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• The magnitude of the gradient vanishes at stationary points, that is ∥∇g (w) ∥2 = 0.

• By extension, the (negative) gradient at points near a stationary point have non-zero
direction but vanishing magnitude i.e., ∥∇g (w) ∥2 ≈ 0.

 The slow-crawling behavior of gradient descent

• Due to the vanishing behavior if the negative gradient magnitude near stationary points,
gradient descent steps progress very slowly (or crawl) near stationary points.

• Consider the general local optimization step:

wk = wk−1 + αdk−1
we saw that if d
k−1
is a unit length descent direction found by any zero order search approach
that the distance traveled with this step equals precisely the steplength value α since
∥∥wk − wk−1 ∥∥ = ∥∥(wk−1 + αdk−1 ) − wk−1 ∥
∥ = α∥
∥ dk−1 ∥
∥2 = α.
2 2

Again here the key assumption made was that our descent direction d
k−1
had unit length.

• However, for gradient descent, our descent direction d = −∇g (wk−1 ) is not
k−1

guaranteed to have unit length

∥∇g (wk−1 )∥
• We travel a distance proportional to the magnitude of the gradient = α∥ ∥2 :
∥∥wk − wk−1 ∥∥ = ∥∥(wk−1 − α∇g (wk−1 )) − wk−1 ∥ ∥∇g (w )∥
∥2 = α∥
k−1
∥2 .
2

 Example 5. Slow-crawling behavior of GD near the minimum of a function

g(w) = w4 + 0.1
• We initialize far from the minimum and set the steplength α = 10−1 .

g = lambda w: w**4 + 0.1

w = -1.0; max_its = 10; alpha_choice = 10**(-1);
weight_history,cost_history = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.single_input_plot(g,[weight_history],[cost_history],wmin = -1.1,wmax = 1.1

30 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

• Gradient descent crawls as it approaches the minimum because the magnitude of the
gradient vanishes here.

 Example 6. Slow-crawling behavior of GD near saddle points

Consider the non-convex function

g(w) = max(0, (3w − 2.3)3 + 1)2 + max(0, (−3w + 0.7)3 + 1)2

1 7 23
which has a minimum at w = 2
and saddle points at w = 30
and w = 30
.

We make a run of gradient descent on this function using 50 steps with α = 10−2 , initialized
such that it approaches one of these saddle points and so slows to a halt.

g = lambda w: np.maximum(0,(3w - 2.3)3 + 1)2 + np.maximum(0, (-3w + 0.7)3 + 1)2

demo.draw_2d(g=g, w_inits = [0],steplength = 0.01,max_its = 50,version = 'unnormalized',wmin =

 Example 7. Slow-crawling behavior of GD in large �at regions of a function

31 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

Example 7. Slow-crawling behavior of GD in large �at regions of a function

g(w0 , w1 ) = tanh(4w0 + 4w1 ) + max(1, 0.4w20 ) + 1

2
• Gradient descent starting at the point w0 = [ ] , in a long narrow valley.
2

• The magnitude of the gradient being almost zero here, we cannot make much progress
employing 1000 steps of gradient descent with a steplength α = 10−1 .

g = lambda w: np.tanh(4w[0] + 4w[1]) + max(0.4*w[0]**2,1) + 1

w = np.array([1.0,2.0]); max_its = 1000; alpha_choice = 10**(-1);
weight_history_1,cost_history_1 = gradient_descent(g,alpha_choice,max_its,w)
static_plotter.two_input_surface_contour_plot(g,weight_history_1,view = [20,300],num_contours =

32 of 33 6/15/2025, 5:12 PM
FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

33 of 33 6/15/2025, 5:12 PM

Weatherwax Nocedal Solutions
No ratings yet
Weatherwax Nocedal Solutions
23 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Lec_11
No ratings yet
Lec_11
13 pages
Stats 102B Cheat Sheet
No ratings yet
Stats 102B Cheat Sheet
4 pages
Stationary Points Minima and Maxima Gradient Method
No ratings yet
Stationary Points Minima and Maxima Gradient Method
8 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Matlab Session 5
No ratings yet
Matlab Session 5
23 pages
Maximum Slope Method
No ratings yet
Maximum Slope Method
14 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
Machine Learning - Lecture 2
No ratings yet
Machine Learning - Lecture 2
28 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Chapter 4: Unconstrained Optimization
No ratings yet
Chapter 4: Unconstrained Optimization
25 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Lec 18
No ratings yet
Lec 18
6 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Chapter 8 Lecture Notes
No ratings yet
Chapter 8 Lecture Notes
4 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
ass6_solns
No ratings yet
ass6_solns
13 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Calculus - class notes
No ratings yet
Calculus - class notes
4 pages
Basic Numerical Solution Methods For Differential Equations
No ratings yet
Basic Numerical Solution Methods For Differential Equations
16 pages
Princeton University Notation and Terminology in optimization
No ratings yet
Princeton University Notation and Terminology in optimization
13 pages
Unconstrained Optimization
No ratings yet
Unconstrained Optimization
12 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Unconstrained Optimization - Ipynb - Colaboratory
No ratings yet
Unconstrained Optimization - Ipynb - Colaboratory
5 pages
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
No ratings yet
1.3+Setting+Parameters+of+a+Deep+Neural+Network+ +Hierarchical+Representations
10 pages
07_Gradient_Descent_For_Linear_Regression_10_min
No ratings yet
07_Gradient_Descent_For_Linear_Regression_10_min
5 pages
Representer Function
No ratings yet
Representer Function
12 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Chương 9
No ratings yet
Chương 9
12 pages
ECOM 6302: Engineering Optimization: Chapter Three
100% (1)
ECOM 6302: Engineering Optimization: Chapter Three
56 pages
Lecture 5
No ratings yet
Lecture 5
6 pages
Background/Random Processes
No ratings yet
Background/Random Processes
33 pages
Elimination Methods
No ratings yet
Elimination Methods
34 pages
Exam With Solutions PDF
0% (1)
Exam With Solutions PDF
17 pages
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
No ratings yet
I. Introduction To Convex Optimization: Georgia Tech ECE 8823a Notes by J. Romberg. Last Updated 13:32, January 11, 2017
20 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
Latex for Mu
No ratings yet
Latex for Mu
3 pages
Chapter 4. Optimization
No ratings yet
Chapter 4. Optimization
62 pages
HW 2 Sol
No ratings yet
HW 2 Sol
5 pages
Hw2sol PDF
100% (1)
Hw2sol PDF
5 pages
Class03 RLS
No ratings yet
Class03 RLS
28 pages
Optimization: 1 Motivation
No ratings yet
Optimization: 1 Motivation
20 pages
BSC Part 3
No ratings yet
BSC Part 3
29 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Unconstrained and Constrained Optimization Algorithms by Soman K.P
No ratings yet
Unconstrained and Constrained Optimization Algorithms by Soman K.P
166 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
HMD-Deep Learning-Lecture 2-2024
No ratings yet
HMD-Deep Learning-Lecture 2-2024
47 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
5/5 (1)
Vedantu - Tools
No ratings yet
Vedantu - Tools
45 pages
Discrete Mathematics and Its Applications: Basic Structures: Sets, Functions, Sequences, and Sums
No ratings yet
Discrete Mathematics and Its Applications: Basic Structures: Sets, Functions, Sequences, and Sums
82 pages
A01 Exam1 - 2013
No ratings yet
A01 Exam1 - 2013
8 pages
Nonlinear Systems
No ratings yet
Nonlinear Systems
39 pages
Optimization-Based Control: Richard M. Murray Control and Dynamical Systems California Institute of Technology
No ratings yet
Optimization-Based Control: Richard M. Murray Control and Dynamical Systems California Institute of Technology
21 pages
Math 2 Level 2 Test 1 John Chung :)
No ratings yet
Math 2 Level 2 Test 1 John Chung :)
17 pages
Quiz-1 Solns PDF
No ratings yet
Quiz-1 Solns PDF
1 page
Optimization Theory and Methods
No ratings yet
Optimization Theory and Methods
7 pages
Chapter 0. Prerequisites 0.3: Sum and Product Notation: Slides (Google Drive) Video (Youtube)
No ratings yet
Chapter 0. Prerequisites 0.3: Sum and Product Notation: Slides (Google Drive) Video (Youtube)
5 pages
Maths Assignment
No ratings yet
Maths Assignment
13 pages
MA3354-Reg2021-Discrete Mathematics-Important Questionssss
No ratings yet
MA3354-Reg2021-Discrete Mathematics-Important Questionssss
4 pages
Grade-11 Mathematics, Model Set C
No ratings yet
Grade-11 Mathematics, Model Set C
4 pages
PDE Notes XChen
No ratings yet
PDE Notes XChen
18 pages
Notes On Complex Numbers: Math 170: Ideas in Mathematics (Section 002)
No ratings yet
Notes On Complex Numbers: Math 170: Ideas in Mathematics (Section 002)
5 pages
Ramlal@ CC - Iitd.ac - In, MATLAB WORKSHOP, CSC, IITD
No ratings yet
Ramlal@ CC - Iitd.ac - In, MATLAB WORKSHOP, CSC, IITD
6 pages
Lesson 29 PDF
No ratings yet
Lesson 29 PDF
7 pages
A Semi-Detailed Lesson Plan in Mathematics 9 3 Hours Quarter 1 Week Date: January 5, 2021
No ratings yet
A Semi-Detailed Lesson Plan in Mathematics 9 3 Hours Quarter 1 Week Date: January 5, 2021
18 pages
Equation of Circle Worksheet
100% (2)
Equation of Circle Worksheet
3 pages
Common C Programming Problem (Solved)
No ratings yet
Common C Programming Problem (Solved)
48 pages
Calculus 2
No ratings yet
Calculus 2
4 pages
Station 1: Modeling Equations and Word Problems
No ratings yet
Station 1: Modeling Equations and Word Problems
7 pages
CCH 13
No ratings yet
CCH 13
21 pages
Formula Sheet: Section 1 - Deterministic Dynamic Programming
No ratings yet
Formula Sheet: Section 1 - Deterministic Dynamic Programming
10 pages
Answer_Key_Differential_Equations
No ratings yet
Answer_Key_Differential_Equations
2 pages
Al Adwaa
No ratings yet
Al Adwaa
12 pages
Can We Resolve CH?
No ratings yet
Can We Resolve CH?
24 pages
CHAPTERWISE - TOP 200 PYQs of JEE Mains 2023 - Removed
No ratings yet
CHAPTERWISE - TOP 200 PYQs of JEE Mains 2023 - Removed
22 pages
Assignment Booklet Assignment Booklet BMTC BMTC-131
No ratings yet
Assignment Booklet Assignment Booklet BMTC BMTC-131
5 pages
Arclength:: 9.13 Surface Integrals
No ratings yet
Arclength:: 9.13 Surface Integrals
12 pages
Activity Sheet in Mathematics 7 Quarter 2-Week 7&8
No ratings yet
Activity Sheet in Mathematics 7 Quarter 2-Week 7&8
9 pages

First Order Method

Uploaded by

First Order Method

Uploaded by

FirstOrderMethods.ipynb - Colab https://colab.research.google.com/drive/1oJub77tNg6ArYQ8pkIbz9PG-...

In local optimization methods, we repeatedly re�ne an initial sample input by traveling in

• random search: prohibitively expensive as the dimension grows.

By exploiting a function's , we can construct

 Big picture view of the gradient descent algorithm

A algorithm, which is a �rst-order local optimization method, employs a

• The of a function helps form the best to the function

• It is to compute the descent direction of a line or a hyperplane.

 The First-Order Optimality Condition

 The �rt-order condition

func1 = lambda w: w**2 + 3

• We draw the �rst-order approximation (a tangent line/hyperplane) at the function's

 Example 1: Finding points of zero derivative for single-input functions

func1 = lambda w: np.sin(2*w)

Points having zero-valued derivative(s) are collectively referred to as for

: Stationary points of a function g (including minimax,

Example 2: A simple looking function but di�cult to compute

The one providing the minimum of the function g is:

 Example 3: Stationary points of a general multi-input quadratic function

Take the general multi-input quadratic function

Computing the �rst derivative (gradient) we have

 Coordinate descent and the �rst order optimality condition

included) of this function are those satisfying the system of equations

• Continuing this pattern to update the nth weight we solve

Example 4: Minimizing convex quadratic functions via �rst order

We minimize the simple quadratic

g(w0 , w1 ) = w20 + w21 + 2

 Geometry of First-Order Taylor Series

• We discuss important characteristics of the *hyperplane* including the direction of

 The anatomy of hyperlanes

The anatomy of hyperlanes

A general N -dimensional hyperplane can be characterized as:

For general N , h(w)= a + bT w is an N -dimensional hyperplane in an (N + 1)-

 Steepest ascent and descent directions

To search for the direction of steepest ascent at a given point w0 :

Note that h (w0 + d) can be written as:

where the �rst two terms are constant with respect to d.

Therefore, maximizing the value of h (w0 + d) is equivalent to maximizing bT d.

Then the optimization problem becomes:

 The gradient and the direction of steepest ascent/descent

• A function g(w) can be approximated locally around a given point w0 by a hyperplane

a = g(w0 ) − ∇g(w0 )T w0 and b = ∇g(w0 )

• This hyperplane is the �rst-order Taylor series approximation of g at w0 , and is to

• dk are descent directions

Start coding or generate with AI.

 The gradient descent algorithm

function g, steplength α, maximum number of steps K , and initial point w0

• When does stop?

 Example 1: A convex single input example

The global minimum of the function is:

 Example 2: A non-convex single input example

• We initialize two runs at w0

g = lambda w: np.sin(3*w) + 0.1*w**2

Depending on where we initialize, we may end up near a local or global minimum.

static_plotter.plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = True

 Example 3: A convex multi-input example

The gradient of the function is:

• We can plot the cost function history plot.

static_plotter.plot_cost_histories([cost_history],start = 0,points = True,labels = ['gradient des

 Basic steplength choices for gradient descent

• We need to choose the steplength/learning rate parameter α carefully for gradient

• Two most common choices are:

1. Using a �xed $\alpha$ value for each step: $\alpha=10^{\gamma}$, where $

 Example: Fixed steplength selection for a single input convex function

 Example: Fixed steplength selection for a multi-input non-convex function

Example: Comparing �xed and diminishing steplengths for a single input

We make two runs of 20 steps of gradient descent:

• Each is initialized at the point w0

• A diminishing steplength is necessary to reach a point close to the minimum of the

static_plotter.plot_cost_histories([cost_history_1,cost_history_2],start = 0,points = True

Oscillation in the cost function history plot: not always a bad

• We often use the to tune the steplength parameter α.

g = lambda w: w[0]**2 + w[1]**2 + 2*np.sin(1.5*(w[0] + w[1])) + 2

• For run 1, α = 10−2 was too small.

 Convergence behavior and steplength parameter selection

• When does gradient descent stop?

• We discuss important characteristics of the hyperplane including the direction of

g = lambda w: np.sin(3w) + 0.1w**2

g = lambda w: w[0]2 + w[1]2 + 2np.sin(1.5(w[0] + w[1])) + 2

g = lambda w: w[0]2 + w[1]2 + 2

g = lambda w: w[0]2 + w[1]2 + 2np.sin(1.5(w[0] + w[1])) + 2

g = lambda w: (w[0]2 + w[1] - 11)2 + (w[0] + w[1]2 - 7)2

• We can ameilorate this zig-zagging behavior by reducing the steplength value α.

g = lambda w: np.maximum(0,(3w - 2.3)3 + 1)2 + np.maximum(0, (-3w + 0.7)3 + 1)2

g = lambda w: np.tanh(4w[0] + 4w[1]) + max(0.4*w[0]**2,1) + 1