Homework2 Advanced Ml
Homework2 Advanced Ml
Fall 2024
Homework 2
Problem 2: Differentiation
In what follows we will use the convention that the gradient is a column vector, which is the
transpose of the Jacobian in dimension 1. In fact that any vector x ∈ Rd is a column vector.
(a) Let’s consider the function g(x) = Ax, where x ∈ Rd and A ∈ Rk×d . In the special case
k = 1, show that the gradient of g is equal to A⊤ .
.
(b) Now, consider the case k > 1, where we might write g(x) = Ax = [g1 (x), g2 (x), ..., gk (x)]⊤ .
Recall that the Jacobian is a generalization of the gradient to multivariate functions g. That
is, the Jacobian of g is the matrix of partial derivatives whose (i, j)th entry is ∂g∂x
i (x)
j
. How
does the Jacobian matrix relate to the gradients of the components gi of g? Argue from
there that the Jacobian matrix of g above is given as Jg (x) = A.
(c) Now consider the function g(x) = x⊤ Ax, where x ∈ Rd and A ∈ Rd×d . Show that the
gradient of g is given as ∇g(x) = Ax + A⊤ x (it then follows that when A is symmetric,
∇g(x) = 2Ax).
Hint : You can write x⊤ Ax = i,j Ai,j · xi xj .
P
The problem asks you to implement Gradient Descent and Newton methods in Python to minimize
the above function F , and in each case, to plot F (xt ) as a function of iterations t ∈ [1, 2, . . . , 30]
(all on the same plot for comparison).
Implementation details: For either methods, and any step size, start in iteration t = 1 at the
same point x1 = (0, 0), for fairer comparison.
Here we can calculate 2/β = 0.195, i.e., in terms of the smoothness constant β of F (via its
Hessian). According to theory, we should set the step size η for Gradient Descent to η < 2/β. In
the case of Newton, which uses a better approximation, we can use a larger step size.
Therefore, on the same figure, plot F (xt ) against t ∈ [1, 2, . . . , 30] for the following 4 configura-
tions, and label each curve appropriately as given below (NM1 through GD0.2):
The code for the function F (x), its Gradient and Hessian Matrix are given below.
Sigma = np . array ( [ [5 , 0 ] ,[0 , 0 . 5 ] ] )
II = np . array ( [ [1 , 1 ] ] )
# x is a 2 by 1 array starting from np . array ([[ 0 . 0 ] ,[ 0 . 0 ]])
def func0 ( x ) :
return np . dot ( np . dot ( x . transpose () , Sigma ) , x ) + math . log ( 1 + math . exp ( - np .
dot ( II , x ) ) )
def First_derivative ( x ) :
x1 = x [ 0 ] [ 0 ]
x2 = x [ 1 ] [ 0 ]
ex = math . exp ( - ( x1 + x2 ) )
return np . array ( [ [ 10 * x1 - ex / ( 1 + ex ) ] ,[ x2 - ex / ( 1 + ex ) ] ] )
def Second_derivative ( x ) :
x1 = x [ 0 ] [ 0 ]
x2 = x [ 1 ] [ 0 ]
ex = math . exp ( - ( x1 + x2 ) )
Page 2
ex = ex / ( 1 + ex ) ** 2
return np . array ( [ [ 10 + ex , ex ] ,[ ex , 1 + ex ] ] )
Hint : Use Taylor’s Remainder Theorem (which holds for instance when ∇2 F is continuous).
Problem 5: Regular SGD vs Adagrad optimization on a Regression problem
Consider a random vector X ∼ N (0, Σ) with covariance matrix Σ = diag (σ∗12 , σ 2 , . . . , σ 2 ).
∗2 ∗d
Here we let d = 10 and σ∗i = 2−i · 10. Let the response Y be generated as
Y = w∗⊤ X + N (0, σ 2 ),
for an optimal vector w∗ = [1, 1, 1, ..., 1]⊤ (supposed to be unknown), and σ = 1. Consider the
following Ridge objective, defined over i.i.d. data {(Xi , Yi )}ni=1 from the above distribution:
n
1X
F (w) = (Yi − w⊤ Xi )2 + λ∥w∥2 ,
n
i=1
Thus, the term fi (w) is just in terms of the random data sample (Xi , Yi ).
We define the stochastic gradient (evaluated at w) at time t by ∇ft (w) (that is at index i = t).
At step t in both of the procedures below, when the current w is wt , we use the common notation
.
˜ (wt ) =
∇F ∇ft (wt ),
Page 3
Algorithm 2: Adagrad
The steps below use a parameter β to be specified;
At time t = 1: set w1 = 0;
while t ≤ n do
Compute ∇F ˜ (wt ) on tth datapoint (Xt , Yt );
q t
∀i ∈ [d], set ηt,i = β·σ12 , where σt,i
2 =
P ˜ i F (ws ))2 (sum over ith coordinate of past ∇’s);
(∇ ˜
t,i s=1
Update wt+1 ˜ (wt ), where now ηt = diag (ηt,1 , . . . , ηt,d );
= wt − ηt ∇F
t = t+1;
end
Report the plots in (c) for the two procedures on the same figure for comparison.
Data sampler code:
# # Code for generator / sampler
import numpy as np
import random
import time
from numpy import linalg as LA
import statistics
# initialization
sigma = 1
d = 10
c_square = 100
cov = np . diag ( [ ( 0 . 25 ** i ) * c_square for i in range (1 , d + 1 ) ] )
mean = [ 0 ] * d
# coeficient given
w = np . array ( [ 1 ] * d )
# Sampler function
def sampler ( n ) :
# data X generator
np . random . seed ( int ( time . time () * 100000 ) % 100000 )
X = np . random . multivariate_normal ( mean , cov , n )
# data Y generator
Y = np . matmul (X , w ) + np . random . normal (0 , sigma ** 2 , n )
return (X , Y )
Page 4