0% found this document useful (0 votes)
3 views4 pages

Homework2 Advanced Ml

This document outlines the details for Homework 2 of the Advanced Machine Learning course (GR5242) for Fall 2024, due on October 11. It includes five problems covering topics such as loss functions, differentiation, optimization methods, Taylor's Remainder Theorem, and comparisons between SGD and Adagrad. Students are required to collaborate but must submit individual solutions, formatted as a single PDF file through Gradescope.

Uploaded by

yl5404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Homework2 Advanced Ml

This document outlines the details for Homework 2 of the Advanced Machine Learning course (GR5242) for Fall 2024, due on October 11. It includes five problems covering topics such as loss functions, differentiation, optimization methods, Taylor's Remainder Theorem, and comparisons between SGD and Adagrad. Students are required to collaborate but must submit individual solutions, formatted as a single PDF file through Gradescope.

Uploaded by

yl5404
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

Advanced Machine Learning GR5242

Fall 2024

Homework 2

Due: Friday October 11, 7 pm, NY time.


Note: All questions carry equal weight.
Collaboration: You may collaborate on this assignment with peers from any course section. However,
you must write up your own solutions individually for submission.
Homework submission: Assignments should be submitted through Gradescope, which is accessible
through Courseworks. Please submit your homework by publishing a notebook that cleanly displays
your code, results and plots to pdf or html. For the non-coding questions, you can typeset using LATEX
or markdown, or you can include neatly scanned answers. Please make sure to put everything
together into a single pdf file for submission. Late submission is not accepted.

Problem 1: Loss Function Example*


Note: this is an open-ended question. You get full point for attempting it.
Give an example of a machine learning model class along with a loss function (defining a risk to
minimize) other than the following two examples: linear regression (with square loss) and neural
network (with logistic loss).

Problem 2: Differentiation
In what follows we will use the convention that the gradient is a column vector, which is the
transpose of the Jacobian in dimension 1. In fact that any vector x ∈ Rd is a column vector.

(a) Let’s consider the function g(x) = Ax, where x ∈ Rd and A ∈ Rk×d . In the special case
k = 1, show that the gradient of g is equal to A⊤ .
.
(b) Now, consider the case k > 1, where we might write g(x) = Ax = [g1 (x), g2 (x), ..., gk (x)]⊤ .
Recall that the Jacobian is a generalization of the gradient to multivariate functions g. That
is, the Jacobian of g is the matrix of partial derivatives whose (i, j)th entry is ∂g∂x
i (x)
j
. How
does the Jacobian matrix relate to the gradients of the components gi of g? Argue from
there that the Jacobian matrix of g above is given as Jg (x) = A.
(c) Now consider the function g(x) = x⊤ Ax, where x ∈ Rd and A ∈ Rd×d . Show that the
gradient of g is given as ∇g(x) = Ax + A⊤ x (it then follows that when A is symmetric,
∇g(x) = 2Ax).
Hint : You can write x⊤ Ax = i,j Ai,j · xi xj .
P

Problem 3: Compare Gradient Descent method and Newton Methods


Consider the following function over x ∈ R2 (depicted in Figure 1):
   
⊤ ⊤ 1 5 0
F (x) = x Σx + log(1 + exp(−1 x)), where 1 = , and Σ = .
1 0 21
Figure 1: Problem 3, a plot of the given function F .

The problem asks you to implement Gradient Descent and Newton methods in Python to minimize
the above function F , and in each case, to plot F (xt ) as a function of iterations t ∈ [1, 2, . . . , 30]
(all on the same plot for comparison).

Implementation details: For either methods, and any step size, start in iteration t = 1 at the
same point x1 = (0, 0), for fairer comparison.
Here we can calculate 2/β = 0.195, i.e., in terms of the smoothness constant β of F (via its
Hessian). According to theory, we should set the step size η for Gradient Descent to η < 2/β. In
the case of Newton, which uses a better approximation, we can use a larger step size.
Therefore, on the same figure, plot F (xt ) against t ∈ [1, 2, . . . , 30] for the following 4 configura-
tions, and label each curve appropriately as given below (NM1 through GD0.2):

(NM1) Newton Method with constant step size η = 1


(GD0.1) Gradient Descent Method with constant step size η = 0.1
(GD0.19) Gradient Descent Method with constant step size η = 0.19
(GD0.2) Gradient Descent Method with constant step size η = 0.2

The code for the function F (x), its Gradient and Hessian Matrix are given below.
Sigma = np . array ( [ [5 , 0 ] ,[0 , 0 . 5 ] ] )
II = np . array ( [ [1 , 1 ] ] )
# x is a 2 by 1 array starting from np . array ([[ 0 . 0 ] ,[ 0 . 0 ]])
def func0 ( x ) :
return np . dot ( np . dot ( x . transpose () , Sigma ) , x ) + math . log ( 1 + math . exp ( - np .
dot ( II , x ) ) )
def First_derivative ( x ) :
x1 = x [ 0 ] [ 0 ]
x2 = x [ 1 ] [ 0 ]
ex = math . exp ( - ( x1 + x2 ) )
return np . array ( [ [ 10 * x1 - ex / ( 1 + ex ) ] ,[ x2 - ex / ( 1 + ex ) ] ] )
def Second_derivative ( x ) :
x1 = x [ 0 ] [ 0 ]
x2 = x [ 1 ] [ 0 ]
ex = math . exp ( - ( x1 + x2 ) )

Page 2
ex = ex / ( 1 + ex ) ** 2
return np . array ( [ [ 10 + ex , ex ] ,[ ex , 1 + ex ] ] )

Problem 4: Taylor’s Remainder Theorem


Consider a convex function F : Rd → R. Suppose its Hessian is positive definite, i.e. ∇2 F ⪰ 0,
and is continuous for every x. Assume that F achieves its minimum at x∗ . Show that

F (x) − F (x∗ ) ≤ (x − x∗ )⊤ ∇F (x)

Hint : Use Taylor’s Remainder Theorem (which holds for instance when ∇2 F is continuous).
Problem 5: Regular SGD vs Adagrad optimization on a Regression problem
Consider a random vector X ∼ N (0, Σ) with covariance matrix Σ = diag (σ∗12 , σ 2 , . . . , σ 2 ).
∗2 ∗d
Here we let d = 10 and σ∗i = 2−i · 10. Let the response Y be generated as

Y = w∗⊤ X + N (0, σ 2 ),

for an optimal vector w∗ = [1, 1, 1, ..., 1]⊤ (supposed to be unknown), and σ = 1. Consider the
following Ridge objective, defined over i.i.d. data {(Xi , Yi )}ni=1 from the above distribution:
n
1X
F (w) = (Yi − w⊤ Xi )2 + λ∥w∥2 ,
n
i=1

for a setting of λ = 0.1. We want to minimize F over choices of w to estimate w∗ .


You are asked to implement the following two optimization approaches, namely SGD (Stochastic
Gradient Descent) and Adagrad, given in the two pseudocodes below. The only differences
between the two approaches are in the setting of the step sizes ηt (in the case of Adagrad, ηt is
a diagonal matrix of step sizes).
For both procedures, we need the following definition of a stochastic gradient. First write
n
1X
F (w) = fi (w), that is, we let fi (w) = (Yi − w⊤ Xi )2 + λ∥w∥2 .
n
i=1

Thus, the term fi (w) is just in terms of the random data sample (Xi , Yi ).
We define the stochastic gradient (evaluated at w) at time t by ∇ft (w) (that is at index i = t).
At step t in both of the procedures below, when the current w is wt , we use the common notation
.
˜ (wt ) =
∇F ∇ft (wt ),

Algorithm 1: Regular SGD


The steps below use a parameter β to be specified;
At time t=1: set w1 = 0;
while t ≤ n do
˜ (wt ) on tth datapoint (Xt , Yt );
Compute ∇F
q t
Set ηt = β·σ1 2
2 , where σt =
P ˜ (ws )∥2 (sum over past ∇’s);
∥∇F ˜
t s=1
˜ (wt );
Update wt+1 = wt − ηt ∇F
t = t+1;
end

Page 3
Algorithm 2: Adagrad
The steps below use a parameter β to be specified;
At time t = 1: set w1 = 0;
while t ≤ n do
Compute ∇F ˜ (wt ) on tth datapoint (Xt , Yt );
q t
∀i ∈ [d], set ηt,i = β·σ12 , where σt,i
2 =
P ˜ i F (ws ))2 (sum over ith coordinate of past ∇’s);
(∇ ˜
t,i s=1
Update wt+1 ˜ (wt ), where now ηt = diag (ηt,1 , . . . , ηt,d );
= wt − ηt ∇F
t = t+1;
end

˜ (wt ) in terms of (Xt , Yt ) and wt ?


(a) What is ∇F
2 +λ) (smoothness measure).
(b) Implement the above two procedures in Python, using β = 2(σ∗1
(c) Experiment: Run each of the two procedures on n = 1000 datapoints, generated from the
above distribution, using the sampler provided below. Repeat 10 times, every time starting
at w1 = 0 as specified. Plot ∥wt − w∗ ∥ against t = 1, 2, . . . , 1000, averaged over the 10
repetitions. Show error bars (i.e., std over repetitions at each t).

Report the plots in (c) for the two procedures on the same figure for comparison.
Data sampler code:
# # Code for generator / sampler
import numpy as np
import random
import time
from numpy import linalg as LA
import statistics
# initialization
sigma = 1
d = 10
c_square = 100
cov = np . diag ( [ ( 0 . 25 ** i ) * c_square for i in range (1 , d + 1 ) ] )
mean = [ 0 ] * d
# coeficient given
w = np . array ( [ 1 ] * d )

# Sampler function
def sampler ( n ) :
# data X generator
np . random . seed ( int ( time . time () * 100000 ) % 100000 )
X = np . random . multivariate_normal ( mean , cov , n )
# data Y generator
Y = np . matmul (X , w ) + np . random . normal (0 , sigma ** 2 , n )
return (X , Y )

Page 4

You might also like