0% found this document useful (0 votes)

3 views4 pages

Homework2 Advanced Ml

This document outlines the details for Homework 2 of the Advanced Machine Learning course (GR5242) for Fall 2024, due on October 11. It includes five problems covering topics such as loss functions, differentiation, optimization methods, Taylor's Remainder Theorem, and comparisons between SGD and Adagrad. Students are required to collaborate but must submit individual solutions, formatted as a single PDF file through Gradescope.

Uploaded by

yl5404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views4 pages

Homework2 Advanced Ml

Uploaded by

yl5404

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Advanced Machine Learning GR5242

Fall 2024

Homework 2

Due: Friday October 11, 7 pm, NY time.

Note: All questions carry equal weight.
Collaboration: You may collaborate on this assignment with peers from any course section. However,
you must write up your own solutions individually for submission.
Homework submission: Assignments should be submitted through Gradescope, which is accessible
through Courseworks. Please submit your homework by publishing a notebook that cleanly displays
your code, results and plots to pdf or html. For the non-coding questions, you can typeset using LATEX
or markdown, or you can include neatly scanned answers. Please make sure to put everything
together into a single pdf file for submission. Late submission is not accepted.

Problem 1: Loss Function Example*

Note: this is an open-ended question. You get full point for attempting it.
Give an example of a machine learning model class along with a loss function (defining a risk to
minimize) other than the following two examples: linear regression (with square loss) and neural
network (with logistic loss).

Problem 2: Differentiation
In what follows we will use the convention that the gradient is a column vector, which is the
transpose of the Jacobian in dimension 1. In fact that any vector x ∈ Rd is a column vector.

(a) Let’s consider the function g(x) = Ax, where x ∈ Rd and A ∈ Rk×d . In the special case
k = 1, show that the gradient of g is equal to A⊤ .
.
(b) Now, consider the case k > 1, where we might write g(x) = Ax = [g1 (x), g2 (x), ..., gk (x)]⊤ .
Recall that the Jacobian is a generalization of the gradient to multivariate functions g. That
is, the Jacobian of g is the matrix of partial derivatives whose (i, j)th entry is ∂g∂x
i (x)
j
. How
does the Jacobian matrix relate to the gradients of the components gi of g? Argue from
there that the Jacobian matrix of g above is given as Jg (x) = A.
(c) Now consider the function g(x) = x⊤ Ax, where x ∈ Rd and A ∈ Rd×d . Show that the
gradient of g is given as ∇g(x) = Ax + A⊤ x (it then follows that when A is symmetric,
∇g(x) = 2Ax).
Hint : You can write x⊤ Ax = i,j Ai,j · xi xj .
P

Problem 3: Compare Gradient Descent method and Newton Methods

Consider the following function over x ∈ R2 (depicted in Figure 1):

⊤ ⊤ 1 5 0
F (x) = x Σx + log(1 + exp(−1 x)), where 1 = , and Σ = .
1 0 21
Figure 1: Problem 3, a plot of the given function F .

The problem asks you to implement Gradient Descent and Newton methods in Python to minimize
the above function F , and in each case, to plot F (xt ) as a function of iterations t ∈ [1, 2, . . . , 30]
(all on the same plot for comparison).

Implementation details: For either methods, and any step size, start in iteration t = 1 at the
same point x1 = (0, 0), for fairer comparison.
Here we can calculate 2/β = 0.195, i.e., in terms of the smoothness constant β of F (via its
Hessian). According to theory, we should set the step size η for Gradient Descent to η < 2/β. In
the case of Newton, which uses a better approximation, we can use a larger step size.
Therefore, on the same figure, plot F (xt ) against t ∈ [1, 2, . . . , 30] for the following 4 configura-
tions, and label each curve appropriately as given below (NM1 through GD0.2):

(NM1) Newton Method with constant step size η = 1

(GD0.1) Gradient Descent Method with constant step size η = 0.1
(GD0.19) Gradient Descent Method with constant step size η = 0.19
(GD0.2) Gradient Descent Method with constant step size η = 0.2

The code for the function F (x), its Gradient and Hessian Matrix are given below.
Sigma = np . array ( [ [5 , 0 ] ,[0 , 0 . 5 ] ] )
II = np . array ( [ [1 , 1 ] ] )
# x is a 2 by 1 array starting from np . array ([[ 0 . 0 ] ,[ 0 . 0 ]])
def func0 ( x ) :
return np . dot ( np . dot ( x . transpose () , Sigma ) , x ) + math . log ( 1 + math . exp ( - np .
dot ( II , x ) ) )
def First_derivative ( x ) :
x1 = x [ 0 ] [ 0 ]
x2 = x [ 1 ] [ 0 ]
ex = math . exp ( - ( x1 + x2 ) )
return np . array ( [ [ 10 * x1 - ex / ( 1 + ex ) ] ,[ x2 - ex / ( 1 + ex ) ] ] )
def Second_derivative ( x ) :
x1 = x [ 0 ] [ 0 ]
x2 = x [ 1 ] [ 0 ]
ex = math . exp ( - ( x1 + x2 ) )

Page 2
ex = ex / ( 1 + ex ) ** 2
return np . array ( [ [ 10 + ex , ex ] ,[ ex , 1 + ex ] ] )

Problem 4: Taylor’s Remainder Theorem

Consider a convex function F : Rd → R. Suppose its Hessian is positive definite, i.e. ∇2 F ⪰ 0,
and is continuous for every x. Assume that F achieves its minimum at x∗ . Show that

F (x) − F (x∗ ) ≤ (x − x∗ )⊤ ∇F (x)

Hint : Use Taylor’s Remainder Theorem (which holds for instance when ∇2 F is continuous).
Problem 5: Regular SGD vs Adagrad optimization on a Regression problem
Consider a random vector X ∼ N (0, Σ) with covariance matrix Σ = diag (σ∗12 , σ 2 , . . . , σ 2 ).
∗2 ∗d
Here we let d = 10 and σ∗i = 2−i · 10. Let the response Y be generated as

Y = w∗⊤ X + N (0, σ 2 ),

for an optimal vector w∗ = [1, 1, 1, ..., 1]⊤ (supposed to be unknown), and σ = 1. Consider the
following Ridge objective, defined over i.i.d. data {(Xi , Yi )}ni=1 from the above distribution:
n
1X
F (w) = (Yi − w⊤ Xi )2 + λ∥w∥2 ,
n
i=1

for a setting of λ = 0.1. We want to minimize F over choices of w to estimate w∗ .

You are asked to implement the following two optimization approaches, namely SGD (Stochastic
Gradient Descent) and Adagrad, given in the two pseudocodes below. The only differences
between the two approaches are in the setting of the step sizes ηt (in the case of Adagrad, ηt is
a diagonal matrix of step sizes).
For both procedures, we need the following definition of a stochastic gradient. First write
n
1X
F (w) = fi (w), that is, we let fi (w) = (Yi − w⊤ Xi )2 + λ∥w∥2 .
n
i=1

Thus, the term fi (w) is just in terms of the random data sample (Xi , Yi ).
We define the stochastic gradient (evaluated at w) at time t by ∇ft (w) (that is at index i = t).
At step t in both of the procedures below, when the current w is wt , we use the common notation
.
˜ (wt ) =
∇F ∇ft (wt ),

Algorithm 1: Regular SGD

The steps below use a parameter β to be specified;
At time t=1: set w1 = 0;
while t ≤ n do
˜ (wt ) on tth datapoint (Xt , Yt );
Compute ∇F
q t
Set ηt = β·σ1 2
2 , where σt =
P ˜ (ws )∥2 (sum over past ∇’s);
∥∇F ˜
t s=1
˜ (wt );
Update wt+1 = wt − ηt ∇F
t = t+1;
end

Page 3
Algorithm 2: Adagrad
The steps below use a parameter β to be specified;
At time t = 1: set w1 = 0;
while t ≤ n do
Compute ∇F ˜ (wt ) on tth datapoint (Xt , Yt );
q t
∀i ∈ [d], set ηt,i = β·σ12 , where σt,i
2 =
P ˜ i F (ws ))2 (sum over ith coordinate of past ∇’s);
(∇ ˜
t,i s=1
Update wt+1 ˜ (wt ), where now ηt = diag (ηt,1 , . . . , ηt,d );
= wt − ηt ∇F
t = t+1;
end

˜ (wt ) in terms of (Xt , Yt ) and wt ?

(a) What is ∇F
2 +λ) (smoothness measure).
(b) Implement the above two procedures in Python, using β = 2(σ∗1
(c) Experiment: Run each of the two procedures on n = 1000 datapoints, generated from the
above distribution, using the sampler provided below. Repeat 10 times, every time starting
at w1 = 0 as specified. Plot ∥wt − w∗ ∥ against t = 1, 2, . . . , 1000, averaged over the 10
repetitions. Show error bars (i.e., std over repetitions at each t).

Report the plots in (c) for the two procedures on the same figure for comparison.
Data sampler code:
# # Code for generator / sampler
import numpy as np
import random
import time
from numpy import linalg as LA
import statistics
# initialization
sigma = 1
d = 10
c_square = 100
cov = np . diag ( [ ( 0 . 25 ** i ) * c_square for i in range (1 , d + 1 ) ] )
mean = [ 0 ] * d
# coeficient given
w = np . array ( [ 1 ] * d )

# Sampler function
def sampler ( n ) :
# data X generator
np . random . seed ( int ( time . time () * 100000 ) % 100000 )
X = np . random . multivariate_normal ( mean , cov , n )
# data Y generator
Y = np . matmul (X , w ) + np . random . normal (0 , sigma ** 2 , n )
return (X , Y )

Page 4

Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
0% (1)
Solution Manual For Discrete Time Signal Processing 3 E 3rd Edition Alan V Oppenheim Ronald W Schafer
4 pages
PDF
No ratings yet
PDF
4 pages
HW 4
No ratings yet
HW 4
6 pages
Numerical Methods To Solve Systems of Equations in Python
No ratings yet
Numerical Methods To Solve Systems of Equations in Python
12 pages
Qs ML
No ratings yet
Qs ML
8 pages
ALl_codes
No ratings yet
ALl_codes
13 pages
Ass 1
No ratings yet
Ass 1
3 pages
1 Lecture 3: Optimization and Linear Regression
No ratings yet
1 Lecture 3: Optimization and Linear Regression
27 pages
DAMA_50_exam_resit_22-23
No ratings yet
DAMA_50_exam_resit_22-23
11 pages
hw01s
No ratings yet
hw01s
10 pages
TMA4215 Report (10056,10047,10023)
No ratings yet
TMA4215 Report (10056,10047,10023)
5 pages
IE684 Lab01
No ratings yet
IE684 Lab01
4 pages
202201083_Lab03
No ratings yet
202201083_Lab03
3 pages
COL726_A2
No ratings yet
COL726_A2
5 pages
hw4
No ratings yet
hw4
12 pages
Fifth
No ratings yet
Fifth
7 pages
EE364a Homework 7 Solutions
No ratings yet
EE364a Homework 7 Solutions
16 pages
H2_AndresAlcivar
No ratings yet
H2_AndresAlcivar
12 pages
col774_ass1_v1
No ratings yet
col774_ass1_v1
5 pages
Matrix Calculus Short
No ratings yet
Matrix Calculus Short
5 pages
ML Lecture 2 2023
No ratings yet
ML Lecture 2 2023
59 pages
202201206_Lab03
No ratings yet
202201206_Lab03
3 pages
hw4_red
No ratings yet
hw4_red
6 pages
Linear Regression
No ratings yet
Linear Regression
14 pages
Gradient Descent Algorithm.Y... (1)
No ratings yet
Gradient Descent Algorithm.Y... (1)
10 pages
Practice Midterm Sol
No ratings yet
Practice Midterm Sol
15 pages
Deeplearning2017 Johnson Automatic Differentiation 01
No ratings yet
Deeplearning2017 Johnson Automatic Differentiation 01
142 pages
IE684 Lab03
No ratings yet
IE684 Lab03
6 pages
ML Labs
No ratings yet
ML Labs
46 pages
Assignment 1
No ratings yet
Assignment 1
14 pages
202201154_Lab03
No ratings yet
202201154_Lab03
3 pages
MMA Assignment-Python
No ratings yet
MMA Assignment-Python
19 pages
04 LinearRegression
No ratings yet
04 LinearRegression
61 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
No ratings yet
Department of Electrical Engineering School of Science and Engineering EE514/CS535 Machine Learning Homework 1
11 pages
2019-20-I Q4_key
No ratings yet
2019-20-I Q4_key
2 pages
HoangGiaUy-20020014-VNUK-en_2024
No ratings yet
HoangGiaUy-20020014-VNUK-en_2024
10 pages
This Study Resource Was: CS 7641 CSE/ISYE 6740 Homework 3
No ratings yet
This Study Resource Was: CS 7641 CSE/ISYE 6740 Homework 3
4 pages
ML Ctanujit
No ratings yet
ML Ctanujit
56 pages
Conjugate Gradient Method
No ratings yet
Conjugate Gradient Method
50 pages
202201176_Lab03
No ratings yet
202201176_Lab03
3 pages
Dokumen - Tips - Homework 3 Solution Ee263 Introduction To Linear Ee263 Homework 3 Solution
No ratings yet
Dokumen - Tips - Homework 3 Solution Ee263 Introduction To Linear Ee263 Homework 3 Solution
27 pages
Day 1
No ratings yet
Day 1
41 pages
Ann2018 L5
No ratings yet
Ann2018 L5
23 pages
Pattern Recognition Systems
No ratings yet
Pattern Recognition Systems
81 pages
HW 1
No ratings yet
HW 1
3 pages
Gradient Descent PDF
No ratings yet
Gradient Descent PDF
9 pages
Chapter04_Training_Models
No ratings yet
Chapter04_Training_Models
33 pages
hw3
No ratings yet
hw3
7 pages
Linear Regression
No ratings yet
Linear Regression
6 pages
COL726_A1_1
No ratings yet
COL726_A1_1
3 pages
Solutions Manual Scientific Computing
0% (1)
Solutions Manual Scientific Computing
192 pages
S Ccs Answers
No ratings yet
S Ccs Answers
192 pages
MAT321 Lecture Notes Boumal 2019
No ratings yet
MAT321 Lecture Notes Boumal 2019
203 pages
Lecture 7 Newton
No ratings yet
Lecture 7 Newton
44 pages
03 Regression
No ratings yet
03 Regression
55 pages
50inference
No ratings yet
50inference
31 pages
HW 1
No ratings yet
HW 1
4 pages
NEOM Manual Part-II 4-Expts
No ratings yet
NEOM Manual Part-II 4-Expts
41 pages
Geometric functions in computer aided geometric design
From Everand
Geometric functions in computer aided geometric design
Oscar Ruiz
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
简单的基于LSTM的股市分析与预测（Python）
No ratings yet
简单的基于LSTM的股市分析与预测（Python）
17 pages
[量化策略]使用Python优化RSI策略
No ratings yet
[量化策略]使用Python优化RSI策略
7 pages
2024 GR5245 HW1_due0929_11pm
No ratings yet
2024 GR5245 HW1_due0929_11pm
2 pages
2024_GR5245 class2 notes
No ratings yet
2024_GR5245 class2 notes
10 pages
GR5010_Handout4_Futures_CTA_2024
No ratings yet
GR5010_Handout4_Futures_CTA_2024
24 pages
GR5010_Handout6OptionsPricing2023 (3)
No ratings yet
GR5010_Handout6OptionsPricing2023 (3)
29 pages
GR5010_Handout0_2024
No ratings yet
GR5010_Handout0_2024
26 pages
GR5010_Handout2_ProbabilityReviewNew
No ratings yet
GR5010_Handout2_ProbabilityReviewNew
31 pages
GR5010_Handout3_Futures2024
No ratings yet
GR5010_Handout3_Futures2024
24 pages
GR5010_Handout1_Arbitrage
No ratings yet
GR5010_Handout1_Arbitrage
24 pages
GR5010_Handout5OptionsBasics2024
100% (1)
GR5010_Handout5OptionsBasics2024
31 pages
Lecture06-Syntax Formal Languages
No ratings yet
Lecture06-Syntax Formal Languages
43 pages
Lecture02 Ambiguity
No ratings yet
Lecture02 Ambiguity
21 pages
Paper 11 Sem-5 Vbu & Bbmku Guess Paper
No ratings yet
Paper 11 Sem-5 Vbu & Bbmku Guess Paper
37 pages
Full All The Mathematics You Missed But Need To Know For Graduate School 1st Edition Thomas A. Garrity PDF All Chapters
100% (10)
Full All The Mathematics You Missed But Need To Know For Graduate School 1st Edition Thomas A. Garrity PDF All Chapters
52 pages
Vector Analysis
No ratings yet
Vector Analysis
72 pages
Worksheet-20 - Graphical Solutions (10-22)
No ratings yet
Worksheet-20 - Graphical Solutions (10-22)
30 pages
MST121 Handbook PDF
No ratings yet
MST121 Handbook PDF
48 pages
Vectors, Tensors, Indices and All That - .
No ratings yet
Vectors, Tensors, Indices and All That - .
7 pages
Calculaus
No ratings yet
Calculaus
8 pages
5.4 Further Implicit Differentiation
No ratings yet
5.4 Further Implicit Differentiation
10 pages
B.Tech-Notes-1 (RGIPT)
No ratings yet
B.Tech-Notes-1 (RGIPT)
2 pages
Chapter 6. Integral Theorems: Section 6.1: Greens Theorem
No ratings yet
Chapter 6. Integral Theorems: Section 6.1: Greens Theorem
7 pages
Exercise 3: Logistic Regression: Andrew NG (Very Slightly Edited by Luis R. Izquierdo For The University of Burgos)
No ratings yet
Exercise 3: Logistic Regression: Andrew NG (Very Slightly Edited by Luis R. Izquierdo For The University of Burgos)
5 pages
Instant Access to Calculus Single and Multivariable Instructor s Manual 6th Edition Hughes-Hallett ebook Full Chapters
100% (8)
Instant Access to Calculus Single and Multivariable Instructor s Manual 6th Edition Hughes-Hallett ebook Full Chapters
85 pages
The Equilibrium Equations of Membrane Shells Expressed in General Surface Coordinates
No ratings yet
The Equilibrium Equations of Membrane Shells Expressed in General Surface Coordinates
50 pages
Calculus for Engineers by aakash
No ratings yet
Calculus for Engineers by aakash
8 pages
Numerical Solution of Ordinary Differential EquationsPde Book
50% (2)
Numerical Solution of Ordinary Differential EquationsPde Book
279 pages
Fourpot: Processing and Analysis of Potential Field Data Using 2-D Fourier Transform
No ratings yet
Fourpot: Processing and Analysis of Potential Field Data Using 2-D Fourier Transform
54 pages
Matlab Model Fat
No ratings yet
Matlab Model Fat
2 pages
Viva Questions
No ratings yet
Viva Questions
3 pages
Math 3 CH2
No ratings yet
Math 3 CH2
53 pages
2015-Magnetic Susceptibility As A Tool For Mineral Exploration
No ratings yet
2015-Magnetic Susceptibility As A Tool For Mineral Exploration
10 pages
Lceture Set 4-Dell Operator
No ratings yet
Lceture Set 4-Dell Operator
15 pages
2017 Fall ME501 06 VectorCalculus
No ratings yet
2017 Fall ME501 06 VectorCalculus
95 pages
560 601 HW3
No ratings yet
560 601 HW3
4 pages
Wangsness Electromagnetic Fields
0% (2)
Wangsness Electromagnetic Fields
599 pages
Oxo - AQA16 - C801 - cs01 - Xxaann Edit
No ratings yet
Oxo - AQA16 - C801 - cs01 - Xxaann Edit
6 pages
CS115 Optimization
No ratings yet
CS115 Optimization
160 pages
Maxima & Minima, Calculus Revision Notes From A-Level Maths Tutor
100% (2)
Maxima & Minima, Calculus Revision Notes From A-Level Maths Tutor
5 pages
Essential mathematical methods for physicists 1st Edition Hans J. Weber download
100% (2)
Essential mathematical methods for physicists 1st Edition Hans J. Weber download
55 pages
MPH02
No ratings yet
MPH02
538 pages

Homework2 Advanced Ml

Uploaded by

Homework2 Advanced Ml

Uploaded by

Advanced Machine Learning GR5242

Due: Friday October 11, 7 pm, NY time.

Problem 1: Loss Function Example*

Problem 3: Compare Gradient Descent method and Newton Methods

(NM1) Newton Method with constant step size η = 1

Problem 4: Taylor’s Remainder Theorem

F (x) − F (x∗ ) ≤ (x − x∗ )⊤ ∇F (x)

for a setting of λ = 0.1. We want to minimize F over choices of w to estimate w∗ .

Algorithm 1: Regular SGD

˜ (wt ) in terms of (Xt , Yt ) and wt ?

You might also like