0% found this document useful (0 votes)

3 views4 pages

Calculus - class notes

The document discusses key concepts in calculus as applied to machine learning, including loss and cost functions, linear regression, multiple linear regression, classification with perceptrons and neural networks, and Newton's method for optimization. It explains the mathematical foundations for minimizing errors using gradient descent and the importance of derivatives in determining function behavior. Additionally, it covers the Hessian matrix and its role in assessing concavity for optimization problems.

Uploaded by

Németh János Tamás

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views4 pages

Calculus - class notes

Uploaded by

Németh János Tamás

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as ODT, PDF, TXT or read online on Scribd

You are on page 1/ 4

Calculus for ML

Functions
Loss function: defined on a single data point: prediction and label
Cost function: defined over all data points (all features and examples summed up)
Linear Regression
The goal is to minimise vertical distance between dots (x, y) and line slope · x + b.
yi − y
For each dot, this distance is: .
slope⋅x+ b − y
m
Therefore, the total cost associated with all m distances is: ∑ (slope⋅xi + b − y i)2 .
i=1
In order to minimise (half) the average, find the minimum of the mean squared error function:
m
1
E = ∑
2 m i=1
(slope⋅x i + b − y i)2 .
∂E 1 T
That is = ⋅ ( slope⋅X + b − y ) ⋅ X
∂slope m
∂E 1
and = ⋅ (slope⋅X + b − y) ⋅ 1
∂b m
where slope and b are scalars;
X.shape = (1, m);
y.shape = (m,); and
1.shape = (m,).

So for gradient descent, the updates are:

k+1 k 1 T k+1 k 1
slope =slope −α (slope⋅X + b− y )⋅X and b =b −α (slope⋅X +b− y)⋅1 .
m m

Multiple linear regression with inputs x 1 … x n

[] [ ]
y1 1 2 m
2 x 1 x1 ⋯ x 1
Y^ = y = w ⋯ w ⋅ ⋮ ⋮ ⋱ ⋮ + b =
⋮
[ 1 n]
1 2 m
m xn xn ⋯ xn
y

[ ]
1 1 1
w 1⋅x 1+b + w 2⋅x 2 +b + ⋯ + w n⋅xn +b
⋮ ⋱ ⋮
m m m
w1⋅x 1 +b + w2⋅x 2 +b + ⋯ + wn⋅x n +b

To minimise the cost function, take the two partial derivatives for gradient descent:
∂cost 1 ∂cost 1
⋅ ( Y^ −Y ) ⋅ X
T
that is = and = ⋅ ( Y^ −Y ) ⋅ 1
∂W m ∂b m
where W.shape = (n,);
X.shape = (n, m);
b is a scalar;
Ŷ.shape and Y.shape = (m,); and
1.shape = (m,).
Classification with perceptron
Classificatory networks need an activation function, e.g. sigmoid:
1 whose derivative is: dσ
σ (z) = = σ (1−σ )
1+ e
−z
dz

Thus we have a prediction function: ^y =σ (W⋅X +b)

And due to the probabilistic nature of classification, the preferred loss function is the log loss:
ℒ ( y , ^y ) = − y⋅ln( ^y )−(1− y)⋅ln(1− ^y ).
The partial derivatives of the constituent functions are:
∂ℒ −( y− y^ ) ∂ ^y ∂ Y^
= , = y^ (1− ^y )⋅x and = ^y (1− ^y )⋅1
∂ y^ y^ (1− y^ ) ∂W ∂b

By the chain rule, the gradients are

∂cost 1 ∂ ℒ ∂ Y^ 1 ^ ∂cost 1 ∂ ℒ ∂ Y^ 1 ^
= ⋅ = ( Y −Y )⋅X T and = ⋅ = ( Y −Y )⋅1
∂W ^
m ∂Y ∂W m ∂b ^
m ∂Y ∂b m
where W.shape = (n,);
X.shape = (n, m);
b is a scalar;
Ŷ.shape and Y.shape = (m,); and
1.shape = (m,).

And for gradient descent, the updates are: W k+1=W k + α (Y −Y^ )⋅X T and b k+1=b k +α (Y −Y^ )⋅1.

Classification with neural network

Consider the activation of each perceptron p in each layer l:

( ) ( )
where alp is the activation of the pth perceptron
w l11 al−1 w l+1 l
1 + 11 a1 +
in layer l
⋮ ⋮
al1=σ l l−1 ⋯ al+1
1 =σ l+1 l
w n 1 an + w n 1 an + wlnp is the weight between an/xn of the
previous layer and ap of the current layer
bl1 bl+1
1
blp is the bias of the pth perceptron in
⋮ ⋮
layer l

( ) ( )
w l1 p al−1 w l+1 l
1 + 1 p a1 +
n is the number of a/x nodes in the previous
⋮+ ⋮+
alp =σ l l−1 ⋯ al+1
p =σ l+1 l
layer
w np an + w np an +
blp bl+1
p
The chain rule from the log loss function through 2 layers to w111 is:
2 2 1 1
∂ℒ ∂ℒ ∂ a1 ∂ z1 ∂ a1 ∂ z1
1
= 2
⋅ 2
⋅ 1
⋅ 1
⋅ 1 =
∂W 11 ∂a ∂ z1 ∂a 1 ∂ z1 ∂ w11
−( y−a2 ) 2
2 2
⋅a (1−a 2)⋅ w211 ⋅ a 1( 1−a1 )⋅ x 1 =
a (1−a )
−( y−a2 ) ⋅ w 211 ⋅ a1 (1− a1 )⋅ x 1

The chain rule from the log loss function through the topmost chain to b11 is:
2 2 1 1
∂ℒ ∂ℒ ∂ a1 ∂ z1 ∂ a1 ∂ z1
1
= 2
⋅ 2
⋅ 1
⋅ 1
⋅ 1 =
∂W 11 ∂a ∂ z1 ∂ a1 ∂ z1 ∂b1
−( y−a2) 2
2 2
⋅a (1−a2 )⋅ w 211 ⋅ a1 (1− a1 )⋅ 1 =
a (1−a )
−( y−a2 ) ⋅ w211 ⋅ a1 (1−a 1)⋅ 1

And the gradient descent updates are:

1 k +1 1 k ∂ℒ 1 k +1 1 k ∂ℒ
w 11 = w 11 −α 1 k
and b 1 =b1 −α .
∂W 11 ∂b11 k

Newton’s method
Similarly how the sign of the first derivative indicates whether the function increases or decreases,
the sign of the second derivative shows the concavity (whether the function is concave up or down):
while f '' >0 f is concave up; and while f '' <0 f is concave down.
(k)
Newton’s iterative method was originally used f (x )
x(k +1)=x (k )− .
to find 0 value(s) for a function f(x): ' (k)
f (x )
Each iteration finds the derivative of the function at the current x position and calculates where that
tangeant intersects the x axis: that new x is used for the next iteration.
' (k)
If instead, the second derivative is taken, (k +1) (k ) f (x )
x =x − .
it can be used to find the extreme(s) of f(x): ''
f (x )
(k)

For a function with two variables f(x, y):

[]
The first derivatives provide the gradient: ∂f
the 2-element vector of the 2 partial derivatives of f: ∂x
∇f=
∂f
∂y

[ ]
The second derivative provides the Hessian, ∂ f
2 2
∂ f
the 2x2-matrix of the 4 partial second derivatives of f. ∂x
2
∂x∂ y
2
∂ f
2
∂ f Hf = 2 2
If both are differentiable, then = . ∂ f ∂ f
∂x∂ y ∂ y∂x ∂ y∂ x ∂y
2
From the Hessian, the concavity of the function at point (x, y) can be determined:
if H(x, y) > 0, then f is concave up; and if H(x, y) < 0, then f is concave down.
Whether the Hessian, a matrix is positive or negative is given by its eigenvalues.
2 2 2 2
∂ f ∂ f ∂ f ∂ f
From ( 2 − λ )( 2 − λ ) − ( )( )=0 , λ1 and λ2 are readily calculable.
∂x ∂y ∂ x∂ y ∂ y ∂ x

If λ1 and λ2 are both > 0 f is concave up.

If λ1 and λ2 are both < 0 f is concave down.
If λ1 and λ2 have different signs the point (x, y) is a saddle point.
For gradient descent, the updates are similar to Newton’s original iterative updates – only using the
gradient
quotient of the first and the second derivatives, i.e. .
Hessian

[ ]
∂2 f ∂2 f ∂2 f ∂2 f
⋯
∂ x 12 ∂ x1 ∂ x2 ∂ x1 ∂ x3 ∂ x1 ∂ xn
∂2 f ∂2 f ∂2 f ∂2 f
⋯
∂ x2 ∂ x1 ∂ x 22 ∂ x2 ∂ x3 ∂ x2 ∂ xn
For n variables,
Hf= ∂2 f ∂2 f ∂2 f ∂2 f .
i.e. f(x1, x2, x3 ... xn) the Hessian is: ⋯
∂ x3 ∂ x1 ∂ x3 ∂ x2 ∂ x 23 ∂ x3 ∂ xn
⋮ ⋮ ⋮ ⋱ ⋮
2 2 2 2
∂ f ∂ f ∂ f ∂ f
⋯
∂ xn ∂ x1 ∂ xn ∂ x2 ∂ xn ∂ x3 ∂ x 2n

If all λs are > 0 f is concave up.

If all λs are > 0 f is concave down.
Otherwise more information is needed.
For Newton's method, the updates are:

[ ] [] []
x 1k +1 x 1k 1 n
∇ f (x k ,... x k ) x 1k
⋮ = ⋮ − 1 n
= ⋮ − H−1 (x 1k ,... x nk ) ⋅ ∇ f (x 1k ,... , x nk ).
n n H (x k , ... , x k ) n
x k +1 xk xk

Newton’s method can be faster but has the following difficulties:

• calculating f ' ' can be computationally expensive or
even impossible (if f is not twice differentiable)
• saddle points are easier to avoid with gradient descent

Machine Learning - Exploring The Model - Resp
No ratings yet
Machine Learning - Exploring The Model - Resp
18 pages
Cheat Sheet
No ratings yet
Cheat Sheet
5 pages
Rolul Managementului Educațional În România
100% (1)
Rolul Managementului Educațional În România
147 pages
Algebra 2 Lesson Plan
No ratings yet
Algebra 2 Lesson Plan
3 pages
L3 Linear Regression and Gradient Descent
No ratings yet
L3 Linear Regression and Gradient Descent
46 pages
Math YHPLinear Regression
No ratings yet
Math YHPLinear Regression
13 pages
05 Gradient Descent
No ratings yet
05 Gradient Descent
23 pages
Linear Regression
No ratings yet
Linear Regression
63 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Mathematical Methods of Optimization
No ratings yet
Mathematical Methods of Optimization
62 pages
Optimization2
No ratings yet
Optimization2
40 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Lecture02a Optimization Annotated PDF
No ratings yet
Lecture02a Optimization Annotated PDF
23 pages
06_23ECE216_GradientDescent_v2
No ratings yet
06_23ECE216_GradientDescent_v2
73 pages
Multi-Variable Optimization Methods
No ratings yet
Multi-Variable Optimization Methods
21 pages
Eem520l3 2023
No ratings yet
Eem520l3 2023
25 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Lecture 2.1 Linear Regression
No ratings yet
Lecture 2.1 Linear Regression
36 pages
Linear Regression
No ratings yet
Linear Regression
29 pages
Regression Analysis
No ratings yet
Regression Analysis
54 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
8 pages
Newton Gauss Method
No ratings yet
Newton Gauss Method
37 pages
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
No ratings yet
Gradient Descent Deep Learning: by T.K. Damodharan Vice President, RBS Reg - No: PC2013003013008
37 pages
Linear+regression+with+one+variable
No ratings yet
Linear+regression+with+one+variable
48 pages
Unit VI Optimization Techniques question bank solved answer
No ratings yet
Unit VI Optimization Techniques question bank solved answer
20 pages
Chapter 5
No ratings yet
Chapter 5
6 pages
Screenshot 2024-10-19 at 10.37.25 AM
No ratings yet
Screenshot 2024-10-19 at 10.37.25 AM
25 pages
Cost Function: y 2m 1 (Y ) 2m 1
No ratings yet
Cost Function: y 2m 1 (Y ) 2m 1
1 page
Lecture 7 (with notes)
No ratings yet
Lecture 7 (with notes)
39 pages
Nonlinearity in Structural Dynamics Chapter App G
No ratings yet
Nonlinearity in Structural Dynamics Chapter App G
11 pages
Simon Chapter 3
No ratings yet
Simon Chapter 3
12 pages
Gradient Descent
No ratings yet
Gradient Descent
5 pages
Supervised Machine Learning
No ratings yet
Supervised Machine Learning
7 pages
Introduction To Gradient Descent
No ratings yet
Introduction To Gradient Descent
13 pages
Lecture 3
No ratings yet
Lecture 3
56 pages
07_Gradient_Descent_For_Linear_Regression_10_min
No ratings yet
07_Gradient_Descent_For_Linear_Regression_10_min
5 pages
Class06 SGD
No ratings yet
Class06 SGD
24 pages
5.1Loss Function, Optimization,Gd
No ratings yet
5.1Loss Function, Optimization,Gd
39 pages
Tom Mitchell Provides A More Modern Definition
No ratings yet
Tom Mitchell Provides A More Modern Definition
10 pages
Iterative Reweighted Least Squares: Sargur N. Srihari
No ratings yet
Iterative Reweighted Least Squares: Sargur N. Srihari
22 pages
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
No ratings yet
Artificial Neural Networks: Multilayer Perceptrons Backpropagation
71 pages
Detailed Guide 7 Loss Functions Machine Learning Python Code
No ratings yet
Detailed Guide 7 Loss Functions Machine Learning Python Code
16 pages
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
No ratings yet
Gradient Descent - Problem of Hiking Down A Mountain: Derivatives
8 pages
Lec 5 - Gradient-Descent
No ratings yet
Lec 5 - Gradient-Descent
31 pages
Gradient_Descent
No ratings yet
Gradient_Descent
52 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
FAI 4 Mathematical Concepts II
No ratings yet
FAI 4 Mathematical Concepts II
39 pages
Lec_11
No ratings yet
Lec_11
13 pages
Opt_Lec_10
No ratings yet
Opt_Lec_10
16 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
Sms Essay 2
No ratings yet
Sms Essay 2
6 pages
Lect 5- Gradient Descent
No ratings yet
Lect 5- Gradient Descent
31 pages
ML02
No ratings yet
ML02
25 pages
ML Notes
No ratings yet
ML Notes
14 pages
CS6910 Tutorial1
No ratings yet
CS6910 Tutorial1
10 pages
4. Gradient Descent
No ratings yet
4. Gradient Descent
15 pages
Linear Regression With One Variable
No ratings yet
Linear Regression With One Variable
12 pages
DL UNIT-I
No ratings yet
DL UNIT-I
30 pages
6 Gradient Method
No ratings yet
6 Gradient Method
19 pages
Chương 9
No ratings yet
Chương 9
12 pages
Capsule Calculus
From Everand
Capsule Calculus
Ira Ritow
No ratings yet
Uts Final Reviewer
No ratings yet
Uts Final Reviewer
14 pages
BIT1100-Course Content
No ratings yet
BIT1100-Course Content
3 pages
Colin Baker
No ratings yet
Colin Baker
20 pages
Management of Spontaneous Abortion - American Family Physician
No ratings yet
Management of Spontaneous Abortion - American Family Physician
8 pages
Desarrollo y Salud
No ratings yet
Desarrollo y Salud
24 pages
Food Likes and Dislikes PDF
No ratings yet
Food Likes and Dislikes PDF
1 page
Quinton Winek - Resume
No ratings yet
Quinton Winek - Resume
1 page
Adham Medhat CV N
No ratings yet
Adham Medhat CV N
2 pages
20th Vs 21st Century Classroom
No ratings yet
20th Vs 21st Century Classroom
2 pages
MEETING 2 Sentence With One Clause
No ratings yet
MEETING 2 Sentence With One Clause
5 pages
W13 Arithmetic Logic Unit Binary Integer Representation Binary Integer Arithmetic - PPT PDF
No ratings yet
W13 Arithmetic Logic Unit Binary Integer Representation Binary Integer Arithmetic - PPT PDF
32 pages
Lesson Plan in English
No ratings yet
Lesson Plan in English
3 pages
Exmpgm
No ratings yet
Exmpgm
15 pages
Practical 02 copy
No ratings yet
Practical 02 copy
4 pages
Isizulu HL Ieb NSC Grade 12 Past Exam Papers 2017 p2 Question Paper
No ratings yet
Isizulu HL Ieb NSC Grade 12 Past Exam Papers 2017 p2 Question Paper
5 pages
Public Notice: Model Schools
No ratings yet
Public Notice: Model Schools
2 pages
Research and Capstone
No ratings yet
Research and Capstone
2 pages
جاهزية-جامعة-20-أوت-1955-سكيكدة-لبناء-مستودع-رقمي-أكاديمي-التوجهات،-الامكانات-والفرص
No ratings yet
جاهزية-جامعة-20-أوت-1955-سكيكدة-لبناء-مستودع-رقمي-أكاديمي-التوجهات،-الامكانات-والفرص
24 pages
M2M HEC - Yale
No ratings yet
M2M HEC - Yale
5 pages
Problem Solving in Operational Excellence
No ratings yet
Problem Solving in Operational Excellence
8 pages
Complete Essay
No ratings yet
Complete Essay
11 pages
Adtw722 e
No ratings yet
Adtw722 e
2 pages
LITERATURE
No ratings yet
LITERATURE
15 pages
Budynas SM ch03
No ratings yet
Budynas SM ch03
56 pages
Jessica Dacosta L of R
No ratings yet
Jessica Dacosta L of R
1 page
Cpar (Q1 M10)
No ratings yet
Cpar (Q1 M10)
2 pages
Introducing the Chorus to Atonal Music a Study of Webern's ..Entflieht Auf Leichten Kähnen.., Op. 2
100% (1)
Introducing the Chorus to Atonal Music a Study of Webern's ..Entflieht Auf Leichten Kähnen.., Op. 2
7 pages

Calculus - class notes

Uploaded by

Calculus - class notes

Uploaded by

Calculus for ML

So for gradient descent, the updates are:

Multiple linear regression with inputs x 1 … x n

Thus we have a prediction function: ^y =σ (W⋅X +b)

By the chain rule, the gradients are

Classification with neural network

And the gradient descent updates are:

For a function with two variables f(x, y):

If λ1 and λ2 are both > 0 f is concave up.

If all λs are > 0 f is concave up.

Newton’s method can be faster but has the following difficulties:

You might also like