0% found this document useful (0 votes)
3 views4 pages

Calculus - class notes

The document discusses key concepts in calculus as applied to machine learning, including loss and cost functions, linear regression, multiple linear regression, classification with perceptrons and neural networks, and Newton's method for optimization. It explains the mathematical foundations for minimizing errors using gradient descent and the importance of derivatives in determining function behavior. Additionally, it covers the Hessian matrix and its role in assessing concavity for optimization problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views4 pages

Calculus - class notes

The document discusses key concepts in calculus as applied to machine learning, including loss and cost functions, linear regression, multiple linear regression, classification with perceptrons and neural networks, and Newton's method for optimization. It explains the mathematical foundations for minimizing errors using gradient descent and the importance of derivatives in determining function behavior. Additionally, it covers the Hessian matrix and its role in assessing concavity for optimization problems.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as ODT, PDF, TXT or read online on Scribd
You are on page 1/ 4

Calculus for ML

Functions
Loss function: defined on a single data point: prediction and label
Cost function: defined over all data points (all features and examples summed up)
Linear Regression
The goal is to minimise vertical distance between dots (x, y) and line slope · x + b.
yi − y
For each dot, this distance is: .
slope⋅x+ b − y
m
Therefore, the total cost associated with all m distances is: ∑ (slope⋅xi + b − y i)2 .
i=1
In order to minimise (half) the average, find the minimum of the mean squared error function:
m
1
E = ∑
2 m i=1
(slope⋅x i + b − y i)2 .
∂E 1 T
That is = ⋅ ( slope⋅X + b − y ) ⋅ X
∂slope m
∂E 1
and = ⋅ (slope⋅X + b − y) ⋅ 1
∂b m
where slope and b are scalars;
X.shape = (1, m);
y.shape = (m,); and
1.shape = (m,).

So for gradient descent, the updates are:


k+1 k 1 T k+1 k 1
slope =slope −α (slope⋅X + b− y )⋅X and b =b −α (slope⋅X +b− y)⋅1 .
m m

Multiple linear regression with inputs x 1 … x n

[] [ ]
y1 1 2 m
2 x 1 x1 ⋯ x 1
Y^ = y = w ⋯ w ⋅ ⋮ ⋮ ⋱ ⋮ + b =

[ 1 n]
1 2 m
m xn xn ⋯ xn
y

[ ]
1 1 1
w 1⋅x 1+b + w 2⋅x 2 +b + ⋯ + w n⋅xn +b
⋮ ⋱ ⋮
m m m
w1⋅x 1 +b + w2⋅x 2 +b + ⋯ + wn⋅x n +b

To minimise the cost function, take the two partial derivatives for gradient descent:
∂cost 1 ∂cost 1
⋅ ( Y^ −Y ) ⋅ X
T
that is = and = ⋅ ( Y^ −Y ) ⋅ 1
∂W m ∂b m
where W.shape = (n,);
X.shape = (n, m);
b is a scalar;
Ŷ.shape and Y.shape = (m,); and
1.shape = (m,).
Classification with perceptron
Classificatory networks need an activation function, e.g. sigmoid:
1 whose derivative is: dσ
σ (z) = = σ (1−σ )
1+ e
−z
dz

Thus we have a prediction function: ^y =σ (W⋅X +b)


And due to the probabilistic nature of classification, the preferred loss function is the log loss:
ℒ ( y , ^y ) = − y⋅ln( ^y )−(1− y)⋅ln(1− ^y ).
The partial derivatives of the constituent functions are:
∂ℒ −( y− y^ ) ∂ ^y ∂ Y^
= , = y^ (1− ^y )⋅x and = ^y (1− ^y )⋅1
∂ y^ y^ (1− y^ ) ∂W ∂b

By the chain rule, the gradients are


∂cost 1 ∂ ℒ ∂ Y^ 1 ^ ∂cost 1 ∂ ℒ ∂ Y^ 1 ^
= ⋅ = ( Y −Y )⋅X T and = ⋅ = ( Y −Y )⋅1
∂W ^
m ∂Y ∂W m ∂b ^
m ∂Y ∂b m
where W.shape = (n,);
X.shape = (n, m);
b is a scalar;
Ŷ.shape and Y.shape = (m,); and
1.shape = (m,).

And for gradient descent, the updates are: W k+1=W k + α (Y −Y^ )⋅X T and b k+1=b k +α (Y −Y^ )⋅1.

Classification with neural network


Consider the activation of each perceptron p in each layer l:

( ) ( )
where alp is the activation of the pth perceptron
w l11 al−1 w l+1 l
1 + 11 a1 +
in layer l
⋮ ⋮
al1=σ l l−1 ⋯ al+1
1 =σ l+1 l
w n 1 an + w n 1 an + wlnp is the weight between an/xn of the
previous layer and ap of the current layer
bl1 bl+1
1
blp is the bias of the pth perceptron in
⋮ ⋮
layer l

( ) ( )
w l1 p al−1 w l+1 l
1 + 1 p a1 +
n is the number of a/x nodes in the previous
⋮+ ⋮+
alp =σ l l−1 ⋯ al+1
p =σ l+1 l
layer
w np an + w np an +
blp bl+1
p
The chain rule from the log loss function through 2 layers to w111 is:
2 2 1 1
∂ℒ ∂ℒ ∂ a1 ∂ z1 ∂ a1 ∂ z1
1
= 2
⋅ 2
⋅ 1
⋅ 1
⋅ 1 =
∂W 11 ∂a ∂ z1 ∂a 1 ∂ z1 ∂ w11
−( y−a2 ) 2
2 2
⋅a (1−a 2)⋅ w211 ⋅ a 1( 1−a1 )⋅ x 1 =
a (1−a )
−( y−a2 ) ⋅ w 211 ⋅ a1 (1− a1 )⋅ x 1

The chain rule from the log loss function through the topmost chain to b11 is:
2 2 1 1
∂ℒ ∂ℒ ∂ a1 ∂ z1 ∂ a1 ∂ z1
1
= 2
⋅ 2
⋅ 1
⋅ 1
⋅ 1 =
∂W 11 ∂a ∂ z1 ∂ a1 ∂ z1 ∂b1
−( y−a2) 2
2 2
⋅a (1−a2 )⋅ w 211 ⋅ a1 (1− a1 )⋅ 1 =
a (1−a )
−( y−a2 ) ⋅ w211 ⋅ a1 (1−a 1)⋅ 1

And the gradient descent updates are:


1 k +1 1 k ∂ℒ 1 k +1 1 k ∂ℒ
w 11 = w 11 −α 1 k
and b 1 =b1 −α .
∂W 11 ∂b11 k

Newton’s method
Similarly how the sign of the first derivative indicates whether the function increases or decreases,
the sign of the second derivative shows the concavity (whether the function is concave up or down):
while f '' >0 f is concave up; and while f '' <0 f is concave down.
(k)
Newton’s iterative method was originally used f (x )
x(k +1)=x (k )− .
to find 0 value(s) for a function f(x): ' (k)
f (x )
Each iteration finds the derivative of the function at the current x position and calculates where that
tangeant intersects the x axis: that new x is used for the next iteration.
' (k)
If instead, the second derivative is taken, (k +1) (k ) f (x )
x =x − .
it can be used to find the extreme(s) of f(x): ''
f (x )
(k)

For a function with two variables f(x, y):

[]
The first derivatives provide the gradient: ∂f
the 2-element vector of the 2 partial derivatives of f: ∂x
∇f=
∂f
∂y

[ ]
The second derivative provides the Hessian, ∂ f
2 2
∂ f
the 2x2-matrix of the 4 partial second derivatives of f. ∂x
2
∂x∂ y
2
∂ f
2
∂ f Hf = 2 2
If both are differentiable, then = . ∂ f ∂ f
∂x∂ y ∂ y∂x ∂ y∂ x ∂y
2
From the Hessian, the concavity of the function at point (x, y) can be determined:
if H(x, y) > 0, then f is concave up; and if H(x, y) < 0, then f is concave down.
Whether the Hessian, a matrix is positive or negative is given by its eigenvalues.
2 2 2 2
∂ f ∂ f ∂ f ∂ f
From ( 2 − λ )( 2 − λ ) − ( )( )=0 , λ1 and λ2 are readily calculable.
∂x ∂y ∂ x∂ y ∂ y ∂ x

If λ1 and λ2 are both > 0 f is concave up.


If λ1 and λ2 are both < 0 f is concave down.
If λ1 and λ2 have different signs the point (x, y) is a saddle point.
For gradient descent, the updates are similar to Newton’s original iterative updates – only using the
gradient
quotient of the first and the second derivatives, i.e. .
Hessian

[ ]
∂2 f ∂2 f ∂2 f ∂2 f

∂ x 12 ∂ x1 ∂ x2 ∂ x1 ∂ x3 ∂ x1 ∂ xn
∂2 f ∂2 f ∂2 f ∂2 f

∂ x2 ∂ x1 ∂ x 22 ∂ x2 ∂ x3 ∂ x2 ∂ xn
For n variables,
Hf= ∂2 f ∂2 f ∂2 f ∂2 f .
i.e. f(x1, x2, x3 ... xn) the Hessian is: ⋯
∂ x3 ∂ x1 ∂ x3 ∂ x2 ∂ x 23 ∂ x3 ∂ xn
⋮ ⋮ ⋮ ⋱ ⋮
2 2 2 2
∂ f ∂ f ∂ f ∂ f

∂ xn ∂ x1 ∂ xn ∂ x2 ∂ xn ∂ x3 ∂ x 2n

If all λs are > 0 f is concave up.


If all λs are > 0 f is concave down.
Otherwise more information is needed.
For Newton's method, the updates are:

[ ] [] []
x 1k +1 x 1k 1 n
∇ f (x k ,... x k ) x 1k
⋮ = ⋮ − 1 n
= ⋮ − H−1 (x 1k ,... x nk ) ⋅ ∇ f (x 1k ,... , x nk ).
n n H (x k , ... , x k ) n
x k +1 xk xk

Newton’s method can be faster but has the following difficulties:


• calculating f ' ' can be computationally expensive or
even impossible (if f is not twice differentiable)
• saddle points are easier to avoid with gradient descent

You might also like