Calculus - class notes
Calculus - class notes
Functions
Loss function: defined on a single data point: prediction and label
Cost function: defined over all data points (all features and examples summed up)
Linear Regression
The goal is to minimise vertical distance between dots (x, y) and line slope · x + b.
yi − y
For each dot, this distance is: .
slope⋅x+ b − y
m
Therefore, the total cost associated with all m distances is: ∑ (slope⋅xi + b − y i)2 .
i=1
In order to minimise (half) the average, find the minimum of the mean squared error function:
m
1
E = ∑
2 m i=1
(slope⋅x i + b − y i)2 .
∂E 1 T
That is = ⋅ ( slope⋅X + b − y ) ⋅ X
∂slope m
∂E 1
and = ⋅ (slope⋅X + b − y) ⋅ 1
∂b m
where slope and b are scalars;
X.shape = (1, m);
y.shape = (m,); and
1.shape = (m,).
[] [ ]
y1 1 2 m
2 x 1 x1 ⋯ x 1
Y^ = y = w ⋯ w ⋅ ⋮ ⋮ ⋱ ⋮ + b =
⋮
[ 1 n]
1 2 m
m xn xn ⋯ xn
y
[ ]
1 1 1
w 1⋅x 1+b + w 2⋅x 2 +b + ⋯ + w n⋅xn +b
⋮ ⋱ ⋮
m m m
w1⋅x 1 +b + w2⋅x 2 +b + ⋯ + wn⋅x n +b
To minimise the cost function, take the two partial derivatives for gradient descent:
∂cost 1 ∂cost 1
⋅ ( Y^ −Y ) ⋅ X
T
that is = and = ⋅ ( Y^ −Y ) ⋅ 1
∂W m ∂b m
where W.shape = (n,);
X.shape = (n, m);
b is a scalar;
Ŷ.shape and Y.shape = (m,); and
1.shape = (m,).
Classification with perceptron
Classificatory networks need an activation function, e.g. sigmoid:
1 whose derivative is: dσ
σ (z) = = σ (1−σ )
1+ e
−z
dz
And for gradient descent, the updates are: W k+1=W k + α (Y −Y^ )⋅X T and b k+1=b k +α (Y −Y^ )⋅1.
( ) ( )
where alp is the activation of the pth perceptron
w l11 al−1 w l+1 l
1 + 11 a1 +
in layer l
⋮ ⋮
al1=σ l l−1 ⋯ al+1
1 =σ l+1 l
w n 1 an + w n 1 an + wlnp is the weight between an/xn of the
previous layer and ap of the current layer
bl1 bl+1
1
blp is the bias of the pth perceptron in
⋮ ⋮
layer l
( ) ( )
w l1 p al−1 w l+1 l
1 + 1 p a1 +
n is the number of a/x nodes in the previous
⋮+ ⋮+
alp =σ l l−1 ⋯ al+1
p =σ l+1 l
layer
w np an + w np an +
blp bl+1
p
The chain rule from the log loss function through 2 layers to w111 is:
2 2 1 1
∂ℒ ∂ℒ ∂ a1 ∂ z1 ∂ a1 ∂ z1
1
= 2
⋅ 2
⋅ 1
⋅ 1
⋅ 1 =
∂W 11 ∂a ∂ z1 ∂a 1 ∂ z1 ∂ w11
−( y−a2 ) 2
2 2
⋅a (1−a 2)⋅ w211 ⋅ a 1( 1−a1 )⋅ x 1 =
a (1−a )
−( y−a2 ) ⋅ w 211 ⋅ a1 (1− a1 )⋅ x 1
The chain rule from the log loss function through the topmost chain to b11 is:
2 2 1 1
∂ℒ ∂ℒ ∂ a1 ∂ z1 ∂ a1 ∂ z1
1
= 2
⋅ 2
⋅ 1
⋅ 1
⋅ 1 =
∂W 11 ∂a ∂ z1 ∂ a1 ∂ z1 ∂b1
−( y−a2) 2
2 2
⋅a (1−a2 )⋅ w 211 ⋅ a1 (1− a1 )⋅ 1 =
a (1−a )
−( y−a2 ) ⋅ w211 ⋅ a1 (1−a 1)⋅ 1
Newton’s method
Similarly how the sign of the first derivative indicates whether the function increases or decreases,
the sign of the second derivative shows the concavity (whether the function is concave up or down):
while f '' >0 f is concave up; and while f '' <0 f is concave down.
(k)
Newton’s iterative method was originally used f (x )
x(k +1)=x (k )− .
to find 0 value(s) for a function f(x): ' (k)
f (x )
Each iteration finds the derivative of the function at the current x position and calculates where that
tangeant intersects the x axis: that new x is used for the next iteration.
' (k)
If instead, the second derivative is taken, (k +1) (k ) f (x )
x =x − .
it can be used to find the extreme(s) of f(x): ''
f (x )
(k)
[]
The first derivatives provide the gradient: ∂f
the 2-element vector of the 2 partial derivatives of f: ∂x
∇f=
∂f
∂y
[ ]
The second derivative provides the Hessian, ∂ f
2 2
∂ f
the 2x2-matrix of the 4 partial second derivatives of f. ∂x
2
∂x∂ y
2
∂ f
2
∂ f Hf = 2 2
If both are differentiable, then = . ∂ f ∂ f
∂x∂ y ∂ y∂x ∂ y∂ x ∂y
2
From the Hessian, the concavity of the function at point (x, y) can be determined:
if H(x, y) > 0, then f is concave up; and if H(x, y) < 0, then f is concave down.
Whether the Hessian, a matrix is positive or negative is given by its eigenvalues.
2 2 2 2
∂ f ∂ f ∂ f ∂ f
From ( 2 − λ )( 2 − λ ) − ( )( )=0 , λ1 and λ2 are readily calculable.
∂x ∂y ∂ x∂ y ∂ y ∂ x
[ ]
∂2 f ∂2 f ∂2 f ∂2 f
⋯
∂ x 12 ∂ x1 ∂ x2 ∂ x1 ∂ x3 ∂ x1 ∂ xn
∂2 f ∂2 f ∂2 f ∂2 f
⋯
∂ x2 ∂ x1 ∂ x 22 ∂ x2 ∂ x3 ∂ x2 ∂ xn
For n variables,
Hf= ∂2 f ∂2 f ∂2 f ∂2 f .
i.e. f(x1, x2, x3 ... xn) the Hessian is: ⋯
∂ x3 ∂ x1 ∂ x3 ∂ x2 ∂ x 23 ∂ x3 ∂ xn
⋮ ⋮ ⋮ ⋱ ⋮
2 2 2 2
∂ f ∂ f ∂ f ∂ f
⋯
∂ xn ∂ x1 ∂ xn ∂ x2 ∂ xn ∂ x3 ∂ x 2n
[ ] [] []
x 1k +1 x 1k 1 n
∇ f (x k ,... x k ) x 1k
⋮ = ⋮ − 1 n
= ⋮ − H−1 (x 1k ,... x nk ) ⋅ ∇ f (x 1k ,... , x nk ).
n n H (x k , ... , x k ) n
x k +1 xk xk