Vector Calculus
Contents
• ex11-12, bt11-12
Differentiation of Univariate Functions
Partial Differentiation and Gradients
Gradients of Matrices
Backpropagation
Higher-Order Derivatives
Linearization and Multivariate Taylor Series
1/10/2022 Chapter 5 - Vector Calculus 2
The Chain Rule
x f f(x) g (gf)(x)
(gf)(x) = g(f(x))f(x) # gf means g after f
dg dg df
=
dx df dx
1/9/2022 Chapter 5 - Vector Calculus 3
Chain rule – Ex
• Use the chain rule to compute the derivative of
h(x) = (2x + 1)4
x 2() 2x () + 1 2x+ 1 ()4 (2x+1)4
• h can be expressed as h(x) = (gfu)(x)
u(x) = 2x, f(u) = u + 1, g(f) = f4
h(x) = (gfu)(x) = g(f)f(u)u(x) = 4f3(1)(2) = 8(2x + 1)3
1/9/2022 Chapter 5 - Vector Calculus 4
Partial Derivative
Definition (Partial Derivative). For a function of n variables
x1, . . . , xn f : n → ,
x f(x)
we define the partial derivatives as
f f(x1,…, xk+ h, xk+1,..., xn) – f(x1,..., xk,…, xn)
= limh→0
xk h
1/9/2022 Chapter 5 - Vector Calculus 5
Gradients of f : n →
• We collect all partial derivatives of f in the row vector to
form the gradient of f f f … f
x1 x2 xn
df
• Notation. xf gradf
dx
• Ex. For f : 2 → , f(x1, x2) = x13 – x1x2
f f
• Partial derivatives = 3x12 – x2, = x 1 3 – x1
x1 x2
• The Gradient of f
xf = [3x12x2 – x2 x13 – x1] 12 (1 row, 2 columns)
1/9/2022 Chapter 5 - Vector Calculus 6
Gradients of f : n → x 3 f 1
df
gradf = = xf 1n
dx
Ex. For f(x, y) = (x3 + 2y)2, xf 13
we obtain the partial derivatives
f
• = 2(x3 + 2y) (x3 + 2y) = 6x2(x3 + 2y)
x x
f
• = 2(x3 + 2y) (x3 + 2y) = 4(x3 + 2y)
y y
The gradient of f is [6x2(x3 + 2y) 4(x3 + 2y)]
1/9/2022 Chapter 5 - Vector Calculus 7
Gradients/Jacobian of Vector-Valued Functions f : n m
• For a vector-valued function f: n m,
f(x) = [f1(x) f2(x) … fm(x)]T m-row vector
where fi : n
Gradient (or Jacobian) of f
df ∇xf 1
J = ∇xf = = … Dimension: mn
dx
∇xfm
1/9/2022 Chapter 5 - Vector Calculus 8
Jacobian of f: n m – size
x3 f4 J 43
x1 x2 x3
f1
f2
f2 x3
f3
fm
1/9/2022 Chapter 5 - Vector Calculus 9
Jacobian of f: n m – Ex
Ex. Find the Jacobian of f: 3 2
f1(x1, x2, x3) = 2x1 + x2x3, f2(x1, x2, x3) = x1x3 - x22
Jacobian of f: J 23
1/9/2022 Chapter 5 - Vector Calculus 10
Gradient of f: n m – Ex
• We are given f(x) = Ax, f(x) ∈ m, A ∈ mn, x ∈ n. Compute
the gradient ∇xf
fi
• ∇xf = its size is mn
xj
mn
𝑛
fi(x) = 𝑗=1 aij xj
fi
= aij ∇xf = A
xj
1/9/2022 Chapter 5 - Vector Calculus 11
Gradient of f: n m – Ex2
• Given h : → , h(t) = (fx)(t)
where f : 2 → , f(x) = exp(x1 + x22),
x1(t) t
x : → , x(t) =
2 =
x2(t) sint
dh
Compute , the gradient of h with respect to t.
dt
• Use the chain rule (matrix version) for h = fx
dh d df dx
= (fx) =
dt dt dx dt
1/9/2022 Chapter 5 - Vector Calculus 12
Gradient of f: n m – Ex2
x1
dh
= f f t = f x1 + f x2
dt x1 x2 x2 x1 t x2 t
t
= exp(x1 + x22) + 2x2exp(x1 + x22)cost,
where x1 = t, x2 = sint.
1/9/2022 Chapter 5 - Vector Calculus 13
Gradient of f: n m – Exercise
• y N, θ D, Φ ND
e: D N, e(θ) = y − Φθ,
L: N , L(e) = e2 = eTe
dL de dL
Find , , and
de dθ dθ
1/9/2022 Chapter 5 - Vector Calculus 14
Gradient of A : m pq
Approach 1
4×2×3 tensor
1/9/2022 Chapter 5 - Vector Calculus 15
Gradient of A : m pq
Approach 2: Re-shape matrices into vectors
4×2×3 tensor
1/9/2022 Chapter 5 - Vector Calculus 16
Gradients of A : m pq – Ex
• Ex. Consider A: 3 32
𝑥1 − 𝑥2 𝑥1 + 𝑥3
• A(x1, x2, x3) = 𝑥1 2 + 𝑥3 2𝑥1
𝑥3 − 𝑥2 𝑥1 + 𝑥2 + 𝑥3
dA
• The dimension of : (32)3
dx
• Approach 1
1 1 −1 0 0 1
A A A
= 2𝑥1 2 , = 0 0, = 1 0
x 1 x2 x3 (32)3 tensor
0 1 −1 1 1 1
1/9/2022 Chapter 5 - Vector Calculus 17
Gradient of f : mn p – Ex
fM AMN xN
… … … … x1 f: MN M, f(A) = Ax
fj Aj1 … AjN … fi = Ai1x1 +… + Aikxk +…+ AiNxN
fi Ai1 … AiN xN
fi fi
… … … … = xk, = 0 (j i)
Aik Ajk
df
M(MN)
dA … … …
… … …
0 … 0
x1 … xN
x1 … xN
0 … 0
… … …
… … …
1/9/2022 Chapter 5 - Vector Calculus 18
Gradient of Matrices with Respect to Matrices mn pq
For R ∈ MN and f : MN → NN d𝑲𝑝𝑞
∈ 1×(𝑀×𝑁)
with f(R) = RTR = K ∈ NN d𝑹
Compute the gradient dK/dR.
The gradient has the dimensions
R
dK/dR ∈ (NN)MN K
Kpq
1/9/2022 Chapter 5 - Vector Calculus 19
Gradient of Matrices with Respect to Matrices mn pq
dK/dR ∈ (NN)MN K= RTR
R = [r1 r2 … rN], ri is ith column of R
dK𝑝𝑞 1×(𝑀×𝑁)
∈
dR Kpq
Rij
dK𝑝𝑞
∈ 1
dR𝑖𝑗 𝑀
Kpq = rpTrq = 𝑚=1 𝑅𝑚𝑝𝑅𝑚𝑞
𝑅𝑖𝑞 𝑖𝑓𝑗 = 𝑝 𝑞
dK𝑝𝑞 𝑅𝑖𝑝 𝑖𝑓 𝑗 = 𝑞𝑝
= pqij =
* dR𝑖𝑗 2𝑅𝑖𝑞 𝑖𝑓 𝑗 = 𝑝 = 𝑞
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
1/9/2022 Chapter 5 - Vector Calculus 20
Backpropagation - Introduction
• Probably the single most important algorithm in all of Deep Learning
• In many machine learning applications, we find good model
parameters by performing gradient descent compute the gradient
of a learning objective w.r.t. the parameters of the model. For
example, an ANN (single Hidden Layer with 150 nodes) for
128x128x3 color image needs at least 128x128x3x150 = 7,372,800
weights.
• The backpropagation algorithm is an efficient way to compute the
gradient of an error function with respect to the parameters of the
model.
1/9/2022 Chapter 5 - Vector Calculus 21
Given training data ML Needs Gradients
{(x1, y1), (x2, y2), …, (xm, ym)}
Choose decision and cost functions
𝒚𝑖 = 𝑓𝜃 (𝒙𝑖 )
C(𝒚𝑖, yi)
Define the goal
1
Find * that minimizes iC(𝒚𝑖, yi)
𝑚
! The backpropagation Train the model with (stochastic) gradient
descent to update ,
algorithm is an efficient 𝜕𝐶
way to compute the (t+1) = (t) - (xi, yi)
gradient 𝜕(t)
1/9/2022 Chapter 5 - Vector Calculus 22
Epochs
• Backpropagation algorithm consists of many cycles, each cycle is
called an epoch with two processes:
forward phase
a(0) z(1), a(1) z(2), a(2) … C
𝜕𝐶 𝜕𝐶 𝜕𝐶
…
𝜕 (1) 𝜕 (2) 𝜕 (𝑁)
backward phase
1/9/2022 Chapter 5 - Vector Calculus 23
Deep Network (ANN with hidden layers)
Activation equations (matrix version)
Layer (1) = hidden layer
z(1) = W(1)a(0) + b(1)
a(1) = 1(z(1))
Layer (2) = output layer
z(2) = W(2)a(1) + b(2)
a(2) = 2(z(2))
The cost for example number k
1 (2) 1
Ck = 𝑖(𝑎𝑖 −𝑦𝑖 )2 = a(2) – y2
2 2
1/9/2022 Chapter 5 - Vector Calculus 24
Forward phase
For L = 1..N, a(0) = x
z(L) = W(L)a(L-1) + b(L)
a(L) = L(z(L))
1
C: cost function (i.e., C = a(N) – y2)
2
1/9/2022 Chapter 5 - Vector Calculus 25
Backpropagation
Layer 1 Layer 2 … Layer K-1 Layer K … Layer N-1 Layer N
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1)
=
𝜕𝑾(𝑁−1) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝑾(𝑁−1) 𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁)
=
𝜕𝑾(𝑁) 𝜕𝒂(𝑁) 𝜕𝑾(𝑁)
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1)
=
𝜕𝒃(𝑁−1) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒃(𝑁−1) 𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁)
=
𝜕𝒃(𝑁) 𝜕𝒂(𝑁) 𝜕𝒃(𝑁)
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2)
= Benefit of backpropagation:
𝜕𝑾(𝑁−2) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝑾(𝑁−2)
Reused terms outside the box
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2)
=
𝜕𝒃(𝑁−2) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒃(𝑁−2)
1/9/2022 Chapter 5 - Vector Calculus 26
Backpropagation Activation equations
z(L) = W(L)a(L-1) + b(L)
Layer 1 Layer 2 … Layer K-1 Layer K … Layer N-1 Layer N a(L) = L(z(L))
C: cost function
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+3) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1)
= … (𝐿+2) (𝐿+1) (𝐿+1)
𝜕𝑾(𝐿+1) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒂 𝜕𝒂 𝜕𝑾
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿)
= …
𝜕𝑾(𝐿) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿) 𝜕𝑾(𝐿)
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿)
= … Backpropagation
𝜕𝒃(𝐿) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝑁−2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿) 𝜕𝒃(𝐿)
At each layer (L), need to compute Compute e(L+1) (at layer
𝜕𝐶 𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝒂(𝐿+2) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿+1) L+1) before computing
e(L) := = … = e(L+1)
e(L) (at layer L)
𝜕𝒂(𝐿) 𝜕𝒂(𝑁) 𝜕𝒂(𝑁−1) 𝜕𝒂(𝐿+1) 𝜕𝒂(𝐿) 𝜕𝒂(𝐿)
1/9/2022 Chapter 5 - Vector Calculus 27
Backpropagation algorithm
For each example in training examples
1. Feed forward
2. Backpropagation
At output layer (N), compute and store:
𝜕𝐶
e(N) = (𝑁)
𝜕𝒂
𝜕𝐶 𝜕𝒂(𝑁) 𝜕𝐶 (𝑁)
𝜕𝒂
= e (N) , = e(N)
𝜕𝑾(𝑁) 𝜕𝑾(𝑁) 𝜕𝒃(𝑁) 𝜕𝒃(𝑁)
For layer (L) from N-1 to 1:
𝜕𝒂 (𝐿+1)
• Compute e(L) using e(L) = e(L+1)
Activation equations 𝜕𝒂(𝐿)
z(L) = W(L)a(L-1) + b(L) 𝜕𝐶 𝜕𝒂(𝐿) , 𝜕𝐶
• Compute (𝐿) = e (L) = e(L)
a(L) = L(z(L)) (𝐿)
𝜕𝑾 𝜕𝑾 (𝐿) 𝜕𝒃(𝐿)
𝜕𝒂
C: cost function
𝜕𝒃(𝐿)
1/9/2022 Chapter 5 - Vector Calculus 28
Higher-order partial derivatives
Consider a function f : 2
of two variables x, y.
Second order partial derivatives:
2 f 2 f Ex.
f: 2 , f(x, y) = x3y – 3xy2 + 5y,
x 2 xy 𝜕𝑓 𝜕𝑓
= 3x2y – 3y2, = x3 – 6xy +5
𝜕𝑥 𝜕𝑦
2 f 2 f 2
𝜕𝑓 2
𝜕𝑓
= 3𝑥2 − 6𝑦,
2 = 6𝑥𝑦,
yx y 2
𝜕𝑥 𝜕𝑥𝜕𝑦
2
𝜕𝑓 2 𝜕2𝑓
= 3𝑥 − 6𝑦, 2 = −6𝑥
𝜕𝑦𝜕𝑥 𝜕𝑦
n f
is the nth partial derivative of f with respect to x
x n
1/9/2022 Chapter 5 - Vector Calculus 29
The Hessian
• The Hessian is the collection of all second-order partial derivatives
Hessian matrix is symmetric for twice
continuously differentiable functions,
that is,
𝜕2𝑓 𝜕2𝑓
Hessian matrix =
𝜕𝑥𝜕𝑦 𝜕𝑦𝜕𝑥
1/9/2022 Chapter 5 - Vector Calculus 30
Gradient vs Hessian of f: n
Consider a function f : n
Gradient Hessian
2 f 2 f 2 f
f f f ...
f ... 1x 2
x1x2 x1xn
x1 x2 xn 2 f 2 f 2 f
...
Dimension: 1 n f x2 x1
2
x2 2 x2 xn
... ... ... ...
2 f 2 f f
2
x x xn x2
...
xn 2
n 1
Dimension: n n
1/9/2022 Chapter 5 - Vector Calculus 31
Gradient vs Hessian of f: n m
Consider (vector-valued) function f : n
m
x1 f1
Gradient x2 f2 Hessian
x3
m n matrix m (n n) tensor
Dimension: 2 3
Dimension: 2 (3 3)
1/9/2022 Chapter 5 - Vector Calculus 32
Example
• Compute the Hessian of the function z = f(x, y) = x2 + 6xy – y3 and
evaluate at the point (x = 1, y = 2, z = 5).
1/9/2022 Chapter 5 - Vector Calculus 33
Taylor series for f:
Taylor polynomials
Approximation problems
1/9/2022 Chapter 5 - Vector Calculus 34
Taylor series for f: D
Consider a function f (smooth at x0)
multivariate Taylor series of f at x0 is defined as
where
1/9/2022 Chapter 5 - Vector Calculus 35
Example
Find the Taylor series for the function
f(x, y) = x2 + 2xy + y3 at x0 = 1, y0 = 2.
1/9/2022 Chapter 5 - Vector Calculus 36
Taylor series of f(x, y) = x2 + 2xy + y3
1/9/2022 Chapter 5 - Vector Calculus 37
Taylor series of f(x, y) = x2 + 2xy + y3
*
* *
[i].[j].[k]
3[i,j,k]
1/9/2022 Chapter 5 - Vector Calculus 38
Taylor series of f(x, y) = x2 + 2xy + y3
The Taylor series expansion of f at (x0, y0) = (1, 2) is
1/9/2022 Chapter 5 - Vector Calculus 39
Summary
Differentiation of Univariate Functions
Partial Differentiation and Gradients
Gradients of Matrices
Backpropagation
Higher-Order Derivatives
Linearization and Multivariate Taylor Series
1/11/2022 Chapter 5 - Vector Calculus 40
THANKS
1/9/2022 Chapter 5 - Vector Calculus 41