0% found this document useful (0 votes)
23 views

Math 4 AI

Maths basics to enter the field of AI

Uploaded by

ehh69220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
23 views

Math 4 AI

Maths basics to enter the field of AI

Uploaded by

ehh69220
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Mathematics for Machine Learning 1

Floor Eijkelboom∗ , Tin Hadži Veljković∗


{eijkelboomfloor, tin.hadzi}@gmail.com

Contents
1 Introduction 2

2 Linear Algebra 3
2.1 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Vector spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.4 Dot product . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.5 Linear Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.6 Change of Basis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3 Multivariate Calculus 15
3.1 What are derivatives and why should we care? . . . . . . . . . . . . . . . . . . . . . 15
3.2 Univariate derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Multivariate derivatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 Jacobians . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4 Statistical Learning 23

A Derivative rules 25

* These authors contributed equally to this work.

1
1 Introduction
Hi there! Welcome to the Machine Learning 1 (ML1) course here at MSc Artificial Intelligence from
the University of Amsterdam. These notes are provided to help you familiarize yourself with the
‘prior knowledge’ needed to follow the ML1 course. In practice, however, you will probably not know
all the material here, and please do not feel discouraged by that. We advise you to read through
these notes and see which things you already knew and which things you need to brush up on.
The document is divided into three parts: linear algebra, multivariate calculus, and a general intro-
duction to machine learning. The first section aims to refresh your knowledge about vectors, ma-
trices, linear transformations, determinants, bases, orthonormal projections, eigen decompositions,
and similar topics. The second section is focused on calculus in higher dimensions, essentially gen-
eralizing the derivative from standard (real) functions to real functions between higher-dimensional
Euclidean spaces. The last section will zoom into the machine learning problem, the actual reason
you are probably reading these notes, to begin with. It is a concise sketch of the general problem
you will be facing for the next weeks.
If you see any errors in this document, please write us and we will make sure to address them as soon
as possible. Moreover, feel free to share this document with whoever might profit from it. Good
luck with your studies!
Floor & Tin

2
2 Linear Algebra
2.1 Basics
Linear algebra serves as a core of most machine learning algorithms that you will encounter through-
out the course, as the majority of objects are represented as vectors and matrices (matrices are called
arrays/tensors in NumPy/PyTorch). For this reason, we will systematically revise all the essential
concepts such as vectors, matrices, linear operators, and determinants. To intuitively explain certain
concepts, we will use jargon which will be denoted in italic font.

2.2 Vector spaces


In order to introduce vector spaces, which is a space where vectors live, we will first try to motivate
its formal definition which will follow later.

Informal definition
Firstly, let’s denote a vector by a bold letter v. The easiest way to visualize a vector is to associate
it with something familiar. For example, imagine you live on a flat Earth and you’re on a hike and
you wish to send your friends your location. You could, for example, represent your location as a
3D vector:  
x1
v =  y1 
z1
In this notation, x1 and y1 are your initial longitude/latitude offset from the bottom of the mountain
(the amount you moved west/east and south/north), while z1 might represent your altitude. You
continue your hike, change your longitude/latitude by x2 and y2 , and climb up by z2 to reach the
peak. Then, your new coordinates v are:
 
x1 + x2
v′ =  y1 + y2 
z1 + z2

In other words, your new coordinates are simply a sum of the two offsets. Notice that the sum of
two independent offsets produced a new location v′ which also represents a valid location.
Now, imagine you’re going on the same hike, but this time the mountain grew in size by a factor of
λ, and you wish to come to the same peak as last time. Intuitively, we can deduce that the you will
have to move further by a factor of λ in each direction, so the total offset w will be given by:
   
λx1 + λx2 x1 + x2
w =  λy1 + λy2  = λ  y1 + y2  = λv′
λz1 + λz2 z1 + z2

This tells us that even if we multiply our offsets by a number λ, we can still represent a valid location.
This was a very specific example to aid the visualization of certain properties that define a vector
space, which we will soon define. If we think of a vector as an abstract object which doesn’t
correspond to anything visualizable, then the above-mentioned properties can be thought as the
following. First, we want the sum of two vectors to also be vector from the same space. Second, if
we scale a given vector, we wish that the scaled version is also a part of the same vector space.

3
Formal definition
We shall now introduce a formal definition of a vector space.
Definition 2.1. A vector space over a field F is a set V with two binary operations:
1. Vector addition assigns to any two vectors v and w in V a third vector in V which is denoted
by v + w.
2. Scalar multiplication assigns to any scalar λ in F and any vector v in V a new vector in V,
which is denoted by λv.
Vector spaces also have to satisfy 8 axioms, which can be found here (most of them are trivial and
intuitive).
In the definition above, a field F is simply a structure from which we take scalars that we multiply
our vectors by. In most cases, the field will simply be real numbers R.
If we come back to the hiking example, we were dealing with the vectors from R3 , as we had 3 entries
of the vector, and each entry was a real number (coordinates are real numbers).

Summary
Vectors are objects that live in a vector space. It is important to note that a vector space is a space
defined by only two operations with objects: how to add objects and how to scale them. If we know
how to do that, we call that space a vector space. In further sections, we will explore other ways to
utilize and transform vectors besides the addition of vectors and multiplication by a scalar.

2.3 Basis
Similar to the previous section, we will first informally motivate the definition of a basis, and only
then formalize it.

Informal definition
The basis of a vector space provides an organized way to represent any vector in that space. As a
simple example, let’s think about possible colors produced by a pixel on the screen you are reading
this on. Every pixel consists of 3 lighting elements: red, green, and blue, and every other color
can be reproduced by varying the intensities of each of these colors. Since the lighting elements are
independent, we can represent an arbitrary color c as follows:
 
ri
c = gi 
bi

where ri /gi /bi denote the intensities of the red/green/blue light. By tuning these three numbers,
we can represent any color reproducible by our monitor. Now, let’s rewrite this more suggestively:
           
ri 0 0 1 0 0
c =  0  + gi  +  0  = ri · 0 + gi · 1 + bi · 0
0 0 bi 0 0 1

4
We can now also give unique names to the column vectors and write the previous expression as
follows:
c = ri · r + gi · g + bi · b,
where:      
1 0 0
r = 0 , g = 1 , b = 0
0 0 1
The color vector c has been written as a weighted sum of other vectors, and this is called a linear
combination. Since we can uniquely represent any vector (color) using these three vectors, we say
that vectors r, g and b form a basis. A basis can be thought of as a set of independent vectors
whose linear combination can uniquely represent any vector. The basis of this form, where the n-th
basis vector has 1 as the n-th element and 0 otherwise, is called a canonical basis and is the most
simple form of basis.
It is worth investigating why we impose the condition that the basis vectors need to be independent,
and what independence means. For simplicity, imagine that a purple color p can be expressed as a
linear combination of red and blue:
 
1
1 1 1  
p= √ r+ √ b= √ 0
2 2 2 1

Now, let’s imagine that we add the color purple to our basis, so our basis now consists of {r, g, b, p}
(you have 4 lights in your pixel now). Your friend told you about an imaginary color durple d, and
they told you that they use the {r, g, b} basis for their pixels. They represent the color durple as
follows:  
−1
1  
d= √ 0
2 1
In order to reproduce this color, you start turning the 4 knobs of color intensities (one for each
different color in your pixel). First, you do not use your purple color, and you just stick with red
and blue. You find that the following combination reproduces durple:
     
1 0 −1
1 1 1   1   1  
d = −√ r + √ b = −√ 0 +√ 0 =√ 0
2 2 2 0 2 1 2 1

However, you start turning the purple knob, tune red and blue a bit, and you realize that also the
following combination produces durple:
     
2 1 −1
1 1 1
d = −2 · r + p = − √ 0 + √ 0 = √  0 
2 0 2 1 2 1

This tells us that after adding the purple color to our basis, our representation of the durple color
was no longer unique, i.e. there were multiple ways to produce it. This stems from the fact that we
have added purple to our basis, which was not independent since we were able to write it as a linear
combination of already existing colors (red and blue). An equally valid choice of basis would have
been to remove the color red from our basis set, and simply have {g, b, p} as our basis.

5
An intuitive way to think of a basis is as a set of vectors, of which none can be written as a linear
combination of the rest. Formally it can be shown that the number of basis vectors has to equal
the dimension of the vector space. For example, in the example above, the number of basis vectors
was 3, as we had 3-dimensional vectors. If we were to add any more vectors to our basis, we would
necessarily add vectors that are no longer independent of each other, and thus we wouldn’t have
a systematic and unique way to represent arbitrary vectors. If we were to remove any vectors (for
example, have 2 vectors in our basis), we wouldn’t be able to express an arbitrary vector as a linear
combination of the basis vectors, as we would be missing building blocks.

Formal definition
In Linear Algebra, the basis of a vector space V is formally defined as follows.
Definition 2.2. A basis B of a vector space V over a field F is a linearly independent subset of V
that spans V. This subset, therefore, has to satisfy the following conditions:
1. Linear independence: For every finite subset {v1 , . . . , vm } of B, neither of the m elements
can be represented as a linear combination of the rest.
2. Spanning property: For every vector v in V, one can choose scalars λ1 , . . . , λn from the
field F and v1 , . . . , vn such that v = λ1 v1 + . . . + λn vn .
The first condition states what we discussed above; if we wish to have a basis, we mustn’t be able to
represent any of the basis elements by a linear combination of other basis elements. This is required
if we wish to uniquely represent every vector using basis vectors.
The second condition tells us that we must be able to represent any vector from the vector space
using a linear combination of the basis vectors. These two conditions combined lead to the fact
that for a basis of a n-dimensional vector space we must have exactly n linearly independent basis
vectors.

2.4 Dot product


So far, we have only seen two operations we can do with vectors: addition of vectors and multiplica-
tion of vectors by a scalar. The two operations combined allowed us to form a definition of a linear
combination and basis.
A vanilla vector space does not have any other operations that involve two vectors. However, a
vector space can be equipped with an inner product to form an inner product space.1 A dot product
2
between vectors v = [a1 , . . . , an ] T3 and w = [b1 , . . . , bn ] T is defined as follows:
n
X
v · w = a1 b1 + . . . + an bn = a i bi
i=1

To see the benefit and the interpretation of the dot product, let’s take a closer look at a case when
we calculate a dot product of a vector with itself:
n
X n
X
v·v = ai ai = a2i
i=1 i=1
1 An interested reader can find more information here.
2 Inner product and a dot product are often used interchangeably, although there are subtle differences, refer here
for a brief discussion.
3 Letter T stands for the transpose operation, more information can be found here.

6
What we can see from this is that this corresponds to the squared norm/magnitude of the vector v.
The usual notation for the norm of a vector is ∥·∥, so we can write:
v
u n
uX √
∥v∥ = t a2i =⇒ ∥v∥ = v · v
i=1

As a simple example, let’s imagine that we have a 2D vector c = a + b, where a = [a, 0]T and
a = [0, b]T , as shown in the figure below:
y
c
b

x
a

If we calculate the dot product of the vector c with itself, we get:


2
∥c∥ = a2 + b2 ,

which is exactly the Pythagorean theorem in 2D.


Besides being useful for calculating norms of vectors, dot product can be used as a measure of
similarity. If we imagine two n-dimensional vectors v and w, the angle θ between them can be
calculated using the following formula:

v · w = ∥v∥ ∥w∥ cos θ

We can divide both sides by the norms of both vectors to get the expression for the cosine of the
angle between the vectors:
v·w
cos θ =
∥v∥ ∥w∥
When the cosine of the angle between two vectors is equal to 1, the vectors are perfectly aligned
(interpreted as being as similar as possible), and when it is equal to 0, the vectors are perpendicular
(interpreted as being as different as possible). This can be interpreted as a measure of similarity
(often called the cosine similarity), which is often used in many areas, such as Natural Language
Processing (more information with some examples can be found here).
To sum up, we have introduced a new operation we can use to manipulate vectors, the dot product.
It is a useful tool because it allows us to easily calculate the norms of vectors, and also the cosine
similarity between them.

2.5 Linear Operators


Mappings
In linear algebra, besides the operations that involve two vectors (vector addition, dot product),
there are functions (mappings) that take as an input a vector, and output a vector. Let’s denote

7
this mapping as f . Formally, any mapping of this sort can be written as:

f :V →W

This is a standard mathematical notation which means the following: a function f takes as an
input a vector from the vector space V and outputs a vector from a vector space W . Now, you
might wonder why there are two vector spaces involved, and this will become more clear after a few
examples.
Let’s consider the following two mappings f and g:
   
  x   x
x x
f =  y , g =y
y y
x+y xy

This is a generalization of functions that we are used to; here our inputs are vectors, and so are the
outputs. If x and y are real numbers, then formally we can write this mapping as f : R2 → R3 ,
since the input to our mapping is a 2D vector, and the output is a 3D vector (therefore they live
in different vector spaces). You will work more with these types of functions in the Multivariate
Calculus section.
Now, which mappings can be called linear mappings? The conditions are intuitive, and quite similar
to the ones of the vector spaces, so we shall provide now a formal definition.
Definition 2.3. Let V and W be vector spaces over the same field F. A function f : V → W is
said to be a linear map if for any two vectors v, w from V and any scalar λ from F, the following
two conditions are satisfied:
1. Additivity: f (v + w) = f (v) + f (w)
2. Homogeneity: f (λv) = λf (v)
The first condition states that the transformation of the sum of vectors has to be equal to the sum
of transformations of every vector individually.
The second condition simply states that it shouldn’t matter whether we first multiply the vector v
by a scalar λ and then transform it, or we first transform the vector v and then multiply it by λ.
Now, are the mappings f and g above linear or not? The way to check it is by testing whether they
satisfy the additivity and homogeneity conditions, which we leave as an exercise4 .

Matrix-vector multiplication
If you’ve encountered linear algebra before, then you probably associate linear mappings/transformations/operators
with matrices. Let’s first discuss how and why matrix-vector multiplication works, and then we will
connect it to the concept of linear mappings discussed in the previous subsection.
To start, let’s imagine we have a very simple canonical basis B = {b1 , b2 } in R2 , where the basis
vectors are:    
1 0
b1 = , b2 =
0 1
4 You should find out that f is indeed linear, while g isn’t.

8
The matrix representation of a linear transformation is defined to have the following form: n-th
column of the matrix corresponds to a vector to which the n-th canonical basis vector transforms.
For example, let’s observe the following matrix A:
 
−1 −2
A= ,
1 −1

This means that the matrix A will transform the vectors b1 and b2 into b′1 and b′2 in the following
way:        
1 ′ −1 0 ′ −2
b1 = → b1 = , b2 = → b2 = ,
0 1 1 −1
which is visualized in the figure below.
y

b′1 b2

b1
x
b′2

Now that we know how a matrix transformation transforms our basis vectors, let’s see how this
applies to an arbitrary vector. Let’s consider a general matrix A and a vector v which have the
following form:    
A11 A12 v
A= , v= 1
A21 A22 v2
The vector produced by the matrix-vector multiplication shall be denoted as w. Let’s try to calculate

9
it using the rules of vector spaces and linear operators that we have learned so far:
  
A11 A12 v1
w=
A21 A22 v2
     
1 A11 A12 v1 0
= +
A21 A22 0 v2
     
2 A11 A12 1 0
= v1 + v2
A21 A22 0 1
         
3 A11 A12 1 A11 A12 0
= v1 + v2
A21 A22 0 A21 A22 1
     
4 A A12 1 A A12 0
= v1 11 + v2 11
A21 A22 0 A21 A22 1
   
5 A A
= v1 11 + v2 12
A21 A22
 
6 A11 v1 + A12 v2
=
A21 v1 + A22 v2

It is important to discuss all the properties used in the derivation above, as they serve as a backbone
to all calculations in linear algebra in general:
1. We have decomposed the vector v into its separate components.
2. We have pulled out the scalar from each vector in order to easily recognize the basis vectors
b1 and b2 .
3. Since we are dealing with a linear operator, we use the additivity property defined above.
4. Again, as we are dealing with a linear operator, we use homogeneity property defined above.
5. We use the definition of what matrix columns represent, i.e. we transform the canonical basis
vectors accordingly.
6. We simply sum up the two remaining vectors.
Using known rules we have derived the elements of the transformed vector. This result is general,
and if we have a matrix-vector multiplication of the type w = Av, then the i-th element of the
output vector w is given by:
X
wi = Aik vk (1)
k

Note that ik-th element of the matrix A is simply the entry of the matrix at the i-th row and k-th
column. Using this formula, we can find every element of the output vector w.

10
In the example above, we have assumed that the matrix A is a square matrix, which resulted in
vectors v and w having the same dimension. Let’s take a look at another matrix, A′ . We will define
A′ as:  
1 0
A′ = 0 1
1 1
Now, let’s try to interpret the meaning of this matrix. We have stated that the columns of the matrix
correspond to the vectors to which our basis vector will transform. So, this means the following:
   
  1   0
1 0
b1 = → b′1 = 0 , b2 = → b′2 = 1 ,
0 1
1 1

i.e. we have a transformation from R2 → R3 . To calculate how this matrix would transform an
arbitrary vector, we would use the procedure same as above, and would again retrieve equation 1.
As a simple exercise, let’s calculate the output vector w = A′ v, where vector v = [x, y]T using
relation 1
2
X
w1 = A′1k vk = A′11 v1 + A′12 v2 = x + 0 = x
k=1
2
X
w2 = A′2k vk = A′21 v1 + A′22 v2 = 0 + y = y
k=1
2
X
w3 = A′3k vk = A′31 v1 + A′32 v2 = x + y
k=1

Therefore, the output vector w is equal to:


 
x
w= y 
x+y

This is exactly the mapping f defined in 2.5!5


Let’s summarize our current findings regarding matrix-vector multiplication:
• We have a general formula 1 for calculating how a matrix transforms a vector.
• The matrix-vector multiplication may or may not change the dimensionality of the input vector.
• If we have a n × k matrix (n rows, k columns), then the input vector has to be k-dimensional,
while the output will be n-dimensional.
• All linear transformations (in finite dimensions) can be written in the matrix form.

Matrix-matrix multiplication
In the previous subsection, we have discussed how matrices (linear operators) transform vectors,
and how to calculate elements of the transformed vectors. Matrix-matrix multiplication can be
5 This is actually a very general result, all linear mappings (in finite dimensional vector spaces) can be written as

matrix multiplication. More info can be found here.

11
thought of as chaining two transformations one after another, and for this reason, we can calculate
the resulting matrix elements by analyzing how the two transformations act on the basis vectors.
For simplicity, let’s assume that we have two 2 × 2 matrices A and B of the following form:
   
A11 A12 B11 B12
A= , B=
A21 A22 B21 B22

Now, we wish to calculate elements of the resulting matrix C = AB. As we stated before, columns
of the matrix represent to what the canonical basis vectors transform to. Therefore, for example,
the first column of the matrix C will be given by the vector to which the vector b1 = [1, 0]T will
transform to. Let’s calculate this by first acting with the matrix B and then with matrix A on the
vector b1 :
C b1 = (AB) b1
   
A11 A12 B11 B12 1
=
A21 A22 B21 B22 0
  
A11 A12 B11
=
A21 A22 B21
 
A11 B11 + A12 B21
= ,
A21 B11 + A22 B21

where we have used identities and properties described in the matrix-vector multiplication section.
Now, if we write the matrix C in the following form:
 
C11 C12
C= ,
C21 C22

we can recognize that the elements of the first column are given by:

C11 = A11 B11 + A12 B21


C21 = A21 B11 + A22 B21

Similarly, we could calculate the elements of the second column of the matrix C by observing how
the two transformations transform the vector b2 = [0, 1]T .
In general, if we have a matrix-matrix multiplication of the type C = AB, the the ij-th element of
the matrix C is given by:
X
Cij = Aik Bkj (2)
k

It is important to note that we used a simple example where both matrices have the same dimensions.
A more general case would be if the matrix A ∈ Rn×k and B ∈ Rk×m . Then, the matrix B would
take as the input a m-dimensional vector and transform it to a k-dimensional vector. Afterward, the
matrix A would take as the input the transformed k-dimensional vector, and output a n-dimensional
vector. So, the total transformation C would be a n × m matrix, i.e. C ∈ Rn×m . Note that the
elements of the matrix C would still be calculated using formula 2.
Next, let’s take a look at two special types of matrices:

12
• Identity matrix - Identity matrix is often denoted by I or I, and it represents a matrix that
leaves a vector unchanged, i.e. Iv = v. Such matrix has elements 1 on the diagonal, and 0
otherwise. For example, a 3 × 3 identity matrix has the following form:
 
1 0 0
I = 0 1 0
0 0 1

• Inverse matrix - An inverse of a matrix A is denoted as A−1 , and is defined by the following
equation:
A−1 A = AA−1 = I
Intuitively, we can think of the inverse matrix A−1 as a matrix that counteracts the operation
done by the matrix A. Therefore, if we chain the two transformations together, it should be
the same as if we did nothing (i.e. the total transformation is equal to the identity matrix
I). A matrix that has an inverse is called an invertible matrix, and only square matrices are
invertible. More information can be found here.
Let’s briefly summarize important information regarding matrix-matrix multiplication:
• Using formula 2 we can find elements of a matrix that is the result of matrix multiplication.
• Multiplying a n × k matrix with a k × m matrix will result in a n × m matrix.
• In general, matrix multiplication is not commutative, i.e. AB ̸= BA.
• An identity matrix I leaves the vector unchanged.
• Some square matrices A have an inverse, which is denoted by A−1 .

2.6 Change of Basis


In the previous section, the elements of the matrix were determined by how they transform the basis
vectors. Let’s take a closer look at two different basis in R2 : a canonical basis {b1 , b2 } and an
arbitrary non-canonical basis {d1 , d2 } whose elements can be expressed in the canonical basis as:
   
3/5 1/3
d1 = , d2 = (3)
1/3 1
The two bases are visualized in the figure below.
y

 
1/3
b2 d2 =
1
 
3/5
d1 =
1/3
x
b1

13
We can think of a basis as a language we use to explicitly write vectors and operators as matrices.
However, the way an arbitrary operator A transforms a vector v shouldn’t depend on the basis we
use. Therefore, we must adjust the entries of the matrix depending on which basis we use, because as
described before, rows of the matrix correspond to the vectors to which the basis vectors transform
to. So let’s try to motivate intuitively how we can transform a matrix A that is written in the
canonical basis {b1 , b2 } into a matrix A′ which describes the same operation, but in the new basis
{d1 , d2 }. The procedure is as follows:
1. We take a vector from the written using vectors from the new basis and translate 6 it into a
language of the old basis using a transformation S.
2. We act on this translated vector with the operator A expressed in the canonical basis.
3. We convert the transformed vector back to the language of the new basis using the inverse
transformation S−1 .
So, in total, we can express the change of basis of a matrix as:

A′ = S−1 AS (4)

Next question is, what does the transformation S between languages correspond to? Well, if we speak
the language of the new basis, then we would express the vectors of the new basis as d1 = [1, 0] T
and d2 = [0, 1] T . However, if we want to express these new vectors in the old canonical basis, then
we would write them in the form of equation 3. Therefore, the transformation S for this example is
equal to:  
3/5 1/3
S=
1/3 1
The inverse transformation can be found and is equal to:
 
45/22 −15/22
S−1 = ,
−15/22 27/22

which is not the nicest expression, but we can transform any operator A written in the canonical
basis {b1 , b2 } into a matrix A′ written in the {d1 , d2 } basis.

We can check whether the transformation S−1 makes sense by for example applying it on the vector
d1 written in the canonical basis:
    
45/22 −15/22 1/3 1
S−1 d1 = = ,
−15/22 27/22 1 0

which is exactly the expected result, because if we speak the language of the {d1 , d2 } basis, we
would write the vector d1 as d1 = [1, 0] T .

6 In this context, translation is meant in the context of the language, not as a spatial translation.

14
3 Multivariate Calculus
3.1 What are derivatives and why should we care?
Before deep diving into derivatives, it is reasonable to ask ourselves what we mean when we talk
about the derivative of some function with respect to some variable. You may know that the
derivative describes the rate of change of the function. With ‘rate of change’ we refer to how
quickly the function value increases at some point x when we increase the value of x. A running
metaphor we will use is the following. We can imagine a variable y which is formed through applying
function f to x, i.e. y = f (x). In this case, we call x an input and call y an output. We are often
interested in studying how sensitive our outputs are to a change in the inputs, or how much our
inputs influence our outputs, as we will get more into it soon. This sensitivity is exactly what is
captured by the derivative, e.g. if the derivative of the output with respect to the input is large in
some point, we know that output is ‘sensitive’ to a small change increase around that point. Now,
you can picture this as a machine spitting out outputs y controlled with many knobs, where each
knob corresponds to a variable x. The derivative tells us how sensitive the value our function spits
out is to any turn of the knobs. Note that standard functions f : R → R are machines with one
knob and spit out one value, but general functions f : Rm → Rn are machines with m knobs and
spit out n different values. As we will look at later in this section, we have m × n derivatives in the
latter case, for we can look at the sensitivity of each output to any of the knobs.
More important, perhaps, is the question of why we care about derivatives at all. In the context
of machine learning, we are often very interested in a function that describes how well our model
performs given our parameters. What we mean with ‘doing well’ is reflected in section 4, but for
now, we presume that we have some measure of ‘doing well’. It is common to instead of maximizing
performance, minimize the error we make, which are equivalent views on the same thing. Let us,
for the sake of simplicity, say that our model parameter is given by x and our error rate is given by
f (x) = x2 + 4x − 2:

Figure 1: Example loss function. Horizontal axis describes the model parameter value x, the vertical
axis describes the corresponding error f (x).

If this function describes our error given our model parameters, we would be very interested in finding
the point where this error rate is minimum, which is exactly why we want to use the derivative. We
notice that in our minimum (which soon enough will turn out to be given by x = −2), the rate of
change of our function is 0. Please take your time to verify this, because this point is crucial.

15
As you might remember from a previous calculus course, the derivative of the function f (x) is given
by f ′ (x) = 2x + 4:

Figure 2: Example loss function plus its derivative. Horizontal axis describes the model parameter
value x, the vertical axis describes the corresponding error f (x) and derivative function f ′ (x).

Here we see that for all points less than −2, indeed the derivative is negative (i.e. the function
decreases) and for all points greater than x = −2, the derivative is positive (i.e. the function
increases). It is exactly the minimum point x = −2 where be function does from decreasing to
increasing. If we want to find this point x = −2 algebraically, we simply solve 2x + 4 = 0, which of
course turns out to be for x = −2.
The following sections will deep dive into how you can find these derivatives. We will first review
the univariate case such as the function we just covered. We will then steadily work our way up to
higher-dimensional derivatives with the aim of you being able to differentiate any ML/DL type of
function.
We do now want to spend too much time on basic differentiation techniques and rather give you a
general approach to differentiation from which things such as the sum rule, product rule, chain rule,
et cetera, will follow directly. If you need a refresher on the basic derivative rules, we included them
in Appendix A. Let us now dive into the actual derivatives!

3.2 Univariate derivatives


Let us start nice and easy with our basic functions over the reals, i.e. functions f : R → R. Though
this initially may look superfluous, we will introduce a visual way of representing these functions.
This new approach will make it easier to consider multivariate functions and is commonplace in
machine learning. Consider the function f such that f : x 7→ x2 , i.e. the functions that squares its
input. Again, our output is given by y = f (x) = x2 . In our example, we can visualize this function
as follows:
x x2 y

The blue squares represent values and the yellow rectangles represent ways to determine a value.
The most important insight you should take away is that the sensitivity of y to x is given by the
sum of influences of all the paths from x to y. In this case, there is only one path, that is through

16
the function x2 . Using basic differentiation techniques, we hence observe that:
dy dx2
= = 2x.
dx dx

A slightly more spicy example if the function f : R → R such that f : x 7→ exp(sin(x)).7 If we make
a diagram of this function as above, we can represent it as follows:

x sin(x) u exp(u) y

Please note that we had to introduce a new variable u := sin(x) that represents the intermediate
value found after applying the sine function to x. When finding the derivative of y with respect to
x, we again count all the paths from x to y. Again, there is only one path, now going through our
intermediate value u. In this case, the effect of x on y is equal to the effect of x on u times the effect
of u on y, i.e.
dy dy du
= .
dx du dx
You may have encountered this separation of derivatives before as the chain rule. These derivatives
are quite simple, giving us
dy dy du
= = exp(u) · cos(x) = exp(sin(x)) cos(x),
dx du dx
where we substituted u = sin(x) in the last step. So, we sum all the paths from x to y, and we
multiply the intermediate effects, e.g. if x influences u which influences y, the influence of x on y
is the influence of x on u times the influence of u on y.

3.3 Multivariate derivatives


 
x
Let’s go one step further, and consider a function f : R → R such that f : 1 7→ x1 x22 . We can
2
x2
again draw this function:
x1

x1 x22 y

x2

In this case, we can consider two derivatives: we can look at the effect of x1 on y and the effect of
dy
x2 on y. When we can consider multiple derivatives for different variables, we do not write dx 1
but
∂y
rather ∂x1 , to avoid confusion. We call such a derivative a partial derivative. Considering our
earlier metaphor, a derivative in a real function is just the effect of turning a knob of a machine with
one knob, whereas a partial derivative is an effect of turning one of the multiple knobs and keeping
the other still. Luckily for us, we can still apply our same tricks and count the paths from a variable
to y. In this case, we have that there is only one path from x1 to y, and only one path from x2 to
y, giving us:
∂y ∂x1 x22
= = x22 ,
∂x1 ∂x1
7 If you are not familiar with the exp(x) function, it is just another way to write ex .

17
and
∂y ∂x1 x22
= = 2x1 x2 .
∂x2 ∂x2
Please note that since we only consider the influence of one variable at the time, all the other
variables are constant when taking derivatives. What we sometimes do, is write the ‘full’ derivative
df
dx as the following vector:
dy h
∂f ∂f
i 
= x22 2x1 x2 .

= ∂x ∂x
dx 1 2

We call this full derivative a gradient in the case we have functions f : Rn → R, denoted as
dy n m
dx = ∇y(x) = grad y(x). However, in the the general case of functions f : R → R we call the
dy
resulting matrix a Jacobian, denoted as dx = Jy (x). The Jacobian is just the matrix which has
∂yi
on its i row all the partial derivatives of yi with respect to xj , i.e. Jij = ∂x j
. Hence, since we only
8
have one output here, we have that the Jacobian has only one row.
 2
x
We can also have a function f : R → R which maps f : x 7→ √ . In this case, we have that
2
x
y = f(x) where
√ y is a vector (and hence is written in bold font), and thus we can consider y1 = x2
and y2 = x. Drawing this, we find:

x2 y1

x

x y2

When again looking at the paths, we see that

dy1 dx2
= = 2x,
dx dx
and √
dy2 d x 1
= = √ .
dx dx 2 x
Here we can also group the different derivatives into one matrix:
 
dy 2x
= √ 1 .
dx 2 x

Please note that if we have a function f : Rn → Rm our Jacobian will be of the shape m × n.
Now we are finally ready to consider a function with multiple streams of influence. Consider the
y = g(h(x)), where h(x) = (x2 , ln(x)) and g(u, v) = uv. That is, y is found by first calculating
intermediate values u = x2 and v = ln(x) and then finding y = uv. If we draw these functions, we
see the following:
8 For pedagogical reasons, we will call all such higher-order derivatives of y Jacobians and denote them with dy
dx
,
but in practice, most people will just use the word ‘gradient’ here anyway.

18
x2 u

x uv y

ln(x) v

It is now very clear that the effect of x of y is twofold: both through u and v. As mentioned earlier,
we need to consider all streams of influence. Specifically, we sum the different paths/effects, i.e.:
dy ∂y du ∂y dv
= + .
dx ∂u dx ∂v dx
Plugging everything in, we find
dy dy du dy dv 1 1 1
= + = v · 2x + u · = ln(x) · 2x + x2 · = 2x(ln(x) + ).
dx du dx dv dx x x 2
You may recognize this as the product rule, now you know where that comes from!
Finishing4 up, we go over one big example. Suppose f : R3 → R3 such that f (x1 , x2 , x3 ) =

h(g(x1 , x2 , x3 )), where g(x1 , x2 , x3 ) = (x21 x22 , x2 x3 ) and h(u, v) = (u2 , uv, v 2 ). Try it for your-
∂y2
self! Find ∂x 2
. Hint: draw out what happens.
When visualizing this function, we get the following:
x1 u2 y1

x21 x22 u

x2 uv y2

x2 x3 v

x3 v2 y3

When counting the paths from x2 to y2 , we find two paths: one through u and one through v. We
hence find
∂y2 ∂y2 ∂u ∂y2 ∂v
= + .
∂x2 ∂u ∂x2 ∂v ∂x2
Plugging our derivatives, we find

∂y2 x3 √ x21 x22 x3


= v · 2x21 x2 + u · √ = 2x21 x2 x2 x3 + √ .
∂x2 2 x2 x3 2 x2 x3

Sweet! We now know how to find derivatives in multivariate functions. As you have seen, this
approach is quite a time intensive, and sometimes (especially in deep learning) it is not necessary
to write out everything by hand like this. This will be the topic of the rest of this section.

3.4 Jacobians
One of the most iconic functions in deep learning is the ‘linear layer’, which takes some input
x ∈ Rm and takes n linear combinations (with different factors) of the inputs. This linear layer can

19
Pm
be considered a function f : Rm → Rn such that yi = wi1 x1 + · · · wim xm = j=1 , wij xj , where we
still write y = f(x). We call the {wik }m
k=1 the weights of the function. Notice that for the entire
function f we have n of such sets of weights, i.e. in total n × m weights. We can write this functions
more compactly as
y = Wx,
where  
w11 w12 ··· w1m
 w21 w22 ··· w2m 
n×m
W= . ..  ∈ R .
 
.. ..
 .. . . . 
wn1 wn2 ··· wnm
When we now imagine all the streams of influence found between the y and x, we realize that each
element yi is dependent on each variable xj . If you do not see this immediately, please draw out the
respective diagram.
A consequence of this is that we have a lot of derivatives, namely for each of the n outputs yi we
have m different derivatives (for the m inputs). To make our lives a whole lot easier, we simply
determine the derivative of the ith element P for the jth variable and see if what we end up with
∂yi d m
generalizes. We hence wanna find ∂x j
= dxj ( k=1 wik xk ). We know that

m m
d X X d
( wik xk ) = wik xk .
dxj dxj
k=1 k=1

d
Let us know zoom in into one of the terms of the summation, i.e. we only consider dxj wik xk . If we
d
have that xk ̸= xj , we will always have that dxj wik xk
= 0, because the entire term does not depend
on xj . When xj = xk , however, we see that the derivative is given by wik . We can express this
‘if-else’ statement quite easily mathematically using something called the Kronecker delta. The
Kronecker delta over two variables i and j is equal to 1 if i is equal to j, and equal to 0 other, or:
(
1 if i = j
δij =
0 otherwise

Sometimes this is written with so-called Iverson brackets as [i = j]. These brackets do the same
thing, i.e. [S] = 1 if S is true, else [S] = 0 for any statement S. The most important property (for
us) of this Kronecker delta is that X
δij xj = xi ,
j

i.e. when summing over elements xj , we can filter out xi by introducing δij . Please verify this
carefully, for this will be our main workhorse throughout this section.
If we go back to our example, we see that hence our derivative is given by dxd j wik xk = δjk wik for
any combination of xj and xk . That is, the derivative is equal to 0 if xj and xk are different, and
equal to wik when xj and xk are the same. Plugging this back in, we find

m m
X d X
wik xk = δjk wik .
dxj
k=1 k=1

20
This we know how to evaluate using our workhorse, and hence we see that
m
dfi X
= δjk wik = wij .
dxj
k=1

Neat! We just found a general approach to taking the derivative of the linear layer and concluded
that the effect of the jth variable on the ith output is given by the weight wij . We can write out
the entire Jacobian (where the element in ith row, jth column is the derivative of yi with respect to
xj ) again:  
w11 w12 · · · w1m
dy   w21 w22 · · · w2m 

= . . . .  ∈ Rn×m .
dx  .. .. .. .. 
wn1 wn2 · · · wnm
But wait! This matrix we recognize from earlier, namely as our matrix W. This allows us to write
dy
= W.
dx
We call this approach if finding a single entry of the derivative and then generalizing the ‘index
method’.
Please note that not only did we just derive the derivative of the linear layer, but we found that
d
dx Wx = W for arbitrary matrices and vectors, e.g. we also know now that

d
(AB + C)v = AB + C,
dv
by simply remembering that AB + C = W′ for some matrix W′ .
d T
Another very common derivative you will encounter is dx a x. In this case, we have that aT w is
simply a scalar, and hence our Jacobian will be of the shape (1 × m) if x is m-dimensional. Let us
again use the index method, and aim to find
m m
d T d X X d
a x= a k xk = ak xk .
dxj dxj dxj
k=1 k=1

As before, we see that the derivative is equal to ak when k = j, and equal to zero otherwise, and
thus
m
X d
ak xk = δjk ak = aj ,
dxj
k=1

where we find aj by applying our workhorse again. This means that the jth element of our derivative
is given by aj , or the entire derivative is given by a.
But... a is a column vector, where our Jacobian should be a row vector as we argued earlier. Sadly,
this problem cannot quite be overcome, and we just need to always check of our answer should be
transposed or not. In this case, we see that our Jacobian matches aT . This is slightly annoying,
but luckily our answer is always either correct or needs to be only transposed, and checking it will

21
become second nature soon enough! Let this inconvenience not distract us from the fact that we did
just find our new identity though, that is:
d T
a x = aT .
dx

d T
Now it is your turn, please try and verify that dx y sAx = yT A, where x ∈ Rm , A ∈ Rn×m , and
m
y ∈ R . Please do this 1) using index notation, and 2) using our identity friends we have already
found.
We know that yT Ax is a scalar (why?), and hence the Jacobian will be again of the form (1 × m)
if x is a m-dimensional vector. Using index notation, we aim to find dxd i yT Ax. We observe that
n m n Xm
d T d XX X d
y Ax = yk Akj xj = yk Akj xj .
dxi dxi j=1 j=1
dx i
k=1 k=1

Again, since yk Akj are just scalars, we know that the derivative is simply found by

d
yk Akj xj = δij yk Akj .
dxi
This gives us the following derivative:
n Xm n Xm n
X d X X
yk Akj xj = δij yk Akj = yk Aki .
j=1
dxi j=1
k=1 k=1 k=1

But this term we recognize as [yT A]i . This means that our full derivative is simply given by yT A,
which aligns with our desired shape so we are done.
So... That’s quite a lot of work. And actually, we could have done way less work using our previous
identities. Observe that yT A is just a row vector, i.e. it can be written as vT = yT A for some
vector v. Thus, we can write yT Ax = vT x. But this we know how to differentiate with our tricks,
d T d T
that is dx v x = vT , and hence we know that dx y Ax = yT A.
This should cover the basics of vector calculus! During the first week of the course, we will spend
some more time on time on this and you will receive an excellent document written by two other
TAs. If you understand these basics, you are well on your way to doing machine learning soon
enough!

22
4 Statistical Learning
In this section, we will do a quick overview of to goal of machine learning in general. Though machine
learning is quite a broad topic in general, exploring one specific type of machine learning will cast
some light on the topic in general. What we will look at is supervised learning, in which we have
example input-output pairs of some process, and we want to learn the relationship between the
inputs and outputs. A classic example is having many examples of images of dogs and cats plus a
label for each image saying either ‘cat’ or ‘dog’. We would then want to find a model that takes
in the image (input) and predicts either ‘cat’ or ‘dog’ (output). In general in machine learning, we
assume that we do not ourselves already know what this model looks like exactly. If we would, then
why bother going for this statistical approach rather than just implementing the function directly?
Let us imagine we aim to predict your grade based on only number of hours studied. This of
course is slightly silly to do, which is exactly why it is a good example to consider! Our goal is to find
a function y that takes in hours studied x and predicts your grade based on the data: the examples
(x, t) of numbers of hours studied and the grade that person obtained. We denote D = {(xn , tn )}
as the entire dataset, and we will assume it contains N points. Ideally, our function y satisfies that

y(xn ) = tn for all (xn , tn ) ∈ D.

However, this will never be possible. Suppose you and your friend both study together for the same
number of hours, that does not imply that you will necessarily obtain the same grade (which is why
this example is silly). Thus, for the same xn , we can have a whole range of different tn that are
‘correct’. However, our function y can (by definition) only return one value y(xn ) to estimate tn , so
what do we do now...
This is where we call out our (soon-to-be) best friend statistics. Instead of considering our points
as just fixed values, we consider them as random variables (or RVs). We will not bother writing
this out formally right now, but in essence, a random variable X is just an object that can take
on specific values (e.g. x1 , x2 , x3 , and x4 ) and assigns a probability describing how ‘probable’ each
assignment is.9 We write p(X = x1 ) as the probability that X takes on the value x1 .
If we have more than one RV, let’s say X and Y , which can take on values {xi } and {yj } respec-
tively, we can make a distribution describing how likely X and Y are joint. We call this the joint
distribution, denoted as p(X, Y ), which can take on values p(X = xi , Y = yj ) for all combinations
i and j.
With this joint distribution, we can define our (for now) last type of distribution. Suppose X
is someone’s age and Y someone’s height, the distribution p(X, Y ) describes how likely people of
certain ages are to have certain heights. However, what if we want to just know how the heights
are distributed for people that are 22 years old? This information is actually in the p(X, Y ), but we
need to filter out all the points for which we have that X ̸= 22. We denote this filtered distribution,
or conditional distribution, as P (Y | X) in general, or as P (Y | X = 22) for a specific age.
Back to our example. Let X denote our input variables and T denote the corresponding data
outputs. We are specifically interested in finding the distribution p(T | X) which for any datapoint
X assigns a distribution of targets T . To decide what this distribution should look like, we zoom
out and consider the data generation process for a second.
9 Actually, what ‘probabilities’ are is quite a divisive thing. The two main interpretations are the frequentist

approach and the Bayesian approach. Though you probably learned mostly frequentist approaches, it is worth deep
diving into Bayesian approaches for they lend themselves better to machine learning often.

23
We know that for any value X = x, we can have many different valid predictions that are valid, and
we wanna quantify the validity of each one using a distribution. First, how do we assess how ‘close’
our predictions are to the actual values? E.g., if the height of someone was 1.70m and I predicted
1.50m, is that twice as bad as predicting 1.60m? Or is it four times as bad? Actually, for this, we
will in general use the mean squared error (or MSE). As the name suggests, your ‘wrongness’ is
given by the average (y(xn ) − tn )2 over all data points, i.e.
N
1 X
MSE = (y(xn ) − tn )2 .
N n=1

What is super convenient about using MSE, is that a lot of mathematics will simplify. Please note,
though, that this is just a choice (though one often made), and choosing different error functions
can have drastic effects on what your model will consider ‘good’ predictions.
The main reason why MSE is used so often is that under a few mild assumptions, the function y
that is ‘ideal’ (i.e. the function that minimizes the MSE) is simply the function that for some point
x returns the weighted average of the target T | X = x, which we write as the expected value:

f (x) = E[T | X = x].

That is to say, right as we might expect it, if we have 100 people that are 22 years old, of which there
are 50 of height 1.70m, 30 of height 1.60m, and 20 of height 1.80m, and we predict 0.5 · 1.70m + 0.3 ·
1.6m + 0.2 · 1.8m = 1.69m (and do the same for all other ages), this function will exactly minimize
the mean squared error. That’s pretty neat, right?
This is where ML1 starts! We will be studying ways of predicting target distributions for our different
inputs, i.e. finding p(T | X). Ideally, we would always use y(x) = E[T | X = x], but in practice,
we will not have enough data for each input point x to predict the average value of T for those x.
This is where you come in, using your linear algebra and multivariate calculus tools to find a good
solution to this problem!

24
A Derivative rules
Here we provide the basic derivative rules. We separated them into (1) derivatives of specific func-
tions, and (2) properties of the derivatives of combined functions.

Standard derivatives
• f (x) = c =⇒ f ′ (x) = 0,
• f (x) = xn =⇒ f ′ (x) = nxn−1 ,
• f (x) = ax =⇒ f ′ (x) = ax log a, and hence f (x) = ex =⇒ f ′ (x) = ex ,
• f (x) = logb x =⇒ f ′ (x) = 1
log(b)·x , and hence f (x) = log x =⇒ f ′ (x) = x1 ,
• f (x) = sin x =⇒ f ′ (x) = cos x,
• f (x) = cos x =⇒ f ′ (x) = − sin x.
Moreover, it is useful to remember special
√ cases of the second rule, e.g. f (x) = x =⇒ f ′ (x) = 0,
′ ′ 1
f (x) = ax =⇒ f (x) = a, and f (x) = x =⇒ f (x) = 2√ x
.

Derivative rules
• (c · f )′ (x) = c · f ′ (x),
• (f + g)′ (x) = f ′ (x) + g ′ (x),
• (f · g)′ (x) = f ′ (x) · g(x) + f (x) · g ′ (x),
f ′ (x)g(x)−f (x)g ′ (x)
• ( fg )′ (x) = g(x)2 ,
• (f ◦ g)′ (x) = (f ′ ◦ g)(x) · g ′ (x).

25

You might also like