Algorithmic Mathematics in Machine Learning
Algorithmic Mathematics in Machine Learning
Mathematics in
Machine Learning
Data Science Book Series
Editor-in-Chief
Ilse Ipsen
North Carolina State University
Editorial Board
Amy Braverman Nicholas J. Higham Ali Rahimi
Jet Propulsion Laboratory University of Manchester Google
Baoquan Chen Michael Mahoney David P. Woodruff
Shandong University UC Berkeley Carnegie Mellon University
Amr El-Bakry Haesun Park Hua Zhou
ExxonMobil Upstream Research Georgia Institute of Technology University of California, Los Angeles
Daniela Calvetti and Erkki Somersalo, Mathematics of Data Science: A Computational Approach to Clustering and Classification
Nicolas Gillis, Nonnegative Matrix Factorization
Bastian Bohn, Jochen Garcke, and Michael Griebel, Algorithmic Mathematics in Machine Learning
Algorithmic
Mathematics in
Machine Learning
Cover illustration reproduced with permission. Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang.
Deep Learning Face Attributes in the Wild; International Conference on Computer Vision (ICCV), 2015.
is a registered trademark.
Contents
Preface ix
List of Figures xi
v
vi Contents
Bibliography 197
Index 219
Preface
The story of machine learning is one of rigorous success. It is frequently employed by scien-
tists and practitioners around the globe in various areas of application ranging from economics
to chemistry, from medicine to engineering, from gaming to astronomy, and from speech pro-
cessing to computer vision. While the remarkable success of machine learning methods speaks
for itself, they are often applied in an ad hoc manner without much care for their mathematical
foundation or for their algorithmic intricacies. Therefore, we decided to write this book. Our
goal is to provide the necessary background on commonly used machine learning algorithms and
to highlight important implementational and numerical details. The book is based on a well-
received practical lab course, which we established within the mathematics studies course at the
University of Bonn, Germany, in 2017. The course has been taught and successively enhanced
each year since then.
In contrast to other books on machine learning, we aim to give mathematicians, computer
scientists, and practitioners with a basic mathematical background in analysis and linear algebra
an introduction to the algorithmic mathematics of many important machine learning methods.
The book is composed of introductory parts, which present a specific machine learning method;
of information parts, which recapitulate the underlying mathematics; and of practical exercises,
which deepen the understanding of the considered method.
Acknowledgments We would like to thank all our co-workers, friends, and students who
helped us to bring this book to life with their valuable input of various kinds. We especially thank
Jannik Schürg, Arno Feiden, and Lisa Beer for their specific text contributions and remarks.
Furthermore, we would like to thank Olmo Chiara Llanos, Paolo Climaco, Sara Hahner, Jan
Hamaekers, Ivan Lecei, Sebastian Mayer, Daniel Oeltz, and Moritz Wolter for their valuable
feedback during the practical lab courses and the writing process of this book.
We would also like to acknowledge the support by the German Federal Ministry of Educa-
tion and Research, which partially funded the development phase of the practical lab within the
project P3ML (project code 01IS17064).
ix
List of Figures
xi
xii List of Figures
In many branches of science, economy, and industry, the amount of available data has become im-
mense during recent years. Most of these data do not contain any valuable information. However,
the differentiation between useful data and “data waste” is seldom straightforward. Moreover,
in applications where expensive (practical or numerical) experiments are involved in the data
acquisition process, the available data sets are rather small. To meet the different challenges on
real-world data sets, like, for example, describing useful information in a more compact format
(dimensionality reduction) or making predictions on unseen data (regression), many different
ideas and approaches have emerged. These are usually grouped together under the name ma-
chine learning (ML) methods.
Machine learning itself is a subarea of artificial intelligence (AI) and a related subject to data
mining (DM). As a core topic, AI is concerned with all aspects of machine intelligence, i.e., with
cases where a (real or virtual) artificial device is able to take actions to achieve a given goal by
using information, e.g., from sensors, on the surrounding environment in which it lives. The
term data mining, on the other hand, is usually used to describe methodologies with the goal of
detecting statistical trends or relations within or in between different data sets. Oftentimes, ML
and DM are used somewhat ambiguously. To better distinguish these two areas, ML is typically
used when the goal is to predict certain values or outcomes from given data, whereas DM is used
when there is no prediction sought, but rather a statistical analysis of the data.
Generally, machine learning refers to the generation of knowledge by using computational
(and statistical) methods on given data. This means that ML instances automatically learn from
experience in order to achieve a certain goal, i.e., they adapt solutions to given data by optimizing
some criterion. An important aspect to note is that the corresponding algorithms are not explicitly
programmed to produce certain solutions in certain cases, but they determine—or infer—their
solutions according to the given data.
1
2 Chapter 1. The Basics of Machine Learning
applications. In particular, we will get to know use-cases and data sets from the fields of hand-
written digit recognition, pedestrian detection in images, and biological single-cell analysis, for
instance.
Info-boxes Three different categories of info-boxes throughout the book will help as an addi-
tional source of information.
X These boxes contain the practical exercises. Here, the reader is encouraged to imple-
ment and test the introduced concepts and algorithms from the specific chapter. This
serves to deepen understanding and to get some feeling for potential pitfalls when
working with these algorithms.
Notation Since the main intention of this chapter is to create common ground for compre-
hending the contents of the further chapters, we aim to keep the notation for the remainder of
this book mostly consistent with that used in this chapter. To this end, let us provide the naming
conventions, which are valid throughout most of this book:
• Vectors are commonly written with small letters in boldface notation, e.g., x, or with an
arrow on top, e.g., ⃗x, to avoid confusion. The jth entry of an indexed vector xi is denoted
by [xi ]j .
• Matrices are commonly written with capital letters in boldface notation, e.g., W .
• n and n̄ denote the number of given training and test data points, respectively.
• D := {(xi , yi ) ∈ Ω × Γ | i = 1, . . . , n} describes the training data set in supervised learn-
ing. The yi are omitted in unsupervised learning.
• D̄ := {(x̄i , ȳi ) ∈ Ω × Γ | i = 1, . . . , n̄} describes the test data set in supervised learning.
The ȳi are omitted in unsupervised learning.
• d denotes the dimension of a data point x.
• d˜ and q are commonly taken to denote dimensions different from d.
• “k-dimensional” is abbreviated by “kD” for any k ∈ N.
• The norm ∥ · ∥ refers to the Euclidean norm ∥ · ∥2 in the corresponding ambient space, e.g.,
Rd , unless otherwise noted.
• M denotes the model space.
• L denotes the loss function.
• µ is the measure according to which training data have been drawn. In the case of super-
vised learning we also use µX as the marginal measure on Ω and µY |X as the conditional
measure of Y given X on Γ.
4 Chapter 1. The Basics of Machine Learning
In the remainder of this chapter, we introduce the basic terminology and concepts of machine
learning that will be relevant throughout this book. This serves as a motivation and as an overview
on fundamental aspects for readers who are unfamiliar with the ML jargon. To this end, we
deliberately omit technical subtleties here.
We begin by introducing the necessary mathematical concepts. In particular, Section 1.2 is
dedicated to the topics of supervised and unsupervised learning. The mathematical formulation
of the underlying optimization problem is dealt with in Section 1.3. Section 1.4 presents hyper-
parameters of machine learning models. Feature engineering and feature selection are discussed
in Section 1.5. The concentration of measure phenomenon and the curse of dimensionality for
high-dimensional problems are the topics of Section 1.6. Subsequently, Section 1.7 and Sec-
tion 1.8 are dedicated to more applied aspects of machine learning. In particular, Section 1.7
deals with general process models for tackling real-world machine learning problems, e.g., the
CRISP-DM model. Section 1.8 discusses important implementational aspects, such as preferable
programming languages and libraries for machine learning. Finally, Section 1.9 provides refer-
ences to other important textbooks on mathematical and implementational aspects of machine
learning.
Note again that we deliberately only provide motivational and fundamental mathematical and
implementational concepts in this chapter. The corresponding details on the topics discussed
here as well as specific instances of ML methods and algorithms will follow in the later chapters.
Supervised learning Unsupervised learning
We are given data samples of inputs and We are only given input samples. Com-
outputs. The goal is to predict the output mon tasks are clustering/segmentation of
to an input. This can be seen as a func- the data or compression/dimensionality
tion (or density) reconstruction task. The reduction, where we look for a low-
prediction should also generalize well to dimensional representation of the high-
yet unseen data. dimensional input data.
Example: Assume we are given real val- Example: Assume we are given 12 two-
ues x1 , . . . , x5 as inputs and real values dimensional points x1 , . . . , x12 . A pos-
y1 , . . . , y5 as outputs. A possible predic- sible goal could be to assign each point
tor would be the piecewise linear inter- to one of three groups/clusters (indicated
i of the (xi , yi ) pairs (indicated by by i
polant color and shape below) according to
the blue graph below). their distances to each other.
“book-figure0” — 2023/10/9 — 13:25 —“book-figure1”
page 4 — #1—
i i
2 0.4
Inputs 2
Outputs
0 0.2
−2
0 2 4 0 0.5
Inputs Inputs 1
Supervised learning Let us focus on supervised learning first. Here a function1 f is learned
from input-output samples (xi , yi ), also referred to as inputs and labels. The goal is not only
that the samples—usually called training data—be (approximately) fitted by f , but also that new
data—usually called test data or evaluation data—which stem from the same distribution as the
training data, be approximated well. Some specific examples of tasks in supervised learning are:
• Classify images according to their content (e.g., cats versus dogs) [KSH12].
• Estimate the risk of disease from patient data [KZK12].
• Mark email messages that are spam [DBC+ 19].
• Detect critical failures in industrial facilities [CGIT23].
• Predict future values of financial assets [HSK19].
• Categorize musical pieces according to their genre [COSJ17].
In supervised learning, we assume that we are given an input training data set
i.i.d.
D := {(xi , yi ) ∈ Ω × Γ | i = 1, . . . , n} with (xi , yi ) ∼ µ ∀ i ∈ {1, . . . , n}.
Here, the n training data points are independent and identically distributed (i.i.d.) samples of an
unknown probability measure µ on Ω × Γ. Usually (and most of the time in this book) the setting
Ω ⊆ Rd and Γ ⊆ R is considered, which already covers a lot of interesting machine learning
problems. Note however that different settings, such as non-scalar Γ or categorical Ω, may be
encountered when considering certain applications, see Section 10.1.
In supervised learning we are looking for f : Ω → Γ such that
But instead of just choosing an interpolant in xi , which does the job, we more importantly also
aim for
f (x) ≈ y ∀(x, y) ∼ µ. (1.2)
If Γ is endowed with a distance measure, this is called a regression problem. In the special case
of |Γ| < ∞ and where there is no defined order on Γ, i.e., the values of y represent categories,
we are dealing with a classification problem. To make the search for such a function f more
feasible, we usually restrict ourselves to some model space or class M and try to find the “best”
f ∈ M that fulfills our requirements of fitting the data. We will go into more detail on model
classes and what we mean by “best” f in the next sections.
Regression Classification
Regression refers to the determination Classification deals with the determina-
of the values of a continuous variable, tion of the class/group which a data point
given the values of other variables. We belongs to. Often, we aim to recon-
are mostly dealing with the reconstruc- struct the class assignment operator g :
tion of a real-valued function g : Ω ⊂ Ω ⊂ Rd → Γ = {1, . . . , K} from data
Rd → Γ ⊂ R from given values (xi , yi ) points (xi , yi ) for i = 1, . . . , n, where
for i = 1, . . . , n, where yi = g(xi ) + εi yi = g(xi ) is the class to which xi be-
is a function evaluation of g perturbed by longs. Here, K ∈ N is the number of
some noise εi . possible classes.
1 More generally, we could consider learning a measure that models the connection between the input-output pairs.
6 Chapter 1. The Basics of Machine Learning
In contrast to general supervised learning methods, where we assume that the training data D
is given beforehand, the class of active learning methods, which is sometimes considered as
a subclass of supervised learning, deals with a scenario where the algorithm determines the
training data points itself; see [Set12]. More specifically, in active learning we start with a large
base set of xi for i = 1, . . . , N . Then, in each step of the active learning algorithm, one of
the points which has not been chosen before is included in the training data set. In particular
xik for some ik ∈ {1, . . . , N } \ {i1 , . . . , ik−1 } is chosen in the kth step and the corresponding
label yik is determined. Then, a supervised learning algorithm is applied to the new training
data set. Moreover, a case-dependent heuristic is employed to determine if the currently selected
data points suffice or if more steps in the active learning algorithm are necessary, i.e., if more
points need to be included in the training data set. This type of algorithm is useful when we have
the possibility to infer the correct yik —or at least a good approximation thereof—on the fly,
while the process of doing so involves intensive computations or is very costly and the number
of chosen data points should thus be kept small.
Dimensionality reduction
In dimensionality reduction, we usually have data xi ∈ Ω = Rd for i = 1, . . . , n
and aim to find a “good” representation η i ∈ Rq with q < d. Here, “good” can mean
that the data points retain their main characteristics, such as pairwise distances for
instance. But “good” can also refer to the fact that it should be possible to approxi-
mately reconstruct the original data points xi given just η i and using an appropriate
reconstruction algorithm. Oftentimes we assume that the original data stem from
some k-dimensional surface or manifold, which we need to detect in order to deter-
mine the η i .
−→
Figure 1.3
Above is an example for d = 2 and q = 1. The data points on the left approxi-
mately reside on the red curve, which is just one-dimensional. Thus, the data can be
represented in one dimension as points on the black line, shown on the right.
Sometimes, instead of performing dimensionality reduction of the data, one is merely interested
in compressing the data with respect to the number n of data points. Here, resampling approaches
like bootstrapping play an important role; see Section 10.2.4 for more details. Furthermore, it
is a common goal in unsupervised learning to find yet undetected patterns in the data, which is
often pursued for already dimension-reduced data. This is for instance achieved by segmentation
or clustering of the input data into several classes.
1.3. The underlying optimization problem 7
Clustering
The goal of clustering is to group together sets of data points (so-called clusters)
according to an underlying similarity measure. In particular, we divide a data set
into k ∈ N different parts C1 , . . . , Ck . Here, similar points are assigned to the same
cluster, whereas dissimilar points are assigned to different clusters. There exists a
variety of clustering methods which differ in the choice of the underlying similarity
measure. A thorough discussion of clustering and a description of suitable algorithms
can be found in [XW08].
represent the given data very well. A high variance, however, means that the model is sensitive
to small fluctuations in the data and therefore does not generalize well to yet unseen data, i.e., the
model has large errors on the test data. In particular, a model that fits the training data well, but
does not perform equally well on the test data, has low bias and high variance, i.e., the model
class is too large. The opposite problem of high bias and low variance is present when the model
class is too small for the given data. In general, more data typically reduce the variance of the
obtained model. Therefore, we can obtain a smaller variance for larger model classes as the
amount of data increases. In contrast, aiming to minimize the variance by using a small model
class is usually more relevant in a setting where only a few data points are available.
Model class
The model class M is the set of all possible functions considered as solutions to
the machine learning problem at hand. Famous model classes include affine linear
functions, splines, kernel spaces, and (deep) neural networks. Oftentimes functions
in the model class are determined by certain parameters p ∈ Rdp . In this case, the
model class is called parametrized.
Let us consider the simple example of affine linear functions in the setting Ω = Rd and Γ = R.
Here, the parametrized model class is given by
( d
)
X
d
MLin = g : R → R g(t) = γ0 + γi ti for some fixed γ0 , . . . , γd ∈ R .
i=1
The dp = d + 1 parameters are p = γ := (γ0 , . . . , γd )T . In the case of the affine linear model
T
class, we will often use the more compact notation g(t) = γ T · t̂ with t̂ := (1, t1 , . . . , td ) . In the
case of parametrized model classes, we sometimes write gp instead of g to make the dependence
on p more distinct.
Loss function
The determination of f ∈ M is done with the help of a loss function L. For su-
n
pervised learning it is often defined as a function L : (Γ × Γ) → [0, ∞]. Note
that most loss functions can be written in terms of a one-sample loss function
L̃ : Γ × Γ → [0, ∞]. For instance, a so-called additive loss function L can be
written with one-sample loss L̃ as
n
1X
L ((z1 , z̃1 ) , . . . , (zn , z̃n )) = L̃ ((zi , z̃i )) .
n i=1
1.3. The underlying optimization problem 9
With the help of a model class and a loss function, we can now write down the actual minimiza-
tion problem underlying most (supervised) machine learning algorithms.
i.e., we average the loss over all possible data points drawn i.i.d. according to µ.
The resulting quantity is also known as the estimation error or generalization error.
However, we usually do not have direct access to the underlying measure µ but only
to the training samples (xi , yi ) for i = 1, . . . , n. Therefore, we substitute the problem
of minimizing the generalization error by the problem of minimizing the training
error (or empirical loss), i.e., we aim to determine
n
1X
f := arg min L̃ ((g(xi ), yi )) = arg min L ((g(x1 ), y1 ) , . . . , (g(xn ), yn )) .
g∈M n i=1 g∈M
(1.4)
In the case of a parametrized model class M, we can rewrite this as
One of the most common loss functions is the least squares loss
n
1X
Lls ((z1 , z̃1 ) , . . . , (zn , z̃n )) := (zi − z̃i )2 ,
n i=1
which is an additive loss function with one-sample loss L̃(z, z̃) = (z − z̃)2 . The least squares
loss computes the normalized squared Euclidean norm of the difference of the vectors z :=
(z1 , . . . , zn )T and z̃ := (z̃1 , . . . , z̃n )T . When evaluating the loss in an ML setting with model
function g, the zi and z̃i are instantiated by the point evaluations g(xi ) and yi for i = 1, . . . , n. In
particular, by using the least squares loss function as in the general optimization problem (1.4),
we obtain the least squares ML minimization problem
n
1X 2
f := arg min (g(xi ) − yi ) .
g∈M n i=1
Besides the least squares loss many other options exist for L. Simple examples are different
vector norms. For classification problems, logistic loss functions or cross entropies are often
used; see, e.g., Section 6.4.5. In the case of more complicated data sets, e.g., when yi is a
function or a density for all i = 1, . . . , n, the choice of an appropriate loss function becomes
more tricky; see also Section 10.1.
10 Chapter 1. The Basics of Machine Learning
Properties of minima Let us consider the parametrized variant (1.5) of the optimization prob-
lem. If L is continuously differentiable, we know from basic calculus that
is a necessary condition for gp to be a minimizer of the loss on the training data. Points that fulfill
this equation are called critical points. Additionally, if the Hessian ∇2p L exists and is positive
definite for gp , we encountered a minimizer. While these conditions help us to check for minima,
we do not know if we encountered a local or a global minimum in general.
The reason for the popularity of convex loss functions, like the least squares loss, is that they
ensure that any critical point of L is automatically a global minimizer if a convex model class is
employed, i.e., a model class that is a convex set of functions. While this does not necessarily
mean that the minimizer is unique, each minimizer is equally good in the sense that it achieves
the same minimum value of L as the others do.
Numerical optimizer While the check for the optimality conditions above can be done by
hand for specific choices of L and M, a minimizer usually needs to be determined by numerical
optimization algorithms. Here, a suitable choice of an algorithm has to be made in accordance
with the properties of the optimization problem (1.4).
For instance, for the least squares loss and an affine linear model class, the minimizer is unique,
and its naive determination boils down to solving a system of linear equations, as we will see in
Chapter 2. An example for a supervised learning problem whose optimization problem deviates
from the form (1.4) is given in Chapter 3. Here, a constrained convex optimization problem is
tackled by an iterative solver. When we look at more complicated settings where the loss function
or the model class is no longer convex, the numerical optimization becomes more intricate. Here,
we usually encounter many local minima.
In general, a local minimum resulting from an iterative numerical optimizer depends on the
initial value and the respective optimization method, provided that it is locally convergent in the
first place. Which local minimum is attained and how much it differs from any global minimum
are mostly unclear and difficult to determine in the non-convex situation. Even in the case of
deep neural networks, for instance, the mathematical analysis of the optimization problem is
hard, and its properties are not completely understood to this day; see Chapter 6. Here, iterative
stochastic methods have heuristically proven to be the numerical optimizers of choice.
Let us look at an example for adding a penalty term to a loss function in the case of linear
least squares regression, i.e., when minimizing the least squares loss Lls over the model class
MLin . After adding a penalty (or regularization) term to the parametrized variant (1.5) of the
optimization problem, we obtain
n
1X T 2
α = arg min γ · x̂i − yi + λ∥γ∥22 , (1.6)
γ∈Rd+1 n i=1
where fα (t) = αT · t̂ is the resulting affine linear function with t̂ = (1, t1 , . . . , td )T , as noted
before. Here, we added the weighted (for a fixed λ > 0) squared ℓ2 norm of the coefficient vector
as penalty. Now solutions with large coefficients are penalized more than those with small ones.
This is known as Tikhonov regularization (see [EHN96, Tik63]) or ridge regression (see, e.g.,
[HTF09]).
1.4 Hyperparameters
Often there are so-called hyperparameters in ML algorithms, i.e., parameters that have to be set
by the user. They have to be carefully chosen in order to achieve good results. Having a large
number of hyperparameters is undesirable since their proper choice gets extremely complicated
and often even practically impossible. We address the problem of hyperparameter search in more
detail in Section 3.6.
Hyperparameters
Hyperparameters are any parameters of the model f ∈ M or the loss L that are
not determined by the optimization of the loss function but which have to be fixed
before the loss minimization can be tackled. An example of a hyperparameter is the
regularization parameter λ in ridge regression introduced above.
An easy but computationally intensive way to search for the optimal hyperparameters
is to compute a lot of solutions for different values of the hyperparameters on a part
of the training data and then evaluate these solutions on another part of the training
data to see which hyperparameter choice performed best.
input of an ML algorithm. In this way, the feature construction and selection can also be seen as
a data preprocessing task, which is of major importance within the machine learning pipeline.
A feature map ϕ : Ω → Ω̃ assigning the input data xi to some meaningful features ϕ(xi ) in
the so-called feature space Ω̃ is a major component of several machine learning algorithms, e.g.,
of support vector machines; see Chapter 3.
Concentration of measure
The concentration of measure effect describes the fact that high-dimensional ran-
dom, independent vectors concentrate in Euclidean space, i.e., the probability of
them residing in certain regions becomes very large. To illustrate this, consider n
uniformly distributed points xi , i = 1, . . . , n, in the unit ball
B1 (0) := x ∈ Rd ∥x∥2 ≤ 1
as training data set D. If we look at their nearest point c to the origin, i.e.,
c := arg minxi ,i=,1...,n ∥xi ∥2 , then the median of the norm of c over all different
realizations of the n-element training data set D fulfills
n1 ! d1
1
M (d, n) := median (∥c∥2 ) = 1− ;
2
see also [HTF09]. Thus, we have, for example, for d = 10 and n = 500 that
M (d, n) ≈ 0.52. This means that already in 10 dimensions and for a small number
1.6. Effects in high dimensions 13
of data points, the point c, which is closest to 0, is indeed closer to the boundary of
the unit ball than to its origin 0. Thus, uniformly distributed data in the unit ball in
Euclidean space tend to concentrate along the boundaryi for large d. Therefore, in
high dimensions and for the Euclidean distance setting, data tend to behave counter-
intuitively. “book-figure2”
i
1
Figure 1.4
The concentration of measure effect shows us that the Euclidean distance does not seem to be in-
tuitive or even meaningful anymore when relying on point to point distances in high-dimensional
spaces. This is the reason why other choices of distance measures are often considered in such
applications. Here, various ℓp norms, weighted Gaussian distances, or even divergences are rea-
sonable alternatives in certain cases [AHK01, SFSW12]. However, recall that such deviations
from the Euclidean distance often result in non-convex loss functions and/or non-unique minima,
which make the ML minimization problem more complicated.
In the ML community, the terms concentration of measure and curse of dimensionality are
often used ambiguously. From a mathematical perspective, the curse of dimensionality is more
about properties on the necessary sample size.
Curse of dimensionality
The classical curse of dimensionality effect resembles the fact that we need exponen-
tially (in the domain dimension d) many points to have a sampling that is as dense as
in one dimension. Such a sampling is a necessary foundation for learning an arbitrary
i
function equally well in d dimensions. This is a famous problem in approximation
theory [NW08].
“book-figure3”
i
To the right, we see an illustration of the curse
1
of dimensionality effect (blue = 1D grid; red =
2D grid). To approximate a univariate (differ- 0.5
entiable [NW08]) function up to a fixed accu-
racy, an interpolant on the N = 10 blue sample 0
nodes can be built. For a bivariate function the
N 2 = 100 red points are needed to (asymptot- −0.5
ically) achieve the same accuracy. In d dimen-
sions we would need N d = 10d many points. −1
Thus, the computational costs scale exponen- −1 −0.5 0 0.5 1
tially in d. Figure 1.5
14 Chapter 1. The Basics of Machine Learning
• Modeling: Appropriate machine learning models are built, and hyperparameters are deter-
mined.
• Evaluation: The results of the ML model on the data set are gathered and evaluated ac-
cording to the chosen success criteria.
• Deployment: Monitoring and maintenance methods for the whole data pipeline are em-
ployed to enable a continuous process. Reports are created.
While some of these topics are closely intertwined, and while there are dependencies between
earlier and later phases, the overall process follows the aforementioned order. Furthermore, the
cyclic nature of the CRISP-DM model illustrates the fact that the data mining process does not
end after the deployment phase, but new insights gained from the later phases usually lead to a
new understanding of the problem at hand and may trigger the whole process again.
While CRISP-DM is still the most often adopted process model for treating machine learning
problems, there also exist other models such as SEMMA [RM10], which is tailored towards the
usage of SAS software, or Six Sigma,2 which is a more general model for data-based business
processes. More recently, modern and agile frameworks such as the Team Data Science Process
(TDSP)3 and the VDI/VDE guideline 37144 are being employed in industrial data analysis.
anwendungen-in-der-produzierenden-industrie-durchfuehrung-von-big-data-projekten
5 https://www.python.org/
6 https://www.numpy.org/
7 https://scikit-learn.org/
8 https://keras.io/
9 https://pytorch.org/
10 https://www.tensorflow.org/
11 https://jax.readthedocs.io
16 Chapter 1. The Basics of Machine Learning
To dive directly into the basics of P YTHON and N UM P Y, let us begin with the first tasks
of this book. They serve to familiarize the reader with the coding environment as well as the
programming language.
X Task 1.1. Make yourself familiar with programming in P YTHON and its libraries
N UM P Y and M AT P LOT L IB12 . Furthermore, you will need the application J UPYTER
N OTEBOOK13 to run the template codes at https://bookstore.siam.org/di03/bonus. To
this end, the “README.md” file provides a guideline on setting up the programming
environment by hand. Furthermore, we encourage you to have a look at the tutorial
notebooks “Introduction to Python.ipynb” and “Introduction to NumPy.ipynb” and to
familiarize yourself with the concept of vectorization, i.e., using operations on whole
arrays instead of using loops and operating on single array elements.
X Task 1.2. Create a J UPYTER notebook in which you construct an array z consisting
of 10000 random numbers drawn from {0, 1, 2}. Implement two versions of a func-
tion that counts the number of appearances of the subsequence (2, 0, 1) in z. The first
version should work with a loop that accesses the array z elementwise and makes
elementwise comparisons. The second version should be a vectorized one, which
operates on (almost) the whole array z. (Hint: The N UM P Y function logical_and
might help you.) Compare the runtime of the two versions.
12 https://matplotlib.org/
13 https://jupyter.org/
Chapter 2
Basic Supervised
Learning Algorithms
In this chapter we will introduce specific instances of machine learning algorithms, and we will
see how to evaluate them on given test data. Our first approach, which is probably the most
simple one, namely linear regression, is discussed in Section 2.1. Here, the model class, over
which the learning problem is solved, consists of (affine) linear functions. This leads to an easy
numerical treatment of the underlying optimization problem. Section 2.2 is dedicated to the
introduction of different measures that serve to quantify how well a regression (or classification)
algorithm performs. Subsequently, we encounter our first programming tasks in Section 2.3. In
addition to the direct solvers, which we discuss in Section 2.1, iterative solvers can be employed
to tackle the linear regression problem. They are studied in Section 2.4. Section 2.5 illustrates
the importance of data normalization, especially for iterative solvers. Finally, we will consider
the k-nearest neighbors algorithm in Section 2.6, which provides another intuitive way to solve
a supervised learning task. More details on linear regression and the k-nearest neighbors method
can be found in [HTF09, Mur22].
with parameter vector γ = (γ0 , . . . , γd )T ∈ Rd+1 . Here, x̂i := (1, [xi ]1 , . . . , [xi ]d )T ∈ Rd+1
is the augmented vector based on a data point xi ∈ Rd for all i = 1, . . . , n, and [·]j denotes the
17
18 Chapter 2. Basic Supervised Learning Algorithms
jth component of a vector. The above expression for the LLS problem can be rewritten as
1 T
α := arg min (Xγ − y) (Xγ − y) = arg min γ T X T Xγ − 2γ T X T y + y T y (2.1)
γ∈Rd+1 n γ∈Rd+1
T T
with X := (x̂1 , x̂2 , . . . , x̂n ) ∈ Rn×(d+1) and y := (y1 , . . . , yn ) ∈ Rn .
X T Xα = X T y, (2.2)
i.e., a minimizer α of (2.1) fulfills this equation and vice versa. This system of linear equations
is also known as the normal equations of the least squares problem.
Furthermore, we note that the minimizer of (2.1) is unique if X T X has full rank, i.e., if it is
an invertible (d + 1) × (d + 1) matrix. In this case we have
−1
X T Xα = X T y ⇔ α = XT X X T y.
Thus, to tackle the LLS regression problem, we need to solve a system of linear equations.
Moreover, if X T X happens to have a non-trivial null space and is, therefore, not invertible, we
can still look for solutions of (2.2) to find a minimizer, but a solution is no longer unique.
b = Ax = LU x ⇔ U −1 L−1 b = x.
Note that the application of the inverses of U and L can be easily computed by
forward and backward substitution, i.e., by successively determining each entry of the
resulting vector. In this way, a complicated, direct matrix inversion can be avoided.
If A is symmetric and positive definite, there exists a unique so-called Cholesky
decomposition A = LLT for a lower-triangular matrix L, which can be computed
via a modified Gaussian elimination. The computation of a Cholesky decomposition
is twice as fast as the one of the LU decomposition. Furthermore, it is stable with
respect to small input changes, which cannot be guaranteed for the LU decomposition
unless pivoting is used. For more details on computing the solution of a system of
linear equations, we refer the reader to [GVL13].
2.1. Linear least squares 19
A = U DW T
˜ ˜
with orthogonal matrices U ∈ Rn×n and W ∈ Rd×d and the diagonal matrix D =
n×d˜
˜) ∈ R
diag(σ1 , . . . , σmin(n,d) of non-negative singular values is called singular
value decomposition. Here, the singular values are non-negative and sorted according
to their size, i.e., σ1 ≥ · · · ≥ σmin(n,d)
˜ ≥ 0. The singular values of A are uniquely
determined. The SVD can be computed with O(min(d, ˜ n)2 max(d,
˜ n)) floating point
operations. For details on the SVD, we refer the reader to [GVL13].
Besides the SVD, we will also need the pseudo-inverse of a matrix in the following.
ò Moore–Penrose pseudo-inverse
˜
The (Moore–Penrose) pseudo-inverse of A ∈ Rn×d with SVD A = U DW T is a
generalization of the inverse of the matrix A. It is given by
A† := W D † U T
˜
with a matrix D † ∈ Rd×n that fulfills
1
† σi if i = j and σi > 0,
D ij :=
0 else.
AA† A = A, A† AA† = A† ,
T T
AA† = AA† , A† A = A† A.
Therefore, we are looking for a γ such that its image under X is closest to y in the Euclidean
norm. We know from basic linear algebra that this is fulfilled for a γ such that Xγ is an or-
thogonal projection of y onto the image space of X. To this end, let C ∈ R(d+1)×n be a matrix
such that T
(XC)2 = XC = (XC) ,
i.e., XC is an orthogonal projector onto the image of X. Then, γ := Cy is already a solution
to (2.1) since
Xγ = XCy
is an orthogonal projection of y onto the image of X. Finally, to prove that (2.3) is a solution to
(2.1), we have to show that X † is a valid candidate for such a matrix C. This is straightforward
since
2 2 2
XX † = U DW T W D † U T = U DD † U T = U DD † DD † U T = XX †
due to the orthogonality of U and W and because DD † is idempotent. The fact that (XX † )T =
XX † is one of the four defining properties of the pseudo-inverse. Therefore, C = X † is a valid
choice, and (2.3) solves (2.1).
σmax (A)
κ(A) := .
σmin (A)
Ax = b and A(x + ∆x ) = (b + ∆b ).
2.2. Evaluation 21
Because of its definition, the best condition number we could hope for when solving a system
Ax = b is κ(A) = 1. In this case, all singular values would be equal to 1. In general, we are
aiming for a condition number that is small, i.e., close to 1.
In the least squares setting the system matrix is usually determined by the drawn samples
and is thus fixed. However, while the SVD can be computed by matrix-vector multiplications
with X (see [GVL13] for details), the normal equations (2.2) involve the system matrix X T X.
Assuming that the SVD of X is given by X = U DW T , the system matrix is given by
X T X = W D T U T U DW T = W D T DW T .
Since W is an orthogonal matrix, the decomposition on the right-hand side is already the SVD
of X T X. Therefore, its singular values are given as the diagonal values of D T D, which are the
squared singular values of X. This leads to the fact that
κ(X T X) = κ(X)2 ,
which is the reason why it is beneficial in terms of numerical stability to use (2.3) instead of
solving (2.2).
Besides employing the SVD to compute the pseudo-inverse in (2.3), using a QR decomposition
of X is another way to solve (2.1) without having to deal with the normal equations (2.2); see
[GVL13] for details.
2.2 Evaluation
After we trained the LLS model, i.e., after we determined the parameters α from (2.1), we want
to study how good the model actually is.
Evaluation of ML algorithms
To assess the quality of an already trained ML model, we need a way to quantify its
performance. Usually, we do not have direct access to the measure µ from which our
training data was generated, but we assume that we can sample a test data set
i.i.d. from µ. Then, we approximately quantify how small the test error of the model
Pd
is, i.e., the error that we get when taking α—or equivalently f (t) = α0 + i=1 αi ti
—to predict the outputs/labels of the test data points. There are many possible criteria
to measure the error on the test data but usually the employed loss function is taken
(without all additional regularization terms).
In cases where not enough test data points are available to compute a represen-
tative test error, (re)sampling techniques like bootstrapping can be employed; see
22 Chapter 2. Basic Supervised Learning Algorithms
Section 10.2.4 for details. Of course, it is of major importance that the test data are
not part of the training data in order to avoid an evaluation bias due to overfitting.
Finally, note that the assumption of the test data following the same distribution as
the training data does not necessarily hold in practical applications. For instance, a
covariate shift can occur in the data set, i.e., the measure µ has changed during or
after training; see, e.g., [SK12].
Evaluation measures for regression Since we have chosen a least squares loss to train the
model, it is only natural to evaluate the mean squared error (MSE) on the test data set D̄, i.e.,
n̄
1X 2
MSE f, D̄ := (f (x̄i ) − ȳi ) .
n̄ i=1
Other commonly used error measures include the mean absolute error
n̄
1X
MAE f, D̄ := |f (x̄i ) − ȳi | ,
n̄ i=1
which is close to 1 for a good model and close to 0 (or even negative) for a poor one.
Evaluation measures for classification Since |Γ| < ∞ for classification, it only makes
sense to consider functions f : Ω → Γ with finite image set. An easy way to obtain an appro-
priate classifier f : Ω → Γ from a regression function f˜ : Ω → R is to use a so-called level set
classifier. For example, let us consider a two-class problem with Γ = {a, b} ⊂ R. Then we use
From now on, we implicitly assume that the function under consideration has image Γ when
we use the evaluation measures for classification below. However, by slightly abusing notation,
we can also use them for a real-valued function f˜ : Ω → R when we assume that the level set
classifier (2.4) is employed instead of f˜ in that case.
The most common error measure for classification is the so-called accuracy
|{i ∈ {1, . . . , n̄} | f (x̄i ) = ȳi }|
Acc f, D̄ := , (2.5)
n̄
which measures the proportion of correct classifications among all test data. Further error mea-
sures in binary classification tasks are the sensitivity and the precision, for instance. To this end,
2.2. Evaluation 23
let us briefly introduce the concept of true/false positives and negatives for a two-class problem,
where Γ = {−1, 1}. After training a model f , the numbers of true positives (T P ) and true
negatives (T N ) on the test data set D̄ are given by
Analogously, the numbers of false positives (F P ) and false negatives (F N ) are defined as
With these definitions, we can define the sensitivity or recall as the rate of true positives
TP
Sens(f, D̄) = .
TP + FN
TN
Spec(f, D̄) = .
TN + FP
TP
Prec(f, D̄) = ,
TP + FP
which describes the ratio between the true positives and all predicted positives. Using these
measures makes more sense than using the accuracy when it is important to minimize the number
of false negatives (or the number of false positives), as in safety-critical applications for instance.
Moreover, these measures are of particular relevance for imbalanced data, i.e., when the number
of instances per class varies strongly.
More sophisticated measures, e.g., the F1 score
can be derived from combinations of the sensitivity, specificity, and precision values.
A meaningful measure for multi-class applications, i.e., when |Γ| > 2, is the so-called confu-
sion matrix C ∈ N|Γ|×|Γ| , which displays the true and false positives on a class-by-class basis.
In particular, it has the entries
This metric is especially helpful when certain classes get misclassified as one particular other
class, for instance. The entries of the confusion matrix can further be used to compute more so-
phisticated correlation measures between two classes, e.g., the Matthews correlation coefficient
between classes i and j. A generalization of the MCC including all classes at once can also be
considered; see [JRF12] for details.
X Task 2.1. First, draw n = 100 random numbers x1 , . . . , x100 ∈ [0, 1), which are
uniformly distributed. To this end, you can use the numpy.random.rand routine.
Subsequently, compute
for i = 1, . . . , 100. Use matplotlib to generate a scatter plot of the 100 two-
dimensional points (xi , yi ) for i = 1, . . . , 100 (i.e., plot the points in a 2D coordinate
system).
X Task 2.2. Implement a linear least squares algorithm using the SVD to com-
pute (2.3). Hint: You can use numpy.linalg.pinv to compute the pseudo-inverse
of a matrix using the SVD. Apply the algorithm to the data from Task 2.1. Plot the
scattered input data as in Task 2.1 together with the regressor f : R → R given by
f (x) := α0 + α1 x.
Compare the resulting coefficients α0 and α1 to the ones you would expect for the
data.
Next, let us consider a two-dimensional classification problem, where we use the level set func-
tion f = fclass (see (2.4)) of a linear least squares regressor f˜ as a solution. To this end, we
first create a suitable two-dimensional data set D in Task 2.3. Subsequently, we will train a
linear least squares regressor f˜ on D and visualize the hyperplane which is defined by the corre-
sponding level set in Task 2.4. In particular, for a two-class problem with Γ = {a, b} ⊂ R, the
hyperplane n o
H := x ∈ R2 |f˜(x) − a| = |f˜(x) − b|
splits the ambient space R2 into two parts, which indicate what class f evaluates to.
In the specific use case from Task 2.3 and Task 2.4 the data has labels Γ = {0, 1}. Therefore,
if f˜(x) < 0.5, the point x is assigned to class 0. Otherwise, it is assigned to class 1, i.e.,
0 if f˜(x) < 21 ,
f (x) :=
1 else.
X Task 2.4. Apply the LLS algorithm from Task 2.2 to the data from Task 2.3. Plot the
scattered input data as in step (b) of Task 2.3 together with the separating hyperplane
H, i.e., the contour line given by
1
α0 + α1 x1 + α2 x2 = ,
2
where x1 and x2 denote the coordinates in R2 (not to be confused with the data xi ).
The result for the task should look (approximately) like Figure 2.1.
As mentioned above, the separating hyperplane from Task 2.4 can be used to divide or classify
the data into two parts by the level set classifier (2.4). Let us quantify how good our classifier
really is.
X Task 2.5. Build the confusion matrix C for the data and the hyperplane from
Task 2.4. In our case this is a 2 × 2 matrix with i, j ∈ {0, 1}, since |Γ| = 2. Calculate
the accuracy trace(C)
n .
X Task 2.6. Create 10 000 test points for each of the two classes in the same way as you
created the training data in step (b) of Task 2.3. Evaluate the LLS classifier, which
was built on the training data, now on the test data and compute the confusion matrix
and the accuracy of the test data. Compare your results to the ones from Task 2.5.
Next, we will try our LLS classifier on the Iris data set [DG17, Fis36]. This data set consists
of 150 points in R4 , which describe three different types of Iris plants; see Figure 2.2 for an
example. We have three classes:
Hyperplane H: α0 + α1 · x1 + α2 · x2 = 0.5
Class 0
Class 1
4
2
x2
−2 Hyperplane H
−3 −2 −1 0 1 2 3 4 5
x1
Figure 2.1: An example for a contour line plot for a separating hyperplane of two data classes.
Note that the linear separation does not work very well due to the large overlap of the two data
classes.
The four features, i.e., the coordinates in Ω = R4 , refer to certain length and width measurements
of the plants.
We will classify one of the three plant classes against both of the remaining classes by using
our LLS algorithm. Thus, we encounter a two-class problem. To this end, we first have to
read in the data set and cast the class names to Γ = {0, 1}. Then, we can proceed as for
the previous tasks, i.e., we build an LLS regressor f˜ and construct the corresponding level set
classifier f = fclass (see (2.4)), for which we compute the confusion matrix and the accuracies.
Instead of reading in the data by hand, we employ the very useful PANDAS library (https://pandas.
pydata.org/) in P YTHON:
2.4. Iterative solvers 27
import pandas as pd
url = ’ h t t p s : / / a r c h i v e . i c s . u c i . edu / ml / machine − l e a r n i n g − d a t a b a s e s / i r i s /
i r i s . data ’
irisDataFrame = pd. read_csv (url , header =None)
In PANDAS, the data are stored in an instance of DataFrame, on which many useful operations
can be run.
(a) Read in the Iris data set and use the data labels y i = 0 for the Iris-setosa
instances and y i = 1 for the Iris-versicolor and Iris-virginica classes:
a.1. Run the LLS algorithm by using only the first two dimensions of Ω in the
input data, i.e., only look at the first two features. Plot the scattered data
and the separating hyperplane as in Task 2.4.
a.2. Now run the LLS algorithm by using all four features/dimensions of the
input data. Compute the confusion matrix and the accuracy.
(b) Finally, run the same two steps as in (a), but now try to classify Iris-versicolor
instances (label y i = 0) against both Iris-setosa and Iris-virginica (label y i =
1). What do you observe?
x ← x − ν∇J(x).
Note that gradient descent can also be applied for nonlinear problems. In this case,
it can be seen as iteratively solving a linear approximation to the nonlinear problem;
see [BV04] for more details.
28 Chapter 2. Basic Supervised Learning Algorithms
Algorithm 1: A gradient descent algorithm with step size (or learning rate) ν > 0
Input: loss function J, step size ν > 0, tolerance ε > 0.
In the case of linear least squares, we iteratively update our solution by taking steps in the nega-
tive direction of the gradient of the loss function. To this end, let
X Task 2.8. Implement the gradient descent method and run an LLS algorithm with
a gradient descent optimizer for the data from Task 2.7 (a.1.). Choose ν ∈
{1, 10−1 , 10−2 , . . .} as the largest value such that convergence is achieved. Create a
plot of the value of J versus the actual iteration number. What do you observe?
Data normalization
The need for normalizing the data is often overlooked when applying ML algorithms.
However, it is a crucial data preprocessing step, which can significantly alter the
performance and even the outcome of an ML algorithm. The most common method
of data normalization is data standardization. Here, the vector of feature means
is subtracted from the data and the result is divided elementwise by the feature’s
standard deviation. Additionally, other data normalization techniques like clipping at
certain boundaries are used in practical applications.
2.6. k-nearest neighbors 29
X Task 2.9. Normalize the data from Task 2.7 (a.1.) by standardization. To this end,
calculate the mean mj and the standard deviation σj for each feature j (i.e., each
coordinate direction j of the data set) and set the jth component of the ith data point
to
[xi ]j − mj
[xi ]j := .
σj
Now run the gradient descent LLS algorithm on the normalized data. Similarly as in
Task 2.8, choose ν as the largest value such that convergence is achieved. Compare
the first 100 iteration steps by plotting the value of J versus the iteration number for
both the normalized and the unnormalized cases. What do you observe?
k-nearest neighbors
The idea of the k-nearest neighbors (k-nn) algorithm on data (xi , yi ) ∈ Rd × R is to
build the solution to the least squares problem directly from the labels yi . To this end,
the labels of the k data points closest to an evaluation point x are taken into account.
In particular, we use the piecewise constant approximation
1 X
f (x) := NearNeighk (x) := yi (2.6)
k
{i|xi ∈Nk (x)}
To justify why the k-nn algorithm actually constructs a meaningful solution, we first have a look
at the continuous least squares problem with model class M. We are looking for
14 Note that distance measures other than the Euclidean distance can also be used to define the k-neighborhood.
30 Chapter 2. Basic Supervised Learning Algorithms
Having a closer look at the k-nn estimator, we see that it essentially takes two steps to approx-
imate the quantity Eµ [Y |X = x]:
1. Since only finitely many samples are given, the expected value is approximated by the
sample mean
p
1X
Eµ [Y |X = x] ≈ yi ,
p j=1 j
where the p ≤ n indices ij ∈ {1, . . . , n} are picked corresponding to data pairs xij , yij
with xij = x.
2. Since there usually do not exist (m)any xi in the data set with xi = x, we also take
values sufficiently close to x into account. To this end, we consider the k-neighborhood
Nk (x) of x and use the corresponding p = k indices of the data points in Nk (x) for the
approximation in step 1.
Note that if |Nk (x)| > k, we randomly choose points in Nk (x) with the largest distance to x
and remove them until |Nk (x)| = k to get the appropriate neighborhood size.
X Task 2.10. Let us now test how the algorithm performs for every possible k.
For the implementation of k-nearest neighbors you can use the scipy.spatial.
distance.cdist and numpy.argpartition functions, for example. To this end,
use the k-nearest neighbors algorithm with the level set classifier (2.4).
(a) Run the k-nearest neighbors algorithm for the data from Task 2.3 for all k =
1, . . . , 200 and store the accuracy for each k.
(b) Do the same as in step (a) but now use the data created in Task 2.6 as test data.
(c) Plot the accuracies from steps (a) and (b) versus the value of k. What do you
observe?
As we have seen in Task 2.10, we can use the k-nearest neighbors formula (2.6) together with a
level set classifier for binary classifications tasks. However, to classify a data point x in a multi-
class setting, we determine the majority class among the k-nearest neighbors of x in the training
data set. Note that this gives the same result as the level set classifier in the binary classification
setting.
Logistic regression Another famous linear model to obtain optimal classifiers is the logistic
regression model, by which the distribution of the underlying random variables is modeled; see
[HTF09, Mur22]. This approach also involves a loss function that is more suitable for classifica-
tion than the least squares loss.
2.7. Further topics 31
Regularization Instead of simply minimizing a loss function as in the case of linear least
squares in (2.1), we could add a regularization term to the minimization problem, as we already
sketched at the end of Section 1.3. This can be interpreted as a tradeoff between minimizing the
loss on the training data and obtaining a simple or sparse model; see [HTF09, Mur22]. Examples
for such regularization terms are ℓp norms of the coefficients (Lasso, Tikhonov) or more complex
norms involving derivatives of the minimizer. Note that this approach can also recast an ill-posed
linear least squares problem (2.2) into a well-posed one when the matrix X T X does not have
full rank.
Chapter 3
Optimal Separating
Hyperplanes and
Support Vector Machines
We have seen in Task 2.4 how a linear model can be used to obtain a separating hyperplane
for classification. A drawback of the LLS approach is the fact that the least squares error does
not capture what we expect from an optimal separation. For illustration, let us consider an
example where the xi belonging to one of the two classes Γ = {−1, 1} are spread very widely
over the domain, but the xi belonging to the other class are very densely concentrated. In such
a case we might encounter a situation where there exists an element in MLin such that its 0-
level set function perfectly separates the two classes −1 and 1, but the solution to (2.2) does
not; see Figure 3.1. The reason is that the linear least squares algorithm minimizes the squared
distances between the true class label (either −1 or 1) and the prediction (a real number). Thus, its
objective is not to determine a separating hyperplane but rather to avoid predictions that deviate
significantly from the true label.
In this chapter, we now consider methods to obtain a so-called separating hyperplane for a
two-class classification problem. Section 3.1 deals with the concept of optimal (margin) separat-
ing hyperplanes and the corresponding optimization problem. Since this might not be solvable at
all in certain situations, we introduce a slightly weakened form in Section 3.2, which leads to the
well-known support vector machines. To solve the underlying optimization problem, we study a
special optimization algorithm for quadratic optimization problems with linear constraints called
sequential minimal optimization in Section 3.3. Subsequently, we encounter our first program-
ming tasks on separating hyperplanes in Section 3.4. While the initial model space for support
vector machines contains only affine linear functions, the so-called kernel trick, which we dis-
cuss in Section 3.5, allows us to successfully deal with nonlinear problems too. Since the sup-
port vector machine algorithm employs certain hyperparameters, we consider their optimization
in Section 3.6. Finally, we introduce the P YTHON library S CIKIT-L EARN, which we employ to
deal with an image classification task in Section 3.7.
H := x ∈ Rd | f (x) = 0
33
i
34 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
“book-figure4” — 2023/10/9 — 13:25 — page 31
i
1
Separating hyperplane
0.8
0.6
Inputs 2 0.4
0.2
LLS
0
0 0.2 0.4 0.6 0.8 1
Inputs 1
Figure 3.1: Example for a scenario where the 0-level set function of the linear least squares (LLS)
solution does not separate the data classes (depicted by − and +) although there exists a linear
separating hyperplane.
1
(γ ∗ )T (x − t0 ) = γ T x + γ0
∥γ∥
Note that an OSH does not necessarily exist since the data might not be linearly
separable at all; see Section 3.5 for more details.
3.1. Optimal separating hyperplanes 35
Having a closer look at Figure 3.1, we observe that the dotted separating hyperplane is already
the optimal separating hyperplane in the above sense.
Since we can use arbitrary positive multiples of γ and γ0 and still obtain the same margin M
1
in (3.1), we set ∥γ∥ = M and obtain the equivalent minimization problem
1
min ∥γ∥2 (3.2)
γ0 ∈R,γ∈Rd 2
yi γ T xi + γ0 ≥ 1
s.t. ∀ i ∈ {1, . . . , n}.
By equivalence, we mean that a minimizer of (3.2) is also a maximizer of (3.1) and vice versa. As
(3.2) is a convex minimization problem with linear inequality constraints, there exists a unique
(global) minimizer, which is also a Karush–Kuhn–Tucker point.
m
X l
X
L(t, a, b) := F (t) + ai Gi (t) + bj Hj (t)
i=1 j=1
of the so-called Lagrangian, the KKT conditions for some t ∈ τ and Lagrange
multipliers a ∈ Rm and b ∈ Rl read
∇t L(t, a, b) = 0,
Gi (t) ≤ 0 ∀ i = 1, . . . , m,
Hj (t) = 0 ∀ j = 1, . . . , l,
ai ≥ 0 ∀ i = 1, . . . , m,
ai Gi (t) = 0 ∀ i = 1, . . . , m.
Instead of directly looking for a minimum of (3.2), we can equivalently look for points fulfilling
the KKT conditions. To this end, note that we have m = n inequality constraints Gi (t) and
l = 0 equality constraints Hj (t). Let
n
1 X
∥γ∥2 + βi 1 − yi γ T xi + γ0
L((γ, γ0 ) , β) :=
2 i=1
36 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
be the Lagrangian function. The KKT conditions for a minimizer now read
(a) ∇γ L = ∇γ0 L = 0,
yi γ T x i + γ0 ≥ 1
(b) ∀ i ∈ {1, . . . , n},
(c) βi ≥ 0 ∀ i ∈ {1, . . . , n},
T
(d) βi yi γ xi + γ0 − 1 = 0 ∀ i ∈ {1, . . . , n}.
Because of (a1) the xi with βi > 0 are actually called support vectors since they span the
hyperplane vector γ. If we now substitute (a1) into the Lagrangian and apply (a2), we obtain
n n n
1 X X X
L((γ, γ0 ) , β) = − βi βj yi yj xTi xj − βi y i γ 0 + βi .
2 i,j=1 i=1 i=1
| {z }
=0
If the problem is convex and there are no equality constraints Hj present, the above
formulation becomes the so-called Wolfe dual problem
If there exists some t ∈ τ such that Gi (t) < 0 for all i = 1, . . . , m, it can be shown
that the minimum of (3.3) is a maximum of (3.6) and vice versa, i.e., by solving one
problem we already solved the other one. More details can again be found in [BV04].
In our case, we have strong duality, i.e., the maximal value of the dual problem equals the
minimal value of the primal problem (3.2). Furthermore, because of the convex nature of the
3.2. Soft margin hyperplanes and support vector machines 37
primal problem, the maximizer of (3.4) defines a minimizer to (3.2) via condition (a1); see
[BV04]. In particular, condition (a1) lets us write the solution to (3.4) as
d
X n
X
f (t) = γ0 + γi t i = γ0 + βi yi xTi t. (3.7)
i=1 i=1
This representation is very efficient if the dimension d is very large but most of the βi are ac-
tually 0. Note that, after solving (3.4), we still have to determine γ0 ∈ R to obtain the final
representation. This will be dealt with in the next section.
Outliers
A data point (x, y) that does not exhibit the general behavior of the rest of the data
or that is drawn according to a different distribution from the rest of the data is called
an outlier. Outlier detection is a research area on its own, which we do not address
here. A survey on outlier detection methods is given in [HA04].
To cope with these situations we have to come up with a more sophisticated (and regularized)
version of the algorithm: The so-called soft-margin hyperplanes method, also known as support
imachine.
vector i
0.6 0.6
Inputs 2
Inputs 2
0.4 0.4
0.2 0.2
Outlier
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Inputs 1 Inputs 1
(a) An example where one outlier heavily influences (b) An example where an OSH solution does not exist
the solution of the OSH algorithm. since the data are not linearly separable.
Figure 3.2: Two situations in which the OSH method does not work well or at all.
38 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
We now introduce slack variables ξi ≥ 0 for all i = 1, . . . , n. Instead of using the hard con-
straints from OSH, we allow some slack for misclassified values and vectors close to the hyper-
plane. To this end, we reformulate the OSH optimization problem (3.2) with the help of the slack
variables to obtain
n
1 X
min ∥γ∥2 + C ξi
γ0 ∈R,γ∈Rd ,ξ∈Rn 2
i=1
yi γ T xi + γ0 ≥ 1 − ξi
s.t. ∀ i ∈ {1, . . . , n}
and ξi ≥ 0 ∀ i ∈ {1, . . . , n},
where C > 0 is the regularization parameter. We can recover OSH by setting C = ∞, i.e., by
enforcing ξi = 0 for all i = 1, . . . , n. A large C results in better fits of the training data, while a
small C allows for larger slack and leads to larger margins.
By introducing new Lagrange multipliers ri ≥ 0 for all i = 1, . . . , n, we get the Lagrangian
n n n
1 X X X
∥γ∥2 + C βi y i γ T x i + γ 0 − 1 + ξ i −
L((γ, γ0 ), ξ, β, r) = ξi − ri ξi .
2 i=1 i=1 i=1
Again, we are interested in the KKT conditions. Calculating ∇γ L and ∇γ0 L gives the same
conditions (a1) and (a2) from OSH. Furthermore, a KKT point fulfills
∂
L = C − βi − ri = 0 ⇔ ri = C − βi ∀ i = 1, . . . , n.
∂ξi
Substituting this together with (a1) and (a2) into the Lagrangian, we obtain
n n n
1 X T
X X
L((γ, γ0 ), ξ, β, r) = − βi βj yi yj xi xj + C ξi − βi yi γ0
2 i,j=1 i=1 i=1
| {z }
=0
n
X n
X n
X
+ βi − βi ξi − (C − βi )ξi
i=1 i=1 i=1
n n
1 X X
=− βi βj yi yj xTi xj + βi ,
2 i,j=1 i=1
Note that this differs from (3.4) only in the upper bound C on each Lagrange parameter βi for
i = 1, . . . , n, which stems from the additional KKT condition for the Lagrange multipliers ri ,
i.e., 0 ≤ ri = C − βi . The remaining conditions are derived completely analogously to (3.4).
Finally, we need to fit the so-called bias γ0 ∈ R. To this end, note that computing the KKT
conditions for the Lagrangian L((γ, γ0 ), ξ, β, r) leads to ri ξi = 0 and
βi yi γ T xi + γ0 − 1 + ξi = 0
for all i ∈ {1, . . . , n} for which 0 < βi < C. Thus, to average out possible numerical errors, we
choose γ0 to be the mean
n
!
X
T
γ0 := mean yk − βi yi xi xk
i=1
with Q ∈ Rn×n with entries Qij = yi yj xTi xj for all i, j = 1, . . . , n and label vector y :=
T
(y1 , . . . , yn ) ∈ Rn .
Chunking If n is quite large, it makes sense to use a so-called chunking approach, i.e., to split
{1, . . . , n} into two disjoint sets w (working set) and κ (fixed set) such that w ∪ κ = {1, . . . , n}.
We write β w and β κ for the subvectors of β corresponding to the indices in w and κ. The idea
is now to solve the problem only with respect to the variables in w and to repeat this process
40 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
several times for different working sets. To this end, the subproblem of optimizing with respect
to w reads
1 T T 1
min β Q β − (1w − Qwκ β κ ) β w + β Tκ Qκκ β κ + 1Tκ β κ
β w ∈R|w| 2 w ww w 2
| {z }
constant
s.t. 0 ≤ βi ≤ C ∀i∈w
and β Tw y w = −β Tκ y κ ,
where Qww , Qwκ , and Qκκ canonically denote rows and columns of the matrix Q with respect
to the corresponding index set. Let us have a look at one of the most successful optimization
algorithms using chunking to solve (3.8). The so-called sequential minimal optimization (SMO)
algorithm [Pla98] uses chunking with the smallest possible subsets w, i.e., |w| = 2 (since nothing
can be optimized for |w| = 1 due to the equality constraints). To this end, the algorithm works
in an iterative manner: At first the values for β and the bias γ0 from the representation (3.7) are
initialized (e.g., as 0). In every iteration step we select two indices i, j ∈ {1, . . . , n} and solve
the corresponding quadratic optimization problem, i.e., we employ the chunking approach and
choose w = {i, j} as the working set.
One step of SMO Let us now have a closer look at one iteration of the SMO algorithm with
w = {i, j}. The working set optimization problem reads
1 2 T
βi xi xi + βj2 xTj xj + 2βi βj yi yj xTi xj − ci βi − cj βj
min (3.9)
βi ,βj ∈R 2
s.t. 0 ≤ βi ≤ C
and 0 ≤ βj ≤ C
X
and yi βi + yj βj = − yl βl = yi βiold + yj βjold
l∈{i,j}
/
for k = i, j.
Note that due to yj yj = 1 the equality constraint is equivalent to βj = ϑ − yi yj βi for ϑ :=
yi yj βiold + βjold . Therefore, we can substitute βj into the minimization problem above. This leads
to the following formulation:
1
(3.9) ⇔ min βi2 xTi xi + xTj xj − 2xTi xj
βi ∈R 2 | {z }
=: χ
with
max(0, ϑ − C) if yi yj = 1,
Lo :=
max(0, −ϑ) else
3.3. Sequential minimal optimization 41
and
min(ϑ, C) if yi yj = 1,
Up :=
min(C − ϑ, C) else.
As this is simply a quadratic minimization problem in one variable with box constraints, the
solution is given by
min max ψ
χ , Lo , Up if χ > 0,
βi = Up if χ = 0 and ψ ≥ 0,
Lo if χ = 0 and ψ < 0.
Therefore, we now have an easy way to compute the true solution for the size-2 subproblem with
indices i and j; see Algorithm 2. There, ⟨xi , xj ⟩Ω = xTi xj for Ω ⊆ Rd . Note that we can
directly compute the true solution without any numerical approximation (besides floating point
accuracy issues, etc.), which can be an important advantage over chunking algorithms working
with subproblems for which only a numerically approximate solution can be computed. This is
due to the potential accumulation of approximation errors in subsequent optimization steps.
Choosing working sets Finding good candidates for possible subsets w in each iteration of
the SMO algorithm is quite crucial. To this end, note that—due to the KKT conditions for the
soft margin problem—we have
Algorithm 2: OneStep algorithm performing Pna single iteration of SMO to update the
coefficients βi , βj and the bias γ0 of f (·) = l=1 βl yl ⟨·, xl ⟩Ω + γ0 . Here, ⟨·, ·⟩Ω is the
inner product on Ω ⊂ Rd .
Input: indices i, j ∈ {1, . . . , n}.
Output: updates of βi , βj , and γ0 .
X Task 3.1. Implement the function OneStep from Algorithm 2, which takes one iter-
ative step of the SMO algorithm for two selected indices i and j.
X Task 3.2. To have a data set on which we can test our algorithm, draw 20 two-
dimensional vectors according to an exponential distribution with parameter value 4
in each of the coordinate directions, i.e., the jth coordinate of the ith vector is drawn
i.i.d. according to [xi ]j ∼ exp(4) for all i = 1, . . . , 20 and j = 1, 2. Assign the label
−1 to these xi . Then, draw 20 two-dimensional vectors according to exp(0.5) in the
same way and assign the label 1 to them.
X Task 3.3. Implement a function SMO that initializes β = 0 and γ0 = 0, then—in each
iteration step—randomly picks i, j ∈ {1, . . . , n} such that i ̸= j, and calls OneStep
with indices i, j to perform an optimization. Use it to complete the following tasks:
3.4. Tasks on (linear) support vector machines 43
(a) After the last iteration step, we need to compute a final estimate for γ0 . To this
end, calculate the mean m of f (xk ) − yk for all indices k ∈ {1, . . . , n} for
which 0 < βk < C. Then, set γ0 ← γ0 − m.
(b) Run the SMO function with 10, 000 iteration steps to compute a support vec-
tor regressor f˜ according to (3.7), which maximizes (3.8) for the n = 40 data
points from Task 3.2. Compute the results for C = 0.01, C = 1, and C = 100.
For each C, plot the scattered data and compute the hyperplane corresponding
to f˜ = 0, i.e., compute the hyperplane corresponding to the level set classi-
fier f = fclass ; see (2.4). Compare your results to the separating hyperplane
computed by the linear least squares algorithm implemented in Task 2.4.
(c) Count the number of support vectors. Mark the corresponding xk in your
scattered data plot.
(d) Furthermore, also count the number of margin defining vectors, i.e., the num-
ber of indices k ∈ {1, . . . , n} for which C > βk > 0, and mark the cor-
responding xk in the scattered data plot. An example for such a plot can be
found in Figure 3.3.
What influence does the parameter C have on the number of the support vectors and
on the position of the separating hyperplane?
3
t2
0 1 2 3 4 5
t1
Now let us check how our classifiers perform if we evaluate them on some test data.
X Task 3.4. Draw 2, 000 test data points according to the distributions from Task 3.2
(1, 000 points for class −1 and 1, 000 points for class 1). Evaluate the accuracy (per-
centage of correctly classified data points; see (2.5)) for the LLS and SVM models
calculated in Task 3.3.
The random picks of i, j in the SMO algorithm can be very ineffective for large data sets.
Therefore, we have to come up with a better approach to choose appropriate indices in each
step of the SMO algorithm. There exist many heuristics to choose suitable indices in each step;
44 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
see [CL11, Pla98, SS02]. We will employ the Karush–Kuhn–Tucker terms KKTi of the dual
minimization problem.
X Task 3.5. Repeat Task 3.3 and Task 3.4, but, instead of drawing the indices i, j for
each SMO-step randomly, write an outer loop that iterates over all i ∈ {1, . . . , n} and
check if KKTi > 0. If this is the case, randomly pick a j ̸= i for which 0 < βj < C.
If no such j exists, randomly pick a j ∈ {1, . . . , n} \ {i}. Subsequently, run the
OneStep function for the pair (i, j). If KKTi = 0 for each i or if the maximum
number of OneStep calls (10, 000) is reached, the algorithm terminates. Compare
the results achieved with this heuristic with the results achieved by randomly picking
i and j. How do their runtimes compare?
0.6
Inputs 2
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1
Inputs 1
Figure 3.4: An example of a two-dimensional data set that is not linearly separable. However,
there exists a nonlinear separation curve.
0
0.5
ϕ3
−0.5
0 1
−1 −1 0 0
−1 −0.5 0 0.5 1 1 −1
Inputs 1 ϕ1 ϕ2
Figure 3.5: Example for applying a feature map to a two-dimensional nonlinearly separable data
set (left) such that it becomes linearly separable in three dimensions (right). Here, Ω = [−1, 1]2 ,
and the feature map ϕ : R2 → R3 is given by ϕ(s, t) := (s, t, s2 + t2 )T .
Theorem 3.5.1 (Cover 1965). There are C(n, d) many homogeneously linearly separable di-
chotomies of n points in general position in Rd , where
d−1
n−1
X
C(n, d) := 2 .
k
k=1
Let us have a closer look at the statement of Cover’s theorem. To this end, let us first explain
the terminology being used.
Note that the condition of the data points being in general position in Rd might not be true for
real-world data sets with a lot of correlations between data points. However, for randomly drawn
data sets this is valid with probability 1.
For us, the most important conclusion to draw from Cover’s theorem is the fact that, for fixed
n, the number of dichotomies that are linearly separable grows when the dimension d grows.
Thus, it is more likely that the data are linearly separable after being mapped into a higher-
˜
dimensional ambient space. This is the motivation for using a feature map ϕ : Rd → Rd with
˜
d > d. However, we cannot draw any conclusions from Cover’s theorem for a specific data set
at hand. Thus, the careful construction of a suitable feature map to obtain linearly separable data
in Rd is still an important task.
15 Note that this has to be understood as a truly linear separator in contrast to an affine linear separator f (x) :=
γ T x + γ0 with γ0 ̸= 0.
46 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
X Task 3.6. Generate 50 uniformly distributed i.i.d. points that lie in {t ∈ R2 | ∥t∥2 <
1} (e.g., by drawing uniformly distributed points in (−1, 1)2 until 50 of them are
within the unit sphere) and label them by −1. Now generate 50 data points, which
are uniformly distributed in {t ∈ R2 | 1 < ∥t∥2 < 2}, and label them by 1.
(a) Fit a linear SVM for C = 10 to the data and plot the scattered data as well as
the separating hyperplane.
(b) Transform the data by the feature map ϕ : R2 → R3 defined as
Fit an SVM for C = 10 to the transformed data. Depict the scattered data and
the nonlinear separation curve in a 2D plot (i.e., in the same way as in (a)).
What does the feature map do, and why does it work so well?
Now, instead of constructing a feature map ϕ explicitly, we can directly work with a kernel K
since it realizes the inner products needed to solve (3.8). The approach of substituting Euclidean
products of the data by kernel evaluations is usually referred to in the machine learning commu-
nity as the so-called kernel trick [SS02], which is based on the results of Mercer [Mer09]. The
kernel trick is often applied to transform a nonlinearly separable data set to a linearly separable
data set in higher dimensions. Let us now consider which kernels are suitable to describe inner
products of feature maps.
holds for arbitrary N ∈ N. If the inequality is strict, we say that the kernel is positive
definite. A positive semi-definite, continuous, and symmetric kernel which is L2 in-
tegrable with respect to the marginal data measure µX on Ω, i.e., K ∈ L2,µX (Ω × Ω),
is called a Mercer kernel; see also [Mer09, RW06, SS02].
We will observe that there is a close relationship between Mercer kernels and feature maps. This
can be exploited to implicitly build suitable feature maps for SVMs. The corresponding feature
space is known to be a so-called reproducing kernel Hilbert space.
3.5. Nonlinear support vector machines 47
with the λi and Φi , i ∈ N, from Mercer’s theorem. Then, the theorem tells us that
K(s, t) = ⟨ϕ(s), ϕ(t)⟩ℓ2 (R) .
A Mercer kernel thus implicitly defines a feature map into “R∞ ” or, more accurately, into the
sequence space ℓ2 (R). Therefore, we can simply take a Mercer kernel K and substitute the inner
products xTi xj in (3.8) by K(xi , xj ) for all i, j = 1, . . . , n.
Furthermore, the solution of (3.8) can now be written as the finite sum
n
X
f (x) = γ0 + βi yi K(xi , x). (3.10)
i=1
This fact is known as the so-called representer theorem; see, e.g., [KW70, SS02]. It gives a clear
advantage over the primal representation
f (x) = γ0 + γ T ϕ(x), (3.11)
where we have to deal with the (potentially) infinite-dimensional vectors γ and ϕ(x).
48 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
Famous examples of Mercer kernels (with respect to the Lebesgue measure µX ) are
∥s − t∥2
Kσ (s, t) := exp − for some σ > 0.
2σ 2
For further examples of Mercer kernels used in machine learning and the analysis thereof, we
refer the interested reader to [SW06, Wah90, Wen95].
Note that, besides the constrained SVM optimization problem (3.8) discussed in Section 3.2,
there exists another way to compute the SVM solution using the formulation (3.10). To this
end, a so-called hinge loss function is employed to measure the error between f (xi ) and yi .
Furthermore, the RKHS norm of f corresponding to the chosen kernel is used in a weighted
regularization term; see [SS02] for details. This can be seen as a generalization of (1.6) with a
different loss function and regularization term.
X Task 3.7. Change your SMO code such that it allows you to use a kernel function
instead of the scalar product of the input data, i.e., substitute all scalar products by
the evaluation of the kernel function. Perform an SVM classification (C = 10)
with Gaussian kernel (σ = 1) for the data from Task 3.6. Depict the scattered data
and the nonlinear separation curve in a 2D plot. The result should look similar to
Figure 3.6.
1
t2
−1
−2
−2 −1 0 1 2
t1
Figure 3.6: Support vector classifier with Gaussian kernel for C = 10, σ = 1.
the regularization parameter C for soft-margin classifiers (see (3.8)), and the kernel parameters
q for the polynomial kernel and σ for the RBF kernel.
A common method to determine the optimal hyperparameters p ∈ P from a finite set
P ⊂ RdH of potential parameter configurations is grid search with either a single validation
set or cross-validation (CV). Here, dH ∈ N is the number of hyperparameters that need to
be chosen and |P | is the number of possible/suitable hyperparameter combinations. Often-
dp
times, P = ⊗i=1 Pi is chosen, where Pi contains suitable values for the ith hyperparameter,
i = 1, . . . , dH . However, if dH is large, we encounter the curse of dimensionality in the size of
P ; see also Section 1.6. Then, it makes sense to choose a significantly smaller set of possible
configurations P , e.g., by so-called Monte Carlo or quasi Monte Carlo methods; see [BB12].
To look for the best p ∈ P , we compute values of an evaluation measure, e.g., the accuracy,
for the results of our ML algorithm for all |P | possible choices of hyperparameters (grid search).
To obtain representative values, we should not use parts of the training data for this purpose.
Instead, a validation set is usually employed. To this end, the underlying idea is to partition the
original data into two disjoint sets, one training data set and one artificial test data set/validation
set. We then train the |P | different models on the training data set and measure their performance
on the validation set. Finally, the best hyperparameters p ∈ P are those for which the evaluation
measure on the validation set was best. Commonly, the data is split by using roughly 80% for
training the model and 20% for validating the model. Note that we could also employ other
quality measures instead of the accuracies; see Section 2.2.
While using grid search with a validation set often works well, small data sets can lead to
unreliable estimates. In these cases, k-fold cross-validation is employed. Here, several validation
sets are used instead of only one. More precisely, the original data set is randomly split into k
parts, also called folds, of approximately equal size. One fold is chosen as the validation set
while the remaining k − 1 folds serve as training data for our algorithm. Subsequently, we take a
different fold as validation data and the rest as training data and repeat the process k times until
each fold has been used as validation data once. The (arithmetic) average of the k accuracies
calculated on the validation data then serves as the quality measure. Typically, k = 5 or k = 10
is recommended; see [HTF09].
Now, to determine the best choice of hyperparameters, we choose small candidate sets.
For example, for the two hyperparameters of SVM with a Gaussian kernel, namely the reg-
ularization parameter C from (3.8) and the Gaussian kernel bandwidth σ, we choose C ∈
{0.01, 0.1, 1, 10, 100} and σ ∈ {1, 10, 100} and run a k-fold cross-validation for all possi-
ble combinations of parameter pairs. The pair (C, σ) with the best average accuracies in the
cross-validation process is the winner. The corresponding pseudocode can be found in Algo-
rithm 3. Note that it is absolutely crucial to avoid using the test data during the model de-
velopment phase, i.e., when tweaking hyperparameters or changing model parts. Otherwise,
overfitting effects might be present and there is no chance to detect them on the test data. More
details on this approach and variants thereof (e.g., leave-one-out cross-validation) can be found
in [HTF09, JWHT21, Mur22, SS02], for example.
After determining the optimal parameters pbest ∈ P as described in Algorithm 3, we can build
a model on the whole training data set D with parameters pbest and evaluate it on the real, unseen
test data D̄.
Figure 3.7: Four example images (28 × 28 pixels) created from the MNIST data set.
You can download and extract the data set by hand, or you can use the following lines of
P YTHON code. As you can see, you might need to install the urllib library. To this end, just
run pip install urllib3 in your shell.
#Load MNIST Data
import os
import gzip
from urllib . request import urlretrieve
download ( filename )
with gzip.open(filename , ’ r b ’ ) as f:
data = np. frombuffer (f.read (), np.uint8 , offset =8)
return data
X Task 3.8. Make yourself familiar with the SVC function in S CIKIT-L EARN, which
implements a support vector classifier.
(a) Choose a random subset of size 500 from the MNIST training data and use
it as your new training data set for cross-validation. Perform a 5-fold cross-
validation SVM to determine the optimal parameters among C ∈ {1, 10, 100}
and γ = 2σ1 2 ∈ {0.1, 0.01, 0.001}. (Hint: You can use the S CIKIT-L EARN
function GridSearchCV.)
(b) Use the determined optimal parameters to learn a support vector classifier on a
random 2, 000 point subset of the MNIST training data and evaluate the con-
fusion matrix and the accuracy on the whole MNIST test data set. (Hint: You
can use the S CIKIT-L EARN module metrics.)
(c) Predict the labels for the first 200 data points of the test data with the classifier
trained in (b). Print out the predicted label and the true label for all misclassi-
fied instances and visualize them, i.e., plot the corresponding digit images.
52 Chapter 3. Optimal Separating Hyperplanes and Support Vector Machines
Kernel choice We have not discussed how an appropriate kernel type, e.g., Gaussian, poly-
nomial, etc., for the task at hand can be chosen. If we have some a priori problem knowledge
(such as smoothness of the “true” separation function), for example, we can exploit this in order
to choose an appropriate kernel type; see also [CZ07]. This is an informed machine learning
approach; see also [vRMB+ 23]. Furthermore, the kernel type (and its optimal hyperparameters)
can also be chosen by cross-validation over a finite set of fixed kernel functions, for instance.
Moreover, so-called multiple kernel learning approaches make use of convex combinations of
different kernel functions; see [GA11].
Regression The linear least squares and the k-nearest neighbors algorithms also apply to
the regression case, where we look for a function f such that f (xi ) ≈ yi and the yi can take
arbitrary values in R instead of only discrete ones as in classification. However, for support
vector machines this is not so straightforward since our optimization problem (3.8) originated
from the optimal separating hyperplane formulation. Nevertheless, there exists a support vec-
tor machines regression algorithm based on the minimization of the so-called ε-insensitive loss
function together with a weighted regularization term, which involves the norm of the solution
in the corresponding kernel space; see [Mur22, SS02].
Gaussian processes Besides the SVM there is another famous example for using kernels in
classification/regression methods, namely Gaussian process regression, also known as kriging.
Here, the employed kernel is determined by the covariance matrix of the training data. With the
help of this kernel, a stochastic process is defined for which every point evaluation is normally
distributed with certain mean and covariance structure. Algorithmically, the minimization of a
least squares loss function together with a weighted regularization term is computed there. For
more details on this stochastically motivated regression method, we refer the reader to [Mur22,
See04, RW06].
Chapter 4
Linear Dimensionality
Reduction
One encounters large and nominally high-dimensional data sets in many real-world applications.
However, the information in such data sets is often redundant in the sense that the different
coordinates of a data set, stemming, e.g., from different physical measurements, are typically
not independent but correlated. Thus, the high-dimensional representation of the data is not in
a compact form. To determine a more effective representation of the data, we need to find a
mapping into a lower-dimensional representation space that preserves the relevant information.
But relevant can be understood in different ways, e.g., as preserving pairwise distances between
data points or as allowing for almost lossless reconstruction of the original data set. The process
of determining such a low-dimensional representation of the data is known as dimensionality
reduction; see also Section 1.2.
Dimensionality reduction methods serve many purposes, e.g., to reduce the computational
costs of a task due to the resulting smaller dimension or to gain new insights into inner structures
of the data set. To this end, a good visualization is a powerful tool to develop an idea or intuition
on what is happening. But the more dimensions our data has, the more challenging it is to visu-
alize it. While finding a suitable visualization is a topic on its own, reducing the dimension can
be a helpful start. A computational reason for reducing the dimension is scaling behavior. The
runtime of many algorithms scales polynomially or even exponentially in the dimension of the
input data. Furthermore, some methods do not perform well in high dimensions due to the curse
of dimensionality and the concentration of measure effect; see Section 1.6. Therefore, dimen-
sionality reduction as a preprocessing step can provide more expressive features for a learning
algorithm; see Section 1.5. While the main object of interest is the representation/projection of
the high-dimensional data into the low-dimensional space, we are sometimes also interested in
the representation system itself. This is, for example, the case if we also want to deal with new
data, e.g., when projecting new test data.
A difference in methodology in this chapter compared to the previous chapters is that a
straightforward evaluation criterion for a given solution is usually not directly available. While
a good solution for a regression problem has to provide a good approximation to the underlying
data measure µ, there is often no single best answer in dimensionality reduction. Two represen-
tations could capture or highlight different aspects of the data and might be useful for different
applications. One could say that the criterion of “preserving information” or “reducing redun-
dancy” leaves more space for different objective functions to minimize and also depends on the
task at hand.
This chapter is structured as follows. Section 4.1 introduces the principal component analysis
(PCA), which is the most popular and most commonly used dimensionality reduction method,
53
54 Chapter 4. Linear Dimensionality Reduction
even today. In Section 4.2 we discuss the connection between the PCA and the SVD, which we
already studied in Section 2.1. Subsequently, we give interpretations of the PCA as an algorithm
working on distances between data points in Section 4.3 and as a statistical method to preserve
the data variance as well as possible in Section 4.4. Finally, Section 4.5 presents tasks on a
pedestrian detection problem tackled with the PCA.
Here, the minimization is done not only with respect to the degrees of freedom of f ,
but also over the low-dimensional variables η i . One of the major benefits of PCA
is that the solution can be quite easily computed by an eigendecomposition of the
covariance matrix of the data. More detailed discussions of this algorithm can be
found in [Hot33, Jol02, LV07].
Note that there are different derivations of the PCA approach, just as there are different names
for it. As briefly mentioned, we consider an affine linear model
f ∈ Mq,d q d
g(η) = ν + V q η with ν ∈ Rd , V q ∈ Sq (Rd ) ,
Lin := g : R → R (4.1)
i
4.1. Principal component analysis 55
“book-figure10” — 2023/10/9 — 13:25 — page 52
i
1
0.5
A2
0
−0.5
−1
−1 −0.5 0 0.5 1
A1
Figure 4.1: Highly correlated two-dimensional data. The blue arrow indicates the direction of
the one-dimensional linear subspace along which the data vary the most.
where n o
Sq (Rd ) := V ∈ Rd×q V T V = I
is the so-called Stiefel manifold of d × q matrices with orthonormal columns. Note that the class
Mq,dLin is the vector-valued analogue to the class MLin , which we encountered in the LLS setting;
see Section 2.1. We want to fit f to the data x1 , . . . , xn ∈ Rq , i.e., we need to determine ν ∈ Rd ,
V q ∈ Rd×q , and η i ∈ Rq for i = 1, . . . , n such that
f (η i ) = ν + V q η i ≈ xi ∀i = 1, . . . , n. (4.2)
As in the LLS setting, we use the least squares loss, but this time for vector-valued functions,
i.e.,
n
1X
f (η i )ni=1 := arg min ∥xi − g(η i )∥22 . (4.3)
q,d
g∈M ,η ∈R q n i=1
Lin i
Moreover, in contrast to the supervised learning setting, where the η i would be known values,
we have to determine them here. This is typical for an unsupervised problem; see Section 1.2.
Note furthermore that the model class Mq,d Lin is no longer convex. So we cannot expect to obtain
a global minimizer by looking for critical points. We will now study how we can still obtain
a solution to (4.3). To this end, we will take two steps: First, we minimize the right-hand side
of (4.3) with respect to ν and η i for given V q . When V q is fixed, the remaining optimization
problem is convex again. Therefore, we only need to compute the corresponding gradients of the
least squares function and set them to zero, i.e.,
n
X
(i) (ν − xi + V q η i ) = 0,
i=1
Thus, the η i = V qT (xi − ν) are completely determined by (ii). A possibility to fulfill (i) is
Pn
to set ν := m := n1 i=1 xi , i.e., ν is set to the mean m of the data points. In a second
step, we determine an appropriate matrix V q that, together with the above values for η i and ν,
56 Chapter 4. Linear Dimensionality Reduction
solves (4.3). Plugging the values for ν and η i into the minimization problem (4.3), we are left
with
n
1X 2
V q = arg min xi − m − V V T (xi − m)
V ∈Sq (Rd ) n i=1 2
n
1 X 2
= arg min I − V V T (xi − m) (PCAMin)
d
V ∈Sq (R ) n i=1
2
n
1 X 2
= arg max V T (xi − m) . (PCAMax)
d
V ∈Sq (R ) n i=1
2
Note that I − V q V qT is the orthogonal projector onto the orthogonal complement of span(V q ).
Thus, we are looking for the linear space spanned by the orthogonal columns of V q such that the
projection of the centered data xi − m onto this space has—on average—the largest Euclidean
norm possible. Let X ∈ Rn×d be the centered data matrix16 with rows xi − m for i = 1, . . . , n,
which can be computed by multiplying the so-called n × n centering matrix H = I − n1 1n with
the non-centered data matrix with row-wise entries xi , i = 1, . . . , n, i.e.,
T
X = H (x1 · · · xn ) . (4.4)
Here, 1n is the n × n matrix in which all entries are 1. Furthermore, let the columns of V q be v i
for i = 1, . . . , q. Then, we have
n q
n X q q
X X 2 X X
∥V qT (xi − m)∥2 = v Tj (xi − m) = v Tj X T Xv j = n v Tj Cv j
i=1 i=1 j=1 j=1 j=1
where [ṽ j ]i denotes the ith entry of ṽ j and ṽ j := Z T v j . Since Z and V q are orthogonal
Pq 2 Pd Pq 2
matrices, we have j=1 [ṽ j ]i ≤ 1 for all i = 1, . . . , d and i=1 j=1 [ṽ j ]i ≤ q. Thus, we
obtain
Xd Xq k
X
2
λi [ṽ j ]i ≤ λi ,
i=1 j=1 i=1
Pq
which shows that ṽ j = ej maximizes j=1 ṽ Tj Λṽ j . Therefore, the jth column of V q is given
by v j = Zej , which is just the jth column of Z T . Our results are summarized in the following
theorem.
Pn
is solved for ν = m = n1 i=1 xi and η i = V qT (xi − m), where the columns of V q are the
orthogonal unit eigenvectors of C := n1 X T X corresponding to the q largest eigenvalues.
The columns of V q are called the q first principal axes of the data X, whereas the projec-
tions η i are called principal components. While the above theorem shows us how to obtain the
principal components, it can be problematic to compute the eigendecomposition of n1 X T X if
the dimension d of the data is quite large. Furthermore, the costs of computing X T X are already
O(d2 n).
ò Numerical eigensolvers
To find possible values for λ ∈ R and x ∈ Rd in an eigenproblem of type
Ax = λx
for a square, diagonalizable matrix A ∈ Rd×d , there exist several numerical solvers,
e.g., power methods, Rayleigh quotient iterations, pivoted Cholesky decompositions,
or the Arnoldi/Lanczos algorithms. Note that, in contrast to computing the full eigen-
decomposition of A, these solvers can be employed in such a way that only the largest
q eigenvalues and the corresponding eigenvectors are obtained with fewer computa-
tional costs. Due to the fact that computing these eigenpairs usually needs at least
O(d2 q) floating point iterations, the computation gets very costly for large dimension
d. Also, many eigendecomposition algorithms can become instable. For more details
on numerical eigensolvers, their implementation, and computational costs, we refer
the reader to [GVL13, HPS12, PTVF07].
n · C = X T X = W D T U T U DW T = W diag(σ12 , . . . , σd2 )W T
A := U q D q WqT
is the solution to the best-rank-q-approximation problem (4.5) and we have for the error
min(n,d)
X
∥U q D q WqT − X∥2F = σi2 .
i=q+1
where [ṽ j ]i denotes the ith entry of the (orthogonal) unit eigenvector ṽ j , which corresponds
to the jth largest eigenvalue of G. Note that this only holds for one-dimensional eigenspaces.
Otherwise we would also have to deal with possible rearrangements. Moreover, we observe that
the left singular vectors U of X are the eigenvectors of the Gramian matrix G.
4.4. Statistical interpretation 59
High-dimensional case Sometimes we have to deal with situations where d ≫ n. For in-
stance, in gene expression analysis d ≈ 109 and often n ≈ 103 . In these cases, we cannot
directly compute the eigendecomposition of C = n1 X T X when dealing with the eigenproblem
Cv = λv,
nor can we easily store the matrix X and compute the SVD, unless we exploit potential sparsity
properties of X. However, as already discussed, there are three ways to compute the principal
components: an SVD of the n × d matrix X, an eigenvalue decomposition of the d × d matrix C,
and an eigenvalue decomposition of the n × n matrix G. Therefore, if d ≫ n and if direct access
to X is unavailable, e.g., due to storage constraints or inaccessible data points, the latter variant
is preferred since an SVD can no longer be computed and since the costs of O(n2 d) floating
point operations for the eigenvalue decomposition of G are cheaper than the costs of O(d2 n)
floating point operations for the eigenvalue decomposition of C.
The computation of the PCA solution using the Gramian matrix is also known as (classical)
multi-dimensional scaling (MDS) or Torgerson MDS. This approach was derived for applications
where only the Gramian matrix G of pairwise similarities or the matrix D of squared pairwise
Euclidean distances ∥xi − xj ∥22 is given. The connection between D and the Gramian matrix G
can be established by
∥xi − xj ∥22 = Gii + Gjj − 2Gij . (4.7)
Therefore, straightforward calculations show that the Gramian matrix G of the centered data
xi − m, i = 1, . . . , m, can be computed by
1
G = − HDH
2
when given D; see also [CC00, LV07]. Here, H = I − n1 1n is the centering matrix introduced
earlier. In summary, we do not need to have access to the original data X if we have access
to either D or G since we can work only with the Gramian matrix to compute the PCA of the
data set.
The underlying optimization problem of MDS One can show that the MDS solution and
thus the PCA solution solves the optimization problem
n
X 2
min
q
⟨xi , xj ⟩2 − ⟨η i , η j ⟩2 , (4.8)
η i ∈R ,i=1,...,n
i,j=1
and, equivalently,
n
X 2
min
q
∥xi − xj ∥22 − ∥η i − η j ∥22 , (4.9)
η i ∈R ,i=1,...,n
i,j=1
i.e., the low-dimensional representatives η i of the data xi , i = 1, . . . , n, are such that they
(on average) preserve pairwise inner products, i.e., pairwise similarities, and pairwise squared
distances best. For a proof of this property and more details on MDS, we refer the reader to
[CC00, LV07, YH38].
is defined as E[xT x], and we assume that it has rank larger than or equal to q. Now we are
asking for directions v 1 , . . . , v q ∈ Rd along which x varies the most, i.e., we first maximize the
variance
v 1 := arg max Var[v T x].
∥v∥2 =1
One can show that the resulting solution is given by the q eigenvectors of E[xT x] belonging to
the largest q eigenvalues; see [Jol02]. In fact, the variances are exactly the eigenvalues, i.e.,
Var[v Ti x] = λi ∀ i = 1, . . . , q.
X Task 4.1. Implement a PCA routine whose inputs are {x1 , . . . , xn } and q. Either
use a N UM P Y /S CI P Y routine to compute eigenvectors directly or use the SVD. In
the case of eigenvectors, make sure your routine does actually return an orthonormal
basis (consult the documentation of your solver).
4.5. Tasks on linear dimensionality reduction 61
X Task 4.2. Test your PCA implementation on the provided toy data set.
(a) Plot a 3D slice of the 4D toy data. Compute the PCA representation for q = 2
and plot it as described in the notebook. Check your result: The plot should
reveal a perfectly round and familiar shape.
(b) Map the 2D representation onto a distorted ellipse in 3D; the code for the trans-
formation is provided in the accompanying notebook. Perform a PCA of this
3D data for q = 2 and plot the result. Repeat this for a few ellipses and describe
how the PCA picks the coordinate system V q .
Now let us revisit the Iris data set from Task 2.7 to check how PCA can be utilized there.
(a) Compute all (four) singular values of X using a suitable function, e.g.,
scipy.linalg.svdvals. Compute the captured variance percentage when
using only the first principal component and compute the captured variance
when using the first two principal components.
(b) Compute the PCA transformation onto the first two principal components of
the Iris data set. Plot the transformed data in a 2D scatter plot such that the
three labels are distinguishable. Do the same for 1D and use the provided
function to plot it. What do you observe?
(c) Use the insight from the visualization in (b) to build classifiers for the whole
Iris data set. To this end, run three different experiments using a 4D, 2D, and
1D PCA as starting point. Then, apply two linear SVMs to classify a data point
as one of the three Iris labels. You can copy your SVM code from Chapter 3,
but we recommend using S CIKIT-L EARN.
With a better understanding of PCA we now turn our attention to the analysis of a larger data set.
Figure 4.2: Ten pedestrian images and ten garbage (non-pedestrian) images from the TUD-
Brussels data set [WWS09].
The second step is known as a segmentation problem in computer vision, which can also be
interpreted as an unsupervised learning problem. It is often treated by thresholding methods,
clustering approaches, or more sophisticated graph-based algorithms; see [ZA15] for further de-
tails. In the following, we will focus on the last task in more detail, i.e., on deciding whether or
not a picture shows a pedestrian. Note that handling such a problem with a PCA is a rather “clas-
sical” approach in machine learning. Nowadays, these kinds of tasks are usually tackled with
deep neural networks instead; see, e.g., [RDGF16, TMB+ 16]. For further details on pedestrian
classification, we refer the reader to the reviews [BOHS15, DWSP12].
Our data set consists of labeled pictures of 100x50 pixels; see Figure 4.2. It is a downscaled
subset of the TUD data set [WWS09]. Half of the pictures show a pedestrian; the other half does
not. A separation into training and test data is provided. We begin by preparing the data.
(a) Load the training and test images into N UM P Y arrays with the help of the
routines in the template notebook. Normalize the pixel values to [0, 1].
(b) Write a routine plot_im to plot an image using M AT P LOT L IB’s imshow (to
get consistent contrast provide constant values for its arguments vmin and
vmax). Create a plot with ten randomly chosen training images showing a
pedestrian and ten randomly chosen training images not showing a pedestrian.
You can use the subplot method17 from M AT P LOT L IB.
Our training data consists of n = 2000 points (images), with dimension d = 15000 (the pixels
of an image for three colors). The dimension is quite large compared to our previous data sets.
The goal is to classify the images (pedestrian or non-pedestrian) using the color values of the
pixels.
After loading the data, we compute a PCA to reduce d to a much smaller number in the next
step. Note that the coordinate axes calculated by the PCA can be interpreted as images, and we
represent the data in terms of coefficients for the corresponding eigenvectors, which are called
eigenpedestrians; see also [SK87] for the general idea of using PCA for image analysis.
17 You should implement an optional argument ax for plot_im, so you can use it for the subplot; see plt.gca().
4.5. Tasks on linear dimensionality reduction 63
X Task 4.5. Take a look at the eigenpedestrians. From now on use the PCA implemen-
tation of S CIKIT-L EARN.
(a) Compute the PCA with q = 200 for the full training set (i.e., with pedestrian
and non-pedestrian samples combined).
(b) Plot the first 20 eigenpedestrians, as well as eigenpedestrians 50 to 60 and
eigenpedestrians 100 to 110. What do you observe?
X Task 4.6. Train a linear SVM (use LinearSVC from S CIKIT-L EARN) employing the
PCA representation of the full training data set for values of the dimension q between
10 and 200 in steps of five. For each q compute and store the prediction accuracy (use
the score method of LinearSVC) on the training and test data set. Plot the scores
for q. Which q seems to be the best choice? Compare the situation to Task 4.3 (c)
with respect to q.
The achieved prediction results are not bad. However, for safety-critical applications, e.g., in self-
driving cars, they are not yet acceptable. Furthermore, the blind trust in error measures like the
accuracy can be fatal in such tasks. Here, gaining insight on how the underlying ML algorithm
came to its decision for certain input data is crucial. To this end, means of interpretability can be
applied; see Section 10.5.
ò Image gradients
A very common tool in computer vision is image gradients. Starting from the differ-
ence quotient
f (x + ε) − f (x)
lim ,
ε→0 ε
we can define a derivative for images by using a discrete approximation. For a single
color channel, i.e., an image matrix I ∈ Rh×w , we define the image gradient at
position y, x by the partial derivatives of I using a centered difference quotient
T
I y+1,x − I y−1,x I y,x+1 − I y,x−1
∇y,x I := , (4.11)
s s
with s = 2. Whenever there is overlap with the boundary, we set the corresponding
pixel value to zero. Often the relative size of the derivatives is important instead of
its scale. Therefore, the scaling factor s in the denominator can be considered to be
flexible. A popular choice is to take a denominator of s = 1 instead, which we will
also use in the following.
Often, one can represent the discrete derivative calculation by a convolution—whose continuous
counterpart might be known from integration theory.
ò Convolutions
The convolution f ∗ g of two real-valued functions f, g : Rd → R is defined by
Z
f ∗ g(x) := f (τ )g(x − τ ) dτ .
Rd
If f and g are only given at discrete values Zd , the integral is substituted by a sum
X
f ∗ g(z) = f (i)g(z − i).
i∈Zd
If f, g are only defined on a finite domain, the above definitions can still be applied
by extending the functions to Rd or Zd , respectively, and assigning 0 as their function
values outside the finite domain.
with shifts18 kx and ky . The result is a new color channel whose entries are computed using
weighted sums of the surroundings of each pixel of I. An example of a popular filter is Gaussian
smoothing. Here the entries of K are computed using a Gaussian kernel.
Either the new color channel is smaller or one has to specify how the missing pixel values
outside of the border of I are extrapolated. In our case we are extending I by the constant value
0. The y- and x-derivatives can be computed using scipy.ndimage.convolve with the filters
[−1, 0, 1]T and [1, 0, −1], which correspond to (4.11) with scaling factor s = 1.
18 In S CI P Y these shifts are chosen to center the filter by default.
4.5. Tasks on linear dimensionality reduction 65
Binning Subsequently, each (y, x) position is assigned (binned) to a square cell of size |c| ∈
N. The cells span over |c| pixels19 in both directions and form a regular grid. Similarly, the
orientations are binned into #b ∈ N equally sized intervals which partition either the full circle
α := 2π or only the half circle α := π, in which case one calls the orientations unsigned. The
corresponding intervals are
α α
i, (i + 1) for i = 0, . . . , #b − 1.
#b #b
Computing the histogram In a next step, the gradient norms are accumulated into a his-
togram20 H. Here, the histogram entry H(cy , cx , bi ) contains information about the gradient
norms whose pixel (y, x) resides in a cell indexed by (cy , cx ) and whose gradient orientation
resides in the interval indexed by bi .
Let us describe in more detail how the histogram entries are computed. For a gradient norm
(max)
∇y,x 2
and a gradient direction ϕ ∈ [0, 2π) at position (y, x), we first determine both
the interval index bprec whose center is the largest among all centers which are smaller than or
equal to
ϕ̂ := ϕ mod α
and the succeeding/next21 interval index bsucc . Furthermore, we determine the preceding and
succeeding cell indices cy,prec , cy,succ , cx,prec , cx,succ in the y- and x-directions in the same fashion,
i.e., with respect to their center. Here, out of bounds cells are ignored. Then, the histogram
entries H(cy , cx , bi ) for all possible combinations of the above mentioned indices are updated
by adding a fraction of the gradient norm at position (y, x). This fraction is individually defined
for each H(cy , cx , bi ) by the coefficients of a convex combination whose terms can be found in
Algorithm 4.
Block building The last step is to combine the histogram entries of the cells into larger blocks,
which are then normalized and clipped according to some clip value T > 0. If |B| ∈ N is the
block size, then a block consists of |B| consecutive cells in the y-direction and |B| consecutive
cells in the x-direction. The blocks do overlap, but only full blocks are considered. Conse-
quently, |B| has to be chosen less than or equal to the minimum number of cells in the y- and
x-direction, respectively. The final HOG feature vector then consists of the entries of all blocks
(in any order).
19 Potential overlaps with the boundary of the image are ignored.
20 While the method was known earlier, the term histogram (historical diagram) was introduced by Pearson, the inventor
of PCA [Pea01].
21 Note that we use a wrap-around when we are at the final index, i.e., the next index would be 0 again.
66 Chapter 4. Linear Dimensionality Reduction
Figure 4.3: Three random images created from the TUD-Brussels data set [WWS09] together
with their computed HOG features (right of the respective image). In each (cy , cx )-cell of the
HOG histogram and for each orientation index bi we have drawn an arrow with direction cor-
responding to the orientation from bin bi . The brightness of each arrow is proportional to the
histogram entry H(cy , cx , bi ). To obtain a better visualization we omitted the block building and
normalizing steps at the end of Algorithm 4.
A visualization of computed HOG features can be found in Figure 4.3. We clearly see that
the HOG features are able to capture coarse shapes and, thus, can be helpful in the detection of
pedestrians. For more details on the actual computation of the HOG features, we refer the reader
to Algorithm 4 and the P YTHON code template provided for the following exercise.
X Task 4.7. Test the HOG features for the pedestrian data set. To this end, repeat the
experiment from Task 4.6 for values of q between 10 and 200 in steps of 5, but use
the HOG features as input for the PCA instead. You can find a function to compute
the HOG features in the template code.
For these variants, well-known iterative solvers exist; see, e.g., [BG05]. Examples for metric
MDS methods are Sammon’s mapping, which uses wij = 1/ dist(xi , xj ) as weights, Isomap,
which employs an approximation to the geodesic distance, and the so-called kernel PCA, where
a kernel is used in (4.8) instead of a scalar product and distances are defined by a generalized
version of (4.7); see also Chapter 5 and [BG05, LV07]. Note here that the name “kernel PCA” is
quite misleading since the approach relates more closely to classical metric MDS than to PCA.
Besides metric MDS, there also exist non-metric variants like the Kruskal–Shepard algorithm.
Here, the corresponding optimization problem (4.8) is altered in such a way that ordinal informa-
tion, i.e., proximity ranks, is used for determining the low-dimensional representation instead of
pairwise scalar products. For more details on the above-mentioned variants, we refer the reader
to [BG05, LV07].
Chapter 5
Nonlinear Dimensionality
Reduction
In the last chapter we introduced PCA as the essential linear dimensionality reduction method. It
is based on the restrictive assumption that the data resides in an affine linear subspace. In contrast
to that, nonlinear dimensionality reduction methods consider general (curved) manifolds or aim
to preserve aspects of the topology instead; see, e.g., [LV07].
Let us assume that xi , i = 1, . . . , n, are given i.i.d. measurements from two random variables
A1 and A2 . Figure 5.1 shows an example, where the noisy data approximately stem from a one-
dimensional curved submanifold of R2 , i.e., it can be described by a nonlinear coordinate system.
We will now extend the idea of linear dimensionality reduction to the nonlinear case. To this end,
recall that we have briefly mentioned variants of multi-dimensional scaling algorithms at the end
of Chapter 4, where the goal is to preserve pairwise distances between points as well as possible.
In the following, we first introduce an MDS variant based on geodesic distances called Isomap
in Section 5.1. Subsequently, we introduce diffusion maps in Section 5.2, where we consider
preservation of so-called diffusion distances on the intrinsic (nonlinear) submanifold of Rd the
data are living on. Section 5.3 then deals with clustering algorithms to segment a data set into
different subsets. Combining clustering algorithms with dimensionality reduction, we obtain
i
a method to visualize a high-dimensional data set in Section 5.4. Finally, tasks on nonlinear
dimensionality reduction can be found in Section 5.5.
“book-figure11” — 2023/10/9 — 13:25 — page 6
i
0.3
0.2
A2
0.1
0 0.5 1
A1
Figure 5.1: An example for a two-dimensional data set of noisy samples from an underlying
curved, nonlinear structure.
69
70 Chapter 5. Nonlinear Dimensionality Reduction
5.1 Isomap
A key idea of many nonlinear dimensionality reduction approaches is to consider data that stem
from (curved) manifolds M instead of just linear subspaces, i.e., the model space contains non-
linear functions instead of only (affine) linear ones. In particular, Riemannian manifolds M and
the corresponding inner product ⟨·, ·⟩M on the tangent space are of interest. On such a manifold
M, the length of the shortest curve connecting two points x, y ∈ M is the geodesic distance,
which we denote by distM (x, y); see [dC92] for definitions and more details on Riemannian
manifolds and geodesic distances.
Isomap
The motivation behind the Isomap approach [TdL00] is to preserve the pairwise ge-
odesic distances in the lower-dimensional representation. In Isomap the geodesic
distances are approximated by graph distances that are computed on a graph that
represents neighborhood relations of the data set. Using the graph distances, a corre-
sponding centered Gramian matrix is obtained, whose eigendecomposition then gives
the lower-dimensional representation of the data, analogously to MDS.
Approximation of geodesic distances Since we do not have access to the geodesic dis-
tances between data points directly, they have to be estimated in order to compute the Isomap
solution. To this end, we construct a neighborhood graph G on the data set X := {xi }i=1,...,n .
ò Neighborhood graph
A graph G := (V, E) with a so-called vertex set V = v1 , . . . , v|V | and a so-called
edge set E ⊆ V × V is often represented by its adjacency matrix, i.e., a matrix A of
size |V | × |V |, where |V | is the number of nodes in the graph. The adjacency matrix
has a non-zero entry at position (i, j) if and only if the ith vertex is connected to the
jth vertex by an edge in the graph, i.e.,
wij if (i, j) ∈ E,
Aij =
0 else,
where wij is a suitable non-zero edge weight, which reflects the importance of the
connection between vertex i and vertex j.
An (undirected) neighborhood graph is a graph on a given data set V = X , where
an edge between two data points xi and xj exists if and only if these points are
close to each other. Two famous examples of neighborhood graphs are the k-nearest
neighbors graph, where each data point is connected to its k-nearest neighbors, and
the r-ball neighborhood graph, where an edge between xi and xj exists if and only
if ∥xi − xj ∥2 ≤ r for all i, j = 1, . . . , n. In both cases, the edge weights are defined
as the Euclidean distance22 wij := ∥xi − xj ∥2 .
Given a neighborhood graph G on the data set X , we define the graph distance distG : X × X →
[0, ∞] between two points xi , xj ∈ X according to the shortest path between these points, i.e.,
22 Note that distance measures other than the Euclidean distance could also be employed here.
5.2. Diffusion maps 71
2. else let
s
Iij := (ik )k=1 s ∈ N, i1 = i, is = j and xik , xik+1 ∈ E ∀ i = 1, . . . , s − 1
be the set of index sequences of all possible finite paths of any length, which we denote by
s, between xi and xj , and let
s−1
X
distG (xi , xj ) := inf
s
wik ik+1 .
(ik )k=1 ∈Iij
k=1
Computing the Isomap solution Given the graph distances, we can formulate the Isomap
approach for reducing the data dimensionality23 from d to q < d. It is known to work well when
the underlying manifold M is isometric to a convex domain; see [ACJP20]. We assume X ⊂
M ⊆ Rd , i.e., the data stem from the manifold M, which is a submanifold of Rd . Furthermore,
if X is sampled densely enough from M, i.e., if the Euclidean distance between local data points
is close to the geodesic distance, and if G only connects close neighbors to each other, we expect
that
distG (xi , xj ) ≈ distM (xi , xj ).
This means that the graph distance matrix D G with entries dist2G (xi , xj ) for i, j = 1, . . . , n
is a good approximation to the geodesic distance matrix D M with entries dist2M (xi , xj ) for
i, j = 1, . . . , n. For a formal proof under corresponding assumptions on the data and on the
manifold, see [ACJP20, TdL00].
Our goal is now to find low-dimensional representations {η i }ni=1 ⊂ Rp of the data points
{xi }ni=1 ⊂ M ⊆ Rd such that their Euclidean distance matrix D η with entries ∥η i − η j ∥22 for
i, j = 1, . . . , n fulfills
DG ≈ Dη .
This can be done by computing the solution to the Torgerson MDS problem (4.9) with
dist2G (xi , xj ) instead of ∥xi − xj ∥22 . To this end, we first center D G to obtain G = − 12 HD G H
using the centering matrix from (4.4). Then, we compute the eigenvalue decomposition of G to
obtain the low-dimensional vectors {η i }ni=1 , which are the dimension-reduced representatives
of the original data X . We refer the reader to Algorithm 5 for details and to Section 4.3 for the
analogy to MDS.
Diffusion maps
Diffusion maps is a nonlinear dimensionality reduction method introduced by Coif-
man and Lafon; see [CL06]. Here, the value K(xi , xj ) of a nonlinear kernel K in
two data points xi and xj is used to represent the similarity between them. The
underlying idea of diffusion maps is to construct a random walk Markov chain on
the data set by using the kernel evaluations to build a transition probability matrix
P ∈ Rn×n on the data set. Then, the low-dimensional data representations can be
constructed from an eigendecomposition of P .
for arbitrary m ∈ N and i1 , . . . , im ∈ {1, . . . , n}. This can be interpreted such that
the probability of the variable Xi being in a certain state only depends on the state of
Xi−1 and not on the complete history of the process. If the distribution is stationary,
i.e., if P Xm = xi | Xm−1 = xij is the same regardless of m ∈ N, the Markov
chain Y is called the path of a random walk on the data. The analogy becomes clearer
5.2. Diffusion maps 73
when the data points are considered to be vertices of a fully connected, directed
graph, where the edge weight from xi to xj is given by P [X2 = xj | X1 = xi ].
Then, following the path given by an instance of Y , we obtain a random walk on the
graph. For more details, see [Geo12].
Defining the transition probabilities To construct the Markov chain, let each data point
represent a vertex in a fully connected graph. The edge weights are now built according to
the values of K, which serves as a similarity measure between two data points. To construct
the transition probabilities of the Markov chain, we first normalize the kernel. To this end, we
choose a parameter α ∈ [0, 1] and define
K(x, y)
K (α) (x, y) := P α P α . (5.1)
z∈X K(x, z) z∈X K(y, z)
With the help of the rescaled kernel K (α) we can define the transition probability from x ∈ X
to y ∈ X by
K (α) (x, y)
P [y | x] := P (x, y) := P (α) (x, z)
. (5.2)
z∈X K
The choice of α Depending on the choice of α, the transition probabilities will change sig-
nificantly. One can show that the value of α corresponds to a specific type of flow field on the
submanifold the data lie on; see [CL06]. In particular, α = 1 leads to a finite-sample approxima-
tion of a so-called Brownian motion random process. This resembles the dynamics of a flow field
following the Laplace–Beltrami operator, i.e., the Laplace operator on the submanifold. When
choosing α = 0, the flow field follows the so-called normalized graph Laplacian for a Gaussian
kernel K. Finally, values of α between 0 and 1 introduce a drift term in addition to the Brownian
motion. When choosing α = 12 , for example, we obtain the Fokker–Planck dynamics of the
random walk. For more details, we refer the reader to [CL06].
We will choose α = 1 later on since this choice is optimally suited to get rid of effects of the
data density on the constructed random walk and just recovers the geometry of the submanifold.
For detailed information on the convergence properties of the discrete data manifold towards the
underlying submanifold of Rd , we refer the reader to [GTGHS20].
Diffusion distances If we think about the matrix P ∈ Rn×n with entries P i,j = P (xi , xj )
as the transition matrix of the Markov process, we obtain the t-step transition matrix by comput-
ing the power P t for some t ∈ N. This allows us to define the so-called diffusion distance.
Diffusion distance
The (squared) diffusion distance at time t ∈ N is defined by
n
X 2 1
DiffDist2t (xi , xj ) := [P t ]i,k − [P t ]j,k , (5.3)
π(xk )
k=1
which resembles how far two points xi and xj on the “data manifold” are away from
each other in terms of reachability within 2t steps by the random walk defined by
P ; see [CL06]. Here, π is the stationary distribution of the random walk on the data
set X . In particular, π ∈ Rn is defined by π i = π(xi ) for i = 1, . . . , n and fulfills
πP = π.
74 Chapter 5. Nonlinear Dimensionality Reduction
C
A
B
Figure 5.2: Random walks on a dumbbell-shaped data set (blue points). The diffusion distance
between A and C is large since only a few random paths (gray) of length 2t = 10 exist between
them. However, the diffusion distance between A and B is small because there exist many paths
(violet) of length 2t = 10 between them. Note that only some exemplary paths are plotted here
for reasons of clarity.
The diffusion distance provides a measure of distance on the data manifold. In Figure 5.2, for
instance, the diffusion distance between the data points A and B is significantly smaller than the
diffusion distance between A and C due to their different reachability in the data graph. Another
interpretation is that DiffDist2t (xi , xj ) is a weighted L2 distance between the corresponding two
probability distributions [P t ]i,k k=1,...,n and [P t ]j,k k=1,...,n over the data set. The difference
between the distributions is small if their main probability masses after time t are in similar
regions of the data set, which means that the corresponding components in (5.3) cancel each
other. This reflects that the two points xi and xj have a large probability of being connected
with a random path of length 2t. In contrast to that, there is no cancellation effect in (5.3) if the
main masses of the distributions lie in different regions of the data set.
By multiplying (5.3) out we can further see that
n
X 1
DiffDist2t (xi , xj ) = [P t ]2i,k − 2[P t ]i,k [P t ]j,k + [P t ]2j,k
π(xk )
k=1
n n n
X 1 X 1 X 1
[P t ]2i,k t t
[P t ]2j,k
= −2 [P ]i,k [P ]j,k +
π(xk ) π(xk ) π(xk )
k=1 k=1 k=1
= ⟨xi , xi ⟩pt ,π − 2⟨xi , xj ⟩pt ,π + ⟨xj , xj ⟩pt ,π .
n
X 1
[P t ]i,k [P t ]j,k
⟨xi , xj ⟩pt ,π := ,
π(xk )
k=1
which is a weighted scalar product of a feature map that involves the probabilities of going from a
point xi (or xj , respectively) to any other node in t steps. Therefore, the scalar product involving
5.2. Diffusion maps 75
the feature map defines a kernel on the data set. Note that the connection between distances and
scalar products above is analogous to (4.7).
Diffusion maps To build now the so-called diffusion map, we exploit an important property of
the eigendecomposition of P : In fact, it can be shown that P admits a sequence of eigenvalues
1 = λ0 > |λ1 | ≥ · · · ≥ |λn−1 | and corresponding orthonormal eigenvectors ψi , i = 0, . . . , n−1,
i.e., it holds that
P ψi = λi ψi ∀ i = 0, . . . , n − 1.
Moreover, [ψ0 ]j = c for some c ∈ R, for all j = 1, . . . , n, i.e., ψ0 is constant over X . With the
help of these eigenvectors, we can write the diffusion distance as
n−1
X 2
DiffDist2t (xi , xj ) = λ2t
k ([ψk ]i − [ψk ]j ) .
k=1
Theorem 5.2.1 (Coifman, Lafon 2006). For an arbitrary, fixed t ∈ N, let the diffusion map
Ψt : X → Rn−1 be defined by
λt1 [ψ1 ]i
Ψt (xi ) :=
..
.
λtn−1 [ψn−1 ]i
This theorem tells us that the diffusion map Ψt embeds the data into Rn−1 in a distance pre-
serving way if we take the diffusion distance as our underlying metric. This is not surprising
since it is always possible to embed n data points of arbitrary dimension into Rn−1 in a distance
preserving way. However, since the absolute value of the eigenvalues λi is non-increasing for
growing i, we can additionally truncate the vectors Ψt (·) after q < n − 1 entries if the remain-
ing i = q + 1, . . . , n − 1 values |λi | are small. In this way, we embed the data X into Rq
while still approximately preserving the diffusion distance. The corresponding low-dimensional
representation of xi is then given by the first q coordinates of Ψt (xi ), i.e.,
t
[Ψt (xi )]1 λ1 [ψ1 ]i
.. ..
xi −→ = .
. .
[Ψt (xi )]q λtq [ψq ]i
The complete diffusion maps approach is given in Algorithm 6. Here, diag(v) for v ∈ Rn
denotes a diagonal matrix with entries
Note that we added in lines 4 and 5 of Algorithm 6 an option to set the diagonal of K (α) to
zero. This is a common technique in biological cell development analysis, which we will deal
with in the upcoming tasks. Here, one is interested in transition probabilities between different
data points only and not in the on-point potentials imposed by local densities; see [HBT15] for
76 Chapter 5. Nonlinear Dimensionality Reduction
more details. In fact, for recovering the structure of the data set based on the graph random
walk and relations between different data points, it can be disadvantageous to have large nonzero
entries on the diagonal of the transition matrix. In this case, setting its diagonal to zero is known
to achieve better results.
5.3 Clustering
Besides dimensionality reduction methods like PCA and diffusion maps, we have already briefly
mentioned clustering as another unsupervised learning approach in Section 1.2. The idea behind
clustering is that we want to separate or segment the data set X into k ∈ N different subsets,
called clusters, C1 , . . . , Ck ⊂ X . The data points within each cluster should be more closely
related to each other thani to data points in other clusters. Figure 5.3 illustrates the result of a
“book-figure12” — 2023/10/9 — 13:25 — page 7
i
0.4
Inputs 2
0.2
0 0.5
Inputs 1
Figure 5.3: Application of clustering to a two-dimensional data set. The result is a partitioning
of the data set into three clusters, indicated by the blue squares, the black circles, and the red
triangles.
5.3. Clustering 77
clustering algorithm on a two-dimensional data set. Note here that the quality of a clustering
result of any given method is usually hard to evaluate [HTF09, Mur22, XW08]. In particular,
every time clustering is performed on a data set we obtain a result, but it could be completely
meaningless, e.g., due to noise effects; see [JWHT21].
While clustering does not reduce the dimensionality of the data set, it often goes hand in hand
with dimensionality reduction algorithms when it comes to detecting and visualizing the most
important properties of the data. We will discuss this combination in Section 5.4 in more detail.
In the remainder of this section, we will briefly introduce two of the most important clustering
algorithms. First, we discuss k-means clustering, whose intrinsic similarity measure is based
solely on the metric/distance measure of the surrounding space. Subsequently, we have a closer
look at spectral clustering, which employs a data-dependent, intrinsic distance measure within
k-means.
Due to the lack of an underlying model class in clustering algorithms, we cannot easily catego-
rize a clustering method as being either linear or nonlinear. However, since k-means relies solely
on Euclidean distance computations and on the averaging of data points, it could be considered
a linear method. In contrast to this, spectral clustering is inherently based on the result of the
spectral decomposition of the transition probability matrix P of the neighborhood graph from
the previous section. Thus, it can be considered to be a nonlinear method.
k-means clustering
k-means is the most famous and straightforward way to cluster a data set. It relies on
searching k centroids c1 , . . . , ck ∈ Rd , which serve as centers for the corresponding
clusters C1 , . . . , Ck . A data point xi is then assigned to the cluster Cj for which the
Euclidean distance ∥xi − cj ∥2 is smaller than the distance to any other centroid. In
summary, the k-means algorithm aims to solve the minimization problem
k X
X
min ∥x − ci ∥22 .
c1 ,...,ck ∈Rd
i=1 x∈Ci
C1 ,...,Ck ⊂X disjoint
Sk
i=1 Ck =X
A famous approach to solve the k-means problem of minimizing the within-cluster sum of
squared errors approximately is Lloyd’s algorithm [Llo82]. It follows the heuristic of alternat-
ingly minimizing the within-cluster sum according to the centroids and according to the cluster
membership. To this end, cluster assignments are computed by determining the nearest centroid
for each data point in the first part of an iteration step of the algorithm. In the second part of an
iteration step the centroids are reassigned by computing the average of all data points within one
cluster. A detailed description can be found in Algorithm 7. Note that the employed simple ran-
dom initialization of the centroids can be improved by using certain heuristics; see, e.g., [VA06]
for the so-called k-means++ initialization. Furthermore, k-means can also be applied with non-
Euclidean distance measures such as the Procrustes distance; see [ADLS10], for example.
While k-means is the most prominent clustering algorithm, its application can become prob-
lematic for high-dimensional data because the Euclidean distance might not be a good distance
quantity due to the concentration of measure effect; see Section 1.6. Here, spectral clustering
becomes a good alternative.
78 Chapter 5. Nonlinear Dimensionality Reduction
Spectral clustering
A spectral clustering algorithm is based on a graph representation of the data, e.g.,
a neighborhood graph. Subsequently, we define a similarity matrix K, e.g., via the
Gaussian kernel as in diffusion maps. From this we build either the normalized (ran-
dom walk) graph Laplacian matrix Lrw on the data graph (see Algorithm 8 for its
computation) or the transition matrix P from diffusion maps. After ordering the ei-
genvalues λ1 , . . . , λn−1 and eigenvectors ψ1 , . . . , ψn−1 of either I − Lrw or P we
determine the differences between consecutive eigenvalues and take a k < n with
relatively large24 λk−1 − λk to be the number of clusters. Then, the standard clus-
tering algorithm k-means is run on the entries of the eigenvectors ψ1 , . . . , ψk−1 ; see
Algorithm 9 for details.
1 Construct a graph G on the data set X , e.g., a k-nearest neighbor graph or a fully
connected graph.
2 Compute the weight matrix K, e.g., using the Gaussian kernel K, i.e.,
K(xi , xj ) if there is an edge between xi and xj in G,
[K]ij :=
0 else.
Note that, for a fully connected graph G in Algorithm 8 and for α = 0 and zeroDiag = False
in Algorithm 6, I − Lrw equals P . Thus, employing P for spectral clustering instead of Lrw ,
which is the classical variant, gives us more flexibility by means of using different values for
24 This is also known as the spectral gap.
5.4. Visualization of high-dimensional data 79
α, which resemble different random walk dynamics as discussed before. Moreover, note that a
mathematical connection between graph theory and spectral clustering with Lrw can be drawn:
One can show that, for k = 2, the resulting segmentation is equivalent to a min-cut on the
underlying graph with weights K(xi , xj ) for all i, j = 1, . . . , n. The reason for this is that
the data are clustered according to the eigenvectors of a Markov chain transition matrix. More
details on spectral clustering and its mathematical and graph-theoretical background can be found
in [vL07].
Color coding
Dimensionality reduction
Data exploration by
visualization of dimension-
reduced and clustered data
0.5
0.15 0.21 0.74 0.8
, , , Scatter plot
0.22 0.14 0.78 0.94 0
0 0.5 1
Figure 5.4: Schematic overview of data visualization via dimensionality reduction and cluster-
ing. First, the four high-dimensional input vectors (upper left) are processed by dimensionality
reduction and with a clustering algorithm, respectively. Then, the resulting low-dimensional co-
ordinates (lower left) are plotted in a scatter plot (lower right), where the cluster labels (upper
right) are used for the color coding. Here, the first label is depicted by red, and the second label
is depicted by blue.
As we can see in Figure 5.5, the Isomap algorithm nicely captures the intrinsic two-dimensional
structure of the Swiss roll. However, for practical applications, in particular with noisy data,
diffusion maps often performs better than Isomap. Therefore, let us now consider the diffusion
maps algorithm.
X Task 5.2. Implement the diffusion maps method from Algorithm 6. To this end, you
can use the scipy.spatial.distance.pdist function to efficiently compute the
pairwise distances needed to evaluate the Gaussian kernel function.
5.5. Tasks on nonlinear dimensionality reduction 81
15 10
10
5 5
0
0
5
10 5
20
15 10
10 5 10
0 5 5
10 0 40 20 0 20 40
Figure 5.5: The Swiss roll data set (left) and its 2D embedding after the application of Isomap
(right). The color coding serves to illustrate that Isomap nicely detects the intrinsic structure of
the data set.
To study the performance of diffusion maps, we will explore its behavior on biological single-
cell data. To this end, we start with the single-cell sequencing data set from [BNC+ 15] and
subsequently analyze the Guo data set from [GHT+ 10]. These data sets contain certain mea-
surements from genes of mouse embryonic stem cells at different developmental stages, which
we will discuss in more detail later. We will explore how data preprocessing and hyperparameter
choices influence the results of diffusion maps on this data set. Our overall goal is to detect the
transition between different developmental stages by visualizing clustering results of the data set.
For the following exercises, we will always use the zeroDiag = True option of Algorithm 6
since we are interested in the inter-cell transitions. Let us begin with the data set from [BNC+ 15].
It contains 182 data points, which are subdivided into three different groups. Each data point has
a dimension of 8989. While the data set has an underlying biological background, this is not
relevant for the following two tasks. We will use this data set just as a first example to show that
real-world data sets might contain nonlinear structures, which the PCA cannot capture well.
X Task 5.3. A routine to load the data set from [BNC+ 15] has already been imple-
mented in the template notebook we provide for this chapter. Now, reduce the di-
mension of the data to three by using diffusion maps. Use the Gaussian kernel with
parameter σ = 20 and use α = 1 in Algorithm 6. Visualize the result in a three-
dimensional scatter plot (q = 3), i.e., plot the (scaled) second, third, and fourth
eigenvectors of the transition matrix P from Algorithm 6 against each other. Do not
forget to color (or label) your resulting points in the plot according to the given labels.
X Task 5.4. Repeat Task 5.3 and employ the PCA and Isomap instead of diffusion
maps to reduce the data set dimension. You can use your own PCA implementation
from the last chapter or the one from S CIKIT-L EARN. Compare the PCA and Isomap
results with the ones achieved with diffusion maps in Task 5.3.
data are collected from different developmental time points and are then combined into a single
data set. For each cell, gene expression analysis is done by measuring the expression intensity
of several genes. However, the high number of genes measured for each cell often makes it diffi-
cult for biologists to detect cell differentiation progressions. Dimensionality reduction methods
can help to extract information by projecting the data into a lower-dimensional space. If this
so-called embedding space is two- or three-dimensional, the data can be visualized as described
in Section 5.4. Different cell groups in the data should then be recognizable as different clusters
in the embedding space.
The Guo data set In the following, we will apply diffusion maps to the Guo data [GHT+ 10].
First, let us study the structure of this data set in more detail. The measurements contained
in it are single-cell qPCR Ct-values for 48 genes of 442 mouse embryonic stem cells at seven
different developmental stages from zygote to blastocyst. The details of qPCR Ct-values will
be explained below, after the description of this data set. Starting from the one-cell stage, cells
transition smoothly either to the trophectoderm (TE) lineage or to the inner cell mass (ICM).
Subsequently, cells transition from the inner cell mass either to the primitive endoderm (PE) or
to the epiblast (EPI) lineage.
In Table 5.12, we can see an exemplary excerpt of the Guo data set. In the first row, the names
of the measured genes (ranging from Actb to Tspan8) are given. The naming annotation in the
first column refers to the embryonic stage, embryo number, and individual cell number; thus
64C 7.14 refers to the 14th cell harvested from the seventh embryo collected from the 64-cell
stage. In the following, we are only interested in the embryonic stage of the cells, which is given
by the first number (e.g., 64C).
Table 5.12: Table of the raw Guo data set. Each of the n = 442 rows contains the Ct-values
for a specific cell at a certain developmental stage. Each of the G = 48 columns contains the
Ct-values for a specific gene.
Ct-measurements To gain the actual data values presented in Table 5.12, a qPCR (real-time
quantitative polymerase chain reaction) was conducted, which consists of several cycles. At each
cycle, the amount of fluorescence is measured. A Ct-value (abbreviation for cycles-to-threshold)
is then defined as the number of cycles for which the fluorescence significantly exceeds the
background fluorescence, i.e., at which a clear fluorescence signal is first detected. Thus, a higher
Ct-value means a lower DNA or gene concentration. For the Guo data set, a total of 28 cycles
of qPCR were performed. All genes that would need more cycles to exceed the background
fluorescence were assigned the threshold value 28. For further details on the analysis of single-
cell development data, Ct-values, and qPCR data, we refer the reader to [BNC+ 15, GGF09,
GHT+ 10, Hea23].
5.5.2 Preprocessing
To ensure an accurate and meaningful analysis, data sets often require preprocessing techniques,
such as data cleaning, handling missing or uncertain values, and data normalization; see also
5.5. Tasks on nonlinear dimensionality reduction 83
Section 2.5. In the following, we will learn how to preprocess the Guo data set. To this end, let
us denote the raw data set by
Xraw = {x1 , . . . , xn } ⊂ RG ,
where n = 442 is now the number of cells, i.e., the number of rows of Table 5.12, and G = 48
is the number of genes, i.e., the number of columns of Table 5.12. This means that [xi ]j is the
expression value of the jth gene of the ith cell. For the raw Guo data set, we have the following
information:
• Cells from the 1-cell stage embryos were treated differently in the experimental procedure.
• Genes that would need more than 28 cycles to exceed the background fluorescence were
assigned the threshold value 28, as mentioned above. However, there still exist a few
entries larger than 28 in the raw data set, which indicate undetectable data. Next, these
data need to be deleted from the data set.
Cleaning the data Since cells from the 1-cell stage and cells with at least one entry larger
than the threshold value have to be excluded from the analysis, a proper cleaning of the data will
be our first step. The resulting cleaned data are given by
X = Xraw \ X1C ∪ xi ∈ Xraw | max [xi ]j > 28 ,
j=1,...,G
where X1C denotes the set of cells from the 1-cell stage.
Normalizing the data Next, we need to normalize the data in order to obtain more accurate
results. A common strategy in biology is the normalization via reference genes. In our case,
we subtract for each cell the mean expression of the endogenous control genes Actb and Gapdh.
Moreover, we now also exclude entries which are identical to the threshold value 28, i.e.,
1
[xi ]j ← [xi ]j − ([xi ]gActb + [xi ]gGapdh )
2
for all i = 1, . . . , n and j = 1, . . . , G for which [xi ]j ̸= 28. Here, gActb and gGapdh denote the
indices of the genes Actb and Gapdh, respectively.
Rescaling the data Subsequently, we need to set the entries with threshold value 28 to a new
baseline. We define this baseline as the smallest integer greater than or equal to the maximum of
the normalized data set, i.e., ⌈max{[xi ]j | [xi ]j ̸= 28}⌉.
i,j
X Task 5.5. Preprocess the Guo data set as described above by cleaning, normalizing,
and rescaling it. Finally, round all entries to three digits.
Now we are ready to apply the diffusion maps algorithm to the resulting data set.
X Task 5.6. Perform a diffusion maps analysis of the preprocessed Guo data set for the
Gaussian kernel with σ = 10 and α = 1 in Algorithm 6 and visualize the embedding
in a two-dimensional scatter plot (s = 2). Interpret your result. Can you assign
the branches revealed in the plot to the lineages of the Guo data set described in
Section 5.5.1?
84 Chapter 5. Nonlinear Dimensionality Reduction
X Task 5.7. Perform a diffusion maps analysis of the Guo data set with the same pa-
rameters as in Task 5.6, but without full preprocessing (still remove the cells with
undetectable data and round all entries to three digits) and compare your result with
the plot from Task 5.6.
t-SNE
The general idea behind t-distributed stochastic neighbor embedding (t-SNE) is that
the similarity between two high-dimensional points xj and xi is measured by the sum
of Gaussian distances exp(−∥xi − xj ∥2 /σi2 ) + exp(−∥xi − xj ∥2 /σj2 ) for certain
bandwidths σi , σj for i, j = 1, . . . , n. Analogously, for the low-dimensional repre-
sentatives η i and η j , similarities are measured by a student t-distribution; see also
[Geo12]. To determine the points η i , the Kullback–Leibler divergence (see Chap-
ter 7 for a definition) between the Gaussian distribution and the student t-distribution
is minimized. More details can be found in [VdMH08].
We will use the existing t-SNE implementation from S CIKIT-L EARN in the following.
X Task 5.8. Embed the preprocessed Guo data set by using principal component analy-
sis and t-SNE. You can use the corresponding implementations from S CIKIT-L EARN.
Compare the results with the diffusion maps embedding from Task 5.6. Compare the
computation times of the dimensionality reduction methods as well.
X Task 5.9. Compare the diffusion maps embedding of the Guo data set for several
bandwidths σ.
Lafon [CL06] proposed a rule for the choice of a good value for σ as
v
u n
u 1 X
σ=t min{∥xi − xj ∥2 }. (5.5)
2n i=1 j̸=i
The radicand indicates half of the average of all nearest neighbor distances in the data set.
5.6. Further topics 85
X Task 5.10. Implement the rule (5.5) for the bandwidth σ. Plot the embedding for the
Guo data set with the bandwidth chosen by this rule.
X Task 5.11. Implement the spectral clustering method from Algorithm 9 with the tran-
sition matrix P from diffusion maps and using k-means for clustering (from S CIKIT-
L EARN) for a given number of clusters k.
X Task 5.12. Plot the largest 20 (ordered) eigenvalues of the transition matrix P for
the preprocessed Guo data set and identify k by determining the biggest spectral gap
(use the parameters from Task 5.6).
X Task 5.13. Employ the spectral clustering algorithm for the Guo data set with k from
Task 5.12 and plot the resulting points/clusters in 2D. Interpret your results. Does the
clustering detect the different cell stages?
possible equivalent fuzzy topological structure. Furthermore, density preserving extensions for
both t-SNE and UMAP were introduced in [NBC21].
Other clustering methods Apart from k-means and spectral clustering there are a variety
of other clustering methods such as hierarchical (agglomerative and divisive) clustering, which
successively builds a hierarchy of clusters, and density-based clustering algorithms like DBScan,
which detect areas of large data density and combine them into clusters. For a discussion of
different clustering approaches, we refer the reader to [XW08].
Chapter 6
We now focus on the model class of artificial neural networks and especially on so-called deep
neural networks. This class constitutes the present state of the art when it comes to large-scale
machine learning problems (many data points) and has proven to be very successful for various
applications, e.g., in signal processing or image recognition; see [Agg18, GBC16], for example.
This chapter is structured as follows. In Section 6.1 we first introduce the model class of
(deep) neural networks and their graphical representations. Section 6.2 is dedicated to the study
of the approximation properties of the model class. There, we investigate how well the model
class is suited to approximate functions from familiar function spaces, e.g., continuous functions
or functions with smooth derivatives. Subsequently, we have a look at the loss function and the
resulting optimization problem when employing neural network models in Section 6.3. Further-
more, we introduce an efficient algorithm to (approximately) minimize the loss there. Section 6.4
deals with a special subclass of neural networks, namely weight-reduced and convolutional neu-
ral networks. In Section 6.5 we study the algorithm from Section 6.3 in more detail and introduce
variants thereof that have proven to be successful in practical applications. Finally, Section 6.6
presents tasks on deep neural network implementations, where we dive into the P YTHON library
K ERAS.
87
i
88 Chapter 6. Deep Neural Networks
“book-figure14” — 2023/10/9 — 13:26 — page 84
i
Input Output
layer layer
z1
w1
z2 w2
w3
z3 f (z) Output
wd
..
.
zd
Figure 6.1: Graph representation for a single-layer neural network with d input neurons and a
single output neuron.
Together, the weights and the bias are the parameters p = (w1 , . . . , wd , b)T in the parametrized
model class M1-layerNN of single-layer neural networks, i.e., these parameters are to be learned.
This means that they need to be determined by the minimization of a loss function with respect to
given training data; see Section 6.3. Note that the model class represented by this simple NN is
already familiar to us: It is just the affine linear model class M1-layerNN = MLin from Section 2.1.
Moreover, a nonlinear activation function ϕ is applied to the result. For the most simple choice
of the Heaviside function
1 if t > 0,
ϕ(t) :=
0 else,
P
d
we obtain the so-called perceptron network. It computes f (z) := ϕ i=1 wi zi + b ; see
[Ros58]. This approach has been introduced by Rosenblatt in 1957 and was one of the first
ANNs for machine learning.
Activation functions
An activation function ϕ : R → R at a network node processes the information
propagated to that node and thus defines the output of a node. It is applied after the
summation of the weighted input information and after adding a bias, i.e.,
d
!
X
ϕ wi zi + b
i=1
6.1. Feed-forward neural networks 89
While there is a nonlinearity present in the Heaviside function, it only casts the real-valued output
Pd
i=1 wi zi + b to 0 and 1 in the same fashion as a level set function; see also Section 2.2. Thus,
the perceptron can only represent affine linear functions.
z2 (2)
w2
f (z) Output
z3
.. w
(2)
d2
.. .
. w
(1)
d,d2
zd
Figure 6.2: Graph representation for a two-layer neural network with d input neurons, d2 hidden
neurons, and a single output neuron.
and
f (z) := o(3) := ϕ(3) net(3)
for some activation function ϕ(3) . For regression, one usually takes ϕ(3) := id in the
output layer. For classification, it is common to choose a softmax function, which we will
introduce in more detail in Section 6.4.4.
The full model thus reads
!
d2 d
(2) (1) (2)
X X
f (z) = o(3) = ϕ(3) net(3) = ϕ(3) wj · ϕ(2) wi,j zi + bj + b(3)
j=1 i=1
In an L-layer neural network, the weight matrices W (l) ∈ Rdl ×dl+1 and bias vectors ⃗b(l+1) ∈
Rdl+1 for l = 1, . . . , L are the parameters of the model class ML-layerNN . Let us study how a
function f ∈ ML-layerNN is defined, i.e., how the network outputs f (z) are computed for given
(l) (l) T
weights and biases. To this end, let the values ⃗o(l) := o1 , . . . , odl of the lth layer neurons
(1)
be given. Note that ⃗o = z is just the input of the network. We set
T
⃗ (l+1) := W (l) · ⃗o(l) + ⃗b(l+1)
net (6.1)
(l) (l)
for the lth layer weight matrix W (l) with entries Wi,j = wi,j and the bias vector ⃗b(l+1) =
(l+1) (l+1) T
b1 , . . . , bdl+1 . Slightly abusing notation, we write
⃗o(l+1) := ϕ(l+1) net(l+1) ,
6.2. Universal approximation theorem 91
where the application of the activation function ϕ(l+1) has to be understood elementwise. This
is done for l = 1, . . . , L to obtain
(L+1)
⃗o(L+1) = o1 =: f (z)
bias terms ⃗b := ⃗b(2) , . . . , ⃗b(L+1) by layerwise iterations is called forward propagation. Note
however that we still need to optimize the weights and biases with respect to a given loss function;
see Section 6.3.
Although perceptron-based neural networks, like the ones we have discussed so far, were
introduced in the 1950s, they did not gain much interest back then since no efficient training
algorithms existed for multi-layer architectures. This changed in the 1980s when such algorithms
[Lin76, Wer82] were presented. Soon, they were employed successfully for several real-world
applications; see, e.g., [LeC85, RHW86].
∥f − g∥∞ < ε
for
T T
f (z) := w⃗ (2) · ϕ(2) W (1) · z + ⃗b(2) .
This theorem essentially tells us that a two-layer network with an appropriate activation func-
tion is able to represent continuous functions on a compact domain with arbitrary precision. Note
that, for polynomial activation functions, this statement is a trivial consequence of the classical
Stone–Weierstrass theorem [Sto48], which states that linear combinations of polynomials can
approximate continuous functions on an interval arbitrarily well.
Popular activation functions that fulfill the requirements of Theorem 6.2.1 are the hyperbolic
tangent function ϕ(x) := tanh(x) and the sigmoid function ϕ(x) := 1−e1−x . Interestingly, the
commonly used ReLU function ϕ(x) := max(0, x) does not meet the prerequisites of the above
theorem. However, there exist more recent variants of approximation theorems for NNs, which
cover the ReLU case and various other activation functions and also take the number of weights
and layers into account, e.g., [Yar17, PV18, BGKP19, GK22, GP90, Mha96, Dah22, MZ22].
For a thorough survey on approximation properties of different types of neural networks, see
[DHP21]. Let us have a look at the version from [PV18], which is valid for ReLU activations.
92 Chapter 6. Deep Neural Networks
Theorem 6.2.2 (Petersen, Voigtländer 2018). Let β > 0. There exists a c > 0 such that, for all
ε > 0 and any piecewise C β function g on [0, 1]d , there exists a DNN with ReLU activations and
2(d−1)
at most O ε− β non-zero weights and c · log2 (β + 2) · 1 + βd layers with output f such
that
∥f − g∥L2 < ε.
Similar results are known for functions of higher smoothness, e.g., holomorphic functions.
There, even exponentially decaying bounds in ε on the approximation error ∥f − g∥ in certain
Lipschitz norms can be proven to hold with similar constraints on the numbers of neurons and the
numbers of layers; see [OSZ22]. To this end, the so-called rectified power unit (RePU) activation
function RePUp (x) := max(0, x)p with p ≥ 2 is used in contrast to the ReLU function.
Besides approximation theorems stating how well neural networks approximate certain func-
tion spaces, the function classes that specific neural networks describe have also been studied.
For two-layer feed-forward neural networks, for example, this class is called the Barron space.
In the case of deep residual neural networks, which we will introduce in Section 8.2, this class
is a so-called flow-induced function space. For more details, we refer the reader to [EMW22].
Furthermore, [Uns19] shows that determining the optimal activation functions of a neural net-
work together with its weights and biases results in non-uniform linear spline activations,26 i.e.,
the resulting network function is a piecewise linear spline with a priori unknown knot locations.
for supervised learning. Note that this optimization problem might feature many local minima,
even for the simple least squares loss.
To tackle such an optimization problem, iterative methods (such as gradient descent; compare
Section 2.4), have proven to work well in practice. There, the fitting of the model parameters,
i.e., the weights W (l) and biases ⃗b(l+1) for l = 1, . . . , L, to given training data is iteratively
done. Then, for a locally convergent method, a resulting local minimum depends on the initial
choice of the model parameters, the respective minimization method (e.g., involving successive
linearization steps), and the available training data set.
While there is not yet a mathematically sound theory/analysis, empirical results in the last
decade have shown that gradient descent methods (see Section 2.4) are a good approach to deal
with the optimization task of DNNs. In particular, we use stochastic gradient descent-type opti-
mizers to numerically tackle the above optimization problem.
Throughout this section, we will stick to the least squares loss function to introduce an SGD
variant for solving the corresponding optimization problem. Nevertheless, the use of other (dif-
ferentiable) loss functions works in an analogous fashion. To reflect our setting, let
|B|
1 X
CB (f ) := Ci (f )
|B| i=1
with the one-sample least squares loss L̃(z, z̃) = (z − z̃)2 . Thus C{1,...,n} (f ) is our well-known
least squares loss on the whole training data set
D = {(xi , yi ) | i = 1, . . . , n}
of the supervised learning problem. Note that we will need to take derivatives of CB (f ) with
respect to all the weights W = W (1) , . . . , W (L) and biases ⃗b = ⃗b(2) , . . . , ⃗b(L+1) in order
to determine the next step in a gradient descent method. It should be clear that the function
f ∈ ML-layerNN , which the network realizes, implicitly depends on those variables. The SGD
method for fixed step size ν > 0, also called learning rate, a fixed subset size 0 < κ ≤ n, also
called minibatch size, and a fixed step number S ∈ N is summarized in Algorithm 10.
Note that we can either run the algorithm for a fixed number S of steps (so-called epochs), as
described here, or use a tolerance criterion for the change of the gradients as done in Section 2.4.
(l) (l)
The initialization of the weights and biases is usually done randomly, e.g., by bk , wi,j ∼
U − √1d , √1d . Note however that the choice of the initial weights and biases can significantly
l l
influence the outcome and the performance of deep learning algorithms in particular cases; see,
e.g., [GB10, MM15]. Note furthermore that the gradient ∇W ,⃗b CB (f ) is an unbiased estimator
for ∇W ,⃗b C{1,...,n} (f ) in each SGD step. However, if κ ≪ n, ∇W ,⃗b CB (f ) is much cheaper to
evaluate than ∇W ,⃗b C{1,...,n} (f ). Nevertheless, to reduce the variance of this estimator, it makes
sense to choose a reasonably large minibatch size κ. Usually, κ is chosen to meet given hardware
and time constraints.
The convergence properties of stochastic gradient descent are discussed in Section 6.5.
10 end forall
11 end forall
12 end forall
13 end forall
2
As mentioned before, we focus here on calculating ∇W ,⃗b C(f ) for C(f ) := (f (x) − y) , which
resembles each Ci for i = 1, . . . , n, with an abstract data point x. Nevertheless, the computations
for other one-sample loss functions work analogously.
(L+1)
Now, if the input to our L-layer network is z, then o(L+1) := o1 = f (z). We will show
∂ ∂
here only how to compute (l) C(f ). Note that the calculation of (l+1) C(f ) works in the
∂wi,j ∂bj
same fashion. As a first step, we apply the chain rule to obtain
(l+1) (l+1)
∂C(f ) ∂C(f ) ∂oj ∂ netj ∂C(f ) (l+1) ′ (l+1) (l)
(l)
= (l+1)
· (l+1)
· (l)
= (l+1)
· ϕ netj · oi .
∂wi,j ∂oj ∂ netj ∂wi,j ∂oj
Furthermore, we have
2(f (z) − y)
if l = L,
(l+2)
∂ neti
Pdl+2 ∂C(f )
∂C(f )
(l+2) · else.
(l+1)
= i=1 ∂ neti (l+1)
∂oj
∂oj
| {z }
(l+1)
=wj,i
Since
(l+2)
∂C(f ) ∂C(f ) ∂oi ∂C(f ) (l+2) ′ (l+2)
(l+2)
= (l+2)
· (l+2)
= (l+2)
· ϕ neti ,
∂ neti ∂oi ∂ neti ∂oi
∂C(f )
we see that we can calculate (l) by iteratively working our way from layer L to L − 1 to L − 2
∂wi,j
and so on until we reach layer l. This process is called backpropagation or just backprop. To this
end, we introduce
(
2(f (z) − y) if l = L,
⃗δ :=
(l)
(l+2) ′
(l+2)
(l+1) ⃗ ⃗
(l+1)
W · δ ⊙ ϕ net else,
6.4. Weight reduction: Regularization 95
which yields
′ (l+1) T
∇W (l) C(f ) = ⃗o (l) ⃗ (l)
· δ ⊙ ϕ (l+1) ⃗
net ,
′ (l+1)
∇⃗b(l+1) C(f ) = ⃗δ(l) ⊙ ϕ(l+1) ⃗
net
for all l = 1, . . . , L. Here, ⊙ denotes the Hadamard product, i.e., the entrywise product between
two vectors.
Note that it is crucial for the performance of the SGD algorithm that the derivatives of CB (f )
can be computed by using basic linear algebra operations without using for-loops over i ∈ B.
Furthermore, note that the ReLU activation function is not differentiable in the classical sense.
However, the differentiation can be understood in a piecewise fashion here.
ò Automatic differentiation
The backpropagation procedure allows us to automatically infer the derivatives
∇W ,⃗b CB (f ) for different choices of activation functions and layer numbers. In
contrast to numerical differentiation, where the derivative is approximated, e.g., by
computing the difference quotient, the symbolic knowledge about the network is used
here. In the mathematical community this is just a special case of automatic differen-
tiation or autodiff . Here, the derivative of a mathematical expression is numerically
evaluated by iterative applications of the chain rule on mathematical expressions,
which are built from combining basic mathematical functions, e.g., trigonometric
functions or polynomials, together with basic mathematical operators, e.g., addition,
division, etc. In this way, derivatives can be computed almost exactly (up to machine
precision); see also [GW08, Nau11] for more details and [BPRS18] for a survey on
automatic differentiation in ML.
Finally, we want to remark that the notation of the backpropagation process can differ through-
out the literature. We presented above just one special way to denote it. Another possibility is a
more compact gradient-type notation which is based on [GBC16]. Here, instead of storing ⃗δ(l)
for l = 1, . . . , L, we store the gradients ⃗g (l) of the loss function with respect to the net sum. To
this end, let ′ (l)
⃗g (l) := ∇ ⃗ (l) C(f ) = ⃗h(l) ⊙ ϕ(l)
net
⃗
net
for l = 1, . . . , L + 1 with
⃗h(l) := 2(f (z) − y) if l = L + 1,
∇⃗o(l) C(f ) = W (l) · ⃗g (l+1) if l = 1, . . . , L.
Then, the derivatives of the loss with respect to the weights and biases can be written as
T
∇W (l) C(f ) = ⃗o(l) · ⃗g (l+1) ,
∇⃗b(l) C(f ) = ⃗g (l) .
overfitting problems and runtime complexity issues. Apart from smoothness regularizations and
bounds on coefficient norms, as for SVM, a very successful method for regularizing NNs is the
so-called dropout approach. Here, in each step of the iterative solver (e.g., SGD), a (hidden) node
in the network is neglected with probability 0 < p < 1. After the training, the whole network
is considered again for testing. There however, the output of each node is usually multiplied by
(l)
1 − p such that the expected output E oi is the same as in the training steps.
Apart from dropout, we could directly use a weight-reduced network, which has fewer param-
eters and, thus, less expressive power than a fully connected one, but which is often sufficient
ithis end, certain neuron-neuron connections are omitted, as illustrated in
to get good results. To
Figure 6.3.
“book-figure16” — 2023/10/9 — 13:26 — page 91 —
i
Figure 6.3: Graph representation for a two-layer neural network where weight reduction is used.
Note that each input neuron is only connected to a subset of hidden neurons.
A special form of weight reduction is weight sharing. Here, while many (or all) neuron con-
nections are still active in the network
i graph, some of them share the same weights as visualized
by Figure 6.4, where shared weights have the same name and color.
“book-figure17” — 2023/10/9 — 13:26 —
i
a
b
c
b
a
a
c c
a
Figure 6.4: Graph representation for a layer employing weight sharing. Here, three different
weights a, b, c are shared among the connections from one layer to the next one.
(l) (l)
where m∗ = m if we want an odd number of 2m + 1 weights w1 , . . . , w2m+1 and
(l) (l)
m∗ = m − 1 if we want an even number of 2m weights w1 , . . . , w2m . Here, the
(l)
size dl+1 of the (l + 1)th layer depends on how we treat indices i for which oi does
not exist, i.e., for which i ≤ 0 or i > dl .
Note that the bias is often omitted in convolutional layers since it does not influence the result
much for large networks. Note also that the indexing in (6.3) differs slightly from the indexing
in the definition of convolutions in Section 4.5.3. We choose the specific definition above to be
consistent with most of the literature on CNNs.
Padding and stride Besides the definition in (6.3), two additional parameters determine the
results and layer size of a convolutional layer in practice: the padding parameter p̃ ∈ N and the
stride parameter s̃ ∈ N.
(l)
The padding determines how non-existing values oi for i ≤ 0 and i > dl in (6.3) are treated.
(l)
This means that all values oi for −p̃ < i ≤ 0 and dl < i ≤ dl + p are defined to be zero.
Computations involving other non-existing values are neglected.
(l)
The stride determines how many oi values are skipped, i.e., how many of the lth layer neurons
are neglected when computing the net sum for the next neuron in the (l +1)th layer. In particular,
the formula (6.3) is only valid for s̃ = 1, which is the default value. For other strides s̃ the
corresponding formula becomes more complicated and reads
∗
m
(l+1) (l) (l) (l+1)
X
netj = os̃(j−1)+k+1 wm+1+k + bj , (6.4)
k=−m
where m∗ = m for an odd number of weights and m∗ = m−1 for an even number of weights. A
i
simple example of a graph for a convolutional layer with three weights and with p̃ = s̃ = m = 1
is depicted in Figure 6.5.
“book-figure18” — 2023/10/9 — 13:26 —
i
w2
w1
w3
w2
w1
w3
w2
w1
w3
w2
Figure 6.5: Graph representation for a convolutional layer. The shared weights w1 , w2 , w3 are
used to compute (6.3) for m∗ = m = 1.
98 Chapter 6. Deep Neural Networks
Layer size The layer size dl+1 directly depends on the size of p̃ and s̃, i.e.,
dl + 2 (p̃ − m) − 1 + s̃
dl+1 = (6.5)
s̃
for the case of 2m + 1 weights and
dl + 2 (p̃ − m) + s̃
dl+1 = (6.6)
s̃
for the case of 2m weights. In the following we will assume s̃ = 1 unless specified otherwise.
Parallel layers Convolutional layers are often used in a parallel fashion, i.e., P ∈ N indepen-
dent copies (so-called channels) of a convolutional layer are learned with different weights for
each copy. The concatenated output of all these parallel layers serves as the input to the subse-
quent layer in the network. Another way to look at this is to think of a layer of size dl+1 where
every neuron stores a P -dimensional vector of convolutions, e.g.,
m
!P
(l+1) (l) (l,p) (l,p)
X
netj = oj+k wm+1+k + bj
k=−m p=1
for s̃ = 1 and an odd number of weights. We can think of this procedure as creating P different
feature maps for the data. Then, by learning the weights and biases of the subsequent layers, the
most significant combination of those feature maps is chosen by the network.
Finally, let us consider the case of a parallel convolutional layer with Po (output) channels
whose input already consists of Pi (input) channels. In this case, for each output channel, a
convolutional operator is applied to all Pi input channels, but with different (learnable) weights
per input channel. The results are then summed up over all input channels. In this way, each of
the Po output channels contains information about each of the Pi input channels.
6.4.2 Pooling
To reduce the size of a network further and since some information gained in an overlapping-type
of convolutional layer, i.e., s̃ < 2m, is redundant, a so-called pooling layer is often employed af-
ter a convolutional layer. Here, a simple mathematical operation is used to summarize/condense
the information of several neurons into one neuron. The most common type of pooling is max-
pooling, where the maximum of the values of the incoming neurons is stored. Let us consider
the example graph for a max-pooling layer with stride s̃ = 3 in Figure 6.6. We can interpret
this as a maximum operator with three arguments (size = 3), which is applied to the first layer
successively and jumps three neurons ahead after each application (stride s̃ = 3).
o2 max(o1 , o2 , o3 )
o3
o4
o5 max(o4 , o5 , o6 )
o6
Figure 6.6: Graph representation for a max-pooling layer of size 3 with a stride of 3.
(l)
particular, let oi,j denote the (i, j)th entry of the image/matrix at layer l. Then, a 2D convolution
with a 2m1 × 2m2 weight stencil
(l) (l) (l)
w1,1 w1,2 ... w1,2m2
(l) (l) (l)
w2,1 w2,2 ... w2,2m2
(l)
W̃ := .. .. .. .. ,
. . . .
(l) (l) (l)
w2m1 ,1 w2m1 ,2 ... w2m1 ,2m2
i.e., with an even number of weights in both directions, is given by
m1 −1 m 2 −1
(l+1) (l) (l) (l+1)
X X
neti,j = oi+k1 ,j+k2 wm1 +1+k1 ,m2 +1+k2 + bi,j , (6.7)
k1 =−m1 k2 =−m2
where we assumed that m1 and m2 are both odd. In the case of odd stencil sizes we have
to modify the formula accordingly; see (6.3) for the 1D case. A schematic illustration of the
application of a 2D convolutional stencil can be found in Figure 6.7.
Analogously, a (max-)pooling layer can also be defined in the same 2D fashion, as is shown
in Figure 6.8.
A typical 2D-convolutional NN for image classification then consists of a sequence of alter-
nating 2D convolutional layers (possibly P parallel ones) and 2D pooling layers. At the end, a
fully connected layer is usually added. An illustration can be found in Figure 6.9.
Then, we have M output neurons, whose values represent the probability that the input vector
belongs to the corresponding class. As the softmax function depends on net-inputs that (in a strict
i
100 Chapter 6. Deep Neural Networks
“book-figure20” — 2023/10/9 — 13:26 — page 94 — #1
i
7 1 18
4 3 5 2D convolution 13
2 0 (l)
1 2
W̃ =
0 3
0 1 2
0 0 2D input 2D output
(l)
Figure 6.7: Schematic illustration for a 2D convolutional layer with 2 × 2 stencil W̃ and
s̃ = p̃ = 1 in each direction. The input is of size 6 × 6, and we exemplarily depict the entries
within the red, blue, and green squares. Note that the padding at the boundary leads to the
zeros within the green square. We illustrate the resulting values after the convolution operator is
applied within the square of the same color in the next layer, e.g., 1 · 7 + 2 · 1 + 0 · 4 + 3 · 3 = 18
in the red square i
in the output. Note that the subsequent layer size increased by one in each
direction according to (6.6).
“book-figure21” — 2023/10/9 — 13:26 — page 94 — #
i
1 2 7 4 2 0
15 3 8 0 1 1
2 0 1 3 0 5 2D max-pooling 15 5
1 1 3 2 1 2 8 3
2 5 6 0 0 2
8 8 1 3 3 1
2D input
Figure 6.8: Illustration of a 2D max-pooling layer with size 3 × 3 and stride s̃ = 3 in each
direction.
sense) do not belong to the neuron at hand, the application of the softmax function is sometimes
also modeled as an extra layer but without any weights/degrees of freedom.
softmax
layer
20 parallel
pooling layers Fully connected
10 parallel 20 parallel con-
pooling layers volutional layers
10 parallel con-
volutional layers
Figure 6.9: Typical 2D CNN for image classification with several parallel convolutional and
pooling layers and a final fully connected layer, where the 2D data are stacked into a large 1D
vector before applying a softmax activation function. The number of channels (10 and 20) of
the convolutional layers has been chosen arbitrarily here. (Cat image licensed under Pixabay
Content License.)
In our case, p1 will be the distribution the data are drawn from, i.e., p1 = µ, and p2 is the output
of our neural network after the application of the softmax layer, i.e., p2 = f . The motivation
for employing a cross entropy loss is that its minimization is equivalent to the maximization of
the so-called likelihood of the labels given the data and all weights W and biases ⃗b under the
assumption that the data (xi , yi ) are independent for different i ∈ {1, . . . , n}, i.e.,
n n
!
Y Y
arg max p2 yi xi , W , ⃗b = arg max log p2 yi xi , W , ⃗b
W ,⃗b i=1 W ,⃗b i=1
n
X
= arg max log p2 yi xi , W , ⃗b
W ,⃗b i=1
= arg max Ep1 [log(p2 )] .
W ,⃗b
This holds since p2 is non-negative and the logarithm is a monotonically increasing function,
which does not change the arg max. Note that
arg max Ep1 [log(p2 )] = arg min Ep1 [− log(p2 )] ,
W ,⃗b W ,⃗b
which shows the equivalence of minimizing the cross entropy between p1 and p2 and maximizing
the likelihood of p2 when the data are drawn according to p1 .
problem at hand allow us to derive an error estimate. To this end, let C̃ be a random variable
depending on a data point (x, y) drawn according to the data measure µ. Then, let Ci be instances
of C̃ instantiated at the training data points (xi , yi ) for i = 1, . . . , n. For
Pnexample, Ci (f ) =
(f (xi ) − yi )2 as in (6.2). This leads to the least squares loss C(f ) = n1 i=1 Ci (f ). Now we
slightly reformulate the minimization problem such that the loss function does not depend on
f but rather on the parameters of f . To obtain a general formulation we denote the parameter
vector by p ∈ Rdp . Then, we arrive at the model problem
n
1X
min Θ(p) := min Θi (p), (6.8)
p∈Rdp p∈Rdp n i=1
hold for some L ≥ 0 and for all p̄, p̂ ∈ Rdp and all i = 1, . . . , n (regardless of the data point
defining Ci ). Furthermore, let Θ be ξ-strongly quasi-convex, i.e.,
ξ
Θ(p∗ ) ≥ Θ(p) + ⟨∇p Θ(p), p∗ − p⟩ + ∥p∗ − p∥2 ∀ p ∈ Rdp
2
for some ξ > 0, where p∗ = arg minp∈Rdp Θ(p). Then, choosing κ = 1 and a learning rate
1
0 < ν ≤ 2L in Algorithm 10 leads to the upper bound
2νσ 2
E ∥p∗ − pk ∥2 ≤ (1 − νξ)k ∥p∗ − p0 ∥2 +
(6.9)
ξ
for the difference between the true solution p∗ and the parameter value pk after the kth iteration,
after the∗ kth2 minibatch has been processed.
i.e., 27
Here, p0 is the initial value and σ 2 :=
E ∥∇p Θ1 (p )∥ .
This theorem gives us an upper bound on the error after k iterations of
the SGD algorithm.
Note that the upper bound only converges to 0 if σ 2 = E ∥∇p Θi (p∗ )∥2 = 0 would hold for
some/any i (since they are identically distributed). Nevertheless, when choosing a very small
learning rate ν, the second term on the right-hand side of (6.9) almost vanishes. However, the
first term decays then very slowly with increasing k. Therefore, a certain tradeoff is needed to
obtain a fast convergence and a small overall upper error bound.
Many versions of the above theorem for different SGD variants can also be found in [GHR20],
e.g., variants for a minibatch algorithm or a regularized version of it. A similar error bound for
SGD with slightly different prerequisites and a result for a decaying learning rate can be found
in [BCN18].
When applying the above theorem to neural networks we immediately encounter a problem:
The prerequisites are usually not fulfilled. In particular, the convexity of the Θi and the ξ-strong
quasi-convexity of Θ are only met for very special types of multi-layer networks. Therefore, a
27 Note that the number of iterations k refers to the number of times that line 6 in Algorithm 10 has been called. This
means that we already performed n iterations after one full epoch if the minibatch size is κ = 1 since we execute line 6
of Algorithm 10 once per epoch for each data point.
6.5. A closer look at different optimizers 103
sound convergence theory in more general cases still needs to be established. Nonetheless, SGD
is still frequently applied there. In the next sections we will encounter variants of SGD, which
heuristically perform in some cases even better than the theory predicts. To this end, let us briefly
introduce a compact way to write the kth iteration of the SGD algorithm as
where ΘB (pk−1 ) = CB (f (·; pk−1 )) is the minibatch variant of the notation from (6.8); see also
Section 6.3. Here, we explicitly expressed the network function f in dependence of both the
input and the parameter vector p, which we omitted previously for readability reasons.
mk := χ · mk−1 − ν · ∇p ΘB (pk−1 ),
pk = pk−1 + mk .
Here, a momentum term mk is incorporated, which contains the history of already en-
countered gradient steps. It is weighted by 0 < χ < 1 in each iteration. Therefore, the
contribution of values of older gradients decays exponentially with the number of steps
taken. Note that if the most recent gradients have a common direction in which they point,
the contribution of this direction to the next step is largely increased. This can be inter-
preted as pushing a ball down a hill. Typically, χ ≈ 0.9 is chosen. An illustration of
gradient descent with and without momentum updates can be found in Figure 6.10.
• Nesterov update [Nes83]:
A problem with the momentum update can be that we still follow the direction of older
gradients while the gradient at our current iterate might point in a completely different
direction. To remedy this issue, Nesterov came up with a clever idea: We first take pk−1 +
χ · mk−1 as an estimate of where we are going in the next step and then compute the
gradient there. This approach has the chance to reach a convergence rate that is twice as
good as that of plain SGD, at least for strongly quasi-convex problems; see [AR20].
Figure 6.10: Gradient descent updates without (left) and with (right) momentum updates are
depicted with orange arrows. The sought minimum is found in the middle of the blue ellipses,
which reflect the isolines of the function under consideration.
104 Chapter 6. Deep Neural Networks
• AdaGrad [DHS11]:
The idea of AdaGrad is to now use the squared sum hk of all previously encountered
gradients to change the learning rate individually for each weight and bias. Here, the
learning rate is damped for weights/biases that already changed largely in the past and it
is increased for weights/biases that only underwent minor changes. Note that the division
and the square-root have to be understood elementwise. Usually, a value of ε ≈ 10−4 is
chosen as threshold to ensure that the denominator in the above equation does not vanish.
• RMSProp [Hinton - unpublished]:
1 1
a correction of 1−β k is made to mk and a correction of 1−β k is made to hk . These cor-
1 2
rections account for a potential initialization bias for large values of β1 and β2 . Note that
there also exists a Nesterov-version of Adam called NAdam [Doz16].
In practice, plain SGD and Adam are most frequently used, and we will encounter them in
Section 6.6 in detail.
X Task 6.1. Implement a class TwoLayerNN, which represents a (fully connected, feed-
forward) two-layer neural network, i.e., L = 2. The activation functions should be
ϕ(2) = ReLU and ϕ(3) = id. The weights and biases can be initialized by drawing
i.i.d. uniformly distributed random numbers in (−1, 1). The class should contain a
method feedForward to calculate the point evaluations of f for a whole minibatch
at once and a method backprop to calculate ∇W ,⃗b CB (f ). To this end, avoid using
for-loops over the minibatch and use linear algebra operations (on vectors, matrices,
or tensors) from N UM P Y instead.
X Task 6.3. Test your implementation by drawing 250 uniformly distributed points xi
in R2 with norm ∥xi ∥ ≤ 1 and label them by yi = −1. Now draw 250 uniformly
distributed points xi in R2 with 1 < ∥xi ∥ ≤ 2 and label them by yi = 1. Use your
two-layer neural network with d2 = 20 hidden layer neurons, κ = 20 and 50000
iterations, i.e., use S = 50000
κ = 2500 epochs, to classify the data. Try different
learning rates ν. Output the least squares error every 5000 iterations. After the result
has been computed, make a scatter plot of the data and draw the contour line of your
learned classifier. What do you observe? What happens if you increase S?
6.6.3 Regularization
To prevent overfitting many techniques are known, but most of them are not well understood
theoretically. A very popular technique is the dropout approach (compare Section 6.4), where,
during each training step, one neglects nodes in a layer with a given probability p. In this way,
random sub-nets are trained. To add dropout regularization to a layer employ
model .add( layers . Dropout (p))
after it.
K ERAS also has support for using regularization terms in the loss function, similar to what we
have seen in the discussion of SVMs in Chapter 3.
X Task 6.4. Use K ERAS to build a classifier for the MNIST data set (see the template
J UPYTER notebook).
(a) Build a model with the following layers:
X Task 6.5. For the MNIST data set, build a CNN with K ERAS with the following
layers:
• 16 parallel convolutional layers with kernel size 3 × 3 + ReLU,
• 32 parallel convolutional layers with kernel size 3 × 3 + ReLU,
• a 2D max pooling layer of size 2 × 2, non-overlapping + dropout (p = 0.25),
• a flattening layer, which converts its input to a vector,
• a fully connected layer with 128 outputs + ReLU + dropout (p = 0.5),
• a fully connected layer with output size 10 + softmax.
Train for 15 epochs; the other parameters should be the same as in the previous model
from Task 6.4.
X Task 6.6. Use a CNN to learn features for pedestrian classification. Proceed as fol-
lows:
• Design and train a CNN for pedestrian classification (use the data from Sec-
tion 4.5.2). Here, you can start with the network from Task 6.5.
• Use the output after the flattening (see K ERAS ’ FAQ on how to do this) as a
feature vector for a linear SVM, maybe even together with the HOG features.
Try to use PCA in order to improve the accuracy, also tweak the HOG parameters.
Make sure not to overfit (the pedestrian data set is small for deep learning standards).
Hint: You can also install the AUGMENTOR library for P YTHON in order to enlarge
the data set, which is a common technique in deep learning.
a minute), while on the GPU an epoch was finished in three seconds. Even higher speedups are
common.
Training a neural network involves a lot of trial-and-error and requires experience and pa-
tience. Modern networks can only be trained (in reasonable time) on a GPU or on dedicated
hardware. For example, the famous AlexNet [KSH12] had 60 million parameters and was trained
over six days on two GPUs in 2012. More recent approaches usually achieve smaller training
times but exploit a much more expensive hardware setup. The authors of [GDG+ 17], for in-
stance, trained their network within an hour on 256 GPUs in 2017. However, comparing many
recent experiments on the same data set to older results is usually not straightforward since a lot
of pre-training and model/architecture mixing is being done; see, e.g., [YWV+ 22].
Besides GPUs, so-called tensor processing units (TPUs) have been developed. These are
specific hardware/chip architectures, which are even more efficient for deep learning problems
than GPUs; see, e.g., [Jea17].
References on general neural networks and CNNs The free course at fast.ai30 is a highly
recommended reference when looking for further tutorials on neural networks from a program-
mer’s perspective. A more thorough and complete consideration of many neural network types
is given in the book [GBC16] or in the review [RW17] on CNNs. There you will also find
references on the history of neural networks.
Recurrent neural networks An important type of network, which we only briefly discuss in
Section 10.3, is the recurrent neural network; see also [GBC16]. Here, variable input sizes/di-
mensions and sequence data (such as time series) can be processed. To this end, an internal
state variable is stored and recomputed for each new time step, for instance. The most famous
recurrent neural networks are long short-term memory networks (LSTM) [HS97] or gated recur-
rent units [CvMG+ 14]. They allow us to process long and possibly instationary sequences by
adaptively learning which time horizon to choose, while using past data in order to predict future
values.
30 http://www.fast.ai/
Chapter 7
Variational Autoencoders
In this chapter, we will introduce neural networks specifically designed for the unsupervised
learning task of dimensionality reduction. These networks are called autoencoders.
Autoencoders
Autoencoders are special feed-forward neural networks. They consist of so-called
encoder layers, which perform the reduction step from the high-dimensional data
space to a certain low-dimensional space, and of decoder layers, where vectors from
the low-dimensional space are mapped back to the high-dimensional space. In this
fashion, two parts are trained simultaneously: A dimensionality reduction architec-
ture and a high-dimensional vector reconstruction algorithm. For details, we refer the
reader to [Bal12, GBC16, Sch15].
The autoencoder approach allows the user to traverse the low-dimensional space of hidden (or
latent) variables, in which—in the best case—the directions can be associated with some mean-
ingful key figures such as physical, biological, or sociological quantities.
It is oftentimes not a priori clear how to obtain meaningful and interpretable latent vari-
ables for the dimensionality reduction step. Furthermore, it is not clear if, for instance, the
high-dimensional reconstruction of an interpolated vector between two data points in the low-
dimensional space is meaningful at all. To this end, variational autoencoders [Doe16, KW13,
KW19] have been created. They involve a probabilistic approach that ensures an appropriate
data distribution within the latent space and a meaningful process of generating high-dimensional
vectors from low-dimensional representatives. This serves as a first step in the direction of inter-
pretable latent variables.
109
110 Chapter 7. Variational Autoencoders
This chapter is structured as follows. Section 7.1 gives a general introduction to autoencoders
and their connection to the PCA. Next, we present tasks on autoencoders in Section 7.2. Sub-
sequently, we discuss the concept of latent variables and the necessary statistical background in
Section 7.3, before we come to study variational autoencoders in Section 7.4. Finally, we give
tasks on the analysis of the MNIST data set from Section 3.7 with variational autoencoders in
Section 7.5.
7.1 Autoencoders
First, we introduce the concept of autoencoders represented by neural networks. To this end, we
briefly review the idea behind these special architectures and their construction before consider-
ing variations thereof.
Before we hint at possible difficulties and remedies for tackling this minimization problem, let
us have a closer look at the remaining question of defining E and D or, equivalently, of choosing
appropriate model spaces for the encoder and decoder. For our considerations we will stick to
the case of E and D being special kinds of neural networks.
i
7.1. Autoencoders 111
“book-figure22” — 2023/10/9 — 13:26 — page 105 — #1
i
Encoder Decoder
E : Rd → Rq D : Rq → Rd
7.1.2 Construction
In its most simple form, an autoencoder is just a fully connected two-layer (feed-forward) neural
network, where the hidden layer contains fewer neurons than the input layer. Furthermore, the
output layer has the same size as the input layer. In terms of the dimensions q and d, which we
assigned to the encoder E and decoder D, we thus have d input and output neurons and q < d
hidden neurons; see Figure 7.1.
The encoder can now be simply written as
E(x) = ϕE (WE x + bE )
D(z) = ϕD (WD z + bD )
n n
1X 1X
arg min ∥xi − f (η i )∥2 = arg min ∥xi − W η i + b∥2 .
f ∈Mq,d q n i=1 W ∈R d×q d
, b∈R , η i ∈R q n
Lin , η i ∈R i=1
Since ϕD = id, the decoder also fulfills D ∈ Mq,d Lin and resembles the map f ; see also (4.1).
Now, in the PCA formulation, we aim to minimize with respect to the low-dimensional data
points η i . In our autoencoder setting, they just resemble the encoded inputs E(xi ).
112 Chapter 7. Variational Autoencoders
To see this resemblance more clearly, let us have a look at the autoencoder minimization
problem in this setting and omit the activation functions ϕE , ϕD :
n
1X 2
(D∗ , E ∗ ) = arg min ∥xi − D ◦ E(xi )∥
(D,E) n i=1
n
1X 2
= arg min ∥xi − WD (WE xi + bE ) − bD ∥ .
WE ,WD ,bE ,bD n i=1
Note that the solution tuple (D∗ , E ∗ ) is not unique since arbitrary basis changes in the latent
space do not influence D ◦ E. Now recall that we found in Section 4.1 that the PCA solution for
the η i is
η i = W T (xi − b).
This led to
n
1X 2
W = arg min I − AAT (xi − b) .
A∈Sq (Rd ) n i=1
Thus, if we set the encoder bias to bE = −WE bD , we obtain a very similar altered autoencoder
problem
n
∗ ∗ 1X 2
(D , E ) = arg min ∥(I − WD WE ) (xi + bD )∥ .
WE ,WD ,bD n i=1
Essentially, the only difference from PCA is that the encoder matrix WE is not necessarily the
transpose of the decoder matrix WD for the autoencoder. Note furthermore that I − AAT
is an orthogonal projection onto the span of the columns of A. However, for the autoencoder,
I−WD WE is not necessarily an orthogonal projection onto the subspace defined by the columns
of WET .
Omitting the biases for a moment and looking at the autoencoder from a matrix factorization
T
point of view, we consider the decomposition of the matrix X := (x1 . . . xn ) ∈ Rn×d into
T q n×q
X ≈ LWD with a matrix of n latent vectors η i ∈ R , namely L ∈ R . To this end, we aim
to solve the problem
2 n
X
T
arg min X − LWD := arg min ∥xi − WD η i ∥2 .
L,WD F η i ,WD i=1
If we assume that WD has orthonormal columns, this problem is exactly solved by the PCA. For
this case, let us write WD as
WD = V q ,
where X = U SV T is the SVD of X (see Section 4.2), and V q contains the first q columns of
V (see also Section 4.2). Now we can write the transformed data L as
L = XV qT = U SV T V q = U SI q = U q S q ,
X̃ := XWET WD
T
∈ Rn×d .
7.1. Autoencoders 113
†
From linear algebra it is known that the minimizer WE is the pseudo-inverse WD of WD given
by
−1
† T T
WE = WD := WD WD WD
in the case that WD has linear independent columns; see also Section 2.1. In the special case of
the PCA this results in
−1
WE = V qT V q V qT = V qT .
Therefore, the encoder is the transposed version of the decoder in this case. This can be seen as
a special kind of weight sharing between the encoder and decoder in the neural network.
Note that we assumed at the very beginning that WD has orthonormal columns to obtain
the special case WD = V q = WET . However, any decoder matrix whose columns span the
same subspace as the columns of V q also results in an optimal recovery in the Frobenius norm
sense. Therefore, there exist infinitely many autoencoder solutions with non-orthogonal columns
of WD which will result in the same low-dimensional space as the PCA does, but the low-
dimensional vectors η i will now be spanned by a non-orthogonal basis.
In general, the real strength of autoencoders—compared to PCA—is of course given by the
additional nonlinearity, which enters through the activation functions ϕD and ϕE .
Encoder Decoder
E : Rd → Rq D : Rq → R d
Usually, the encoding layers shrink monotonically in size until the stage of highest compres-
sion is reached, whereas the decoding layers grow in size. An example of a four-layer autoen-
coder with a two-layer encoder and a two-layer decoder is depicted in Figure 7.2.
31 The term deep autoencoder is also used to describe an architecture of several autoencoders chained together.
114 Chapter 7. Variational Autoencoders
Besides fully connected networks, it is also possible to employ other types of networks within
the AE framework. For instance, when working with images, it is quite natural to use 2D-
convolutional layers in the encoder followed by pooling layers. The decoder is then built from
convolutional layers and upsampling layers, which enlarge the image by using bilinear interpo-
lation, for instance.
7.1.5 Regularization
As we have already seen in Section 7.1.3, the training of an AE is not a well-posed problem
since it admits more than one possible solution. Indeed, already in the most simple case, there are
infinitely many solutions for the weights even for the highly simplified setting from Section 7.1.3.
By adding nonlinearities via the activation functions this problem becomes much more severe.
For instance, there exist nonlinear space-filling curves whose parametrization can be used as a
decoder in order to perfectly reconstruct arbitrary high-dimensional vectors from real numbers;
see [Sag12]. While this might be a meaningful way to compress the data, it is often desirable
to have much more structured and smooth encoding and decoding functions. Numerically this
smoothness serves to avoid instabilities when minimizing the loss functional, but it can also be
beneficial to enable meaningful interpolations in the latent space.
We have already discussed that, with a rising number of degrees of freedom, overfitting quickly
becomes a problem for deep neural networks. As a remedy, dropout or weight sharing were
introduced in Chapter 6. Besides these methods, we want to hint at another way of regularizing
the occurring minimization problem in the following: In order to regularize the problem, we can
add a penalty term to the loss function. This enables us to steer the optimization algorithm to
a more suitable solution, e.g., a smoother or more structured solution. For instance, we could
penalize the (piecewise32 ) derivatives of the encoder and decoder to obtain smoother functions.
This leads to the minimization problem
n
1X 2 2
arg min ∥xi − D ◦ E(xi )∥2 + ν1 ∥∇x E(x)∥L2 (Rd ) + ν2 ∥∇η D(η)∥L2 (Rq )
D,E n i=1
when penalizing the (Bochner33 ) L2 norms of the derivatives, for instance. Here, ν1 > 0 and
ν2 > 0 are (hyperparameter) weights, which determine how strong the penalization should be.
A famous method of regularization is by penalizing large numbers of non-zero weights. For
instance, we can sum the number of non-zero neurons in the output layer of the encoder and
use this sum as a regularization term. This is called a sparsity enforcing penalty, which leads
to so-called sparse autoencoders. Other variants compute the ℓ1 - or ℓ2 -norms of the encoder’s
last layer’s neurons. More sophisticated ways to induce sparsity, e.g., by using Kullback–Leibler
divergences of hidden unit activations, can be found in [Ng11].
ò Kullback–Leibler divergence
The Kullback–Leibler divergence is a specific measure for differences between two
probability distributions. Given two distributions of continuous random variables
with densities p1 and p2 , the Kullback–Leibler divergence between p1 and p2 is de-
fined by
Z ∞
p1 p1 (x)
KL (p1 ∥ p2 ) := Ep1 log = p1 (x) log dx. (7.2)
p2 −∞ p2 (x)
For discrete distributions there exists an analogous expression with a sum over all
possible values of the probability space instead of the integral. For more details on
Kullback–Leibler divergences or the more general concept of Bregman divergences,
we refer the reader to [BMDG05].
X Task 7.1. Import the MNIST data set. You can use the code introduced in Sec-
tion 3.7. Trim the training and test data sets such that only the digits 0, 1, 2, and 3
are used. Run a PCA with latent dimension q = 2 on the training data and plot the
images of the first ten digits of the original data set as well as the corresponding PCA
reconstructions (i.e., project the images into two dimensions via the PCA and then
calculate the corresponding high-dimensional representations thereof).
Create a plot of the latent space distribution of 1000 random samples from the training
data set, i.e., plot the two-dimensional coordinates of the embedded data points and
use a color to indicate the corresponding label/digit. Furthermore, create a plot with
a 25 × 25 grid of images corresponding to the high-dimensional representatives of
the two-dimensional points at the latent space coordinates x, y ∈ {−4, −4 + s, −4 +
2s, . . . , 4 − s, 4} with s = 13 . An example of such a plot—but computed with a deep
variational autoencoder instead of PCA—can be found in Figure 7.4 at the end of this
chapter.
X Task 7.2. Use K ERAS to build a simple autoencoder with one hidden layer and q = 2
(see Figure 7.1) to encode and decode the trimmed MNIST training data set from
Task 7.1. Train it by using Adam for a least squares loss with batch size 64 for 15
epochs. As for PCA in Task 7.1, plot the first ten digit reconstructions, the two-
dimensional coordinates of the first 1000 encoded samples, and the 25 × 25 grid with
the images for the corresponding latent space coordinates. Compare the results to the
ones from Task 7.1.
(xi , y i ) ∼ µ ∀ i = 1, . . . , n.
In statistics, this task can be looked at from a more general perspective. To this end, we consider
the task of learning the conditional probabilities Pµ (Y |X = x) of a random variable Y on Γ that
is distributed according to the conditional measure µY |X (x, ·) of Y . Here, the measure is condi-
tioned on the assumption that the outcome of a random variable X on Ω distributed according to
the marginal µX of µ on Rd is X = x. This task is referred to as discriminative modeling; see
also [NJ02]. Recall that for neural networks, for instance, we already learned such a probability
distribution by applying the softmax function at the output layer for our classification tasks.
In the following, we return to the unsupervised learning task of dimensionality reduc-
tion, where only data points x1 , . . . , xn ∈ Rd are present and latent space representatives
η 1 , . . . , η n ∈ Rq are sought. We use this notation to reflect the resemblance to the dimension-
ality reduction tasks from before. Nonetheless, we retain the viewpoint of learning conditional
probabilities between an ambient space variable X ∈ Rd and a latent space variable Y ∈ Rq in
the same fashion as in the supervised example above.
Generative modeling
In contrast to discriminative modeling, where the conditional measure of Y given
X is modelled, there is generative modeling [NJ02]. This term is sometimes used
ambiguously to describe either the modeling of the joint distribution Pµ (X, Y ) or
the modeling of the conditional distribution Pµ (X|Y = η). Note that the latter
distribution can be inferred from the joint one. In this book, we will stick to the case
of approximating Pµ (X|Y = η) when referring to generative modeling.
The idea of generative modeling might seem odd at first since we are aiming to infer information
on the distribution of the observables given the latent variables. However, having a closer look
at the decoder D of our autoencoder in Section 7.1, we see that it is indeed trained to reconstruct
the observables from their latent space representation. It therefore can be seen as a generative
model. As a matter of fact, it generates samples given some latent coordinates.
The latent space distribution While the generative property of D in our AE is nice to have
(e.g., for data augmentation or for the computation of reconstruction errors), we did not yet model
the decoder probabilistically and, thus, have no information on the latent space distribution.
Thus, we are able to compute an x ∈ Rd for a given η ∈ Rq , but we do not know how the
reconstructed x—or, more generally, Pµ (X|Y )—behaves if we vary η gradually. To this end,
we need to have information (or a model) on the latent space distribution of Y as well. This will
be done in the following section by the variational autoencoders.
7.4. Variational autoencoders 117
Variational Autoencoders
Variational autoencoders (VAEs) create a generative decoder by modeling the latent
space distribution and by approximating the conditionals Pµ (X|Y ) and Pµ (Y |X).
For a more detailed consideration of VAEs, we refer the reader to [Doe16, KW13,
KW19, Mur23].
In order for this to be meaningful, the decoder density p(X|Y ) has to be powerful enough to
adequately represent the true conditional Pµ (X|Y ). To this end, it is modeled as
p(X|Y = η) ∼ N f (η), σ 2 I ,
arg max Pp (X) = arg max log (Pp (X)) = arg max log(p(X = X))
f f f
n
! n
Y X
= arg max log p(X = xi ) = arg max log(p(X = xi ))
f i=1 f i=1
n
X Z
= arg max log p(xi |Y = η)p(η) dη . (7.3)
f i=1 Rq
This expression maximizes the probability of drawing X under the given model. The product
over all data points appears due to the fact that the data xi are assumed to be independent real-
izations of X. Computing or even approximating these integrals in every step of the optimization
Pn intensive task. For instance, we could draw m random
procedure is a hard and computationally
samples η j ∼ p(Y ) and compute i=1 log p(xi |Y = η j ) for each j = 1, . . . , m to then ap-
1
proximate the above integrals by a m -normalized sum over the m samples. But this so-called
Monte Carlo estimator is known to converge only slowly to the true integral, and we thus need a
very large number m of samples to achieve a small approximation error.
118 Chapter 7. Variational Autoencoders
An established method from statistics to reduce the number m of necessary samples is impor-
tance sampling. Here, the η j are drawn only in areas where they contribute much to the actual
integrand, i.e., where p(xi |Y = η j ) is large. To this end, we construct a density r(Y |X = x)
on Rq , which aims to sample those η that are likely to produce x, i.e., for which p(x|Y = η)
is large. Then, we sample according to r(Y |X = x) instead of to the prior p(Y ). But we thus
compute Z
Er [p(x|Y = η)] := p(x|Y = η)r(η|X = x) dη (7.4)
Rq
instead of Ep [p(x|Y = η)], which we wanted to maximize in the first place in (7.3). Therefore,
we have to find a relation between the two quantities Er [p(x|Y = η)] and Ep [p(x|Y = η)].
ò Bayes’ theorem
Bayes’ theorem states that the posterior probability P[A | B] of event A given event
B with P[B] > 0 can be written as
P[B | A] P[A]
P[A | B] = .
P[B]
Here, P[A] is the prior probability, P[B | A] is the likelihood of B under A, and P[B]
is called evidence; see [Geo12] for more details. The application of Bayes’ theorem
to compute the posterior is called Bayesian inference.
With Bayes’ theorem we can relate the posterior to the prior via
p(x|Y = η)p(Y = η)
p(η|X = x) = .
p(X = x)
7.4. Variational autoencoders 119
which is equivalent to
log(p(X = x)) −Kr,p = Er [log (p(x|Y = η))] − KL (r(η|X = x) ∥ p(Y = η)) . (7.5)
| {z } | {z }
log-evidence evidence lower bound (ELBo)
This is the essential equation for the optimization of VAEs. The right-hand side resembles the
so-called evidence lower bound (ELBo). The name stems from the fact that Kr,p ≥ 0 and
p(X = X) from (7.3) is the (model) evidence. We see that the ELBo equals the log-evidence
up to the Kullback–Leibler divergence Kr,p between r and the true posterior. Therefore, if
r(η|X = x) is a good approximation to p(η|X = x), Kr,p becomes small and we are es-
sentially maximizing (7.3) when maximizing the sum of the ELBos for each x = xi . In this
way, maximizing the ELBo achieves two things:
1. We maximize the likelihood of the given data X, and, thus, our generative model p(X|Y )
becomes better.
2. We minimize the KL divergence between r and the true posterior, and, thus, the so-called
inference model r becomes better.
where the minimization has to be understood with respect to the weights and biases of the net-
works corresponding to f , g1 , and g2 . This can easily be done by the known optimizers from
Section 6.5 if we are able to evaluate the ELBo in an efficient way. Note that we usually would
not sum over all i = 1, . . . , n data points in (7.6) in each optimization step, but rather use a
minibatch approach as in Algorithm 10 in Section 6.3.
Let us now consider the terms of the ELBo in more detail. By modeling the prior p and
the density r according to the corresponding normal distributions above, the Kullback–Leibler
divergences in (7.6) are taken between two Gaussians. In this case we have two densities p̂k ∼
N (mk , Σk ) for k = 1, 2, given by
1 1 T −1
p̂k (η) = q 1 exp − (η − m k ) Σ k (η − m k ) .
(2π) 2 det(Σk ) 2 2
120 Chapter 7. Variational Autoencoders
With Kp̂1 ,p̂2 := KL (p̂1 ∥ p̂2 ) = Ep̂1 [log p̂1 − log p̂2 ] this leads to
1 det Σ2 1
+ Ep̂1 (η − m2 )T Σ−1 T −1
Kp̂1 ,p̂2 = log 2 (η − m2 ) − (η − m1 ) Σ1 (η − m1 )
2 det Σ1 2
1 det Σ2 1
+ Ep̂1 tr Σ−1 T
= log 2 (η − m2 )(η − m2 )
2 det Σ1 2
1
− Ep̂1 tr Σ−1 T
1 (η − m1 )(η − m1 )
2
since y T x = tr(xy T ) for any two vectors x, y ∈ Rq , where q is the dimension of the latent
space. Because of Ep̂1 (η − m1 )(η − m1 )T = Σ1 , we obtain
1 det Σ2 1 1
+ Ep̂1 tr Σ−1 ηη T − 2m2 η T + m2 mT2 − tr Σ−1
Kp̂1 ,p̂2 = log 2 1 Σ1 .
2 det Σ1 2 2
With p̂1 = r(·|X = x) ∼ N (g1 (x), exp(g2 (x))I) and p̂2 = p(Y = ·) ∼ N (0, I), we then
obtain
with respect to the weights and biases of g1 and g2 . Thus, we indeed can use the simple one-
shot estimator described in Section 7.4.3 for forward propagation of the corresponding neural
networks, but we cannot use it directly for backward propagation. To remedy this issue, we first
need to employ the so-called reparametrization trick: We parametrize a random variable η(x)
drawn according to r(η|X = x) ∼ N (g1 (x), exp(g2 (x))I) by
p
η(x) := g1 (x) + exp(g2 (x))I · z, (7.7)
where z ∼ N (0, I) is a standard normal distributed random variable. Here, the square root of
exp(g2 ) has to be understood componentwise. Then,
h p i
∇g1 ,g2 Er [log (p(x|Y = η))] = ∇g1 ,g2 Ez log p x|Y = g1 (x) + exp(g2 (x))I · z
h p i
= Ez ∇g1 ,g2 log p x|Y = g1 (x) + exp(g2 (x))I · z ,
η(x)
Figure 7.3: A variational autoencoder with d = 6 and q = 2. The encoder computes the vectors
g1 (x) and g2 (x) for the mean and the log-variance of the distribution r. Then η(x) is formed
according to (7.7). Consequently, the decoder computes f (η(x)).
The encoder, when called with a high-dimensional sample x, now computes g1 (x) and g2 (x).
A corresponding latent space sample (according to r) can then be drawn by the reparametrization
trick (7.7).
In order to generate new high-dimensional samples from a random latent space element, we
simply draw η ∼ N (0, I) and then run our trained decoder to finally obtain f (η).
122 Chapter 7. Variational Autoencoders
X Task 7.3. Implement a variational autoencoder (see Figure 7.3) with q = 2. The
encoder only consists of the input layer and the layer for g1 and g2 . The decoder only
consists of the layer for η and the output layer. The variance of the decoder is set to
σ 2 = 1. Plot the analogous reconstructions, the scattered data in the latent space, and
the 25 × 25 latent space grid as in Task 7.2. What do you observe?
Hints:
• To draw samples in the latent space, you can use the keras.backend functions,
e.g., keras.backend.random_normal.
• By using the layers and Model class from K ERAS, you can build the encoder
by defining the output as a list of outputs, e.g.,
keras.Model(name= " V_Encoder " , inputs =inputs ,
outputs =[mean , log_var , sample ]),
where mean, log_var, and sample are the outputs of the corresponding layers
evaluated on the inputs.
• Define a custom loss function
def VAE_loss (data , results ):
in which you compute the negative ELBo as described in Section 7.4.3. The
keras.backend functions, e.g., mean, exp, sum, should be helpful again.
X Task 7.4. Add one intermediate layer to both the encoder and decoder from Task 7.3
with size half of the input layer size. Redo the experiments of Task 7.3 and observe
the difference in the resulting plots. An example for the 25 × 25 latent space grid plot
can be found in Figure 7.4.
Figure 7.4: Illustration of the two-dimensional latent space for a variational autoencoder with
three hidden layers trained on the first four digits (0, 1, 2, and 3) of the MNIST data set. We
divided the area [−4, 4]2 of the latent space into a 25 × 25 grid. Subsequently, we drew a sample
at each grid point and used the decoder to obtain the corresponding image in the high-dimensional
space. The resulting 25 × 25 images are shown at their respective coordinate positions. Created
from the MNIST data set.
against the discriminator. By training the discriminator with the help of the data produced from
the generator, it becomes more robust against such attacks.
Interpretable latent spaces Besides having a meaningful prior distribution in the latent
space, which can be used to generate high-dimensional samples via the decoder, it is also de-
sirable to be able to interpret the coordinates of the latent space. This can be of special interest if
one tries to infer causal relationships between latent space variables and high-dimensional sam-
ple behavior. To this end, special regularization terms are employed to guarantee an interpretable
latent space distribution; see, e.g., [TK21].
Dynamical variational autoencoders So far, the introduced autoencoders treated each in-
put data point independently when building the latent space. However, when there are tempo-
ral correlations in the data (e.g., for time series data), it makes sense to use latent spaces and
autoencoder models that take this into account as well. Here, so-called dynamical variational
autoencoders aim to model joint distributions of whole sequences of input and latent space data.
We refer the reader to [GLB+ 21] for a detailed survey.
Chapter 8
In this chapter, we will explore relationships between neural network architectures and differ-
ential equations, especially ordinary differential equations (ODEs). In particular, we will have
a look at neural tangent kernels [JGH18] in Section 8.1, Hamiltonian networks [HR18] in Sec-
tion 8.2, neural ODEs [CRBD18] in Section 8.3, and generative diffusion models [HJA20] in
Section 8.4. Here, we aim to hint at the relation between those methods and differential equa-
tions without going into technical details. For a more thorough reference on these methods and
their connection to differential equations, we refer the reader to the corresponding cited articles
and to [EMW20].
In contrast to other chapters, we do not provide any tasks in this chapter since the covered
topics are quite advanced and more theoretical in nature. Furthermore, they serve to highlight
the connection between modern machine learning methods and advanced mathematical concepts
instead of dealing with the algorithmic details of these methods. However, we feel that this
chapter is still quite important when it comes to understanding the mathematical background of
neural networks.
125
126 Chapter 8. Deep Neural Networks and Differential Equations
for a vector-valued function ψ : Rd × Rdp → RdL , which depends on the input, but also on the
weights, biases, and activation functions of the previous l = 1, . . . , L − 1 layers. In this way,
we have a direct analogy to SVM or kernel methods in general, where ψ reflects the feature map
that is chosen to transform the data; see (3.10) and (3.11).
However, the difference between the SVM and the hidden-layer neural network model is that
ψ—or equivalently the corresponding kernel K—has been chosen a priori for an SVM, whereas
here, i.e., in the deep neural network model, ψ depends on the degrees of freedom (namely the
weights and biases of the hidden layers). Therefore, the kernel changes during the optimization
steps of, e.g., stochastic gradient descent. The idea behind the neural tangent kernel is to define a
kernel that—in special settings—achieves both: It resembles the network’s structure and it stays
constant during the learning process. In this way, we can use the kernel to run methods like SVM
(see Section 3.5) or kernel ridge regression (see [Mur22, SS02]), which have a well-understood
convergence behavior and guaranteed global optima, instead of running SGD-type optimizers for
highly nonconvex and nonlinear optimization problems that stem from deep neural networks.
Besides the idea of neural tangent kernels, there also exist methods that are hybrids between
kernel methods and deep neural networks called deep kernel networks; see, e.g., [BGR19].
with the corresponding feature map35 ψNTK (x; p) := ∇p f (x; p). The NTK serves
to describe the dynamics of the training process of a neural network with gradient
descent-type methods such as SGD or Adam. More details can be found in [JGH18,
RYH22].
In order to establish the connection between the NTK and the network’s training process, let us
first look at the kth update step of SGD. Recalling Section 6.5, it can be written as
pk+1 = pk − ν∇p ΘB (pk ),
35 The name neural tangent kernel stems from the fact that the feature map ψ
NTK is the tangent vector of f in the space
of weights and biases of the network.
8.1. Neural tangent kernels 127
with step size ν and ΘB (pk ) being the evaluation of the (batch) loss function CB of the network’s
output f for the current weight and bias parameters pk . If we rewrite this as
pk+1 − pk
= −∇p ΘB (pk ), (8.2)
ν
we can think of the left-hand side as a discrete derivative of the trajectory of the iterates pk . In
∂
the limit ν → 0 we would just encounter the derivative of a time-dependent p, i.e., ∂t p(t). In
this regard, we can reinterpret (8.2) as a time-explicit Euler discretization with step width ν for
the ordinary differential equation
∂
p(t) = −∇p ΘB (p(t)). (8.3)
∂t
The trajectory described by this ordinary differential equation is often called gradient flow.
Having a closer look at (8.3) and recalling Section 6.3, we see that the right-hand side is given
by
1 X 1 X ′
−∇p ΘB (p) = − ∇p Ci (f (·; p)) = − Ci (f (·; p))∇p f (xi , p) (8.4)
|B| |B|
i∈B i∈B
1 X ′
=− Ci (f (·; p))ψNTK (xi ; p),
|B|
i∈B
and the evaluation essentially boils down to computing the feature map evaluations ψNTK (xi , p)
because the derivative Ci′ (f (·, p)) of the one-sample least squares loss is directly accessible. For
a least squares loss function, for example, we have Ci′ (f (·, p)) = 2(f (xi ; p) − yi ); see also the
backpropagation equations in Section 6.3.
The connection between the kernel and the dynamics of the gradient flow becomes even more
obvious when we use the chain rule to compute
∂f (x; p(t)) T ∂p(t)
= (∇p f (x; p(t)))
∂t ∂t
(8.3),(8.4) 1 X ′ T
= − Ci (f (·; p(t))) (∇p f (x; p(t))) ∇p f (xi ; p(t))
B
i∈B
1 X ′
= − Ci (f (·; p(t))) KNTK (x, xi ; p(t)).
B
i∈B
128 Chapter 8. Deep Neural Networks and Differential Equations
Here, we directly see that the NTK determines how f evolves over the course of time, i.e., during
the iteration process of SGD.
From this equation, we can derive that ΘB (p(t)) is strictly decreasing over time if the NTK
is positive definite for all p(t); see also Section 3.5 for more details on kernel properties like
positive definiteness. Therefore, if we have a convex loss function that is bounded from below,
like, e.g., the least squares loss function, the SGD algorithm will converge to the global optimum
if the NTK is positive definite.
In this way, the positive definiteness of the NTK serves as a sufficient criterion for convergence
of the network’s optimization process. However, analyzing KNTK for all possible trajectories of
p(t) is usually not possible. Therefore, we aim for a setting where the kernel does not change
(much) during the SGD process.
p≈p0
≈ f (z; p0 ) + ψNTK (z; p0 )T (p − p0 ) =: flin (z; p). (8.5)
If p does not vary too much from the initialization p0 of the SGD process, the second order term
in ∥p − p0 ∥ can be neglected and flin becomes a decent approximation of f . Moreover, flin is
8.2. Residual neural networks and differential equations 129
completely linear in p, which means that taking gradient descent steps for the optimization of
flin with respect to p is equivalent to following the negative gradient direction −ψNTK (z; p0 ),
which only depends on the initial parameters p0 . Furthermore, to achieve global convergence of
the SGD procedure, we would only have to ensure that KNTK (·, ·; p0 ) is a positive definite kernel
function because the NTK of flin does not change over time.
The important observation, which enables us to work with (8.5), is that p changes less and
less between iterations when we grow the size M := d2 = d3 = . . . , dL of each hidden layer,
i.e., when we increase the number of neurons in the hidden layers. More formally, for the least
squares loss function and ReLU activations, it holds that the NTK converges (in probability) to
a kernel function, which only depends on the initial weights and biases p0 and does not change
during training via SGD when M → ∞; see [JGH18]. Furthermore, if the limit kernel function
is positive definite, we have
1
sup |f (z; p(t)) − flin (z; p(t))| = O M − 2
t
for all z with probability arbitrarily close to one; see [LXS+ 19] for more details.
In this way, we obtain an NTK, which is constant during the training process. Additionally, if
it is positive definite it is guaranteed that we find a global optimum with SGD.36 If we consider
only input vectors with norm one, for instance, it can indeed be proven that the corresponding
infinite-width NTK is positive definite; see [JGH18]. However, we also see that this result only
holds in the limit case of infinitely large layer sizes, i.e., infinitely many neurons in each hidden
layer, which of course does not resemble a real neural network. Nevertheless there exist attempts
to achieve similar results for finite networks; see [AZLS19]. First approximation-theoretic results
on neural tangent kernel spaces can be found in [EW21].
To conclude this section, we present a plot of several NTKs for a toy example in Figure 8.1.
To this end, we trained a three-layer fully connected network with varying sizes M of the inter-
mediate layers on 100 samples of the function f˜ : S2 → R defined on the unit circle as
Note that atan2(z2 , z1 ) defines the angle between (z1 , z2 ) and the x1 -axis. As we see in Fig-
ure 8.1, the kernels for a fixed M vary slightly depending on the weight initialization. However,
when M becomes larger, the variation of the NTK values becomes smaller. Furthermore, the
NTK after 200 epochs of training is deviating a lot from the corresponding NTK after initializa-
tion for small M , but both are very close to each other for larger M . This resembles the fact that
the NTK is constant during the training process when the number of intermediate layer neurons
goes to infinity.
SGD iterations when M → ∞. However, the collective changes influence the network function f just enough such that
training remains meaningful; see [LXS+ 19].
130 Chapter 8. Deep Neural Networks and Differential Equations
Figure 8.1: Top: Neural Tangent Kernel (NTK) for a three-layer network with 2 input neurons,
M neurons at both intermediate layers and 1 output neuron. We omit biases for simplicity. The
weights are each initialized randomly according to N (0, 1). We depict the resulting NTK for
different intermediate layer sizes M directly after the random initialization and after 200 epochs
of training with 100 samples of the function (8.6). For each M we ran 5 experiments with
different random initializations. We depict the value of the corresponding kernel KNTK (z, y; p)
for a fixed value of y ∈ R2 and in dependence of θ = atan2(z2 , z1 ). Bottom: The corresponding
network functions after initialization and after 200 epochs of training.
data, such as classification [HZRS16], semantic segmentation [CPK+ 18], and object detection
[RHGS15], for instance. One step of a ResNet forward propagation can be written in the form
T
⃗o(l+1) = ⃗o(l) + δt · ϕ(l+1) W(l) · ⃗o(l) + ⃗b(l+1) , (8.7)
where δt > 0 is some positive scaling parameter. Note that this implies that ⃗o(l) ∈ Rd is of
the same size d for each layer 0 ≤ l ≤ L. To end up with only one output neuron, e.g., for
real-valued regression, we would still need to add a final layer reducing the output size to one. If
we assume that the activation function ϕ(l+1) = ϕ is the same in each iteration step l = 0, . . . , L,
8.2. Residual neural networks and differential equations 131
This ODE now describes the forward propagation process of the neural network model. This is
in contrast to (8.3), which resembles the iterative optimizer used to minimize the loss function.
Note that we employed ν as a step size in (8.2) since we also used this letter to denote the learning
rate in SGD; see Algorithm 10. Now, as (8.8) is no longer related to SGD, we use δt in (8.8)
instead.
is fulfilled for all eigenvalues λi (J (l) ) of the lth layer Jacobian J (l) of the right-hand side of
(8.8).
This reinterpretation motivates the creation of new forms of neural networks that allow for
stable evaluations, e.g., networks with anti-symmetric weight matrices or Hamiltonian-based
networks. For the latter, the variable ⃗o(t) from (8.9) is split into two vectors ⃗y (t) and ⃗z(t) (e.g.,
of equal length), which are the solutions to the coupled Hamiltonian ODE system
⃗y˙ (t) = ϕ W(t)⃗z(t) + ⃗b(t) and ⃗z˙ (t) = −ϕ W(t)T ⃗y (t) + ⃗b(t) .
Because of its structure, this system is automatically stable and, thus, also allows for stable
forward propagation if appropriate discretization schemes are chosen. An example presented
in [HR18] is the symplectic so-called Verlet integration scheme (see also [WNH93]), which
alternates between updating the y and z values, i.e.,
T
(l+ 21 ) (l− 12 ) (l) (l) ⃗ (l+1)
⃗z = ⃗z − δt · ϕ W ⃗y + b
1
and ⃗y (l+1) = ⃗y (l) + δt · ϕ W(l) ⃗z(l+ 2 ) + ⃗b(l+1)
where K denotes a convolutional operator matrix applied to ⃗o(l) . The specific structure is chosen
because of its benign stability properties; see [RH20] for more details. This leads to a direct con-
nection between convolutional ResNets and partial differential equations (PDEs). For instance,
for image data, K can be chosen as a discretized version of the two-dimensional ∇ operator; see
[PTVF07] for possible discretizations. If ϕ = id, this results in a discretized version of the heat
equation
⃗o˙ (t) = −∇T ∇o(t) = −∆o(t),
where the Laplace operator on the right side operates on local patches in the pixel-space. Similar
to the ODE setting, [RH20] shows that, for any non-decreasing activation function ϕ, the forward
propagation of (8.10) is stable.
To allow for more general architectures than just convolutional ResNets, [BGK22] investigates
networks motivated by integro-partial differential equations, where not only local but also global
interactions in the pixel space are taken into account.
where Ĉ : Rd → [0, ∞) is a loss function mapping the d-dimensional output to a positive, real
number. To solve this optimization problem one needs to be able to compute gradients of the
ODE solver with respect to the weights and biases. This can be done by using the so-called
adjoint sensitivity method (see [CRBD18] for details), which just consists of solving a so-called
adjoint ODE backwards in time. To this end, we define the adjoint
∂ Ĉ(⃗o(T ))
⃗a(t) :=
∂⃗o(t)
8.4. Generative diffusion models 133
as the derivative of the loss function Ĉ at the final result ⃗o(T ) with respect to the intermediate
outcome ⃗o(t) for some 0 ≤ t ≤ T . It can then be shown that the derivatives of the loss function
with respect to the weights can be obtained by computing the integral
∂ Ĉ(⃗o(T ))
Z T ∂F ⃗o(t), W , ⃗b
= ⃗a(t)T dt; (8.12)
∂W 0 ∂W
see [CRBD18]. The formula for the bias derivative works analogously. To calculate the adjoint
itself, we can solve the ODE
∂F ⃗o(t), W , ⃗b
⃗a˙ (t) = −⃗a(t)T (8.13)
∂⃗o(t)
with a given final value ⃗a(T ). In this way, we obtain the adjoint at any time 0 ≤ t ≤ T . For
example, for the training data set (xi , yi ) ∈ Rd × R and for a vector-valued least squares loss
n
1 X
Ĉ(⃗o(T )) = ∥⃗o(T )|i − yi ∥2 ,
dn i=1
where ⃗o(T )|i denotes ⃗o(T ) for the initial condition ⃗o(0) = xi , the final value would be
n d
2 XX
⃗a(T ) = [⃗o(T )|i − yi ]j .
dn i=1 j=1
distribution, i.e., the standard normal distribution in our case, and µ is established. This can
be seen in correspondence to VAEs, where the latent space samples were assumed to follow a
Gaussian distribution and were mapped to the original data space by the decoder; see Chapter 7.
for each i ∈ N. Here, x(i) denotes the ith iterate, 0 < βi < 1 is called the ith noise
level, and ε ∼ N (0, I) is a standard normally distributed random variable in Rd .
The original data point x(0) is perturbed by Gaussian noise in each iteration step. In
the limit i → ∞, the iterates x(i) become completely random, normally distributed
vectors.
The main idea of generative diffusion processes is to invert the above forward dif-
fusion process step by step to obtain data points distributed according to µ when
starting at a random Gaussian vector following the distribution N (0, I). Details can
be found in [HJA20, Mur23, SDWMG15].
Before we discuss how to build a generative diffusion process, let us first consider the connection
between diffusion processes and stochastic differential equations.
dX(t) dW (t)
= µ(X(t), t) + σ(X(t), t) . (8.15)
dt dt
For more details on SDEs and on how to comprehend the time-derivative of W in
(8.15), we refer the reader to [KP92, Pro05].
Oftentimes, e.g., in mathematical finance [Hul14, Sam65] or in particle physics
[Ein56, KSZ96], W is modeled as multi-dimensional standard Brownian motion,
i.e., as a stochastic process whose increments W (t1 ) − W (t0 ) are independent and
distributed according to N (0, (t1 − t0 )I) for t1 > t0 > 0.
37 Another way to interpret a stochastic process is as a random function. In particular, a stochastic process evaluates to
In the context of generative diffusion models, we consider the so-called forward diffusion SDE
dX(t) 1 p dW (t)
= − β(t)X(t) + β(t) (8.16)
dt 2 dt
with some noise variance β : [0, ∞) → R. For our purposes, we are not interested in the time-
continuous equation (8.16), but in the Euler–Maruyama discretization [KP92] thereof, which is
a stochastic variant of the time-explicit Euler scheme; see, e.g., (8.2) and (8.8). By using a time
step size δt > 0, we obtain
p
X(t + δt) − X(t) 1 β(t)
= − β(t)X(t) + √ ε, (8.17)
δt 2 δt
where ε ∼ N (0, I) is a standard normally distributed random variable. Note that this can be
rewritten as
β(t)δt p
X(t + δt) = 1 − X(t) + β(t)δt ε.
2
Now let us consider the forward diffusion equation (8.14) with x(i) = X(i) and βi = β(i)δt,
i.e., p p
X(i + δt) = 1 − β(i)δtX(i) + β(i)δt ε. (8.18)
Apart from the drift coefficient, i.e., the term in front of X(t) or X(i), respectively, both equa-
tions are the same. Finally,
pto obtain the connection between these equations, note that a first
order Taylor expansion of 1 − β(i)δt around δt = 0 yields
p β(i)δt
+ O (δt)2 .
1 − β(i)δt = 1 −
2
Thus, the Euler–Maruyama discretization (8.17) of the forward diffusion SDE (8.16) resembles
the forward diffusion process (8.14) up to second order in δt for δt → 0.
one can show that the (true) distribution of x(i−1) given x(i) and x(0) can be written as
p∗ x(i−1) x(i) , x(0) ∼ N (m̄i , σ̄i I),
where both mean m̄i and noise level σ̄i depend only on x(i) , x(0) , and βj for 1 ≤ j ≤ i.
For more details and explicit formulas for m̄i and σ̄i , we refer the reader to [HJA20, Mur23].38
Since x(0) is not known when reversing the forward diffusion process to infer x(i−1) from x(i) ,
∗ (i−1) (i) (0)
we approximate the distribution p x x ,x by
p x(i−1) x(i) ∼ N (mi , σi I) (8.22)
with degrees of freedom mi and σi not depending on x(0) . Oftentimes, the variance coefficients
are fixed a priori, e.g., σi = βi ; see also [HJA20]. This means we are left with the task of
determining mi from given data x1 , . . . , xn . To this end, we create a neural network taking x(i)
as input and computing mi as output. To train the network, we could consider minimizing the
cross entropy h i
−Ep∗ (x(0) ) p(x(0) ) (8.23)
over the given training data, i.e., the expected value Ep∗ (x(0) ) would become an average over
the x1 , . . . , xn . However, since p(x(0) ) is not directly accessible, we follow the same idea as in
Section 7.4.2 and (7.5), where we discussed the ELBo loss for variational autoencoders. Thus,
instead of minimizing (8.23) to obtain m1 , . . . , mT , we minimize the ELBo
" T !#
(i−1) (i)
X p x x
Ep∗ log p∗ x(T ) + log , (8.24)
i=1
p∗ x(i) x(i−1)
where T > 0 is the number of steps of the forward diffusion process. For details on the training
procedure, see Section 8.4.3 and [HJA20, Mur23, SDWMG15].
Finally, we observe that, after learning mi from (8.22), both the forward diffusion process
p p
x(i) = 1 − βi x(i−1) + βi εi
and the reverse process
√
x(i−1) = mi + σi ε̄i
with εi , ε̄i ∼ N (0, I) are computed by sampling standard normally distributed random variables.
This has to be seen in analogy to the Euler–Maruyama discretizations (8.18) and (8.20) of the
corresponding forward and reverse diffusion SDEs (8.16) and (8.19).
38 An even more detailed derivation can be found in the excellent blog article https://lilianweng.github.io/posts/2021-
07-11-diffusion-models.
39 Its derivatives can then be computed by automatic differentiation.
8.5. Further topics 137
i.e., we compute an average over the given training data. To evaluate (8.25), we assume that T
(T )
is large enough such that xj is (approximately) standard normally distributed, i.e., we assume
(T ) (i) (i−1)
p∗ (xj ) ∼ N (0, I) for all j = 1, . . . , n. Furthermore, note that p∗ (xj xj ) from (8.21)
(i)
is known, i.e., we can evaluate it. Thus, we can compute xj for each i = 1, . . . , T and for each
training data point xj , j = 1, . . . , n, by drawing samples according to (8.21), and we thus can
evaluate (8.25).
Finally, assuming that p∗ (x(T ) ) ∼ N (0, I), we minimize (8.25) with respect to the weights
and biases of the network with an SGD-type algorithm.
To use the already trained generative diffusion process for data generation, we simply draw a
random vector x(T ) ∼ N (0, I), compute the corresponding mT with the trained neural network,
and apply (8.22) for i = T to obtain x(T −1) . This process of computing mi and applying (8.22) is
then repeated for i = T − 1, . . . , 1 to finally obtain the new data point x(0) ; see [HJA20, Mur23].
Physics-informed neural networks Since ODEs and PDEs often appear in applications
from physics, special types of networks, namely physics-informed neural networks (PINNs),
have been developed to specifically deal with such equations. Here, network structures, loss
functions, and regularization terms are adapted to meet specific physical requirements, e.g.,
the loss can be chosen such that it penalizes unphysical behavior in the network output; see
[BE21, RPK19] for an overview on PINNs and [vRMB+ 23] for a survey on informed machine
learning in general.
138 Chapter 8. Deep Neural Networks and Differential Equations
Figure 8.2: Example pictures (deep fakes) of fictitious people created by a generative diffusion
model [HJA20] trained on the CelebA data set [LLWT15] of real celebrities. Each row contains
(from left to right) the iterates Ai for i = 16, 12, 8, 4, 0 of the reverse diffusion process (8.26)
for different random instances of AT ∼ N (0, I) with T = 50. For the corresponding code to
generate the pictures, see https://huggingface.co/google/ddpm-celebahq-256. Created from the
CelebA data set.
Deep operator networks When we are dealing with parametrized ODEs or PDEs for which
many solutions have to be computed, e.g., in uncertainty quantification, it is computationally
beneficial to directly approximate the solution operator corresponding to the parametrized equa-
tion instead of its results for a large discrete set of parameter values. Here, neural networks
that learn nonlinear, continuous operators instead of functions have been developed, like so-
called DeepONets or Fourier neural operators (FNOs). We refer the reader to Section 10.1 and
[KLL+ 23, KLM21, LMK22, LKA+ 20, LJP+ 21] for more details.
SGD variants and ODEs Besides analyzing the performance of SGD via the neural tangent
kernel, which we did in Section 8.1, there also exist approaches studying properties of Nesterov’s
SGD variant (compare Section 6.5) and other momentum-based optimizers with the help of ODE
systems and gradient flows; see [MJ21, SBC16, WWJ16].
Chapter 9
Reinforcement Learning
Besides supervised and unsupervised learning, reinforcement learning is the third main category
in modern artificial intelligence. It is influenced by ideas of behaviorism that attribute learning
to a feedback of positive and negative reinforcement.
Reinforcement learning
The goal in reinforcement learning (RL) is to make sequential decisions in a given
environment, where a decision stems from a fixed set of actions. To this end, an
agent is considered that acts within the environment, i.e., based on a policy, the agent
decides which action to take in which state of the environment. A path of an agent in
the state space, i.e., a sequence of states resulting from the taken actions, is called a
trajectory. An agent is trained by using a quantitative feedback of how good (or bad)
a decision or a sequence of decisions has been. To this end, the agent is guided in its
decisions through rewards.
An important aspect of reinforcement learning is that the agent can only learn by
applying actions and by observing rewards. Thus, no or only incomplete details about
the environment are exposed to the agent. This a key difference from the domain of
optimal control of dynamical systems [BCD97, Ber19, GP17, Pow22, Rec19, SB18,
Vin00].
In general, an agent’s decision that may lead to a high reward later, i.e., after fol-
lowing a trajectory of states, might yield no or even a negative immediate reward at
the time of making it. Therefore, actions are not based on the immediate reward, but
on their estimated long-term value over the whole trajectory, i.e., on their aggregated
reward. To summarize, reinforcement learning is concerned with the interaction of
an agent and its environment under uncertainty, in particular due to incomplete and
stochastic information.
139
140 Chapter 9. Reinforcement Learning
trading. Surveys on reinforcement learning algorithms applied to tasks in robotics can be found
in [KBP13, PN17].
This chapter is structured as follows: As a first step towards RL, we consider deterministic
optimal control problems in Section 9.1. Here, we are given complete information about the
environment, which is assumed to be deterministic. We introduce state and action spaces, de-
terministic state dynamics, and a reward function. The goal is to find an optimal policy, which
determines the best—in terms of the value, i.e., the aggregated rewards—action to take when the
system is in a particular state. To this end, we first introduce the policy evaluation algorithm,
which computes the value for a given policy. Then, to actually determine an optimal policy, we
consider the value iteration algorithm. Furthermore, we provide the policy iteration algorithm,
which computes an optimal policy by alternating the optimization of a policy and its evaluation.
Subsequently, we generalize the setting by allowing probabilistic state transitions, i.e., the state
dynamics will be modeled by a probability distribution instead of a fixed function. Section 9.2
is then dedicated to the situation where only incomplete information about the environment is
available for reinforcement learning. Here, information cannot be directly accessed but is only
observed by following trajectories of state-action pairs. In such settings, an algorithm such as
Monte Carlo policy evaluation, SARSA, or Q-Learning needs to be employed to compute values
for trajectories and approximate optimal policies. We conclude this chapter with a discussion
of deep reinforcement learning in Section 9.3, where deep neural networks are used to approx-
imately determine optimal solutions. These function approximations are employed in case of
large state spaces when the previously discussed methods can no longer be directly applied due
to their huge costs.
The set A is the action space, and As is the set of admissible actions in state s. The
evaluation T (s, a) of the state dynamics T defines the next state when taking action
a ∈ As in state s ∈ S, and R(s, a) is the reward when taking the action a ∈ As in
state s ∈ S. Note that a deterministic, discounted MDP can also be represented as a
graph; see Figure 9.1. A function
π : S → A,
3 −6 1
8 0
−2 −9
1 4
Figure 9.1: An example for the graph representation of a deterministic MDP environment. Here,
every state s ∈ S is represented by a node and every action a ∈ As is represented by an edge
pointing from s to another node. The label of an edge a originating in s is the reward R(s, a) ∈ R
for taking action a in state s.
Given a deterministic, discounted MDP as the environment, an agent can in any state s ∈ S
interact with it by choosing an action a ∈ As . To be more precise, given st ∈ S and at ∈ As
for some time index t ∈ N, the environment experiences a transition from the state st to a
state st+1 ∈ S, which is determined by the state dynamics map T , i.e., st+1 = T (st , at ).
Consequently, the agent obtains a reward rt+1 ∈ R, which is defined by R, i.e., rt+1 = R(st , at ).
The trivial situation of having one fixed action space A = As for all s ∈ S represents the case
of state-independent actions. The continuing interaction of the agent with the environment is
illustrated in Figure 9.2. In summary, the agent encounters a sequence of events, represented
by a trajectory of states st ∈ S, the taken actions at ∈ Ast , the resulting next states st+1 =
T (st , at ) ∈ S, and the rewards rt+1 = R(st , at ) ∈ R for t ∈ N. The goal is now to find a policy
π ∗ : S → A that gives the best action to take in a state in terms of the aggregated rewards. The
question is then how such a policy can be determined. This will be explained in more detail in
the following.
First, to be able to compare the quality of different policies at all, we define the discounted
cumulative reward for a given policy π : S → A and a starting state s0 ∈ S as
∞
X
V π (s0 ) := γ t R(st , π(st )). (9.1)
t=0
V π is also called the value function for the policy π. In (9.1), s0 , s1 , s2 , . . . denotes the trajectory
which we obtain when we follow π for the initial state s0 , i.e., st+1 = T (st , π(st )) for all t ∈ N.
i
142 Chapter 9. Reinforcement Learning
“book-figure26” — 2023/10/9 — 13:26 — page 133 — #1
i
agent/policy π
action at
state st rt+1 ← R(st , at )
environment
st+1 ← T (st , at ) (MDP)
st+1 ← st
Figure 9.2: The
i repeated interaction between agent and environment. After each action the new
state and the corresponding reward are observed.
“book-figure27” — 2023/10/9 — 13:26 — page 134 — #1
i
0
−5 1
3 −6 1
−2 8 0 0
−2 −9
1 4
4
Figure 9.3: A policy for the example from Figure 9.1. Here, we visualize a specific policy
π : S → A by bold edges. For this policy, the value of V π (for γ = 1) is denoted by the numbers
inside the nodes. This means that the value in node s is V π (s). Note that the rightmost node in
the graph represents a terminal state, i.e., actions do not have any effect.
Here, the discount factor γ ∈ (0, 1] models the effect of delayed rewards. We refer the reader to
Figure 9.3 for an abstract example of such a sequential, deterministic Markov decision process
with a value function V π for a chosen policy π.
The optimization problem A policy π is called optimal if it maximizes the discounted cu-
mulative reward V π . Thus, the task of optimal control can now be stated as the following opti-
mization problem: Find an optimal policy π ∗ : S → A from the set {π : S → A} of all policies
such that
π ∗ (s) := arg max V π (s) ∀ s ∈ S. (9.2)
π
The cumulative reward for the optimal policy π ∗ is called the optimal value function and is
∗
denoted40 by V ∗ (s) := V π (s).
For γ = 1 we speak of an undiscounted optimal control problem; otherwise, for 0 < γ < 1,
the problem is called discounted. Note that, for an undiscounted problem, neither the existence
nor the uniqueness of a solution of (9.2) can be guaranteed. However, as we will see later on, a
solution exists for any discounted problem since the value function V π is bounded for γ < 1 and
any bounded reward function R. Note furthermore that the sum in (9.1) is sometimes truncated
40 Note that while the optimal value function can be shown to be unique, optimal policies (or actions) are not neces-
sarily unique. In this chapter, arg max is therefore understood as being one of the arguments that are maximizing the
expression.
9.1. Optimal control 143
T
X
V π (s0 ) := R(st , π(st ))
t=0
is used instead of (9.1). In such a case, the problem (9.2) is called a finite horizon problem.
Otherwise, it is called an infinite horizon problem.
Certain states are often considered to be terminal. This means that actions in a terminal state no
longer have any effect (e.g., when a game ends, when the agent exhausted his capital in trading,
or when a robot has reached a destination in a navigation task), and one may think of those states
as absorbing states. To include these states in our MDP model, we denote a state s as terminal
if all possible actions a ∈ As taken in this state result in zero rewards and lead back to the same
state s, i.e.,
T (s, a) = s, R(s, a) = 0 ∀ a ∈ As . (9.3)
If an agent reaches a terminal state at time T ∈ N, the discounted cumulative reward (9.1)
is effectively finite because all summands for t > T are zero. In practice, the environment
provides the information on whether a state is terminal or not. For example, the rightmost state
in Figure 9.3 is a terminal state since (9.3) holds.
The Bellman equation From the general definition of the discounted cumulative reward (9.1)
we see that V π (s0 ) can also be formulated in a recursive way. To this end, we split the sum into
the first term and the remainder, which is itself again a representation of the value function, i.e.,
∞
X
V π (s0 ) = γ t R(st , π(st ))
t=0
∞
X
= R(s0 , π(s0 )) + γ t R(st , π(st )) (9.4)
t=1
π
= R(s0 , π(s0 )) + γV (s1 )
with a = π(s). Furthermore, the optimal value function V ∗ , i.e., the cumulative reward for the
optimal policy (9.2), is given by the Bellman equation
This way, V ∗ can be characterized without referring to a specific policy π ∗ . In particular, choos-
ing a maximizing action in each step yields an equation for the optimal policy
in terms of the optimal value function V ∗ . The policy π ∗ now maximizes V ∗ (s) for every s. The
idea behind the Bellman equation is Bellman’s optimality principle.
144 Chapter 9. Reinforcement Learning
The Bellman operator Let us now introduce the Bellman operator Bπ , which applies (9.5)
for a policy π, i.e.,
Bπ : v : S → R → {v : S → R},
(9.8)
v 7→ s 7→ R(s, a) + γv(T (s, a)) ,
with a = π(s). Moreover, let us define the Bellman optimality operator B, which applies (9.6),
i.e.,
B : v : S → R → {v : S → R},
(9.9)
v 7→ s 7→ max R(s, a) + γv(T (s, a)) .
a∈As
For finite state spaces S and finite action spaces A, it is not difficult to verify that Bπ and B
are contractions for γ < 1. It thus holds that
for all v, w : S → R under natural conditions on the environment. Here, ∥ · ∥∞ denotes the L∞
norm of functions from S to R. Using Banach’s fixed point theorem one can now establish the
following result; see, for instance, [Ber17].
Theorem 9.1.1 (Bellman 1957). Let v : S → R and let γ < 1. The following hold:
(i) V π from (9.5) is the unique solution of the fixed point equation Bπ v = v.
(ii) V ∗ from (9.6) is the unique solution of the fixed point equation Bv = v.
(iii) For any policy π it holds that limn→∞ Bπn v = V π .
(iv) It holds that limn→∞ B n v = V ∗ .
The above theorem states that a fixed point of (9.5) or (9.6) is a value function for every s ∈ S
when the equations are understood as recursive update equations.
The policy evaluation method computes (an approximation of) the value function V π for the
policy π; see Algorithm 11. This is done by repeatedly applying the Bellman operator Bπ to the
current estimate of V π .
Moreover, using a fixed point iteration with B instead of Bπ , we can compute an optimal value
function. This is the so-called value iteration method; see Algorithm 12. Here, one computes (an
approximation of) the optimal value function V ∗ , from which the corresponding (approximately)
optimal policy π ∗ is then derived.
The key difference between Algorithm 11 and Algorithm 12 is in line 5 of both algorithms: In
the case of policy evaluation the action is given by the policy π, while in the case of value iteration
the action is being optimized over all possible actions As in state s ∈ S. The value iteration
algorithm and the policy evaluation algorithm first emerged in optimal control applications, also
known as dynamic programming [FF13, Ber19, Pow22]. Variants of both will also be basic
building blocks for more sophisticated reinforcement learning algorithms later on.
X Task 9.1. First, we consider an environment which resembles a frozen lake. Here,
the state space S contains 16 different locations (tiles) on the lake, which are arranged
in a 4 × 4 square grid. The agent has to travel from a given starting location to a fixed
destination. To this end, the agent is allowed to move along one of four possible
146 Chapter 9. Reinforcement Learning
directions (north, south, west, and east) to a spot adjacent to its current one in each
step. However, the agent has to avoid falling into one of the lake’s holes, which are
present at certain locations that are unknown to the agent at the beginning. Moreover,
the lake can be either non-slippery, i.e., the agent always reaches the lake tile which
corresponds to his chosen action, or slippery, i.e., at random times the agent’s action
is ignored and a random move to an adjacent lake tile is taken instead. A visualization
of the environment can be found in Figure 9.4.
First, get familiar with gymnasium [Tea23] and the ideas of value iteration. To this
end, refer to the J UPYTER notebook ReinforcementLearning_template.ipynb
at https://bookstore.siam.org/di03/bonus.
(a) We pre-defined a class RandomAgent, which takes a random step in any of the
four directions. Implement the action function for this agent and experimen-
tally estimate the expected value of the agent’s policy on the FrozenLake-v1
environment with is_slippery first False then True. What do you observe
and why?
(b) Implement the missing piece of iterative policy evaluation to calculate the
value function for the random policy from task (a) with γ = 0.9 with
is_slippery = False. What is the value function if is_slippery is True?
(c) Implement the missing piece of value iteration to calculate an optimal policy
for the FrozenLake-v1 environment where is_slippery is False. Experi-
mentally estimate the expected return for these optimal policies.
Figure 9.4: A visualization of the FrozenLake-v1 environment in gymnasium created with the
gymnasium.env.render function; see [Tea23]. The agent (top left corner) has to travel to the
destination (bottom right corner) and has to avoid the four holes in the lake. Modified image
used with the kind permission of the Farama foundation and Francisco Coda.
9.1. Optimal control 147
by greedily picking the best action a for each state s according to V π . This resembles the policy
improvement step; see Figure 9.5 for an example of two steps of the overall iterative procedure.
The policy iteration algorithm now alternates these two steps; see Algorithm 13.
1 Initialize π ′ , v ′ arbitrarily.
2 repeat
3 v ← v′ .
4 π ← π′ .
5 Solve (I − γT )v ′ = r and set V π (s) ← v ′s for all s ∈ S
(or approximate v ′s = V π (s) by policy evaluation with Algorithm 11).
6 forall s ∈ S do
π ′ (s) ← arg maxa∈As R(s, a) + γV π (T (s, a)) .
7
8 end forall
9 until maxs∈S |v − v ′ | ≤ δ.
′
10 return V π , π ′ .
In comparison, Algorithm 12 (value iteration) updates the estimate of the values in each loop
over the state space. Subsequently, after each such loop, it (implicitly) obtains a new policy
based on the updated value function. In contrast to that, Algorithm 13 updates the estimate of
the values for the current policy to a prescribed accuracy in each policy evaluation step. Then,
given this (stable) value, a new policy is determined in the policy improvement step. In terms
of the number of policy updates, policy iteration usually converges faster than value iteration, in
part because more effort is spent between updates.
Note that in line 5 of Algorithm 13, instead of using Algorithm 11, one can solve a system
of linear equations to obtain the current policy iterate. This is potentially more efficient. To see
why both approaches are equivalent, we define the reward vector r ∈ R|S| with entries r s :=
R(s, π(s)) for all states s ∈ S. Moreover, we define the state transition matrix T ∈ R|S|×|S|
with entries T s,s′ := δT (s,π(s)),s′ for all states s, s′ ∈ S with
if T (s, π(s)) = s′ ,
1
δT (s,π(s)),s′ :=
0 else.
i
148 Chapter 9. Reinforcement Learning
“book-figure28” — 2023/10/9 — 13:27 — page 138 — #1
i
0
−5 1
3 −6 1
−2 8 0 0
−2 −9
1 4
4
i
(a) Policy improvement.
“book-figure29” — 2023/10/9 — 13:27 — page 138 — #1
i
0
6 12
3 −6 1
5 8 0 0
−2 −9
1 4
4
i
(b) Policy evaluation.
“book-figure30” — 2023/10/9 — 13:27 — page 138 — #1
i
0
6 12
3 −6 1
5 8 0 0
−2 −9
1 4
4
i
(c) Policy improvement.
“book-figure31” — 2023/10/9 — 13:27 — page 138 — #1
i
0
6 12
3 −6 1
10 8 0 0
−2 −9
1 4
4
Figure 9.5: Subsequent steps (a)–(d) of the policy iteration algorithm for the example from
Figure 9.3. The new action π ′ (s) of a policy improvement step is indicated by a dashed edge for
each s ∈ S.
v = r + γT v,
9.1. Optimal control 149
where v ∈ RS has the entries v s = v(s). This indeed provides an alternative to Algorithm 11
since (9.11) is equivalent to solving the system of linear equations
(I − γT )v = r.
Overall, policy iteration, as shown in Algorithm 13, is an iterative method that searches in
the space of policies until it has converged to an optimal policy. The approach was developed
by Bellman [Bel57] and Howard [How60] and is also known as Howard’s algorithm. It can
be shown that policy iteration converges since the number of states and actions in the MDP is
finite; see [Ber12, Ber19, SB18, Mur23]. Under a monotonicity assumption on the matrices T
for each policy, one can indeed prove that policy iteration converges superlinearly. Moreover,
under certain additional regularity assumptions, even quadratic convergence can be shown; see
[BMZ09, SR04] for details. This superior result is related to an interpretation of policy iteration
as a semismooth Newton method for finding a root of Bv − v = 0. In particular, line 5 of
Algorithm 13, which reflects the linear problem from (9.8), can be seen as solving a linearization
of the nonlinear problem (9.9), similar to a Newton method.
As in the deterministic case, the set A is the action space and As is the set of admis-
sible actions in state s. Pa (s, s′ ) is now the probability that the system will change
to state s′ if action a has been taken in state s. The corresponding reward is given by
r(s, a, s′ ). In the stochastic setting, a family π of probability measures π(s) on As
indexed by s ∈ S is a policy, where an agent selects an action in s based on π(s).
Note that a stochastic Markov decision process can also be used to model a deterministic one.
To this end, Dirac measures are used as the probability measures Pa (s, ·). To be consistent with
the deterministic case, where π : S → A is a function, notation is oftentimes abused in the
stochastic setting and one writes π(s) instead of a := arg maxa′ ∈As π(s)(a′ ) when referring to
the action a that is most likely in state s for the policy π. Stochastic policies are needed for tasks
involving partially observable environments, e.g., for bluffing in card games with incomplete in-
formation, such as poker (see [SB18]), for disguising the state to obfuscate private information,
150 Chapter 9. Reinforcement Learning
or for exploring the state space, as we will see in Section 9.2.5. One can formally define anal-
ogous expressions of (9.1) in the stochastic setting; see, e.g., [BS96, FS06, SB18]. Then, (9.2)
gives the optimal deterministic policy as before. For details on stochastic policies, see [SB18].
All algorithms from the deterministic case can be employed to obtain a deterministic policy
in a stochastic environment if we slightly alter the definitions: The Bellman optimality equation
(9.6) can be redefined by taking the expected value over the probabilities to obtain
X
V ∗ (s) := max Pa (s, s′ ) r(s, a, s′ ) + γV ∗ (s′ )
(9.12)
a∈As
s′ ∈S
in the case of stochastic transactions. Then, Algorithm 11 and Algorithm 12 can be reformulated
for a stochastic environment. To this end, we replace line 5 in Algorithm 11 with
!
X X
′ ′ ′ ′
V (s) ← π(s)(a) Pa (s, s )(r(s, a, s ) + γV (s ))
a∈As s′ ∈S
Here, π(s)(a) denotes the probability that π picks action a given state s.
Besides policy evaluation and value iteration, we can also reformulate policy iteration (Algo-
rithm 13) for a stochastic environment. To this end, the entries of the state transition matrix T
become T s,s′ := Pπ(s) (s, s′ ). Furthermore, line 7 in Algorithm 13 becomes
X
π ′ (s) ← arg max Pa (s, s′ ) r(s, a, s′ ) + γV π (s′ ) .
(9.14)
a∈As
s′ ∈S
Note that one can further generalize the setting and also define the reward r as probabilistic.
In this case, Algorithm 11 and Algorithm 12 can be extended to cover such probabilistic rewards
as well. For reasons of simplicity, we will only consider deterministic rewards in the following.
on the sampled trajectories, approximations of the value function or the optimal policy are then
computed, for instance.
In model-based reinforcement learning, algorithms are employed that train a model of the tran-
sition map from sampled trajectories. Here, while the true transition map is not available, both
value iteration (Algorithm 12) and policy iteration (Algorithm 13) can in principle be applied as
before, but now using the learned model. In the following, we will focus on model-free reinforce-
ment learning approaches, where the sampled trajectories are directly used, e.g., to approximate
the value function, without employing a model of the transition map.
The remainder of this section is structured as follows: After introducing the general reinforce-
ment learning setting in Section 9.2.1, we discuss a Monte Carlo approach for policy evaluation
in Section 9.2.2. Here, the agent learns from experience, i.e., from sampled trajectories, in order
to estimate the value function. This is in line with the conceptual idea of mimicking learning
through positive and negative reinforcement. A more advanced technique than simple Monte
Carlo estimation is the temporal difference (TD) method, which we explore in Section 9.2.2.
The combination of the concepts of temporal differences and Q-functions leads to the SARSA al-
gorithm presented in Section 9.2.4. In Section 9.2.5 we discuss the tradeoff between exploration
and exploitation and the resulting Q-learning algorithm. Subsequently, we introduce the binning
approach to deal with infinite state spaces in Section 9.2.6. Finally, Section 9.2.7 provides tasks
on the introduced RL algorithms.
9.2.1 Setting
Reinforcement learning versus optimal control In the optimal control setting we had per-
fect knowledge about the state transitions. But this is no longer the case in reinforcement learning
and an algorithm cannot explicitly make use of T or τ and the probability measures Pa for any
a ∈ A anymore.
Thus, in the general reinforcement learning setting, information can only be obtained in an
interactive fashion, i.e., one observes the next state st+1 and the reward rt+1 without explicitly
knowing the state dynamics T or the reward function R (or τ and r in the stochastic setting,
respectively). To avoid any confusion about the nature of the environment, we will restrict
ourselves to the stochastic setting41 for the remainder of this chapter and pose the following
assumptions:
1. At any point t ∈ N in time the environment an agent is interacting with is in state st ∈ S.
We can choose at ∈ As and then the environment transitions to the next state st+1 , which
is sampled from Pat (st , st+1 ). Note however that we do not assume that Pat is explicitly
known.
2. We cannot put the system into an arbitrary state s ∈ S.
3. The system might reset (possibly also at our will) to a state from a set of initial states.42
The above assumptions model a human learner, who does not know a priori how the environment
reacts when a certain action is taken. Of course, various modifications of these assumptions exist
in the RL literature. Note that the line between classical optimal control and modern reinforce-
ment learning is further blurred by the fact that, in case the state space S is very large, the
application of RL algorithms using sampled trajectories is often computationally advantageous
in contrast to classical optimal control methods, even in the case of perfect knowledge of the
environment. Here, supervised learning techniques (e.g., regression) can be employed to learn
V π from samples [Ber12].
41 Note however that the deterministic setting is just a special case of the stochastic setting.
42 This is necessary to be able to sample multiple trajectories.
152 Chapter 9. Reinforcement Learning
and X
Pa (s, s′ ) r(s, a, s′ ) + γv(s′ ) .
(Bv)(s) := max (9.16)
a∈As
s′ ∈S
for the states si occurring over a trajectory. We call the above quantity C(si ) the target for
V π (si ). Then we use the average of the observed targets over all trajectories as an estimate for
V π (si ).
In Algorithm 14 only the first visit of a state during an episode is considered when estimating
the value function and later visits are neglected. It is easy to see that each target in the list
Returns(si ) is an independent, identically distributed estimate of V π with finite variance. Using
the law of large numbers, one can obtain convergence of the averages of Returns(s) to their
expected value V π (s) in case of this first-visit Monte Carlo approach for all s ∈ S [SS96, SB18].
This basic Monte Carlo policy evaluation method can then be used instead of line 5 in policy
iteration (Algorithm 13), for example.
43 Note at this point that fewer summands in C(si ) are employed for an i close to l − 1 than for a smaller i.
9.2. Classic reinforcement learning 153
Temporal differences
The idea behind temporal difference (TD) learning is to use incremental updates for
the computation of V π after every state transition of the system. When transitioning
from si to si+1 by taking action ai , the basic TD update step for the estimator V of
V π is given by
V (si ) ← V (si ) + αdi (9.18)
for some appropriately chosen step size α > 0 while using the temporal difference
see also [Sut88, SB18]. Here, an update of V can be computed immediately after
a state transition. This is in contrast to Monte Carlo approaches, where the whole
trajectory of length l has to be known before an update of V (si ) can be computed for
any si contained in the trajectory.
Relation between Monte Carlo and temporal differences To see how TD is related to an
episodic Monte Carlo policy evaluation, let us rewrite the latter. To this end, recall definition
(9.17) for C(si ), which resembles the target value from Algorithm 14. Then, defining m(si )
as the number of first visits of si over all trajectories up to the current one, we can update the
estimate V , initialized by zero, at each visit of si by
1 m−1 1
V (si ) ← V (si ) + (C(si ) − V (si )) = V (si ) + C(si ).
m(si ) m m
This update is equivalent to the computation in line 10 in Algorithm 14. Now we rewrite the
above update step using the temporal differences di from (9.19). To this end, we express the term
154 Chapter 9. Reinforcement Learning
C(si ) − V (si ) as
ℓ−1
1 X k−i
V (si ) ← V (si ) + γ dk , (9.21)
m(si )
k=i
with step size α > 0. As we see, di contains r(si , ai , si+1 ) + γV (si+1 ) as the target value of
the update. Here, the current estimate for V (si+1 ) is used, which replaces C(si ) from the Monte
Carlo update. In this way, the temporal difference can be understood as a residual (or “error”)
between the current estimate V (si ) and the observed value r(si , ai , si+1 ) + γV (si+1 ) after the
state transition. This is the reason why di is sometimes also referred to as TD error. One can
show (under mild assumptions) that a policy evaluation algorithm using the TD update44 will
also converge to V π [Sut88, SB18].
A significant advantage of updating the value function by (9.22) compared to the update (9.20)
is the possibility to use a TD update in an online fashion.45 In contrast to that, one has to wait
until a trajectory has been fully explored before an update can be computed with Monte Carlo.
44 Note at this point that an approach of using a current estimate (in our case V (s
i+1 )) when computing an update
for another state (in our case V (si )) is often referred to as bootstrapping in the RL literature. As we will see in Sec-
tion 10.2.4, the term bootstrapping has a different meaning when we consider sampling methods.
45 This means that each new iterate can be computed by using only the last iterate and the current reward.
9.2. Classic reinforcement learning 155
recursively defined by
Qπ : {(s, a) | s ∈ S, a ∈ As } → R
X X
(s, a) → Pa (s, s′ ) r(s, a, s′ ) + γ π(s′ )(a′ ) Qπ (s′ , a′ ) .
s′ ∈S a′ ∈As′
(9.23)
Note that the definition of Qπ is analogous to (9.4) but it is employed now in the stochastic
setting and with an (s, a)-dependent action-value function Qπ instead of an s-dependent value
function V π . Besides algorithmic aspects, such Q-functions are often employed to simplify the
analysis of RL algorithms; see [SB18].
Note that employing a Q-function Qπ instead of a value function V π implies more than just a
notational difference. Indeed, a Q-function Qπ (s, a) provides direct access to the value for any
action a ∈ As taken in state s ∈ S. This has to be seen in contrast to V π (s), which is the value
when taking the fixed action π(s) in state s. Furthermore, Qπ (s, a) provides the value of the
action a ∈ As for a given state s ∈ S, even if we do not know which state s′ will occur after
employing a. This is of particular importance in the case of model-free reinforcement learning,
i.e., in the case where we no longer have a model for the transition probabilities.
Then, the optimal policy π ∗ at state s is given as the maximum over the Q-values of the available
actions, i.e.,
π ∗ (s) := arg max Q∗ (s, a). (9.24)
a∈As
The crucial difference between (9.24) and (9.7) is that we no longer explicitly need the reward
r or the transition probabilities τ in the definition of the optimal policy. However, the drawback
156 Chapter 9. Reinforcement Learning
of the above approach is that more values need to be stored and computed for Qπ than for the
corresponding value function V π : S → R.
Policy evaluation, value iteration, and policy iteration for Q-functions Because of the
recursive relationship (9.23), it is straightforward to formulate iterative optimization schemes in
the fashion of Algorithm 11, Algorithm 12, and Algorithm 13, but with Q-functions instead of
value functions. For example, we can use Q-functions in policy iteration with a Monte Carlo
policy evaluation as outlined in Section 9.2.2. Furthermore, we are now able to define a Monte
Carlo-based value iteration scheme with Q-functions, similar to Algorithm 12. For that, we
alternate between the following steps:
1. In state si we take an action ai ∼ π(si )(·) relying on a policy π defined by the greedy rule
(9.24) based on our current Q-function estimate, and we observe a reward ri and the new
state si+1 .
2. Similar to line 5 of Algorithm 12, we modify our estimate of the optimal Q-function by
Q(si , ai ) ← ri + γ maxa∈Asi+1 Q(si+1 , a).
In this way, our estimate of Q will converge towards the optimal Q-function Q∗ under suitable
assumptions. Note that, for constructing such a Monte Carlo-based value iteration scheme with
a value function V instead of a Q-function Q, either we would have to employ a model-based
approach or we would need to sample each possible action in step 2 from above to obtain the
next state and its value, which is typically not feasible.
SARSA Instead of considering these simple Monte Carlo-based approaches in detail, we will
have a closer look at SARSA, a more advanced RL algorithm that is based on temporal differ-
ences and Q-functions. To this end, let us rewrite the TD update formula (9.22) with Q-functions
to obtain
Note that, when replacing V (si+1 ) with the Q-function value Q(si+1 , ai+1 ), we already use
ai+1 when updating Q(si , ai ).
Now, when evaluating a fixed policy π, we take an action ai+1 ∼ π(si+1 )(·). In particular, in
each step of (9.25), our knowledge of Q(si , ai ) improves through the reward ri = r(si , ai , si+1 )
and through our current estimate of the value Q(si+1 , ai+1 ) of the next state si+1 and the next
action ai+1 ∼ π(si+1 )(·).
SARSA
SARSA (or Sarsa) stands for state-action-reward-state-action. The name refers to the
five variables si , ai , ri , si+1 , and ai+1 at iteration i. The goal of SARSA is to ap-
proximate the optimal Q-function by sampling trajectories. For that, SARSA follows
the idea of (9.25), i.e., the Q-function is learned from experience; see Algorithm 15.
After Q has been determined, an estimate of an optimal policy π ∗ is derived by using
Q as a surrogate for the optimal Q-function Q∗ and by employing (9.24).
Note that SARSA is open-ended by nature, i.e., we investigate trajectory after trajectory. There-
fore, we need to provide a stopping condition for the algorithm to terminate. For our purpose,
we simply use a fixed number of N > 0 trajectories that we traverse.
9.2. Classic reinforcement learning 157
1 Initialize Q arbitrarily such that Q(s, a) ← 0 for all terminal states s ∈ S and all actions
a ∈ As .
2 for i ∈ {0, . . . , N } do
3 Initialize a starting state s ∈ S randomly.
4 Select a according to π̃(s).
5 repeat
6 Take action a, sample next state s′ according to Pa (s, ·), and observe reward
r = r(s, a, s′ ).
7 Select a′ according to π̃(s′ ).
8 Q(s, a) ← Q(s, a) + α (r + γQ(s′ , a′ ) − Q(s, a)).
9 s ← s′ .
10 a ← a′ .
11 until s is terminal.
12 end for
13 Derive policy π from Q by the greedy rule (9.24).
14 return Q, π.
which at each step follows a policy that is greedily determined from the current Q-function; see
(9.24). This highlights that the action can be selected according to different policies in each
iteration step. It is important to note that we deliberately do not denote π̃ as a policy since π̃ is
not a function (or a family of probability measures) on S but it inherently depends on Q.
A common alternative to the greedy strategy (9.26) is the so-called ε-greedy strategy, which is
also typically used in SARSA (Algorithm 15) instead of a fixed policy π. The ε-greedy strategy
picks a random action with probability 0 < ε ≤ 1 and a value-maximizing action—like the
greedy strategy—with probability 1−ε. This means that we will obtain the rewards of suboptimal
actions throughout the learning process in the ε-greedy case, but we will observe more diverse
actions than in the greedy case. This leads to better coverage of the state space and of the action
space.
A common choice for exploring the state and action spaces is the ε-greedy action-selection strat-
egy. Note that, at the start of the learning process, this strategy is usually employed with a
large value of ε to emphasize exploration. Then, the value of ε is successively reduced over
the course of training to converge to the greedy strategy to emphasize exploitation. This is one
of the most popular approaches since it is easy to implement and yet very powerful. Another
example of a commonly used explorative strategy is the softmax exploration, also called Boltz-
mann exploration. Here, an action a is drawn randomly according to the softmax distribution;
see [Ber19, SB18] for details.
Q-learning In SARSA, one action-selection strategy has been used to both determine the next
action in the current trajectory and calculate the update step for Q using (9.25). This is known as
on-policy learning. In contrast to that, we now consider an off-policy learning approach, where
the traversal of the state space and the update step are based on different strategies.47
Q-learning
The so-called Q-learning algorithm introduced in [Wat89] is similar to SARSA.
However, while the action-selection strategy π̃ is still used to traverse the state space,
there is a significant difference for the update of the Q-function in Q-learning. Here,
the update step considers the maximum Q-function value over all possible actions
a′ ∈ As′ , i.e., the update uses a greedy strategy; see Algorithm 16.
The estimation of Q in Q-learning is analogous to the value iteration scheme from Algorithm 12.
If we compare the update step from both algorithms, we observe that both depend on the optimal
action given the current estimate of V or Q, respectively. However, different update rules are
employed: In Algorithm 12 the update formula stems from the dynamic programming principle,
while in Algorithm 16 it stems from the temporal differences.
Relation between SARSA, Q-learning, and policy iteration Note that Q-learning and
SARSA coincide if the action-selection strategy π̃ is chosen to be the greedy strategy, i.e., if
47 Note that the action-selection strategy for the next trajectory sample can even be based on a conventional controller
1 Initialize Q arbitrarily such that Q(s, a) ← 0 for all terminal states s ∈ S and all actions
a ∈ As .
2 for i ∈ {0, . . . , N } do
3 Initialize a starting state s ∈ S randomly.
4 repeat
5 Select a according to π̃(s).
6 Take action a, sample next state s′ according to Pa (s, ·), and observe reward
r = r(s, a, s′ ).
Q(s, a) ← Q(s, a) + α r + γ maxa′ ∈As′ Q(s′ , a′ ) − Q(s, a) .
7
8 s ← s′ .
9 until s is terminal.
10 end for
11 Derive policy π from Q by the greedy rule (9.24).
12 return Q, π.
π̃(s) = arg maxa∈As Q(s, a). However, as outlined, one usually uses a less greedy approach
when choosing the next action for exploration, e.g., the ε-greedy strategy or the softmax explo-
ration.
Furthermore, comparing Algorithm 15 and Algorithm 16 with policy iteration from Algo-
rithm 13, we note that SARSA and Q-learning can be considered as optimistic policy iteration
methods, where only a single sample is obtained between policy updates instead of a whole
trajectory; see also [Ber19]. Note that numerical issues such as cyclic or oscillating policies
can occur in SARSA or Q-learning. For further details on such numerical problems and their
stabilization by using joint updates after several state-action observations we refer the reader to
[Ber12, Ber19].
to obtain ten different bins. Then, after the binning process, we obtain indeed a problem with
discrete state and action spaces by defining a Markov decision process (and the corresponding
policies and Q-functions) on the set of bins instead of the continuous spaces. This allows us
to treat the resulting problem with the reinforcement learning algorithms that we have already
introduced in the previous sections. For a specific continuous state, we then simply look up the
value of the corresponding state bin, e.g., for the state 0.23 in the above example, we take the
value assigned to the third bin [0.2, 0.3). Discrete Q-functions are often called Q-tables to clearly
differentiate between the discrete and the continuous approaches.
160 Chapter 9. Reinforcement Learning
In the same way as above, we can use binning for a two-dimensional domain. For exam-
ple, we can discretize [0, 1]2 into 10 times 10 bins Υ × Υ. When proceeding in this way for
more and more dimensions, the number of bins increases exponentially with the dimension, and
we encounter the curse of dimensionality for high-dimensional problems; see also Section 1.6.
However, in many practical applications, we encounter low-dimensional state and action spaces,
which can be treated very well by a combination of binning and RL algorithms like SARSA or
Q-learning, as we will see in the following task.
X Task 9.2. Experiment with the reinforcement learning algorithms introduced in this
section. To this end, refer to the provided J UPYTER notebook template from Task 9.1.
(a) Implement SARSA or Q-Learning to solve the FrozenLake-v1 environment
with is_slippery=True and γ = 0.9. Use an ε-greedy action-selection strat-
egy with ε = 0.1. Estimate the expected value of this policy experimentally.
(b) We now use the CartPole-v1 environment from gymnasium. Here, the envi-
ronment resembles a cart to which a pole is attached. The cart can move to the
left and to the right. The goal is to balance the pole by accelerating the cart in a
chosen direction at any given point in time. Get familiar with the environment
by implementing a policy which picks admissible random actions. Traverse a
trajectory of this policy and render/visualize it with the help of the gym.make
method with render_mode="human".
(c) Adapt the algorithm from (a) to solve the CartPole-v1 problem. Use binning to
discretize the state space. Experimentally confirm the expected return for the
learned policy and render/visualize one trajectory.
which is linear in θ and employs a set of d maps fk , which are usually chosen to be simple
projections, (Fourier) polynomials, or piecewise constant functions.48 If we assume that fk ,
k = 1, . . . , d, are fixed functions, we now only need to store the vector θ ∈ Rd to represent Qθ .
This has to be compared to |S| · |A| values for a standard Q-function Q. Such an approximation
Qθ then just has to be plugged in for Q in Algorithm 15, for instance. Furthermore, infinite state
spaces S could also be dealt with straightforwardly when using the approximation (9.27) with
fixed fk , k = 1, . . . , d.
arg min ∥Q − Qθ ∥∗
θ
in some appropriate norm ∥ · ∥∗ . To this end, let us consider the example of a least squares
minimization, i.e., we aim to minimize
1 X 2
∥Q − Qθ ∥∗ := (Q(s, a) − Qθ (s, a)) .
2
s∈S,a∈As
Since we do not have access to the true Q-function Q, we instead use the updated target value
r(s, a, s′ ) + γ maxa′ Qθ (s′ , a′ ) with s′ ∼ Pa (s, ·) for a state-action pair (s, a) to obtain
∥Q − Qθ ∥∗ ≈ ∥r(·, ·, S ′ ) + γ max
′
Qθ (S ′ , a′ ) − Qθ ∥∗ ,
a
′
where S (s, a) is a random variable distributed according to Pa (s, ·). Now, we employ one step
of gradient descent to move towards a minimizer of this expression, i.e.,
X
θ ← θ−α r(s, a, S ′ (s, a))+γ max
′
Qθ (S ′
(s, a), a′
)−Qθ (s, a) ∇θ Qθ (s, a) (9.28)
a
s∈S,a∈As
1 Initialize θ ∈ Rd arbitrarily.
2 for i ∈ {0, . . . , N } do
3 Initialize a starting state s ∈ S randomly.
4 Select a according to π̃(s).
5 repeat
6 Take action a, sample next state s′ according to Pa (s, ·), and observe reward
r = r(s, a, s′ ).
7 if s′ is terminal then
8 θ ← θ + α (r − Qθ (s, a)) ∇Qθ (s, a).
9 end if
10 else
11 Select a′ according to π̃(s′ ).
12 θ ← θ + α (r + γQθ (s′ , a′ ) − Qθ (s, a)) ∇Qθ (s, a).
13 s ← s′ ; a ← a′ .
14 end if
15 until s′ is terminal.
16 end for
17 Derive policy π from Qθ by the greedy rule (9.24).
18 return Qθ , π.
Since it would be way too costly to evaluate (9.28) several times after each Q-function update
step, we employ the minibatch idea of stochastic gradient descent; see Algorithm 10. Here, when
we are currently in state s ∈ S and take action a ∈ As , we only use this specific state-action pair
(s, a) in (9.28), i.e., we compute
θ ← θ − α r(s, a, S ′ (s, a)) + γ max′
Q θ (S ′
(s, a), a′
) − Qθ (s, a) ∇θ Qθ (s, a). (9.29)
a
A SARSA variant with a Q-function approximation based on such an update step is given in
Algorithm 17. Note that, to resemble an SGD algorithm with minibatch size one, we would need
that (s, a) is an i.i.d. sample of all state-action pairs. But depending on the underlying MDP and
the chosen actions, this might not be the case. However, in practice this strategy still works well.
In fact, for the Q-function approximation (9.27), convergence and error bounds can be shown
under certain assumptions; see [SB18].
1 Initialize θ ∈ Rd arbitrarily.
2 for i ∈ {0, . . . , N } do
3 Initialize a starting state s ∈ S randomly.
4 repeat
5 Select a according to π̃(s).
6 Take action a, sample next state s′ according to Pa (s, ·), and observe reward
r = r(s, a, s′ ).
7 Store the transition (s, a, r, s′ ) in the replay memory.
8 Sample a batch (sj , aj , rj , s′j )nj=1
batch
from replay memory.
9 for j ∈ {1,( . . . , n batch } do
rj if s′j is terminal,
10 yj =
rj + γ maxa′ Qθ s′j , a′
else.
11 end for
12 Perform a gradient descent step on (9.30).
13 s ← s′ .
14 until s′ is terminal.
15 end for
16 Derive policy π from Qθ by the greedy rule (9.24).
17 return Qθ , π.
is fixed, i.e., the layers, nodes, and activation functions are chosen a priori. Then, the function
f , which the corresponding network represents, is chosen as approximation Qθ to the optimal
Q-function Q∗ . Here, the parameter θ = (W , ⃗b) contains the weights and biases of the network.
The resulting method is summarized in Algorithm 18. To train the neural network, a database of
past steps called replay memory is used, which needs to have been collected and stored before-
hand. Let us explain this in more detail next.
Experience replay To obtain less correlated samples than those obtained by just taking
the most recent trajectory information, a technique called experience replay is used in Algo-
rithm 18. Here, the observed transitions (s, a, r, s′ ) are collected in a replay memory. Each
time a trajectory progresses, we draw nbatch uniformly distributed samples (sj , aj , rj , s′j )nj=1
batch
from the replay memory, where the batch size nbatch ∈ N is fixed a priori. Then, the tar-
gets yj := rj + maxa′ ∈As Qθ (s′ , a′ ) are computed for each transition (sj , aj , rj , s′j ) for
j = 1, . . . , nbatch . Finally, we perform a gradient descent step using the loss
n batch
1 X
(yj − Qθ (sj , aj ))2 . (9.30)
nbatch j=1
yi = ri + γ max
′
Qθ̂ (s′i , a′i ). (9.31)
ai
This can be seen in analogy to picking an action-selection strategy and, separately, a greedy
update step for the Q-function. Here, we employ an action-selection strategy π̃, and the Q-
function approximation is updated using a gradient descent algorithm based on the targets (9.31).
A similar approach is followed in double Q-learning [vH10]. Here, two different Q-functions are
used to reduce the expected error when determining the value of the next state.
X Task 9.3. Let us now apply the deep reinforcement learning algorithms introduced
in this section to the CartPole-v1 environment introduced in Task 9.2. To this end,
refer again to the template J UPYTER notebook from Task 9.1.
(a) Use episodic SARSA (Algorithm 17) in the CartPole-v1 environment to bal-
ance the pole. To this end, use linear Q-functions to represent the value of each
action a, i.e.,
X d
Qθ (s, a) := [θ a ]k [s]k = θ Ta s,
k=1
is updated accordingly [Ber12, Ber22, SB18]. This can however be infeasible in practice even
for moderately large state spaces. In general, the analysis of reinforcement learning algorithms
is said to be difficult, since “it relies on multiple interacting approximations whose effects are
hard to predict and quantify in practise” [Ber19].
Overall, there are two contrasting aspects to consider when we train an RL system. First, from
a theoretical point of view, we typically need to guarantee that every state-action pair is regu-
larly visited to show convergence. Second, from an efficiency point of view, we aim to focus the
limited learning resources on those situations that happen more often and lead to a high reward,
while we mostly ignore theoretically possible but essentially irrelevant states. In general, we aim
for an estimation of expected values under one distribution, reflecting the policy which we want
to learn, while we obtain samples from another distribution, reflecting the action-selection strat-
egy or other decisions affecting the sampled trajectory. Here, returns of the observed trajectories
can be weighted according to the relative probability of the two distributions; see [SB18]. Addi-
tionally, sampling large reward terms more often can increase the sample efficiency; see [Ber12].
n-step learning The basic TD method (9.18) can be generalized to use n-step learning. To
this end, the trajectory is observed for n steps after the current state si , and the n-step discounted
cumulative reward is employed together with an estimate of V (si+n ) to obtain
n−1
X
Ri:i+n := γ k r(si+k , ai+k , si+k+1 ) + γ n V (si+n ).
k=0
After observing si+n , the value V (si ) is updated by the α-weighted difference between the n-
step reward Ri:i+n and the current estimate for V (si ), i.e.,
V (si ) ← V (si ) + α (Ri:i+n − V (si )) .
One can consider n-step methods as a generalization of TD and MC methods. Here, the one-step
TD approach (9.18) with n = 1 is at one end of the scale, whereas the MC update (9.20) is at the
other end with n = l being the length of a whole trajectory.
The n-step idea is also used to improve deep Q-learning. In [HMvH+ 18], for instance, several
variants of Deep Q-networks were combined and tested for an arcade game environment. There,
one major improvement over the basic DQN was attributed to n-step learning, where, instead of
′
Pn step to compute a target y = r + γ maxa′ Q(s , a), n steps are
considering the reward of one
combined to compute y = j=1 γ j−1 rj + γ n maxan+1 Q(sn+1 , an+1 ); see also [Ber19, GP17,
SB18] for more details.
TD(λ) A further generalization of TD methods involves so-called λ-returns. Here, both ex-
tremes, the basic TD update and the MC update, are combined. By using exponentially decaying
weights (1 − λ)λk−1 for k ∈ N with a parameter λ ∈ (0, 1), we define the λ-return
∞
X
Riλ := (1 − λ) λk−1 Ri:i+k .
k=1
The method using this value in the update is called TD(λ). Note that one recovers the one-step
TD-algorithm for λ = 0. Furthermore, one can show that λ → 1 represents a Monte Carlo-type
algorithm which is just slightly more general than Algorithm 14; see [SB18]. Moreover, one can
show that one can adjust the tradeoff between bias and variance of the update by varying λ. This
has a large influence on the speed of convergence of the corresponding algorithm. Finally, to
obtain a memory-efficient policy evaluation method with n-step TD or TD(λ) updates, one has
to employ a special technique called eligibility traces. For more information on this technique
and the corresponding algorithms, we refer the interested reader to [Ber12, SB18, vSMP+ 16].
166 Chapter 9. Reinforcement Learning
Policy gradient methods Note finally that there exist reinforcement learning algorithms that
directly work with continuous action spaces, e.g., policy gradient methods [SMSM99]. The basic
idea of policy gradient methods is to approximate a policy π. One example of a policy gradient
method would be to use the softmax action-selection strategy from Section 9.2.5 and to update
the functional representation of the softmax distribution by an (approximate) gradient ascent
method according to a given scalar performance measure; see [Ber19, SB18]. Note that this
approach is only valid if the functional representation of the softmax distribution is differentiable.
Modern approaches, such as proximal policy optimization (PPO) [SWD+ 17], employ a deep
neural network πθ as an approximation to π. Then, the network is optimized with respect to a
chosen measure for the reward. Approaches that approximate the value function in addition to
the policy are often called actor-critic methods. Here, actor refers to the policy that provides the
action, and critic refers to the value function that evaluates the given action.
Further literature To obtain detailed insights into the theoretical aspects of reinforcement
learning, we refer the interested reader to [Ber12, Ber17, Ber19, Ber23, Sze10]. Furthermore, an
overview is given in [SB18]. In [Pow22], a unified framework for common sequential decision
problems is proposed.
Chapter 10
Further Developments
This chapter serves as a rough overview on some important machine learning topics that we have
not covered so far. Since they are quite relevant in data science but a thorough treatment of
them would be beyond the scope of this book, we decided to present these techniques in short
introductional overviews in this chapter without going into too much detail. We will not present
any tasks regarding the following topics, but we encourage the interested reader to check out the
given references to learn more in these directions.
In the following, in Section 10.1 we first encounter machine learning problems where the
data is non-vectorial or the labels are non-scalar. Subsequently, we introduce ensemble methods
in Section 10.2. These techniques combine several ML models to create a more sophisticated
one. Recurrent neural networks, which deal with sequence data, are discussed in Section 10.3.
Section 10.4 deals with the transformer network, which presently resembles the state of the
art when it comes to sequence models in deep learning. Finally, we conclude this chapter in
Section 10.5 with a digression into the topic of interpretability.
167
168 Chapter 10. Further Developments
i.e., we included the Euclidean norm to obtain a scalar quantity from the differences of the vector-
valued data labels. We dealt with this particular setting already in Section 4.1 and Section 7.1
when discussing the PCA and autoencoders. Technically, however, we did not have vector-
valued data labels in these cases since we only considered unsupervised learning problems there.
Nevertheless, the reasoning behind the chosen model classes and loss functions is essentially the
same as for vector-valued labels in supervised learning problems. Alternatively, and depending
on the situation at hand, more involved vector norms than the ∥ · ∥2 norm can be employed there.
(EEG), the corresponding time series data is represented as a matrix, which is then cast into a
covariance matrix format; see, e.g., [CKBM21, CBB17].
Besides covariance matrices, symmetric positive-definite matrix-valued data also appear in the
medical area of diffusion tensor imaging (DTI), for example; see [PFA06, JHS+ 13, PSF19] for
an ML approach to DTI. The ultimate goal of DTI is to generate contrast in magnetic resonance
images by determining the so-called diffusion tensor T ∈ R3×3 for an anisotropic diffusion
process, i.e., we learn the diffusion coefficient in Fick’s first law
J (x) = T (x)∇c(x)
from given flow data. Here, J is the diffusion flux vector and ∇c denotes the (spatial) gradient
of the particle concentration. Employing a log-Euclidean distance
in the loss function has proven to be both efficient and successful for DTI; see also [PSF19].
Here, log(·) denotes the matrix logarithm and ∥ · ∥F is the Frobenius norm. Besides the log-
Euclidean distance, there also exist other, more involved, choices, e.g., affine-invariant metrics,
Cholesky distances, and Stein divergences; see [JHS+ 13] for details.
Another area of application where spd matrices are to be learned is robotics. In particular,
when we deal with a robotic arm, the configuration49 x(t) ∈ R6 at time t of the end effector,
i.e., the end of the robotic arm, can be expressed in terms of the configurations θ(t) ∈ Rm of the
joints of the robotic arm via
x(t) = f (θ(t))
for some function f : Rm → R6 . Here, m depends on the number and the type of robot joints.
To obtain the corresponding velocities, we compute the time derivative
∂f
ẋ(t) = (θ(t))θ̇(t).
∂θ
where J = ∂f ∂θ denotes the Jacobian of f with respect to the joint configurations. Then, the
spd matrix (J J T )−1 defines the so-called manipulability ellipsoid, i.e., the end effector velocity
T
ellipsoid corresponding to the joint velocity sphere θ̇ θ̇. This is a cardinal structure since it
describes the possible movement directions of the end effector at any given time t. Therefore,
learning it from a set of given robot arm trajectories is an important task, e.g., when teaching
a robot to track a reference motion; see [ADHSK21, JRCC21]. A similar machine learning
problem with manifold-valued labels is encountered when attempting to recover robot orientation
data represented by rotation matrices from the Lie manifold SO(3), i.e., the special orthogonal
group in three dimensions; see [ZHS+ 17].
Besides the above applications, manifold-valued data also appear in the context of image syn-
thesis when the corresponding image is characterized not by RGB color values but by different,
more intricate representations; see, e.g., [HWVG19]. Furthermore, when processing manifold-
valued data, e.g., matrices on Stiefel manifolds or Grassman manifolds, by neural networks
which keep their structure intact, similar challenges are encountered as with manifold-valued
labels; see [CBMV20].
49 This is usually a combination of three spatial coordinates and three orientations.
170 Chapter 10. Further Developments
x = (a1 , . . . , am )
Ensemble learning
The simple but powerful idea behind ensemble learning is to combine the results
of several basic machine learning models to create a more complex and accurate
model. Results achieved in this way can be superior to those achieved by each of
the single models alone. Most ensemble learning methods rely on a combination of
bootstrapping, boosting, and stacking mechanisms. We refer the reader to [DYC+ 20,
HTF09, JWHT21, Mur22, SR18], for details.
10.2.2 Stacking
The stacking approach is slightly more advanced than the simple voting and averaging ap-
proaches. Here, a meta-model is constructed to obtain a better result than those obtained by
just considering the base models themselves.
i
172 Chapter 10. Further Developments
“book-figure32” — 2023/10/9 — 13:27 — page 163 — #1
i
Output: B
1×A
Votes: 3×B
1×C
B B C A B
Figure 10.1: An example for voting with five different models for a classification problem with
classes A, B, and C. After all five models have been trained, the majority vote of the five base
models decides which class is predicted for a given input data point. For the example input above
the model outcomes are B, B, C, A, B. This results in an ensemble vote of B.
Stacking
In stacking we build a meta-model M on top of m independently trained base mod-
els M1 , . . . , Mm . The meta-model’s input for a specific data point x is the outputs
Mj (x) of the base models for i = j, . . . , m. Then, the training set50 is taken to train
the meta-model to deliver the best decision according to a given loss function. Often,
the same loss function is chosen for the meta-model as for the base models. However,
it is possible to employ a different meta-loss function as well.
As an example, let us consider the stacking algorithm from [Bre96b]. It uses a positive com-
bination of the base models and the least squares loss to obtain the coefficients
n 2
(i)
X
(α1 , . . . , αm ) = arg min yi − βj Mj (xi ) (10.1)
βj ≥0 i=1
j=1,...,m
Pm
of the meta-model M := αj Mj . Moreover, to avoid overfitting in the minimization pro-
j=1
(i)
cedure for (10.1), the models Mj , i = 1, . . . , n, have been used there instead of the model Mj
(i)
for each j = 1, . . . , m. Here, Mj is the same as Mj , but it has been trained without using the
data point (xi , yi ) in its training data set; see [Bre96b] for details.
This way, the most common approach to stacking is to create a meta-model by linearly com-
bining the base models. In contrast to voting or averaging, stacking allows us to learn how to
weight the different outcomes of the different base model classes. This often leads to improved
results over that of any of the base models alone; see [Wol92, Bre96b]. In Figure 10.2 we provide
an example illustration for stacking with five base models. Moreover, there also exist nonlinear
meta-model approaches; see, e.g., [MN15].
50 Sometimes the training data set for the meta-model is chosen to be different from the one on which the base models
Meta-model
(α1 ,...,α5 )
Figure 10.2: An example for stacking with five different models. After all five models have been
trained, a meta-model is trained to decide what to output for a certain input depending on the
outputs of the five base models.
10.2.3 Boosting
While for stacking all the m different models have been trained independently from each other
and while the meta-model is then built by minimizing a meta-loss function, we now intertwine
these two steps in boosting by successively taking into account one base model after the other.
Boosting
In boosting we build the ensemble of models adaptively by starting with just one
simple base model M1 . Subsequently, model M2 is trained, where training data
points that produced a large error (or were misclassified) for M1 are given more
importance during the training process of M2 . This process is iterated until m base
models have been built. Then, the different base models are combined to come to an
ensemble decision. To this end, there are distinct specific approaches which lead to
different types of boosting methods.
with real-valued coefficients αi . But now, instead of directly optimizing a loss function for the
complicated model M , as we did in stacking, the models Mi and weights αi are optimized
successively by an iterative procedure. This is also known as greedy optimization. We define the
iterates of this process
M̃l := M̃l−1 + αl Ml
for l = 1, . . . , m as our new base models. Here, M̃0 is set to zero. After m iteration steps we
obtain M = M̃m . Here, in the lth iteration step, we compute αl and Ml by (approximately)
determining
(αl , Ml ) := arg min C(M̃l−1 + β M̂ ) (10.2)
β∈R,M̂ ∈M
174 Chapter 10. Further Developments
for a given function C(f ) := L((f (x1 ), y1 ), . . . , (f (xn ), yn )), which defines the loss on the
training data, e.g., for least squares loss L. Here, the base model M̂ is chosen from a fixed model
class M, such as the class of affine linear functions, for instance. However, for most model
classes and loss functions, (10.2) is not straightforward to calculate and the minimum is only
approximated. For example, in so-called gradient boosting [Fri01] for regression problems, Ml
is just taken to be the least squares fit to the residuals
of the true labels yi and the evaluations M̃l−1 (xi ) on the training data.
The most famous boosting method for classification problems is AdaBoost, which stands for
adaptive boosting. Here, in the training of M̃l , the training data points are weighted in such a
way that points that were wrongly classified by M̃l−1 are assigned larger weights. In particular,
for a two-class problem with yi ∈ {−1, 1} for all i = 1, . . . , n, the exponential weights
(l)
wi := exp −yi M̃l−1 (xi )
are used. Thus, each weight is either exp(1), namely if yi ̸= M̃l−1 (xi ), or exp(−1), otherwise.
Then, M̃l = M̃l−1 + αl Ml is determined by solving
n
(l)
X
(αl , Ml ) := arg min wi exp −yi β M̂ (xi ) , (10.3)
β∈R,M̂ ∈M i=1
where M̂ is optimized over some fixed model class M. For more details on the weighting process
and the greedy
i optimization, we refer the reader to [FS97, HTF09, HRZZ09]. A schematic
illustration of the AdaBoost mechanism is given in Figure 10.3.
“book-figure34” — 2023/10/9 — 13:27 — page 165 — #1
i
Output
Boosted model
M =M̃3
Figure 10.3: An example for AdaBoost with three base models. After a base model is trained,
misclassified samples (or samples with high error) are given a larger weight before the next model
is trained by (10.3).
10.2. Ensemble learning 175
10.2.4 Bagging
Stacking and boosting are ensemble methods which reduce (mainly) the bias of the model, i.e.,
they increase the model complexity either by a meta-model or by adaptively growing the ensem-
ble model. This way, the ensemble model is able to generalize to a broader class of functions
than the individual base models alone. However, another important quantity is the variance of
the ensemble model, which is related to how robust the model is when having new input data;
see also Section 1.3. A method that mainly aims at reducing the variance of the ensemble model
is bagging.
Bagging
Bagging is an abbreviation for bootstrap aggregation. It means that the base models
M1 , . . . , Mm are trained independently from each other on reduced data sets which
were sampled by so-called bootstrapping. The ensemble model M is then built by
combining the base models appropriately, e.g., by averaging or voting on the results
of the base models. In this way, the bagging ensemble model exhibits reduced vari-
ance compared to the original base models. For more details see [Bre96a].
In contrast to stacking, where the meta-model is the key part, or boosting, where the adaptive
construction of the base models is the core idea, bagging has its focus on the choice of the input
data sets for the base models. These data sets are bootstrapped variants of the original training
data set.
Bootstrapping
The term bootstrapping describes a specific way to (re-)sample a data set. Assume
we have an original data set D of n samples. Then, a bootstrapped variant B thereof is
a data set of k ≤ n samples that have been drawn from D randomly with replacement.
Thus, there might be copies of the same data point in B. If D contains a large number
of independent and identically distributed samples from the underlying, unknown
data distribution, B also can be regarded as representative and independent samples
thereof; see [MD93] for more details.
One of the major benefits of bootstrapping is that we can draw samples from the original data
distribution without knowing it and without having to acquire new data points according to it,
which might be very costly or time-consuming in practical applications. However, the boot-
strapped samples can of course only be representative of the underlying data distribution if the
original data set is representative as well. A schematic illustration of bagging, i.e., of draw-
ing samples for the base models via bootstrapping and combining these models via voting or
averaging, can be found in Figure 10.4.
Voting or
Averaging
Bootstrapping Bootstrapping
Figure 10.4: An example for bagging with four different base models. First, four bootstrapped
sets of some fixed size k ≤ n of the original data set are created. After the four base models
have been trained on the respective bootstrapped sample set, voting or averaging is performed to
acquire the ensemble result.
x ∈ Rd [x]c = t .
The goal is that parts of the training data set x1 , . . . , xn which exhibit the same (or
similar) labels reside in the same partition. To this end, the training is usually done
by successively choosing partitions that minimize the entropy (or the Gini impurity)
of the two resulting parts of the data set; see [Bre84] for more details.
A random forest is an ensemble model using bagging with decision trees as base
models; see [Ho95]. Furthermore, each decision tree is only allowed to use a cer-
tain subset of the d coordinates of the ambient space during training to reduce the
complexity; see [Bre01].
51 Note that a decision tree technically does not have to be a binary tree. However, for almost all practical applications,
True False
x2 ≤ 0 x1 ≤ 0.25
7 x1 ≤ 0 1.5 x1 ≤ 1
3 2 0 −2
Figure 10.5: An example of an already trained decision tree T of maximum depth 3 for two-
dimensional data. The green leaf nodes illustrate the output values of the tree, i.e., for an input
whose corresponding decision path leads to a specific leaf, the output value of that leaf is taken
as the predicted label for the input. The red path shows how the tree is evaluated for the example
input data point x = (0.5, 0.2). The result is T (x) = 2.
After a decision tree model has been trained, we can evaluate it for a specific data point x by
traversing the corresponding path within the tree beginning at the root and ending in a leaf. As
we will explain in more detail in the following, each non-leaf node checks a condition on one of
the coordinate dimensions x1 , . . . , xd of the data point. If the condition is met, we follow the left
branch; else we follow the right one. Once we arrive at a leaf, the value stored in the leaf node is
the prediction for the data point x we started with. An example of a decision tree can be found
in Figure 10.5.
Besides the maximum tree depth, the most important choice when training a decision tree T
is the choice of the splitting criterion for each node. At each node we have to decide which
coordinate direction c ∈ {1, . . . , d} and which value t ∈ R to take for the split. Usually, some
loss function C or a measure for the homogeneity of the target variable within the subsets is used
to evaluate the quality of a specific split, and it is then minimized (heuristically) with respect to
the pair (c, t).
Let us take the least squares loss for regression problems as an example. When splitting the
root at (c, t) for our training data set (x1 , y1 ), . . . , (xn , yn ), we create two new index sets
Lc,t = i [xi ]c ≤ t, i = 1, . . . , n and Rc,t = i [xi ]c > t, i = 1, . . . , n ,
where [xi ]c denotes the cth coordinate of the vector xi . Now we can compute the least squares
error after the split by taking the average of all labels in the corresponding sets as the predicted
value, i.e.,
2 2
1 X
yi − 1
X 1 X
yi − 1
X
yj + yj .
n |Lc,t | n |Rc,t |
i∈Lc,t j∈Lc,t i∈Rc,t j∈Rc,t
178 Chapter 10. Further Developments
This quantity is then minimized to find the optimal (c, t) for the actual splitting step. Subse-
quently, the same procedure is performed for the child nodes with the corresponding (smaller)
data sets; see also [Bre84] for details. This is again a greedy type of optimization, as in boosting,
since we optimize the criterion for all the nodes sequentially. The splitting procedure is stopped
when the least squares error in a node is smaller than some threshold or when the maximum
depth is reached.
Finally, to create an ensemble model, we build m independent decision trees as base models.
To this end, we use a random coordinate subset C̃j ⊂ {1, . . . , d} for each of the j = 1, . . . , m
models as possible splitting candidates, i.e., for the jth tree, a splitting in a node is only possible
in a direction c ∈ C̃j . Furthermore, bootstrapping is used to reduce the training data set size
for each decision tree. In this way, we arrive at the random forest model, which mitigates the
problem of high variance that single decision trees tend to have; see [Bre01].
The idea behind RNNs is that past information can be used to alter the way in which new infor-
mation is processed. Often the output of such a network depends on the most recent input and
an additional (hidden) state h ∈ Rdh with fixed hidden state dimension dh ∈ N. The hidden
state itself depends on earlier inputs and earlier hidden states of the network. This mechanism is
beneficial when processing sequential data.
In the following, we will denote by z ∈ Rd the network’s input52 sequence. Note that the
sequence length d and the hidden state dimension dh are not related in any way. For the sake
of simplicity, we assume here that each element zj for j = 1, . . . , d of the input sequence is
a real number. Note however that the z1 , . . . , zd could be vectors themselves, i.e., zt ∈ Rdi
for each t = 1, . . . , d with some application-dependent input dimension di ∈ N. Their spe-
cific form and structure depends on the type of sequence which is to be investigated. Next, let
ht ∈ Rdh be the sequence of hidden state vectors for t = 1, . . . , d. Finally, the network’s output
is denoted by f (z). Depending on the task under consideration, this could be a whole sequence
52 For input sequences with variable lengths, the length is also passed on as an input parameter and d is usually chosen
large enough to capture all sequences in the data set. The remaining coordinates of an input are then just set to 0.
10.3. Recurrent neural networks 179
f (z) = (f1 , . . . , fd ) ∈ Rd or only the value f (z) = fd ∈ R, which is referring to the final com-
putation. The first situation is typically encountered for translator networks (see [CvMG+ 14]),
while the second situation appears typically for sequence regression networks (see [QSC+ 17]).
Note here that also the f1 , . . . , fd could be vectors themselves instead of real numbers, i.e.,
ft ∈ Rdo with some application-dependent dimension do ∈ N of the output data of the respec-
tive network. Forithe sake of simplicity, we consider di = do = 1 in the following. The graph
representation of a simple RNN can be found in Figure 10.6.
“book-figure37” — 2023/10/9 — 13:27 — page 169 — #1
i
B
Input Output
A C
z h f (z)
Figure 10.6: Graph representation of a simple recurrent neural network with input z, state vector
h, and output f (z).
An RNN graph as in Figure 10.6 has to be read differently from a graph for a feed-forward
network. Here, A : R × Rdh → Rdh , B : Rdh → Rdh , and C : Rdh → R are not just simple
weights but certain transformations that depend on the specific architecture of the network, as we
will see later on. This representation is used as a compact form of the so-called unfolded graph,
where the loop is completely expanded and, thus, no longer appears; see Figure 10.7.
Here, the unfolded graph has to be understood as follows:
• The calculation begins with the first element z1 of the input sequence to which A is applied
to compute the first hidden state h1 := A(z1 , 0) ∈ Rdh with some fixed hidden state
dimension dh . The hidden state is then transformed by C to obtain the first output f1 .
• In a next step, z2 is fed into the network. Then, by taking into account the values of z2
and B(h1 ), the network computes the next hidden state h2 = A(z2 , B(h1 )) and the next
output f2 = C(h2 ).
• This procedure is continued until we reach the end of the input sequence zd , i.e., we have
ht = A(zt , B(ht−1 )) and ft = C(ht ) for all t = 1, . . . , d, where h0 := 0.
Note that the application of the activation function ϕ has to be understood componentwise. As
i
180 Chapter 10. Further Developments
“book-figure38” — 2023/10/9 — 13:27 — page 169 — #
i
Input
z1 A C
h1 f1
B
.. .. ..
. . .
B
zt−1 A C
ht−1 ft−1
B
zt A C
ht ft
B
zt+1 A C
ht+1 ft+1
B
.. .. ..
. . .
B
zd A C
hd fd
Figure 10.7: Unfolded graph representation of a simple recurrent neural network with input
z = (z1 , . . . , zd ), state vectors ht for t = 1, . . . , d, and output f (z) = (f1 , . . . , fd ), where
ht = A(zt , B(ht−1 )) and ft = C(ht ).
we see, the sequence part zt is used to compute the hidden state ht and the output ft via
ht = ϕ(azt + Bht−1 + a0 ) and ft = ψ(cT ht + c0 )
for each t = 1, . . . , d, where the variables a, a0 , B, c, c0 have to be determined by an opti-
mizer. To this end, backpropagation is applied to the unfolded graph after a loss function has
been chosen. This approach is also known as backpropagation through time. While this works
analogously to feed-forward neural networks, simple RNNs such as the one above, but with long
input sequences, tend to encounter problems with many local minima and vanishing or explod-
ing gradients during the optimization procedure with SGD-type methods; see [GBC16] for more
details. Therefore, skip connections between far away sequence parts have been introduced to
circumvent this problem to some extent, i.e., the output of neurons at part t of the sequence can
directly be accessed by neurons at part t + δt of the sequence for a δt ≫ 1. In other words,
there are connections between neurons corresponding to far away sequence parts. Besides, more
sophisticated RNNs have been proposed, which lead to a more stable optimization process.
In contrast to previous models, an LSTM introduces a second hidden state variable, the so-called
cell state ct ∈ Rdh . The visualization of an LSTM network is given in Figure 10.8. Here, we
assume that each element of the input and output sequences is indeed a vector. For the sake of
simplicity, we let zt ∈ Rdi be of input dimension di ∈ N at each time step t = 1, . . . , d, and we
let the corresponding output ft ∈ Rdh be of the same dimension dh as the hidden state. Although
these are vectors, we do not use the boldsymbol vector notation for them here to be consistent
with our previous notation from Section 10.3.1 and to avoid confusion.
The illustration in Figure 10.8 shows how the computation of ft ∈ Rdh and the new hidden
states ht ∈ Rdh and cell states ct ∈ Rdh works when given zt ∈ Rdi and ht−1 ∈ Rdh ,
ct−1 ∈ Rdh .iEssentially, the computational graph is built by certain arithmetic operations and
activation functions (σ and tanh). The involved weights and biases are omitted here from the
“book-figure39” — 2023/10/9 — 13:27 — page 170 — #1
i
.. ..
. .
Input zt
σ ×
tanh × +
× tanh
ft
.. ..
. .
Figure 10.8: Graph representation of an LSTM network with input z = (z1 , . . . , zd ), hidden
state vectors ht , and cell state vectors ct for t = 1, . . . , d and output f (z) = (f1 , . . . , fd ).
182 Chapter 10. Further Developments
graph for the sake of clarity. The forward pass of the illustrated LSTM cell then reads
ht = ϕ (azt + Bht−1 + a0 )
of the hidden state vector for some a, a0 ∈ Rdh , B ∈ Rdh ×dh . Then, one can show that, for an
orthogonal (or unitary) matrix B and for ϕ = ReLU, the optimization with SGD-like methods
works well, i.e., the norms of the involved gradients stay constant; see [ASB16, HSL16] for
details. However, these networks suffer from a poorer expressivity than standard RNNs, and the
class of functions that can be represented by such a network is thus more limited. Recent research
addresses the question of a good tradeoff between the expressivity and the (almost-)orthogonality
of the RNN; see [KGPT+ 19].
Another class of RNNs which are able to circumvent the problem of vanishing and exploding
gradients is continuous-time RNNs. Here, similarly to Chapter 8, the RNN’s evolution equations
are interpreted as a time-discrete variant of an ODE. For example, an ODE of type
could be discretized as
with time step δt, which just resembles a time-explicit Euler discretization with respect to h and
z. This then defines the hidden state evolution equation for the corresponding RNN. For more
details on the connection between certain ODEs and RNNs, see [She20].
Note that the connection to ODEs allows for a more elaborated stability analysis of the net-
work’s evolution and optimization processes. Thus, more complex ODEs and more intricate
time-discretization schemes can lead to RNNs that have beneficial behavior when it comes to
optimizing the network’s weights and biases with SGD. This way, the problem of vanishing and
exploding gradients can be circumvented; see, e.g., [CCHC19] for antisymmetric hidden state
transition matrices, [EAQ+ 21] for a learnable scale of symmetrized or antisymmetric hidden
state transition matrices, [RM21] for an RNN stemming from a second-order ODE system moti-
vated by coupled nonlinear oscillators, and [RMEM21] for an RNN stemming from a multi-scale
ODE system. These approaches manage to either reduce the problem of vanishing and exploding
gradients or circumvent it completely.
Transformer-based neural networks have been especially successful in natural language pro-
cessing tasks such as text generation or machine translation [Bea20, DCLT19, VSP+ 17]. To deal
with sequence-to-sequence problems, e.g., in machine translation, the transformer employs two
separate modules: an encoder and a decoder. First, the encoder processes a sequence, e.g., a sen-
tence (sequence of words) from the source language, as an input. Its goal is to compute a compact
representation of all the relevant information in the input sequence, similar to the hidden state
in RNNs. Next, the representation that the encoder outputs is taken as an input for the decoder.
Moreover, the decoder is usually also fed a piece of the desired output sequence, e.g., the already
translated beginning of a sentence from the target language. Now, the decoder generates the next
element of the output sequence based on the given representation of the input sequence and the
given part of the output sequence. After adding the new element that the decoder produced to
the output sequence, this process is repeated until the whole output sequence has been produced,
e.g., until a full sentence has been output by the decoder.
184 Chapter 10. Further Developments
One major benefit of transformer networks in contrast to recurrent neural networks is that
transformers treat the whole input sequence at once because of the employed self-attention mod-
ules. These are able to efficiently model the semantic dependencies between different words of
an input sentence. In this way, the transformer network can grasp the structure of a sentence
much better than an LSTM network. In this way, long range dependencies, which were problem-
atic for RNNs because of the computational costs and the information loss over long sequences,
can be modeled much more easily. Moreover, even a parallelization of the computations is pos-
sible, which was not the case for RNNs because of their successive computation of the hidden
state vectors.
Before we introduce the specific network architecture of the transformer, we need to have a
look at attention modules to understand their basic mechanisms.
with certain weights wi . Here, the value vectors v 1 , . . . , v d are related to the representation
vectors r 1 , . . . , r d . The exact relationship depends on the type of attention module and will
become clearer in the next section.
More specifically, an attention module mimics the process of retrieving a value v ∈ RV
based on a query q ∈ RQ and a key k ∈ RK from a database53 for some fixed dimensions
V, Q, K ∈ N. To this end, let us assume that the d value vectors v 1 , . . . , v d correspond to d
key vectors k1 , . . . , kd . Now, we choose the weights wi in (10.4) in such a way that they reflect
the similarity between q and ki . The idea behind this mechanism is that values with keys which
match the query the most should get most of the attention, i.e., they will get assigned larger
weights.54 Let us now discuss how the weights wi are determined for the dot product similarity
measure, which is used in the transformer architecture.
Dot product similarity The most commonly used similarity measure in attention modules is
a scaled dot product between the query vector and the key vectors when Q = K, i.e.,
q T ki
similarity(q, ki ) := √ .
K
53 When we introduce the specific transformer architecture in Section 10.4.2, it will become evident that the database
contains keys and values which represent feature vectors of words in either the input or the output language. These are
then matched to query feature vectors.
54 In the case of database retrieval, this means that a weighted sum of all values from the database is returned, where
the weights reflect how well the query matched the corresponding keys.
10.4. Transformer neural networks 185
Now, in order to fulfill the side conditions of non-negative weights that sum up to one, we apply
the softmax function to the similarities, i.e., we set
exp (similarity(q, ki ))
wi := softmax(similarity(q, ki )) = Pd . (10.5)
j=1 exp (similarity(q, kj ))
Note that the weights do not depend on the values v 1 , . . . , v d in general. However, in the specific
case of the attention modules in the transformer, there is a linear relationship between the keys
and the values, as we will see in Section 10.4.2. Thus, the weights do depend on v 1 , . . . , v d
there.
Multiple queries Finally, let us have a look at the case where we have dq different queries
q 1 , . . . , q dq of dimension Q = K. If we stack the value, query, and key vectors row-wise into
matrices V ∈ Rd×V , Q ∈ Rdq ×Q , and K ∈ Rd×K and if indeed Q = K, we can write the
output of the attention module as
!
QK T
attention(Q, K, V ) := softmax √ V ∈ Rdq ×V . (10.6)
K
T
QK
Here, √
K
∈ Rdq ×d has entries
" #
QK T
√ = similarity(q i , kj )
K i,j
Input/Output embedding Each element of a given input or output sequence is embedded into
a suitable feature space. For instance, for machine translation, each word (or token/subword unit
˜
[SHB16]) of a sentence is embedded into Rd via word embedding. In particular, for the input
embedding, we first fix the size s̃ ∈ N of the possible vocabulary for the input language and
store all possible words in an input dictionary.55 Then, each word of the dictionary is assigned
a different index i ∈ {1, . . . , s̃}. Thus, each word can be encoded by the so-called one-hot
55 Note that a dictionary here only contains words from one specific language and not the corresponding translations
into another language. This is consistent with the use of the term dictionary in the machine learning literature.
i
186 Chapter 10. Further Developments
“book-figure40” — 2023/10/9 — 13:27 — page 175 — #1
i
Output probabilities
Softmax Decoder
(repeated
N times)
Linear
Normalization
Encoder Two-layer
(repeated feed-forward
N times)
Normalization
Normalization +
+ Multi-head
attention
Two-layer
feed-forward V K Q
Multi-head Multi-head
attention attention (masked)
V K Q V K Q
Positional Positional
+ +
encoding encoding
Input embedding Output embedding
Figure 10.9: A schematic overview of the transformer network for translation. The main parts
are N consecutive encoder (left) and decoder (right) blocks containing attention and feed-forward
sub-blocks. Our illustration is based on Figure 1 of [VSP+ 17].
encoding, i.e., the word belonging to index i is assigned to the ith unit vector ei ∈ Rs̃ with
entries
1 if i = j,
[ei ]j :=
0 else
for j = 1, . . . , s̃. p
The resulting embedding vector is obtained by multiplication p of a scaled
˜ ˜
embedding matrix d · E ∈ R (in) d×s̃
with the vector ei , i.e., the ith column of d˜ · E (in) is the
10.4. Transformer neural networks 187
embedding vector for the ith word from the dictionary. The entries of E (in) are determined during
the training phase of the transformer; see Section 10.4.3 for details. The same process is applied
˜
for the output embedding with a different learnable weight matrix E (out) ∈ Rd×t̃ , where t̃ ∈ N is
the dictionary size for the output language. In this way, the word embeddings are learned from
scratch by the transformer.
Positional encoding Subsequently, special positional encoding vectors are added to the re-
sults. This is done to ensure that the network can identify the position in which a word occurred
in the original sentence. As the remaining elements of the network, e.g., the attention modules,
are permutation invariant, i.e., they do not work with the initial sequences but only with a set
of vectors without particular order, the algorithm could not infer a word’s position without the
positional encoding. For details on the specific way the positional encoding is calculated in the
case of the transformer, see [VSP+ 17].
Multi-head attention in the encoder After adding the positional encoding, the resulting vec-
tors are passed to a so-called multi-head attention layer in the encoder. Let us call the incoming
˜ ˜
d vectors r 1 , . . . , r d ∈ Rd and let R = (r 1 , . . . , r d )T ∈ Rd×d , where d˜ is the embedding di-
mension introduced at the beginning of this section. Note however that different sequences can
have different sequence lengths d, i.e., the word count might vary between the sentences in the
data set.
Now, let us answer the open question from Section 10.4.1 of how the vectors r 1 , . . . , r d are
related to the attention module’s queries, keys, and values. To this end, note that the atten-
tion module here is a self-attention module, meaning that queries, keys, and values are built
from the same data. More specifically, they are constructed by multiplying the incoming vectors
r 1 , . . . , r d of R by a matrix, i.e., by performing a linear transform
Masked multi-head attention in the decoder Similarly to the multi-head attention module
in the encoder, there is a masked multi-head attention module as a first building block in the
decoder. Both work in the same fashion except for the fact that there is a positional restriction
in the masked variant. More specifically, query vectors can only attend to key vectors that are
188 Chapter 10. Further Developments
positioned in front of them in the sequence. Technically, this is done by setting the similarity
to −∞ in the case of invalid query-key pairs. This makes sure that the transformer’s decoder is
able to construct sentences word-by-word without using knowledge about parts of the sentence
that have not been constructed yet.
Cross-attention module Finally, there is a second multi-head attention module in the decoder
that mixes the information of the encoded input sequence and the output sequence. Here, dq
query vectors, which have been computed from the output sequence, attend to d key vectors
from the encoded input. Note that dq and d can be different this time. Similar to self-attention,
the queries, keys, and values are constructed via learnable weight matrices
Skip connections Next to the multi-head attention modules and the two-layer feed-forward
modules we always have a skip connection as in Figure 10.9, which adds the input of the cor-
responding module to its output. This gives the network the possibility to balance the original
information with the result of the module.
Layer normalization The normalization blocks perform a layer normalization. Here, instead
of using standardization (see Section 2.5), each entry of an incoming vector x = (x1 , . . . , xd˜) is
normalized by
d˜ d˜
xj − m 1X 1X
xnorm
j := gj with m := xi and σ 2 := (xi − m)2 ,
σ2 d˜ i=1 d˜ i=1
where gj is a learnable weight parameter for j = 1, . . . , d,˜ which is determined during the
training phase; see Section 10.4.3. For more details on layer normalization see [BKH16].
Two-layer feed-forward network Besides multi-head attention modules and layer normal-
ization, the transformer also employs fully connected two-layer feed-forward neural networks
with ReLU activation functions as modules in both the encoder and decoder. The weights and
biases of these networks are being determined in the training phase of the transformer; see also
Section 10.4.3.
Output probabilities After applying N consecutive decoder blocks, the resulting vectors are
transformed from d˜ dimensions to t̃ dimensions, where t̃ is the size of the vocabulary of the output
language. This transformation is usually fixed to be the transpose of E (out) , which we used for
˜
the output word embedding. In particular, for a vector x ∈ Rd resulting from the final decoder
T
block, we compute E (out) x. We can think of this as a vector of Euclidean scalar products
between x and the columns of E (out) . After applying a final softmax activation function, the
transformer returns the output probabilities of each word in the dictionary. This means that the
10.4. Transformer neural networks 189
most probable word, i.e., the word for which the scalar product between its embedding vector
and x is the largest among all words from the dictionary, is the best candidate to continue the
given output sequence.
Training phase As we have seen, the transformer employs an encoder and a decoder, each
consisting of several modules, e.g., input/output embedding, multi-head attention, normalization,
and two-layer feed-forward networks. The learnable parameters, which need to be determined
during the training phase, are
To determine these parameters, we use a training data set of pairs of input and corresponding
output sequences, e.g., input sentences in a source language and their corresponding translation
in a target language. A cross entropy loss function (see Section 6.4.5) is employed to compare the
output probabilities after the final softmax layer of the transformer to the true output sequence
from the training data. To this end, partial sequences, which only contain the start of each
output sequence, are employed. In this way, the task becomes to predict the next word of the
output sequence; see also the translation example below. Note that the output sequences from the
training data set are also shifted by one word to the right and start with the placeholder “[start]”.
This is to make sure that the network can translate sentences from scratch without knowing the
first word of the translation in the later generation phase, where only the input sequence is known.
Finally, an SGD-type optimizer is used to minimize the cross entropy over all the training data
in order to determine the free parameters of the transformer.
As an example, let us consider an English-to-German translator. During the training phase,
we feed the network with pairs of English input sentences, e.g., “I will meet you the day after
tomorrow.”, and the corresponding German translations, e.g., “[start] Ich werde dich übermor-
gen treffen.” as output sentence. Regarding both sentences as word-by-word sequences, note
that we included the placeholder [start] in the output sequence, which introduces the aforemen-
tioned shift. During training, we use all the partial sentences “[start]”, “[start] Ich”, “[start] Ich
werde”, “[start] Ich werde dich”, “[start] Ich werde dich übermorgen”, and “[start] Ich werde dich
übermorgen treffen” of the output sequence, together with the whole input sequence, to train the
transformer to predict the next word for each partial output sentence. Note that for the last output
sentence “[start] Ich werde dich übermorgen treffen” the transformer needs to predict that the
190 Chapter 10. Further Developments
translation is complete, i.e., it should predict a punctuation mark.56 Since the transformer gen-
erates its output on a word by word basis, the optimal prediction from the network for the first
output subsequence “[start]” would be “Ich”, i.e., the word Ich should have the highest output
probability among all words in the output language’s dictionary. For the partial output sentence
“[start] Ich” the transformer should predict “werde” and so on. By following this procedure for
the whole training data set, we obtain prediction probabilities for all pairs of input sequences and
corresponding partial output sequences. These probabilities are processed by the cross entropy
loss function, which is then minimized with respect to the free parameters of the transformer.
Generation phase In the generation phase, we are only given an input sequence, e.g., a sen-
tence in the source language, but not an output sequence, e.g., the corresponding translation
in the target language. Thus, we have to generate the output sequence word by word from its
beginning.
For our English-to-German translator from above this would mean that we are given the input
sequence “I will meet you the day after tomorrow.” and no output sequence. Therefore, we
evaluate the trained transformer for the given input sequence and the output sequence “[start]”.
Similar as in the training phase, the transformer now predicts the next word a from the dictionary,
i.e., a is the word corresponding to the highest value in the transformer’s output probabilities. In
the optimal case, a = Ich. Then, to continue the generation/translation of the output sentence,
we evaluate the trained transformer again for the given input sequence but now with the output
sequence “[start] a” to predict the next word. This process is continued until the transformer
predicts that the sentence has ended, i.e., a punctuation mark or another end-of-sentence token
has the highest probability. In this way, the transformer is employed to generate a whole output
sequence for a given input sequence.
Another improvement in many natural language processing tasks can be made when introduc-
ing bidirectional learning. Here, instead of inferring the next word in a sentence from the pre-
vious ones, a word is inferred from the whole context, i.e., from all other words in the sentence
including the ones that appear later in the sentence. Together with pre-training on large data-
bases, this is the basic idea of BERT (Bidirectional Encoder Representations from Transformers)
[DCLT19] and RoBERTa (Robustly optimized BERT pre-training approach) [LOG+ 19]. These
approaches employ encoder-only models, i.e., the decoder part of the transformer is omitted. To
this end, to infer words from context, i.e., to predict a word at a given position, a final fully
connected layer and a softmax activation are added to the encoder.
Most recently, OpenAI’s ChatGPT59 received a lot of attention. It is an AI chat-bot based on
the GPT-3 model, i.e., an algorithm which reacts to both an input prompt from a user and the
conversation history with that user by delivering an appropriate answer. Because of its decoder-
only architecture, no source language is used for the GPT-3 model, and the conversation history is
concatenated to the user prompt to obtain one long sequence that is fed to ChatGPT as the output
sequence. Besides supervised learning on the GPT-3 model, ChatGPT’s training process involved
reinforcement learning from human feedback. Here, a so-called proximal policy optimization
algorithm was employed. Besides ChatGPT, improved transformer models like GPT-3 and BERT
have been the cornerstones for several other successful chat-bots and text-generation algorithms
like Google’s Bard,60 Meta’s Llama Chat,61 AI21 Labs’ Jurassic,62 DeepMind’s Sparrow,63 and
Jasper’s Chat.64 Overall, transformer-based architectures are the present state of the art when it
comes to natural language processing tasks.
10.5 Interpretability
Usually machine learning algorithms are trained in such a way that they achieve a small loss.
However, since metrics like the least squares error or the accuracy are not always sufficient to
assess a trained model’s quality in real-world applications, the demand for understanding how a
specific trained model operates has risen drastically. Moreover, finding the underlying reasons for
the produced decisions of a machine learning algorithm has become more and more important.
59 https://openai.com/blog/chatgpt/
60 https://blog.google/technology/ai/bard-google-ai-search-updates
61 https://ai.meta.com/llama/
62 https://www.ai21.com/blog/announcing-ai21-studio-and-jurassic-1
63 https://www.deepmind.com/blog/building-safer-dialogue-agents
64 https://www.jasper.ai/chat
192 Chapter 10. Further Developments
While some measures to achieve interpretability are already built into certain machine learning
algorithms (see, e.g., [RPK19]), most interpretability methods are post-hoc in the sense that they
are applied to the already trained model of a given machine learning algorithm.
The ultimate goals of applying interpretability methods can be quite diverse, e.g.,
While most interpretability methods usually cannot directly achieve these goals, their application
provides a crucial first step in the direction of moving away from black-box machine learning
models that are neither interpretable nor explainable to the user. Let us however emphasize
that interpretability algorithms do not present a full transparent insight into the model’s inner
mechanisms. On the contrary, they usually just compute certain statistics (e.g., correlations)
between input coordinates, features, and the output. There are more sophisticated approaches
like causal machine learning models (see [Pea09, Sch22]), but such techniques are still quite
novel areas of active research.
While the use of interpretability methods does not remedy negative properties that the under-
lying algorithm or data may have, it can still expose those properties and guide the user in the
decision whether to trust a machine learning model or not. Furthermore, when we are able to
distinguish if a model is explainable or not, we can use this information to validate our choices
for the model design, e.g., the number of layers or neurons in deep learning.
We present in the rest of this section an introduction to easy-to-use interpretability algorithms
that can be applied to a trained machine learning model. A more versatile discussion of these
algorithms and more sophisticated variants of interpretability methods can be found in [Mol20].
Sensitivity analysis
In sensitivity analysis the influence of an input coordinate (or a feature) is measured
by quantifying how much the loss changes when changing the respective coordinate
value. More generally, sensitivity analysis always answers the question of how robust
a model is with respect to changes in the input.
We will here focus on the example of instance-based sensitivity analysis, which computes the
sensitivities with respect to each input coordinate direction for one specific input data point x.
An alternative could be average sensitivity analysis, where the average importance of an input
coordinate over all training data points is computed. Assume that the outcome of our trained ma-
chine learning model is the real-valued function f . Using notation similar to that in Section 6.3.1,
let Cx (f ) denote the loss on the data point (x, y). For instance, consider
Cx (f ) := (f (x) − y)2
10.5. Interpretability 193
for a least squares loss function. Then, the sensitivity S in direction j is computed as
∂
S(f )j := Cx (f ) .
∂xj
Note that xj is here referring to the jth coordinate of the d-dimensional vector x and must not
be confused with the ith training data point xi . After computing S(f )j for each j = 1, . . . , d,
the sensitivity with respect to each input coordinate j is known. Often, the shorthand notation
S(f ) := |∇x Cx (f )|
is used to define S. Note that the absolute value | · | has to be understood componentwise here.
Sometimes, the loss Cx is even discarded and the sensitivities are directly computed on the
model’s output f (x) instead.
It now remains to compute the derivatives of Cx with respect to x. In the general situation, this
∂
could only be done approximatively, e.g., by replacing the derivative ∂x j
Cx by some suitable
finite difference for each j = 1, . . . , d. It is noteworthy that, for the more sophisticated neural
network models, we can directly use the automatic differentiation modules of deep learning
libraries like K ERAS and T ENSOR F LOW to compute the derivatives in the formula above. Thus,
there is no need for additional implementational effort to calculate S(f ) in this case.
The quantity S(f ) tells us which coordinate directions are more important than others when
it comes to minimizing the loss function for the specific data point x. When we are dealing
with image data, i.e., when x is an image, we can visualize the sensitivities as a heat map,
sometimes also referred to as saliency map. For a gray-scale picture, each input coordinate
represents a single pixel in the image. Thus, we can visualize the sensitivities by plotting them
as color values (according to their magnitude) for each pixel. The same can be done for color
images after averaging the sensitivities over the color channels, for instance. For an example see
Figure 10.10.
Figure 10.10: An example of a heat map of sensitivity values of an image classification algorithm.
As a model we used the pre-trained GoogLeNet [SLJ+ 15], which correctly classified the original
image (left) as a cat. The sensitivities of the cat class output (right) have been averaged over the
three color channels of the input picture. The color represents the magnitude of the sensitivity for
each pixel. We see that the pixels which influence the loss the most are indeed the ones showing
the cat’s shape. (Cat image licensed under Pixabay Content License.)
194 Chapter 10. Further Developments
Activation maximization
The task of finding a prototypical input for a certain output class is solved by acti-
vation maximization. Here, an output class c ∈ Γ is specified in a first step. Subse-
quently, a data point x is determined that the machine learning model f most likely
assigns to the specific class c, i.e., the probability P[f (x) = c] that x belongs to the
class c is very large (or even the largest) compared to other possible inputs. The point
x is also often called prototype for the class.
The naive way to compute a prototype of a class c is to just maximize the probability of that class
with respect to the input, i.e.,
where f : Rd → Γ is the classification model. However, without any further side conditions,
the maximum in (10.7) might not be defined. This is due to the fact that the input coordinate
values (x1 , . . . , xd ) of a data point x ∈ Rd are unbounded, i.e., we could increase the size of
the coordinates arbitrarily, which, depending on the model f , might lead to larger and larger
probabilities P [f (x) = c]. Therefore, a regularization term like the (weighted) total variation
of the input x is usually subtracted from the class probability in (10.7) to obtain a well-posed
maximization problem and a more intuitive prototype; see [NYC16] for several examples of
activation maximization methods and for more sophisticated variants that additionally take the
training data distribution into account.
To compute a prototype we essentially need to maximize the class probability of a given ma-
chine learning model with respect to its input. This can be done by gradient ascent methods,
for instance. To this end, we can again use the automatic differentiation modules of K ERAS or
T ENSOR F LOW for calculating prototypes of neural network models. In Figure 10.11 we depicted
exemplary prototypes for certain classes of the ImageNet (https://www.image-net.org) data set.
Figure 10.11: Three examples for prototypes of images for the three classes church (left), dumb-
bell (middle), and basketball (right) of the ImageNet data set. As model we used the pre-trained
GoogLeNet [SLJ+ 15]. The prototypes have been computed by activation maximization with
a total variation regularizer and gradient ascent. While the resulting prototype pictures do not
show realistic scenarios, we can still observe the shapes of multiple church towers (left), dumb-
bells (middle), and basketballs (right).
For neural networks there exist specially designed interpretability methods like layerwise rel-
evance propagation and deep Taylor decomposition [MBL+ 19]. Here, the output of the network
is passed backwards through it in a specially weighted way depending on each neuron’s activa-
tion. This can also be interpreted as a chain of Taylor expansions of first order in the neurons;
see [MBL+ 19] for details.
More interpretability methods and the corresponding code packages can be found in [Mol20].
Bibliography
[ACF22] M. Afsar, T. Crump, and B. Far. Reinforcement learning based recommender systems: A
survey. ACM Computing Surveys, 55(7):1–38, 2022. (Cited on p. 139)
[ACJP20] E. Arias-Castro, A. Javanmard, and B. Pelletier. Perturbation bounds for Procrustes, clas-
sical scaling, and trilateration, with applications to manifold learning. Journal of Machine
Learning Research, 21(15):1–37, 2020. (Cited on p. 71)
[ADHSK21] F. Abu-Dakka, Y. Huang, J. Silvério, and V. Kyrki. A probabilistic framework for learning
geometry-based robot manipulation skills. Robotics and Autonomous Systems, 141:103761,
2021. (Cited on p. 169)
[ADLS10] G. Amaral, L. Dore, R. Lessa, and B. Stosic. K-means algorithm in statistical shape analy-
sis. Communications in Statistics—Simulation and Computation, 39(5):1016–1026, 2010.
(Cited on pp. 7, 77)
[Aea15] M. Abadi et al. TensorFlow: Large-scale machine learning on heterogeneous systems, 2015.
Software available from https://www.tensorflow.org. (Cited on p. 105)
[Agg18] C. Aggarwal. Neural Networks and Deep Learning. Springer, Cham, Switzerland, 2018.
(Cited on p. 87)
[AHK01] C. Aggarwal, A. Hinneburg, and D. Keim. On the surprising behavior of distance metrics in
high dimensional space. In J. Van den Bussche and V. Vianu, editors, Database Theory —
ICDT 2001, pages 420–434, Berlin, Heidelberg, Germany, 2001. Springer. (Cited on p. 13)
[And82] B. Anderson. Reverse-time diffusion equation models. Stochastic Processes and Their
Applications, 12(3):313–326, 1982. (Cited on p. 135)
[AR20] M. Assran and M. Rabbat. On the convergence of Nesterov’s accelerated gradient method
in stochastic settings. In H. Daumé III and A. Singh, editors, Proceedings of the 37th Inter-
national Conference on Machine Learning, online, volume 119 of Proceedings of Machine
Learning Research, pages 410–420. PMLR, 2020. (Cited on p. 103)
[ARB19] M. Azaouzi, D. Rhouma, and L. Ben Romdhane. Community detection in large-scale so-
cial networks: State-of-the-art and future directions. Social Network Analysis and Mining,
9(1):23, 2019. (Cited on p. 7)
[ASB16] M. Arjovsky, A. Shah, and Y. Bengio. Unitary evolution recurrent neural networks. In
M. Balcan and K. Weinberger, editors, Proceedings of the 33rd International Conference
on Machine Learning, New York, NY, USA, volume 48 of Proceedings of Machine Learning
Research, pages 1120–1128. PMLR, 2016. (Cited on p. 182)
197
198 Bibliography
[Au18] T. Au. Random forests, decision trees, and categorical predictors: The “absent levels”
problem. The Journal of Machine Learning Research, 19(1):1737–1766, 2018. (Cited on
p. 170)
[AZLS19] Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-
parameterization. In K. Chaudhuri and R. Salakhutdinov, editors, Proceedings of the 36th
International Conference on Machine Learning, Long Beach, CA, USA, volume 97 of Pro-
ceedings of Machine Learning Research, pages 242–252. PMLR, 2019. (Cited on p. 129)
[Bal12] P. Baldi. Autoencoders, unsupervised learning, and deep architectures. In I. Guyon, G. Dror,
V. Lemaire, G. Taylor, and D. Silver, editors, Proceedings of ICML Workshop on Unsuper-
vised and Transfer Learning, Bellevue, WA, USA, volume 27 of Proceedings of Machine
Learning Research, pages 37–49. PMLR, 2012. (Cited on p. 109)
[BB12] J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. Journal of
Machine Learning Research, 13(10):281–305, 2012. (Cited on p. 49)
[BBS21] P. Bongini, M. Bianchini, and F. Scarselli. Molecular generative graph neural networks for
drug discovery. Neurocomputing, 450:242–252, 2021. (Cited on p. 170)
[BCB15] D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to
align and translate. In Proceedings of the 3rd International Conference on Learning Repre-
sentations, San Diego, CA, USA. ICLR, 2015. (Cited on p. 184)
[BCD97] M. Bardi and I. Capuzzo-Dolcetta. Optimal Control and Viscosity Solutions of Hamilton-
Jacobi-Bellman Equations. Birkhäuser, Boston, MA, USA, 1997. (Cited on p. 139)
[BCN18] L. Bottou, F. Curtis, and J. Nocedal. Optimization methods for large-scale machine learning.
SIAM Review, 60(2):223–311, 2018. (Cited on pp. 101, 102)
[BE21] J. Blechschmidt and O. Ernst. Three ways to solve partial differential equations with neural
networks: A review. GAMM Mitteilungen, 44(2):e202100006, 2021. (Cited on p. 137)
[Bea20] T. Brown et al. Language models are few-shot learners. Advances in Neural Information
Processing Systems, 33:1877–1901, 2020. (Cited on pp. 183, 190)
[Bel57] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.
(Cited on pp. 144, 149)
[Bel61] R. Bellman. Adaptive Control Processes: A Guided Tour. Princeton University Press,
Princeton, NJ, USA, 1961. (Cited on p. 12)
[Ber12] D. Bertsekas. Dynamic Programming and Optimal Control: Approximate Dynamic Pro-
gramming, volume 2. Athena Scientific, Belmont, MA, USA, 4th edition, 2012. (Cited on
pp. 144, 149, 151, 159, 165, 166)
[Ber17] D. Bertsekas. Dynamic Programming and Optimal Control, volume 1. Athena Scientific,
Belmont, MA, USA, 4th edition, 2017. (Cited on pp. 144, 166)
[Ber19] D. Bertsekas. Reinforcement Learning and Optimal Control. Athena Scientific, Belmont,
MA, USA, 2019. (Cited on pp. 139, 145, 149, 158, 159, 165, 166)
[Ber22] D. Bertsekas. Abstract Dynamic Programming. Athena Scientific, Belmont, MA, USA,
2022. (Cited on pp. 164, 165)
[Ber23] D. Bertsekas. A Course in Reinforcement Learning. Athena Scientific, Belmont, MA, USA,
2023. (Cited on p. 166)
[BG05] I. Borg and P. Groenen. Modern Multidimensional Scaling: Theory and Applications.
Springer, New York, NY, USA, 2005. (Cited on p. 68)
Bibliography 199
[BGK22] B. Bohn, M. Griebel, and D. Kannan. Deep neural networks and PIDE discretizations. SIAM
Journal on Mathematics of Data Science, 4(3):1145–1170, 2022. (Cited on p. 132)
[BGKP19] H. Bölcskei, P. Grohs, G. Kutyniok, and P. Petersen. Optimal approximation with sparsely
connected deep neural networks. SIAM Journal on Mathematics of Data Science, 1(1):8–45,
2019. (Cited on p. 91)
[BGR19] B. Bohn, M. Griebel, and C. Rieger. A representer theorem for deep kernel learning. Journal
of Machine Learning Research, 20(64):1–32, 2019. (Cited on p. 126)
[BHL93] G. Berkooz, P. Holmes, and J. Lumley. The proper orthogonal decomposition in the analysis
of turbulent flows. Annual Review of Fluid Mechanics, 25(1):539–575, 1993. (Cited on
p. 54)
[Bis06] C. Bishop. Pattern Recognition and Machine Learning. Springer, New York, NY, USA,
2006. (Cited on p. 16)
[BKH16] J. Ba, J. Kiros, and G. Hinton. Layer normalization. arXiv preprint, arXiv:1607.06450,
2016. (Cited on p. 188)
[BMDG05] A. Banerjee, S. Merugu, I. Dhillon, and J. Ghosh. Clustering with Bregman divergences.
Journal of Machine Learning Research, 6(10):1705–1749, 2005. (Cited on p. 115)
[BMFB+ 22] L. Brogat-Motte, R. Flamary, C. Brouard, J. Rousu, and F. D’Alché-Buc. Learning to predict
graphs with fused Gromov-Wasserstein barycenters. In K. Chaudhuri, S. Jegelka, L. Song,
C. Szepesvari, G. Niu, and S. Sabato, editors, Proceedings of the 39th International Confer-
ence on Machine Learning, Baltimore, MD, USA, volume 162 of Proceedings of Machine
Learning Research, pages 2321–2335. PMLR, 2022. (Cited on p. 170)
[BMR+ 17] T. Brown, D. Mané, A. Roy, M. Abadi, and J. Gilmer. Adversarial patch. arXiv preprint,
arXiv:1712.09665, 2017. (Cited on p. 122)
[BMZ09] O. Bokanowski, S. Maroso, and H. Zidani. Some convergence results for Howard’s algo-
rithm. SIAM Journal on Numerical Analysis, 47(4):3001–3026, 2009. (Cited on p. 149)
[BOHS15] R. Benenson, M. Omran, J. Hosang, and B. Schiele. Ten years of pedestrian detection, what
have we learned? In L. Agapito, M. Bronstein, and C. Rother, editors, Computer Vision
- ECCV 2014 Workshops, pages 613–627, Cham, Switzerland, 2015. Springer. (Cited on
pp. 62, 63)
[Bre84] L. Breiman. Classification and Regression Trees. Routledge, New York, NY, USA, 1984.
(Cited on pp. 176, 178)
[Bre96a] L. Breiman. Bagging predictors. Machine Learning, 24:123–140, 1996. (Cited on p. 175)
[Bre96b] L. Breiman. Stacked regressions. Machine Learning, 24:49–64, 1996. (Cited on p. 172)
[Bre01] L. Breiman. Random forests. Machine Learning, 45(1):5–32, 2001. (Cited on pp. 176, 178)
[BS96] D. Bertsekas and S. Shreve. Stochastic Optimal Control: The Discrete-Time Case. Athena
Scientific, Belmont, MA, USA, 1996. (Cited on p. 150)
200 Bibliography
[BSZG18] A. Bojchevski, O. Shchur, D. Zügner, and S. Günnemann. NetGAN: Generating graphs via
random walks. In J. Dy and A. Krause, editors, Proceedings of the 35th International Con-
ference on Machine Learning, Stockholm, Sweden, volume 80 of Proceedings of Machine
Learning Research, pages 610–619. PMLR, 2018. (Cited on p. 170)
[BV04] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, New
York, NY, USA, 2004. (Cited on pp. 27, 35, 36, 37)
[BYF+ 19] M. Budninskiy, G. Yin, L. Feng, Y. Tong, and M. Desbrun. Parallel transport unfolding:
A connection-based manifold learning approach. SIAM Journal on Applied Algebra and
Geometry, 3(2):266–291, 2019. (Cited on p. 85)
[CAEHP+ 22] I. Chami, S. Abu-El-Haija, B. Perozzi, C. Ré, and K. Murphy. Machine learning on graphs:
A model and comprehensive taxonomy. Journal of Machine Learning Research, 23(89):1–
64, 2022. (Cited on p. 170)
[Cal20] O. Calin. Deep Learning Architectures. Springer, Cham, Switzerland, 2020. (Cited on
pp. 16, 108)
[CBB17] M. Congedo, A. Barachant, and R. Bhatia. Riemannian geometry for EEG-based brain-
computer interfaces; a primer and a review. Brain-Computer Interfaces, 4:1–20, 2017.
(Cited on p. 169)
[CBMV20] R. Chakraborty, J. Bouza, J. Manton, and B. Vemuri. Manifoldnet: A deep neural network
for manifold-valued data with applications. IEEE Transactions on Pattern Analysis and
Machine Intelligence, 44(2):799–810, 2020. (Cited on p. 169)
[CC00] T. Cox and M. Cox. Multidimensional Scaling. Chapmann & Hall, New York, NY, USA,
2nd edition, 2000. (Cited on p. 59)
[CCHC19] B. Chang, M. Chen, E. Haber, and E. Chi. AntisymmetricRNN: A dynamical system view
on recurrent neural networks. In Proceedings of the 7th International Conference on Learn-
ing Representations, New Orleans, LA, USA. ICLR, 2019. (Cited on p. 183)
[CGIT23] P. Climaco, J. Garcke, and R. Iza-Teran. Multi-resolution dynamic mode decomposition for
damage detection in wind turbine gearboxes. Data-Centric Engineering, 4:e1, 2023. (Cited
on p. 5)
[Cho17] F. Chollet. Deep Learning with Python. Manning Publications, Shelter Island, NY, USA,
2017. (Cited on p. 16)
[CL06] R. Coifman and S. Lafon. Diffusion maps. Applied and Computational Harmonic Analysis,
21(1):5–30, 2006. (Cited on pp. 72, 73, 75, 84)
[CL11] C.-C. Chang and C.-J. Lin. LIBSVM: A library for support vector machines. ACM Trans.
Intell. Syst. Technol., 2(3), 2011. (Cited on pp. 44, 51)
[COSJ17] Y. Costa, L. Oliveira, and C. Silla Jr. An evaluation of convolutional neural networks for
music classification using spectrograms. Applied Soft Computing, 52:28–38, 2017. (Cited
on p. 5)
Bibliography 201
[Cov65] T. Cover. Geometrical and statistical properties of systems of linear inequalities with ap-
plications in pattern recognition. IEEE Transactions on Electronic Computers, 3:326–334,
1965. (Cited on p. 44)
[CPK+ 18] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. DeepLab: Semantic
image segmentation with deep convolutional nets, atrous convolution, and fully connected
CRFs. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40(4):834–848,
2018. (Cited on p. 130)
[CRBD18] R. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud. Neural ordinary differential equa-
tions. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems, volume 31, Red Hook,
NY, USA, 2018. Curran Associates, Inc. (Cited on pp. 125, 132, 133)
[CvMG+ 14] K. Cho, B. van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and
Y. Bengio. Learning phrase representations using RNN encoder–decoder for statistical ma-
chine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural
Language Processing (EMNLP), Doha, Qatar, pages 1724–1734. Association for Compu-
tational Linguistics, 2014. (Cited on pp. 108, 179, 182)
[CZ07] F. Cucker and D. Zhou. Learning Theory. Cambridge University Press, Cambridge, UK,
2007. (Cited on pp. 16, 52)
[Dah22] W. Dahmen. Compositional sparsity, approximation classes, and parametric transport equa-
tions. arXiv preprint, arXiv:2207.06128, 2022. (Cited on p. 91)
[DBC+ 19] E. Dada, J. Bassi, H. Chiroma, S. Abdulhamid, A. Adetunmbi, and O. Ajibuwa. Machine
learning for email spam filtering: Review, approaches and open research problems. Heliyon,
5(6):e01802, 2019. (Cited on p. 5)
[DBK+ 16] Y. Deng, F. Bao, Y. Kong, Z. Ren, and Q. Dai. Deep direct reinforcement learning for
financial signal representation and trading. IEEE Transactions on Neural Networks and
Learning Systems, 28(3):653–664, 2016. (Cited on p. 139)
[dC92] M. do Carmo. Riemannian Geometry. Birkhäuser, Boston, MA, USA, 1992. (Cited on
p. 70)
[DCLT19] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirec-
tional transformers for language understanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Minneapolis, MN, USA, pages 4171–4186. Association for Com-
putational Linguistics, 2019. (Cited on pp. 183, 191)
[DD09] M. Deza and E. Deza. Encyclopedia of Distances. Springer, Berlin, Heidelberg, Germany,
2009. (Cited on pp. 7, 167)
[Dea22] J. Degrave et al. Magnetic control of tokamak plasmas through deep reinforcement learning.
Nature, 602(7897):414–419, 2022. (Cited on pp. 139, 166)
[DFE+ 22] T. Dao, D. Fu, S. Ermon, A. Rudra, and C. Ré. FlashAttention: Fast and memory-efficient
exact attention with IO-awareness. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35,
pages 16344–16359, Red Hook, NY, USA, 2022. Curran Associates, Inc. (Cited on p. 190)
202 Bibliography
[DFO20] M. Deisenroth, A. Faisal, and C. Ong. Mathematics for Machine Learning. Cambridge
University Press, Cambridge, UK, 2020. (Cited on p. 16)
[DG17] D. Dua and C. Graff. UCI machine learning repository. http://archive.ics.uci.edu/ml, 2017.
(Cited on p. 25)
[DHP21] R. DeVore, B. Hanin, and G. Petrova. Neural network approximation. Acta Numerica,
30:327–444, 2021. (Cited on p. 91)
[DHS11] J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(61):2121–2159, 2011.
(Cited on p. 104)
[Dij59] E. Dijkstra. A note on two problems in connexion with graphs. Numerische Mathematik,
1(1):269–271, 1959. (Cited on p. 71)
[DOL03] M. De Oliveira and H. Levkowitz. From visual data exploration to visual data mining: A
survey. IEEE Transactions on Visualization and Computer Graphics, 9(3):378–394, 2003.
(Cited on p. 79)
[Doz16] T. Dozat. Incorporating Nesterov momentum into Adam. In Proceedings of 4th Interna-
tional Conference on Learning Representations, Workshop Track, San Juan, Puerto Rico,
2016. https://openreview.net/forum?id=OM0jvwB8jIp57ZJjtNEZ. (Cited on p. 105)
[DT05] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition, San Diego, CA,
USA, volume 1, pages 886–893. IEEE, 2005. (Cited on p. 63)
[DTRF19] X. Dong, D. Thanou, M. Rabbat, and P. Frossard. Learning graphs from data: A signal
representation perspective. IEEE Signal Processing Magazine, 36(3):44–63, 2019. (Cited
on p. 170)
[DWW22] W. Dahmen, M. Wang, and Z. Wang. Nonlinear reduced DNN models for state estimation.
Communications in Computational Physics, 32(1):1–40, 2022. (Cited on p. 137)
[DYC+ 20] X. Dong, Z. Yu, W. Cao, Y. Shi, and Q. Ma. A survey on ensemble learning. Frontiers of
Computer Science, 14(2):241–258, 2020. (Cited on p. 171)
[EAQ+ 21] B. Erichson, O. Azencot, A. Queiruga, L. Hodgkinson, and M. Mahoney. Lipschitz recurrent
neural networks. In 9th International Conference on Learning Representations, ICLR 2021,
Virtual Event. OpenReview.net, 2021. (Cited on p. 183)
[EEF+ 18] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno,
and D. Song. Robust physical-world attacks on deep learning visual classification. In Pro-
ceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition,
Salt Lake City, UT, USA, pages 1625–1634. IEEE, 2018. (Cited on pp. 122, 192)
[EHJ21] W. E, J. Han, and A. Jentzen. Algorithms for solving high dimensional PDEs: From non-
linear Monte Carlo to machine learning. Nonlinearity, 35(1):278–310, 2021. (Cited on
p. 137)
Bibliography 203
[EHN96] H. Engl, M. Hanke, and A. Neubauer. Regularization of Inverse Problems. Springer, Dor-
drecht, The Netherlands, 1996. (Cited on p. 11)
[Ein56] A. Einstein. Investigations on the Theory of the Brownian Movement. Dover Publications,
Mineola, NY, USA, 1956. (Cited on p. 134)
[Elm90] J. Elman. Finding structure in time. Cognitive Science, 14(2):179–211, 1990. (Cited on
p. 179)
[EMW20] W. E, C. Ma, and L. Wu. Machine learning from a continuous viewpoint, I. Science China
Mathematics, 63(11):2233–2266, 2020. (Cited on p. 125)
[EMW22] W. E, C. Ma, and L. Wu. The Barron space and the flow-induced function spaces for neural
network models. Constructive Approximation, 55(1):369–406, 2022. (Cited on p. 92)
[EW21] W. E and S. Wojtowytsch. Kolmogorov width decay and poor approximators in machine
learning: Shallow neural networks, random feature models and neural tangent kernels. Re-
search in the Mathematical Sciences, 8(1):1–28, 2021. (Cited on p. 129)
[FCH+ 08] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin. LIBLINEAR: A library
for large linear classification. Journal of Machine Learning Research, 9:1871–1874, 2008.
(Cited on p. 51)
[Fea22] A. Fawzi et al. Discovering faster matrix multiplication algorithms with reinforcement
learning. Nature, 610:47–53, 2022. (Cited on p. 139)
[FF13] M. Falcone and R. Ferretti. Semi-Lagrangian Approximation Schemes for Linear and
Hamilton–Jacobi Equations. SIAM, Philadelphia, PA, USA, 2013. (Cited on p. 145)
[Fis36] R. Fisher. The use of multiple measurements in taxonomic problems. Annals of Eugenics,
7(2):179–188, 1936. (Cited on p. 25)
[Flo62] R. Floyd. Algorithm 97: Shortest path. Communications of the ACM, 5(6):345, 1962. (Cited
on p. 71)
[FS06] W. Fleming and H. Soner. Controlled Markov Processes and Viscosity Solutions. Springer,
New York, NY, USA, 2006. (Cited on p. 150)
[FZM+ 15] C. Frogner, C. Zhang, H. Mobahi, M. Araya, and T. Poggio. Learning with a Wasserstein
loss. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and R. Garnett, editors, Advances
in Neural Information Processing Systems, volume 28, Red Hook, NY, USA, 2015. Curran
Associates, Inc. (Cited on p. 167)
[GA11] M. Gönen and E. Alpaydın. Multiple kernel learning algorithms. Journal of Machine Learn-
ing Research, 12:2211–2268, 2011. (Cited on p. 52)
[GB10] X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. In Y. Teh and M. Titterington, editors, Proceedings of the 13th International
Conference on Artificial Intelligence and Statistics, Chia Laguna Resort, Sardinia, Italy,
volume 9 of Proceedings of Machine Learning Research, pages 249–256. PMLR, 2010.
(Cited on p. 93)
204 Bibliography
[GBC16] I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. MIT Press, Cambridge, MA,
USA, 2016. http://www.deeplearningbook.org. (Cited on pp. 16, 87, 95, 96, 108, 109, 115,
178, 180, 182)
[GDG+ 17] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia,
and K. He. Accurate, large minibatch SGD: Training Imagenet in 1 hour. arXiv preprint,
arXiv:1706.02677, 2017. (Cited on p. 108)
[Geo12] H.-O. Georgii. Stochastics: Introduction to Probability and Statistics. De Gruyter, Berlin,
Germany; New York, NY, USA, 2012. (Cited on pp. 73, 84, 118)
[Gér19] A. Géron. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow.
O’Reilly, Sebastopol, CA, USA, 2019. (Cited on p. 16)
[GGF09] R. Goni, P. García, and S. Foissac. The qPCR data statistical analysis. Technical report, Inte-
gromics SL, https://www.gene-quantification.com/integromics-qpcr-statistics-white-paper.
pdf, 2009. (Cited on p. 82)
[GHR20] E. Gorbunov, F. Hanzely, and P. Richtárik. A unified theory of SGD: Variance reduction,
sampling, quantization and coordinate descent. In S. Chiappa and R. Calandra, editors,
Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics,
online, volume 108 of Proceedings of Machine Learning Research, pages 680–690. PMLR,
2020. (Cited on pp. 101, 102)
[GHT+ 10] G. Guo, M. Huss, G. Tong, C. Wang, L. Sun, N. Clarke, and P. Robson. Resolution of cell
fate decisions revealed by single-cell gene expression analysis from zygote to blastocyst.
Developmental Cell, 18(4):675–685, 2010. (Cited on pp. 81, 82)
[GK22] P. Grohs and G. Kutyniok, editors. Mathematical Aspects of Deep Learning. Cambridge
University Press, Cambridge, UK, 2022. (Cited on p. 91)
[GLB+ 21] L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda. Dynamical
variational autoencoders: A comprehensive review. Foundations and Trends in Machine
Learning, 15(1–2):1–175, 2021. (Cited on p. 123)
[GP90] F. Girosi and T. Poggio. Networks and the best approximation property. Biological Cyber-
netics, 63(3):169–176, 1990. (Cited on p. 91)
[GP17] L. Grüne and J. Pannek. Nonlinear Model Predictive Control. Springer, Cham, Switzerland,
2017. (Cited on pp. 139, 165)
[GR70] G. Golub and C. Reinsch. Singular value decomposition and least squares solutions. Nu-
merische Mathematik, 14:403–420, 1970. (Cited on p. 54)
[GTGHS20] N. García Trillos, M. Gerlach, M. Hein, and D. Slepčev. Error estimates for spectral con-
vergence of the graph Laplacian on random geometric graphs toward the Laplace–Beltrami
operator. Foundations of Computational Mathematics, 20:827–887, 2020. (Cited on p. 73)
[GVL13] G. Golub and C. Van Loan. Matrix Computations, 4th edition. JHU Press, Baltimore, MD,
USA, 2013. (Cited on pp. 18, 19, 20, 21, 57, 58)
[GW08] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of Algo-
rithmic Differentiation, 2nd edition. SIAM, Philadelphia, PA, USA, 2008. (Cited on p. 95)
Bibliography 205
[HA04] V. Hodge and J. Austin. A survey of outlier detection methodologies. Artificial Intelligence
Review, 22(2):85–126, 2004. (Cited on p. 37)
[Hac13] W. Hackbusch. Multi-Grid Methods and Applications. Springer Science & Business Media,
Berlin, Germany, 2013. (Cited on p. 28)
[Hac16] W. Hackbusch. Iterative Solution of Large Sparse Systems of Equations, 2nd edition.
Springer, Cham, Switzerland, 2016. (Cited on p. 28)
[HBT15] L. Haghverdi, F. Buettner, and F. Theis. Diffusion maps for high-dimensional single-cell
analysis of differentiation data. Bioinformatics, 31(18):2989–2998, 2015. (Cited on p. 75)
[Hea23] L. Heumos et al. Best practices for single-cell analysis across modalities. Nature Reviews
Genetics, 24:550–572, 2023. (Cited on p. 82)
[HGG+ 20] M. Hennig, M. Grafinger, D. Gerhard, S. Dumss, and P. Rosenberger. Comparison of time
series clustering algorithms for machine state detection. In Procedia CIRP, volume 93,
pages 1352–1357, Chicago, IL, USA, 2020. Elsevier B.V. (Cited on p. 7)
[Hig02] N. Higham. Accuracy and Stability of Numerical Algorithms, 2nd edition. SIAM, Philadel-
phia, PA, USA, 2002. (Cited on pp. 19, 20, 21)
[HJA20] J. Ho, A. Jain, and P. Abbeel. Denoising diffusion probabilistic models. In H. Larochelle,
M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, editors, Advances in Neural Information
Processing Systems, volume 33, pages 6840–6851, Red Hook, NY, USA, 2020. Curran
Associates, Inc. (Cited on pp. 125, 134, 136, 137, 138)
[HJKN20] M. Hutzenthaler, A. Jentzen, T. Kruse, and T. Nguyen. A proof that rectified deep neural
networks overcome the curse of dimensionality in the numerical approximation of semilin-
ear heat equations. Partial Differential Equations and Applications, 1(2):1–34, 2020. (Cited
on p. 137)
[HK20] J. Hancock and T. Khoshgoftaar. Survey on categorical data for neural networks. Journal
of Big Data, 7(28):1–41, 2020. (Cited on p. 170)
[HMvH+ 18] M. Hessel, J. Modayil, H. van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan,
B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement
learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence
and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI
Symposium on Educational Advances in Artificial Intelligence, New Orleans, LA, USA,
AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. (Cited on p. 165)
[Ho95] T. Ho. Random decision forests. In Proceedings of 3rd International Conference on Docu-
ment Analysis and Recognition, Montreal, Canada, volume 1, pages 278–282. IEEE Com-
puter Society, 1995. (Cited on p. 176)
[Hot33] H. Hotelling. Analysis of a complex of statistical variables into principal components. Jour-
nal of Educational Psychology, 24(6):417, 1933. (Cited on p. 54)
[How60] R. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA,
USA, 1960. (Cited on p. 149)
[HPS12] H. Harbrecht, M. Peters, and R. Schneider. On the low-rank approximation by the pivoted
Cholesky decomposition. Applied Numerical Mathematics, 62(4):428–440, 2012. (Cited
on p. 57)
206 Bibliography
[HR18] E. Haber and L. Ruthotto. Stable architectures for deep neural networks. Inverse Problems,
34(1), 2018. 014004. (Cited on pp. 125, 131)
[HRZZ09] T. Hastie, S. Rosset, J. Zhu, and H. Zou. Multi-class AdaBoost. Statistics and Its Interface,
2(3):349–360, 2009. (Cited on p. 174)
[HS97] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation, 9(8):
1735–1780, 1997. (Cited on pp. 108, 181, 182)
[HSK19] B. Henrique, V. Sobreiro, and H. Kimura. Literature review: Machine learning techniques
applied to financial market prediction. Expert Systems with Applications, 124:226–251,
2019. (Cited on p. 5)
[HSL16] M. Henaff, A. Szlam, and Y. LeCun. Recurrent orthogonal networks and long-memory
tasks. In M. Balcan and K. Weinberger, editors, Proceedings of the 33rd International Con-
ference on Machine Learning, New York, NY, USA, volume 48 of Proceedings of Machine
Learning Research, pages 2034–2042. PMLR, 2016. (Cited on p. 182)
[HTF09] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer
Series in Statistics. Springer, New York, NY, USA, 2009. (Cited on pp. 11, 12, 16, 17, 30,
31, 49, 51, 77, 170, 171, 174)
[Hul14] J. Hull. Options, Futures, and Other Derivatives, 9th edition. Pearson, Boston, MA, USA,
2014. (Cited on p. 134)
[HWVG19] Z. Huang, J. Wu, and L. Van Gool. Manifold-valued image generation with Wasserstein
generative adversarial nets. In Proceedings of the 33rd AAAI Conference on Artificial Intel-
ligence, Honolulu, HI, USA, volume 33, pages 3886–3893. AAAI Press, 2019. (Cited on
pp. 167, 169)
[HZRS16] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In
Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), Las Vegas, NV, USA, pages 770–778, 2016. (Cited on p. 130)
[Jea17] N. Jouppi et al. In-datacenter performance analysis of a tensor processing unit. In Pro-
ceedings of the 44th Annual International Symposium on Computer Architecture, Toronto,
Canada, pages 1–12, New York, NY, USA, 2017. Association for Computing Machinery.
(Cited on p. 108)
[JGH18] A. Jacot, F. Gabriel, and C. Hongler. Neural tangent kernel: Convergence and generalization
in neural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi,
and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31,
pages 8580–8589, Red Hook, NY, USA, 2018. Curran Associates, Inc. (Cited on pp. 125,
126, 129)
[JHS+ 13] S. Jayasumana, R. Hartley, M. Salzmann, H. Li, and M. Harandi. Kernel methods on the
Riemannian manifold of symmetric positive definite matrices. In Proceedings of the 2013
IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, pages
73–80. IEEE Computer Society, 2013. (Cited on pp. 168, 169)
[Jol02] I. Jolliffe. Principal Component Analysis. Springer, New York, NY, USA, 2002. (Cited on
pp. 54, 60)
[JRF12] G. Jurman, S. Riccadonna, and C. Furlanello. A comparison of MCC and CEN error mea-
sures in multi-class prediction. PLOS ONE, 7(8):1–8, 2012. (Cited on p. 24)
Bibliography 207
[KB14] D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint, arXiv:
1412.6980, 2014. (Cited on p. 104)
[KBP13] J. Kober, J. Bagnell, and J. Peters. Reinforcement learning in robotics: A survey. The
International Journal of Robotics Research, 32(11):1238–1274, 2013. (Cited on p. 140)
[KBPA+ 19] P. Kidger, P. Bonnier, I. Perez Arribas, C. Salvi, and T. Lyons. Deep signature transforms.
In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 32, Red Hook, NY,
USA, 2019. Curran Associates, Inc. (Cited on p. 171)
[KDP+ 16] H. Kadri, E. Duflos, P. Preux, S. Canu, A. Rakotomamonjy, and J. Audiffren. Operator-
valued kernels for learning from functional response data. Journal of Machine Learning
Research, 17(20):1–54, 2016. (Cited on p. 170)
[KGPT+ 19] G. Kerg, K. Goyette, M. Puelma Touzel, G. Gidel, E. Vorontsov, Y. Bengio, and G. Lajoie.
Non-normal recurrent neural network (NNRNN): Learning long time dependencies while
improving expressivity with transient dynamics. In H. Wallach, H. Larochelle, A. Beygelz-
imer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Pro-
cessing Systems, volume 32, Red Hook, NY, USA, 2019. Curran Associates, Inc. (Cited on
p. 182)
[Kid22] P. Kidger. On Neural Differential Equations. Doctoral thesis, Mathematical Institute, Uni-
versity of Oxford, Oxford, UK, 2022. (Cited on p. 133)
[KJM20] N. Kriege, F. Johansson, and C. Morris. A survey on graph kernels. Applied Network
Science, 5(1):6, 2020. (Cited on p. 52)
[KKL20] N. Kitaev, Ł. Kaiser, and A. Levskaya. Reformer: The efficient transformer. arXiv preprint,
arXiv:2001.04451, 2020. (Cited on p. 190)
[KLL+ 23] N. Kovachki, Z. Li, B. Liu, K. Azizzadenesheli, K. Bhattacharya, A. Stuart, and A. Anand-
kumar. Neural operator: Learning maps between function spaces with applications to PDEs.
Journal of Machine Learning Research, 24(89):1–97, 2023. (Cited on pp. 138, 170)
[KLM21] N. Kovachki, S. Lanthaler, and S. Mishra. On universal approximation and error bounds for
Fourier neural operators. The Journal of Machine Learning Research, 22(1):13237–13312,
2021. (Cited on pp. 138, 170)
[KP92] P. Kloeden and E. Platen. Numerical Solution of Stochastic Differential Equations. Springer,
Berlin, Heidelberg, Germany, 1992. (Cited on pp. 134, 135)
[KS90] M. Kirby and L. Sirovich. Application of the Karhunen-Loeve procedure for the character-
ization of human faces. IEEE Transactions on Pattern Analysis and Machine Intelligence,
12(1):103–108, 1990. (Cited on p. 54)
[KSH12] A. Krizhevsky, I. Sutskever, and G. Hinton. Imagenet classification with deep convolutional
neural networks. In F. Pereira, C. Burges, L. Bottou, and K. Weinberger, editors, Advances
in Neural Information Processing Systems, volume 25, pages 1097–1105, Red Hook, NY,
USA, 2012. Curran Associates, Inc. (Cited on pp. 5, 96, 108)
[KSZ96] J. Klafter, M. Shlesinger, and G. Zumofen. Beyond Brownian motion. Physics Today,
49(2):33–39, 1996. (Cited on p. 134)
[KW13] D. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint, arXiv:1312.
6114, 2013. (Cited on pp. 109, 117)
[KZK12] J. Kruppa, A. Ziegler, and I. König. Risk estimation and risk prediction using machine-
learning methods. Human Genetics, 131(10):1639–1654, 2012. (Cited on p. 5)
[Lan12] S. Lang. A First Course in Calculus. Springer, New York, NY, USA, 2012. (Cited on p. 128)
[LeC85] Y. LeCun. Une procedure d’apprentissage pour reseau a seuil asymetrique. In Proceedings
of Cognitiva 85, Paris, France, pages 599–604, 1985. (Cited on p. 91)
[Lin76] S. Linnainmaa. Taylor expansion of the accumulated rounding error. BIT Numerical Math-
ematics, 16(2):146–160, 1976. (Cited on p. 91)
[LJP+ 21] L. Lu, P. Jin, G. Pang, Z. Zhang, and G. Karniadakis. Learning nonlinear operators via
DeepONet based on the universal approximation theorem of operators. Nature Machine
Intelligence, 3(3):218–229, 2021. (Cited on pp. 138, 170)
[LKA+ 20] Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, and A. Anand-
kumar. Fourier neural operator for parametric partial differential equations. arXiv preprint,
arXiv:2010.08895, 2020. (Cited on pp. 138, 170)
[LL17] S. Lundberg and S.-I. Lee. A unified approach to interpreting model predictions. In
I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Gar-
nett, editors, Advances in Neural Information Processing Systems 30, pages 4765–4774.
Curran Associates, Inc., Red Hook, NY, USA, 2017. (Cited on p. 194)
[LLHZ16] S. Lai, K. Liu, S. He, and J. Zhao. How to generate a good word embedding. IEEE Intelli-
gent Systems, 31(6):5–14, 2016. (Cited on p. 170)
[Llo82] S. Lloyd. Least squares quantization in PCM. IEEE Transactions on Information Theory,
28(2):129–137, 1982. (Cited on p. 77)
[LLWT15] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In Proceed-
ings of the IEEE International Conference on Computer Vision (ICCV), Santiago, Chile.
IEEE, 2015. (Cited on p. 138)
[LMK22] S. Lanthaler, S. Mishra, and G. Karniadakis. Error estimates for DeepONets: A deep learn-
ing framework in infinite dimensions. Transactions of Mathematics and Its Applications,
6(1):tnac001, 2022. (Cited on pp. 138, 170)
[LMW+ 17] S. Liu, D. Maljovec, B. Wang, P.-T. Bremer, and V. Pascucci. Visualizing high-dimensional
data: Advances in the past decade. IEEE Transactions on Visualization and Computer
Graphics, 23(3):1249–1268, 2017. (Cited on p. 79)
[LOG+ 19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer,
and V. Stoyanov. RoBerta: A robustly optimized Bert pretraining approach. arXiv preprint,
arXiv:1907.11692, 2019. (Cited on p. 191)
[Lor56] E. Lorenz. Empirical orthogonal functions and statistical weather prediction, volume 1.
Massachusetts Institute of Technology, Department of Meteorology, Cambridge, MA, USA,
1956. (Cited on p. 54)
Bibliography 209
[LPM15] M.-T. Luong, H. Pham, and C. Manning. Effective approaches to attention-based neural ma-
chine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural
Language Processing, Lisbon, Portugal, pages 1412–1421. Association for Computational
Linguistics, 2015. (Cited on p. 183)
[LSST+ 02] H. Lodhi, C. Saunders, J. Shawe-Taylor, N. Cristianini, and C. Watkins. Text classification
using string kernels. Journal of Machine Learning Research, 2:419–444, 2002. (Cited on
p. 170)
[LV07] J. Lee and M. Verleysen. Nonlinear Dimensionality Reduction. Springer Science & Business
Media, New York, NY, USA, 2007. (Cited on pp. 16, 54, 59, 68, 69, 85)
[LXS+ 19] J. Lee, L. Xiao, S. Schoenholz, Y. Bahri, R. Novak, J. Sohl-Dickstein, and J. Pennington.
Wide neural networks of any depth evolve as linear models under gradient descent. In
H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural Information Processing Systems, volume 32, pages 8572–8583, Red
Hook, NY, USA, 2019. Curran Associates, Inc. (Cited on p. 129)
[MBL+ 19] G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. Müller. Layer-wise relevance
propagation: An overview. In W. Samek, G. Montavon, A. Vedaldi, L. Hansen, and K.-R.
Müller, editors, Explainable AI: Interpreting, Explaining and Visualizing Deep Learning,
pages 193–209, Cham, Switzerland, 2019. Springer. (Cited on p. 195)
[Mea15] V. Mnih et al. Human-level control through deep reinforcement learning. Nature, 518(7540):
529–533, 2015. (Cited on pp. 139, 150, 160)
[Mer09] J. Mercer. XVI. Functions of positive and negative type, and their connection with the theory
of integral equations. Philosophical Transactions of the Royal Society of London. Series A,
209(441-458):415–446, 1909. (Cited on pp. 46, 47)
[Mha96] H. Mhaskar. Neural networks for optimal approximation of smooth and analytic functions.
Neural Computation, 8(1):164–177, 1996. (Cited on p. 91)
[MHSG18] L. McInnes, J. Healy, N. Saul, and L. Großberger. UMAP: Uniform manifold approximation
and projection. Journal of Open Source Software, 3(29):861, 2018. (Cited on p. 85)
[MM15] D. Mishkin and J. Matas. All you need is a good init. arXiv preprint, arXiv:1511.06422,
2015. (Cited on p. 93)
[MM18] H. Minh and V. Murino. Covariances in Computer Vision and Machine Learning. Springer,
Cham, Switzerland, 2018. (Cited on p. 168)
[MN15] J. Moudrík and R. Neruda. Evolving non-linear stacking ensembles for prediction of Go
player attributes. In 2015 IEEE Symposium Series on Computational Intelligence, Cape
Town, South Africa, pages 1673–1680. IEEE, 2015. (Cited on p. 172)
210 Bibliography
[Mur22] K. Murphy. Probabilistic Machine Learning: An Introduction. MIT Press, Cambridge, MA,
USA, 2022. (Cited on pp. 16, 17, 30, 31, 49, 51, 52, 77, 126, 171)
[Mur23] K. Murphy. Probabilistic Machine Learning: Advanced Topics. MIT Press, Cambridge,
MA, USA, 2023. (Cited on pp. 16, 117, 134, 136, 137, 149)
[MZ22] T. Mao and D.-X. Zhou. Approximation of functions from Korobov spaces by deep convo-
lutional neural networks. Advances in Computational Mathematics, 48(6):84, 2022. (Cited
on p. 91)
[NBC21] A. Narayan, B. Berger, and H. Cho. Assessing single-cell transcriptomic variability through
density-preserving data visualization. Nature Biotechnology, 39(6):765–774, 2021. (Cited
on p. 86)
[Nea20] E. Ntoutsi et al. Bias in data-driven artificial intelligence systems—an introductory sur-
vey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(3):e1356,
2020. (Cited on p. 192)
[Nes83] Y. Nesterov. A method for solving the convex programming problem with convergence rate
O k12 . Soviet Mathematics Doklady, 27(2):372–376, 1983. (Cited on p. 103)
[Ng11] A. Ng. Sparse autoencoder. CS294A Lecture notes, 72:1–19, 2011. available at https:
//web.stanford.edu/class/cs294a/sparseAutoencoder.pdf, Stanford University, Stanford, CA,
USA. (Cited on p. 114)
[NYC16] A. Nguyen, J. Yosinski, and J. Clune. Multifaceted feature visualization: Uncovering the
different types of features learned by each neuron in deep neural networks. arXiv preprint,
arXiv:1602.03616, 2016. (Cited on p. 194)
[NZM+ 19] H. Nies, Z. Zakaria, M. Mohamad, W. Chan, N. Zaki, R. Sinnott, S. Napis, P. Chamoso,
S. Omatu, and J. Corchado. A review of computational methods for clustering genes with
similar biological functions. Processes, 7(9):550, 2019. (Cited on p. 7)
Bibliography 211
[OFG97] E. Osuna, R. Freund, and F. Girosi. An improved training algorithm for support vector
machines. In Neural Networks for Signal Processing VII. Proceedings of the 1997 IEEE
Signal Processing Society Workshop, Amelia Island, FL, USA, pages 276–285. IEEE, 1997.
(Cited on p. 41)
[OSZ22] J. Opschoor, C. Schwab, and J. Zech. Exponential ReLU DNN expression of holomorphic
maps in high dimension. Constructive Approximation, 55(1):537–582, 2022. (Cited on
p. 92)
[Pea01] K. Pearson. On lines and planes of closest fit to a system of points in space. The London, Ed-
inburgh, and Dublin Philosophical Magazine and Journal of Science, 6(2):559–571, 1901.
(Cited on pp. 54, 65)
[Pea09] J. Pearl. Causality. Cambridge University Press, Cambridge, UK, 2009. (Cited on p. 192)
[PFA06] X. Pennec, P. Fillard, and N. Ayache. A Riemannian framework for tensor computing.
International Journal of Computer Vision, 66:41–66, 2006. (Cited on p. 169)
[PG12] A. Paprotny and J. Garcke. On a connection between maximum variance unfolding, shortest
path problems and Isomap. In 15th International Conference on Artificial Intelligence and
Statistics (AISTATS 2012), La Palma, Canary Islands, Spain, pages 859–867, 2012. (Cited
on p. 85)
[Phi21] J. Phillips. Mathematical Foundations for Data Analysis. Springer, Cham, Switzerland,
2021. (Cited on p. 16)
[Pla98] J. Platt. Sequential minimal optimization: A fast algorithm for training support vector ma-
chines. Technical report, Microsoft Research Redmond Labs, Redmond, WA, USA, 1998.
(Cited on pp. 40, 44)
[PMG+ 17] N. Papernot, P. McDaniel, I. Goodfellow, S. Jha, B. Celik, and A. Swami. Practical black-
box attacks against machine learning. In Proceedings of the 2017 Asia Conference on Com-
puter and Communications Security, Abu Dhabi, UAE, pages 506–519, 2017. (Cited on
p. 122)
[Pow22] W. Powell. Reinforcement Learning and Stochastic Optimization. Wiley, Hoboken, NJ,
USA, 2022. (Cited on pp. 139, 145, 166)
[Pro05] P. Protter. Stochastic Integration and Differential Equations. Springer, Berlin, Heidelberg,
Germany, 2005. (Cited on p. 134)
[PSF19] X. Pennec, S. Sommer, and T. Fletcher. Riemannian Geometric Statistics in Medical Image
Analysis. Academic Press, London, UK, 2019. (Cited on p. 169)
[PT13] A. Paprotny and M. Thess. Realtime Data Mining. Applied and Numerical Harmonic
Analysis. Springer, Cham, Switzerland, 2013. (Cited on p. 139)
[PTVF07] W. Press, S. Teukolsky, W. Vetterling, and B. Flannery. Numerical Recipes: The Art of
Scientific Computing, 3rd edition. Cambridge University Press, Cambridge, UK, 2007.
(Cited on pp. 20, 57, 132)
212 Bibliography
[PV18] P. Petersen and F. Voigtlaender. Optimal approximation of piecewise smooth functions using
deep ReLU neural networks. Neural Networks, 108:296–330, 2018. (Cited on p. 91)
[QSC+ 17] Y. Qin, D. Song, H. Cheng, W. Cheng, G. Jiang, and G. Cottrell. A dual-stage attention-
based recurrent neural network for time series prediction. In Proceedings of the 26th In-
ternational Joint Conference on Artificial Intelligence, Melbourne, Australia, pages 2627–
2633. AAAI Press, 2017. (Cited on p. 179)
[RBAF+ 19] M. Rhif, A. Ben Abbes, I. Farah, B. Martínez, and Y. Sang. Wavelet transform application
for/in non-stationary time-series analysis: A review. Applied Sciences, 9(7):1345, 2019.
(Cited on p. 171)
[RBDG20] R. Roscher, B. Bohn, M. Duarte, and J. Garcke. Explainable machine learning for scientific
insights and discoveries. IEEE Access, 8(1):42200–42216, 2020. (Cited on pp. 191, 192)
[RBZ23] D. Ruiz-Balet and E. Zuazua. Neural ODE control for classification, approximation, and
transport. SIAM Review, 65(3):735–773, 2023. (Cited on p. 133)
[RDGF16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time
object detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), Las Vegas, NV, USA, pages 779–788, 2016. (Cited on p. 62)
[Rec19] B. Recht. A tour of reinforcement learning: The view from continuous control. Annual
Review of Control, Robotics, and Autonomous Systems, 2(1):253–279, 2019. (Cited on
p. 139)
[RH20] L. Ruthotto and E. Haber. Deep neural networks motivated by partial differential equations.
Journal of Mathematical Imaging and Vision, 62(3):352–364, 2020. (Cited on p. 132)
[RHGS15] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection
with region proposal networks. In C. Cortes, N. Lawrence, D. Lee, M. Sugiyama, and
R. Garnett, editors, Advances in Neural Information Processing Systems, volume 28, pages
91–99, Red Hook, NY, USA, 2015. Curran Associates, Inc. (Cited on p. 130)
[RM10] S. Rohanizadeh and M. Moghadam. A proposed data mining methodology and its applica-
tion to industrial procedures. Journal of Optimization in Industrial Engineering, 2(4):37–50,
2010. (Cited on p. 15)
[RM21] K. Rusch and S. Mishra. Coupled oscillatory recurrent neural network (coRNN): An accu-
rate and (gradient) stable architecture for learning long time dependencies. In Proceedings
of the 9th international Conference on Learning Representations, online, 2021. (Cited on
p. 183)
[RMEM21] K. Rusch, S. Mishra, B. Erichson, and M. Mahoney. Long expressive memory for sequence
modeling. arXiv preprint, arXiv:2110.04744, 2021. (Cited on p. 183)
[Ros58] F. Rosenblatt. The perceptron: A probabilistic model for information storage and organiza-
tion in the brain. Psychological Reviews, 65:386–408, 1958. (Cited on p. 88)
Bibliography 213
[RS05] J. Ramsay and B. Silverman. Functional Data Analysis. Springer, New York, NY, USA,
2005. (Cited on p. 170)
[RS07] J. Ramsay and B. Silverman. Applied Functional Data Analysis: Methods and Case Studies.
Springer, New York, NY, USA, 2007. (Cited on p. 170)
[RSG16] M. Ribeiro, S. Singh, and C. Guestrin. "Why should I trust you?" Explaining the predic-
tions of any classifier. In Proceedings of the 22nd ACM SIGKDD International Conference
on Knowledge Discovery and Data Mining, San Francisco, CA, USA, pages 1135–1144.
Association for Computing Machinery, 2016. (Cited on p. 194)
[RW06] C. Rasmussen and C. Williams. Gaussian Processes for Machine Learning. MIT Press,
Cambridge, MA, USA, 2006. (Cited on pp. 46, 47, 52)
[RW17] W. Rawat and Z. Wang. Deep convolutional neural networks for image classification: A
comprehensive review. Neural Computation, 29(9):2352–2449, 2017. (Cited on pp. 96,
108)
[RYH22] D. Roberts, S. Yaida, and B. Hanin. The Principles of Deep Learning Theory. Cambridge
University Press, Cambridge, UK, 2022. (Cited on p. 126)
[Saa03] Y. Saad. Iterative Methods for Sparse Linear Systems, 2nd edition. SIAM, Philadelphia,
PA, USA, 2003. (Cited on p. 28)
[Sag12] H. Sagan. Space-Filling Curves. Springer, New York, NY, USA, 2012. (Cited on p. 114)
[Sam65] P. Samuelson. Rational theory of warrant pricing. Industrial Management Review, 6(2):13–
32, 1965. (Cited on p. 134)
[SB18] R. Sutton and A. Barto. Reinforcement Learning: An Introduction, 2nd edition. MIT Press,
Cambridge, MA, USA, 2018. (Cited on pp. 139, 149, 150, 152, 153, 154, 155, 158, 162,
164, 165, 166)
[SBC16] W. Su, S. Boyd, and E. Candes. A differential equation for modeling Nesterov’s accelerated
gradient method: Theory and insights. Journal of Machine Learning Research, 17(153):1–
43, 2016. (Cited on p. 138)
[Sch15] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85–
117, 2015. (Cited on p. 109)
[Sch22] B. Schölkopf. Causality for machine learning. In H. Geffner, R. Dechter, and J. Halpern,
editors, Probabilistic and Causal Inference: The Works of Judea Pearl, pages 765–804.
Association for Computing Machinery, New York, NY, USA, 2022. (Cited on p. 192)
[SDB16] L. Seversky, S. Davis, and M. Berger. On time-series topological data analysis: New data
and opportunities. In Proceedings of the 2016 IEEE Conference on Computer Vision and
Pattern Recognition Workshops, Las Vegas, NV, USA, pages 59–67, 2016. (Cited on p. 171)
[Sea18] D. Silver et al. A general reinforcement learning algorithm that masters chess, Shogi, and
Go through self-play. Science, 362(6419):1140–1144, 2018. (Cited on pp. 139, 160, 166)
[See04] M. Seeger. Gaussian processes for machine learning. International Journal of Neural Sys-
tems, 14(02):69–106, 2004. (Cited on p. 52)
[Set12] B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learn-
ing, 6(1):1–114, 2012. (Cited on p. 6)
[SFSW12] D. Schnitzer, A. Flexer, M. Schedl, and G. Widmer. Local and global scaling reduce hubs
in space. Journal of Machine Learning Research, 13(1):2871–2902, 2012. (Cited on p. 13)
[SHB16] R. Sennrich, B. Haddow, and A. Birch. Neural machine translation of rare words with sub-
word units. In Proceedings of the 54th Annual Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers), Berlin, Germany, pages 1715–1725. Association for
Computational Linguistics, 2016. (Cited on p. 185)
[She00] C. Shearer. The CRISP-DM model: The new blueprint for data mining. Journal of Data
Warehousing, 5:13–22, 2000. (Cited on p. 14)
[She20] A. Sherstinsky. Fundamentals of recurrent neural network (RNN) and long short-term mem-
ory (LSTM) network. Physica D: Nonlinear Phenomena, 404:132306, 2020. (Cited on
pp. 182, 183)
[SK87] L. Sirovich and M. Kirby. Low-dimensional procedure for the characterization of human
faces. Journal of the Optical Society of America A, 4(3):519–524, 1987. (Cited on p. 62)
[SK18] M. Simonovsky and N. Komodakis. GraphVAE: Towards generation of small graphs using
variational autoencoders. In V. Kůrková, Y. Manolopoulos, B. Hammer, L. Iliadis, and
I. Maglogiannis, editors, Artificial Neural Networks and Machine Learning – ICANN 2018,
pages 412–422, Cham, Switzerland, 2018. Springer. (Cited on p. 170)
[SLJ+ 15] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke,
and A. Rabinovich. Going deeper with convolutions. In Proceedings of the 2015 IEEE
Conference on Computer Vision and Pattern Recognition, Boston, MA, USA, pages 1–9.
IEEE, 2015. (Cited on pp. 193, 195)
[SMSM99] R. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for rein-
forcement learning with function approximation. In S. Solla, T. Leen, and K.-R. Müller,
editors, Advances in Neural Information Processing Systems, volume 12, pages 1057–1063,
Cambridge, MA, USA, 1999. MIT Press. (Cited on p. 166)
[SR04] M. Santos and J. Rust. Convergence properties of policy iteration. SIAM Journal on Control
and Optimization, 42(6):2094–2115, 2004. (Cited on p. 149)
[SR18] O. Sagi and L. Rokach. Ensemble learning: A survey. Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 8(4):e1249, 2018. (Cited on p. 171)
[SS96] S. Singh and R. Sutton. Reinforcement learning with replacing eligibility traces. Machine
Learning, 22(1-3):123–158, 1996. (Cited on p. 152)
[SS02] B. Schölkopf and A. Smola. Learning with Kernels – Support Vector Machines, Regulariza-
tion, Optimization, and Beyond. MIT Press, Cambridge, MA, USA, 2002. (Cited on pp. 38,
44, 46, 47, 48, 49, 51, 52, 126)
Bibliography 215
[SS16] S. Saitoh and Y. Sawano. Theory of Reproducing Kernels and Applications. Springer,
Singapore, 2016. (Cited on p. 47)
[Str19] G. Strang. Linear Algebra and Learning From Data. Wellesley-Cambridge Press, Wellesley,
MA, USA, 2019. (Cited on p. 16)
[Sut88] R. Sutton. Learning to predict by the methods of temporal differences. Machine Learning,
3(1):9–44, 1988. (Cited on pp. 153, 154)
[SW06] R. Schaback and H. Wendland. Kernel techniques: From machine learning to meshless
methods. Acta Numerica, 15:543–639, 2006. (Cited on p. 48)
[SWD+ 17] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimiza-
tion algorithms. arXiv preprint, arXiv:1707.06347, 2017. (Cited on p. 166)
[SX22] J. Siegel and J. Xu. High-order approximation rates for neural networks with ReLUk acti-
vation functions. Applied and Computational Harmonic Analysis, 58:1–26, 2022. (Cited on
p. 89)
[Tak81] F. Takens. Detecting strange attractors in turbulence. In D. Rand and L.-S. Young, editors,
Dynamical Systems and Turbulence, Warwick 1980, pages 366–381, Berlin, Heidelberg,
Germany, 1981. Springer. (Cited on p. 171)
[TdL00] J. Tenenbaum, V. de Silva, and J. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290(5500):2319–2323, 2000. (Cited on pp. 70, 71)
[Tes95] G. Tesauro. Temporal difference learning and TD-Gammon. Communications of the ACM,
38(3):58–68, 1995. (Cited on p. 139)
[Tik63] A. Tikhonov. Solution of incorrectly formulated problems and the regularization method.
Soviet Math. Dokl., 4:1035–1038, 1963. (Cited on p. 11)
[TK21] N. Takeishi and A. Kalousis. Physics-integrated variational autoencoders for robust and
interpretable generative modeling. In M. Ranzato, A. Beygelzimer, Y. Dauphin, P. Liang,
and J. Wortman Vaughan, editors, Advances in Neural Information Processing Systems,
volume 34, pages 14809–14821, Red Hook, NY, USA, 2021. Curran Associates, Inc. (Cited
on p. 123)
[TMB+ 16] D. Tomè, F. Monti, L. Baroffio, L. Bondi, M. Tagliasacchi, and S. Tubaro. Deep convolu-
tional neural networks for pedestrian detection. Signal Processing: Image Communication,
47:482–489, 2016. (Cited on p. 62)
[TZ15] N. Tishby and N. Zaslavsky. Deep learning and the information bottleneck principle. In
Proceedings of the 2015 IEEE Information Theory Workshop, Jerusalem, Israel, pages 1–5.
IEEE, 2015. (Cited on p. 108)
[Uns19] M. Unser. A representer theorem for deep neural networks. Journal of Machine Learning
Research, 20(110):1–30, 2019. (Cited on p. 92)
[VA06] S. Vassilvitskii and D. Arthur. k-means++: The advantages of careful seeding. In Proceed-
ings of the 18th Annual ACM-SIAM Symposium on Discrete Algorithms, New Orleans, LA,
USA, pages 1027–1035. SIAM, 2006. (Cited on p. 77)
216 Bibliography
[Vap99] V. Vapnik. The Nature of Statistical Learning Theory, 2nd edition. Springer, New York,
NY, USA, 1999. (Cited on pp. 16, 47)
[VdMH08] L. Van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine
Learning Research, 9(11):2579–2605, 2008. (Cited on pp. 84, 85)
[Vin00] R. Vinter. Optimal Control. Birkhäuser, Boston, MA, USA, 2000. (Cited on p. 139)
[vL07] U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395–416, 2007.
(Cited on p. 79)
[vRMB+ 23] L. von Rueden, S. Mayer, K. Beckh, B. Georgiev, S. Giesselbach, R. Heese, B. Kirsch,
J. Pfrommer, A. Pick, R. Ramamurthy, M. Walczak, J. Garcke, C. Bauckhage, and
J. Schuecker. Informed machine learning—A taxonomy and survey of integrating knowl-
edge into learning systems. IEEE Transactions on Knowledge and Data Engineering,
35(1):614–633, 2023. (Cited on pp. 52, 137)
[vSMP+ 16] H. van Seijen, A. Mahmood, P. Pilarski, M. Machado, and R. Sutton. True online temporal-
difference learning. Journal of Machine Learning Research, 17(145):1–40, 2016. (Cited on
p. 165)
[VSP+ 17] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, Ł. Kaiser, and I. Polo-
sukhin. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach,
R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Pro-
cessing Systems, volume 30, pages 6000–6010, Red Hook, NY, USA, 2017. Curran Asso-
ciates, Inc. (Cited on pp. 183, 185, 186, 187)
[Wah90] G. Wahba. Spline Models for Observational Data. SIAM, Philadelphia, PA, USA, 1990.
(Cited on p. 48)
[Wat89] C. Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK,
1989. (Cited on p. 158)
[Wen95] H. Wendland. Piecewise polynomial, positive definite and compactly supported radial func-
tions of minimal degree. Advances in Computational Mathematics, 4(1):389–396, 1995.
(Cited on p. 48)
[WH96] G. Wanner and E. Hairer. Solving Ordinary Differential Equations II: Stiff and Differential-
Algebraic Problems. Springer, Berlin, Heidelberg, Germany, 1996. (Cited on p. 127)
[WNH93] G. Wanner, S. Nørsett, and E. Hairer. Solving Ordinary Differential Equations I: Nonstiff
Problems. Springer, Berlin, Heidelberg, Germany, 1993. (Cited on pp. 127, 131)
[WR22] S. Wright and B. Recht. Optimization for Data Analysis. Cambridge University Press,
Cambridge, UK, 2022. (Cited on p. 16)
[WWS09] C. Wojek, S. Walk, and B. Schiele. Multi-cue onboard pedestrian detection. In 2009 IEEE
Conference on Computer Vision and Pattern Recognition, Miami, FL, USA, pages 794–801.
IEEE, 2009. (Cited on pp. 62, 67)
[XTL09] H. Xie, H. Tang, and Y.-H. Liao. Time series prediction based on NARX neural networks:
An advanced approach. In International Conference on Machine Learning and Cybernetics,
Baoding, Hebei, China, volume 3, pages 1275–1279. IEEE, 2009. (Cited on p. 171)
[Yar17] D. Yarotsky. Error bounds for approximations with deep ReLU networks. Neural Networks,
94:103–114, 2017. (Cited on p. 91)
[YH38] G. Young and A. Householder. Discussion of a set of points in terms of their mutual dis-
tances. Psychometrika, 3(1):19–22, 1938. (Cited on p. 59)
[YHW+ 07] J. Yang, D. Hubball, M. Ward, E. Rundensteiner, and W. Ribarsky. Value and relation
display: Interactive visual exploration of large data sets with hundreds of dimensions. IEEE
Transactions on Visualization and Computer Graphics, 13(3):494–507, 2007. (Cited on
p. 7)
[YWV+ 22] J. Yu, Z. Wang, V. Vasudevan, L. Yeung, M. Seyedhosseini, and Y. Wu. CoCa: Contrastive
captioners are image-text foundation models. Transactions on Machine Learning Research,
2022. https://openreview.net/forum?id=Ee277P3AYC. (Cited on p. 108)
[YYR+ 18] J. You, R. Ying, X. Ren, W. Hamilton, and J. Leskovec. GraphRNN: Generating realistic
graphs with deep auto-regressive models. In Proceedings of the 35th International Con-
ference on Machine Learning, Stockholm, Sweden, volume 80 of Proceedings of Machine
Learning Research, pages 5708–5717. PMLR, 2018. (Cited on p. 170)
[ZA15] N. Zaitoun and M. Aqel. Survey on image segmentation techniques. Procedia Computer
Science, 65:797–806, 2015. (Cited on p. 62)
[ZHS+ 17] M. Zeestraten, I. Havoutis, J. Silvério, S. Calinon, and D. Caldwell. An approach for imita-
tion learning on Riemannian manifolds. IEEE Robotics and Automation Letters, 2(3):1240–
1247, 2017. (Cited on p. 169)
[ZQW20] W. Zhao, J. Queralta, and T. Westerlund. Sim-to-real transfer in deep reinforcement learning
for robotics: A survey. In 2020 IEEE Symposium Series on Computational Intelligence
(SSCI), pages 737–744, 2020. (Cited on p. 166)
[ZTS+ 21] S. Zhai, W. Talbott, N. Srivastava, C. Huang, H. Goh, R. Zhang, and J. Susskind. An
attention free transformer. arXiv preprint, arXiv:2105.14103, 2021. (Cited on p. 190)
Index
219
220 Index
Q-value, 154 stability of, 131, 132, 183 skip connection, 180, 188
qPCR, see real-time polymerase ResNet, see residual neural slack variable, 38
chain reaction network SO(3), 169
QR decomposition, 21 reward, 139, 141, 149 soft margin hyperplane, 37–51
query vector, 184, 187 discounted cumulative, 141, softmax, 99, 158, 166, 185
143, 152, 154 softmax exploration, 158
r-ball neighborhood graph, 70 stochastic, 150 solution operator, 170
random forest, 170, 175–178 ridge regression, 11 space-filling curve, 114
random walk, 72 RKHS, see reproducing kernel sparsity enforcing penalty, 114
real-time polymerase chain Hilbert space special orthogonal group, 169
reaction, 82 RMSProp, 104 specificity, 23
recall, 23 RNN, see recurrent neural spectral clustering, 78, 85
rectified linear unit, 89, 91 network spectral gap, 78
rectified power unit, 92 RoBERTa, 191 stacking, 171–172
recurrent neural network, 87, robotics, 169 standardization, 28
108, 178–183, 190 robustness, 192 state, 139
graph of, 179 Runge–Kutta method, 127
cell, 181, 182
regression, 1, 5 hidden, 178, 182
deep learning, 87–99 saliency map, 193 terminal, 143
Gaussian process, 52 Sammon’s mapping, 68 state dynamics, 141, 149
k-nearest neighbors, 29–30 SARSA, see state-action- state space, 139, 140, 149
linear least squares, 11, reward-state-action
continuous, 159, 161
17–28, 54 Schmidt–Eckart–Young–
state-action-reward-state-
logistic, 30 Mirsky theorem,
action, 156, 158,
random forest, 175–178 58
161
ridge, 11 S CIKIT-L EARN, 2, 15, 51
semi-gradient, 162
support vector machine, 52 SDE, see stochastic differential
stencil of a convolution, 98
regularization, 10, 31 equation
Stiefel manifold, 55, 169
dropout, 96 segmentation, see clustering
stochastic differential equation,
LASSO, 31 self-attention module, 183, 187
134
penalty, 114, 194 semi-gradient state-action-
forward diffusion, 135
slack variable, 38 reward-state-action,
Tikhonov, 11, 31 162 reverse diffusion, 135
total variation, 194 sensitivity, 23 stochastic gradient descent,
weight reduction, 96 sensitivity analysis, 192–193 92–93, 101, 126, 161,
reinforcement learning, 139, separating hyperplane, 33–37 163
150–166, 191 sequence-to-sequence problem, AdaGrad, 104
assumptions in, 151 183 convergence of, 102
deep, 139, 160 sequential minimal epoch of, 93
model-based, 151 optimization, 39–41 learning rate of, 93
model-free, 151 SGD, see stochastic gradient momentum-based, 103, 138
relative absolute error, 22 descent Nesterov, 103, 138
ReLU, see rectified linear unit SHAP, see Shapley additive RMSProp, 104
reparametrization trick, 121 explanation stochastic process, 133
replay memory, 163 Shapley additive explanation, Stone–Weierstrass theorem, 91
representer theorem, 47 194 stride, 97
reproducing kernel Hilbert Shapley value, 194 subword unit, 185
space, 47 shortest curve, 70 supervised learning, 4–6, 8,
RePU, see rectified power unit sigmoid function, 89 17–52, 87–108, 175–178
residual neural network, 92, signature transformation, 171 support vector, 36
129 similarity matrix, 78 support vector machine, 37–51,
convolutional, 131 simple recurrent network, 179 63
forward propagation of, 130, singular value decomposition, SVD, see singular value
131 19, 54, 57–58, 112 decomposition
Index 225
SVM, see support vector time series, 108, 123, 133, 169, VAE, see variational
machine 171, 178, 183 autoencoder
token, 185 validation data, 49
t-distributed stochastic Torgerson multi-dimensional value function, 141, 143, 155
neighbor embedding, 84 scaling, 59, 67, 71 optimal, 142, 155
t-SNE, see t-distributed TPU, see tensor processing unit value iteration, 145, 150, 156
stochastic neighbor training data, 5, 9, 22 value vector, 184, 187
embedding training error, 9 vanishing gradients, 180
Taken’s theorem, 171 trajectory, 139, 152 variance, 7, 60, 72, 135, 175
target value, 152 transformer, 171, 183–191 variational autoencoder, 109,
decoupled, 163 graph of, 186 117–123
Taylor formula, 128, 195 transition map, 141 dynamical, 123
TD, see temporal difference transition matrix, 73, 78, 147, graph of, 121
temporal difference, 152, 156 150 Verlet integration, 131
λ, 165 translator, 183, 189 visualization, 53, 61, 79, 193
n-step, 165 voting, 171
tensor processing unit, 108 uncertainty quantification, 138,
T ENSOR F LOW, 2, 15, 105 170 Wasserstein distance, 167
terminal state, 143 universal approximation wavelet feature, 171
test data, 5, 21 theorem, 91 weight reduction, 96
test error, 21 unsupervised learning, 4, 6–7, weight sharing, 96, 113
text generation, 183 53–86, 109–123, Wolfe duality, 36
Tikhonov regularization, 11, 31 133–137 word embedding, 184, 185
time horizon, 181 update gate, 182 working set, 39
Algorithmic Mathematics in Machine Learning
Data Science Book Series Data Science Book Series
This unique book explores several well-known machine learning and data analysis algorithms from a
mathematical and programming perspective. The authors
• present machine learning methods, review the underlying mathematics, and provide programming
Algorithmic
Mathematics in
exercises to deepen the reader’s understanding;
• accompany application areas with exercises that explore the unique characteristics of real-world
data sets (e.g., image data for pedestrian detection, biological cell data); and
• provide new terminology and background information on mathematical concepts, as well as
exercises, in “info-boxes” throughout the text.
Machine Learning
Algorithmic Mathematics in Machine Learning is intended for mathematicians, computer scientists,
and practitioners who have a basic mathematical background in analysis and linear algebra, but little
or no knowledge of machine learning and related algorithms. Researchers in the natural sciences and
engineers interested in acquiring the mathematics needed to apply the most popular machine learning
algorithms will also find this book useful.
This book is appropriate for a practical lab or basic lecture course on machine learning within a
mathematics curriculum.
Bastian Bohn is an Akademischer Rat at the Institute for Numerical Simulation, University
of Bonn, Germany, where he was previously a postdoctoral researcher. His research
interests include machine learning, the mathematics of data science, numerical algorithms
in high dimensions, and approximation theory.
Michael Griebel
Jochen Garcke
Bastian Bohn
For more information about SIAM books, journals,
conferences, memberships, or activities, contact:
ISBN: 978-1-61197-787-5
90000
Bastian Bohn
Jochen Garcke
9 781611 977875 DI03 Michael Griebel
DI03