Deep Learning and Computational Physics: Deep Ray Orazio Pinti Assad A. Oberai
Deep Learning and Computational Physics: Deep Ray Orazio Pinti Assad A. Oberai
Orazio Pinti
Assad A. Oberai
Deep Learning
and Computational Physics
Deep Ray Orazio Pinti
Department of Mathematics Pasteur Labs
University of Maryland Brooklyn, NY, USA
College Park, MD, USA
Assad A. Oberai
Department of Aerospace and Mechanical
Engineering
USC Viterbi School of Engineering
University of Southern California
Los Angeles, CA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2024
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
These notes were compiled as lecture notes for a course developed and taught at
the University of the Southern California. They should be accessible to a typical
engineering graduate student with a strong background in Applied Mathematics.
The main objective of these notes is to introduce a student who is familiar
with concepts in linear algebra and partial differential equations to select topics
in deep learning. These lecture notes exploit the strong connections between deep
learning algorithms and the more conventional techniques of computational physics
to achieve two goals. First, they use concepts from computational physics to develop
an understanding of deep learning algorithms. Not surprisingly, many concepts in
deep learning can be connected to similar concepts in computational physics, and
one can utilize this connection to better understand these algorithms. Second, several
novel deep learning algorithms can be used to solve challenging problems in compu-
tational physics. Thus, they offer someone who is interested in modeling a physical
phenomena with a complementary set of tools.
vii
Acknowledgements
The authors would like to acknowledge Prof. Alvaro Cotinho and Dr. Dhruv Patel
for their insightful feedback on an earlier version of this book, and Drs. Agnimitra
Dasgupta, Saeed Moazami and Dr. Anirban Chandra for proof reading portions of
this book.
ix
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Computational Physics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Examples of ML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 Types of ML Algorithms Based Tasks . . . . . . . . . . . . . . . . 3
1.3 Artificial Intelligence, Machine Learning and Deep Learning . . . . 4
1.4 Machine Learning and Computational Physics . . . . . . . . . . . . . . . . . 5
1.5 Computational Exercise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2 Introduction to Deep Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.1 MLP Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.2.1 Linear Activation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.2 Rectified Linear Unit (ReLU) . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.3 Leaky ReLU . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.4 Logistic Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.5 Tanh . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2.6 Sine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.3 Expressivity of a Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3.1 Universal Approximation Results . . . . . . . . . . . . . . . . . . . . 16
2.4 Training, Validation and Testing of Neural Networks . . . . . . . . . . . 17
2.5 Overfitting and How to Avoid It . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.5.1 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.6 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
2.6.1 Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.7 Some Advanced Optimization Algorithms . . . . . . . . . . . . . . . . . . . . 24
2.7.1 Momentum Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.7.2 Adam . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.7.3 Stochastic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.8 Calculating Gradients Using Back-Propagation . . . . . . . . . . . . . . . . 29
2.9 Regression Versus Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
xi
xii Contents
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
About the Authors
Orazio Pinti is a Research Scientist at Pasteur Labs, working in the field of scien-
tific machine learning and computational physics. He holds a BSc and MSc from
the Polytechnic University of Turin, and a PhD from the University of Southern
California, all in Aerospace Engineering. His interests include applied mathematics,
machine learning, and computational science, with a focus on reduced-order and
multi-fidelity modeling.
xv
xvi About the Authors
The theory and algorithms for statistical learning and data-driven algorithms have
been around since the early 19th century. But the first hints of modern machine learn-
ing can be traced back to the 1943 work of McCulloch and Pitts [63] who proposed
the first model of an artificial neuron loosely based on the functioning of a biological
neuron in vertebrates. Arthur Samuel is popularly credited to have coined the term
“machine learning” in 1959, when he was at IBM performing research on teaching
a computer to play checkers [93]. Machine learning has been very successful in
applications such as computer vision, speech recognition and natural language pro-
cessing. But the last few years have also witnessed the emergence of machine learning
(in particular deep learning) algorithms to solve physics-driven problems, such as
approximating solutions to partial differential equations and inverse problems.
This course deals with topics that lie at the interface of computational physics
and machine learning. Before we can appreciate the need to combine both these
important concepts, we need to understand what each of them mean on their own.
3. Write down a mathematical description of the law. This could make use of ordi-
nary differential equations (ODEs), partial differential equations (PDEs), integral
equations, etc.
4. Once the mathematical model is framed, solve for the solution of the system.
There are two ways to obtain this:
(a) In certain situations an exact analytical form of the solution can be obtained.
For instance one could solve ODEs/PDEs using separation of variables,
Laplace transforms, Fourier transforms or integrating factors.
(b) In most scenarios, exact expressions of the solution cannot be obtained and
must be suitable approximated using a numerical algorithm. For instance,
one could use forward or backward Euler, mid-point rule, or Runge-Kutta
schemes for solving systems of ODEs [14]; or one could use finite differ-
ence/volume/element methods for solving PDEs [3].
5. Once the algorithm to evaluate the solution (exactly or approximately) is designed,
use it to validate the mathematical model, i.e., see if the predictions agree with
the data collected.
All these steps broadly describe what computational physics entails.
Unlike computational physics, machine learning (ML) does not require the postula-
tion of a physical law. The general steps involved are:
1. Collect data by observing the physical phenomena, by real-time measurements of
some observable or by using a numerical solver approximating the phenomenon.
2. Train a suitable algorithm using the collected data, with the aim of discovering a
pattern or relation between the various samples. See Sect. 1.2.1 for some concrete
examples.
3. Once trained, use the ML algorithm to make future predictions, and validate it
with additional collected data.
1.2.1 Examples of ML
∑
N
∏(a, b) =
. |yi − f˜(xi )|2 .
i=1
If .(a ∗ , b∗ ) = arg mina,b ∏(a, b), then we can consider . f˜∗ (x) := f˜(x; a ∗ , b∗ ) to
be the approximation of . f (x) (see Fig. 1.1a).
2. Decision trees: We are given a dataset from a sample population, containing
the features: age and income. Furthermore, the data is divided into two groups;
an individual in Group A owns a house while an individual in Group B does
not. Then, given the features of a new data point, we would like to predict the
probability of this new individual owning a house. Decision trees can be used
to solve this classification problem. The way a typical decision tree works is by
making cuts that maximize the group-based separation for the samples in the
dataset (see Fig. 1.1b). Then, based on these cuts, the algorithm determines the
probability of belonging to a particular class/group for a new point.
3. Clustering algorithms: Given a set of data with a number of features per sample,
find cluster/patterns in the data (see Fig. 1.1c).
of labelled and unlabelled data for training. For example, we are given 10,000
images that are unlabeled and only 50 images that are labeled. Can we use this
dataset to develop an image classification algorithm?
4. Re-inforcement learning: The methods belonging to this family learn based on
rewards or penalties for decisions taken. Thus, a suitable path/policy is learned to
maximize the reward. This strategy can be used to train algorithms to play chess
or Go.
In this course, we will primarily focus on the first two types of ML algorithms.
At times, the terms Artificial Intelligence (AI), ML and Deep Learning (DL) are used
interchangeably. In reality, these are three related but different concepts. This can be
understood by looking at the Venn diagram in Fig. 1.2.
AI refers to a system with human-like intelligence. While ML is a key component
of an AI system, there are other ingredients involved. A self-driving car is a proto-
typical example of AI. Let’s take a closer look at the design of such a system (see
Fig. 1.3). A car is mounted with a camera which takes live images/video of the road
ahead. These frames are then passed to an ML algorithm which performs a semantic
segmentation, i.e., segments out different regions of the frame and classifies the type
of object (car, tree, road, sky, etc.) in each segment. Once this segmentation is done,
it is passed to a decision system that decides what the next action of the car should
be based on this segmented image. This information then passes through a control
module that actually controls the mechanical actions of the car. This entire process
mimics what a real driver would do, and is thus artificial intelligence.
On the other hand, machine learning (ML) are the components of this system that
are trained using data. That is they learn through data. In the example above, the
Semantic Segmenter is one such system. There are many ML algorithms that can
perform this task using data, and we will learn about some of these in this course. The
Decision System could also be an ML component—where the appropriate decision
to be made is learnt from prior data. However, it could also be a non-ML rule-based
expert system.
DL is a subset of ML algorithms. The simplest form of a DL architecture, known as
a feed-forward network, comprises a number of layers of non-linear transformations.
This architecture is loosely motivated by how signals are transmitted by the central
nervous in living organisms. We will study the DL architecture in greater detail in
Chap. 2.
Now that we have a better understanding of computational physics and ML, the next
obvious question would be “why do we need to look at a combination of the two?”
We list down a few motivations below:
former we may have a well understood mathematical model, while for the latter
we may have to rely on ML to develope a model.
• ML in general is very data hungry. But the knowledge of physics can help restrict
the manifold on which the input and solution/predictions lie. With such constraints,
we can reduce the amount of data required to train the ML algorithm.
• Tools for analyzing computational physics (functional analysis, numerical anal-
ysis, notions of convergence to exact solutions, probabilistic frameworks) carry
over to ML. Applying these tools to ML helps us better understand and design
better ML algorithms.
We briefly summarize the various topics that will be covered in this course:
In order to effectively use the various ML algorithms discussed in this course, and
solve the computational exercises, a good grasp on Python programming and PyTorch
is required. Listed below are various resources and tutorials about the necessary
programming concepts.
1. Python basics, NumPy and plotting: A good resource for those who have never
used Python before, or even for those who are familiar with Python and want
to freshen up their programming knowledge, is https://www.w3schools.com/
python/. This tutorial covers important Python modules such as NumPy, Pandas
and SciPy, describe how to generate plots using Matplotlib, and how to
read/write files.
2. PyTorch: The codes used in this course will be written using PyTorch, which is
a machine learning framework originally developed by Meta AI. You can install
PyTorch locally on your machines by following the instructions given here https://
pytorch.org/get-started/locally/. Various tutorials on using PyTorch can be found
here https://pytorch.org/tutorials/.
3. Google Colab: If you do not wish do install PyTorch locally, or do not have the
appropriate hardware to train deep learning models, you could also use Google
Colab https://colab.research.google.com. Colab is essentially a combination of
Jupyter notebook and Google Drive, and only requires you to have a Google
1.5 Computational Exercise 7
account. The attractive thing about Colab is that it comes preinstalled with many
useful packages (like NumPy, PyTorch etc.), so everyone can use it without wor-
rying about installing the correct versions of libraries and dependencies. Further-
more, it runs entirely on the cloud and can be launched and used directly through
the web browser. It also gives free access to powerful GPUs and TPUs.
Chapter 2
Introduction to Deep Neural Networks
In this chapter, we introduce the simplest deep learning architecture used in practice,
which is known as the multilayer perceptron (MLP). We will discuss its various
components, its ability to approximate functions with different regularity (univer-
sal approximation results), and the various training paradigms to learn the various
parameters (and hyperparameters) without overfitting the training dataset.
signal provided by the source layer. In each layer .l, .1 ≤ l ≤ L + 1, the .i-th neuron
performs an affine transformation on that layer input . x (l−1) followed by a non-linear
transformation
Fig. 2.1 MLP with 2 hidden layers, with a depiction of the transformation occurring inside each
neuron
( )
x (l) = σ Wi(l)
. i j jx (l−1)
+bi
(l)
, 1 ≤ i ≤ Hl , 1 ≤ j ≤ Hl−1 (2.1)
, ,, ,
Einstein sum
where the activation function is applied component-wise. Thus, the action of the
whole network .F : Rd |→ R D can be mathematically seen as a composition of alter-
nating affine transformations and component-wise activations
3. We use the term depth of the network to denote the number of computing layers
in the MLP, i.e. the number of hidden layers and the output layer, which would
be . L + 1 as per the notations used above.
The learnable parameters of the network are all the weights and biases, which we
represent as
.θ = {W
(l)
, b(l) }l=1
L+1
∈ R Nθ
where . Nθ denotes the total number of parameters of the network. The network
F (x; θ ) represents a family of parameterized functions, where .θ needs to suitably
.
chosen such that the network approximates the target function . f (x) at the input . x.
∑ L+1
Question 2.1.1 Prove that . Nθ = l=1 (Hl−1 + 1)Hl .
The activation function is perhaps the most important component of an MLP. A large
number of activations are available in literature, each with its own advantages and
disadvantages. Let us take a look at a few of these options (also see Fig. 2.2).
• The function is infinitely smooth, but all derivatives beyond the second derivative
are zero.
• The range of the function is .(−∞, ∞).
• Using the linear activation function (in all layers) will reduce the entire network
to a single affine transformation of the input . x. In other words, the network will
be nothing more that a linear approximation of the target function . f , which is not
useful if . f is non-linear.
This is one of the most popular activation functions used in practice. Some features
of this function are:
• The function is continuous, while its derivative will be piecewise constant with a
jump at.ξ = 0. The second derivative will be a dirac function concentrated at.ξ = 0.
In other words, the higher-order derivatives (greater than 1) are not well-defined.
• The range of the function is .[0, ∞).
The ReLU activation leads to a null output from a neuron if the affine transformation
of the neuron is negative. This can lead to the phenomena of dying neurons [62] while
training a neural network, where neurons drops out completely from the network
and no longer contribute to the final prediction. To overcome this challenge, a leaky
version ReLU was designed
(
ξ, if ξ ≥ 0
σ (ξ ; α) =
. (2.5)
αξ, if ξ < 0
• The derivatives of Leaky ReLU behave in the same way as those for ReLU. How-
ever, the first derivative (except at .ξ = 0) is non-zero.
• The range of the function is .(−∞, ∞).
1
σ (ξ ) =
. (2.6)
1 + e−ξ
2.2.5 Tanh
The tanh function can be seen as a symmetric extension of the logistic function
eξ − e−ξ
.σ (ξ ) = (2.7)
eξ + e−ξ
2.2.6 Sine
Recently, the sine function, i.e., .σ (ξ ) = sin(ξ ) has been proposed as an efficient acti-
vation function [100]. It has the best features of all the activation function discussed
above:
14 2 Introduction to Deep Neural Networks
Question 2.2.1 Can you think of an MLP architecture with the sine activation func-
tion, which leads to an approximation very similar to a Fourier series expansion?
Remark 2.2.1 There are many more activation functions that are used in a compre-
hensive list of activation functions can be found in [25].
Let us try to understand the effect of increasing. Nθ in an MLP. In popular parlance this
would be referred to as examining the effect of increasing expressivity of a network.
To see this, let us consider a simple example using the ReLU activation function,
i.e., .σ (ξ ) = max{ξ, 0}. We set .d = D = 1, . L = 1 and the parameters
[ ] [ ]
(1) 2 (1) −2 [ ]
. W = , b = , W (2) = 1 1 , b(2) = 0.
1 0
Notice that while the the output . x (1) of the hidden layer (see Fig. 2.3b, c) have only
one corner/kink, the final output ends up having two kinks (see Fig. 2.3d).
We generalize this formulation to a bigger network with . L hidden layers each of
width . H . Then one can expect that .xi(1) , .1 ≤ i ≤ H will have a single kink, with the
location and angle of the kink depending on the weights and bias associated with
each neuron of the hidden layer. The vector . x (1) is passed to the next hidden layer,
where each neuron will combine the single kinks and give an output with possibly
. H kinks. Once again, the location and angles of the . H kinks in the output from each
neuron of the second hidden layer will be different. The location of the kinks will be
different because each neuron is allowed a different bias, and therefore can induce
a different shift. Continuing this argument, one can expect the number of kinks to
increase as . H , . H 2 , . H 3 as it passes through the various hidden layers with width
L
. H . In general the total number of kinks can grow as . H . Further, by appropriately
selecting the weights and the biases in the network one could select the location of the
2.3 Expressivity of a Network 15
kinks, and the slope in the segment between the kinks. Thus one could approximate
any continuous function by generating a piece-wise continuous approximation. In
other words, the network has the ability to become more expressive as the depth (and
width) of the network is increased.
As an illustration, in Fig. 2.4 we show the approximation of . f (x) = sin(2π x) for
. x ∈ [0, 1] using neural networks with varying depths (in terms of . L) and a fixed
hidden layer width . H = 3, trained on 50 equi-spaced (in .x) training samples. Note
that as the depth increases, the network develops more kinks, allowing it to bend
enough to better approximate the target function.
16 2 Introduction to Deep Neural Networks
Fig. 2.4 Approximating .sin(2π x) using an MLP with . H = 3 for each hidden layer and varying
depth. The blue (.−•) curves denote the true function/data while the orange (.−*) curves denote the
predictions by the trained networks
This theorem states that a network with a single hidden layer can approximate
any continuous function to within any specific point-wise error if one is allowed to
select an arbitrarily large value for its width.
This theorem provides an analogous result for a network with a fixed and finite
width, and an arbitrary large depth.
2.4 Training, Validation and Testing of Neural Networks 17
The theorem above specifies how the approximation error for a network can vary
as the number of parameters in the network, denoted by . Nθ , changes.
Theoretical results like those mentioned above help demystify the “black-box”
nature of neural network, and serve as useful practical guidelines when designing
network architectures.
Now that we have a better understanding of the architecture of MLPs, we would like
to discuss how the parameters of these networks are set to approximate some target
function. We restrict our discussions to the framework of supervised learning.
Let us assume that we are given a dataset of pairwise samples .S = {(x i , yi ) : 1 ≤
i ≤ N } corresponding to a target function . f : x |→ y. We wish to approximate this
function using the neural network
.F(x; θ , Θ)
where .θ are the network parameters defined before, while .Θ corresponds to the
hyper-parameters of the network . These include quantities such as the depth . L + 1,
width . H , type of activation function .σ , etc. The strategy to design a robust network
involves three steps:
1. Find the optimal values of .θ (for a fixed .Θ) in the training phase.
2. Find the optimal values of .Θ in the validation phase.
3. Test the performance of the network on unseen data in the testing phase.
To accomplish these three tasks, it is customary to split the dataset .S into three
distinct parts: a training set with . Ntrain samples, a validation set with . Nval samples
and test set with . Ntest samples, with . N = Ntrain + Nval + Ntest . Typically, one uses
around 60% of the samples as training samples, 20% as validation samples and the
remaining 20% for testing.
Splitting the dataset is necessary as neural networks are heavily over-parameterized
functions. The large number of degrees of freedom available to model the data can
lead to over-fitting the data. This happens when the error or noise present in the data
drives the behavior of the network more than the underlying input-output relation
18 2 Introduction to Deep Neural Networks
itself. Thus, a part of the data is used to determine.θ, and another part to determine the
hyper-parameters .Θ. The remainder of the data is kept aside for testing the perfor-
mance of the trained network on unseen data, i.e., the network’s ability to generalize
well.
Now let us discuss how this split is used during the three phases in further details:
Training: Training the network makes use of the training set .Strain to solve the
following optimization problem: Find
1 ∑
Ntrain
∗
θ = arg min ∏train (θ), where ∏train (θ) =
. || yi − F (x i ; θ , Θ)||2
θ Ntrain i=1
(x i , yi )∈Strain
for some fixed.Θ. The optimal.θ ∗ is obtained using a suitable gradient based algorithm
(will be discussed later). The function.∏train is referred to as the training loss function.
In the example above we have used the mean-squared loss function. Later we will
consider other types of loss functions.
Validation: Validation of the network involves using the validation set .Sval to solve
the following optimization problem: Find
1 ∑
Nval
∗
.Θ = arg min ∏val (Θ), where ∏val (Θ) = || yi − F (x i ; θ ∗ , Θ)||2 .
Θ N val i=1
(x i , yi )∈Sval
The optimal .Θ∗ is obtained using techniques such as (random or tensor) grid search.
Note that the optimal .θ ∗ depends on the choice of .Θ, i.e., .θ ∗ = θ ∗ (Θ). For ease of
notation, we have suppressed this dependency here.
1 ∑
Ntest
∏test =
. || yi − F (x i ; θ ∗ , Θ∗ )||2 .
Ntest i=1
(x i , yi )∈Stest
This testing error is also known as the (approximate) generalizing error of the
network.
Let’s see an example to better understand how such a network is obtained
Example 2.4.1 Let us consider an MLP where all hyper-parameters are fixed except
for the following flexible choices
σ ∈ {ReLU, tanh},
. L ∈ {10, 20}.
2.5 Overfitting and How to Avoid It 19
Remark 2.4.1 The notion of the test dataset is important to understand, as it is often
misused in practice. In the “correct” approach, the test data should never be used
to improve the performance of the network. Once the network (with optimized .θ
and .Θ) is evaluated on the test set, further changes in .θ and .Θ to improve the test
performance can be seen as “data-snooping”, with the test set becoming a glorified
validation set.
Neural networks, especially MLPs, are almost always over-parametrized, i.e., . Nθ >>
Ntrain where . Ntrain is the number of training samples. This would lead to a highly
nonlinear network model with such a situation depicted in Fig. 2.5a. In this figure,
we show the approximation of scalar-valued function . f (black curve) taking a scalar
input .x. The magenta curve denotes the network prediction, the red dots are the
noisy training points, while the blue crossed are the noisy validation points. An over-
parameterized MLP has the tendency to fit the noise in the training set, thus leading
to a poor performance on the validation (or any unseen) data. We would like to avoid
such an overfitting.
Fig. 2.5 The behavior of an un-regularized/overfit network (a) and a regularized network b approx-
imating a scalar valued function taking a scalar input .x. The black curve denotes the true function
. f , the magenta curve denotes the network prediction, the red dots are the training points, while the
blue crosses are the validation points
20 2 Introduction to Deep Neural Networks
The first thing we would like to do is to be able to find out whether we are in
this situation. How do we know this? A clue to this answer can be gleaned from
Fig. 2.5a by noting the distance between the magenta curve (the MLP prediction)
and the validation data points, indicated by the blue crosses. This distance is much
larger than the distance between the magenta curve and the training data points. This
will translate to .∏val >> ∏train . Thus whenever one observes a large gap between the
validation and training losses, one should think about whether one is overfitting the
data.
Now that we have a way of identifying overfitting, the next question to answer is
how to avoid it? There are several approaches; however the most popular approach
is referred to as regularization and is described next.
2.5.1 Regularization
From Fig. 2.5a, we observe that the one significant difference between the black
curve (what we desire) and the magenta curve (what we predict) is that the former is
much smoother than the latter. That is the derivative of the black curve with respect
to its argument is much smaller than the derivative of the red curve. This tells that
we would like to train MLPs such that .| ∂F∂ x(x) | is small. Regularization is a way to
achieve this goal.
Consider the output the first hidden layer of a network whose input is a scalar .x,
x (1) = σ (W11
. 1
(1)
x + b1(1) ),
which gives
∂ x1(1) (1)
. = σ ' (W11 x + b1(1) )W11
(1) (1)
∝ W11 .
∂x
(1)
Since this derivative scales with .W11 , this tells us that if we limit the value of
(1)
. W11 we would also limit the value of this derivative. Of course, this is just the
derivative of the output of the first hidden layer of the network. We will show later
that the derivative of the final output of the MLP is a product of such derivatives.
Therefore if we can control this derivative, and others like this, we can control the
overall derivative. The most obvious way to do this is by penalizing large values
of the weights in the network. This is precisely what is accomplished by adding a
regularization term to the loss function.
The simplest method of regularization involves adding a penalty term to the loss
function:
.∏(θ) −→ ∏(θ ) + α||θ ||, α≥0
is also large. Now, in addition to finding the value of parameters that best match the
training data, we are also looking for those parameters that are small, and therefore
will lead to a smoother output from the MLP.
Let us consider some common types of regularization:
• .l2 regularization: Here we use the .l2 norm in the regularization term
( )1/2
∑
Nθ
||θ|| = ||θ||2 =
. θi2 .
i=1
• .l1 regularization: Here we use the .l1 norm in the regularization term
∑
Nθ
||θ|| = ||θ||1 =
. |θi |,
i=1
In Fig. 2.5b, we depict the predictions with a regularized network. Using the
penalty term, we obtain a network with lower complexity, whose predictions are
smoother. Further, we note that the mismatch between the prediction and validation
points is lowered, but at the cost of a slightly higher error in predicting on the
training data. Therefore, we now have .∏val ≈ ∏train . In the terminology of statistical
learning, this notion is also known as bias-variance tradeoff [45], which says that as
the model complexity of the prediction model decreases, the bias (error introduced by
approximating a complicated . f by a much simpler .F) increases, while the variance
(sensitivity of .F to change in the training data) decreases. In practice, this is better
monitored in terms of the .∏val and .∏train . As shown in Fig. 2.6, .∏val is typically high
Fig. 2.6 Bias-variance tradeoff. As the model complexity increases (for eg. .α decreases to 0), the
model bias will decrease, the model variance will increase, the training loss/error will decrease, and
the validation loss/error will first decrease but then increase. The optimal .α ∗ has been marked in
the figure, which is where the validation loss takes the lowest value
22 2 Introduction to Deep Neural Networks
for a complex.F while.∏train is low. As the model complexity decreases (say by means
of increasing the regularization parameter .α), .∏val decreases while .∏train increases.
However, there is typically a sweet spot beyond which both .∏val and .∏train increase
with further simplification of the model. This situation is known as underfitting and
can occur if .α is chosen to be too large. Thus, a careful choice of .α (marked as .α ∗ in
the figure) becomes important to ensure the network performs well on the training
set, while generalizing well to unseen data.
Recall that we wish to solve the minimization problem .θ ∗ = arg min ∏(θ) in the
training phase. This minimization problem can be solved using an iterative opti-
mization algorithm called gradient descent (GD), also known as steepest descent.
Assuming sufficient smoothness of the loss function with respect to .θ, consider the
truncated Taylor expansion about .θ 0
∂∏ ∂ 2∏
∏(θ 0 + Δθ) = ∏(θ 0 ) +
. (θ 0 ) · Δθ + (θ̂ )Δθi Δθ j
∂θ ∂θi θ j
∂ ∏2
for some.θ̂ in a small neighbourhood of.θ 0 . When.||Δθ|| is small and assuming. ∂θiθj
is
bounded, we can neglect the second order term and just consider the approximation
∂∏
∏(θ 0 + Δθ) ≈ ∏(θ 0 ) +
. (θ 0 ) · Δθ . (2.8)
∂θ
In order to lower the value of the loss function as much as possible compared to
its evaluation at .θ 0 , i.e. minimize .Δ∏ = ∏(θ 0 + Δθ ) − ∏(θ 0 ), we need to choose
the step .Δθ in the opposite direction of the gradient, i.e.:
∂∏
.Δθ = −η (θ 0 ) (2.9)
∂θ
with the step-size .η ≥ 0, also known as the learning-rate. This is yet another hyper-
parameter that we need to tune during the validation phase. Note that by using (2.9)
in (2.8), we have
|| ||2
|| ∂∏ ||
.∏(θ 0 + Δθ) ≈ ∏(θ 0 ) − η || ||
|| ∂θ (θ 0 )|| ≤ ∏(θ 0 ).
2
(a) Evaluate . ∂∏
∂θ
(θ k )
(b) Update .θ k+1 = θ k − η ∂∏
∂θ
(θ k )
(c) Increment .k = k + 1.
2.6.1 Convergence
Assume that .∏(θ ) is convex and differentiable, and its gradient is Lipschitz continu-
ous with Lipschitz constant .K. Then for a .η ≤ 1/K , the GD updates converges with
the rate
∗ C
.||θ − θ k ||2 ≤ .
k
However, in most scenarios .∏(θ ) is not convex. If there is more than one minima,
then what kind of minima does GD like to pick? To answer this, consider the loss
function for a scalar .θ as shown in Fig. 2.7, which has two valleys. Let’s assume that
the profile of .∏(θ ) in each valley can be locally approximated by a parabola
1
. ∏(θ ) ≈ a(θ − θ ∗ )2
2
where .a > 0 is the curvature while .θ ∗ is the local minima. Clearly each valley has a
different value for .a and .θ ∗ . Note that the curvature of the left valley is much smaller
than the curvature of the right valley. Let’s pick a constant learning rate .η and a
starting value .θ0 in either of the valleys. Then,
∂∏
. (θ0 ) = a(θ0 − θ ∗ )
∂θ
and the new point after a GD update will be
The iterates .θk would converge to .θ ∗ if .|1 − ηa| < 1. Since .a > 0 in the valleys, we
will need the following condition on the learning rate
. − 1 < 1 − ηa =⇒ ηa < 2.
If we fix .η, then for convergence we need the local curvature to satisfy .a < 2/η. In
other words, GD will prefer to converge to a minima with a flat/small curvature, i.e.,
it will prefer the minima in the left valley. If the starting point is in the right valley,
24 2 Introduction to Deep Neural Networks
there is a chance that we will keep overshooting the right minima and bounce off
the opposite wall till the GD algorithm slingshots .θk outside the valley (depicted in
Fig. 2.7). After this it will enter the left valley with a smaller curvature and gradually
move towards its minima.
While it is clear that GD prefers flat minima, what is not clear is why are flat
minima better. There is empirical evidence that the parameter values obtained at flat
minima tend to generalize better, and therefore are to be preferred [47].
[θ k+1 ]i = [θ k ]i − [ηk ]i [g k ]i , 1 ≤ i ≤ Nθ .
. (2.10)
where we have made use of the notation that for any vector, . a, .[a]i denotes the .i-
th component. In the equation above .[ηk ]i is the component-wise learning rate and
the vector-valued function .g depends on/approximates the gradient. Note that the
learning rate is allowed to depend on the iteration number .k.
The GD method can be represented using (2.10) by recognizing
∂∏
[ηk ]i = η, g k =
. (θ k ).
∂θ
2.7 Some Advanced Optimization Algorithms 25
An issue with the GD method is that the convergence to the minima can be quite slow
if .η is not suitably chosen. For instance, consider the objective function landscape
shown in Fig. 2.8, which has sharper gradients along the .[θ]2 direction compared
to the .[θ ]1 direction. If we start from a point, such as the one shown in the figure
with a red cross-mark, then if .η is too large (but still within the stable bounds)
the updates will keep zig-zagging their way towards the minima. Ideally, for the
particular situation shown in Fig. 2.8, we would like the steps to take longer strides
along the .[θ]1 compared to the .[θ]2 direction, thus reaching the minima faster.
Let us look at two popular methods that are able to overcome some of the issues
faced by GD.
Momentum methods make use of the history of the gradient, instead of just the
gradient at the previous step. The formula for the update is given by (2.10) where
∂∏
.[ηk ]i = η, g k = β1 g k−1 + (1 − β1 ) (θ k ), g −1 = 0
∂θ
26 2 Introduction to Deep Neural Networks
2.7.2 Adam
The Adam optimization algorithm (short for “adaptive moment estimation”) was
introduced by Kingma and Ba [50], and makes use of the history of the gradient as
well the second moment (which is a measure of the magnitude) of the gradient. For
an initial learning rate .η, the updates are once again given by (2.10), where
∂∏
g k = β1 g k−1 + (1 − β1 ) (θ k )
∂θ
( )2
∂∏
. [G k ]i = β2 [G k−1 ]i + (1 − β2 ) (θ k ) (2.11)
∂θi
η
[ηk ]i = √
[G k ]i + ∈
In the equations above, .g k and . G k are the weighted running averages of the gradients
and the square of the gradients, respectively. The recommended values for the hyper-
parameters are .β1 = 0.9, .β2 = 0.999 and .∈ = 10−8 . Note that the learning rate for
each component is different. In particular, the larger the magnitude of the gradient
for a component the smaller is its learning rate. Referring back to the example in
Fig. 2.8, this would mean a smaller learning rate for .θ2 in comparison to .θ1 , and
therefore will help alleviate the zig-zag path of the optimization algorithm.
Remark 2.7.1 The Adam algorithm also has additional correction steps for .g k and
G k to improve the efficiency of the algorithm. See [50] for details.
.
1 ∑
Ntrain
∏(θ ) =
. ∏i (θ ), ∏i (θ) = || yi − F (x i ; θ , Θ)||2
Ntrain i=1
1 ∑
Ntrain
∂∏ ∂∏i
. (θ ) = (θ )
∂θ Ntrain i=1 ∂θ
However, taking the summation of gradients can be very expensive since . Ntrain is
typically very large, . Ntrain ∼ 106 . One easy way to circumvent this problem is to use
the following update formula (shown here for the GD method)
∂∏i
θ
. k+1 = θ k − ηk (θ k ), (2.12)
∂θ
where .i is randomly chosen for each update step .k. This is known as stochastic
gradient descent. Remarkably, this modified algorithm
√ does converge assuming that
.∏i (θ ) is convex and differentiable, and .ηk ∼ 1/ k [72].
To illustrate why .ηk needs to decay, consider the toy function(s) for .θ ∈ R2
∏1 (θ) = ([θ]1 − 1)2 + ([θ]2 − 1)2 , ∏2 (θ) = ([θ]1 + 1)2 + 0.5([θ]2 − 1)2 ,
1
. ∏3 (θ) = 0.7([θ]1 + 1)2 + 0.5([θ]2 + 1)2 , ∏4 (θ) = 0.7([θ]1 − 1)2 + ([θ]2 + 1)2 , (2.13)
2
1
∏(θ) = (∏1 (θ) + ∏2 (θ) + ∏3 (θ) + ∏4 (θ)) .
4
The contour plots of these functions in shown in Fig. 2.9a, where the black contour
plots corresponds to .∏(θ ). Note that the .θ ∗ = (0, 0) is the unique minima for .∏(θ ).
We consider solving with the SGD algorithm √ with a constant learning rate .ηk =
0.4 and a decaying learning rate .ηk = 0.4/ k. Starting with .θ 0 = (−1.0, 2.0) and
randomly selecting .i ∈ 1, 2, 3, 4 for each step .k, we run the algorithm for 10,000
iterations. The first 10 steps with each learning rate is plotted in Fig. 2.9a. We can
clearly see that without any decay in the learning rate, the SGD algorithm keeps
overshooting the minima. In fact, this behaviour continues for all future iterations as
can be seen in Fig. 2.9b where the norm of the updates does not decay (we expect it
to decay to .|θ ∗ | =√0). On the other hand, we quickly move closer to .θ ∗ if the learning
rate decays as .1/ k.
The reason for reducing the step size as we approach closer to the minima is that
far away from the minima for .∏, the gradient vector for .∏ and all the individual .∏i ’s
align quite well. However, as we approach closer to the minima for .∏ this is not the
case and therefore one is required to take smaller steps so as not be thrown off to a
region far away from the minima.
In practice, stochastic optimization algorithms are not used for the following
reasons:
1. Although the loss function decays with the number of iterations, it fluctuates in a
chaotic manner close the the minima and never manages to reach the minima.
2. While handling all samples at once can be computationally expensive, handling
a single sample at a time severely under-utilizes the computational and memory
resources.
28 2 Introduction to Deep Neural Networks
Fig. 2.9 SGD algorithm with and without a decay in the learning rate
Note that taking . Nbatch = 1 leads to the original optimization algorithms, while take
Nbatch = Ntrain gives the stochastic gradient descent algorithm. One typically chooses
.
a batch-size to maximize the amount of data that can be loaded into the RAM at one
time. We define an epoch as one full pass through all samples (or mini-batches) of
the full training set. The following describes the mini-batch stochastic optimization
algorithm:
1. For epoch = 1, …, J
(a) Randomly shuffle the full training set
(b) Create . Nbatch mini-batches
(c) For .i = 1, . . . , Nbatch
i. Evaluate the batch gradient using (2.14).
ii. Update .θ using this gradient and your favorite optimization algorithm
(gradient descent, momentum, or Adam).
Remark 2.7.2 There is an interesting study [113] that suggests that stochastic gra-
dient descent might actually help in selecting minima that generalize better. In that
study the authors prove that SGD prefers minima whose curvature is more homo-
geneous. That is, the distribution of the curvature of each of the components of the
2.8 Calculating Gradients Using Back-Propagation 29
loss function is sharp and centered about a small value. This is in contrast to minima
where the overall curvature might be small; however the distribution of the curvature
of each component of the loss function is more spread out. Then they go on to show
(empirically) that the more homogeneous minima tend to generalize better than their
heterogeneous counterparts.
The final piece of the training algorithm that we need to understand is how the
gradients are actually evaluated while training the network. Recall the output . x (l+1)
of layer .l + 1 is given by
Given a training sample .(x, y), set . x (0) = x. The value of the loss/objective function
(for this particular sample) can be evaluated using the forward pass:
1. For .l = 1, . . . , L + 1
(a) Evaluate .ξ (l) using (2.15).
(b) Evaluate . x (l) using (2.16).
2. Evaluate the loss function for the given sample
Fig. 2.10 Computational graph for computing the loss function and its derivatives with respect to
hidden/latent vectors
30 2 Introduction to Deep Neural Networks
We would of course need to repeat this step for all samples in the training set (or
a mini-batch for stochastic optimization). For simplicity, we restrict the discussion
to the evaluation of the loss and its gradient for a single sample.
In order to update the network parameters, we need . ∂∏ ∂θ
, or more precisely
∂∏ ∂∏
. ,
∂ W (l) ∂ b(l)
for .1 ≤ l ≤ L + 1. We will derive expressions for these derivatives by
∂∏ ∂∏
first deriving expressions for . ∂ξ (l) and . ∂ x (l) .
From the computational graph it is easy to see how each hidden variable in the
network is transformed to the next. Recognizing this, and applying the chain rule
repeatedly yields
∂∏
. = −2( y − x (L+1) ) (2.18)
∂ x (L+1)
∂ξ (l+1)
. = W (l+1) (2.19)
∂ x (l)
∂ x (l)
. = S(l) ≡ diag[σ ' (ξ1(l) ), . . . , σ ' (ξ H(l)l )], (2.20)
∂ξ (l)
where the last two relations hold for any network layer .l, . Hl is the width of that
particular layer, and .σ ' denotes the derivative of the activation with respect to its
argument. Using these relations in (2.17), we arrive at,
∂∏ ∂∏
.
(l)
= · S(L+1) · W (L+1) · · · S(l+1) · W (l+1) · S(l) . (2.21)
∂ξ ∂ x (L+1)
∂∏
. = S(l) W (l+1)T S(l+1) · · · W (L+1)T S(L+1) [−2( y − x (L+1) )]. (2.22)
∂ξ (l)
∂∏ ∂∏ ∂ξ (l) ∂∏
.
(l)
= (l)
· (l)
= (l) ⊗ x (l−1) , (2.23)
∂W ∂ξ ∂W ∂ξ
estimating the cost of the upper branch of the computational graph, plus the cost of
computing the outer product. The former is dominated by the matrix-vector product
at each layer with the . W (l) , and scales as . O(H 2 L). The cost of computing the
T
outer product scales with the number of entries in each matrix times the number of
matrices, and is therefore also given by . O(H 2 L). Therefore, the cost of computing
the derivative of the loss function scales as . O(2H 2 L), which is the same order as the
cost of computing the loss function itself. The fact that these costs scale in the same
way is critical, making the training of an MLP feasible with reasonable computational
resources.
Question 2.8.1 Can you derive a similar set of expressions and the corresponding
algorithm to evaluate . ∂∂∏
b(l)
?
(L+1)
Question 2.8.2 Can you derive an explicit expression for . ∂ ∂xx (0) . That is the an
expression for the derivative of the output of the network with respect to its input?
This is a very useful quantity that finds use in algorithms like physics informed
neural networks (see Chap. 5) and Wasserstein generative adversarial networks (see
Chap. 7).
1 ∑
Ntrain
∏(θ ) =
. || yi − F (x i ; θ , Θ)||2 .
Ntrain i=1
32 2 Introduction to Deep Neural Networks
1 ∑
Ntrain
∏(θ ) =
. || yi − F (x i ; θ , Θ)||.
Ntrain i=1
Neural networks with the above losses can be used to solve various regression
problems where the underlying function is highly nonlinear and the inputs/outputs
are multi-dimensional.
Example 2.9.1 Given the house/apartment features such as the zip code, the number
of bedrooms/bathrooms, carpet area, age of construction, etc, predict the outcomes
such as the market selling price, or the number of days on the market.
Now let us consider some examples of classification problems, where the output
of the network typically lies in a discrete finite set.
Example 2.9.2 Given the symptoms and blood markers of patients with COVID-19,
predict whether they will need to be admitted to ICU. So the input and output for
this problem would be
where . p1 is the probability of being admitted to the ICU, while . p2 is the probability
of not being admitted. Note that .0 ≤ p1 , p2 ≤ 1 and . p1 + p2 = 1.
Example 2.9.3 Given a set of images of animals, predict whether the animal is a
dog, cat or bird. In this case, the input and output should be
x = the image
.
y = [ p1 , p2 , p2 ]
exp (ξi(L+1) )
x (L+1) = ∑C
. i
(L+1)
j=1 exp (ξ j )
2.9 Regression Versus Classification 33
where .C is the number of classes (and also the output dimension). It is easily
verified that with this transformation, the
∑Ccomponents of the . x (L+1) form a convex
(L+1) (L+1)
combination, i.e., .xi ∈ [0, 1] and . i=1 xi = 1.
2. The output labels for the various samples need to be one-hot encoded. In other
words, for the sample .(x, y), the output label . y should have dimension . D = C,
and whose component is 1 only for the component signifying the class . x belongs
to, otherwise 0. For instance, in Example 2.9.3
⎧
⎪ T
⎨[1, 0, 0] if x is a dog,
.y = [0, 1, 0]T if x is a cat,
⎪
⎩
[0, 0, 1]T if x is a bird.
3. Although the MSE or MSA can still be used as the loss function, it is preferable
to use the cross-entropy loss function
1 ∑
Ntrain ∑
C
∏(θ ) =
. −yci log(Fc (x i ; θ )), (2.24)
Ntrain i=1 c=1
where . yci is the .c-th component of the true label for the .i-th sample. The loss
function in (2.24) treats . yc and .Fc as probability distributions and measures the
discrepancy between the two. It can be shown to be related to the Kullback-Liebler
divergence between the two distributions. Compared to MSE, this loss function
severely penalizes strongly confident incorrect predictions. This is demonstrated
in Example 2.9.4.
Note that both losses penalize large values of . p. Also when . p = 0, both losses are
zero. However, as . p → 1 (which would lead the wrong prediction), the MSE loss
.→ 2, while the cross-entropy loss .→ ∞. That is, it strongly penalizes incorrect
Fig. 2.11 Comparing MSE and cross-entropy losses as a function of the class probability . p for a
binary classification problem
The scope of this numerical exercise is to understand how the expressive power of the
network varies with depth and width. We begin by loading some necessary Python
libraries:
# Numpy library
import numpy as np
# Plotting library
import matplotlib.pyplot as plt
# PyTorch libraries
import torch
from torch import nn
Next, we define the structure of the network. The class called MLP below is
inherited from torch.nn.Module. The __init__ method takes the following
arguments:
The following PyTorch commands are used to define the various layers:
• torch.nn.Linear(N,M): Use to define a dense layer with M neurons and
receiving an input N-dimensional input.
• torch.nn.ModuleList(): Holds PyTorch submodules in a list.
• We have shown a few ways to define activation functions.
• To initialize the weights with a uniform distribution you can use torch.nn.
init.uniform_() and the class function torch.nn.Module.apply().
The second crucial method required in the MLP class is forward that takes as
argument the network input, and defines the forward pass of the network. Note that
the activation function is applied only to the output of the hidden layers.
class MLP(nn.Module):
def __init__(self, input_dim=1, out_dim=1, width=10,
depth=5, activation=’tanh’):
super(MLP, self).__init__()
self.depth=depth
# Output layer
MLP_list.append(nn.Linear(width, out_dim))
# Weights initialization
def init_weights(layer):
if isinstance(layer, nn.Linear):
nn.init.uniform_(layer.weight, -1, 1)
if layer.bias is not None:
nn.init.uniform_(layer.bias, -1, 1)
self.model.apply(init_weights)
36 2 Introduction to Deep Neural Networks
np.random.seed(1)
Ntrain = 100
Nval = 50
N = Ntrain + Nval
x = torch.linspace(0, 1, N).reshape(N,1)
ind_all = np.arange(0,N)
np.random.shuffle(ind_all)
ind_train = np.sort(ind_all[0:Ntrain])
ind_val = np.sort(ind_all[Ntrain:])
x_train = x[ind_train]
x_val = x[ind_val]
y_train = target_func(x_train)
y_val = target_func(x_val)
fig, ax = plt.subplots(figsize=(15,5))
ax.plot(x_train,y_train,’-o’, label=’Training points’)
ax.plot(x_val,y_val,’-x’, label=’Validation points’)
plt.legend()
1. __init__ which receives paired data samples and stores them as class objects.
2. __len__ which returns the number of samples in the dataset.
3. __getitem__ which receives a list of indices and returns the sample pairs
corresponding to those indices.
We then use this dataset class to create a dataset object for training data only, and
use torch.utils.data.DataLoader to create a wrapper that allows to iterate
through random batches of the dataset. We use a batch size of 50 with shuffling. How
many mini-batches would this lead to?
# Create custom dataset class
class CustomDataset(Dataset):
def __init__(self, samples):
"""
Initialize the CustomDataset with paired samples.
Args:
samples (list of tuples): A list of (x, y) pairs
representing the dataset samples.
"""
self.samples = torch.Tensor(samples).to(torch.float32)
def __len__(self):
"""
Returns the length of the dataset, i.e., the number of
samples.
"""
return len(self.samples)
Args:
indices (list): A list of indices to retrieve samples
for.
Returns:
list: A list of (x, y) pairs corresponding to the
specified indices.
"""
selected_samples = self.samples[idx]
return selected_samples
training_samples =
np.concatenate((x_train.reshape(-1,1),y_train.reshape(-1,1)),
axis=1)
train_dataset = CustomDataset(samples=training_samples)
train_loader = DataLoader(train_dataset, batch_size=50,
shuffle=True)
2.10 Computational Exercise 39
Residual networks (or ResNets) were introduced by He et al. [40] in 2015. In this
chapter, we will discuss what these networks are, why they were introduced and their
relation to ODEs.
the contribution of first .l¯ layers of the network will be negligible, as the influence of
their weights on the loss function is small. Because of this depth cut-off, the benefit
in terms of expressivity of deep networks is lost.
So why does this happen? Recall from Sect. 2.8 that
∂∏ ∂∏
.
(l)
= ⊗ x (l−1)
∂W ∂ξ (l)
and
∂∏ ∏
L+1
∂∏
.
(l)
= S(l) (W (m)T S(m) ) .. (3.1)
∂ξ m=l+1
∂ x (L+1)
For any
|| || matrix, . A, let .τ ( A) denote the largest singular value. Then we can bound
|| ∂∏ ||
.|| || by
∂ξ (l)
|| || || ||
|| ∂∏ || ∏
L+1
|| ||
|| || ≤ τ (S(l) ) (m) (m) || ∂∏ ||
.
|| ∂ξ (l) || (τ (W )τ (S )) || ∂ x (L+1) || . (3.2)
m=l+1
where each term in the product is a scalar less than 1. As the number of terms
increases, that is . L − l >> 1, this product can, and does,
|| ||become very small.
|| This ||
|| ∂∏ || || ∂∏ ||
typically happens when . L − l ≈ 20, in which case .|| ∂ξ (l) ||, and therefore .|| ∂W (l) ||,
become very small. This issue is called the problem of vanishing gradients. It mani-
fests itself in deep networks where the weights in the inner layers (say . L − l > 20)
do not contribute to the network.
In [40], the authors demonstrate that taking a deeper network can actually lead to
an increase in training and validation error. To demonstrate this, we train MLPs with
varying depth to approximate the function
( )
u(x) = sin 2π(x + 1)3 cos(2πx), x ∈ [0, 1].
. (3.4)
As can be seen in the first column of Figs. 3.1 and 3.2, the training and test (MSE)
losses curves shift upwards as the depth of the MLP is increased. In terms of predic-
tions, only depth .= 10 leads to a good approximation of the function. With depth .=
20, the approximation towards the right of the domain (where high-frequency modes
dominate) is poor. If we increase the depth to 40, the MLP seems to learn a constant
function. Thus, beyond a certain point, increasing the depth of a network can be coun-
terproductive. Based on our previous discussion on vanishing gradients we know why
this is the case. Given this, we would like to come up with a network architecture
|| || || ||
that addresses the problem of vanishing gradients by ensuring .|| ∂ x∂∏
(L+1)
|| ≈ || ∂∏ ||
|| ∂ξ (1) ||.
This means requiring that when the weights of the network approach small values,
the network should approach the identity mapping, and not the null mapping. This
is the core idea behind a ResNet architecture.
3.2 ResNets
Consider an MLP with depth 6 (as shown in Fig. 3.3) with a fixed width . H for each
hidden layer. We add skip connections between the hidden layers in the following
manner
(l)
.xi = σ(Wi(l) (l−1)
j xj + bi(l) ) + xi(l−1) , 2 ≤ l ≤ L . (3.5)
3.2 ResNets 43
Fig. 3.1 Performance of MLP approximating (3.4) on the training set, without residual connections
(left) and with residual connections (right), as the depth is increased
∂∏ ∂∏
. = ,
∂ x (1) ∂ x (5)
i.e., we will not have the issue of vanishing gradients.
2. The computational graph for forward and back-propagation of a ResNet is shown
(l+1)
in Fig. 3.4. Looking at this graph, it is clear that the expression for . ∂∂xx (l) now
involves traversing two branches and adding their sum. Therefore, we have
∂∏ ∏
L+1
∂∏
.
(l)
= S(l) (I + W (m)T S(m) ) . (3.6)
∂ξ m=l+1
∂ x (L+1)
Fig. 3.2 Performance of MLP approximating (3.4) on the test set, without residual connections
(left) and with residual connections (right), as the depth is increased
( )
∂∏ ∑
L+1
∂∏
.
(l)
= S(l) I+ W (m) T S(m) + higher order terms . (3.7)
∂ξ m=l+1
∂ x (L+1)
In the expression above, even if the individual matrices have small entries, their
sum need not approach a zero matrix. This implies that we can create a finite (and
3.3 Connections with ODEs 45
significant) change between the gradients near the input and output layers, while
still requiring the weights to be small (via regularization).
We empirically demonstrate the benefit of adding residual skip connections by
training MLPs to approximate (3.4) with varying depth, but now including skip
connections between hidden layers. The results shown in the second column of
Figs. 3.1 and 3.2 clearly show an improvement in the training/test losses, as well as
the predictions, for all depths considered. The improvement is more significant for
the deeper network, with the depth .= 40 MLP clearly capturing the target function
.u(x), as opposed to the constant function learned by the MLP in the absence of skip
connections.
Remark 3.2.1 The above analysis can be extended to cases when the hidden layer
width . H is not fixed, but the analysis is not as clean. See [40] on how we can do this.
Let us first consider the special case of a ResNet with.d = D = H . Recall the relation
(3.5), which we can rewrite as
x (l) − x (l−1) 1 1
. = σ(W (l) x (l−1) + b(l) ) = σ(ξ(l) ) (3.8)
Δt Δt Δt
for some scalar .Δt, where we note that .ξ (l) is a function of . x (l−1) parameterized by
.θ
(l)
= [W (l) , b(l) ]. Thus, we can further rewrite (3.8) as
x (l) − x (l−1)
. = V (x (l−1) ; θ (l) ). (3.9)
Δt
46 3 Residual Neural Networks
dx
. ẋ ≡ = V (x, t) (3.10)
dt
where we want to find . x(T ) given some initial state . x(0). In order to solve this
numerically, we can uniformly divide the temporal domain with a time-step .Δt
and temporal nodes .t (l) = lΔt, .0 ≤ l ≤ L + 1, where .(L + 1)Δt = T . Define the
discrete solution as . x (l) = x(lΔt). Then, given . x (l−1) , we can use a time-integrator
to approximate the solution . x (l) . We can consider a method motivated by the forward
Euler integrator, where the the LHS of (3.10) is approximated by
x (l) − x (l−1)
. LHS ≈ .
Δt
where we are allowing the parameters to be different at each time-step. Putting these
two together, we get exactly the relation of the ResNet given in (3.9). In other words,
a ResNet is nothing but a descritization of a non-linear system of ODEs. We make
some comments to further strengthen this connection.
• In a fully trained ResNet we are given . x (0) and the weights of a network, and we
predict . x (L+1) .
• In a system of ODEs, we are given . x(0) and . V (x, t), and we predict . x(T ).
• Training the ResNet means determining the parameters .θ of the network so that
(L+1)
.x is as close as possible to . yi when . x (0) = x i , for .i = 1, . . . , Ntrain .
• When viewed from the analogous ODE point of view, training means determining
the right hand side . V (x, t) by requiring . x(T ) to be as close as possible to . yi when
. x(0) = x i , for .i = 1, . . . , Ntrain .
• In a ResNet we are looking for “one” . V (x, t) that will map . x i to . yi , for all
.1 ≤ i ≤ Ntrain .
Motivated by the connection between ResNets and ODEs, neural ODEs were pro-
posed in [15] which received the Best Paper Award in NeurIPS 2018. Consider a
system of ODEs given by
dx
. = V (x, t) (3.11)
dt
3.4 Neural ODEs 47
Given . x(0), we wish to find . x(T ). In [15], the RHS, i.e., . V (x, t), is defined using
a feed-forward neural network with parameters .θ (see Fig. 3.5). The input to the
network is .(x, t) while the output is . V (x, t) (having the same dimension as . x). With
this description, the system (3.11) is solved using a suitable time-marching scheme,
such as forward Euler, Runge-Kutta, etc.
So how do we use this network to solve a regression problem? Assume that you
are given the labelled training data .S = {(x i , yi ) : 1 ≤ i ≤ Ntrain }. Here both . x i and
. yi are assumed to have the same dimension .d − 1. The key idea is to think of . x i as
points in the .d − 1-dimensional space that represent the initial state of the system,
and to think of . yi as points that represent the final state. Then the regression problem
becomes finding the RHS of (3.11) that will map the initial points to the final points
with minimal amount of error. In other words, find the parameters .θ such that
1 ∑
N
∏(θ) =
. |x i (T ; θ) − yi |2
N i=1
Let us list the advantages and differences when comparing Neural ODEs to
ResNets:
In the previous chapters, we have seen how to construct neural networks using fully-
connected layers. We will now look at a different class of layers, called convolution
layers [36, 59], which are very useful when handling inputs which are images. These
are routinely used in network architectures designed for tasks such as classification
of images into different categories [87, 98], performing semantic segmentation on
images [66, 81, 108], and transforming images from one type to another [91].
Note that the image in (4.1) typically defines a grayscale image where the value
of .u at each pixel is just the intensity. If we work with color images, then it would be
a three-dimensional tensor, with the third dimension corresponding to the red, blue
and green channels. In other words, .U ∈ R N1 ×N2 ×3 .
If we want to use a fully-connected neural network (MLP) which takes as input a
colored 2D image of size .100 × 100, then the input dimension after unravelling the
entire image as a single vector would be .3 × 104 , which is very large. This would
in turn lead to very large connected layers which is not computationally feasible.
Secondly, when unravelling the image, we lose all spatial context of the initial image.
Finally, one would expect that local operations, such as edge detection, to be the same
in any region of the image. Consider the weights for a fully connected layer. These
would be represented by the matrix .Wi j , where the .i index represents the output of
the linear transform and the . j index represents the input. If the operation was the
same for every output index, we would apply the same operation for every .i and
therefore not need the matrix. To address all these issues, we can use the convolution
operator on functions. Or, more appropriately, the discrete version of the convolution
operator on the discrete version of images.
The convolution operator maps functions to functions. Consider the function .u(x),
x ∈ Rd , and a sufficiently smooth kernel function .g(x) which decays as .|x| → ∞.
.
Then the convolution operator is given by
∫
.u(x) = g( y − x)u( y)d y. (4.2)
Rd
Consider a point .x0 . Then .g(y − x0 ) shifts the kernel to the location .x0 which will
sample the function .u in the orange shaded region. Similarly, for another point .x1 ,
.g(y − x 1 ) shifts the kernel to the location . x 1 which will sample the function .u in the
green shaded region. Thus as the kernel moves, it samples .u in different windows.
Note that the same operation is applied regardless of the value of .x. Let us now
consider a few typical kernel functions.
4.2.1 Example 1
g(ξ) = ρ(|ξ|),
.
4.2.2 Example 2
Let us consider another example of a kernel that would produce the derivative of a
smooth version of .u. In 2D, we want this to look like (note: in the equation below .ρ
is the Gaussian kernel),
(∫ ∞ ∫ ∞ )
∂
u(x) = ρ(| y − x|)u( y)dy1 dy2
∂x1 −∞ −∞
∫ ∞∫ ∞
∂ρ(| y − x|)
= u( y)dy1 dy2
. −∞ −∞ ∂x1 (4.3)
∫ ∞∫ ∞( )
∂ρ(| y − x|)
= − u( y)dy1 dy2
−∞ −∞ ∂ y1
, ,, ,
required kernel
This kernel is shown in both 1D and 2D in Fig. 4.3. Note that the action of this kernel
looks like a smoothed finite difference operation. That is the region to the left of
the center of the kernel is weighted by a negative value and the region to the right
is weighted by a positive value and the integral involves summing the contributions
from these regions to compute the finite difference.
Fig. 4.3 Derivative kernel. The blue curve denotes the Gaussian kernel and the orange curve denotes
the derivative
4.3 Discrete Convolutions 53
In this section, we derive an expression for the discrete version of the convolution.
To get started we represent .u, .ū and the kernel .g with their discrete counterparts.
That is,
As in the continuous case, we will assume that .G vanishes after a certain distance
To evaluate the discrete convolution in 2D, consider (4.2) and discretize it using
a suitable quadrature rule. Then, using the expressions above we will have
∞
∑ ∞
∑
U [i, j] =
. G[m − i, n − j]U [m, n] (4.7)
m=−∞ n=−∞
where we have absorbed the measure .h 2 into the definition of the kernel. Let .m ' =
m − i and .n ' = n − j. Then
∞
∑ ∞
∑
.U [i, j] = G[m ' , n ' ]U [i + m ' , j + n ' ]. (4.8)
m ' =−∞ n ' =−∞
Since .G vanishes outside the limit of its width, the limits of the sum are reduced
by excluding all the pixels over which the convolution will be zero,
∑
N̄ ∑
N̄
. U [i, j] = G[m ' , n ' ]U [i + m ' , j + n ' ]. (4.9)
m ' =− N̄ n ' =− N̄
• Kernels that lead to the derivative along the .x1 -direction and .x2 -direction are given
by (see Fig. 4.3b for a derivative along the .x1 -direction)
⎡ ⎤ ⎡ ⎤
0 00 0 1 0
. ⎣−1 0 1⎦ and ⎣0 0 0⎦
0 00 0 −1 0
• Similarly, the second derivatives along the.x1 and.x2 -directions are given by kernels
of the form ⎡ ⎤ ⎡ ⎤
0 0 0 0 −1 0
. ⎣−1 2 −1⎦ and ⎣0 2 0⎦
0 0 0 0 −1 0
Remark 4.3.1 Equation 4.9 is precisely how a convolution is applied in deep learn-
ing. Thus, the convolution is entirely determined by
which become the learnable parameters of the convolution layer, with the number
of parameters being .(2 N̄ + 1)2 . Further, the kernel of the convolution has a width
.k = (2 N̄ + 1) in each direction.
Remark 4.3.2 We can have different . N̄ in different directions. That is, we can have
kernels with different widths along each direction. Moreover, the widths are allowed
to be even as opposed to odd numbers considered above.
There is a very strong connection between the concept of convolution and the stencil
of a finite difference scheme. This is made clear in the discussion below.
Say some function .u(x1 , x2 ) is represented on a finite grid, where the grid points
j
are indexed by .(i, j) with a grid size .h. Using the notation .U [i, j] = u(x1i , x2 ) and
applying a Taylor series expansion about .(i, j), we have
j j
U [i + 1, j] − U [i − 1, j] u(x1i+1 , x2 ) − u(x1i−1 , x2 ) ∂ j
. = = u(x1i , x2 ) + O(h 2 ).
2h 2h ∂x1
4.5 Convolution Layers 55
Thus we may say that this convolution operation approximates a derivative along
the 1-direction.
Similarly, we can show that
U [i + 1, j] − 2U [i, j] + U [i − 1, j] ∂ j
. = u(x1i , x2 ) + O(h 2 ).
h 2 ∂x12
We are now ready to discuss the explicit structure and action of convolution layers.
As we proceed, we should keep a few key points in mind:
• Each convolution layer consists of several discrete convolutions, each with its own
kernel.
• The weights of the kernel, which determine its action (smoothing, first derivative,
second derivative etc.), are learnable parameters and are determined when training
the network. Thus the way to think about the learning process is that the network
learns the operations (convolutions) that are appropriate for its task. The task can
be a classification problem, for example.
Let us assume we have an . N1 × N2 (grayscale) image as an input to a convolution
layer comprising multiple convolutions. Each convolution will generate a different
image, as shown in Fig. 4.4a. The trainable parameters of this layer are the weights of
each convolution kernel. If we assume that the width of the kernels is .k = (2 N̄ + 1)
in each direction, and there are . P kernels, then the number of trainable weights will
be . P × k 2 .
Next let us consider the size of the output image after applying a single kernel
operation. Note that we will not be able to apply the kernel on the boundary pixels
since there are no pixel-values available beyond the image boundary (see Fig. 4.4b).
Thus, we will have to skip . N̄ pixels at each boundary when applying the kernel,
56 4 Convolutional Neural Networks
Fig. 4.5 Max pooling applied to an image over patches of size .(2 × 2)
stride .> 1 which will further shrink the size of the output image. For instance, if
stride was taken as . S in each direction (with zero-padding applied), then the output
image size would reduce by a factor of . S in each direction (see Fig. 4.4e).
Pooling operations are generally used to reduce the size of an image, and allowing one
to step through different scales of the image. If applied on an image of size . N × N
over patches of size . S × S, the new image will have dimensions . NS × NS , where . S
is the stride of the pooling operation. This is shown in Fig. 4.5 for . S = 2. Note it is
typical to select the patch of pixels over which the max or average is computed to be
.(S × S), where . S is the stride. This is true for Fig. 4.5b but not for Fig. 4.5a.
Also, we show in Fig. 4.6 how pooling allows us to move through various scales
of the image, where the image gets coarser as more pooling operations are applied.
Note that pooling operations do not have any trainable parameters. The pooling oper-
ation has strong analog in similar operators that are used when designing multigrid
preconditioners for solving linear systems of algebraic equations.
Fig. 4.6 Max pooling applied repeatedly to an image over patches of size .(2 × 2) with stride 2
. M1 × M2 × 1. The output of each convolution are stacked together to give the final
output of the convolution layer. This can be written as,
∑
N̄ ∑
N̄ ∑
Q
.Ū [i, j, p] = g p [m, n, q]U [i + m, j + n, q], 1 ≤ i ≤ M1 , 1 ≤ j ≤ M2 , 1 ≤ p ≤ P,
m=− N̄ n=− N̄ q=1
where .g p is the kernel of the . p-th convolution in the layer. Note that the total number
of trainable parameters will be . P × k 2 × Q. This is the type of convolutional layer
most frequently encountered in a convolutional neural network, which is described
in the following section.
Now let’s put all the elements together to form the full network. Consider an image
classification problem. Then the functional form of a typical CNN used for this task
is given by . ŷ = F (x; θ) where . x ∈ R N1 ×N2 ×Q is the input image with . Q channels,
while . ŷ ∈ RC is the predicted probability vector whose .i-th component of denotes
the probability that the input image belongs the .i-th class among a total of .C classes.
The true . y are typically one-hot encoded.
A possible architecture of this network is shown in Fig. 4.7. This consists of a
number of convolution layers followed by pooling layers, which will reduce the
spatial resolution of the input image while increasing the number of channels. The
output of the final pooling layer is flattened to form a vector, which is then passed
though a number of fully connected layers with some activation function (say ReLU).
4.6 Convolution Neural Network (CNN) 59
Fig. 4.7 Example of a CNN architecture for an image classification problem, for 10 classes
The final fully connected layer reduces the size of the vector to .C (which is taken
to be 10 in the figure), which is then passed through a softmax function to generate
the predicted probability vector . ŷ. Since we are solving a classification problem, the
loss function is taken to be the cross-entropy function
∑
Ntrain ∑
C
[ ] ∑
Ntrain ∑
C
[ (i) ( )]
.∏(θ) = − yc(i) log( ŷc(i) ) = − yc log Fc (x (i) ; θ) .
i=1 c=1 i=1 c=1
We train the CNN by trying to find .θ ∗ = arg min∏(θ) with the final network being
θ
F (x; θ ∗ ).
.
We make some important remarks:
1. The convolution operation is also a linear operation on the input, as is the case
for a fully connected layer. The only difference is that in a fully-connected layer,
the weights connect every pixel in the output to every pixel of the input, while in
a convolution layer the weights connect one pixel of the output only to a patch of
pixels in the input. Furthermore, the same weights are applied on each patch of
the input.
2. In the CNN shown in Fig. 4.7, the convolution layers can be interpreted as encod-
ing information about the input image, while the fully connected layers can be
interpreted as using this information to solve the classification problem. This is
why in the literature, convolution layers are said to perform feature selection.
Further, the part of the network leading up to the “flattened” vector is sometimes
referred to as the encoder.
60 4 Convolutional Neural Networks
They allow the input to be a 2D image, while drastically decreasing the number
of learnable parameters needed for the feature extraction task. In fact, kernels
introduce a limited number of parameters compared to a classic fully connected
layer. Since the same kernel is applied at different pixel locations in an image, i.e.
parameter sharing, they utilize the computational resources in an efficient and
smart manner.
We have seen how convolution and pooling layers can be used to scale down images.
We now consider layers that do the opposite, i.e., scale up images. To understand
what this operation would look like, let us look at a few examples.
1. Consider a 1D image of size 4
[ ]
Input = u 1 , u 2 , u 3 , u 4 ,
.
Consider a convolution layer with the kernel . k, stride 1 and zero-padding layer
of size 1. Then, the output of the layer acting on the input is
[ ]
.Output = yu 1 + zu 2 , xu 1 + yu 2 + zu 3 , xu 2 + yu 3 + zu 4 , xu 3 + yu 4
The steps involved in the convolution operator are: pad, dot-product, stride. Note
that using padding and stride 1 have ensured the output has the same size as the
input.
2. Consider another convolution with the same kernel . k, zero-padding layer 1 but
stride 2. Then, the output of the layer acting on the same input as earlier is
[ ]
Output = yu 1 + zu 2 , xu 2 + yu 3 + zu 4
.
Note that the size of the output has reduced by a factor of 2. In other words,
increasing the stride has allowed us to downsample the input. The question we
want to ask is whether we can perform an upsampling in a similar way? This can
indeed be done by transposing every operation in a convolution layer.
If we use a stride of 2 (or in general .s), we need to shift the entries in the .i-th row
of the outer-product to the right by .2(i − 1) (or in general .s(i − 1)), and fill the
empty spaces in the final matrix with zeros
[ ]
u1 x u1 y u1z 0 0
Striding =
. .
0 0 u2 x u2 y u2 z
After striding is performed we need to add the entries in each column and crop
the vector to get the output
[ ] [ ]
.Output = Crop( u 1 x, u 1 y, u 1 z + u 2 x, u 2 y, u 2 z ) = u 1 x, u 1 y, u 1 z + u 2 x, u 2 y
where we have cropped out the last few elements (by convention) to get an output
which has 2 times the size of the input.
4. We consider transpose convolution in 2D applied on a 2D image of size .(2 × 2).
The kernel is taken to be of shape .(3 × 3) with stride 2 and padding (cropping).
The action of this transpose convolution is shown in Fig. 4.8a, where we first
obtain an image of size .(5 × 5) which is then cropped to give an output of size
.(4 × 4). Note that the output pixels get an unequal contribution from the various
patches (indicated by numbers inside each cell/pixel of the output), which leads
to an undesirable phenomena called checker-boarding. Checker-boarding refers
to pixel-to-pixel variations in the values of the output image. A nice discussion
on this, with a useful visual toolbox can be found here [74].
5. One way to avoid checker-boarding, is by ensuring that the filter size is an integer
multiple of the stride. Let us repeat the previous example but with a.(2 × 2) kernel.
The operation is illustrated in Fig. 4.8b. In this case, we do not need to pad (crop)
and each output pixel has an equal contribution.
We make some remarks:
1. Transpose convolution layers are also called fractionally-strided layers, because
for every one step in the input, we take greater than one step in the output. This
is the opposite of what happens in a convolution layer. In a convolution layer,
we take step greater than one step in the input image, for each step in the output
image.
4.8 UpSampling 63
Fig. 4.8 Example of a transpose convolution operation. The cells marked with red X’s in the output
are cropped out. The numbers denote the number of patches that contributed to the value in the
output pixel
4.8 UpSampling
each .(i, j)-th pixel in the input image generates an independent patch of . S f × S f
of pixels in the output image, the values of which are determined by a suitable
interpolation algorithm. The two popular UpSampling interpolation algorithms are:
• Nearest neighbour interpolation: This strategy involves copying the value in the
.(i, j)-th pixel in the input image to corresponding patch of . S f × S f of pixels in
the output image. An example of this is shown in Fig. 4.9a.
• Bilinear interpolation: This strategy involves using a weighted combination of
the values in the .(i, j)-th pixel and its neighbouring pixels in the input image to
evaluate the values in the corresponding patch of . S f × S f of pixels in the output
image. An example of this is shown in Fig. 4.9b.
The UpSampling layer is typically used in combination with Convolution layers for
up scaling an image, in place of a Transpose Convolution layers.
Fig. 4.10 Example of a U-Net mapping an input of shape .256 × 256 × 3 to an output image of
shape .256 × 256 × 1. CON(. P) denotes a convolution layer with . P kernels and appropriate zero
padding to preserve the spatial resolution; POOL denotes a max (or average) pooling layer halving
the spatial resolution; UP denotes an upsampling layer that doubles the spatial resolution; SKIP
denotes a skip connection that concatenates along the channel-direction
same spatial scale in the down-scaling branch of the U-Net, and the other is from the
coarser scales of the upscaling branch of the U-net. The U-Net architecture shown in
Fig. 4.10 maps an image of shape .256 × 256 × 3 to an output image with the same
spatial resolution but with a single channel. Such U-Net models are typically used
for image segmentation tasks.
Remark 4.9.1 The U-net architecture shares many common features with the
V-cycle that is typically used in multigrid preconditioners.
Remark 4.9.2 We can also think of a the U-net as an encoder-decoder network with
the additional feature of including skip connections.
3. Recognize that while a human expert can look at the MNIST images and easily
classify them correctly, this is not the case for the Mechanical MNIST data. It is
very hard for a human expert to look at the displacement fields and infer what
the shape of the inclusion will look like. The question is whether a CNN-based
classifier can do both these tasks well.
Import required libraries:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision
import torchvision.transforms as transforms
import numpy as np
import matplotlib.pyplot as plt
from torch.utils.data import TensorDataset
from torch.utils.data import DataLoader
The sources of data for the MNIST and the Mechanical MNIST problems are dif-
ferent. Therefore, you have to spend some time carefully gathering and processing
data. This is often the case when working on a problem in deep learning. So, it is
good to get used to it.
1. For the MNIST data, use the command: torchvision.datasets.
MNIST(). You should use it twice: once for defining the training set with the
train argument set to train=True, and another time for the testing set with
train=False. This data set contains . Ntrain = 60,000 images and labels for
training, and. Ntest = 10,000 images and labels for testing. Each image is a.28 × 28
array of pixels and the each label takes on integer values in the interval .[0, 9].
transform = transforms.Compose([transforms.ToTensor()])
# Load MNIST training dataset
train_dataset = torchvision.datasets.MNIST(root=’./data’, train=True,
transform=transform, download=True)
# Load MNIST test dataset
test_dataset = torchvision.datasets.MNIST(root=’./data’, train=False,
transform=transform, download=True)
2. You can access the images and the labels by using train_dataset.data and
train_dataset.targets, respectively.
3. The Mechanical MNIST dataset is stored at https://open.bu.edu/handle/2144/
39429. Download the displacement files corresponding to loading Step 5. You
will see that this dataset contains .28 × 28 arrays of vertical and horizontal dis-
placements for . Ntrain training and . Ntest test samples. It does not contain the corre-
sponding labels (that is which digit does the displacement fields correspond to);
however, the ordering for the samples is the same as that for the MNIST data.
Therefore, we will use the labels from the MNIST data. You can download and
store this data in your Google drive so that you can upload it in Colab.
4.10 Computational Exercise: Convolutional Neural Networks (CNNs) 67
4. Process both the MNIST and the Mechanical MNIST data so that you have:
• For the MNIST data, numpy arrays: (a) mnist_train of size . Ntrain × 1 ×
28 × 28 of training images, (b) mnist_test of size . Ntest × 1 × 28 × 28 of
test images, (c) y_train of size . Ntrain × 10 of one-hot-encoded (categori-
cal) training labels, and (d) y_test of size . Ntest × 10 of one-hot-encoded
(categorical) training labels.
• For the Mechanical MNIST data, numpy arrays: (a) mnist_mech_train
of size . Ntrain × 2 × 28 × 28 of training images, and (b) mnist_mech_test
of size . Ntest × 2 × 28 × 28 of test images. The label arrays for the Mechanical
MNIST are the same as the ones for the MNIST dataset.
Note that this task will require things like converting text and torch tensors to
numpy arrays, and reshaping and concatenating arrays to get them to be of desired
shape.
5. For the first five training and test samples create a figure each that contains (a) a
plot of the MNIST image, (b) plots of the two Mechancial MNIST displacements,
and (c) the corresponding label. On the basis of the physical experiment, explain
why the displacement images look the way they look.
Construct a classifier each for the MNIST and the Mechanical MNIST data with the
following architecture:
• Convolution layer with 32 filters (kernel size .= 3 × 3), zero padding, stride 1 and
ReLU activation.
• A .2 × 2 Max. pooling layer with stride 2.
• Convolution layer with 128 filters (kernel size .= 3 × 3), zero padding, stride 1
and ReLU activation.
• A .2 × 2 Max. pooling layer with stride 2.
68 4 Convolutional Neural Networks
• Convolution layer with 256 filters (kernel size .= 3 × 3), zero padding, stride 1
and ReLU activation.
• A .2 × 2 Max. pooling layer with stride 2.
• A flattening layer.
• A fully connected layer with 256 neurons and ReLU activation.
• A final fully connected layer with width = number of classes = 10 and no activation.
Note, we don’t need to specify a soft-max activation becuase that is built into the
cross entropy loss.
Given that the input image is of size.28 × 28, what will be the size of the intermediate
tensors after each layer of the above CNN?
Train the MNIST and Mechanical MNIST networks with the following parameters:
1. While training the networks, save the training loss and accuracy after each mini-
batch iteration. (a) Plot the loss versus iteration number for the MNIST and
Mechanical MNIST networks on the same plot. Use a log scale for the vertical
axis. (b) Plot the accuracy on the training set versus iteration number for both
networks. Do not use the log scale in this plot. What do you observe?
2. For the trained networks, what is the accuracy on the test set? Which network
does better? Why do you think this is the case.
3. For the MNIST network, find three cases when the prediction is incorrect. For
each case plot the MNIST image, the correct label and the incorrect label. What
do you observe?
4. For the Mechanical MNIST network, find three cases when the prediction is
incorrect. For each case plot the Mechanical MNIST images (the displacement
fields), the correct label and the incorrect label. What do you observe?
4.10 Computational Exercise: Convolutional Neural Networks (CNNs) 69
Remarks
du d2 u
a − κ 2 = f (x), x ∈ (0, l)
dx dx
. (5.1)
u(0) = 0
u(l) = 1
where .a denotes the advective velocity, .κ is the diffusion coefficient while . f (x)
is the source. Such equations are used to model many physical phenomena, such
as the transport of pollutant by fluids, or modelling the flow of electrons through
semiconductors. The multi-dimensional version of this problem will take the form
Note that the model problem is a linear PDE (ODE in the one-dimensional case).
When the velocity or the diffusivity depend on the concentration we are led to a
nonlinear version of this equation. For example, replacing the velocity .a by .u leads
to the viscous Burgers equation.
1 − exp(ax/κ)
.u(x) =
1 − exp(al/κ)
where .al/κ is known as the Peclet number (Pe) and measures the ratio of the strength
of advection to the strength of diffusion. We plot the solution for varying values of
.a and .κ in Fig. 5.1. Note that for small Pe, the solution is essentially a straight line.
But as Pe increases, the solution starts to bend and forming a steeper boundary layer
near the right boundary. The thickness of this boundary layer is given by .δ ≈ l/Pe.
We will now consider a few methods to numerically solve this toy problem.
The finite difference method [101, 105] is one of the most popular methods for
solving PDEs numerically. The core idea is to approximate the derivatives in the
differential equation at hand with finite difference quotients. More specifically, the
key steps of a finite difference scheme are as follows:
1. Discretize the domain into a grid of points, with the goal being to find the solution
at these points.
2. Approximate the derivates with finite difference approximations at these points.
This leads to a system of (linear or non-linear) algebraic equations.
3. Solve this system using a suitable algorithm to find the solution.
5.1 Finite Difference Method 73
du u i+1 − u i−1
. (xi ) = + O(h 2 )
dx 2h
d2 u u i+1 − 2u i + u i−1
(xi ) = + O(h 2 )
dx 2 h2
Note that both approximations used above are second-order accurate. They are
“central difference” approximations, as they weigh points on either side of the.i-th
point with the same magnitude. It is worth mentioning that in the limit of large
Peclet number, a central difference approximation of the advective term is not ideal
since it leads to numerical instability. In such a case, an “upwind” approximation
is preferred. Applying the approximations to the PDE at .xi , .1 ≤ i ≤ N − 1
Looking at each node where the solution is unknown (recall that .u 0 = 0 and
.u N = 1 are known),
βu 1 + γu 2 = −αu 0 + f 1
. αu i−1 + βu i + γu i+1 = f i , ∀ 2 ≤ i ≤ N − 2 (5.3)
αu N −2 + βu N −1 = −γu N + f N −1
Combining all the . N − 1 equations in (5.3), we get the following linear system
. Ku = f (5.4)
where the tridiagonal matrix . K and the other vectors in (5.4) are defined as
74 5 Solving PDEs with Neural Networks
⎡ ⎤
β γ 0
⎢ .. .. ⎥
⎢α . . ⎥
.K = ⎢ ⎥ (N −1)×(N −1)
⎢ .. .. ⎥ ∈ R ,
⎣ . .γ ⎦
0 α β
[ ]T
u = u 1 u 2 · · · u N −2 u N −1 ∈ R N −1 ,
[ ]T
f = −αu 0 + f 1 f 2 f 3 · · · f N −2 f N −1 −γu N + f N −1 ∈ R N −1
3. Solve .u = K −1 f .
Note that:
The spectral collocation method [63, 123] seeks a solution written as an expansion
in terms of a set of smooth and global basis functions. The basis functions are chosen
a priori, whereas the coefficients of the expansion are unknowns, and are computed
by requiring that the numerical solution of the PDE is exact at a set of so-called
collocation points. More specifically, this approach involves the following steps.
The first few Chebyshev polynomials are shown in Fig. 5.2. Note that this basis
satisfies all the required properties listed above. It is easy to evaluate at any
point because one can use the recurrence relation above and the values of the two
lower-order polynomials to evaluate the Chebyshev polynomial of the subsequent
order. One can also take derivatives of the recurrence relation above to obtain a
recurrence relation for derivatives of all orders.
2. Write the solution as a linear combination of the basis functions .{φn (x)}n=0
N
∑
N
.u(x) = u n φn (x) (5.5)
n=0
where .u n are the basis coefficients. For our toy problem (5.1) (assuming .l = 1),
we will use the Chebyshev polynomials.φn (x) = Tn (2x − 1), where the argument
is transformed to use these functions on the interval .(0, 1).
3. Evaluate the derivates for the PDE, which for our toy problem will be
du ∑ N ∑ N
(x) = u n φ'n (x) = u n 2 Tn' (2x − 1)
dx n=0 n=0
. (5.6)
d2 u ∑ N ∑ N
''
(x) = u n φ n (x) = u n 4 Tn'' (2x − 1)
dx 2 n=0 n=0
4. Use the boundary conditions of the PDE. For the specific case of (5.1),
76 5 Solving PDEs with Neural Networks
∑
N ∑
N
u(0) = 0 =⇒ u n φn (0) = u n Tn (−1) = 0,
n=0 n=0
. (5.7)
∑
N ∑
N
u(1) = 1 =⇒ u n φn (1) = u n Tn (1) = 1
n=0 n=0
∑
N ∑
N
a u n φ'n (xi ) − κ u n φ''n (xi ) = f (xi )
n=0 n=0
. (5.8)
∑
N
( )
=⇒ u n 2aTn' (2xi − 1) − 4κTn'' (2xi − 1) = f (xi )
n=0
. Ku = f (5.9)
where
5. Solve .u = K −1 f .
We need to choose the collocation/quadrature points .xi properly, so that . K has
desirable properties that make the linear system (5.9) easier to solve. These include
invertibility, positive-definiteness, sparseness, etc.
Remark 5.2.1 The method is called a collocation method as the PDE is evaluated
at the specific collocation/quadrature points .xi .
Remark 5.2.2 When working with a non-linear PDE, we will end up with a non-
linear systems of algebraic equations for the coefficients.u 0 , ..., u N , which is typically
solved by Newton’s method.
Let us look at a least-square variant for finding the coefficients of the expansion of
the spectral methods. As done earlier, we still represent the solution using (5.5) and
compute its derivates. Then, the coefficients.u are found by minimizing the following
loss function
5.3 Physics-Informed Neural Networks (PINNs) 77
This can be solved using any of the gradient-based methods we have seen in Chap. 2.
This approach is especially useful when treating non-linear PDEs. In fact, in those
cases it is not be possible to write a linear system in terms of the coefficients such as
(5.9). A few things to note here
• .λ is a parameter used to scale the interior loss and boundary loss differently.
• The number of interior points .xi can be chosen independently of the number of
basis functions. In other words, . Ntrain does not have to be the same as . N .
• We will see in the next section how this variant of the spectral method is very
similar to how deep neural networks are used to solve PDEs.
The idea of learning the solution of a PDE using a neural network constrained by
structure of the PDE operator was first considered in the early 90 s by Dassanayake
and Phan-Thien [23], where they solved simple PDEs on fixed meshes. This was
closely followed by the work by Lagaris et al. [54, 55] where the networks were
constrained to strongly satisfy the boundary conditions. With the renewed interest in
using machine learning tools in solving PDEs, this idea was rediscovered in 2019 by
Raissi et al. [85], and was given the term PINNs (physics-informed neural networks).
The basic idea of PINNs is similar to regression, except that the loss function .∏(θ)
contains derivate operators arising in the PDE being considered. We outline the main
steps below for a one-dimensional scalar PDE, which can easily be extended to
multi-dimensional systems of PDEs. We recommend that the reader thinks about the
similarities and differences between this method and the spectral collocation method
described in the previous section.
1. Select a neural network as a function representation of the PDE solution:
∂ x (L+1)
. = W (1)T S(1) W (2)T S(2) · · · W (L)T S(L) W (L+1)T S(L+1)
∂ x (0)
du
Hence, the evaluation of . dx requires the extention of the original graph with a
backward branch used to evaluate the derivative of the activation function for each
component of the vectors.ξ (l) (see Fig. 5.3). The second derivative. dx
d2 u
2 is evaluated
Fig. 5.3 Extended graph to evaluate derivatives with respect to network input
adaptive techniques to tune the value of .λb during training have been proposed
[11, 110], including self- and residual-based attention [4, 64].
L(u(x)) = f (x), x ∈ Ω
. (5.13)
B(u(x)) = g(x), x ∈ ∂Ω
where . L is the differential operator, . f is the known forcing term, . B is the boundary
operator, and.g is the non-homogeneous part of boundary condition (also prescribed).
As an example, we can consider the three-dimensional incompressible Navier-
Stokes equation solving for the velocity field .v = [v1 , v2 , v3 ] and pressure . p on
.Ω = Ω S × [0, T ]. Here .Ω S is the three dimensions spatial domain and .[0, T ] is the
time interval of interest. The equation is given by
∂v
+ v · ∇v + ∇ p − νΔv = f , ∀ (s, t) ∈ Ω
∂t
. ∇ · u = 0, ∀ (s, t) ∈ Ω (5.14)
v = 0, ∀ (s, t) ∈ ∂Ω S × [0, T ]
v(s, 0) = v 0 (s), ∀ s ∈ Ω S .
The first equation above is the balance of linear moment. The second equation
enforces the conservation of mass. The third equation is the no-slip boundary con-
dition which is used when the boundary is rigid and fixed. The fourth equation is
the prescription of the initial velocity field. The prescribed data for (5.14) comprises
the kinematic viscosity .ν, the forcing function . f (s, t), and the initial velocity of the
fluid .v 0 (s). Given the prescribed data we wish to solve for the velocity field .v and
the pressure field . p on .Ω.
To design a PINN for (5.13), the input to the network should be the inde-
pendent variables . x and the output should be the solution vector .u. For the spe-
cific case of the Navier-Stokes system (5.14), the input to the network would be
. x = [s1 , s2 , s3 , t] ∈ R , while the output vector would be . u = [v1 , v2 , v3 , p] ∈ R .
4 4
The simplest schematic of a PINN that achieves this is shown in Fig. 5.4. The key
design steps would be the following:
1. Construct the loss functions
• Select suitable . Nv collocation points in the interior of the domain and . Nb points
on the domain boundary to evaluate the residuals. These could be chosen as
based on quadrature rules, such as Gaussian, Lobatto, Uniform, Random, etc.
1 ∑
Nb
∏b (θ) = |Rb (F (x i ; θ)|2
Nb i=1
2. Train the network: find .θ ∗ = arg min∏(θ), and set the solution as .u∗ (x) =
θ
F (x; θ ∗ )
We make some remarks here:
• It is implicitly assumed that a weight regularization term is also added to the loss
.∏(θ).
• Is .u∗ (x) the exact solution to the PDE? The answer is No!
• In practice, we only compute .∏(θ ∗ ). Is the solution error .||e|| = ||u∗ − u|| related
to this loss value? And if it is, can we say that this error will be small as long as
the loss is small? This is what we try to answer in next section.
In order to evaluate the error.e = u∗ − u, we need to know the exact solution.u which
is not available in general. We consider a way of overcoming this issue, by restricting
our discussion to linear PDEs i.e., . L and . B are assumed to be linear operators. Note
that if .u is the exact solution, then
and
. B(e) = B(u∗ − u) = B(u∗ ) − B(u) = B(u∗ ) − g = Rb (u∗ ). (5.16)
Thus, (5.15), (5.16) lead to a PDE for .e driven by the residuals of the MLP solution,
L(e) = R(u∗ ), in Ω
. (5.17)
B(e) = Rb (u∗ ), on ∂Ω.
If the residuals of .u∗ were zero, then .e = 0. Unfortunately, these residuals are not
zero. The most that we can say is that they are small at the collocation points.
However, from the theory of stability of well-posed PDEs, we have
( )
||e|| L 2 (Ω) ≤ C1 ||R(u∗ )|| L 2 (Ω) + ||Rb (u∗ )|| L 2 (∂Ω)
. (5.18)
where .C1 is a stability constant that depends on the PDE, the domain .Ω, etc. Such
a condition can typically be obtained for all well-posed PDEs. It says that if the
terms driving the PDE are small, then the solution to the PDE will also be small.
This equation tells us that we can control the error if we can control the residu-
als for the MLP solution. However, in practice we know and control .∏int , ∏b and
not .||R(u∗ )||2L 2 (Ω) , ||Rb (u∗ )||2L 2 (∂Ω) . The question then becomes, are these quantities
related. This is answered in the analysis below. Firstly, we assume the . Nv interior
collocation points to be suitable quadrature nodes, which leads to the following
quadrature approximation for any function .l (with sufficient regularity)
∫
mΩ ∑
Nv
. l(x)dx = wi l(x i ) + C2 Nv−α (5.19)
Ω Nv i=1
5.6 Data Assimilation Using PINNs 83
where .m Ω is the measure of the domain .Ω, .{wi } are the quadrature weights, and the
constants .C2 > 0 and rate .α > 0 depend on the type of quadrature used. Next, we
can write
| |
| |
||R(u∗ )||2L 2 (Ω) = |m Ω ∏int (θ ∗ ) + ||R(u∗ )||2L 2 (Ω) − m Ω ∏int (θ ∗ )|
| |
.
≤ m Ω ∏int (θ ∗ ) + |||R(u∗ )|| L 2 (Ω) − m Ω ∏int (θ ∗ )| (5.20)
≤ m Ω ∏int (θ ∗ ) + C2 Nv−α
In the equation above, the first line is obtained by adding and subtracting the term
m Ω ∏int (θ ∗ ), the second line is obtained by using the triangle inequality, while the
.
third line is obtained using (5.19) for the integrand . R(u∗ )2 with the quadrature
weights chosen as .wi = 1 for simplicity. Then by using the relation .(a + b)1/2 ≤
a 1/2 + b1/2 for .a, b ≥ 0, we get from (5.20)
where .m ∂Ω is the measure of .∂Ω, while .C3 and .β > 0 will depend on the type of
boundary quadrature rule considered.
Combining (5.18), (5.21) and (5.22), we get
⎛ ⎞
⎜ 1/2 ⎟
≤ C1 ⎜ ∏int (θ ∗ )1/2 + m ∂Ω ∏b (θ ∗ )1/2 + C2 (Nv )−α/2 + C3 (Nb )−β/2 ⎟
1/2 1/2 1/2
.||e|| L 2 (Ω)
⎝m,Ω ,, , , ,, ,⎠
reduced by Nθ ↑ reduced by Nv ,Nb ↑
(5.23)
This equation tells us that it is possible to control the error in the PINNs solution
by reducing the loss functions (by increasing . Nθ ) and by increasing the number of
interior and boundary collocation points. For further details about this analysis, the
reader is referred to [69].
u = u(x i ), x i ∈ Ω, 1 ≤ i ≤ M
. i
Furthermore, we are given that .u satisfies some constraint . R(u) = 0 on .Ω. Here . R
could correspond to the PDE residuals (5.17) or any other physical constraint. Then,
data assimilation corresponds to using this information to find the value of .u at any
. x ∈ Ω.
We can solve this problem using PINNs. First, we represent .u using a neural
network .F(x; θ). Next, we define a loss function
λI ∑ 1 ∑v
M M+N
.∏(θ) = (u i − F(x i ; θ))2 + |R(F(x i , θ))|2 + λ||θ||2
M i=1 Nv i=M+1 , ,, ,
, ,, , , ,, , smoothness
data matching physical constraint
Over the last few years, PINNs have been used to solve a number of PDE problems, as
well as integro-differential models [75, 76, 122] and stochastic differential equations
[115]. Several gradient pathologies and instabilities of PINNs were identified in [110,
112] along with a discussion of potential remedies [109]. The generalization error
analysis of PINNs is available for linear second-order elliptic and parabolic PDEs
[99]. In addition to the error estimates for forward PDE problems [69] discussed in
Sect. 5.5, estimates for PDE-based inverse problems have also been addressed [68].
In [22], the authors analyze how the generalization error for PINNs trained to solve
the Navier-Stokes equations can be controlled by reducing the training error.
The PINN formulations discussed in the present chapter makes use of the strong
form of the PDE to construct the residual losses. However, it is more meaningful
to consider the variational (weak) form of the PDE, especially when the solution
is expected to have lower regularity. Thus, variational PINNs were designed in
[48], which made use of the Petrov-Galerkin formulation for PDEs. Another vari-
ant of PINNs, called weak PINNS, have been designed to accurately approximate
the entropy solutions for conservation laws. To solve PDEs on very complicated
domains, the XPINN framework was proposed in [19] which deploys multiple neu-
ral networks in parallel to approximate the local solution in smaller sub-domains. We
direct interested readers to [18] for a more extensive review on currently available
PINN-type approaches.
5.8 Computational Exercise: Physics Informed Neural Networks (PINNs) 85
In this exercise, you will use feed-forward networks to solve the nonlinear diffusion-
advection-reaction equation
In this equation, the non-dimensional number.Pe is the Peclet number which measures
the strength of advection relative to diffusion, and.Da is the Damkohler number, which
measures the strength of reaction to diffusion.
1. Use the MLP class developed previously to create a feed-forward network of
width=18 and depth=6, and with input and output dimensions set to 1. This
will be used to represent the function .u = u(x; θ) which will be solution to the
PDE above. The weights and biases (.θ) should be initialized using a standard
normal distribution (zero mean and unitary standard deviation). All hidden layers
should make use of a tanh activation function, while no activation should be
used in the output layer. Use .l2 regularization in all layers with a parameter 1e-7.
2. Create an array of . N = 100 uniformly spaced points in .[0, 1]. Train a neural
network with the following loss function
1 ∑ ( )
N
.∏(θ) = (Lu(xi ; θ))2 + λb u(0; θ)2 + (u(1; θ) − 1)2
N i=1
which is the sum of the interior residual and a scaled boundary residual.
def loss_fun(x_train, model, Pe, Da):
u = model(x_train)
u_x = torch.autograd.grad(
u, x_train,
grad_outputs=torch.ones_like(x_train),
create_graph=True)[0]
u_xx = torch.autograd.grad(
u_x, x_train,
grad_outputs=torch.ones_like(x_train),
create_graph=True)[0]
R_int = torch.mean(torch.square(u_xx - Pe * u_x + Da * u *
(1.0 - u)))
R_bc = torch.square(u[0]) + torch.square(u[-1] - 1.0)
return R_int,R_bc
86 5 Solving PDEs with Neural Networks
3. Use a learning rate of 1e-4 and .λb = 10 for the training. Consider four different
sets of values for the non-dimensional parameters:
• . Pe = 0.01, Da = 0.01 (diffusion dominates).
• . Pe = 20, Da = 0.01 (advection dominates).
• . Pe = 0.01, Da = 60 (reaction dominates).
• . Pe = 15, Da = 40 (advection and reaction dominate).
• Train the network for 40,000 epochs without mini-batches. Save the history
of the interior loss and boundary loss.
• Generate a plot of the interior and boundary as a function of epoch number.
One plot for each set of parameter values.
• Using the fully trained network (at the end of 40,000 epochs) evaluate predicted
solution at the test points contained in the first column of the reference solution
files provided to you.
• Generate a plot of the network solution and the reference solution as a function
of the spatial coordinate, .x. One plot for each set of parameter values.
5. Repeat the previous step for .λb = 300 and .λb = 10,000.
6. Based on the results obtained answer the following questions:
Recall that a typical MLP . y = F (x; θ) is a function that takes as input . x ∈ Rd and
gives an output . y ∈ Rd with trainable weights .θ. Also, as we discussed in Chap. 5,
a PINN is a network of the form .u(x; θ) = F (x; θ) taking as input the independent
variable . x of the underlying PDE and giving the approximate solution .u(x; θ) (of the
PDE) as output. The network is trained by minimizing the weighted sum of the PDE
and boundary residual. However, this is just one instance of the solution of the PDE
for some given boundary condition and source term. For instance, if we consider the
PDE
∇ · (κ∇u) = f (x), x ∈ Ω = [0, 1] × [0, 1]
. (6.1)
u(x) = g(x), x ∈ ∂Ω
1 ∑ λb ∑
Nv Nb
.∏(θ) = |∇ · (κ∇F (x i ; θ)) − f (x i )| + |F (x i ; θ) − g(x i )|2
2
Nv i=1 Nb i=1
Then, if.θ ∗ = arg min∏(θ), the PINN solving (6.1) will be .u ∗ (x) = F (x; θ ∗ ). How-
θ
ever, if we change . f or .g in (6.1), we have no reason to believe that the same trained
network would work. In fact, we would need to retrain the network (with perhaps
the same architecture) for the new . f and .g. This can be quite cumbersome to do, and
we would ideally like to avoid it. In this chapter, we will see ways by which we can
overcome this issue.
Assume the the source term . f in (6.1) is given as a parametric function . f (x; α). For
instance, we could have
Then we could train a PINN that accommodates for the parametrization by consid-
ering a network that takes as input both . x and .α, i.e., .F (x, α; θ). This is shown in
Fig. 6.1 This network can be trained by minimizing the loss function
⎡ ⎤
Nb
1 ∑
Na ∑Nv ∑
⎣ 1 λ
|F (x i , α j ; θ) − g(x i )|2 ⎦
b
.∏(θ) = |∇ · (κ∇F (x i , α j ; θ)) − f (x i , α j )|2 +
Na Nv Nb
j=1 i=1 i=1
6.2 Operators
1. Consider the PDE (6.1) for which we assume that the conductivity.κ and boundary
condition .g are given and fixed functions. For this PDE, .Ω X = ΩY = Ω and the
operator .N maps the RHS . f to the solution (temperature) .u of the PDE. That
is, .u(x) = N( f )(x). The input and the output to the operator are related by the
PDE.
2. Consider the PDE (6.1) but we assume that the conductivity field .κ might change
for the model, instead of the RHS. Then, .Ω X = ΩY = Ω and the operator .N maps
the conductivity .κ to the solution .u of the PDE. That is, .u = N(κ)(x). The input
and the output to the operator are related by the PDE where it is assumed that . f
and .g are given and fixed.
3. Once again, consider the same PDE (6.1) but with conductivity and the boundary
condition being allowed to change. Then, the operator .N maps the boundary
condition .g and the conductivity .κ to the solution .u of the PDE. That is, .u =
N(κ, g)(x). In this case the input to the operator are two functions .(g, κ) and the
output is a single function. Therefore .Ω X = Ω, while .ΩY = Ω × ∂Ω. The input
and the output are related through the solution to the PDE where it is assumed
that . f is given and fixed.
4. Consider the equations of linear isotropic elasticity posed on a three-dimensional
domain .Ω ⊂ R3 ,
( )
∇ · λ(∇ · u)I + 2μ∇ S (u) = f (x), x ∈ Ω
. (6.2)
u(x) = g(x), x ∈ ∂Ω
where we are interested in how the solution of the PDE changes as the source
function . f is varied. Thus, we are interested in the operator defined by .u(x) =
N( f )(x) whose input function is . f : Ω → R3 , and the output function is .u :
Ω → R3 . The two are related by the equations above where .λ, μ, g are given and
fixed.
5. Now, consider a different PDE. In particular, the advection-diffusion-reaction
equation,
∂u
+ a · ∇u − κ∇ 2 u + u(1 − u) = f, (x, t) ∈ Ω × (0, T ]
∂t
(6.3)
u(x, t) = g(x, t), (x, t) ∈ ∂Ω × (0, T ]
.
u(x, 0) = u 0 (x), x ∈ Ω.
We want to find the operator .N that maps the initial condition .u 0 to the solution .u
at the final time.T , i.e.,.u(x, T ) = N(u 0 )(x). In this case.Ω X = ΩY = Ω. Further,
the input and the output functions are related to each other via the solution of the
PDE above with . a, κ, f, g given and fixed.
Remark 6.2.1 It is often useful to determine whether an operator is linear or non-
linear. This is because if it is linear it can be well approximated by another linear
operator. In the cases considered above the operators in examples 1 and 4 were linear
whereas those in examples 2, 3, and 5 were nonlinear.
90 6 Operator Networks
We are interested in networks that approximate the operator .N. We will see how
we can do this in the following sections. These types of networks are often referred
to as Operator Networks. There are two popular versions of these networks. One is
referred to as a Deep Operator Network, or a DeepONet, and the other is referred to
as a Fourier Neural Operator. We describe the DeepONet and its extensions in the
following three sections. Thereafter, we describe the Fourier Neural Operator, the
VarMiON, which is an operator network that draws from variational principles, and
Mesh Graph Networks that are adept at working with data defined on unstructured
meshes.
Operator networks were first proposed by Chen and Chen [17], where they considered
only shallow networks with a single hidden layer. This idea was rediscovered and
extended to deep architectures more recently in [61] and were called DeepONets. A
standard DeepONet comprises two neural networks. We describe below its construc-
tion to approximate an operator.N : A → U , where. A is a set of functions of the form
.a : ΩY ⊂ R → R while .U consists of functions of the form .u : Ω X ⊂ R → R.
d D
where the trainable parameters of the DeepONet will be the combined parameters
of the branch and trunk nets, i.e., .θ = [θ T , θ M ].
In the above construction, once the DeepONet is trained (we will discuss the
training in the following section), it will approximate the underlying operator .N, and
6.3 Deep Operator Network (DeepONet) Architecture 91
allow us to approximate the value of any .N(a)(x) for any .a ∈ A and any . x ∈ Ω X .
Note that in the construction of the DeepONet, the . M sensor points need to be
pre-defined and cannot change during the training and evaluation phases.
We can make the following observations regarding the DeepONet architecture:
1. The expression in (6.4) has the form of representing the solution as the sum of a
series of coefficients and functions. The coefficients are determined by the branch
network, while the functions are determined by the trunk network. In that sense
the DeepONet construction is similar to the formulation in the spectral method or
the finite element method. There is a critical difference though. In these methods,
the basis functions are pre-determined and selected by the user. However, in the
DeepONet these functions are determined by the trunk network and their final
form depends on the data used to train the DeepONet.
2. Architecture of the branch sub-network: When the nodes for sampling the input
function are chosen randomly, the appropriate architecture for the branch network
comprises fully connected layers. Further recognizing that the dimension of the
input to this network can be rather large . M ≈ 104 , while the output is typically
small . p ≈ 102 , this network can be thought of as an encoder.
When the nodes for sampling the input function are chosen as a tensorized grid, the
appropriate architecture for the branch network comprises convolutional layers.
In that case, this network maps an image of large dimension (. M ≈ 104 ) to a latent
vector of small dimension, . p ≈ 102 . Thus it is best represented by a convolutional
neural network.
92 6 Operator Networks
3. Broadly speaking, there are two ways of improving the expressivity of a Deep-
ONet. These involve increase the number of network parameters in the branch and
trunk sub-networks, and increasing the dimension . p of the latent vectors formed
by these sub-networks.
Training a DeepONet is typically supervised, and requires pairwise data. The fol-
lowing are the main steps involved:
1. Select . N1 representative function .a (i) , .1 ≤ i ≤ N1 from the set . A. Evaluate the
values of these functions at the . M sensor points, i.e., .a (i) (i) ( j)
j = a ( y ) for .1 ≤ j ≤
(i) (i) (1) (i) (M) T
M. This gives us the vectors . a = [a ( y ), ..., a ( y )] ∈ R M for each
.1 ≤ i ≤ N1 .
2. For each.a (i) , determine (numerically or analytically) the corresponding functions
(i)
.u given by the operator .N.
3. Sample the function .u (i) at . N2 points in .Ω X , i.e., .u (i) (x (k) ) for .1 ≤ k ≤ N2 .
4. Construct the training set
{( ) }
.S= a(i) , x (k) , u (i) (x (k) ) : 1 ≤ i ≤ N1 , 1 ≤ k ≤ N2
1 ∑
N1 ∑N2
∏(θ) =
. ~(x (k) , a(i) ; θ) − u (i) (x (k) )|2 .
|N (6.5)
N1 N2 i=1 k=1
Remark 6.3.1 We need not choose the same . N2 points across all .i in the training
set. In fact, these can be chosen randomly leading to a more diverse dataset.
Remark 6.3.2 The DeepONet can be easily extended to the case where the input
comprises multiple functions. In this case, the trunk network remains the same, how-
ever the branch network now has multiple vectors as input. The case corresponding
to two input functions, .a( y) and .b( y), which when sampled yield the vectors, . a and
. b, is shown in Fig. 6.3.
6.3 Deep Operator Network (DeepONet) Architecture 93
Remark 6.3.3 The DeepONet can be easily extended to the case where the output
comprises multiple functions (say . D such functions). In this case, the output of the
branch and trunk network leads to . D vectors each with dimension . p. The solution
is then obtained by taking the dot product of each one of these vectors. The case
corresponding to two output functions .u 1 (x) and .u 2 (x) is shown in Fig. 6.4.
A natural question one can ask is “how well do DeepONets approximate operators?”
One of the first universal approximation theorem for a shallow version of DeepONets
was provided by Chen and Chen [17]:
Theorem 6.3.1 Suppose .Ω X and .ΩY are compact sets in .R D and .Rd (or more
generally a Banach space), respectively. Let .V be a compact subset of .C(ΩY ) and
.N be a nonlinear, continuous operator mapping . V into .C(Ω X ). Then given .∈ > 0,
~ with
there exists a DeepONet .N
such that
. ~
max |N(x, a; θ) − N(a)(x)| < ∈
x∈Ω X
a∈V
~
Recall that DeepONets approximate .u(x) = N(a)(x) ≈ N(x, a; θ). Assume that
the pair .a and .u satisfy a PDE. For example,
∇ · (κu) = f in Ω
. (6.6)
u = g on ∂Ω
where .κ and .g are prescribed. To construct the operator .N~ that maps . f to .u, we need
to solve the PDE externally using a traditional numerical solver to define the target
labels in the loss function (6.5). However, in addition to this, we can also use a PINN-
type loss function and add that to the total loss. This is the idea of Physics-Informed
DeepONets proposed in [111]. So for the above model PDE, the physics-based loss
would would be,
1 1 ∑
N̄1 ∑
N̄2
| ( ) |
. ∏ p (θ) = ~ (k) , f (i) ; θ) − f (i) (x (k) )|2 .
|∇ x · κ∇ x N(x (6.7)
N̄1 N̄2 i=1 k=1
This is in addition to the standard data-driven loss function which, for this example
is given by
6.5 DeepONets and Their Applications 95
1 ∑
N1 ∑N2
∏d (θ) =
. ~(x (k) , f (i) ; θ) − u (i) (x (k) )|2 .
|N (6.8)
N1 N2 i=1 k=1
DeepONet has been successfully applied in various fields, spanning heat conduction
[53], biomedical applications [120], fracture mechanics [35], multi-scale model-
ing using elastic and hyper-elastic materials [121], and the response of power-grids
response to part failures or disturbances [70]. Considerable efforts have been invested
96 6 Operator Networks
Fig. 6.6 Computational graph for a feed-forward Fourier Neural Operator (FNO) network
As shown in Fig. 6.6, .v (n) and .u(n) are the counterparts of .ξ (n) and . x (n) , respectively.
Further since .ξ (L+1) was a scalar, we will set .v (L+1) to be a scalar-valued function.
We are now done with extending the input, the output and the variables in the
hidden layers from vectors to functions. Next, we need to extend the operators that
transform vectors to vectors within an MLP to those that transform functions to
functions within an FNO.
We begin with the operator . A(1) , which in an MLP is an affine map from a vector
with one component to a vector with . H components. Its straightforward extension
to functions is,
.v
(1)
(x) = A(1) (a)(x), (6.11)
where
v (1) (x) = Wi(1) a(x) + bi(1) ,
. i i = 1, . . . , H. (6.12)
Here .Wi(1) and .bi(1) are the weights and biases associated with this layer.
Similarly, in an MLP the operator . A(L+1) is an affine map from a vector with . H
components to a vector with .1 component. Its straightforward extension to functions
is,
.v
(L+1)
(x) = A(L+1) (u(L) )(x), (6.13)
where
v (L+1) (x) = Wi(L+1) u i(L) (x) + b(L+1) ,
. i = 1, . . . , H. (6.14)
In the equation above the summation over the index .i (from 1 to . H ) is implied, and
Wi(L+1) and .b(L+1) are the weights and the bias associated with this layer.
.
Next we describe the action of the activation on input functions in each layer. It
is a simple extension of the activation function applied to the point-wise values of
the input function. That is,
(n)
.u (x) = σ(v (n) )(x), (6.15)
98 6 Operator Networks
where
u (n) (x) = σ(vi(n) (x)),
. i i = 1, . . . , H. (6.16)
where
In the equation above the summation over the dummy index . j (from 1 to . H ) is
implied. The new term that appears in this equation is a convolution. It is motivated
by the observation that a large class of linear operators can be represented as con-
volutions. An example is the so-called Green’s operator which maps the right hand
side (also called the forcing function) of a linear PDE to its solution. The functions
(n+1)
.κi j (z) are called the kernels of the convolution. We note that there are . H 2 of these
functions in each layer.
It is instructive to examine a specific case of a convolution. Let us consider .Ω =
[0, L 1 ] × [0, L 2 ], where we denote the two coordinates by either .x1 and .x2 , or . y1 or
. y2 . In this case we may write the convolution as,
∫ L1 ∫ L2
v (x1 , x2 ) =
. i κi j (y1 − x1 , y2 − x2 )u j (y1 , y2 ) dy2 dy1 , i = 1, . . . , H.
0 0
(6.20)
In the equation above, we have dropped the superscripts since they are not relevant
to the discussion.
Remark 6.6.2 It is instructive to list all the trainable entities in a FNO. First we list
all the trainable parameters:
The neural operator introduced in this section acts directly on functions and trans-
forms them into functions. However, when implementing this operator on a computer
the functions have to be represented discretely. This is described in the following
section.
The functions that appear in the neural operator described in the previous section
are:
(1)
.a, v , u(1) , . . . , v (L) , u(L) , v (L+1) , u. (6.23)
Each of these functions is defined on the domain .Ω. We discretize this domain with
N uniformly distributed points, and represent each function using its values at these
.
points.
As an example, in two dimensions, with .Ω = [0, L 1 ] × [0, L 2 ], we represent the
function .a(x1 , x2 ) as,
where
L1
x
. 1m = (m − 1) × (6.25)
N1 − 1
L2
x
. 2n = (n − 1) × . (6.26)
N2 − 1
where
v (1) [m, n] = Wi(1) a[m, n] + bi(1) ,
. i i = 1, . . . , H. (6.28)
where
Next we describe the action of the activation function on discretized input functions.
It is given by
(n)
.u [m, n] = σ(v (n) )[m, n], (6.31)
where
u (n) [m, n] = σ(vi(n) [m, n]),
. i i = 1, . . . , H. (6.32)
where
( p+1) ( p+1) ( p) ( p+1)
v
. i [m, n] = Wi j u j [m, n] + bi (6.34)
∑
N1 ∑
N2
( p+1) ( p)
. + κi j [r − m, s − n]u j [r, s]h 1 h 2 , i = 1, . . . , H,
r =1 s=1
(6.35)
where .h 1 = NL1 −1
1
and .h 2 = NL2 −1
2
. Note that the integral in the convolution is now
replaced by a sum over all the grid points. Computing this integral (approximation)
for each value of .i and .m, n involves . O(N1 N2 H ) flops. And since this needs to be
done for . H different values of .i, . N1 values of . M, and . N2 values of . j, the total cost
of discretizing the convolution operation is . O(N12 N22 H 2 ) = O(N 2 H 2 ), where . N =
N1 × N2 . The factor of. N 2 in this cost is not acceptable and makes the implementation
of this algorithm impractical. In the following section we describe how the use of
Fourier Transforms (forward and inverse) overcomes this bottleneck and leads to
a practical algorithm. This is also the reason that this algorithm is referred to as a
“Fourier Neural Operator”.
N1 /2 N2 /2
∑ ∑ 2πi(
mx1
+
nx2
)
u(x1 , x2 ) ≈
. û[m, n]e L1 L2
. (6.36)
m=−N1 /2 n=−N2 /2
6.6 Fourier Neural Operator (FNO) 101
Here . N1 and . N2√are even integers, the coefficients .û[m, n] are the Fourier coef-
ficients and .i = −1. We note that while the function .u is real-valued the coef-
ficients are complex-valued. However, since .u is real-valued, they obey the rule
∗ ∗
.û[−m, −n] = û [m, n], where .(.) denotes the complex-conjugate of a complex
We now describe how Fourier transforms can be used to evaluate the convolution
efficiently. To do this we consider the special case of 2D convolution in (6.20).
∑ N1 /2 ∑ N2 /2 my ny
2πi( L 1 + L 2 )
We begin with substituting .u j (y1 , y2 ) = m=−N 1 /2 n=−N2 /2 û j [m, n]e
1 2
∑ mx1 nx2
∫ L1 ∫ L2 mz 1 nz 2
2πi( + ) 2πi( + )
= û j [m, n]e L1 L2
κi j (z 1 , z 2 )e L1 L2
dz 2 dz 1
m,n 0 0
∑ 2πi(
mx1
+
nx2
)
= L1 L2 û j [m, n]κ̂i j [−m, −n]e L1 L2
. (6.38)
m,n
In the development above, in going from the first to the second line we have taken the
summation outside the integral and recognized that the coefficients .û j [m, n] do not
depend on. y1 and. y2 . In going from the second to the third line we have introduced the
variables.z 1 = y1 − x1 and.z 2 = y2 − x2 . In going from the third to the fourth line we
have made use of the fact that the functions .κi j (z 1 , z 2 ) are periodic. Finally in going
102 6 Operator Networks
from the fourth to the fifth line we have made use of the definition of the Fourier
Transform (6.37). This final relation tells us that the convolution can be computed
by:
1. Computing the Fourier Transform of .u j .
2. Computing the Fourier Transform of .κi j .
3. Computing the product of the coefficients of these two transforms.
4. Computing the inverse Fourier Transform of the product.
Next, we account for the fact that we will only work with the discrete forms of
the functions .u j and .κi j . This means that we evaluate the inverse Fourier transform
(6.36) at a finite set of grid points. Further, it means that we have to approximate the
integral in the Fourier transform (6.37). This alternate form is given by
h1h2 ∑
N1 ∑N2
rx sx
−2πi( L1m + L2n )
û[r, s] =
. u[m, n]e 1 2 . (6.39)
L 1 L 2 m=1 n=1
method impractical except for when. N is very small. However, the use of Fast Fourier
Transform (FFT) reduces this cost to . O(N log N ). Thus the cost of implementing the
convolution reduces to . O(N log N H 2 ). This makes the implementation of Fourier
Neural Operators practical.
The form of the PDEs considered so far, for instance (6.1), are written in the strong
form and assume sufficient regularity of the PDE solution and data. However, it is
more practical to consider the weak (variational) form when this regularity is not
guaranteed (see, for example, [44]). In [78], an operator network was constructed
that mimics the variational form of the PDE, where the architecture of the various
sub-components of the network were motivated by the variational structure. Such
operator networks are called Variationally Mimetic Operator Networks (VarMiONs).
To understand the VarMiON architecture, we first provide some background of the
variational form for linear elliptic PDEs and it’s discrete counterpart.
6.7.1 Background
Variational form Let .Ω ∈ R D be an open, bounded domain with boundary .[. The
boundary is further partitioned into the Dirichlet boundary .[g and the Neumann
6.7 Variationally Mimetic Operator Network (VarMiON) 103
|
boundary .[η . Define the space . Hg1 = {u ∈ H 1 (Ω) : u |[g = 0}, where the Sobolev
function space. H 1 (Ω) = {u ∈ L 2 (Ω) : ∂i u ∈ L 2 (Ω) ∀ 1 ≤ i ≤ D}. Consider the fol-
lowing model PDE model as a canonical example for second-order linear elliptic
PDEs
∇ · (κ(x)∇u(x)) = f (x), ∀ x ∈ Ω,
. κ(x)∇u(x) · n(x) = q(x), ∀ x ∈ [q , (6.40)
u(x) = 0, ∀ x ∈ [g ,
where .(., .) denotes the . L 2 (Ω) inner-product, while the LHS of the PDE becomes
∫ ∫ ∫
. w(x) (∇ · (κ(x)∇u(x))) dx = − ∇w(x) · (κ(x)∇u(x))dx + w(x)κ(x)∇u(x) · n(x)dx
Ω
∫Ω ∫∂Ω
=− ∇w(x) · (κ(x)∇u(x))dx + w(x)κ(x)∇u(x) · n(x)dx
Ω [g
∫
+ w(x)κ(x)∇u(x) · n(x)dx
[q
∫ ∫
=− ∇w(x) · (κ(x)∇u(x))dx + w(x)q(x)dx (6.42)
Ω [q
where the final expression above is obtained by using the fact that .w vanishes on .[g ,
and the given Neumann boundary condition in (6.40). We denote the inner-product
on . L 2 ([q ) as .(., .)[q and define the bilinear form
∫
.a(w, u; κ) := − ∇w(x) · (κ(x)∇u(x))dx. (6.43)
Ω
Note that .a is linear in .w and .u, but does not need be linear in .κ which parametrizes
this bilinear form. Combining (6.41), (6.42) and (6.43), the variational problem is
defined by: Find .u ∈ Hg1 such that .∀ w ∈ Hg1
If the solution and PDE data have sufficient regularity, we can recover the strong
form (6.40) from the weak form (6.44). We refer interested readers to [13, 28] for
additional details.
Consider the solution operator
which maps the data .( f, κ, q) to the unique solution .u of (6.44). We are interested
in approximating .N using a VarMiON. To do so, we need to first describe a discrete
framework for (6.44)
Discrete variational form Let us consider the class of numerical solvers for (6.44)
that approximate the solution function space .V by the space .Vh spanned by a finite
set of continuous basis functions .{φi (x)}i=1
n
. Typical examples include the Finite
Element Method (FEM) or Proper Orthogonal Decomposition (POD) basis approxi-
mating the solutions of (6.44) [8, 44]. Then any function.v h ∈ Vh can be be expressed
as a linear combination of the finite basis, i.e.,
.v
h (x) = v φ (x) = V T Φ(x), V = (v1 , . . . , vn )T , Φ(x) = (φ1 (x), . . . , φn (x))T .
i i
|
We also define the space .Vqh = {v |[q : v ∈ Vh } for the function on the Neumann
boundary.
We project the PDE data .( f, κ, q) onto these finite dimensional spaces, with the
projections given by
|
f h (x) = F T Φ(x) ∈ Vh , κh (x) = K T Φ(x) ∈ Vh , q h (x) = Q T Φ(x)|[q ∈ Vqh .
.
(6.46)
where coefficients . F, K , Q will depend on the choice of the basis functions. If the
approximate solution is represented as .u h (x) = U T Φ(x), then using (6.46) in (6.44)
gives us the following system of equations describing the discrete variational form
. S(κh )U = M 1 F + M 2 Q (6.47)
.N
h
: Vh × Vh × Vqh → Vh , N h ( f h , κh , q h )(x) := u h (x) = (B 1 ( f h , κh ) + B 2 (q h , κh ))T Φ(x)
(6.49)
6.7 Variationally Mimetic Operator Network (VarMiON) 105
where
In order to feed .( f, κ, q) into the VarMiON network, we need to sample the function
at sensor nodes (as was also done for DeepONets). Let .{ y( j) } M
j=1 be the sensor nodes
( j) '
in .Ω for . f and .κ, while .{ yb } M
j=1 be the sensor nodes on .[q for .q. We define the
input vectors
(1) '
(M )
.~
F = ( f ( y(1) ), . . . , f ( y(M) ))T , ~
K = (κ( y(1) ), . . . , κ( y(M) ))T , ~
Q = (q( yb ), . . . , q( yb ))T
(6.51)
Then the schematic of the VarMiON architecture to solve the PDE (6.40) is shown
in Fig. 6.7, which comprises:
' ( )T
~ θ)
.N(.; ~ ~
: R M × R M × R M → Vτ , N( K, ~
F, ~ Q; θ)(x) = u(x; θ) = β 1 ( ~ K ) + β2 ( ~
F, ~ Q, ~
K) τ.
(6.52)
where.θ are all the trainable parameters of the VarMiON. Note the similarity between
(6.49) and (6.52). In particular, . B 1 (respectively . B 2 ) have the same structure as
F and . ~
.β 1 (respectively .β 2 ). Further, the note the branch network for . ~ Q are linear,
motivated by the linearity of.N with respect to. F and . Q. In this sense, the VarMiON
h
The VarMiON is trained in a supervised manner similar to the DeepONet, where the
training data is generated using a numerical solver (for instance the same one used
to describe the discrete variational form).
1. Sample . N1 representative function triplets .( f (i) , κ(i) , q (i) ), .1 ≤ i ≤ N1 from the
set function space .F × K × Q.
2. Evaluate the values of these . N1 triples at the sensor points (according to (6.51))
(i) (i) (i)
to get the branch input vectors . ~F ,~ K ,~ Q for all .1 ≤ u ≤ N1 .
3. For each . N1 triplet, determine (numerically) the corresponding solution functions
.u
h,(i)
given by the operator .N h .
4. Sample the function .u h,(i) at . N2 points in .Ω X , i.e., .u h,(i) (x ( j) ) for .1 ≤ j ≤ N2 .
5. Construct the training set
{( (i)
) }
(i) (i)
S=
. F ,~
~ K ,~ Q , x ( j) , u h,(i) (x ( j) ) : 1 ≤ i ≤ N1 , 1 ≤ j ≤ N2
1 ∑
N1 ∑N2
(i) (i) (i)
.∏(θ) = ~( ~
|N F ,~ K ,~ Q ; θ)(x ( j) ) − u (i) (x ( j) )|2 . (6.53)
N1 N2 i=1 j=1
Once trained, then given any new .( f, κ, q) sampled at the sensor points, and
a new point . x ∈ Ω X , we can evaluate the corresponding prediction .u ∗ (x) =
~( ~
N K, ~
F, ~ Q; θ ∗ )(x).
~ ~
E( f, κ, q) := ||N( f, κ, q) − N(
. K, ~
F, ~ Q; θ ∗ )|| L 2 (Ω) . (6.54)
Let us also define the error between .N and .N h , i.e., the error incurred while approxi-
mating the true solution with the numerical solver (used to generate training samples)
~ ~( ~
E( f, κ, q) := ||N h ( f h , κh , q h ) − N
. K, ~
F, ~ Q; θ ∗ )|| L 2 (Ω) . (6.56)
Further, for any .( f, κ, q), ( f ' , κ' , q ' ), we define the corresponding change in the
true solution as
.Estab [( f, κ, q), ( f ' , κ' , q ' )] := ||N( f, κ, q) − N( f ' , κ' , q ' )|| L 2 (Ω) , (6.57)
Note that (6.57) and (6.58) describe the stability of the true and VarMiON solution
operator, respectively.
Now, for any training sample triplet.( f (i) , κ(i) , q (i) ), we can use triangle inequality
repeatedly to get the bound (see [78] for details)
Thus, a bound for the generalization error can be obtained by determining bounds
for each of the four terms on the right of (6.57).
The authors in [78] obtained a bound of the following form (under certain assump-
tions)
108 6 Operator Networks
( )
√ 1 1 1
.E( f, κ, q) ≤ C ∈h + ∈s + ∈t + + ' /2 + (6.60)
M α/2 '
(M ) α γ/2
N2
where .∈h is the bound on the numerical solver error (6.55), .∈t is the VarMiON error
on the training set (which is tracked while training), .∈s is the a measure of the
maximum distance between any .( f, κ, q) ∈ F × K × Q and the set of . N1 samples
in the training set, and .α, α' , γ are associated with the descritization of the input and
output of the VarMiON (similar to .α and .β arising in the PINNs error estimate (5.23
)). The constant .C depends on the stability constants associated with the stability of
.N and .N.~
We make few comments about (6.60):
• The error estimate reveals the various sources of error. In general, we cannot
expect the VarMiON error to be smaller than the error associated with generating
the training samples.
• If the training samples are generated by a Galerkin FEM method, then .∈h would
correspond to the interpolation error that depends on the element size and the order
of the basis elements (see [44] for details).
• We can make .∈s smaller by carefully adding more training samples so that the
space .F × K × Q is better “covered” by these samples. However, we might also
need to increase the size of the VarMiON in order to maintain the training error .∈t
at the level.
• Choosing a finer discretization for the inputs and outputs, i.e., a larger . M, . M ' and
. N2 , would lower the quadrature error (governed by the last three terms on (6.60)).
Remark 6.7.1 It has also been shown in [78] that the inverse of the matrix . S(κh )
arising in the discrete variational formulation (6.47) can be approximated in terms
of its counterpart . ~
D( ~
K ) in the VarMiON.
Remark 6.7.2 A VarMiON-type formulation has also been proposed in [78] for the
scalar non-linear advection-diffusion-reaction problem, and been numerically tested
on the regularized Eikonal equations.
The type of operator networks we discuss next are called Mesh Graph Networks
(MGNs) [82]. MGNs provide a deep learning framework to operate directly on mesh-
based data, commonly used in computational physics. This is done by interpreting
mesh-based data as graphs, and using a type of neural network architecture called
Graph Neural Networks (GNNs) [6, 96, 125]. GNNs generalize the concept of CNNs
to graph-like structures, and are useful and efficient tools for approximating functions
that map graphs to graphs.
6.8 Mesh Graph Networks 109
6.8.1 Background
the nodes .i and . j are connected. This connection can be interpreted in many ways,
depending on the problem at hand. If we decide to associate each edge with a weight,
representing the strength of the connection, we call it a weighted graph.
A common way to represent a graph is through its adjacency matrix . A. This is
an .n × n matrix whose row and column indices represent the vertices of the graph,
and the entries . Ai j indicate whether or not two nodes are connected. In the case of a
simple graph, the entries of . A are either 1 or 0, depending on whether two nodes are
connected or not. For a weighted graph, the entries are the non-negative real numbers
that represent the strength of the connections.
We also distinguish between directed and undirected graphs. For directed graphs,
the edges represent one-way, non-symmetric connections. On the other hand, for
undirected graphs the connections are symmetric. We can deduce that for an undi-
rected graph the adjacency matrix is symmetric, whereas this is not necessarily the
case for a directed graph.
It is important to note that a graph is an abstract object that may not have a
physical interpretation. Fundamentally, it is a nonlinear data structure composed of
a collection of nodes that are related to each other in some way. When thinking of a
graph, it is common to imagine a particular embedding, presumably in a Euclidean
space, for the sake of visualizing the graph. Embedding the graph in a Euclidean space
(say the plane) simply means associating each vertex of the graph with a point in the
plane, and connecting with line segments the nodes that are connected to each other.
Every proper embedding preserves the inner topology of the graph. There is nothing
fundamental about any given proper embedding; however, there are embeddings that
are more favorable than others if we want to discover or highlight some properties
of the graph.
One way to study the properties of a graph is to compute and analyze the eigen-
values and eigenfunctions of its adjacency matrix, and other matrices that are derived
from it. This branch of mathematics is called spectral graph theory and has a rich
history of applications in machine learning that include clustering [73, 107], semi-
supervised learning [1, 10, 43], and multi-fidelity modeling [84].
As an example, if we were to use a graph to describe a social network of human
beings, then each person could be a node, and the their age, weight, interests, and
income could be the nodal attributes. The edges could be used to represent whether
two persons know and interact with each other, and the elements of the adjacency
matrix could denote the strength of this connection.
Alternatively, if we were to us a graph to define the state of an elastic solid, then
each material point can be a node whose features include the local loading, the stress
state, and the material properties, whereas the edges can incorporate the physical
distance among nodes.
Mesh-based simulation in computational physics In a typical problem in compu-
tational physics, a mathematical model is usually used to describe the behavior or
the evolution of a system in terms of a set of PDEs, defined over a specific domain in
space and time. Solving these differential equations analytically is often not possi-
ble. In these cases, numerical methods are employed to solve these equations. These
6.8 Mesh Graph Networks 111
include the finite difference method, the finite volume method and the finite ele-
ment method. In all these methods, the space-time domain is discretized into smaller
sub-domains that are used to solve the PDE approximately. The data structure that
represent this discretization is often called the “mesh”, and consists of nodes, vol-
umetric elements, faces and edges. Once the nodes are numbered, the information
regarding the edges is contained in a connectivity matrix which determines which
nodes in the computational mesh are connected to each other.
Computational meshes may be structured or unstructured. In three spatial dimen-
sions, a structured mesh is composed of cuboids that divide the physical domain in
a regular, grid-like fashion. This mesh is very efficient to work with since its repre-
sentation benefits from a one-to-one correspondence with a simple Euclidean grid.
This means that the connections among nodes can be easily recovered from their
numbers and do not require special data structures. On the other hand, unstructured
meshes cannot be mapped back to a Euclidean grid and are generally composed by
elements of different shapes that include tetrahedrons, wedges, prisms and hexahe-
drons. Unlike structured meshes, unstructured meshes require special data structures
to store information about how different entities are connected to each other. How-
ever, they are can easily represent complex physical domains.
Based on the description of a computational mesh and a graph, it should be clear
to the reader that these two concepts are closely related to each other. In fact, it is easy
to see that one can represent the nodes and the edges in a computational mesh as a as
a graph . M = (V, E). In this graph, the physical quantities defined at the nodes (for
example, pressure and velocity for the incompressible Navier Stokes equations) can
be thought of as the attributes of the node. Furthermore the information contained
in the connectivity matrix can be used as adjacency matrix. For reasons that will
become more clear, a bidirectional graph is preferable for this purpose, implying
that the connections between the nodes are not symmetric.
All vertices and edges are equipped with a set of features, denoted by .Vi ∈ RdV and
.Ei j ∈ R E . The vertex features,.Vi , include all the physical quantities that characterize
d
the input state. The edge features,.Ei j , can include geometrical properties of the nodes,
and are used to communicate information about the physical domain to the model.
For example, these could be the absolute or the relative coordinates of the vertices,
that is .Ei j = (xi , xi j ), where .xi j = xi − x j .
The map from the input to the output graph is accomplished via the three steps
described below.
112 6 Operator Networks
Encoding The first involves encoding all the node and edge features into a latent
space using the encoders .∈ E and .∈V . That is,
( )
e = ∈ E Ei j ; θ ∈ E
. ij (6.61)
( )
.vi = ∈ V Vi ; θ ∈V (6.62)
Here .∈ E (·; θ ∈ E ) and .∈V (·; θ ∈V ) are MLPs with trainable parameters .θ ∈ E and .θ ∈V .
We note that the dimension of the latent vector is usually higher than the number
of features. Thus the encoders embed the node and edge features into a higher-
dimensional space.
Processing This phase is the core of the MGN algorithm. It consists of . L identical
message passing steps, which iteratively update the node and edge features for an
entity based on its neighbor’s features. Each step is accomplished by a neural network
with trainable parameters, and the steps are applied iteratively. This operation is given
by .l ∈ {1, . . . , L},
e(l) = f (l)
. ij
(l−1)
E (ei j , vi(l−1) , v(l−1)
j ; θ (l)
fE) (6.63)
(l)
∑ (l)
.ei = ei j (6.64)
j∈N (i)
( )
v(l) = f (l)
. i V vi
(l−1)
, ei
(l)
; θ (l)
fV (6.65)
In this section we describe how the MGN is trained. For this it is worth remembering
that an MGN is a graph-to-graph model – that is it takes a graph as input and generates
another graph as output.
The neural networks that comprise an MGN are the encoders .∈ E , ∈V , the
(l) (l)
. L pairs of update networks . f E , f V , l = 1, · · · , L, and the decoder .δ V . Let
us denote the set of all the weights and biases of these networks as .θ =
[θ ∈V , θ ∈ E , θ (0) (0) (L−1)
fE, θ fV, ..., θ fE , θ (L−1)
fV , θ δ V ]. Then, given a set of observed input
and output graphs, training an MGN amounts to finding the optimal set of parame-
ters .θ ∗ that minimize the discrepancy between the predicted and the observed output
graphs. The discrepancy between two graphs can be defined by a mean measure of
the difference between the attributes of their nodes. The type of observed input and
output graph depends upon the problem we wish to solve. In this context, it is useful
to consider steady state and time-dependent problems separately.
6.8 Mesh Graph Networks 115
S = {(a(k) , u(k) ) : k = 1, . . . , N }
.
where . a(k) and .u(k) store the values of input function and final solution at every grid
point of the mesh, respectively.
For an MGN, a graph representation of these data is needed. To accomplish this
we define graphs for the input functions . Ma(k) and for the output solutions . Mu(k) , so
the graph-pairs can be assembled as
SM = {(Ma(k) , Mu(k) ) : k = 1, . . . , N }
.
We note that the attributes type of the nodes of . Ma are different than the attributes
for the nodes of . Mu . For example, in the case of the heat equation, the attributes
for the .i-th node of the input graph . Ma are the values of the source term, . f i , or the
conductivity, .κi , at that node. Whereas, the attribute at the same node for the output
graph . Mu is the value of the temperature attained at that node. Similarly, for the
elasticity problem, the nodal attributes of the input graph can be the values of the
body force or material properties at that node, and the nodal attributes of the output
graph can be the values of the displacement and stress fields at that node.
The loss function for the MGN is given by
1 ∑ ∑ (k)
N
∏(θ) =
. |Vi − V̂i(k) |2 . (6.67)
N |V | k=1 k i∈V
Here, given the .i-th node, and the .k-th input-output graph pair, .Vi(k) are the observed
nodal features of output graph, and .V̄i(k) are nodal features of the output graph pre-
dicted by the MGN while using the corresponding input graph as input. The loss
described above is the mean-square loss, though other loss terms, like the mean
absolute difference are also possible. Some regularization of the type described in
Sect. 2.5.1 is also typically added. Finally, training the MGN corresponds to finding
∗
.θ = arg min ∏(θ). Once the training is complete, the MGN model can be used to
θ
116 6 Operator Networks
make predictions for a new instance of an input functions .a, by using as input the
corresponding input graph . Ma .
Time dependent problems When dealing with time dependent problems, the goal
is usually to predict the state of a system .u(t, x), x ∈ Ω X at any instant in the future
.t > t0 given the initial condition .u(t0 , x). The application of a MGN to this problem
relies on using the MGN auto-regressively. That is, the MGN is trained to take the
solution at a given time .t as input to predict the solution at .t + Δt. Once trained, this
MGN can be applied regressively to march the solution over multiples of .Δt. The
training of this type of MGN is described next.
The first step involves solving for the evolution of the system of interest for . N
time steps .tk = kΔt, k = 1, . . . , N . This requires the generation of a computational
mesh to discretize the domain .Ω X , and the use of a numerical solver to integrate the
equations in time. The result is stored in a collection of vectors,
S = {u(k) : k = 0, . . . , N },
.
where .u i(k) = u(t k , x i ) is the solution at time .t k at the .i-th node whose coordinates
are denoted by . x i .
The next step involves building a graph representation for this data set. This is
done by using the nodes of the mesh to define the vertices of the graph, the vectors .uk
to define the nodal attributes, and the connectivity of the mesh to define the edges. The
result is a finite set of graphs .{M (0) , . . . , M (N ) }, with . M (k) = M(tk ) = (V k , E k ).
This set can also be interpreted as a sequence of snapshots at different time stamps
.tk of an evolving graph, where the attributes of each node are allowed to change with
time.
Given this, the MGN can be thought of as the operator that maps . M (k−1) → M (k)
for any value of .k. We note that it is the same MGN that is applied for all values of
.k. To train this MGN we build the input-out pairs of graphs
The loss function for the MGN is given by (6.67) where once again, given the .i-th
node, and the .k-th input-output graph pair, .Vi(k) are the observed nodal features of
output graph, and .V̄i(k) are nodal features of the output graph predicted by the MGN
while using the corresponding input graph as input. The MGN is trained by finding
the values of the network parameters that minimize this loss.
Once the MGN is trained it is applied to increment the solution from time .t to
.t + Δt. By consider the output of the MGN at .t + Δt as the input for the next
application of the MGN, we can compute the solution at .t + 2Δt. These recursive
steps can be applied several times to advance the solution over a substantial time
interval. As an example consider the compressible Navier Stokes equations solved
in complex but fixed domain. The solution involves determining the pressure (. p),
density (.ρ), and the velocity field (.u) at every node on the mesh at every time step.
6.9 Computational Exercise: Deep Operator Networks (DeepONets) 117
Thus after . N such time steps we have . N + 1 graphs .{M (0) , . . . , M (N ) } where the
attributes of the node .i at time step .k is .Vi(k) = ( pik , ρik , uik ).
Remark 6.8.1 In the original paper [82], for the output graph, the nodal attributes
(k)
are the time derivatives of the physical quantities, that are denoted by .V̇i . These
are integrated in time with an explicit Euler scheme to obtain the desired physical
quantities, .Vi(k+1) = Vi(k) + Δt V̇i(k) .
Remark 6.8.2 It is worth noting that for time-dependent problems, a trained MGN
works only for the fixed time step size that was used to generate the training data.
Remark 6.8.3 In this chapter we described an MGN that was applied to a prob-
lem with a mesh that did not change with time. However, for some problems, like
those based on a Lagrangian [9, 42], or an Arbitrary Lagrangian Eulerian (ALE)
description [24, 41], the mesh itself evolves with time. This is particularly useful for
modeling moving material interfaces. The MGN framework can also be applied to
such problems. In fact, the original MGN reference [82], contains the appropriate
approach to solve these types of problems.
Remark 6.8.4 In order to distinguish boundary nodes from interior nodes, one may
introduce a simple one-hot-encoded nodal attribute and use it in the MGN formulation
(Fig. 6.8).
In this exercise you will implement, train and test a DeepONet. The DeepONet will
map a function defined on the domain .Ω = [0, 1] to another function also defined on
the same domain. The input function is the initial condition to the viscous Burgers
equation, and the output function is the solution at .t = 1.
The architecture of the DeepONet is shown in Fig. 6.9. The branch will map a
vector, . a ∈ Rk to the another vector, . b ∈ R p . The trunk vector will map .x ∈ R1 to the
another vector, . b ∈ R p . The output of the network will be the dot product .u = b · τ .
Load them on your notebook, determine the dimensions of the arrays in these files
and explain why each dimension makes sense. Using this information, determine
the value of .k (see Fig. 6.2) for this problem. Create the training dataset for the
DeepONet using the class CustomDataset by combining all these arrays:
class CreateDataset(Dataset):
def __init__(self,branch_input,trunk_input,output):
self.branch_input = torch.Tensor(branch_input)
self.trunk_input = torch.Tensor(trunk_input)
self.output = torch.Tensor(output)
def __len__(self):
return len(self.branch_input)
2. Train the DeepOnet with the architecture shown in Fig. 6.2. Within the training
loop, define the branch and trunk networks using the MLP class from Chap. 2.
Then generate the final model output by computing the dot product of the output
from these networks. Pay close attention to the following:
(a) Use . L = 6, H = 20, p = 25 and tanh activation.
(b) Use
We have provided 5 test cases. So this means that you are required to generate 5
× 2 figures for the test data.
.
Chapter 7
Generative Deep Learning
Generative AI has captured the interest of a large section of our society of late.
This interest has been created by the impressive results obtained by AI tools like
ChatGPT, DallE, and Stability.ai. While the application domains for these tools differ
significantly, and the details of the algorithms within them are also different, they
share a common feature in that they are generative algorithms. Although we may
have some intuitive feel for what the word generative implies, it is worth defining it
with some rigor in order to make progress in understanding these algorithms. This
is the main focus of this chapter. That is to provide the reader with the background
required to understand the broad principles behind generative algorithms, and also
to give them some specific example of a few (and by no means all) such algorithms.
For our purpose, we group generative algorithms into to two broad classes. The first
class is of “pure generative” algorithms. These are algorithms that are trained on
samples of data drawn from an underlying, unknown, probability distribution, and
once trained they generate new samples that are also drawn from the same distri-
bution. For a simple example consider a dataset like CelebA [60] which comprises
RGB images of celebrities. Then one can consider developing a generative algorithm
which uses all these images for training, and then during execution will produce an
image which will look remarkably like a human face, but will not be the same as
one of the faces used for training the algorithm. Or perhaps consider the following
example which stems from to physics and engineering. Consider an algorithm that
is trained on lots of images of the micro-structure of a binary composite material,
and then produces new microstructural images that are different and yet look a lot
like the training images.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024 121
D. Ray et al., Deep Learning and Computational Physics,
https://doi.org/10.1007/978-3-031-59345-1_7
122 7 Generative Deep Learning
We begin our discussion with the definition of a random variable. A random variable
X attains values on the real line and comes equipped with a cumulative distribution
.
function. We note that formally a random variable is associated with a sample space,
event class and the probability law for a random event, and is defined as a function
that maps the sample space of the random events to the real line. This association
equips the random variable with a cumulative distribution function and in turn equips
the cumulative distribution function with certain properties that are defined in the
following section.
In order to make things precise let us define some examples of random variables
(RVs):
1. Consider a switch that can either be “on” or “off’. Further the probability of “on”
.= p, and probability of “off” .= 1 − p, with . p ∈ [0, 1]. Then we may define the
RV (
0 if switch = off
.X = (7.1)
1 if switch = on.
3. Now consider an unbiased spinner that stops at any angle in the interval .(0, 2π]
with equal probability. We define a RV to be equal to the value of this angle
divided by .2π. That is,
We note that the first two examples correspond to a discrete RV whereas the
third example is that of a continuous RV. Next we define the cumulative distribution
function and also list its properties.
which defines a probability on .R of . X taking values in the interval .(−∞, x]. Let us
define the cdf for the above examples:
1. For the Bernoulli RV defined by (7.1)
Let us discuss some properties of . FX . We note that these properties are inherited
by the cdf through its association with an underlying random event.
1. .0 ≤ FX (x) ≤ 1.
2. .lim x→∞ FX (x) = 1.
3. .lim x→−∞ FX (x) = 0.
4. . FX is monotonically increasing.
5. The cdf is always continuous from the right
Note that the . FX for discrete RV (see Fig. 7.1) are discontinuous at finitely many
.x. In fact, the cdf for discrete RVs can be written as a finite sum of the form
∑
K ∑
K
. FX (x) = pk H (x − xk ), pk = 1,
k=1 k=1
Remark 7.2.1 Once we have the . FX we can calculate the probability that . X will
take values in “any” interval in .R, i.e., we can compute . P[a < X ≤ b]. Note that
Thus,
. P[a < X ≤ b] = FX (b) − FX (a).
The pdf enjoys the following properties that it inherits from the cdf:
1. . f X (x) ≥ 0, ∀x ∈ R, since . FX is monotonically increasing.
2. .lim x→−∞ f X (x) = lim x→∞ f X (x) = 0.
3. Integrating (7.4) from .(−∞, x] gives us
∫ x
. f X (y)dy = FX (x) − lim FX (x) = FX (x).
−∞ x→−∞
4. Also
∫ b ∫ a ∫ b
. P[a < X ≤ b] = FX (b) − FX (a) = f X (y)dy − f X (y)dy = f X (y)dy.
−∞ −∞ a
Thus, the integral of a pdf in an interval gives the “probability mass” which is
the probability that the RV lies in that interval. This is the reason why the pdf is
called a “density”.
5. A very important property is that a pdf always integrates to 1,
∫ ∞
. f X (y)dy = lim FX (x) − lim FX (x) = 1.
−∞ x→∞ x→−∞
That is, we can interpret the pdf at a point .x as a measure of the likelihood that the
random variable will attain a value in a small neighborhood of .x. Also, note that
as .h → 0+ , . P[a < X ≤ a + h] → 0. That is, for a continuous RV the probability
of attaining a single value is zero.
A few remarks are in order. First, for a discrete RV the pdf contains a sum of
Dirac-distributions and has to be interpreted in the sense of distributions. Second, in
the following development we will work almost exclusively with continuous RVs and
therefore we will assume the existence of a well-defined pdf. Finally, when working
with generative models, we will find it much more convenient to work with pdfs
than cdfs. Thus, it is recommended that the reader familiarizes themselves with the
concept of a pdf before proceeding further.
We now look at some important random variables and the associated cdf and pdfs
(also see Fig. 7.2). We note that an RV is completely defined by either its cdf or pdf.
Thus defining an RV entails defining the functional form of either of these functions.
7.2 Introductory Concepts in Probability 127
1. Uniform RV: As the name suggests, for a Uniform RV the pdf is uniformly
distributed over a finite interval. That is, for some interval .(a, b], the pdf is given
by (
1
if x ∈ (a, b]
. f X (x) = ,
b−a
0 other wise
d
f (x) =
. X FX (x) = λe−λx .
dx
3. Gaussian RV: This RV is used to model the distribution of naturally occurring
quantities like like height, weight, etc. within a population. In fact, through the
Central Limit Theorem, this is also the distribution given by an aggregate of many
RVs. In this case, the pdf is given by
1
e− 2 ( σ )
1 x−μ 2
f (x) = √
. X (7.5)
2πσ
which is parameterized by the parameter .μ which denotes the center of this dis-
tribution, and the parameter .σ which denotes its spread. We will shortly see that
.μ turns out to be the mean of the distribution and .σ turns out to be the variance.
2
The Gaussian pdf (7.5) is often concisely represented by the symbol . N (μ, σ).
∫ ∞ ∫ ∞ ∫ ∞ ∫ ∞
.0 = (m − x) f X (x)dx = m f X (x)dx − x f X (x)dx =⇒ x f X (x)dx = m.
−∞ −∞ −∞ −∞
Using this property, we can easily say the mean for a uniform RV is .(a + b)/2,
while for a Gaussian RV it is .μ.
• .E[c] = c for a constant .c.
• We can calculate the expected value of functions of RVs as
∫ ∞
E[g(X )] =
. g(x) f X (x)dx.
−∞
For a uniform RV
∫ ( )2
b
b+a 1 (b − a)2
VAR[X ] =
. x− dx = .
a 2 b−a 12
For a Gaussian RV, we first use the property that the pdf integrates to unity to write
∫ ∞ √
e− 2 ( ) dx =
1 x−μ 2
. σ 2πσ.
−∞
Recall that we defined a random variable as an entity that attains values on the real
line and comes equipped with a cdf (or alternately a pdf). Similarly, we may define
a random vector as an entity that attains values in .Rd , where .d is a positive integer,
and comes equipped with a joint cdf, or a joint pdf. We will find it very useful to
work with joint pdfs rather than joint cdfs and therefore we will almost exclusively
focus on defining random vectors through their joint pdfs. We also note that just like
a random variable, a random vector is also associated with the sample space, event
class and the probability law for a random event. It is defined as a function that maps
the sample space of the random event to .Rd . In that sense, a random variable is a
special case of a random vector when .d = 1.
To consider a simple example, let’s return to the spinner. We now spin the spinner
twice and measure .ψ1 ∈ (0, 2π], .ψ2 ∈ (0, 2π]. In this case, we can define a random
vector . X, where
ψ1 ψ2
.X1 = , X2 = .
2π 2π
Note that for this random vector .d = 2.
130 7 Generative Deep Learning
We now describe two important classes of random vectors that find common use
in generative models. These are joint uniform random variables and joint Gaussian
random variables. In each case we “define” the random vector by presenting an
analytical expression for the joint pdf and describing the parameters that appear in
it.
1. Joint uniform RV: Consider the.d−dimensional box formed by the tensor product
of the intervals .(ai , bi ], .i = 1, · · · , d, and .bi = ai + δi , .δi > 0. Then the joint pdf
for a random vector with uniform distribution in this box is given by
(
∏d
1
if x ∈ the box
f (x) =
. X i=1 δi . (7.6)
0 otherwise
2. Joint Gaussian RV: The joint pdf for joint Gaussian RVs is given by
[ ]
1 1
f (x) =
. X √ exp − (x − μ)T ∑ −1 (x − μ) (7.7)
(2π) d/2 det(∑) 2
. X j is given by .∑i j .
7.2 Introductory Concepts in Probability 131
Fig. 7.3 Contours of a bivariate normal distribution for different values of correlation .ρ
Often, the joint Gaussian pdf is represented as.N (μ, ∑), since.μ and.∑ completely
define the pdf.
It is useful to consider the special case of .d = 2, then the covariance matrix can
be written as [ 2 ]
σ1 ρσ1 σ2
.∑ = .
ρσ1 σ2 σ22
We can interpret .σ12 and .σ22 as the variance along the . X 1 and . X 2 coordinates, and .ρ
as a parameter that represents the correlation between these variables. In Fig. 7.3,
we have plotted contours of this pdf for different values of .ρ with .σ1 , σ2 = 1 and
.μ1 , μ2 = 0. We can see that if .ρ / = 0, the values of one variable are correlated
Consider the function .g(X), which can be scalar-, vector-, or tensor-valued, then its
expected value is given by
∫
E[g(X)] =
. g(x) f X (x)dx,
Rd
Here .⊗ represents the outer product between two vectors. The .i-th diagonal compo-
nent of the covariance matrix represents the variation in the random variable . X i . The
off-diagonal components of the covariance matrix represent the correlation between
the two corresponding components of the random vector. The sign of the the off-
diagonal component determines the sign of this correlation. If the off-diagonal com-
ponent is zero, then the two components are said to be uncorrelated.
Using the definitions of the mean and the covariance, and the expression for the
joint pdf for a Gaussian random vector (7.7), it easily verified that for a Gaussian
random vector the mean, .μ X = μ, and the covariance, .COV[X] = ∑.
This is the marginal pdf for .Y . It is easy to show that function defined above satisfies
all the properties of a pdf. For example, we note that as required it does integrate to
unity,
∫ ∫ ∫ ∫
. f Y ( y)d y = f XY (x, y)dxd y = f XY (x, y)dxd y = 1
RD RD Rd Rd+D
where the last equality is obtained from the property of the joint pdf . f XY .
7.2 Introductory Concepts in Probability 133
Next we consider the question: what does the distribution of .Y look like when
we know that . X = x, a specified value. An intuitive answer might be to consider the
joint pdf as a function of . y at a fixed value of . x. This is correct up to a multiplicative
constant, which is needed because the joint pdf at a fixed value of . x does not integrate
to unity. Thus the correct answer is the conditional distribution which is denoted by
. f Y |X ( y|X = x), or in a slightly simpler notation by . f Y |X ( y|x). This is given by
f XY (x, y)
f
. Y |X ( y|X = x) = .
f X (x)
We note that in the expression above the term in the numerator is the joint pdf
while the term in the denominator is marginal distribution at . x defined in (7.2.10).
Integrating this expression over. y on both sides, the numerator on the right hand sides
reduces to the marginal distribution of . x at . x, which is equal to the denominator, and
thus the integral evaluates to unity, as it should for a pdf.
Similarly, we can define a distribution for . X given .Y = y, a specified value. This
is given by
f XY (x, y)
. f X|Y (x|Y = y) = .
f Y ( y)
Using (7.2.10) in the right hand side of (7.2.10), we can write a direct relation
between the two conditional distributions,
f X|Y (x|Y = y) f Y ( y)
f
. Y |X ( y|X = x) = .
f X (x)
This is the celebrated Bayes rule which finds lots of applications in solving proba-
bilistic inference problems.
We end this brief discussion of introductory concepts in probability by hearkening
back to the regression problem we considered in Chap. 2. There, we were given lots
of samples of the vectors .(x, y) and using these, we trained a network that could then
be used to determine the value of . y for a specified value of . x. Now we are interested
in the probabilistic version of this problem. Once again, we are given lots of samples
of random vector .(X, Y ). However, instead of using these to train a network that will
yield one value of .Y for a specific . X = x, we want to train a network that would
generate samples of .Y from the conditional distribution . f Y |X ( y|X = x). This is the
conditional generative problem that we discussed in the Introduction to this chapter.
We will return to it in Sect. 7.4. In the next section, we first address the simpler pure
generative problem.
134 7 Generative Deep Learning
We now consider the pure generative problem and describe an algorithm to solve it.
We are working with a random vector . X with .d components with a pdf given by
. f X . This pdf is unknown to us. Instead we are given a dataset of realizations .{x i }
7.3.1 GANs
GANs were first proposed by Goodfellow et al. [33] in 2014. Since then, many
variants of GANs have been proposed which differ based on the network architecture
and the objective function used to train the GAN. We begin by describing the abstract
problem setup followed by the architecture and training procedure of a GAN.
Consider the dataset .S = {x i ∈ Ω X ⊂ Rd : 1 ≤ i ≤ Ntrain }. We assume that .S
consists of realizations of some random vector . X with density . f X , i.e., . x i ∼ f X . We
want to train a GAN to learn . f X and generate new samples from it.
A GAN typically comprises two sub-networks, a generator and a discriminator
(or critic). The generator is a network of the form
g(.; θ g ) : Ω Z → Ω X , g : z |→ x
. (7.8)
7.3 Pure Generative Problem 135
.c(.; θ d ) : Ω X → R (7.9)
with the trainable parameters .θ d . Once again, the critic architecture will depend on
the shape of . X. If . X is a vector then .c can be an MLP with input dimension .d and
output dimension .1. If . X is an image, then the critic architecture will have a few
convolution layers, followed be a flattening layer and a number of fully connected
layers. This is similar to the CNN architecture shown in Fig. 4.7 but with a scalar
output and without an output function.
The schematic of the GAN along with the inputs and outputs of the sub-networks is
shown in Fig. 7.4. The generator and critic play adversarial roles. The critic is trained
to distinguish between true samples coming from . f X and fake samples generated by
g
.g with the density . f X . The generator, on the other hand, is trained to fool the critic
by trying to generate realistic samples, i.e., samples similar to those sampled from
. fX.
We define the objective function describing a Wasserstein GAN (WGAN) [5],
which has better robustness and convergence properties compared to the original
GAN. The objective function is given by
1 ∑
Ntrain
1 ∑
Ntrain
∏(θ g , θ d ) =
. c(x i ; θ d ) − c(g(z i ; θ g ); θ d ) (7.10)
Ntrain i=1
Ntrain i=1
, ,, , , ,, ,
critic value on real samples critic value on fake samples
where . x i ∈ S are samples from the true target distribution . f X , while . z i ∼ f Z are
passed through .g to generate the fake samples. To distinguish between true and
fake samples, the critic attains large positive values when evaluated on real samples
and large negative values on fake generated samples. Thus, the critic is trained to
maximize the objective function. In other words, we want to solve the problem
Note that the optimal parameters of the critic will depend on .θ g . Now to fool the
critic, the generator .g tries to minimize the objective function,
N̄ (|| || )2
λ ∑ || ||
|| ∂c ( x̂ i ; θ d )|| − 1
∏c (θ g , θ d ) = ∏(θ g , θ d ) − (7.13)
N̄ i=1 || ∂ x̂ ||
.
3. Since the dimension . N Z of the latent variable is typically much smaller than
the dimension .d of samples in .Ω X , the trained generator also provides a low
dimensional representation of high-dimensional data, which can be very useful
in several downstream tasks [79, 80].
If we step back and observe how a trained GAN works we recognize that it generates
samples from a “simple” . Nz -dimensional Gaussian distribution and maps them such
that they are transformed to samples from the desired .d-dimensional distribution . f X .
We also note that for a GAN typically . N Z << d. That is the dimension of the latent
space is much smaller than that of the ambient or physical space. This observation
leads us to the question whether it is possible to derive other generative algorithms
based on this type of transformation, and whether relaxing the requirement . N Z << d
can lead us to better algorithms. In particular, if we set . Nz = d can we somehow
address the challenges associated with a GAN. Primarily, can we eliminate the need
to solve a challenging, non-linear min-max problem? A class of models, commonly
referred to score-based diffusion models [102, 103], accomplishes this goal.
As before we let . X be a .d-dimensional random vector with density . f X . However,
we now let . X(T ) be the .d-dimensional latent vector. We choose it such that its joint
pdf is Gaussian with mean .μ = 0, and covariance .∑ = σ 2 (T )I. That is, the joint pdf
of . X(T ) is . N (μ, σ 2 (T )I). The latent vector is the specific instance of a stochastic
process . X(t), where .t is a time-like variable. The evolution of the joint pdf of this
random vector is described in the development below.
Recall, that our goal is to generate samples of. X(T ) and transform them to samples
of . X. We now derive an algorithm that accomplishes this. The derivation takes the
following steps:
1. Derive a convolution that transforms . f X to . N (μ, σ 2 (T )I).
2. Derive a time-dependent pde whose solution is equal to this transformation.
3. Derive a backward in time version of this pde. The solution of this pde transforms
. N (μ, σ (T )I) to . f X .
2
4. Derive and discretize a stochastic ODE corresponding to the pde above. The
discretization part will require training a neural network.
5. Generate samples from . N (μ, σ 2 (T )I) and use these as initial states in the dis-
cretized ODE above. The final states will then be the desired generated samples
from . f X .
These steps are outlined next.
We first consider the transformation from . f X to . f X(T ) . This transformation maps
a complex and unknown joint pdf to a Gaussian probability density with zero mean
and covariance given by .σ 2 (T )I. One way to achieve this is to convolve the original
density with a Gaussian kernel as follows,
7.3 Pure Generative Problem 139
∫
. f (x, t) = f (x|x ' , t) f X (x ' )d x ' , (7.15)
ΩX
where [ ]
' 1 |x − x ' |2
. f (x|x , t) = exp − . (7.16)
(2πσ 2 (t))d/2 2σ 2 (t)
Here we think of .t ∈ (0, T ) as a time-like scalar variable and assume that .σ(t) is an
increasing function of .t with .σ(0) = 0. In this case, if we stipulate that .σ(T ) >> 1,
then it is easy to show that . f (x, T ) = N (0, σ 2 (T )I). Also, since .σ(0) = 0, we
can show that Gaussian kernel . f (x|x ' , t) tends to the Dirac distribution as .t → 0.
Therefore, we have . f (x, 0) = f X . This concludes the first step in our derivation.
Before we move on to the nest step, we make the observation that by running . f (x, t)
backwards in time, that is from .t = T to .t = 0 we would generate time-dependent
density that begins with the Gaussian density . N (0, σ 2 (T )I) and ends at the data
density . f X .
Readers who are familiar with the diffusion equation (or the heat conduction
equation) will recognize that . f (x, t) given by (7.15) is the solution to the equation
∂f γ(t) 2
− ∇ f = 0, (x, t) ∈ Ω X × (0, T ]
. ∂t 2 (7.17)
f (x, 0) = f X (x), x ∈ Ω.
where .γ(t) = dσdt(t) . The solution to this PDE transforms . f (x, 0) = f X (x) to
2
where.t = T − τ . The equation for this density can be derived by substituting relation
above into (7.17). It is given by
∂ f˜ γ(T − τ ) 2 ˜
+ ∇ f = 0, (x, τ ) ∈ Ω X × (0, T ]
. ∂τ 2 (7.19)
f˜(x, 0) = N (0, σ 2 (T )I), x ∈ Ω.
The solution of this equation will ensure that at .τ = T , this density will obtain the
desired value, . f˜(x, T ) = f X (x).
Equation (7.19) has a problem in that it is the diffusion equation with the opposite
sign in the spatial derivative term. It is sometimes referred to as the anti-diffusion
equation. More importantly, for our application, this equation does not have a “par-
ticle” counterpart that can be used to transform samples drawn from its initial state,
. N (0, σ (T )I), to samples from from its final state, . f X (x). As shown below, this
2
140 7 Generative Deep Learning
difficulty can be addressed by adding and subtracting another diffusion term with
the correct sign to this equation.
We begin with the differential equation in (7.19),
∂ f˜ γ(T − τ ) 2 ˜
+ ∇ f = 0,
∂τ 2
∂ f˜ γ(T − τ ) 2 ˜
⇒ − ∇ f + γ(T − τ )∇ 2 f˜ = 0,
∂τ 2
∂ f˜ γ(T − τ ) 2 ˜
⇒ − ∇ f + ∇ · (γ(T − τ )∇ f˜) = 0,
.
∂τ 2 (7.20)
∂ f˜ γ(T − τ ) 2 ˜
⇒ − ∇ f + ∇ · (γ(T − τ )∇ ln( f˜) f˜) = 0,
∂τ 2
∂ f˜ γ(T − τ ) 2 ˜
⇒ − ∇ f + ∇ · (γ(T − τ )∇ ln( f (x, T − τ )) f˜) = 0,
∂τ 2
∂ f˜ γ(T − τ ) 2 ˜
⇒ − ∇ f + ∇ · (γ(T − τ )s(x, T − τ )) f˜) = 0.
∂τ 2
In deriving lines 2–4 in the development above, we have performed simple algebraic
operations. To arrive at the second from the last line, we have used the relation (7.18),
and in the last line we have introduced the definition of the score function for the
density . f (x, t),
. s(x, t) = ∇ ln f (x, t). (7.21)
where . x (n) denotes the sample value at time .τ (n) and .w ∼ N (0, 1).
We are now ready to describe the generative algorithm that will generate samples
from the desired distribution. First, we select a schedule for .σ(t), which is such that
dσ 2 (t)
.σ(0) = 0, .σ(T ) >> 1, and for .t ∈ (0, T ), .γ(t) = > 0. Note that selecting a
dt
schedule for .σ(t) means that we also select a schedule for .γ(t). Next, we divide the
7.3 Pure Generative Problem 141
time interval .(0, T ) into . N T intervals which gives the time increment .Δτ = T /N T .
Then we generate. Nsamples samples of.x (0) from. N (0, σ 2 (T )I). Each of these samples
is then evolved through (7.22). In this equation, we known every coefficient, except
the score function . s(x, T − τn ). Therefore, once this function is learnt we can evolve
these samples, and by construction, we are guaranteed that at .τn = T , the samples
will have evolved to be samples drawn from . f X (x). This algorithm is described in
Algorithm 2.
In the next paragraph, we describe how we can learn an approximation to the
score function by using a finite number of samples generated from . f X (x). These
samples form the training data for the generative algorithm, and the approximation
for the score function is expressed in terms of a deep neural network.
Learning the Score Function. Let . s(x, t; θ) : Ω X × (0, T ) |→ Ω X be a neural net-
work that maps samples from the domain .Ω X and the time interval .(0, T ) to samples
in .Ω X . Let .θ represent all the trainable parameters in this network. Then from (7.21)
we conclude that an appropriate loss function to determine these parameters is given
by
∫ T∫
.∏(θ) = |s(x, t; θ) − ∇ ln f (x, t)|2 f (x, t)d xdt, (7.23)
0 ΩX
which is designed to minimize the difference between the true score function and
its counterpart approximated by the neural network. By utilizing (7.15) in the above
equation and recognizing that the loss function may be shifted by a term that does
not depend on .θ (see [103] for details) we arrive at,
∫ T ∫ ∫
∏(θ) =
. |s(x, t; θ) − ∇ x ln f (x|x ' , t)|2 f (x|x ' , t) f X (x ' )d x ' d xdt.
0 ΩX ΩX
(7.24)
Thereafter, using the definition of the Gaussian kernel . f (x|x ' , t) (7.16), we may
simplify this as,
∫ ∫ ∫
T
x − x' 2
∏(θ) =
. |s(x, t; θ) + | f (x|x ' , t) f X (x ' )d x ' d xdt. (7.25)
0 ΩX ΩX σ 2 (t)
where .tk is sampled from a uniform distribution on the interval .(0, T ), . x i is the .i-th
sample from the training dataset .S, and . x ( j,k) = x i + n( j,k) , where .n( j,k) is sampled
from . N (0, σ 2 (t (k) )1). From the equation above it is clear that given a noisy version
( j,k)
of the sample . x i , denoted by . x i , and the time, .t (k) , the score function learns to
( j,k)
compute the noise, .n , scaled by the variance .σ 2 (t (k) ). This justifies the name
142 7 Generative Deep Learning
end
τ (n) = τ (n) + Δτ
n =n+1
end
Note that the score function takes as input the .d-dimensional vector . x and the
scalar .t and generates as output the .d-dimensional score function. Thus when . x is
a vector a logical choice for the architecture for the score function is an MLP that
maps a .d + 1-dimensional vector to a .d-dimensional vector. When . x represents an
image, or a discretized version of a field variable, the architecture is different. In this
case the network that approximates the score function is an image-to-image network
(U-Net, for example) and the scalar time is used for instance normalization within
the network.
The value of .σ(T ) is a hyperparameter for the method. This value should be larger
than the largest difference between the samples of . x. That is it should be larger than
.|x i − x j | for all .i, j in the dataset. In practise, this value is computed approximately,
we want to find . y for a new . x not appearing in .S. We have seen in the previous
chapters how neural networks can be used to solve such a regression (or classification)
problem.
Now let us consider the probabilistic version of this problem. We assume that . x
and . y are modelled using the random vectors . X and .Y , respectively. Further, let the
paired samples in (7.27) be drawn from the unknown joint distribution . f XY . Then
given a realization . X = x̂, we wish to use .S to determine the conditional distribution
. f Y |X ( y| x̂) and generate samples from it.
There are several popular approaches to solve this probabilistic problem, such as
Bayesian neural networks [12], variational inference [21, 86], dropouts [26, 104],
deep Boltzman machines [93, 94], or diffusion-based models [103]. In this chapter
we will focus on an extension of GANs and diffusion models to solve this problem.
where . z ∼ f Z is the latent variable. Note that unlike a “simple” GAN, the generator
in a conditional GAN also takes as input . x. For a given value of . X = x̂, sampling
. z ∼ f Z will generate many samples of . y from some induced conditional distribution
g g
. f Y |X ( y| x̂). The goal is to prescribe the parameters .θ g such that . f Y |X ( y| x̂) approxi-
mates the true conditional . f Y |X ( y| x̂) for (almost) every value of . x̂.
The critic is a network of the form
c(·, ·; θ d ) : Ω X × ΩY → R
. (7.29)
which is trained to distinguish between paired samples .(x, y) generated from the
true joint distribution . f XY and the fake pairs .(x, ỹ) where . ỹ is generated by .g given
(real) . x.
The objective function for a cWGAN is given by
1 ∑
Ntrain
1 ∑
Ntrain
.∏(θ g , θ d ) = c(x i , yi ; θ d ) − c(x i , g(z i , x i ; θ g ); θ d ) . (7.30)
Ntrain i=1
Ntrain i=1
, ,, , , ,, ,
critic value on real pairs critic value on fake pairs
As earlier, the critic is trained to maximize the objective function (given by (7.11))
while the generator is trained to minimize it (given by (7.12)). Further, a stabilizing
gradient penalty term needs to be included when optimizing the critic (see [89]). This
term is given by
λ ∑ (|| || )
N̄
∏c (θ g , θ d ) = ∏(θ g , θ d ) −
. ||∇c( x̂ i , yi ; θ d )|| − 1 2 (7.31)
N̄ i=1
where .∇c denotes the gradient of .c with respect to both its arguments, . x̂ i = αx i +
(1 − α)g(z i ; θ g ) and .α is sampled from a uniform pdf in .(0, 1). The additional term
in (7.31) is known as a gradient penalty term and is used to constraining the (norm
of) gradient of the critic .d with respect to its input to be close to 1, and thus be a
1-Lipschitz function.
The optimal weights for the critic are determined by maximizing the objective
function (7.11), and the optimal weights for the generator are determined by mini-
mizing the objective function (7.12). Once again, this leads to a min-max problem
which is solved using the alternating steepest descent algorithm described earlier.
Under the assumptions of infinite capacity (. Nθg , Nθd → ∞), infinite data (. Ntrain →
∞) and a perfect optimizer, we can prove [88] that the generated conditional dis-
g
tribution . f Y |X ( y| x̂) converges in a weak sense to the target condition distribution
. f Y |X ( y| x̂) (on average) for a given . X = x̂.
7.4 Conditional Generative Algorithms 145
Thus when the training for a conditional GAN is over, the generator can be used
to generate samples from the conditional distribution for a given . X = x̂ as follows:
y = g(z i , x̂; θ ∗g ),
. i zi ∼ f Z . (7.32)
That is, we use . x̂ as input in one of the channels of the generator and use . z i sampled
from . f Z as input in the other channel. The output of the generator produces samples
of .Y that are generated from the desired conditional distribution.
The key differences between the pure generative problem and the conditional gen-
erative problem are:
1. In the former we are given samples from the distribution . f X , whereas in that in
the latter we are given data from the joint distribution . f XY .
2. In the former the goal is to generate some more samples from . f X , whereas in the
latter it is to generate samples from the conditional distribution . f Y |X ( y| x̂).
In order to devise a diffusion model that accomplishes this goal, our strategy is
to repeat the development for the pure generative problem, while replacing . f X with
. f Y |X ( y| x̂). This derivation is based on the development in [7, 20, 103].
The convolution defined in (7.15) is replaced with
∫
. f ( y, t| x̂) = f ( y| y' , t) f Y |X ( y' | x̂)d y' , (7.33)
ΩY
where the definition of the Gaussian kernel . f ( y| y' , t) remains the same and is given
by (7.16).
Following the steps outlined in Sect. 7.3.2, we can show that the backward in time
version of this probability density, defined as . f˜( y, τ | x̂) = f ( y, t| x̂), satisfies the
partial differential equation,
∂ f˜ γ(T − τ ) 2 ˜
. − ∇ f + ∇ · (γ(T − τ )s( y, x̂, T − τ ) f˜) = 0, (7.34)
∂τ 2
where the “spatial” derivatives are now along the . y coordinates and the appropriate
score function is given by
When . f˜ is initialized to be . N (0, σ 2 (T )I) and evolved by solving the above PDE,
we are guaranteed that the final state at .τ = T is the desired conditional density,
. f Y |X ( y| x̂).
146 7 Generative Deep Learning
This means that if we select samples such that each . y(0) is sampled from
. N (0, σ (T )I), and then evolve the samples according to,
2
√
. y(n+1) = y(n) + γ(T − τ (n) )s( y(n) , x̂, T − τ (n) )Δτ +
Δτ γ(T − τ (n) )w,
(7.36)
we are guaranteed that each. y(N ) will be a sample from the desired conditional density
. f Y |X ( y| x̂).
Now that we have obtained the iterations that will map samples from the standard
normal density to the desired conditional density, all that remains is to provide an
expression for the score function. Once again we use a neural network to approximate
this score function, and train the neural network by defining the loss function to be
∫ T ∫ ∫
.∏(θ) = |s( y, x, t; θ) − ∇ ln f ( y, t|x)|2 f ( y, t|x) f X (x)d xd ydt.
0 ΩY ΩX
(7.37)
Following the steps outlined in Sect. 7.3.2, this expression reduces to
∫ ∫ ∫ ∫
T y − y' 2
.∏(θ) = |s( y, x, t; θ) + | f ( y| y' , t) f Y |X ( y' |x) f X (x)d xd yd y' dt.
0 ΩY ΩY ΩX σ 2 (t)
(7.38)
Recognizing that . f Y |X ( y' |x) f X (x) = f XY (x, y' ), and replacing the integrals above
with their Monte Carlo approximations, we arrive at the final expression for the loss
function that is used to train the score function,
|
Ntrain |
|2
∑K ∑ J ∑ ( j,k)
− |
1 | ( j,k) y y i|
.∏(θ) = | s( yi , x i , t (k) ; θ) + i 2 (k) | , (7.39)
K J Ntrain k=1 j=1 i=1 | σ (t ) |
where .tk is sampled from a uniform distribution on the interval .(0, T ), .(x i , yi ) is
the .i-th sample from the training dataset .S, and . y( j,k) = yi + n( j,k) , where .n( j,k) is
sampled from . N (0, σ 2 (t (k) )1). The minimization of this loss function will produce
an approximate score function which can be used in the iterations given by Eq. (7.36)
to generate samples from the desired conditional density for a given input . x̂.
References
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer 147
Nature Switzerland AG 2024
D. Ray et al., Deep Learning and Computational Physics,
https://doi.org/10.1007/978-3-031-59345-1
148 References
17. T. Chen, H. Chen, Universal approximation to nonlinear operators by neural networks with
arbitrary activation functions and its application to dynamical systems. IEEE Trans. Neural
Netw. 6, 911–917 (1995)
18. S. Cuomo, V.S. Di Cola, F. Giampaolo, G. Rozza, M. Raissi, F. Piccialli, Scientific machine
learning through physics-informed neural networks: where we are and what’s next. J. Sci.
Comput. 92, 88 (2022)
19. A.D. Jagtap, G.E. Karniadakis, Extended physics-informed neural networks (xpinns): a gener-
alized space-time domain decomposition based deep learning framework for nonlinear partial
differential equations. Commun. Comput. Phys. 28, 2002–2041 (2020)
20. A. Dasgupta, J. Murgoitio-Esandi, D. Ray, A. Oberai, Conditional score-based generative
models for solving physics-based inverse problems, in NeurIPS 2023 Workshop on Deep
Learning and Inverse Problems (2023)
21. A.K. David, M. Blei, J.D. McAuliffe, Variational inference: a review for statisticians. J. Am.
Stat. Assoc. 112, 859–877 (2017)
22. T. De Ryck, A.D. Jagtap, S. Mishra, Error estimates for physics-informed neural networks
approximating the Navier-Stokes equations. IMA J. Numer. Anal. 44, 83–119 (2023)
23. M.W.M.G. Dissanayake, N. Phan-Thien, Neural-network-based approximations for solving
partial differential equations. Commun. Numer. Methods Eng. 10, 195–201 (1994)
24. J. Donea, S. Giuliani, J.-P. Halleux, An arbitrary lagrangian-eulerian finite element method
for transient dynamic fluid-structure interactions. Comput. Methods Appl. Mech. Eng. 33,
689–723 (1982)
25. S.R. Dubey, S.K. Singh, B.B. Chaudhuri, Activation functions in deep learning: a compre-
hensive survey and benchmark. Neurocomputing 503, 92–108 (2022)
26. Y. Gal, Z. Ghahramani, Dropout as a bayesian approximation: Representing model uncertainty
in deep learning, in International Conference on Machine Learning, PMLR, pp. 1050–1059
(2016)
27. S. Garg, S. Chakraborty, Vb-deeponet: a bayesian operator learning framework for uncertainty
quantification. Eng. Appl. Artif. Intell. 118, 105685 (2023)
28. D. Gilbarg, N. Trudinger, Elliptic Partial Differential Equations of Second Order Classics in
Mathematics (Springer, Berlin, Heidelberg, 2001)
29. J. Gilmer, S.S. Schoenholz, P.F. Riley, O. Vinyals, G.E. Dahl, Neural message passing for
quantum chemistry, in International Conference on Machine Learning, PMLR, pp. 1263–
1272 (2017)
30. R.J. Gladstone, H. Rahmani, V. Suryakumar, H. Meidani, M. D’Elia, A. Zareei, Mesh-based
GNN surrogates for time-independent PDEs. Sci. Rep. 14, 3394
31. B.V. Gnedenko, Theory of Probability (Routledge, 2018)
32. H. Goh, S. Sheriffdeen, J. Wittmer, T. Bui-Thanh, Solving bayesian inverse problems via
variational autoencoders, in Proceedings of the 2nd Mathematical and Scientific Machine
Learning Conference, ed. by J. Bruna, J. Hesthaven, L. Zdeborova, vol. 145 of Proceedings
of Machine Learning Research, PMLR, 16–19 Aug. 2022, pp. 386–425
33. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
Y. Bengio, Generative adversarial nets, in Advances in Neural Information Processing Sys-
tems, pp. 2672–2680 (2014)
34. S. Goswami, K. Kontolati, M.D. Shields, G.E. Karniadakis, Deep Transfer Learning
for Partial Differential Equations Under Conditional Shift with Deeponet, p. 55 (2022).
arXiv:2204.09810
35. S. Goswami, M. Yin, Y. Yu, G.E. Karniadakis, A physics-informed variational deeponet for
predicting crack path in quasi-brittle materials. Comput. Methods Appl. Mech. Eng. 391,
114587 (2022)
36. J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai, T. Liu, X. Wang, G. Wang, J.
Cai et al., Recent advances in convolutional neural networks. Pattern Recognit. 77, 354–377
(2018)
37. I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, A.C. Courville, Improved training of
wasserstein gans, in Advances in Neural Information Processing Systems, pp. 5767–5777
(2017)
References 149
38. J. He, S. Koric, S. Kushwaha, J. Park, D. Abueidda, I. Jasiuk, Novel deeponet architecture
to predict stresses in elastoplastic structures with variable complex geometries and loads.
Comput. Methods Appl. Mech. Eng. 415, 116277 (2023)
39. J. He, S. Kushwaha, J. Park, S. Koric, D. Abueidda, I. Jasiuk, Sequential deep operator
networks (s-deeponet) for predicting full-field solutions under time-dependent loads. Eng.
Appl. Artif. Intell. 127, 107258 (2024)
40. K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition. IEEE Confer-
ence on Computer Vision and Pattern Recognition (CVPR) 2016, 770–778 (2016)
41. C.W. Hirt, A.A. Amsden, J. Cook, An arbitrary lagrangian-eulerian computing method for all
flow speeds. J. Comput. Phys. 14, 227–253 (1974)
42. C.W. Hirt, J. Cook, T.D. Butler, A lagrangian method for calculating the dynamics of an
incompressible fluid with free surface. J. Comput. Phys. 5, 103–124 (1970)
43. F. Hoffmann, B. Hosseini, Z. Ren, A.M. Stuart, Consistency of semi-supervised learning
algorithms on graphs: probit and one-hot methods. J. Mach. Learn. Res. 21, 7549–7603
(2020)
44. T.J. Hughes, The finite element method: linear static and dynamic finite element analysis
(Cour, Corp, 2012)
45. G. James, D. Witten, T. Hastie, R. Tibshirani, An Introduction to Statistical Learning: With
Applications in R (Springer Publishing Company, Incorporated, 2014)
46. T. Karras, M. Aittala, T. Aila, S. Laine, Elucidating the design space of diffusion-based
generative models. Adv. Neural Inf. Process. Syst. 35, 26565–26577 (2022)
47. N.S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P.T.P. Tang, On large-batch training
for deep learning: Generalization gap and sharp minima, in International Conference on
Learning Representations (2017). https://openreview.net/forum?id=H1oyRlYgg
48. E. Kharazmi, Z. Zhang, G. Em Karniadakis, Variational Physics-informed Neural Networks
for Solving Partial Differential Equations (2019)
49. P. Kidger, T. Lyons, Universal approximation with deep narrow networks, in Proceedings of
Thirty Third Conference on Learning Theory, ed. by J. Abernethy and S. Agarwal, vol. 125
of Proceedings of Machine Learning Research, PMLR, 09–12 July 2020, pp. 2306–2327
50. D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization (2017). https://arxiv.org/
abs/1412.6980v9
51. D.P. Kingma, M. Welling, An introduction to variational autoencoders. Found. Trends ®
Mach. Learn. 12, 307–392 (2019)
52. A. Kopaničáková, G.E. Karniadakis, Deeponet Based Preconditioning Strategies for Solving
Parametric Linear Systems of Equations (2024). arXiv:2401.02016
53. S. Koric, D.W. Abueidda, Data-driven and physics-informed deep learning operators for solu-
tion of heat conduction equation with parametric heat source. Int. J. Heat Mass Transf. 203,
123809 (2023)
54. I. Lagaris, A. Likas, D. Fotiadis, Artificial neural networks for solving ordinary and partial
differential equations. IEEE Trans. Neural Netw. 9, 987–1000 (1998)
55. I. Lagaris, A. Likas, D. Papageorgiou, Neural-network methods for boundary value problems
with irregular boundaries. IEEE Trans. Neural Netw. 11, 1041–1049 (2000)
56. S. Lanthaler, S. Mishra, G.E. Karniadakis, Error estimates for DeepONets: a deep learning
framework in infinite dimensions. Trans. Math. Appl. 6 (2022)
57. E. Lejeune, Mechanical MNIST: A benchmark dataset for mechanical metamodels. Extrem.
Mech. Lett. 36, 100659 (Elsevier, 2020)
58. Z. Li, N. Kovachki, K. Azizzadenesheli, B. Liu, K. Bhattacharya, A. Stuart, A. Anandkumar,
Fourier Neural Operator for Parametric Partial Differential Equations (2020). https://arxiv.
org/abs/2010.08895
59. Z. Li, F. Liu, W. Yang, S. Peng, J. Zhou, A survey of convolutional neural networks: analysis,
applications, and prospects (IEEE Trans. Neural Netw. Learn, Syst, 2021)
60. Z. Liu, P. Luo, X. Wang, X. Tang, Deep learning face attributes in the wild, in Proceedings
of International Conference on Computer Vision (ICCV), Dec. 2015
150 References
61. L. Lu, P. Jin, G. Pang, Z. Zhang, G.E. Karniadakis, Learning nonlinear operators via deeponet
based on the universal approximation theorem of operators. Nat. Mach. Intell. 3, 218–229
(2021)
62. A.L. Maas, A.Y. Hannun, A.Y. Ng, et al., Rectifier nonlinearities improve neural network
acoustic models, in Proceedings of the ICML, vol. 30 (2013)
63. M.R. Malik, T.A. Zang, M.Y. Hussaini, A spectral collocation method for the Navier-Stokes
equations. J. Comput. Phys. 61, 64–88 (1985)
64. L. McClenny, U. Braga-Neto, Self-adaptive Physics-informed Neural Networks Using a Soft
Attention Mechanism (2020). https://arxiv.org/abs/2009.04544
65. W. McCulloch, W. Pitts, A logical calculus of the ideas immanent in nervous activity. Bull.
Math. Biophys. 5, 115–133 (1943)
66. F. Milletari, N. Navab, S.-A. Ahmadi, V-net: Fully convolutional neural networks for volu-
metric medical image segmentation, in Fourth International Conference on 3D Vision (3DV),
vol. 2016 (IEEE, 2016), pp. 565–571
67. M. Mirza, S. Osindero, Conditional Generative Adversarial Nets (2014). https://arxiv.org/
abs/1411.1784
68. S. Mishra, R. Molinaro, Estimates on the generalization error of physics-informed neural
networks for approximating a class of inverse problems for PDEs. IMA J. Numer. Anal. 42,
981–1022 (2021)
69. S. Mishra, R. Molinaro, Estimates on the generalization error of physics-informed neural
networks for approximating PDEs (IMA J. Numer, Anal, 2022)
70. C. Moya, S. Zhang, G. Lin, M. Yue, Deeponet-grid-uq: a trustworthy deep operator framework
for predicting the power grid’s post-fault trajectories. Neurocomputing 535, 166–182 (2023)
71. K.P. Murphy, Machine Learning: a Probabilistic Perspective (MIT Press, 2012)
72. A. Nemirovski, A. Juditsky, G. Lan, A. Shapiro, Robust stochastic approximation approach
to stochastic programming. SIAM J. Optim. 19, 1574–1609 (2009)
73. A. Ng, M. Jordan, Y. Weiss, On spectral clustering: analysis and an algorithm. Adv. Neural
Inf. Process. Syst. 14 (2001)
74. A. Odena, V. Dumoulin, C. Olah, Deconvolution and Checkerboard Artifacts, Distill (2016)
75. G. Pang, M. D’Elia, M. Parks, G. Karniadakis, nPINNs: nonlocal physics-informed neural net-
works for a parametrized nonlocal universal laplacian operator. Algorithms and applications.
J. Comput. Phys. 422, 109760 (2020)
76. G. Pang, L. Lu, G.E. Karniadakis, fPINNs: fractional physics-informed neural networks.
SIAM J. Sci. Comput. 41, A2603–A2626 (2019)
77. G. Papamakarios, E. Nalisnick, D.J. Rezende, S. Mohamed, B. Lakshminarayanan, Normal-
izing flows for probabilistic modeling and inference. J. Mach. Learn. Res. 22, 2617–2680
(2021)
78. D. Patel, D. Ray, M.R. Abdelmalik, T.J. Hughes, A.A. Oberai, Variationally mimetic operator
networks. Comput. Methods Appl. Mech. Eng. 419, 116536 (2024)
79. D.V. Patel, A.A. Oberai, Gan-based priors for quantifying uncertainty in supervised learning.
SIAM/ASA J. Uncertain. Quantif. 9, 1314–1343 (2021)
80. D.V. Patel, D. Ray, A.A. Oberai, Solution of physics-based bayesian inverse problems with
deep generative priors. Comput. Methods Appl. Mech. Eng. 400, 115428 (2022)
81. S. Pereira, A. Pinto, V. Alves, C.A. Silva, Brain tumor segmentation using convolutional
neural networks in MRI images. IEEE Trans. Med. Imaging 35, 1240–1251 (2016)
82. T. Pfaff, M. Fortunato, A. Sanchez-Gonzalez, P.W. Battaglia, Learning Mesh-based Simulation
with Graph Networks (2020). arXiv:2010.03409
83. A. Pinkus, Approximation theory of the MLP model in neural networks. Acta Numerica 8,
143–195 (1999)
84. O. Pinti, A.A. Oberai, Graph laplacian-based spectral multi-fidelity modeling. Sci. Rep. 13,
16618
85. M. Raissi, P. Perdikaris, G. Karniadakis, Physics-informed neural networks: a deep learning
framework for solving forward and inverse problems involving nonlinear partial differential
equations. J. Comput. Phys. 378, 686–707 (2019)
References 151
86. R. Ranganath, S. Gerrish, D. Blei, Black box variational inference, in Proceedings of the
Seventeenth International Conference on Artificial Intelligence and Statistics, ed. by S. Kaski,
J. Corander, vol. 33 of Proceedings of Machine Learning Research, Reykjavik, Iceland, 22–25
Apr. 2014, PMLR, pp. 814–822
87. W. Rawat, Z. Wang, Deep convolutional neural networks for image classification: a compre-
hensive review. Neural Comput. 29, 2352–2449 (2017)
88. D. Ray, J. Murgoitio-Esandi, A. Dasgupta, A.A. Oberai, Solution of Physics-Based Inverse
Problems using Conditional Generative Adversarial Networks with full Gradient Penalty
(2023). arXiv:2306.04895
89. D. Ray, H. Ramaswamy, D.V. Patel, A.A. Oberai, The Efficacy and Generalizability of Condi-
tional GANs for Posterior Inference in Physics-Based Inverse Problems (2022). https://arxiv.
org/abs/2202.07773
90. D. Rezende, S. Mohamed, Variational inference with normalizing flows, in International
Conference on Machine Learning, PMLR, pp. 1530–1538 (2015)
91. O. Ronneberger, P. Fischer, T. Brox, U-net: convolutional networks for biomedical image
segmentation, in Medical Image Computing and Computer-Assisted Intervention–MICCAI
2015, ed. by N. Navab, J. Hornegger, W.M. Wells, A.F. Frangi, Cham (Springer International
Publishing, 2015) pp. 234–241
92. S.M. Ross, Introduction to Probability Models (Academic Press, 2014)
93. R. Salakhutdinov, G. Hinton, Deep boltzmann machines, in Proceedings of the Twelth Inter-
national Conference on Artificial Intelligence and Statistics, ed. by D. van Dyk, M. Welling,
eds., vol. 5 of Proceedings of Machine Learning Research, Hilton Clearwater Beach Resort,
Clearwater Beach, Florida USA, 16–18 Apr. 2009, PMLR, pp. 448–455
94. R. Salakhutdinov, H. Larochelle, Efficient learning of deep boltzmann machines, in Proceed-
ings of the Thirteenth International Conference on Artificial Intelligence and Statistics, ed. by
Y.W. Teh, M. Titterington, vol. 9 of Proceedings of Machine Learning Research, Chia Laguna
Resort, Sardinia, Italy, 13–15 May 2010, PMLR, pp. 693–700
95. A. Samuel, Some studies in machine learning using the game of checkers. IBM J. Res. Dev.
3, 210–229 (1959)
96. F. Scarselli, M. Gori, A.C. Tsoi, M. Hagenbuchner, G. Monfardini, The graph neural network
model. IEEE Trans. Neural Netw. 20, 61–80 (2008)
97. J.K. Seo, K.C. Kim, A. Jargal, K. Lee, B. Harrach, A learning-based method for solving
ill-posed nonlinear inverse problems: a simulation study of Lung Eit. SIAM J. Imaging Sci.
12, 1275–1295 (2019)
98. N. Sharma, V. Jain, A. Mishra, An analysis of convolutional neural networks for image
classification. Procedia Comput. Sci. 132, 377–384 (2018)
99. J. Shin, Y. Darbon, G. Em Karniadakis, On the convergence of physics informed neural
networks for linear second-order elliptic and parabolic type PDEs. Commun. Comput. Phys.
28, 2042–2074 (2020)
100. V. Sitzmann, J.N.P. Martel, A.W. Bergman, D.B. Lindell, G. Wetzstein, Implicit Neural Rep-
resentations with Periodic Activation Functions (2020). https://arxiv.org/abs/2006.09661
101. G.D. Smith, Numerical Solution of Partial Differential Equations: Finite Difference Methods
(Oxford University Press, 1985)
102. Y. Song, S. Ermon, Improved techniques for training score-based generative models. Adv.
Neural Inf. Process. Syst. 33, 12438–12448 (2020)
103. Y. Song, J. Sohl-Dickstein, D.P. Kingma, A. Kumar, S. Ermon, B. Poole, Score-based Gen-
erative Modeling Through Stochastic Differential Equations (2020). arXiv:2011.13456
104. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple
way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014)
105. J.W. Thomas, Numerical Partial Differential Equations: Finite Difference Methods, vol. 22
(Springer Science & Business Media, 2013)
106. C. Villani, Optimal Transport: Old and New Grundlehren der mathematischen Wissenschaften
(Springer, Berlin, Heidelberg, 2008)
107. U. Von Luxburg, A tutorial on spectral clustering. Stat. Comput. 17, 395–416 (2007)
152 References
108. P. Wang, P. Chen, Y. Yuan, D. Liu, Z. Huang, X. Hou, G. Cottrell, Understanding convolution
for semantic segmentation, in IEEE Winter Conference on Applications of Computer Vision
(WACV), vol. 2018. (IEEE, 2018), pp. 1451–1460
109. S. Wang, S. Sankaran, H. Wang, P. Perdikaris, An expert’s guide to training physics-informed
neural networks (2023). http://arxiv.org/abs/2308.08468
110. S. Wang, Y. Teng, P. Perdikaris, Understanding and mitigating gradient flow pathologies in
physics-informed neural networks. SIAM J. Sci. Comput. 43, A3055–A3081 (2021)
111. S. Wang, H. Wang, P. Perdikaris, Learning the solution operator of parametric partial differ-
ential equations with physics-informed deeponets. Sci. Adv. 7 (2021)
112. S. Wang, X. Yu, P. Perdikaris, When and why PINNs fail to train: a neural tangent kernel
perspective. J. Comput. Phys. 449, 110768 (2022)
113. L. Wu, C. Ma, W. E, How SGD selects the global minima in over-parameterized learning: a
dynamical stability perspective, in Advances in Neural Information Processing Systems, ed.
by S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett, vol. 31
(Curran Associates, Inc., 2018)
114. W. Xu, Y. Lu, L. Wang, Transfer learning enhanced deeponet for long-time prediction of evo-
lution equations. Proceedings of the AAAI Conference on Artificial Intelligence 37, 10629–
10636 (2023)
115. L. Yang, D. Zhang, G.E. Karniadakis, Physics-informed generative adversarial networks for
stochastic differential equations. SIAM J. Sci. Comput. 42, A292–A317 (2020)
116. L. Yang, D. Zhang, G.E. Karniadakis, Physics-informed generative adversarial networks for
stochastic differential equations. SIAM J. Sci. Comput. 42, A292–A317 (2020)
117. Y. Yang, G. Kissas, P. Perdikaris, Scalable uncertainty quantification for deep operator net-
works using randomized priors. Comput. Methods Appl. Mech. Eng. 399, 115399 (2022)
118. Y. Yang, P. Perdikaris, Adversarial uncertainty quantification in physics-informed neural net-
works. J. Comput. Phys. 394, 136–152 (2019)
119. D. Yarotsky, A. Zhevnerchuk, The Phase Diagram of Approximation Rates for Deep Neural
Networks (2019). https://arxiv.org/abs/1906.09477
120. M. Yin, E. Ban, B.V. Rego, E. Zhang, C. Cavinato, J.D. Humphrey, G. Em Karniadakis,
Simulating progressive intramural damage leading to aortic dissection using deeponet: an
operator-regression neural network. J. R. Soc. Interface 19, 20210670 (2022)
121. M. Yin, E. Zhang, Y. Yu, G.E. Karniadakis, Interfacing finite elements with deep neural
operators for fast multiscale modeling of mechanics problems. Comput. Methods Appl. Mech.
Eng. 402, 115027 (2022)
122. L. Yuan, Y.-Q. Ni, X.-Y. Deng, S. Hao, A-PINN: auxiliary physics informed neural networks
for forward and inverse problems of nonlinear integro-differential equations. J. Comput. Phys.
462, 111260 (2022)
123. M. Zayernouri, G.E. Karniadakis, Fractional spectral collocation method. SIAM J. Sci. Com-
put. 36, A40–A62 (2014)
124. J. Zhang, S. Zhang, G. Lin, Multiauto-deeponet: a Multi-resolution Autoencoder Deeponet
for Nonlinear Dimension Reduction, Uncertainty Quantification and Operator Learning of
Forward and Inverse Stochastic Problems (2022). arXiv:2204.03193
125. J. Zhou, G. Cui, S. Hu, Z. Zhang, C. Yang, Z. Liu, L. Wang, C. Li, M. Sun, Graph neural
networks: a review of methods and applications. AI Open 1, 57–81 (2020)