Machine Learning in Modeling and Simulation
Machine Learning in Modeling and Simulation
Timon Rabczuk
Klaus-Jürgen Bathe Editors
Machine
Learning in
Modeling and
Simulation
Methods and Applications
Computational Methods in Engineering &
the Sciences
Series Editor
Klaus-Jürgen Bathe, Department of Mechanical Engineering, Massachusetts
Institute of Technology, Cambridge, MA, USA
This Series publishes books on all aspects of computational methods used in
engineering and the sciences. With emphasis on simulation through mathematical
modelling, the Series accepts high quality content books across different domains
of engineering, materials, and other applied sciences. The Series publishes mono-
graphs, contributed volumes, professional books, and handbooks, spanning across
cutting edge research as well as basics of professional practice. The topics of
interest include the development and applications of computational simulations in
the broad fields of Solid & Structural Mechanics, Fluid Dynamics, Heat Transfer,
Electromagnetics, Multiphysics, Optimization, Stochastics with simulations in and
for Structural Health Monitoring, Energy Systems, Aerospace Systems, Machines
and Turbines. Climate Prediction, Effects of Earthquakes, Geotechnical Systems,
Chemical and Biomolecular Systems, Molecular Biology, Nano and Microflu-
idics, Materials Science, Nanotechnology, Manufacturing and 3D printing, Artificial
Intelligence, Internet-of-Things.
Timon Rabczuk · Klaus-Jürgen Bathe
Editors
Machine Learning
in Modeling and Simulation
Methods and Applications
Editors
Timon Rabczuk Klaus-Jürgen Bathe
Bauhaus University Massachusetts Institute of Technology
Weimar, Germany (MIT)
Cambridge, MA, USA
© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
Chapter “Machine Learning in Computer Aided Engineering” is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further
details see license information in the chapter.
This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.
This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
v
vi Preface
vii
viii Contents
ix
Chapter 1
Machine Learning in Computer Aided
Engineering
1.1 Introduction
F. J. Montáns (B)
E.T.S. de Ingeniería Aeronáutica y del Espacio, Universidad Politécnica de Madrid, Pza. Cardenal
Cisneros 3, 28040 Madrid, Spain
e-mail: [email protected]
Department of Mechanical and Aerospace Engineering, Herbert Wertheim College
of Engineering, University of Florida, Gainesville, FL 32611, USA
E. Cueto
Aragon Institute of Engineering Research, Universidad de Zaragoza, Maria de Luna s/n,
50018 Zaragoza, Spain
e-mail: [email protected]
K. J. Bathe
Mechanical Engineering Department, Massachusetts Institute of Technology, 77 Mass. Ave.,
Cambridge, MA 02139, USA
e-mail: [email protected]
it had less presence. The success in many extremely useful areas such as speech and
face recognition has contributed to this interest (Marr 2019). Today, ML may help
you (through web services) to find a job, obtain a loan, find a partner, obtain an
insurance, and, among others, also helps in the medical and legal services (Duarte
2018). Of course, ML raises many ethical issues, some of which are described, for
example in Stahl (2021). However, the discovered power and success of ML in many
areas have made a very important impact on our society and, remarkably, on how
many problems are addressed. No wonder, the number of ML papers published in
almost all fields has sharply increased in the last 10 years, with a rate following
approximately Moore’s law (Frank et al. 2020).
Machine Learning is considered a part of Artificial Intelligence (AI) (Michalski
et al. 2013). In essence, ML algorithms are general procedures and codes that, with
the information from datasets, can give predictions for a wide range of problems (see
Fig. 1.1). The main difference to classical programs is that the classical programs
are developed for specific applications, like in Computer Aided Engineering, which
is the topic of this chapter, to solve specific differential equations in integral forms.
An example is how finite elements have been developed. In contrast, ML procedures
are for much more general applications, being used almost unchanged in problems
apparently unconnected as predicting the evolutions of stocks, spam filtering, face
recognition, typing prediction, pharmacologic design, or materials selection. ML
methods are different also from Expert Systems because these are based on fixed
rules or fixed probability structures. ML methods excel when useful information
needs to be obtained from massive amounts of data.
Of course, generality comes usually with a trade-off regarding efficiency for a
specific problem solution (Fig. 1.1), so the use of ML for the solution of simple
problems, or for problems which can be solved by other more specific procedures,
is typically inappropriate. Furthermore, ML is used when predictions are needed for
problems which have not or cannot be accurately formulated; that is, when the vari-
ables and mathematical equations governing the problem are not fully determined—
but physics-informed approaches with ML are now also much focused on, Raissi
et al. (2019). Nonetheless, ML codes and procedures are still mostly used as general
“black boxes”, typically employing standard implementations available in free and
open-source software repositories. A number of input variables are then employed
and some specific output is desired, which together comprises the input-to-output
process learned and adjusted from known cases or from the structure of the input
data. Some of these free codes are Scikit-learn (Pedregosa et al. 2011) (one of the
best-known), Microsoft Cognitive Toolkit (Xiong et al. 2018), TensorFlow (Dillon
et al. 2017) (which is optimal for Cuda-enabled Graphic Processing Unit (GPU)
parallel ML), Keras (Gulli and Pal 2017), OpenNN (Build powerful models 2022),
and SystemML (Ghoting et al. 2011), just to name a few. Other proprietary soft-
ware, used by big companies, are AWS Machine Learning Services from Amazon
(Hashemipour and Ali 2020), Cloud Machine Learning Engine from Google (Bisong
2019a), Matlab (Paluszek and Thomas 2016; Kim 2017), Mathematica (Brodie et al.
2020; Rodríguez and Kramer 2019), etc. Moreover, many software offerings have
libraries for ML, and are often used in ML projects like Python (NumPy, Bisong
1 Machine Learning in Computer Aided Engineering 3
Data
Parameter optimization
Hyperparameters General Algorithm
Efficiency
Prediction,
Feedback:
classification,
complexity
reduction
Generality
Prediction Classification
Fig. 1.1 Overall Machine Learning (ML) process and contrast between efficiency and generality of
the method. Hyperparameters are user-defined parameters which account for the type of problem,
whereas parameters are optimized for best prediction. ML may be used for prediction and for
classification. it is also often used as a tool for dimensionality reduction
2019b, Scikit-learn, Pedregosa et al. 2011, and Tensorly, Kossaifi et al. 2016; see
review in Stančin and Jović 2019), C++, e.g., Kaehler and Bradski (2016), Julia (a
recent Just-In-Time (JIT) compiling language created with science and ML in mind,
Gao et al. 2020; Innes 2018; Innes et al. 2019), and the R programming environ-
ment (Lantz 2019; Bischl et al. 2016; Molnar et al. 2018); see also Raschka and
Mirjalili (2019), King (2009), Gao et al. (2020), Bischl et al. (2016). These soft-
ware offerings also use many earlier published methods for standard computational
tasks such as mathematical libraries (like for curve fitting, the solution of linear
and nonlinear equations, the determination of eigenvalues and eigenvectors or Sin-
gular Value Decompositions), and computational procedures for optimization (e.g.,
the steepest descent algorithms). The offerings also use earlier established statistical
and regression algorithms, interpolation, clustering, domain slicing (e.g., tessellation
algorithms), and function approximations.
4 F. J. Montáns et al.
problems in general, there are many other ML techniques that are being used. We
review below the fundamental aspects of these techniques.
Many procedures use, or are derived from, statistics, and in particular probability
theory (Murphy 2012; Bzdok et al. 2018). In a similar manner, ML employs mostly
optimization procedures (Le et al. 2011). The main conceptual difference between
these theories and ML is the purpose of the developments. In the case of statistics,
the purpose is to obtain inference or characteristics of the population such as the
distribution and the mean (which of course could be used thereafter for predictions);
see Fig. 1.2. In the case of ML, the purpose is to predict new outcomes, often without
the need for statistically characterizing populations, and incorporate these outcomes
in further predictions (Bzdok et al. 2018). ML approaches often support predictions
on models. ML optimizes parameters for obtaining the best predictions as quantified
by a cost function, and the values of these parameters are optimized also to account for
the uncertainty in the data and in the predictions. ML approaches may use statistical
distributions, but those are not an objective and their evaluation is often numerical
(ML is interested in predictions). Also, while ML uses optimization procedures to
obtain values of parameters, the objective is not to obtain the “optimum” solution to fit
data, but a parsimonious model giving reliable predictions (e.g., to avoid overfitting).
number, a matrix, or other. The purpose of the ML approach in this case is (typi-
cally) to create a model that relates those known outputs yi to the dataset samples
through some combination of the j = 1, . . . , N features x j (i) ≡ x ji of each sam-
ple i. x j (i) are also referred to as data, variables, measurements, or characteristics,
depending on the context or field of application. An example of a ML procedure
could be to relate the seismic vulnerability of a building (label) as a function of
features like construction type, age, size, location, building materials, maintenance,
etc. Rosti et al. (2022), Zhang et al. (2019), Ruggieri et al. (2021). The ML purpose
is here to be able to learn the vulnerability of buildings from known vulnerabilities
of other buildings. The labeling could have been obtained from experts or from past
earthquakes. Supervised learning is based on sufficient known data, and we want to
determine predictions in the nearby domain. In essence, we can say that “supervised
learning is a high-dimensional interpolation problem” (Mallat 2016; Gin et al. 2021).
We note that supervised learning may be improved with further data when available,
since it is a dynamic learning procedure, mimicking the human brain.
In unsupervised learning the samples si are unlabeled (si ≡ {xi }), so the purpose
is to label the samples from learning similitudes and common characteristics in the
features of the samples; it is usually an instance-based learning. Typical unsupervised
ML approaches are employed in clustering (e.g., classifying the structures by type in
our previous example), dimensionality reduction (detecting which features are less
relevant to the output label, for example because all or most samples have it, like
doors in buildings), and outlier detection (e.g., in detecting abnormal traffic in the
Internet, Salman et al. 2020, 2022; Salloum et al. 2020) for the case when very few
samples have that feature. These approaches are similar to data mining.
Semi-supervised learning is conceptually a combination of the previous
approaches but with specific ML procedures. In essence it is a supervised learn-
ing approach in which there are few labeled samples (output known) but many more
unlabeled samples (output unknown), even sometimes with incomplete features, with
some missing characteristics, which may be filled in by imputation techniques (Lak-
shminarayan et al. 1996; Ramoni and Sebastiani 2001; Liu et al. 2012; Rabin and
Fishelov 2017). The point here is that by having many more samples with unassigned
features, we can determine better the statistical distributions of the data and the pos-
sible significance of the features in the result, resulting in an improvement over using
only labeled data for which the features have been used to determine the label. For
example, in our seismic vulnerability example, imagine that one feature is that the
building has windows. Since almost all buildings have windows, it is unlikely that
this feature is relevant in determining the vulnerability (it will give little Information
Gain; see below). On the contrary, if 20% of the buildings have a steel structure, and
if the correlation is positive regarding the (lack of) vulnerability, it is likely that the
feature is important in determining the vulnerability.
There is also another type of ML seldom used in CAE, which is reinforced learn-
ing (or reward-based learning). In this case, the computer develops and changes
actions to learn a policy depending on the feedback, i.e. rewards which themselves
modify the subsequent actions by maximizing the expected reward. It has some
common concepts to supervised learning, but the purpose is an action, instead of a
8 F. J. Montáns et al.
Data is the key to ML procedures, so datasets are usually large and obtained in dif-
ferent ways. The importance of data requires that the data is presented to the ML
method (and maintained if applicable) in optimal format. To reach that goal requires
many processes which often also involve ML techniques. For example, in a dataset
there may be data which are not in a logical range, or with missing entries, hence they
need to be cleaned. ML techniques may be used to determine outliers in datasets, or
assign values (data imputation) according to the other features and labels present in
other samples in the dataset. Different dataset formats such as qualitative entries like
“good”, “fair”, or “bad”, and quantitative entries like “1–9”, may need to be con-
verted (encoded) to standardized formats, using also ML algorithms (e.g., assigning
“fair” to a numerical value according to samples in the dataset). This is called data
ingestion. ML procedures may also need to have data distributions determined, that
is, data evaluated to learn if a feature follows a normal distribution or if there is a
consistent bias, and also standardize data according to min–max values or the same
normal distribution, for example to avoid numerical issues and give proper weight
to different features. In large dynamic databases, much effort is expended for the
proper maintenance of the data so it remains useful, using many operations such as
data cleaning, organization, and labeling. This is called data curation.
Another aspect of data treatment is the creation of a training set, a validation set,
and a test set from a database (although often test data refers to both the validation
and the test set, in particular when only one model is considered). The purpose of the
training set is to train the ML algorithm: to create the “model”. The purpose of the
validation set is to evaluate the models in an independent way from the training set,
for example to see which hyperparameters are best suited, or even which ML method
is best suited. Examples may be the number of neurons in a neural network or the
smoothing hyperparameter in splines fitting; different smoothing parameters yield
different models for the same training set, and the validation set helps to select the best
values, obtaining the best predictions but avoiding overfitting. Recall that ML is not
interested in the minimum error for the training set, but in a predictive reliable model.
The test set is used to evaluate the performance of the final selected model from the
overall learning process. An accurate prediction of the training set with a poor pre-
diction of the test set is an indicator of overfitting: we have reached an unreliable
model. A model with similar accuracy in the training and test sets is a good model.
The training set should not be used for assessing the accuracy of the model because
the parameters and their values have been selected based on these data and hence
overfitting may not be detected. However, if more data is needed for training, there are
techniques for data augmentation, typically performing variations, transformations,
1 Machine Learning in Computer Aided Engineering 9
Overfitting and model complexity are important aspects in ML; see Fig. 1.3. Given
that the data has errors and often some stochastic nature, a model which gives zero
error in the training data does not mean that it is a good model; indeed it is usually a
hint that it is the opposite: a presentation of overfitting (Fig. 1.3a). Best models are
those less complex (parsimonious) models that follow Occam’s razor. They are as
simple as possible but still have a great predictive power. Hence, the less parameters,
the better. However, it is often difficult to simplify ML models to have few “smart”
parameters, so model reduction and regularization techniques are often used as a “no-
brainer” remedy for overfitting. Typical regularization (“smoothing”) techniques are
Least Absolute Shrinkage Selection Operator, sparse or L1 regularization (LASSO)
(Xu et al. 2008) and L2, called Ridge (Tikhonov 1963), or noise (Bishop 1995) reg-
ularization, or regression. The LASSO scheme “shrinks” the less important features
Fig. 1.3 Using B-splines to fit hyperelastic stress–strain data. Regression may be performed in
nominal stress–stretch (P − λ) axes, or in true stress–strain σ − E axes; note that the result is
different. While usual test representation in hyperelasticity is in the (P − λ) axes, regression is
preferred in σ − E because of the symmetry of tension and compression in logarithmic strains).
B-spline fit of experimental data with a overfitting and b proper fit using regularization based on
stability conditions. Modified from Latorre and Montáns (2020)
10 F. J. Montáns et al.
(hence is also used for feature selection), whereas the L2 scheme gives a more even
weight to them. The combination of both is known as elastic net regularization (Zou
and Hastie 2005).
Model selection taking into account model fitness and including a penalization for
model complexity is often performed by employing the Akaike Information Criterion
(AIC). Given a collection of models arising from the available data, the AIC allows
to compare these models among them, so as to help select the best fitted model. In
essence, the AIC not only estimates the relative amount of information lost by each
model but also takes into account its parsimony. In other words, it deals with the
trade-off between overfitting and underfitting by computing
AIC = 2 p − 2 ln L (1.1)
3 iterations (3 folds)
Train/test
Data Random
}
split 9 9 9
1 9 9 5 5 5
Training set
2 5 5 2 2 2 (model fitting)
3 2 2 1 1 1
}
4 1 1 6 6 6 Validation set
(hyperparameters
5 6 6 8 8 8
selection)
6 8 8
7 4
}
4
8 7
7 Test set
9 3 (final model assessment)
3
Fig. 1.4 Training and test sets: k-fold generation of training and validation sets from data. Number
of data: 9, data for training and model selection: 6, data for final validation test (test set): 3, number
of folds for model selection: 3, data in each fold: 2, number of models: 3 (k = 3). Sometimes, the
validation test is also considered as test set. The 10-fold cross-validation is a common choice
1 Machine Learning in Computer Aided Engineering 11
where the number of folds is the same as the number of samples, so the test set has
only one element. While LOOCV is expensive in general (Meijer and Goeman 2013),
for the linear case it is very efficient because all the errors are obtained simultaneously
with a single fit through the so-called hat matrix (Angelov and Stoimenova 2017).
The schemes we present in this section are basic ingredients of many ML algorithms.
The simplest ML approach is much older than the ML discipline: linear and nonlinear
regression. In the former case, the purpose is to compute the weights w and the offset
b of the linear model ỹ ≡ f (x) = w T x + b, where x is the vector of features. The
parameters w, b are obtained through the minimization of the cost function (MSE:
Mean Squared Error)
1 1
n n
C(xi ; {w, b}) = Li := [ f (xi ; {w, b}) − yi ]2 (1.2)
n i=1 n i=1
with respect to them, which in this case is the average of the loss function Li =
( ỹi − yi )2 , where the yi are the known values, ỹi = f (xi ; {w, b}) are the predictions,
and the subindex i refers to sample i, so xi is the vector of features of that sample.
Of course, in linear regression, the parameters are obtained simply by solving the
linear system of equations resulting from the quadratic optimization problem. Other
regression algorithms are similar, as for example spline, B-spline regressions, or
P-spline (penalized B-splines) regressions, used in nonlinear mechanics (Crespo
et al. 2017; Latorre and Montáns 2017) or used to perform efficiently an inverse of
functions which does not have an analytical solution (Benítez and Montáns 2018;
Eubank 1999; Eilers and Marx 1996). In all these cases, smoothing techniques are
fundamental to avoid overfitting.
While it is natural to state the regression problem as a minimization of the cost
function, it may be also formulated in terms of the likelihood function L. Given some
training data (yi , xi ) (with labels yi for data xi ), we seek the parameters w (for sim-
plicity we now include b in the set w) that minimize the cost function (e.g., MSE);
or equivalently we seek the set w which maximizes the likelihood L(w|(y, x)) =
p(y|x; w) for those parameters w to give the probability representation for the train-
ing data, which is the same as the probability of finding the data (y, x) given the distri-
bution by w. The likelihood is the “probability” by which a distribution (characterized
by w) represents all given data, whereas the probability is that of finding data if the dis-
tribution is known. Assuming data to be identically distributed and independent such
12 F. J. Montáns et al.
that p(y1 , y2 , . . . , yn |x1 , x2 , . . . , xn ; w) = p(y1 |x1 ; w) p(y2 |x2 ; w) . . . p(yn |xn ; w),
the likelihood is
n
L(w|(yi , xi ), i = 1, . . . , n) = p(yi |xi ; w) (1.3)
i=1
or
n
log L(w|(yi , xi ), i = 1, . . . , n) = log p(yi |xi ; w) (1.4)
i=1
describes this case for a given xi , as it is immediate to check. The linear regression
is assigned to the logit function to convert the (−∞, ∞) range into the desired
probabilistic [0, 1] range
p(x)
logit( p(x)) = log = wT x (1.7)
1 − p(x)
The logit function is the logarithm of the ratio between the odds of y = 1 (which are
p) and y = 0 (which are (1 − p)). The probability p(x) may be factored out from
Eq. (1.7) as
1
p(x) = (1.8)
1 + exp (−w T x)
which is known as the sigmoid function. Neural Networks frequently use logis-
tic regression with the sigmoid model function where the parameters are obtained
through the minimization of the proper cost function, or through the maximization
1 Machine Learning in Computer Aided Engineering 13
of the likelihood. In this latter case, the likelihood of the probability distribution in
Eq. (1.6) is
n
L( p(x)|(yi , xi ), i = 1, . . . , n) = p(xi ) yi [1 − p(xi )](1−yi ) (1.9)
i=1
where yi are the labels (with value 1 or 0) and p(xi ) are their (sigmoid-based)
probabilistic predicted values given by Eq. (1.8) for the training data, which are a
function of the parameters w. The maximization of the log-likelihood of Eq. (1.9) for
the model parameters gives the same solution as the minimization of the cross-entropy
n
yi (1 − yi )
arg min H (w) = − log p(xi ; w) + log(1 − p(xi ; w)) (1.10)
w
i=1
n n
where σ is the bandwidth or smoothing parameter (deviation), and the weight for
sample i is wi (x) = K i (x)/ nj=1 K j (x). The predictor, using the weights from the
n
kernel, is f (x) = i=1 wi (x)yi (although kernels may be used also for the labels).
The cost function to determine σ 2 or other kernel parameters may be
1 )i(
n
f 2 (x)d − 2 f (xi ) −→ min (1.12)
n i=1
where the last summation term is the LOOCV, which excludes sample i from the
set of predictions (recall that there are n different f )i( functions). Equation (1.12)
focuses in essence on the minimum squared error for the solution. As explained below,
kernels are also employed in Support Vector Machines to deal with nonlinearity and
in dimensionality reduction of nonlinear problems to reduce the space.
Naïve Bayes (NB) schemes are frequently used for classification (spam e-mail fil-
tering, seismic vulnerability, etc.), and may be Multinomial NB or Gaussian NB. In
both cases the probability theory is employed.
14 F. J. Montáns et al.
NB procedures operate as follows. From the training data, the prior probabil-
ities for the different classes are computed, e.g., vulnerable or safe, p(V ) and
p(S), respectively, in our seismic vulnerability example. Then, for each feature,
the probabilities are computed within each class, e.g., the probability that a vul-
nerable (or safe) structure is made of steel, p(steel|V ) (or p(steel|S)). Finally,
given a sample outside the training set, the classification is obtained from the
largest probability considering the class and the features present in the sample, e.g.,
p(V ) p(steel|V ) p(. . . |V ) . . . or p(S) p(steel|S) p(. . . |V ) . . ., and so on. Gaussian
NB are applied when the features have continuous (Gaussian) distributions, as for
example the height of a building in the seismic vulnerability example. In this case
the feature conditioned probabilities p(·|V ) are obtained from the respective normal
distributions. Logarithms of the probabilities are frequently used to avoid under-
flows.
Decision trees are nonparametric. The simplest and well-known decision tree gener-
ator is the Iterative Dichotomiser 3 (ID3) algorithm, a greedy strategy. The scheme
is used in the “guess who?” game and is also in essence the idea behind the root-
finding bisection method or the Cuthill–McKee renumbering algorithm. The objec-
tive is, starting from one root node, to select at each step the feature and condi-
tion from the data that maximizes the Information Gain G (maximizes the benefit
of the split), resulting in two (or more) subsequent leaf nodes. For example, in a
group of people, the first optimal condition (maximizing the benefit of the split)
is typically if the person is male or female, resulting in the male leaf and in the
female leaf, each with 50% of the population, and so on. In seismic vulnerability,
it could be if the structure is made of masonry, steel, wood, or Reinforced Con-
crete (RC). The gain G is the difference between the information entropies before
(H (S), parent entropy) and after the split given by the feature at hand j. Let us
denote by x j (i) the feature j of sample i, by xi the array of features of sample
i, and by x j the different features (we omit the sample index if no confusion is
possible). If H (S|x j ) are the children entropies after the split by feature x j , the
Gain is
G(S, x j ) = H (S) − H (S|x j ) = H (S) − p j H (S j ) (1.13)
si ={xi ,yi }∈S j
where the S j are the subsets of S as a consequence of the split using feature (or
attribute) x j , p j is the subset probability (number of samples si in subset S j divided
by the number of samples in the compete set S), and
l
H (S) = − p(y j ) log p(y j ) (1.14)
j=1
1 Machine Learning in Computer Aided Engineering 15
where H (S) is the information entropy of set S for the possible labels y j , j =
1, . . . , l, so Eq. (1.13) results in
l
l
G(S, x j ) = − p(yi ) log p(yi ) + pj p(yi |x j ) log p(yi |x j ) (1.15)
i=1 si ∈S j i=1
The gain G is computed for each feature x j (e.g., windows, structure type, building
age, and soil type). The feature that maximizes the Gain is the one selected to generate
the next level of leaves. The decision tree building process ends when the entropy
reaches zero (the samples are perfectly classified). Figure 1.5 shows a simple example
with four samples si in the dataset, each with three features x j of two possible values
(0 and 1), and one label y of two possible values (A and B). The best of the three
features is selected as that one which provides the most information gain. It is seen
that feature 1 produces some information gain because after the split using this
feature, the samples are better classified according to the label. Feature 2 gives no
A AB B (better sorted,
some info gain)
B
A AB AB (equally sorted,
B no info gain)
A
AA BB (perfectly sorted,
best info gain)
Fig. 1.5 Example of determination of the feature with most information gain. If we choose feature
x1 for sorting, we find two subsets, S1 = {si such that x1 = 1} = {s2 , s3 , s4 } and S2 = {si s.t. x1 =
0} = {s1 }. In S1 there are two elements (s3 , s4 ) with label A, and one element (s2 ) with label
B, so the probabilities are 2/3 for label A and 1/3 for label B. In S2 the only element (s1 ) has
label B, so the probabilities are 0/1 for label A and 1/1 for label B. The entropy of subset S1 is
H (S1 ) = − 23 log2 23 − 13 log2 31 = 0.92. The entropy of subset S2 is H (S2 ) = −0 − 11 log2 11 = 0.
In a similar form, since there are a total of 4 samples, two with label A and two with label B, the
parent entropy is H (S) = − 24 log2 24 − 24 log2 24 = 1. Then, the information gain is G = H (S) −
p(S1 )H (S1 ) − p(S2 )H (S2 ) = 1 − 43 0.92 − 41 0 = 0.31, where p(Si ) is the probability of a sample
being in subset Si , i.e. 3/4 for S1 and 1/4 for S2 . Repeating the computations for the other two
features, it is seen that feature x3 is the one that has the best information gain. Indeed the information
gain is G = 1 because it fully separates the samples according to the labels
16 F. J. Montáns et al.
gain because it is useless to distinguish the samples according to the label (it is
in 50% each), and feature 3 is the best one because it fully classifies the samples
according to the label (A is equivalent to x3 = 1, and B is equivalent to x3 = 0).
As for the Cuthill–McKee renumbering algorithm, there is no proof of reaching the
optimum.
While DT are typically used for classification, there are regression trees in which
the output is a real number. Other decision tree algorithms are the C4.5 (using con-
tinuous attributes), Classification And Regresssion Tree (CART), and Multivariate
Adaptive Regression Spline (MARS) schemes.
The Support Vector Machine (SVM) is a technique which tries to find the optimal
hyperplane separating groups of samples for clustering (unsupervised) or classifi-
cation (supervised). Consider the function z(x) = w T x + b for classification. The
model is
y ≡ f (x) = sign(z(x)) = sign(w T x + b) (1.16)
where the parameters {w, b} are obtained by minimizing 21 |w|2 (or equivalently |w|2
or |w|) subject to yi (w T xi + b) ≥ 1 ∀i such that the decision boundary f (x) = 0
given by the hyperplane has maximum distance to the groups of the samples; see
Fig. 1.6. The minimization problem (in primal form) using Lagrange multipliers αi is
n
find arg min 1
2
|w|2 + αi 1 − yi (w T xi + b) with αi ≥ 0 (1.17)
w,b i=1
or in penalty form
n
find arg min 1
2
|w|2 +C max 0, 1 − yi (w T xi + b) (1.18)
w,b i=1
A measure of certainty for sample i is based on its proximity to the boundary; i.e.
(w T xi + b)/|w| (the larger the distance to the boundary, the more certain the classifi-
cation of the sample). Of course, SVMs may be used for multiclass classification, e.g.,
using the One-to-Rest approach (employing k SVMs to classify k classes) or the One-
to-One approach (employing 21 k(k − 1) SVMs to classify k classes); see Fig. 1.6.
Taking the derivative of the Lagrangian in squared brackets in Eq. (1.17) with
respect to w and b, we get that at the minimum
n
n
w= αi yi xi and αi yi = 0 (1.19)
i=1 i=1
1 Machine Learning in Computer Aided Engineering 17
Maximized distance:
Decision boundary
Two-class SVM
Support vectors
One-to-one
One-to-rest
Fig. 1.6 Two-class SVM decision boundary and one-to-one and one-to-rest SVM multiclass clas-
sification
and substituting it in the primal form given in Eq. (1.17), the minimization problem
may be written in its dual form
⎡ ⎤
n
1 n n
find arg max ⎣ αi − α j αk y j yk (x Tj xk )⎦ with αi yi = 0 (1.20)
αi ≥0 i=1
2 j,k=1 i=1
and with b = y j − w T x j being x j any active (support) vector, with α j > 0. Then,
z = w T x + b is z = i αi yi xiT x + b. Instead of searching for the weights wi , i =
1, . . . , N (N is the number of features of each sample), we search for the coefficients
αi , i = 1, . . . , n (n is the number of samples).
Nonlinear separable cases may be addressed through different techniques as using
positive slack variables ξi ≥ 0 or kernels. When using slack variables (Soft Margin
SVM), for each sample i we write yi (w T xi + b) ≥ 1 − ξi and we apply a L1 (LASSO)
regularization by minimizing 21 |w|2 + C i ξi subject to the constraints yi (w T xi +
b) ≥ 1 − ξi and ξi ≥ 0, where C is the penalization parameter. In this case, the
only change in the dual formulation is the constraint for the Lagrange multipliers:
C ≥ αi ≥ 0, as it can be easily verified.
When using kernels, the kernel trick is typically employed. The idea behind the
use of kernels is that if data is linearly non-separable in the features space, it may be
18 F. J. Montáns et al.
(a)
x (c) x2
(b)
x
x
Fig. 1.7 Use of higher dimensions to obtain linearly separable data. a Data is linearly separable
in 1D. b Data is not linearly separable in 1D. c Using two dimensions with mapping φ = [x, x 2 ]T ,
data becomes linearly separable in the augmented space
separable in a larger space; see, for example, Fig. 1.7. This technique uses the dual
form of the SVM optimization problem. Using the dual form
n
n
|w|2 = w T w = αi α j yi y j (xiT x j ) and w T x = αi yi (xiT x) (1.21)
i, j=1 i=1
the equations only involve inner products of feature vectors of the type (xiT x j ), ideal
for using a kernel trick. For example, the case shown in Fig. 1.8 is not linearly separa-
√ T
ble in the original features space, but using the mapping φ(x) := x12 , x22 , 2x1 x2
to an augmented space, we find that the samples are linearly separable in this space.
Then, for performing the linear separation in the transformed space, we have to
compute z in that transformed space (Representer Theorem, Schölkopf et al. 2001)
n
z= αi yi [φ(xi )T φ(x)] + b (1.22)
i=1
Fig. 1.8 Linearly non-separable samples (left). Linear separation in a transformed higher dimen-
sional space (right)
1 Machine Learning in Computer Aided Engineering 19
to substitute the inner products in the original space by inner products in the trans-
formed space. These operations (transformations plus inner products in the high-
dimensional space) can be expensive (in complex cases we need to add many dimen-
sions). However, in our example we note that
⎡ ⎤
b12
√ ⎢ ⎥ b1 2
K (a, b) := φ(a)T φ(b) = a12 a22 2a1 a2 ⎣ b22 ⎦ = a1 a2 = (a T b)2
√ b2
2b1 b2
(1.23)
so it is not necessary to use the transformed space because the inner product can be
equally calculated in both spaces. Indeed note that, remarkably, we even do not need
to know explicitly φ(x), because the kernel K (a, b) = (a T b)2 is fully written in the
original space and we never need φ(x). Then we just solve
⎡ ⎤
n
1 n n
find arg max ⎣ αi − α j αk y j yk K (x j , xk )⎦ with αi yi = 0 (1.24)
αi ≥0 i=1
2 j,k=1 i=1
When problems have too many features (or data, measurements), dimensionality
reduction techniques are employed to reduce the number of attributes and gain
insight into the most meaningful ones. These are typically employed not only in
pattern recognition and image processing (e.g., identification or compression) but
also to determine which features, data, or variables are most relevant for the learning
purpose. In essence, the algorithms are similar in nature to determining the principal
modes in a dynamic response, because with that information, relevant mechanical
20 F. J. Montáns et al.
properties (mass distribution, stiffness, damping), and the overall response may be
obtained. In ML, classical approaches are given by Principal Component Analysis
(PCA) based on Pearson’s Correlation Matrix (Abdi and Williams 2010; Bro and
Smilde 2014), Singular Value Decomposition (SVD), Proper Orthogonal Decompo-
sition (POD) (Berkooz et al. 1993), Linear (Fisher’s) Discriminant Analysis (LDA)
(Balakrishnama and Ganapathiraju 1998; Fisher 1936), Kernel (Nonlinear) Prin-
cipal Component Analysis (kPCA), Hofmann et al. (2008), Alvarez et al. (2012),
Local Linear Embedding (LLE) (Roweis and Saul 2000; Hou et al. 2009), Man-
ifold Learning (used also in constitutive modeling) (Cayton 2005; Bengio et al.
2013; Turaga et al. 2020), Uniform Manifold Approximation and Projection (UMAP)
(McInnes et al. 2018), and autoencoders (Bank et al. 2020; Zhuang et al. 2021; Xu and
Duraisamy 2020; Bukka et al. 2020; Simpson et al. 2021). Often, these approaches
are also used in clustering.
LLE is one of the simplest nonlinear dimension reduction processes. The idea is
to identify a global space with smaller dimension that reproduces the proximity of
data in the higher dimensional space; it is a k-NN approach. First, we determine the
weights wi j , such that wi j = 1, which minimize the error
⎛ ⎞2
n k
Error(w) = ⎝x i − wi j x j ⎠ (1.26)
i=1 j=1
in the representation of a point from the local space given by the k-nearest points
(k is a user-prescribed hyperparameter), so
w = arg min [Error(w)] (1.27)
Then, we search for the images yi of xi in the lower dimensional space, simply
by considering that the computed wi j reflect the geometric properties of the local
manifold and are invariant to translations and rotations. Given wi j , we now look for
the lower dimensional coordinates yi that minimize the cost function
⎛ ⎞2
n k
C(yi , i = 1, . . . , n) = ⎝yi − wi j y j ⎠ (1.28)
i=1 j=1
Isometric Mapping (ISOMAP) techniques are similar, but use geodesic k-node-
to-k-node distances (computed by Dijkstra’s 1959 or the Floyd–Warshall 1962 algo-
rithms to find the shortest path between nodes) and look for preserving them in the
reduced space. Another similar technique is the Laplacian eigenmaps scheme (Belkin
and Niyogi 2003), based on the non-singular lowest eigenvectors of the Graph Lapla-
cian L = d − w, where dii = j wi j gives the diagonal degree matrix and wi j are the
edge weights, computed for example using the Gaussian kernel wi j = K (xi , x j ) =
exp(−|xi − x j |2 /(2σ 2 )). Within the same k-neighbors family, yet more complex and
advanced, are Topological Data Analysis (TDA) techniques. A valuable overview
may be found in Chazal and Michel (2021); see also the references therein.
1 Machine Learning in Computer Aided Engineering 21
1
n
S jk = (x j (i) − x̄ j )(xk(i) − x̄k ) (1.29)
n i=1
where the overbar denotes the mean value of the feature, and x j (i) is feature j of
sample i. The eigenvectors and eigenvalues of the covariance matrix are the principal
components (directions/values of maximum significance/relevance), and the number
of them selected as sufficient is determined by the variance ratios; see Fig. 1.9(a). PCA
is a linear unsupervised technique. The typical implementation uses mean-corrected
n
samples, as in kPCA, so in such case S jk = n1 i=1 x j (i) xk(i) , or in matrix notation
S = n XX . kPCA (Schölkopf et al. 1997) is PCA using kernels (such as polynomials,
1 T
the hyperbolic tangent, or the Radial Basis Function (RBF)) to address the nonlin-
earity by expanding the space. For example, using the RBF, we construct the kernel
matrix K i j , for which the components are obtained from the samples i, j as K i j =
exp(−γ |xi − x j |2 ) . The RBF is then centered in the transformed space by (note that
being centered in the original features space does not mean that the features are also
automatically centered in the transformed space, hence the need for this operation)
1 1 1
K̄ = K − 1K − K1 + 2 1K1 (1.30)
n n n
where 1 is a n × n matrix of the unit entry “1”. Then K̄(xi , x j ) = φ̄(xi )T φ̄(x j )
with the centered φ̄(xi ) = φ(xi ) − 1/n rn=1 φ(xr ). The larger eigenvalues are the
x1 x1
x1
>> max
Fig. 1.9 a Principal Component Analysis. The principal components are those with largest vari-
ations (largest eigenvalues of the variances matrix). b Linear Discriminant Analysis to separate
clusters. It is seen that feature x1 is not a good choice to determine if a sample belongs to a given
cluster, but there is a features combination (a line) which gives the best discrimination between
clusters. That combination maximizes the distance between means of clusters whereas minimizes
the dispersion of samples within the clusters
22 F. J. Montáns et al.
n−classes
ni
Sw = (x − mi )(x − mi )T (1.31)
i=1 within-class-i
n−classes
Sb = n i (mi − x̄)(mi − x̄)T (1.32)
Class i=1
and x̄ is the overall mean vector of the features x and mi is the mean vector of
those within-class i. If x̄ = mi the class is not separable from the selected fea-
tures. Frequently used nonlinear extensions of LDA are the Quadratic Discriminant
Analysis (QDA) (Tharwat 2016; Ghosh et al. 2021), Flexible Discriminant Analysis
(FDA) (Hastie et al. 1994), and Regularized Discriminant Analysis (RDA) (Friedman
1989).
Proper Orthogonal Decompositions (POD) are frequently motivated in PCA and
are often used in turbulence and in reducing dynamical systems. It is a technique
also similar to classical modal decomposition. The idea is to decompose the time-
dependent solution as
P
u(x, t) = a p (t)ϕ p (x) (1.33)
p=1
and compute the Proper Orthogonal Modes (POMs) ϕ p (x) that maximize the energy
representation (L2-norm). In essence, we are looking for the set of “discrete func-
tions” ϕ p (x) that best represent u(x, t) with the lowest number of terms P. Since
these are computed as discretized functions, several snapshots u(x, ti ), i = 1, . . . , n
are grabbed in the discretized domain, i.e.
⎡ ⎤
u 11 . . . u 1n
⎢ . . .. ⎥
U = u(x, t1 ) u(x, t2 ) . . . u(x, tn ) = ⎣ .. . . . ⎦ (1.34)
u m1 . . . u mn
Then, the POD vectors are the eigenvectors of the sample covariance matrix. If the
snapshots are corrected to have zero mean value, the covariance matrix is
1
S= UUT (1.35)
n
1 Machine Learning in Computer Aided Engineering 23
The POMs may also be computed using the SVD of U (the left singular vectors are the
eigenvectors of UUT ) or auto-associative NNs (Autoencoder Neural Networks that
replicate the input in the output but using a hidden layer of smaller dimension). To
overcome the curse of dimensionality when using too many features (e.g., for para-
metric analyses), the POD idea is generalized in Proper Generalized Decomposition
(PGD), by assuming approximations of the form
N
u(x1 , x2 , . . . , xd ) = φ i1 (x1 ) ◦ φ i2 (x2 ) ◦ . . . ◦ φ id (xd ) (1.36)
i=1
j
where φ i (x j ) are the unknown vector functions (usually also discretized and com-
puted iteratively, for example using a greedy algorithm), and “◦” stands for the
Hadamard or entry-wise product of vectors. Note that, in general we cannot use the
separability (x, y) = φ(x)ψ(y) but PGDs look for the best φi (x)ψi (y) choices
for the given problem such that we can say (x, y) i φi (x)ψi (y) in a suffi-
ciently small number of addends (hence with a reduced complexity). The power of
the idea is that for a large number n of features, determining functions of the type
(x1 , x2 , . . . , xn ) is virtually impossible, but determining products and additions of
scalar functions is feasible.
The UMAP and t-SNE schemes are based on the concept of a generalized metric
or distance between samples. A symmetric and normalized (between 0 and 1) metric
is defined as
j j
di j (xi , x j ) = di (xi , x j ) + d ij (xi , x j ) − di (xi , x j )d ij (xi , x j ) (1.37)
where the unidirectional distance function is defined as
j ρi j − ρi1
di (xi , x j ) = exp − (1.38)
ρik
where ρi j = |xi − x j | and ρik = |xi − xk |, with k referring to the k-nearest neighbor
(ρi1 refers to the nearest neighbor to i). Here k is an important hyperparameter. Note
j j
that di = 1 if i, j are nearest neighbors, and di → 0 for far away neighbors. We
are looking for a new set of lower dimensional features z to replace x. The same
generalized distance di j (zi , z j ) may be applied to the new features. To this end,
through optimization techniques, like the steepest descent, we minimize the fuzzy
set dissimilarity cross-entropy (or entropy difference) like the Kullback–Leibler (KL)
divergence (Hershey and Olsen 2007; Van Erven and Harremos 2014), which mea-
sures the difference between the probability distributions di j (xi , x j ) and di j (zi , z j ),
and their complementary values [1 − di j (xi , x j )] and [1 − di j (zi , z j )] (recall that
d ∈ (0, 1], so it is seen as a probability distribution)
n
di j (xi , x j )
K L(d(x), d(z)) = di j (xi , x j ) ln
i, j=1
di j (zi , z j )
1 − di j (xi , x j )
+ 1 − di j (xi , x j ) ln (1.39)
1 − di j (zi , z j )
24 F. J. Montáns et al.
Note that the KL scheme is not symmetric with respect to the distributions. If dis-
tances in both spaces are equal for all the samples, KL = 0. In general, a lower
dimensional space gives KL = 0, but with the dimension of z fixed, the features
(or combinations of features) that give a minimum KL considering all n samples
represent the optimal selection.
Autoencoders are a type of neural network, discussed below, and can be interpreted
as a nonlinear generalization of PCA. Indeed, an autoencoder with linear activation
functions is equivalent to a SVD.
exp(z) − exp(−z)
tanh(z) = , with z = w T x + w0 (1.40)
exp(z) + exp(−z)
NNs are made from many such artificial neurons, typically arranged in several layers,
with each layer l = 1, . . . , L containing many neurons. The output from the network
1 Machine Learning in Computer Aided Engineering 25
Bias 1
Weights update Error?
0
1
Learning process
1
1
2
Input features
2 -1
3
Neuron input Firing step function Output
3
…
Rosenblatt’s perceptron
Weights
1
Learning process
1
2
Input features
2
3
Activation function Output
Neuron input Threshold firing
3 (linear in Adaline)
…
Fig. 1.10 Rosenblatt’s perceptron and Adaline (Adaptive Linear Neuron) model
where the f l are the neuron functions of the layer (often also denoted by σ l in the
sigmoid case), typically arranged by groups in the form f l (Wl xl + bl ), where Wl is
the matrix of weights, zl := Wl xl + bl , xl = yl−1 = f l−1 (zl−1 ) are the neuron inputs
and output of the previous layer (the features for the first function; y0 ≡ x), and bl
is the layer bias vector, which is often incorporated as a weight on a unit bias by
writing zl = Wl xl , so xl has also the index 0, and x0l = 1; see Fig. 1.11. The output
may be also a vector y ≡ y L . The purpose of the learning process is to learn the
optimum values of Wl , bl . The power of the NNs is that a simple architecture, with
simple functions, may be capable of reproducing more complex functions. Indeed,
Rosenblatt’s scheme discussed below may give any linear or nonlinear function. Of
course, complex problems will require many neurons, layers, and data, but the overall
structure of the method is almost unchanged.
The Feed Forward Neural Network (FFNN) with many layers, as shown in
Fig. 1.11, is trained by optimization algorithms (typically modifications of the steep-
est descent) using the backpropagation algorithm, which consists in computing the
26 F. J. Montáns et al.
Fig. 1.11 Neural Network with L − 1 hidden layers and one (the L) output layer. Notation for
weights is Woil , where i is the input cell (zero refers to the bias unit), o is the output cell (the order
is often reversed in the literature), and l = 1, . . . , L are the layers
sensitivities using the chain rule from the output layer to the input layer, so for each
layer, the information of the derivatives of the subsequent layers are known. For exam-
ple, in Fig. 1.11, assume that the error is computed as E = 21 (y − yexp )T (y − yexp )
(logistic errors are more common, but we consider herein this simpler case). Then,
if α is the learning rate (a hyperparameter), the increment between epochs of the
parameters is
∂y
Woil = −α(y − yexp )T (1.42)
∂ Woil
where ∂y/∂ Woil is computed through the chain rule. Figure 1.12 shows a simple
example with two layers and two neurons per layer; superindices refer to layer and
subindices to neuron. For example, following the green path, we can compute
2
∂ y2 ∂ y2 ∂z 2
= (1.43)
∂ W21
2
∂z 2
2
∂ W21
2
where ∂ y2 /∂z 22 is the derivative of the selected activation function evaluated at the
iterative value z 22 and ∂z 22 /∂ W21
2
= x12 is also the known iterative value. As an example
of a deeper layer, consider the red line in Fig. 1.12
2 1
∂ y1 ∂ y1 z 1 ∂ x22 ∂z 2
= (1.44)
∂ W21
1
∂z 12 ∂ x22 ∂ z 21 ∂ W21
2
1 Machine Learning in Computer Aided Engineering 27
Fig. 1.12 Computation of the gradient through backpropagation. z lo is defined as z lo = Woil xil (which
includes the bias) and f ol (z lo ) is the activation function
y
1 -1
1
z
z=x-1 y
x 1 -1 1 3
y
1 0 1
1 1 -1 x
z=x z y= yi
1
x -1
1 -1
1 -3
y
1 1
1
z
z=x+1
x 1 -1
Fig. 1.13 Neural networks are capable of generating functions to fit data regardless of the dimension
of the space and the nonlinearity of the problem. In this example, three neurons of the simplest
Rosenblatt’s perceptron consisting of a step function are used to generate a local linear function.
This function is obtained by simply changing the weights of the bias and adding the results of the
three neurons with equal weights. Other more complex functions may be obtained with different
weights. Furthermore, the firing step function may be changed by generally better choices as the
ReLU or the sigmoid functions
where we note that the first square bracket corresponds to the last layer, the sec-
ond to the previous one, and so on, until the term in curly brackets addressing the
specific network variable. The procedures had issues with exploding or vanishing
gradients (especially with sigmoid and hyperbolic tangent activations), but several
improvements in algorithms (gradient clipping, regularization, skipping of connec-
tions, etc.), have resulted in efficient algorithms for many hidden layers. The complex
improvement techniques, with an important number of “tweaks” to make them work
in practical problems, is one of the reasons why “canned” libraries are employed and
recommended (Fig. 1.13).
28 F. J. Montáns et al.
A Bayesian Neural Network (BNN) is a NN that uses probability and the Bayes
theorem relating conditional probabilities
p(x|z) p(z)
p(z|x) = (1.45)
p(x)
where p(x|z) = p(x ∩ z)/ p(z). A typical example is to consider a probabilistic dis-
tribution of the weights (so we take z = w) for a given model, or a probabilistic
distribution of the output (so we take z = y) not conditioned to a specific model.
These choices can be applied in uncertainty quantifications (Olivier et al. 2021),
with metal fatigue a typical application case (Fernández et al. 2022a; Bezazi et al.
2007). Given the complexity in passing analytical distributions through the NN, sam-
pling is often performed through Monte Carlo approaches. The purpose is to learn
the mean and standard deviations of the distributions of the weights, assuming they
follow a normal distribution wi ≈ N(μi , σi2 ). For the case of predicting an output
y, considering one weight, the training objective is to maximize the probability of
the training data for the best prediction, or minimize the likelihood of a bad predic-
tion as
μ∗ , ∗ = arg min L( f (xi ; N(μ, ), yi )) + KL[ p(N(μ, )), p(N(0, 1))]
μ, ∀xi ,yi
(1.46)
1
K
y= f (x; Nk (μ∗ , ∗ )) (1.47)
K k=1
where the Nk (μ∗ , ∗ ) are the numerical evaluations of the normal distributions for
the obtained parameters.
Fig. 1.14 Typical structure of a CNN, including one convolution layer, one pooling layer, a flattened
layer of features, and a FFNN
Fig. 1.15 Convolutional network layer with depth 1, stride length 2 (the filter patch moves 2
positions at once) and edge padding 1 (the boundary is filled with one row and column of zeroes).
Pooling is similar, but usually selects the maximum or average of a moving pad to avoid correlation
of features with location
patch are the typical operations. In the convolutional layers, input data has usually
several dimensions, and they are filtered with a moving patch array (also named
kernel, with a specific stride length and edge padding; see Fig. 1.15) to reduce
the dimension and/or to extract the main characteristics of, or information from,
the image (like looking at a lower resolution version or comparing patterns with a
reference). Each padding using a patch over the same record is called a channel, and
successive or chained paddings are called layers, Fig. 1.15. The same padding, with
lower dimension, may be applied over different sample dimensions (a volume). In
essence, the idea is similar to the convolution of functions in signal processing to
extract information from the signal. Indeed this is also an application of CNN. The
structure of CNNs have obvious and interesting applications in multiscale modeling
in materials science, and in constitutive modeling (Yang et al. 2020; Frankel et al.
2022), and thus also in determining material properties (Xie and Grossman 2018;
Zheng et al. 2020), behavior prediction (Yang et al. 2020), and obviously in extracting
microstructure information from images (Jha et al. 2018).
30 F. J. Montáns et al.
RNNs are used for sequences of events, so they are extensively used in language
processing (e.g., in “Siri” or translators from Google), and they are effective in
unveiling and predicting sequences of events (e.g., manufacturing) or when history
is important (path-dependent events as in plasticity Mozaffar et al. 2019; Li et al.
2019; du Bos et al. 2020). In Fig. 1.16, a simple RNN is shown with t h representing
the history variables, such that the equations of the RNN are
t+1
h = f lh (Wh t h + Wx t x + b) (1.48)
t
o= f lo (Wh t h + W x t x + b) (1.49)
t
y= f lh (Wh t h + b) (1.50)
The unfolding of a RNN allows for a better understanding of the process; see Fig. 1.16.
Following our previous seismic example, they can be used to study the prediction of
1 1
+1
+1 +2
Recurrent layer Recurrent layer Recurrent layer
FF layer
1 1 1 1
+1 +2 +3
+1 +2 +3
(b) RNN with one recurrent layer; unfolded representaon considering three events
Fig. 1.16 Recurrent Neural Network. a Folded representation, b unfolded representation consid-
ering three events, and c classification according to the input–output instances considered
1 Machine Learning in Computer Aided Engineering 31
Long memory
Forget Input
LSTM
unit
… gate
gate
Output
… LSTM
unit
gate
Short memory
LSTM unit
Fig. 1.17 A LSTM RNN, including long and short memory, and forget, input, and output gates. σ
is the sigmoid function, colored boxes are typical NN layers, tanh is the hyperbolic tangent, ⊗ and
⊕ and tanh are componentwise operations
new earthquakes from previous history; see, for example, Panakkat and Adeli (2009),
Wang et al. (2017). A RNN is similar in nature to a FFNN, and is frequently mixed
with FF layers, but recycling some output at a given time or event for the next time(s)
or event(s). RNNs may be classified according to the number of related input–output
instances as one-to-one, one-to-many (one input instance to many output instances),
many-to-one (e.g., classifying a voice or determining the location of an earthquake),
and many-to-many (translation into a foreign language); see Fig. 1.16. A frequent
ingredient in RNN are “gates” (e.g., in Long Short-Term Memory (LSTM), see
Fig. 1.17) to decide which data is introduced, output, or forgotten.
velocity fields (by comparing images) (Deng et al. 2019). GANs are also used in the
generation of compliant designs, for example in the aeronautical industry (Shu et al.
2020), and also to solve differential equations (Yang et al. 2020; Randle et al. 2020).
A recent overview of GANs may be found in Aggarwal et al. (2021).
While NNs may bring accurate predictions through extensive training, obtaining
such predictions may not be computationally efficient. Ensemble learning consists
of employing many low-accuracy but efficient methods to obtain a better prediction
through a sort of averaging (or voting). Following our seismic vulnerability example,
it would be like asking several experts to give a fast opinion (for example just showing
them a photograph) about the vulnerability of a structure or a site, instead of asking
one of them to perform a detailed study of the structure (Giacinto et al. 1997; Tang
et al. 2022). The methods used may be, for example, shallow NNs and decision
tress.
ML usually gives no insight into the physics of the problem. The classical proce-
dures are considered “black boxes”, with inherent positive (McCoy et al. 2022) and
negative (Gabel et al. 2014) attributes. While these black boxes are useful in applica-
tions to solve classical fuzzy problems where they have been extensively applied in
economy, image or speech recognition, pattern recognition, etc. they have inherently
several drawbacks regarding use in mechanical engineering and applied sciences.
The first drawback is the large amount of data they require to yield relevant predic-
tions. The second one is the lack of fulfillment of basic physics principles (e.g., the
laws of thermodynamics). The third one is the lack of guarantees in the optimal-
ity or uniqueness of the prediction, or even guarantees in the reasonableness of the
predicted response. The fourth one is the computational cost, if including training,
when compared using classical methods. Although once trained, the use may be
much faster than many classical methods. Probably, the most important drawback is
the lack of physical insight into the problem, because human learning is complex and
needs a detailed understanding of the problem to seek creative solutions to unsolved
problems. Indeed, in contrast to “unexplainable” AI, now also eXplainable Artificial
Intelligence (XAI) is being advocated (Arrieta et al. 2020).
ML may be a good avenue to obtain engineering solutions, but to yield valu-
able (and reliable), scientific answers, physics principles need to be incorporated in
the overall procedure. To this end, the predictions and learning of the previously
overviewed methods, or other more elaborated ones, should be restricted to solution
subsets that do fulfill all the basic principles. That is, conservation of energy, of linear
1 Machine Learning in Computer Aided Engineering 33
momentum, etc. should be fulfilled. When doing so, we use data-driven physics-
based machine learning (or modeling) (Ströfer et al. 2018), or “gray-box” modeling
(Liu et al. 2021; Asgari et al. 2021; Regazzoni et al. 2020; Rogers et al. 2017). The
simplest and probably most used method to impose such principles (an imposition
called “whitening” or “bleaching” Yáñez-Márquez 2020) is the use of penalties and
Lagrange multipliers in the cost function (Dener et al. 2020; Borkowski et al. 2022;
Rao et al. 2021; Soize and Ghanem 2020), but there are many options and procedures
to incorporate physics either in the data or in the learning (Karpatne et al. 2017). The
resulting methods and disciplines which mix data science and physical equations are
often referred to as Physics Based Data Science (PBDS), Physics-Informed Data Sci-
ence (PIDS), Physics-Informed Machine Learning (PIML) (Karniadakis et al. 2021;
Kashinath et al. 2021), Physics Guided Machine Learning (PGML) (Pawar et al.
2021; Rai and Sahu 2020), or Data-Based Physics-Informed Engineering (DBPIE).
In a nutshell, data-based physically informed ML allows for the use of data science
methods without most of the shortcomings of physics-uninformed methods. Namely,
we do not need much data (Karniadakis et al. 2021), solutions are often meaningful,
the results are more interpretable, the methods much more efficient, and the number
of meaningless spurious solutions is substantially smaller. The methods are no longer
a sophisticated interpolation but can give predictions outside the domain given by the
training data. In essence, we incorporate the knowledge acquired in the last centuries.
In PBDS, meaningful internal variables play a key role. In classical engineer-
ing modeling, as in constitutive modeling, variables are either external (position,
velocity, and temperature) or internal (plastic or viscous deformations, damage, and
deformation history). The external variables are observable (common to all methods),
whereas the internal variables, being non-observable, are usually based on assump-
tions to describe some internal state. Here, a usual difference with ML methods is that
a physical meaning is typically assigned to internal variables in classical methods, but
for example when using NNs, internal variables (e.g., those in hidden layers) have
typically no physical interpretation. However, the sought solution of the problem
relates external variables both through physical principles or laws and through state
equations. To link both physical principles and state equations, an inherent physical
meaning is therefore best given (or sought) for the internal ML variables (Carleo et al.
2019; Vassallo et al. 2021). Physical principles are theoretical, of general validity, and
unquestioned for the problem at hand (e.g., mass, energy, momentum conservation,
and Clausius-Duhem inequality), whereas state equations are a result of assumptions
and observations at the considered scales, leading to equations tied to some condi-
tions, assumptions, and simplifications of sometimes questionable generality and of
more phenomenological nature.
In essence, the possible ML solutions obtained from state equations must be
restricted to those that fulfill the basic physical principles, constituting the physically
viable solution manifold, and that is often facilitated by the proper selection of
the structure of the ML method and the involved internal variables. These physical
constraints may be incorporated in ML procedures in different ways, depending
on the analysis and the ML method used, as we briefly discuss below (see also an
example in Ayensa Jiménez 2022).
34 F. J. Montáns et al.
ifold” and a “constitutive manifold”, and we seek the intersection of both for some
given actions or boundary and initial conditions (Ibañez et al. 2018; He et al. 2021;
Ibañez et al. 2017; Nguyen and Keip 2018; Leygue et al. 2018). Autoencoders are a
good tool to reduce complexity and filter noise (He et al. 2021). Other methods are
devoted to inferring the boundary conditions or material constitutive inhomogeneities
(e.g., local damage) assuming that the general form of the constitutive relations is
known (this is a ML approach to the classical inverse problem of damage/defect
detection).
Regarding the determination of the constitutive equations, the procedure may be
purely data-driven (without the explicit representation of a constitutive manifold or
constitutive relations, i.e. “model-free” Kirchdoerfer and Ortiz 2016, 2017; Eggers-
mann et al. 2021a; Conti et al. 2020; Eggersmann et al. 2021b) or manifold-based, in
which case a constitutive manifold is established as a data-based constitutive equa-
tion. In the model-free family, we assume that a large amount of data is known, so
a material data “point” is always close to the physical manifold (see Fig. 1.18 left).
Then, while these techniques may be considered within the ML family, they are more
data-driven deterministic techniques (raw data is employed directly, no constitutive
equation is “learned”). In the manifold-based family (Fig. 1.18, center and right), the
manifold may be explicit (e.g., spline-based, Sussman and Bathe 2009; Crespo and
Montáns 2019; Latorre and Montáns 2017; Crespo et al. 2017; Coelho et al. 2017)
or implicit (discrete or local, e.g., Lopez et al. 2018; Ibañez et al. 2020; Meng et al.
2018; Ibañez et al. 2017). This is a family of methods for which the objective is to
learn the state equations from known (experimental or analytical) data points, prob-
ably subject to some physics requirements (as integrability). Within this approach,
once the manifold is established, the computation of the prediction follows a scheme
very similar to the use of classical methods (Crespo et al. 2017).
Remarkably, in some Manifold Learning approaches, physical requirements
(which may include, or not, physical internal variables, Amores et al. 2020) may
result in a substantial reduction of the experimental data needed (Latorre et al. 2017;
Amores et al. 2021) and of the overall computational effort, resulting also in an
Constitutive manifold
Stress Stress
stress
Nearest Converged solution
data
Start Nearest physical Intersection of
manifold solution manifolds
Fig. 1.18 Data-based constitutive modeling. Left: purely data-driven technique, where no consti-
tutive manifold is directly employed. Instead, the closest known data point is located and is used to
compute the solution. Center, right: constitutive (e.g., stress–strain) data points are used to compute
a constitutive manifold (which may include uncertainty quantification), which is then employed to
compute the solution in a classical manner
36 F. J. Montáns et al.
The use of ML models, as when using any model (including classical analytical mod-
els; see, for example, Bathe 2006), may result in a significant error in the prediction of
the actual physical response. This error may be produced either by insufficient data (or
insufficient quality of the data because of noise or lack of completeness), or by inac-
curacy of the model (e.g., due to too few layers in a NN or erroneous or oversimplify-
ing assumptions) (Haik et al. 2021). Then problems are how to incorporate new data
(labeled or unlabeled) into the model (Buizza et al. 2022), how to enrich the model
to improve the predictions (Singh et al. 2017), and how to augment physical models
with machine-learned bias (Volpiani et al. 2021) (hybrid models). These problems are
typically encountered in dynamics (Muthali et al. 2021), and the solutions are often
similar to those employed in control theory (Rubio et al. 2021), as the use of Kalman
methods (Zhang et al. 2020). Machine learning techniques may be used for self-
learning complex physical phenomena as the sloshing of fluids (Moya et al. 2020).
In essence, the proposal here is to assume that there is a model-predicted response
ymodel and a true (say “experimental”) response yexp (Moya et al. 2022). The differ-
ence is the error to be corrected, namely ycorr = yexp − ymodel . This error is corrected
in further predictions by assuming that there is an indeterminacy either in the input
data (statistical error) or in the model (some unknown variables that are not being
considered). Note that the statistical error case is conceptually similar to the quantifi-
cation of uncertainty. In case the model needs corrections, some formalism may be
employed to introduce physics corrections to learned models. For example, correcting
dissipative behavior in assumed (hyper)elastic behavior (or vice versa). In case there
are some indeterminacies in the model, we can assume that the model is of the form
y = f (x; w, ω) (1.51)
where the w are the parameters determined previously (e.g., during the usual model
learning process and now fixed) and ω are the parameters correcting the model by
minimizing the error. This model correction process using new data is called data
assimilation. In Dynamic Data-Driven Application Systems (DDDAS), the concepts
of Digital Twins and Hybrid Twins are employed. A Digital Twin (Glaessgen and
Stargel 2012) is basically a virtual (sometimes comprehensive) model which is used
as a replication of a system in real life. For example, a Formula-1 simulator, Mayani
et al. 2018, or a spacecraft simulator, Ye et al. 2020; Wang 2020 may be considered
a Digital Twin (Luo et al. 2020). A Digital Twin serves as a platform to try new
solutions when it is difficult or expensive to try them in the actual physical system.
Digital Twins are increasingly used in industry in many fields (Bhatti et al. 2021;
Garg and Panigrahi 2021; Burov and Burova 2020). This virtual platform may con-
tain classical analytical models, data-driven models, or a combination of both (which
is currently very usual in complex systems). The concept of Hybrid Twin (Chinesta
et al. 2020) (or self-learning digital twin, Moya et al. 2020) is a step forward, which
mixes the virtual/digital twin model with model order reductions and parametrized
solutions. The purpose is to have a twin in real time, which may be used to predict
38 F. J. Montáns et al.
the behavior of a system in advance and correct the system (Moya et al. 2022) or
take any other measure; that is, in essence to control a complex physical system. The
dynamic equation of the Hybrid Twin is
where the μ are the model parameters, A(X, t; μ) is the (possibly analytical) model
contribution given those parameters (a linear model would be A(μ)X) (Sancarlos
et al. 2021), B(X, t) is a data-based correction to the model (a continuous update
from measurements), C(t) are the external actions, and R(t) is the (unbiased and
unpredictable) noise. We use the word “hybrid” (Champaney et al. 2022) because
analytical and data-based approaches are employed. Hybrid Twins have been applied
in various fields, for example in simulating acoustic resonators (Martín et al. 2020).
is re-written as
Ẋ = (X) (1.54)
Fig. 1.20 Sparse Identification of Nonlinear Dynamics. Case of Lorenz System. Reproduced from
Brunton et al. (2016)
representation allowing for explicit derivatives and uses alternating direction opti-
mization with adaptive Sequential Threshold Ridge regression (STRidge) (Rudy et al.
2017) for promoting sparsity, and also more classical genetic and symbolic regression
procedures (Searson 2009; Schmidt and Lipson 2009, 2010). An overview of these
techniques and others may be found in Brunton and Kutz (2022); see also Zhang and
Liu (2021) for a progressive approach for considering uncertainties.
These approaches, as the SINDy type, can trivially address the correction given
by an imperfect modeling (i.e. the Hybrid Twin). It simply suffices to consider a
correction in Eq. (1.53)
Ẋ − A(X) = B(X) (1.55)
where B(X) is the measured discrepancy to be corrected between the results obtained
from the inexact model and the experimental results. As performed in mathematics
and physics, the key for simplification and possible linearization of a complex prob-
lem consists of finding a proper (possibly reduced) space of (possibly transformed)
input variables to re-write the problem. As mentioned, NNs, in particular autoen-
coders, can be used to find the space, to which, thereafter, a SINDy approach may
be applied to create a Digital or Hybrid Twin (Champion et al. 2019). These mixed
NN approaches have also been employed in multiscale physics transferring learning
through scales by increasingly deep and wide NNs (Liu et al. 2020), also employing
CNNs (Liu et al. 2022). Of course, Dynamic Mode Decomposition (DMD) (Schmid
2010; Tu 2013; Schmid 2011; Jovanović et al. 2014; Demo et al. 2018), a procedure to
determine coupled spatio-temporal modes for nonlinear problems based on Koopman
(composite operator) theory (Williams et al. 2015), is also used for incorporating data
40 F. J. Montáns et al.
into physical systems, or determining the physical system equations themselves. The
idea is to obtain two sets (“snapshots”) of spatial measurements separated by a given
t, namely t X and t+ t X. Then, the eigenvectors of A = t+ t X t X∼1 , where t X∼1
is the pseudoinverse, are the best regressors to the linear model, that is, the minimum-
squares best fit of the nonlinear model, compatible with the snapshots. In practice,
the A matrix is usually not computed because working with the SVD of X is more
efficient (Proctor et al. 2016).
Other techniques to discover physical relations (or nonlinear differential equa-
tions), as well as simultaneously obtain physical parameters and fields, are physics-
informed NN (PINN) (Raissi and Karniadakis 2018; Raissi et al. 2019; Pang et al.
2019; Yang et al. 2021). For example, using neural networks, the viscosity, the den-
sity, and the pressure, with the velocity field in time may be obtained assuming the
Navier–Stokes equations as background and employing a NN as the learning engine
to match snapshots. Moreover, these methods may be combined with time integrators
for obtaining the nonlinear parameters of any differential equation, including higher
derivatives, just from discretized experimental snapshots (Meng et al. 2020; Zhang
et al. 2020). Other applications include inverse problems in discretized conservative
settings (Jagtap et al. 2020).
While it is very well known that the so-called universal approximation theorem
guarantees that a neural network can approximate any continuous function, it is also
possible to approximate continuous operators by means of neural networks (Chen
and Chen 1995). Based on this fact, Lu and coworkers have proposed the Deep
Operator Networks (DeepONets) (Lu et al. 2021).
A DeepONet typically consists of two different networks working together: one
to encode the input function at a number of measurement locations (the so-called
branch net) and a second one (the trunk net) to encode the locations for the output
functions. Assume that we look forward to characterize an operator F : X → Y ,
with X, Y two topological spaces. For any function x ∈ X , this operator produces
G = F(x), the output function. For any point y in the domain of F(x), G(y) is a real
number. A DeepONet thus learns from pairs (x, y) to produce the operator. However,
for an efficient training, the input function x is sampled at discrete spatial locations.
In some examples, DeepONets showed very small generalization errors and even
exponential error convergence with respect to the training dataset size. This is how-
ever not yet fully understood. DeepONets have been applied, for example, to predict
crack paths in brittle materials (Goswami et al. 2022), instabilities in boundary layers
(Di Leoni et al. 2021), and the response of dynamical systems subjected to stochastic
loadings (Garg et al. 2022).
Recently, DeepONets have been generalized by parameterizing the integral kernel
in Fourier space, giving rise to the so-called Fourier Neural Operators (Li et al. 2020).
These networks have also gained a high popularity, and have been applied to weather
forecasting, for instance (Pathak et al. 2022).
1 Machine Learning in Computer Aided Engineering 41
Within the realm of PIML approaches, a new family of methods has recently been
proposed. The distinctive characteristic is that these new techniques see the super-
vised learning process as a dynamical system as
with z being the set of variables governing the problem. The supervised learning
problem will thus be to establish f in such a way as to reach an accurate description of
the evolution of the variables. By formulating the problem in this way, the analyst can
use the knowledge already available, and established over centuries, on dynamical
systems. For instance, adopting a Hamiltonian perspective on the dynamics and
enforcing f to be of the form
ż = L∇ H (1.57)
where L is the classical (skew-symmetric) symplectic matrix, which ensures that the
learnt dynamics will conserve energy, because it is derived from the Hamiltonian
H . Many recent references have exploited this approach, either in Hamiltonian or
Lagrangian frameworks (Greydanus et al. 2019; Mattheakis et al. 2022; Cranmer
et al. 2020). If the system of interest is dissipative—which is, by far, most frequently
the case—a second potential must be added to the formulation as
ż = L∇ H + M∇ S (1.58)
where S represents the so-called Mathieu potential. To ensure the fulfillment of the
first and second principles of thermodynamics, an additional restriction (the so-called
degeneracy conditions) must be imposed, i.e.
L∇ S = M∇ H = 0 (1.59)
These equations essentially state that entropy has nothing to do with energy conserva-
tion and, in turn, energy potentials have nothing to do with dissipation. The resulting
NN formulations produce predictions that comply with the laws of thermodynamics
(Hernández et al. 2021, 2022).
In this section we describe some applications of machine learning in CAE. The main
purpose is to briefly focus on a variety of topics and ML approaches employed in
several fields, but not to give a comprehensive review. Hence, given the vast literature
already available, developed in the last few years, many important works have likely
42 F. J. Montáns et al.
been omitted. However, even though the field of applications is very broad, the main
ideas fundamental to the techniques are given in the previous sections.
One of the simplest modeling problems and, hence, one of the most explored ones is
the case of elasticity. The linear elastic problem, addressed from a model-free data-
driven method is analyzed in Kirchdoerfer and Ortiz (2016), Conti et al. (2018), and
even earlier in Wang et al. (2011) for cloths in the animation and design industries.
Data-driven nonlinear elasticity is also analyzed in several works (Conti et al. 2020;
Stainier et al. 2019; Nguyen and Keip 2018), and applied to soft tissues (González
et al. 2020) and foams (Frankel et al. 2022).
In particular, data-driven specific solvers are needed if model-free methods are
employed, and some effort is directed to developing such solvers and data structur-
ing methods for the task (Eggersmann et al. 2021a, b; Platzer et al. 2021). Kernel
regression is also employed (Kanno 2018).
Another common methodology is the use of data-driven constitutive manifolds
(Ibañez et al. 2017), where identification and reduction of the constitutive manifolds
allow for a much more efficient approach. NNs are as well used in finite deformation
elasticity (Nguyen-Thanh et al. 2020; Wang et al. 2022).
Remarkably, nonlinear elasticity is one of the cases where physics-informed meth-
ods are important, because true elasticity means integrable path-independent consti-
tutive behavior, i.e. hyperelasticity. Classical ML methods are not integrable (hence
not truly elastic). To fulfill such requirement, specific methods are needed (González
et al. 2019b; Chinesta et al. 2020; Hernandez et al. 2021). One of the possibilities is
to posit the state variables and a reduced expression of the hyperelastic stored energy
(which may be termed as “interpretable” ML models Flaschel et al. 2021). Then,
this energy may be modeled, for example, by splines or B-splines. This approach,
1 Machine Learning in Computer Aided Engineering 43
based on the Valanis–Landel assumption, was pioneered by Sussman and Bathe for
isotropic polymers (Sussman and Bathe 2009) and extended later for anisotropic
materials (Latorre and Montáns 2013) like soft biological tissues (fascia, Latorre
et al. 2017, skin Romero et al. 2017, heart Latorre and Montáns 2017, muscle Latorre
et al. 2018, Moreno et al. 2020), compressible materials (Crespo et al. 2017), auxetic
foams (Crespo and Montans 2018; Crespo et al. 2020), and composites (Amores
et al. 2021). Polynomials in terms of invariants are also employed, with the coeffi-
cients determined by sparse regression (Flaschel et al. 2021). Another approach is
to select models from a database, and possibly correct them (González et al. 2019a;
Erchiqui and Kandil 2006), or select specific function models for the hyperelastic
stored energy using machine learning methods (e.g., NNs) (Flaschel et al. 2021;
Vlassis et al. 2020; Nguyen-Thanh et al. 2020). In particular, polyconvexity (to guar-
antee stability and global minimizers for the elastic boundary-value problem) may
also be imposed in NN models (Klein et al. 2022). Anisotropy in hyperelasticity may
be learned from data with NNs (Fuhg et al. 2022a).
In material datasets, noise and outliers may be a relevant issue, both regard-
ing accuracy and their promotion of overfitting. Clustering has been employed in
model-free methods to assign a different relevance depending on the distance to
the solution and using an estimation based on maximum entropy (Kirchdoerfer and
Ortiz 2017). For spline-based constitutive modeling, experimental data reduction
using stability-based penalizations allows for the use of noisy datasets and outliers
avoiding overfitting (Latorre and Montáns 2020).
as typically pursued for hereditary models) (Eggersmann et al. 2019; Ciftci and Hackl
2022). FFNNs with PODs have been employed to fit several plasticity stress–strain
behaviors. NNs are also used to replace the stress integration approaches in FE anal-
ysis of elastoplastic models (Jang et al. 2021). In general, RNNs (Mozaffar et al.
2019; Borkowski et al. 2022) and CNNs (Abueidda et al. 2021) are a good resort for
predicting plastic paths, and sophisticated LSTM and Gated Recurrent Unit (GRU)
schemes have been reported to give excellent predictions even for complex paths
(Wang et al. 2020).
In materials science, ML is employed to predict the cyclic stress–strain behavior
depending on the microstructure of the material obtained from electron backscat-
ter diffraction (EBSD) analysis. The shape of the yield function can also be deter-
mined by employing sparse regression from a strain map and the cell load in a
non-homogeneous test (like considering a plate with holes) (Flaschel et al. 2022).
A mixture of analytical formulas and FFNN machine learning has been employed
to replace the temperature- and rate-dependent term of the Johnson–Cook model
(Li et al. 2019). In plasticity, physics-based modeling is incorporated by assuming
the existence of a stored energy, a plasticity yield function, and a plastic flow rule.
These may be obtained by NNs learned from numerical experiments on polycrystal
databases, resulting in a more robust ML approach than using the classical black-box
ML scheme (Vlassis and Sun 2021). Support Vector Regression (SVR), Gaussian
Process Regression (GPR), and NNs have been used to determine data-driven yield
functions with the convexity constraints required by the theory (Fuhg et al. 2022b).
Automatic hyperparameter (self-)learning has been addressed for NN modeling of
elastoplasticity in Fuchs et al. (2021).
1.4.1.3 Fracture
Fracture phenomena may also be modeled using NNs (Theocaris and Panagiotopou-
los 1993; Seibi and Al-Alawi 1997) and data-driven model-free techniques (Carrara
et al. 2020). Data-driven model extraction from experimental data and knowledge
transfer (Goswami et al. 2020) have been applied to obtain predictions in 3D mod-
els from 2D cases (Liu et al. 2021). Data-driven approaches are used to enhance
fracture paths in simulations of random composites and in model reduction to avoid
high fidelity phase-field computations (Guilleminot and Dolbow 2020). SVMs and
variants have been used for predicting fracture properties, e.g., Yuvaraj et al. (2013),
Kulkrni et al. (2011), and so have been other methods like BNN, Genetic Algorithm
(GA), and hybrid systems; see, for example, Nasiri et al. (2017), Hoshyar et al.
(2020).
The modeling of complex materials is one of the fields where machine learning
may bring about significant advances in CAE (Peng et al. 2021), in particular when
1 Machine Learning in Computer Aided Engineering 45
nonlinear behavior is modeled (Jackson et al. 2019). This is particularly the case when
the macroscopic behavior or the physical properties depend in a complex manner on
a specific microstructure (Fish et al. 2021) or on physics equations and phenomena
only seen at a micro- or smaller scale, as atomistic (Caccin et al. 2015; Kontolati
et al. 2021; Wood et al. 2019), molecular (Xiao et al. 2020), or cellular (Verkhivker
et al. 2020).
ML allows for the simpler implementation of first-principles in multiscale simu-
lations (Hong et al. 2021), describing physical macroscopic properties, like also in
chaotic dynamical systems for which the highly nonlinear behavior depends on com-
plex interactions at smaller scales (e.g., weather and climate predictions) (Chattopad-
hyay et al. 2020). Generating surrogate models to reproduce the observed macro-
scopic effects due to complex phenomena at the microscale (Wirtz et al. 2015) is often
only possible through ML and Model Order Reduction (MOR) (Wang et al. 2020;
Yvonnet and He 2007). Even in the simplest cases, ML may substantially speed up
the expensive computational costs of classical nonlinear FE2 homogenization tech-
niques (Feng et al. 2022; Wu et al. 2020), allowing for real-time simulations (Rocha
et al. 2021). The nonlinear multiscale case is complex because an infinite number of
simulations would be needed for a complete general database. However, a reduced
dataset may be used to develop a numerical constitutive manifold with sufficient
accuracy, e.g., using Numerically EXplicit Potentials (NEXP) (Yvonnet et al. 2013).
Material designs are often obtained from inverse analyses facilitated by parametric
ML surrogate models (Jackson et al. 2019; Haghighat et al. 2021). In particular, ML
may be employed to determine the phase distributions in heterogeneous materials
(Valdés-Alonzo et al. 2022).
The modeling of classical fiber-based and complex composite heterogeneous
materials often requires multiscale approaches (Pathan et al. 2019; Hadden et al.
2015; Kanouté et al. 2009) because modeling of interactions at the continuum level
requires inaccurate assumptions. CNNs are ideal for dealing with the relation of
an unstructured RVE with continuum equivalent properties. In particular, ML may
be used for dealing with stochastic distributions of constituents (Liu et al. 2022).
Modeling of complex properties such as composite phase changes for thermal man-
agement in Li-ion batteries may be performed with CNNs (Kolodziejczyk et al.
2021). Indeed, CNNs can also be used for performing an inverse analysis (Sorini
et al. 2021). In general, many complex properties and effects observed macroscop-
ically, but through effects mainly attributed to the microscale, are often addressed
with different ML techniques, including CNNs, e.g., Field et al. (2021), Nayak et al.
(2022), and Koumoulos et al. (2019).
Metamaterials are architected materials with inner custom-made structure. With the
current development of 3D printing, metamaterial modeling and design is becoming
an important field (Kadic et al. 2019; Bertoldi et al. 2017; Zadpoor 2016; Barchiesi
et al. 2019) because a material with unique salient properties may be designed ad
46 F. J. Montáns et al.
libitum allowing for a wide range of applications (Surjadi et al. 2019). Their design
has evolved from the classical optimization-based approach (Sigmund 2009). ML
methods for the design of metamaterials are often used with two objectives. The first
objective is to generate simple surrogate models to accelerate simulations avoiding
FE modeling to the very fine scale describing the structure, especially when non-
linearities are important. The second objective is to perform analyses using a meta-
material topology parametrization which allows for an effective metamaterial design
from macroscopic desired properties. Examples of ML approaches for metamaterials
pursuing these two objectives can be found in, e.g., Wu et al. (2020), Fernández et al.
(2022b), Zheng et al. (2020), and Wilt et al. (2020).
Fluid phenomena and related modeling approaches are very rich, spanning from the
breakup of liquid droplets under different conditions (Krzeczkowski 1980; Roisman
et al. 2018; Liu et al. 2018) to smoke from fires in tunnels (Gannouni and Maad
2016; Wu et al. 2021), emissions from engines (Khurana et al. 2021; Baklacioglu
et al. 2019), flow and wake effects in wind turbines (Clifton et al. 2013; Ti et al.
2020), and free surface flow dynamics (Becker and Teschner 2007; Scardovelli and
Zaleski 1999). The difficulty in obtaining accurate and efficient solutions, especially
when effects at multiple scales are important, has fostered the introduction of ML
techniques. We briefly review some representative works.
1
∇ ū · ū = ∇(− pI + 2μ∇ s ū − ρ ũ ⊗ ũ) (1.60)
ρ
for which a turbulence closure model is assumed, e.g., eddy viscosity model or
the more involved k − ε (Gerolymos and Vallet 1996) or Spalart–Allmaras models
(Spalart and Allmaras 1992). In Eq. (1.60), ∇ s ū is the average deviatoric strain-
rate tensor. The framework in Eq. (1.60) gives the two commonly used models: the
Reynolds-Averaged Navier–Stokes (RANS) model, best for steady flows (Speziale
1998; Kalitzin et al. 2005), and the Large Eddy Simulations (LES) model, using a
subgrid-scale model, thus much more expensive computationally, but best used to
predict flow separation and fine turbulence details. RANS closure models have been
1 Machine Learning in Computer Aided Engineering 47
explored using ML. For example, the work reported (Zhao et al. 2020) trains a tur-
bulence model for wake mixing using a CFD-driven Gene Expression Programming
(an evolutionary algorithm). Physics-informed ML may also be used for augmenting
turbulence models, in particular to overcome the difficulties of ill-conditioning of the
RANS equations with typical Reynolds stress closures, focusing on improving mean
flow predictions (Wu et al. 2018). Results of using ML to improve accuracy of clo-
sure models are, for example, given in Wackers et al. (2020), Wang et al. (2017). One
of the important problems in modeling turbulence and accelerating full field simula-
tions is to upscale the finer details, e.g., vorticity from the small to the larger scale,
using a lower resolution (grid) analysis. These upscaling procedures may be per-
formed by inserting NN corrections which learn the scale evolution relations, greatly
accelerating the computations by allowing lower resolution (Kochkov et al. 2021).
More accurate and faster shock-capturing by NN has been pursued in Stevens and
Colonius (2020), where ML has been applied to improve finite volume methods to
address discontinuous solutions of PDEs. In particular, Weighted Essentially Non-
Oscillatory Neural Network (WENO-NN) approaches establish the smoothness of
the solution to avoid spurious oscillations, still capturing accurately the shock, where
the ML procedure facilitates the computation of the optimal nonlinear coefficients
of each cell average.
ML has been used for some time already in structural mechanics, with probably the
most applications in Structural Health Monitoring (SHM) (Farrar and Worden 2012).
ML is applied for the primal identification of structural systems (SSI) (Sirca and
Adeli 2012; Amezquita-Sancheza et al. 2020), in particular of complex or historical
structures, to assess their general and seismic vulnerability (Ruggieri et al. 2021;
Xie et al. 2020) and facilitate ulterior health monitoring (Mishra 2021). Feature
extraction and model reduction is fundamental in these approaches (Rosafalco et al.
2021). Other areas where ML is employed is in the control of structures (e.g., active
Tuned Mass Dampers, Yucel et al. 2019; Colherinhas et al. 2019; Etedali and Mollayi
2018) under wind, seismic or crowd actions, or in structural design (Herrada et al.
2017; Sun et al. 2021; Hong et al. 2020; Yuan et al. 2020). We also comment in this
section on the development of novel ML approaches based on ideas used in structural
and finite element analyses.
truss bridges (Mehrjoo et al. 2008), and arch bridges (Jayasundara et al. 2020). Dif-
ferent types of NN are used (e.g., Bayesian, Arangio and Beck 2012; Li et al. 2020;
Ni et al. 2001, Convolutional, Nguyen et al. 2020; Quqa et al. 2022, Recurrent, Miao
et al. 2023; Miao and Yokota 2022), and the use of other techniques is also frequent
as SVM; see, for example, (Alamdari et al. 2017; Yu et al. 2021).
Apart from bridges and multi-story buildings (González and Zapico 2008; Wang
et al. 2020), there are many other types of structures for which SSI and SHM are
performed employing ML. Important structures are dams, where a deterioration and
failure may cause massive destruction, hence visual inspection and monitoring of
displacement cycles are typical actions in SHM of dams. The observations feed
ML algorithms to assess the health of the structure. The estimation of the structural
response from collected data is studied for example in Li et al. (2021b), where
a CNN is used to extract features and a bidirectional gated RNN is employed to
perform transfer learning from long-term dependencies. Similar works addressing
SHM of dams are given (Yuan et al. 2022; Sevieri and De Falco 2020). A review
may be found in Salazar et al. (2017).
Of course, different outputs may be pursued and the appropriate ML technique is
related to both available data and desired output. For example, NNs have been used
in Kao and Loh (2013), Ranković et al. (2012), Chen et al. (2018), and He et al.
(2022) to monitor radial and lateral displacements in arch dams. Several ML tech-
niques such as Random Forest (RF), Boosted Regression Trees (BRT), NN, SVM,
and MARS are compared in Salazar et al. (2015) in the prediction of dam displace-
ments and of dam leakage. The researchers found that BRT outperforms the most
common data-driven technique employed when considering this problem, namely the
Hydrostatic-Seasonal-Time method (HST), which accounts for the irreversible evo-
lution of the dam response due to the reversible hydrostatic and thermal loads; see also
Salazar et al. (2016). Gravity dams are a different type of structure from arch dams.
Their reliability under flooding, earthquakes, and aging has also been addressed using
ML methods in Hariri-Ardebili and Pourkamali-Anaraki (2018), where kNN, SVM,
and NB have been used in the binary classification of structural results, and a failure
surface is computed as a function of the dimensions of the dam. Related to dam infras-
tructure planning, flooding susceptibility predictions due to rainfall using NB and
Naïve Bayes Trees (NBT) are compared in Khosravi et al. (2019) with three classical
methods (see review of Multicriteria Decision Making (MCDM) in de Brito and Evers
2016) in the field. For tunnel designs and monitoring, we have that the soil is also an
integral part of the structure and is difficult to characterize. The understanding of its
behavior often depends on qualitative observations; it is therefore another field where
machine learning techniques will have an important impact in the future (Jafari 2020).
Important types of structures considered in SHM are also aerogenerators or Wind
Turbines (WT); see review in Ciang et al. (2008). Here, two main components are
typically analyzed: the blades and the gearbox (Wang et al. 2016). SVM is a frequent
ML technique used and acoustic noise is a source of relevant data for blade moni-
toring (Regan et al. 2016). Deep NNs are also frequently employed when multiple
sources of data are available, in particular CNNs are used to deal with images from
drones (Shihavuddin et al. 2019; Guo et al. 2021). Images are valuable not only in
the detection of overall damage (e.g., determining a damage index value), but also in
50 F. J. Montáns et al.
determining the location of the damage. This gives an alternative to the placement of
networks of strain sensors (Laflamme et al. 2016). Other WT functional problems,
such as dirt and mud detection in blades to improve maintenance, can be determined
employing different ML methods; e.g., in Jiménez et al. (2020) k-Nearest Neigh-
bors (k-NN), SVM, LDA, PCA, DT, and an ensemble subspace discriminant method
are employed. ther factors like the presence of ice in cold climates are also impor-
tant. In Jiménez et al. (2019), a ML approach is applied to pattern recognition on
guided ultrasonic waves to detect and classify ice thickness. In this work, different
ML techniques are employed for feature extraction (data reduction into meaningful
features), both linear (autoregressive ML models and PCA) and nonlinear (nonlinear-
AR eXogenous and nonlinear PCA), and then feature selection is performed to avoid
overfitting. A wide range of supervised classifiers of different families (DT, LDA,
QDA, several types of SVM, kNN, and ensembles) were employed and compared,
both in terms of accuracy and efficiency.
Applications of ML can be found also in data preparation, including imputation
techniques to fill missing sensor data (Li et al. 2021a, b). Systems, and damage and
structural responses are assessed employing different variables. Typical variables
are the displacements (building drift), which allow for the determination of mate-
rial and structural geometric properties, for example in reinforced concrete (RC)
columns. This can be achieved through locally weighted LS-SVM (Luo and Paal
2019). Bearing capacities and failure modes of structural components (columns,
beams, shear walls) can also be predicted using ML techniques, in particular when
the classical methods are complex and lack accuracy. For example, in Mangalathu
et al. (2020) several ML methods such as Naïve Bayes, kNN, decision trees, and ran-
dom forests combined with several weighted Boost techniques (similar to ensemble
learning under the assumption that many weak learners make a strong learner) such
as AdaBoost (Adaptative Boost, meaning that new weak learners adapt from mis-
classifications of previous ones) are compared to predict the failure modes (flexural,
diagonal tension or compression, sliding shear) of RC shear walls in seismic events.
Identification of smart structures with nonlinearities, like buildings with magne-
torheological dampers, has been performed through a combination of NN, PCA, and
fuzzy logic (Mohammadzadeh et al. 2015).
In SHM, the integration of data from different types or families of sensors (data
fusion) is an important topic. Data fusion (Hall and Llinas 2001) brings not only
challenges in SHM but also the possibility of more accurate, integral prediction
of the health of the structure (Wu and Jahanshahi 2020). For example, in Vitola
et al. (2017) a data fusion system based on kNN classification was used in SHM.
SHM is most frequently performed through the analysis of the dynamic response of
the structure and comparing vibrational modes using the Modal Assurance Criterion
(MAC) (Ho et al. 2021). However, in the more challenging SSI, many other additional
features are employed as typology, age, and images. In SHM, damage detection is also
pursued through the analysis of images. Visual inspection is a long used method for
crack detection in concrete or steel structures, or to determine unusual displacements
and deformations of the overall structure from global structural images. Automatic
processing and damage detection from images obtained from stationary cameras or an
1 Machine Learning in Computer Aided Engineering 51
Unmanned Aereal Vehicle (UAV) (Sankarasrinivasan et al. 2015; Reagan et al. 2018)
is currently being performed using ML techniques. A recent review of these image-
based techniques can be found in Dong and Catbas (2021). Another recent review
of ML applications of SHM of civil structures can be found in Flah et al. (2021).
One of the lessons learnt considering the available results is that to improve predic-
tions and robustness, some progress is needed in physics-based ML approaches for
SHM. For instance, an improvement may be using concrete damage models with envi-
ronmental data, typology, images, etc. to detect damage which may have little impact
in sensors (Kralovec and Schagerl 2020), but which may result in significant losses.
This issue is also of special relevancy in the aircraft industry (Ahmed et al. 2021).
While ML has contributed to CAE and structural design, new ML approaches have
also been developed based on concepts that are traditional in structural analysis and
finite element solutions. For example, one of the ideas is the concept of substructur-
ing, employed in static condensation, Guyan reduction, and Craig–Bampton schemes
(Bathe 2006). In Jokar and Semperlotti (2021) a Finite Element Network Analysis
(FENA) is proposed. The method substitutes the classical finite elements by a library
of “elements” consisting of a Bidirectional Recurrent Neural Network (BRNN). The
BRNN of the elements are trained individually and the training can be computation-
ally costly. Then these trained BRNN are concatenated, and the composite system
needs no further training. The solution is fast, not considering the training, since
in contrast to FE solutions, no system of equations is solved. The method has only
been applied to the analysis of an elastic bar, so the generalization of the idea to the
solution of more complex problems is still an open research task.
The partition of unity used in finite element and meshless methods has been
employed to develop a Finite Element Machine (FEMa) for fast supervised learning
(Pereira et al. 2020). The idea is that each training sample is the center of a Shepard
function, and the training set is treated as a probabilistic manifold. The advantage
is that, as in the case of spline-based approaches, the technique has no parameters.
Compared to several methods, the BPNN, Naïve Bayes, SVM (using both RBF and
sigmoids), RF, DT, etc. the FEMa method was competitive in the eighteen benchmark
datasets typically employed in the literature when analyzing supervised methods.
Another interesting approach is the substitution of some procedures of finite
element methods with machine learning approaches. Candidates are material and
general element libraries, creating surrogate material models (discussed above) or
surrogate elements, or patches of elements. This approach follows the substructuring
or multiscale computational homogenization (FE2) idea, but in this case using ML
procedures instead of a RVE finite element mesh. In Capuano and Rimoli (2019),
several possibilities are addressed and applied to nonlinear truss structures and a
(nonlinear) hyperelastic perforated plane strain structure. A similar approach is used
in Yan et al. (2022) for composite shells employing physics-based NNs. In Jung et al.
(2020), finite element matrices passing the patch test are generated from data using a
neural network accounting for some physical constraints, as vanishing strain energy
under rigid body motions.
Despite the already mentioned advance in scientific machine learning in several fields,
much less has been achieved considering multiphysics problems. This is undoubtedly
due to the youth of the discipline, but there are a number of efforts that deserve
1 Machine Learning in Computer Aided Engineering 53
mentioning. For instance, in Alexiadis (2019) a system is developed with the aim
of replicating human physiology. In Alizadeh et al. (2021), a similar approach is
developed for nanofluid flow, while (Ren et al. 2020) studies hydrogen production.
In the field of multiphysics problems, there exists a particularly appealing
approach to machine learning, namely that of port-Hamiltonian formalisms (Van
Der Schaft et al. 2014). Port-Hamiltonian systems are essentially open systems
that obey a Hamiltonian description of their physics (and thus, are conservative, or
reversible). Their interaction with the environment is made through a forcing term.
If we call z the set of variables governing the problem (z = (p, q), e.g., position and
momentum, for a canonical Hamiltonian system), its evolution in time will be given
by
ż = J∇ H + F (1.61)
ML techniques have been applied to classical manufacturing since their early con-
ception, and are now important in Additive Manufacturing (AM). Furthermore, ML
is currently being applied to the complete product chain, from conceptual design
to the manufacturing process. Below, we review ML applications in classical and
additive manufacturing, and in automated design.
cessing (NLP) was applied to documentation from 2005 to 2017 in the field of smart
manufacturing. The survey analyzes aspects ranging from decision support (prior to
the moment, a piece was manufactured), plant and operations health management
(for the manufacturing process itself), data management, as a consequence of the
vast amount of information produced by Internet of Things (IoT) devices installed in
modern plants, or lifecycle management, for instance. The survey concludes that ML-
based techniques are present in the literature (at the moment of publication, 2018) for
product life cycle management. While many of these ML techniques are inherently
designed to perform prognosis (i.e., to predict several aspects related to manufactur-
ing), in Ademujimi et al. (2017) a review is given of literature that employs ML to
perform diagnosis of manufacturing processes.
Due to its inherent technological complexity and our still limited comprehension of
many of the physical processes taking place, additive manufacturing (AM) has been
an active field of research in machine learning. The interested reader can consult
different reviews of the state of the art (Razvi et al. 2019; Meng et al. 2020; Jin et al.
2020; Wang et al. 2020). One of the fields where ML will be very important, and
that is tied to topology optimization, is 3D printing. AM, in particular 3D printing,
represents a revolution in component design and manufacturing because it allows
for infinite possibilities and largely reduced manufacturing difficulties. Moreover,
these technologies are reaching resolutions at the microscale, so a component may
be designed and manufactured with differently designed structures at the mesoscale
(establishing metamaterials), obtaining unprecedented material properties at the con-
tinuum scale thus widening the design space (Barchiesi et al. 2019; Zadpoor 2016).
There are many different AM procedures, like Fused Deposition Modeling (FDM),
Selective Laser Melting (SLM), Direct Energy Deposition (DED), Electron Beam
Melting (EBM), Binder Jetting, etc. While additive manufacturing offers huge pos-
sibilities, it also results into associated new challenges in multiple aspects, from the
detection of porosity (important in the characterization of the printed material) to the
recognition of defects (melting, microstructural, and geometrical), to the character-
ization of the complex anisotropic behavior, which depends on multiple parameters
of the manufacturing process (e.g., laser power in Selected Laser Melting, direction
of printing, powder and printing conditions). Both the design using AM and the error
correction or compensation (Omairi and Ismail 2021) are typical objectives in the
application of ML to AM. Different ML techniques are employed, with SVM one
of the most used schemes. For example, SVM is employed for identifying defective
parts from images in FDM (Delli and Chang 2018), for detecting geometrical defects
in SLM-made components (zur Jacobsmühlen et al. 2015; Gobert et al. 2018), for
building process maps relating variables to desired properties (e.g., as low porosity)
(Aoyagi et al. 2019), and for predicting surface roughness in terms of process features
(Wu et al. 2018). NNs are often used for optimizing the AM process by predicting
properties as a function of printing variables. For example, NNs have been used for
1 Machine Learning in Computer Aided Engineering 55
predicting and optimizing melt pool geometry in DED (Caiazzo and Caggiano 2020),
to build process maps and optimize efficiency and surface roughness in SLM (Zhang
et al. 2017), to minimize support wasted material (optimize supports in a piece) in
FDM (Jiang et al. 2019), to predict and optimize resulting mechanical properties of
the printed material (Lewandowski and Seifi 2016) like strength (e.g., using CNN
from thermal histories in Xie et al. 2021 or FFNN in Bayraktar et al. 2017), bending
stiffness in AM composites (Nawafleh and AL-Oqla 2022), and stress–strain curves
of binary composites using a combination of CNN and PCA (Yang et al. 2020).
NNs have also been used to create surrogate models with the purpose of mimicking
the acoustic properties of AM replicas of a Stradivarius violin (Tian et al. 2021).
Reviews of techniques and different applications of machine learning in additive
manufacturing may be found in Wang et al. (2020), DebRoy et al. (2021), Meng
et al. (2020), Qin et al. (2022), Xames et al. (2023), and Hashemi et al. (2022). The
review in Guo et al. (2022) addresses in some detail physics-based proposals.
of this approach using deep NN is given in Yoo et al. (2021), to propose designs of
a wheel, in which variations that comply with mechanical requirements (strength,
eigenfrequencies, etc. evaluated through surrogate models as a function of geometric
parameters) given with shapes are obtained by variations and simplifications using
autoencoders. Based on this work, an interesting discussion between aesthetics and
performance (aspects to include in ML models) is given in Shin et al. (2021). The
combination of topology optimization and generative design can be found in many
endeavors (Oh et al. 2019; Barbieri and Muzzupappa 2022).
Moreover, in the design process, there are many aspects that can be automated.
A typical aspect is the search for components with similar layout such that detailed
drawings, solid models (Chu and Hsu 2006), and manufacturing processes (Li et al.
2016) of new designs may be inferred from previous similar designs (Zehtaban et al.
2016). Indeed, many works focus on procedures to reuse parts of CAD schemes for
electronic circuits (Boning et al. 2019) or to develop microfluidic devices (Lore et al.
2015; Tsur 2020).
1.5 Conclusions
With the current access to large amounts of data and the ubiquitous presence of real-
time sensors in our life, as those present in cell phones, and also with the increased
computational power, Machine Learning (ML) has resulted in a change of paradigm
on how many problems are addressed. When using ML, the approach to many engi-
neering problems is no longer a matter of understanding the governing equations,
not even a matter of fully understanding the problem being addressed, but of having
sufficient data so relations between features and desired outputs can be established;
and not even in a deterministic way, but in an implicit probabilistic way.
ML has been succeeding for more than a decade in solving complex problems as
face recognition or stocks evolution, for which there was no successful deterministic
method, and not even a sound understanding of the actual significance of the main
variables affecting the result. Computer Aided Engineering (CAE), with the Finite
Element Method standing out, had also an extraordinary success in accurately solv-
ing complex engineering problems, but a detailed understanding of the governing
equations and their discretization is needed. This success delayed the introduction of
ML techniques in classical CAE dominated fields, but during the last years increasing
emphasis has been placed on ML methods. In particular, ML is used to solve some of
the issues still remaining when addressing the problem through classical techniques.
Examples of these issues are the still limited generality of classical CAE methods
(although the success of the FEM is due to its good generalization possibilities), the
search for practical solutions when there is not a complete, full understanding of the
problem, and computational efficiency in high-dimensional problems like multiscale
and nonlinear inverse problems. While we are still seeing the start of a new era, already
a large variety of problems in CAE has been addressed using different ML techniques.
1 Machine Learning in Computer Aided Engineering 57
Lessons have also been learned in the last few years. One important lesson is
that in engineering solutions, robustness and reliability of the solution are important
(Bathe 2006), and data may not be sufficient to guarantee that robustness. Then,
ML methods that incorporate physical laws and use the vast analytical knowledge
acquired in the last centuries may result not only in more robust methods but also
in more efficient schemes. In this chapter, we briefly reviewed ML techniques in
CAE and some representative applications. We focused on conveying some of the
excitement that is now developing in the research and use of ML techniques by short
descriptions of methods and many references to applications of those techniques.
Acknowledgements This is part of the training activities of the project funded by Euro-
pean Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie
Grant Agreement No. 101007815; FJM gratefully acknowledges this funding.
The Support of the Spanish Ministry of Science and Innovation, AEI /10.13039/501100011033,
through Grant number PID2020-113463RB-C31 and by the Regional Government of Aragon, grant
T24-20R, and the European Social Fund is also gratefully acknowledged by EC.
References
Alamdari MM, Rakotoarivelo T, Khoa NLD (2017) A spectral-based clustering for structural health
monitoring of the sydney harbour bridge. Mech Syst Signal Process 87:384–400
Alber M, Buganza Tepole A, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis G, Lytton
WW, Perdikaris P, Petzold L et al (2019) Integrating machine learning and multiscale modeling-
perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences.
NPJ Digit Med 2(1):1–11
Aletti M, Bortolossi A, Perotto S, Veneziani A (2015) One-dimensional surrogate models
for advection-diffusion problems. In: Numerical mathematics and advanced applications-
ENUMATH 2013. Springer, Berlin, pp 447–455
Alexiadis A (2019) Deep multiphysics: Coupling discrete multiphysics with machine learning to
attain self-learning in-silico models replicating human physiology. Artif Intell Med 98:27–34
Alizadeh R, Allen JK, Mistree F (2020) Managing computational complexity using surrogate mod-
els: a critical review. Res Eng Des 31(3):275–298
Alizadeh R, Abad JMN, Ameri A, Mohebbi MR, Mehdizadeh A, Zhao D, Karimi N (2021) A
machine learning approach to the prediction of transport and thermodynamic processes in mul-
tiphysics systems-heat transfer in a hybrid nanofluid flow in porous media. J Taiwan Inst Chem
Eng 124:290–306
Alvarez MA, Rosasco L, Lawrence ND et al (2012) Kernels for vector-valued functions: A review.
Found Trends® Mach Learn 4(3):195–266
Amezquita-Sancheza J, Valtierra-Rodriguez M, Adeli H (2020) Machine learning in structural
engineering. Scientia Iranica 27(6):2645–2656
Amores VJ, Benítez JM, Montáns FJ (2020) Data-driven, structure-based hyperelastic manifolds:
A macro-micro-macro approach to reverse-engineer the chain behavior and perform efficient
simulations of polymers. Comput Struct 231:106209
Amores VJ, Nguyen K, Montáns FJ (2021) On the network orientational affinity assumption in poly-
mers and the micro-macro connection through the chain stretch. J Mech Phys Solids 148:104279
Amores VJ, San Millan FJ, Ben-Yelun I, Montans FJ (2021) A finite strain non-parametric hyperelas-
tic extension of the classical phenomenological theory for orthotropic compressible composites.
Compos Part B: Eng 212:108591
Angelov S, Stoimenova E (2017) Cross-validated sequentially constructed multiple regression. In:
Annual meeting of the bulgarian section of SIAM. Springer, Berlin, pp 13–22
Aoyagi K, Wang H, Sudo H, Chiba A (2019) Simple method to construct process maps for additive
manufacturing using a support vector machine. Addit Manuf 27:353–362
Arangio S, Beck J (2012) Bayesian neural networks for bridge integrity assessment. Struct Control
Health Monit 19(1):3–21
Arangio S, Bontempi F (2015) Structural health monitoring of a cable-stayed bridge with bayesian
neural networks. Struct Infrastruct Eng 11(4):575–587
Arbabi H, Bunder JE, Samaey G, Roberts AJ, Kevrekidis IG (2020) Linking machine learning with
multiscale numerics: Data-driven discovery of homogenized equations. JOM 72(12):4444–4457
Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, García S, Gil-López
S, Molina D, Benjamins R et al (2020) Explainable artificial intelligence (XAI): Concepts, tax-
onomies, opportunities and challenges toward responsible ai. Inf Fusion 58:82–115
Asgari S, MirhoseiniNejad S, Moazamigoodarzi H, Gupta R, Zheng R, Puri IK (2021) A gray-box
model for real-time transient temperature predictions in data centers. Appl Therm Eng 185:116319
Ashtiani MN, Raahemi B (2021) Intelligent fraud detection in financial statements using machine
learning and data mining: a systematic literature review. IEEE Access 10:72504–72525
Aubry N (1991) On the hidden beauty of the proper orthogonal decomposition. Theor Comput Fluid
Dyn 2(5):339–352
Audouze C, De Vuyst F, Nair P (2009) Reduced-order modeling of parameterized PDEs using
time-space-parameter principal component analysis. Int J Numer Methods Eng 80(8):1025–1057
1 Machine Learning in Computer Aided Engineering 59
Bisong E (2019b) Numpy. In: Building machine learning and deep learning models on google cloud
platform. Springer, Berlin, pp 91–113
Bock FE, Aydin RC, Cyron CJ, Huber N, Kalidindi SR, Klusemann B (2019) A review of the
application of machine learning and data mining approaches in continuum materials mechanics.
Frontiers Mater 6:110
Bonet J, Wood RD (1997) Nonlinear continuum mechanics for finite element analysis. Cambridge
University Press
Boning DS, Elfadel IAM, Li X (2019) A preliminary taxonomy for machine learning in vlsi cad.
In: Machine learning in VLSI computer-aided design. Springer, Berlin, pp 1–16
Borkowski L, Sorini C, Chattopadhyay A (2022) Recurrent neural network-based multiaxial plas-
ticity model with regularization for physics-informed constraints. Comput Struct 258:106678
Braconnier T, Ferrier M, Jouhaud JC, Montagnac M, Sagaut P (2011) Towards an adaptive pod/svd
surrogate model for aeronautic design. Comput Fluids 40(1):195–209
Bro R, Smilde AK (2014) Principal component analysis. Anal Methods 6(9):2812–2831
Brodie CR, Constantin A, Deen R, Lukas A (2020) Machine learning line bundle cohomology.
Fortschritte der Physik 68(1):1900087
Brunton SL, Kutz JN (2022) Data-driven science and engineering: Machine learning, dynamical
systems, and control. Cambridge University Press (2022)
Brunton SL, Kutz JN (2019) Methods for data-driven multiscale model discovery for materials. J
Phys: Mater 2(4):044002
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc Natl Acad Sci 113(15):3932–3937
Build powerful models. (2022). https://www.opennn.org/
Buizza C, Casas CQ, Nadler P, Mack J, Marrone S, Titus Z, Le Cornec C, Heylen E, Dur T, Ruiz
LB et al (2022) Data learning: integrating data assimilation and machine learning. J Comput Sci
58:101525
Bukka SR, Magee AR, Jaiman RK (2020) Deep convolutional recurrent autoencoders for flow field
prediction. In: International conference on offshore mechanics and arctic engineering, vol 84409.
American Society of Mechanical Engineers, p V008T08A005
Burkov A (2020) Machine learning engineering, vol 1. True Positive Incorporated
Burkov A (2019) The hundred-page machine learning book, vol 1. Andriy Burkov Quebec City,
QC, Canada
Burov A, Burova O (2020) Development of digital twin for composite pressure vessel. J Phys: Conf
Ser 1441:012133. IOP Publishing
Buşoniu L, de Bruin T, Tolić D, Kober J, Palunko I (2018) Reinforcement learning for control:
Performance, stability, and deep approximators. Ann Rev Control 46:8–28
Bzdok D, Altman N, Krzywinski M (2018) Statistics versus machine learning. Nat Methods 15:233–
234
Caccin M, Li Z, Kermode JR, De Vita A (2015) A framework for machine-learning-augmented
multiscale atomistic simulations on parallel supercomputers. Int J Quantum Chem 115(16):1129–
1139
Caiazzo F, Caggiano A (2020) Laser direct metal deposition of 2024 Al alloy: trace geometry
prediction via machine learning. Materials 11(3):444
Camburn B, He Y, Raviselvam S, Luo J, Wood K (2020) Machine learning-based design concept
evaluation. J Mech Des 142(3):031113
Capuano G, Rimoli JJ (2019) Smart finite elements: A novel machine learning application. Comput
Methods Appl Mech Eng 345:363–381
Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, Zdeborová L (2019)
Machine learning and the physical sciences. Rev Modern Phys 91(4):045002
Carrara P, De Lorenzis L, Stainier L, Ortiz M (2020) Data-driven fracture mechanics. Comput
Methods Appl Mech Eng 372:113390
Cayton L (2005) Algorithms for manifold learning. Univ Calif San Diego Tech Rep 12(1–17):1
1 Machine Learning in Computer Aided Engineering 61
Conti S, Müller S, Ortiz M (2018) Data-driven problems in elasticity. Arch Rat Mech Anal
229(1):79–123
Conti S, Müller S, Ortiz M (2020) Data-driven finite elasticity. Arch Rat Mech Anal 237(1):1–33
Cranmer M, Greydanus S, Hoyer S, Battaglia P, Spergel D, Ho S (2020) Lagrangian neural networks.
arXiv:2003.04630
Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H (2015) Survey of review spam
detection using machine learning techniques. J Big Data 2(1):1–24
Crespo J, Montans FJ (2018) A continuum approach for the large strain finite element analysis of
auxetic materials. Int J Mech Sci 135:441–457
Crespo J, Montáns FJ (2019) General solution procedures to compute the stored energy density of
conservative solids directly from experimental data. Int J Eng Sci 141:16–34
Crespo J, Latorre M, Montáns FJ (2017) WYPIWYG hyperelasticity for isotropic, compressible
materials. Comput Mech 59(1):73–92
Crespo J, Duncan O, Alderson A, Montáns FJ (2020) Auxetic orthotropic materials: Numerical
determination of a phenomenological spline-based stored density energy and its implementation
for finite element analysis. Comput Methods Appl Mech Eng 371:113300
de Brito MM, Evers M (2016) Multi-criteria decision-making for flood risk management: a survey
of the current state of the art. Nat Hazards Earth Syst Sci 16(4):1019–1033
De Jong K (1988) Learning with genetic algorithms: An overview. Mach Learn 3(2):121–138
DebRoy T, Mukherjee T, Wei H, Elmer J, Milewski J (2021) Metallurgy, mechanistic models and
machine learning in metal printing. Nat Rev Mater 6(1):48–68
Delli U, Chang S (2018) Automated process monitoring in 3d printing using supervised machine
learning. Procedia Manuf 26:865–870
Demo N, Tezzele M, Rozza G (2018) Pydmd: Python dynamic mode decomposition. J Open Source
Softw 3(22):530
Dener A, Miller MA, Churchill RM, Munson T, Chang CS (2020) Training neural networks under
physical constraints using a stochastic augmented Lagrangian approach. arXiv:2009.07330
Deng Z, He C, Liu Y, Kim KC (2019) Super-resolution reconstruction of turbulent velocity
fields using a generative adversarial network-based artificial intelligence framework. Phys Fluids
31(12):125111
Denkena B, Bergmann B, Witt M (2019) Material identification based on machine-learning algo-
rithms for hybrid workpieces during cylindrical operations. J Intell Manuf 30(6):2449–2456
Desai SA, Mattheakis M, Sondak D, Protopapas P, Roberts SJ (2021) Port-hamiltonian neural
networks for learning explicit time-dependent dynamical systems. Phys Rev E 104(3):034312
Dhanalaxmi B (2020) Machine learning and its emergence in the modern world and its contribution
to artificial intelligence. In: 2020 International conference for emerging technology (INCET).
IEEE, pp 1–4
Di Leoni PC, Lu L, Meneveau C, Karniadakis G, Zaki TA (2021) DeepONet prediction of linear
instability waves in high-speed boundary layers. arXiv:2105.08697
Dijkstra EW et al (1959) A note on two problems in connexion with graphs. Numerische Mathematik
1(1):269–271
Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, Patton B, Alemi A, Hoffman M,
Saurous RA (2017) Tensorflow distributions. arXiv:1711.10604
Domaneschi M, Noori AZ, Pietropinto MV, Cimellaro GP (2021) Seismic vulnerability assessment
of existing school buildings. Comput Struct 248:106522
Dong CZ, Catbas FN (2021) A review of computer vision-based structural health monitoring at
local and global levels. Struct Health Monit 20(2):692–743
du Bos ML, Balabdaoui F, Heidenreich JN (2020) Modeling stress-strain curves with neural net-
works: a scalable alternative to the return mapping algorithm. Comput Mater Sci 178:109629
Duarte AC, Roldan F, Tubau M, Escur J, Pascual S, Salvador A, Mohedano E, McGuinness K,
Torres J, Giro-i Nieto X (2019) WAV2PIX: Speech-conditioned face generation using generative
adversarial networks. In: ICASSP, pp 8633–8637
1 Machine Learning in Computer Aided Engineering 63
Duarte, F (2018) 5 algoritmos que ya están tomando decisiones sobre tu vida y que quizás tu no
sabías [in spanish, translation: 5 algorithms that are already making decisions about your life,
and perhaps you did not know]. https://www.bbc.com/mundo/noticias-42916502
Duffy AH (1997) The “what” and “how” of learning in design. IEEE Expert 12(3):71–76
Dumon A, Allery C, Ammar A (2011) Proper general decomposition (PGD) for the resolution of
Navier-Stokes equations. J Comput Phys 230(4):1387–1407
Eggersmann R, Kirchdoerfer T, Reese S, Stainier L, Ortiz M (2019) Model-free data-driven inelas-
ticity. Comput Methods Appl Mech Eng 350:81–99
Eggersmann R, Stainier L, Ortiz M, Reese S (2021) Efficient data structures for model-free data-
driven computational mechanics. Comput Methods Appl Mech Eng 382:113855
Eggersmann R, Stainier L, Ortiz M, Reese S (2021) Model-free data-driven computational mechan-
ics enhanced by tensor voting. Comput Methods Appl Mech Eng 373:113499
Eidnes S, Stasik AJ, Sterud C, Bøhn E, Riemer-Sørensen S (2022) Port-hamiltonian neural networks
with state dependent ports. arXiv:2206.02660
Eilers PH, Marx BD (1996) Flexible smoothing with b-splines and penalties. Stat Sci 11(2):89–121
El Kadi H (2006) Modeling the mechanical behavior of fiber-reinforced polymeric composite mate-
rials using artificial neural networks-a review. Compos Struct 73(1):1–23
El Said B, Hallett SR (2018) Multiscale surrogate modelling of the elastic response of thick com-
posite structures with embedded defects and features. Compos Struct 200:781–798
Erchiqui F, Kandil N (2006) Neuronal networks approach for characterization of softened polymers.
J Reinf Plast Compos 25(5):463–473
Erichson NB, Muehlebach M, Mahoney MW (2019) Physics-informed autoencoders for lyapunov-
stable fluid flow prediction. arXiv:1905.10866
Etedali S, Mollayi N (2018) Cuckoo search-based least squares support vector machine models for
optimum tuning of tuned mass dampers. Int J Struct Stab Dyn 18(02):1850028
Eubank RL (1999) Nonparametric regression and spline smoothing. CRC Press
Farrar CR, Worden K (2012) Structural health monitoring: a machine learning perspective. Wiley,
New York
Feng N, Zhang G, Khandelwal K (2022) Finite strain FE2 analysis with data-driven homogenization
using deep neural networks. Comput Struct 263:106742
Fernández J, Chiachío M, Chiachío J, Muñoz R, Herrera F (2022) Uncertainty quantification in
neural networks by approximate Bayesian computation: Application to fatigue in composite
materials. Eng Appl Artif Intell 107:104511
Fernández M, Fritzen F, Weeger O (2022) Material modeling for parametric, anisotropic finite strain
hyperelasticity based on machine learning with application in optimization of metamaterials. Int
J Numer Methods Eng 123(2):577–609. https://onlinelibrary.wiley.com/doi/full/10.1002/nme.
6869
Field D, Ammouche Y, Peña JM, Jérusalem A (2021) Machine learning based multiscale calibration
of mesoscopic constitutive models for composite materials: application to brain white matter.
Comput Mech 67(6):1629–1643
Fischer CC, Tibbetts KJ, Morgan D, Ceder G (2006) Predicting crystal structure by merging data
mining with quantum mechanics. Nat Mater 5(8):641–646
Fish J, Wagner GJ, Keten S (2021) Mesoscopic and multiscale modelling in materials. Nat Mater
20(6):774–786
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics
7(2):179–188
Flah M, Nunez I, Ben Chaabene W, Nehdi ML (2021) Machine learning algorithms in civil structural
health monitoring: a systematic review. Arch Comput Methods Eng 28(4):2621–2643
Flaschel M, Kumar S, De Lorenzis L (2021) Unsupervised discovery of interpretable hyperelastic
constitutive laws. Comput Methods Appl Mech Eng 381:113852
Flaschel M, Kumar S, De Lorenzis L (2022) Discovering plasticity models without stress data. NPJ
Comput Mater 8(1):1–10
Floyd RW (1962) Algorithm 97: shortest path. Commun ACM 5(6):345
64 F. J. Montáns et al.
Frank M, Drikakis D, Charissis V (2020) Machine-learning methods for computational science and
engineering. Computation 8(1):15
Frankel AL, Safta C, Alleman C, Jones R (2022)ased graph convolutional neural networks for
modeling materials with microstructure. J Mach Learn Model Comput 3(1)
Frankel A, Hamel CM, Bolintineanu D, Long K, Kramer S (2022) Machine learning constitutive
models of elastomeric foams. Comput Methods Appl Mech Eng 391:114492
Freischlad M, Schnellenbach-Held M (2005) A machine learning approach for the support of
preliminary structural design. Adv Eng Inf 19(4):281–287
Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc 84(405):165–175
Fu K, Li J, Zhang Y, Shen H, Tian Y (2020) Model-guided multi-path knowledge aggregation for
aerial saliency prediction. IEEE Trans Image Process 29:7117–7127
Fuchs A, Heider Y, Wang K, Sun W, Kaliske M (2021) DNN2: A hyper-parameter reinforcement
learning game for self-design of neural network based elasto-plastic constitutive descriptions.
Comput Struct 249:106505
Fuhg JN, Bouklas N, Jones RE (2022) Learning hyperelastic anisotropy from data via a tensor basis
neural network. arXiv:2204.04529
Fuhg JN, Fau A, Bouklas N, Marino M (2022) Elasto-plasticity with convex model-data-driven
yield functions. Hal-03619186v1. https://hal.science/hal-03619186/
Fuhg JN, Böhm C, Bouklas N, Fau A, Wriggers P, Marino M (2021) Model-data-driven constitutive
responses: application to a multiscale computational framework. Int J Eng Sci 167:103522
Gabel J, Desaphy J, Rognan D (2014) Beware of machine learning-based scoring functions. on the
danger of developing black boxes. J Chem Inf Model 54(10):2807–2815
Ganin Y, Bartunov S, Li Y, Keller E, Saliceti S (2021) Computer-aided design as language. Adv
Neural Inf Process Syst 34:5885–5897
Gannouni S, Maad RB (2016) Numerical analysis of smoke dispersion against the wind in a tunnel
fire. J Wind Eng Ind Aerodyn 158:61–68
Gao K, Mei G, Piccialli F, Cuomo S, Tu J, Huo Z (2020) Julia language in machine learning:
Algorithms, applications, and open issues. Comput Sci Rev 37:100254
Garg A, Panigrahi BK (2021) Multi-dimensional digital twin of energy storage system for electric
vehicles: A brief review. Energy Storage 3(6):e242
Garg S, Gupta H, Chakraborty S (2022) Assessment of deeponet for time dependent reliability
analysis of dynamical systems subjected to stochastic loading. Eng Struct 270:114811
Gaurav D, Tiwari SM, Goyal A, Gandhi N, Abraham A (2020) Machine intelligence-based algo-
rithms for spam filtering on document labeling. Soft Comput 24(13):9625–9638
Gero JS (1996) Creativity, emergence and evolution in design. Knowl-Based Syst 9(7):435–448
Gerolymos G, Vallet I (1996) Implicit computation of three-dimensional compressible Navier-
Stokes equations using k-epsilon closure. AIAA J 34(7):1321–1330
Ghosh A, SahaRay R, Chakrabarty S, Bhadra S (2021) Robust generalised quadratic discriminant
analysis. Pattern Recognit 117:107981
Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y,
Vaithyanathan S (2011) SystemML: Declarative machine learning on mapreduce. In: 2011 IEEE
27th international conference on data engineering. IEEE, pp 231–242
Giacinto G, Paolucci R, Roli F (1997) Application of neural networks and statistical pattern recog-
nition algorithms to earthquake risk evaluation. Pattern Recognit Lett 18(11–13):1353–1362
Gin CR, Shea DE, Brunton SL, Kutz JN (2021) DeepGreen: deep learning of Green’s functions for
nonlinear boundary value problems. Sci Rep 11(1):1–14
Glaessgen E, Stargel D (2012) The digital twin paradigm for future NASA and US Air force
vehicles. In: 53rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics and materials
conference. 20th AIAA/ASME/AHS adaptive structures conference. 14th AIAA, p 1818
Gobert C, Reutzel EW, Petrich J, Nassar AR, Phoha S (2018) Application of supervised machine
learning for defect detection during metallic powder bed fusion additive manufacturing using
high resolution imaging. Addit Manuf 21:517–528
1 Machine Learning in Computer Aided Engineering 65
Gonzalez FJ, Balajewicz M (2018) Deep convolutional recurrent autoencoders for learning low-
dimensional feature dynamics of fluid systems. arXiv:1808.01346
González MP, Zapico JL (2008) Seismic damage identification in buildings using neural networks
and modal data. Comput Struct 86(3–5):416–426
González D, Chinesta F, Cueto E (2019) Learning corrections for hyperelastic models from data.
Front Mater 6:14
González D, Chinesta F, Cueto E (2019) Thermodynamically consistent data-driven computational
mechanics. Contin Mech Thermodyn 31(1):239–253
González D, García-González A, Chinesta F, Cueto E (2020) A data-driven learning method for
constitutive modeling: application to vascular hyperelastic soft tissues. Materials 13(10):2319
González D, Chinesta F, Cueto E (2021) Learning non-markovian physics from data. J Comput
Phys 428:109982
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y
(2020) Generative adversarial networks. Commun ACM 63(11):139–144
Google cloud: AI and machine learning products. innovative machine learning products and services
on a trusted platform. https://cloud.google.com/products/ai
Goswami S, Anitescu C, Chakraborty S, Rabczuk T (2020) Transfer learning enhanced physics
informed neural network for phase-field modeling of fracture. Theor Appl Fract Mech 106:102447
Goswami S, Yin M, Yu Y, Karniadakis GE (2022) A physics-informed variational DeepONet for
predicting crack path in quasi-brittle materials. Comput Methods Appl Mech Eng 391:114587
Grefenstette JJ (1993) Genetic algorithms and machine learning. In: Proceedings of the sixth annual
conference on computational learning theory, pp 3–4
Greydanus S, Dzamba M, Yosinski J (2019) Hamiltonian neural networks. Adv Neural Inf Process
Syst 32
Gui G, Pan H, Lin Z, Li Y, Yuan Z (2017) Data-driven support vector machine with optimization
techniques for structural health monitoring and damage detection. KSCE J Civ Eng 21(2):523–
534
Guilleminot J, Dolbow JE (2020) Data-driven enhancement of fracture paths in random composites.
Mech Res Commun 103:103443
Gulli A, Pal S (2017) Deep learning with Keras. Packt Publishing Ltd
Guo J, Liu C, Cao J, Jiang D (2021) Damage identification of wind turbine blades with deep
convolutional neural networks. Renew Energy 174:122–133
Guo S, Agarwal M, Cooper C, Tian Q, Gao RX, Grace WG, Guo Y (2022) Machine learning for
metal additive manufacturing: Towards a physics-informed data-driven paradigm. J Manuf Syst
62:145–163
Hadden CM, Klimek-McDonald DR, Pineda EJ, King JA, Reichanadter AM, Miskioglu I, Gowtham
S, Odegard GM (2015) Mechanical properties of graphene nanoplatelet/carbon fiber/epoxy hybrid
composites: Multiscale modeling and experiments. Carbon 95:100–112
Haghighat E, Juanes R (2021) Sciann: A keras/tensorflow wrapper for scientific computations and
physics-informed deep learning using artificial neural networks. Comput Methods Appl Mech
Eng 373:113552
Haghighat E, Raissi M, Moure A, Gomez H, Juanes R (2021) A physics-informed deep learning
framework for inversion and surrogate modeling in solid mechanics. Comput Methods Appl Mech
Eng 379:113741
Haik W, Maday Y, Chamoin L (2021) A real-time variational data assimilation method with model
bias identification and correction. In: RAMSES: reduced order models; approximation theory;
machine learning; surrogates; emulators and simulators
Hall D, Llinas J (2001) Multisensor data fusion. CRC Press
Hanifa RM, Isa K, Mohamad S (2021) A review on speaker recognition: technology and challenges.
Comput Electr Eng 90:107005
Hariri-Ardebili MA, Pourkamali-Anaraki F (2018) Simplified reliability analysis of multi hazard
risk in gravity dams via machine learning techniques. Arch Civ Mech Eng 18(2):592–610
66 F. J. Montáns et al.
Hasançebi O, Dumlupınar T (2013) Linear and nonlinear model updating of reinforced concrete
t-beam bridges using artificial neural networks. Comput Struct 119:1–11
Hashemi SM, Parvizi S, Baghbanijavid H, Tan AT, Nematollahi M, Ramazani A, Fang NX, Elahinia
M (2022) Computational modelling of process-structure-property-performance relationships in
metal additive manufacturing: A review. Int Mater Rev 67(1):1–46
Hashemipour S, Ali M (2020) Amazon web services (AWS)–an overview of the on-demand
cloud computing platform. In: International conference for emerging technologies in comput-
ing. Springer, Berlin, pp 40–47
Hassan RJ, Abdulazeez AM et al (2021) Deep learning convolutional neural network for face
recognition: A review. Int J Sci Bus 5(2):114–127
Hastie T, Tibshirani R, Buja A (1994) Flexible discriminant analysis by optimal scoring. J Am Stat
Assoc 89(428):1255–1270
He Q, Laurence DW, Lee CH, Chen JS (2021) Manifold learning based data-driven modeling for
soft biological tissues. J Biomech 117:110124
He S, Shin HS, Tsourdos A (2021) Computational missile guidance: a deep reinforcement learning
approach. J Aerosp Inf Syst 18(8):571–582
He X, He Q, Chen JS (2021) Deep autoencoders for physics-constrained data-driven nonlinear
materials modeling. Comput Methods Appl Mech Eng 385:114034
He Q, Gu C, Valente S, Zhao E, Liu X, Yuan D (2022) Multi-arch dam safety evaluation based on
statistical analysis and numerical simulation. Sci Rep 12(1):1–19
Hebb DO (2005) The organization of behavior: a neuropsychological theory. Psychology Press
Hemati MS, Williams MO, Rowley CW (2014) Dynamic mode decomposition for large and stream-
ing datasets. Phys Fluids 26(11):111701
Hernandez Q, Badias A, Gonzalez D, Chinesta F, Cueto E (2021) Deep learning of thermodynamics-
aware reduced-order models from data. Comput Methods Appl Mech Eng 379:113763
Hernández Q, Badías A, González D, Chinesta F, Cueto E (2021) Structure-preserving neural
networks. J Comput Phys 426:109950
Hernández Q, Badías A, Chinesta F, Cueto E (2022) Thermodynamics-informed graph neural
networks. arXiv:2203.01874
Herrada F, García-Martínez J, Fraile A, Hermanns L, Montáns F (2017) A method for perform-
ing efficient parametric dynamic analyses in large finite element models undergoing structural
modifications. Eng Struct 131:625–638
Hershey JR, Olsen PA (2007) Approximating the Kullback Leibler divergence between Gaus-
sian mixture models. In: 2007 IEEE international conference on acoustics, speech and signal
processing-ICASSP’07, vol 4. IEEE, pp IV–317
Hkdh B (1999) Neural networks in materials science. ISIJ Int 39(10):966–979
Ho LV, Nguyen DH, Mousavi M, De Roeck G, Bui-Tien T, Gandomi AH, Wahab MA (2021) A
hybrid computational intelligence approach for structural damage detection using marine predator
algorithm and feedforward neural networks. Comput Struct 252:106568
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat
36(3):1171–1220
Hong T, Wang Z, Luo X, Zhang W (2020) State-of-the-art on research and applications of machine
learning in the building life cycle. Energy Build 212:109831
Hong SJ, Chun H, Lee J, Kim BH, Seo MH, Kang J, Han B (2021) First-principles-based machine-
learning molecular dynamics for crystalline polymers with van der waals interactions. J Phys
Chem Lett 12(25):6000–6006
Hoshyar AN, Samali B, Liyanapathirana R, Houshyar AN, Yu Y (2020) Structural damage detection
and localization using a hybrid method and artificial intelligence techniques. Struct Health Monit
19(5):1507–1523
Hosmer Jr DW, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley,
New York
Hou C, Wang J, Wu Y, Yi D (2009) Local linear transformation embedding. Neurocomputing
72(10–12):2368–2378
1 Machine Learning in Computer Aided Engineering 67
Huang D, Fuhg JN, Weißenfels C, Wriggers P (2020) A machine learning based plasticity model
using proper orthogonal decomposition. Comput Methods Appl Mech Eng 365:113008
Ibañez R, Borzacchiello D, Aguado JV, Abisset-Chavanne E, Cueto E, Ladeveze P, Chinesta F (2017)
Data-driven non-linear elasticity: constitutive manifold construction and problem discretization.
Comput Mech 60(5):813–826
Ibañez R, Abisset-Chavanne E, Aguado JV, Gonzalez D, Cueto E, Chinesta F (2018) A manifold
learning approach to data-driven computational elasticity and inelasticity. Arch Comput Methods
Eng 25(1):47–57
Ibañez R, Gilormini P, Cueto E, Chinesta F (2020) Numerical experiments on unsupervised manifold
learning applied to mechanical modeling of materials and structures. Comptes Rendus. Mécanique
348(10–11):937–958
Ibragimova O, Brahme A, Muhammad W, Lévesque J, Inal K (2021) A new ANN based crystal
plasticity model for fcc materials and its application to non-monotonic strain paths. Int J Plast
144:103059
Innes M (2018) Flux: Elegant machine learning with Julia. J Open Source Softw 3(25):602
Innes M, Edelman A, Fischer K, Rackauckas C, Saba E, Shah VB, Tebbutt W (2019) A differentiable
programming system to bridge machine learning and scientific computing. arXiv:1907.07587
Inoue H (2018) Data augmentation by pairing samples for images classification. arXiv:1801.02929
Jackson NE, Webb MA, de Pablo JJ (2019) Recent advances in machine learning towards multiscale
soft materials design. Curr Opin Chem Eng 23:106–114
Jafari M (2020) System identification of a soil tunnel based on a hybrid artificial neural network-
numerical model approach. Iran J Sci Technol, Trans Civ Eng 44(3):889–899
Jagtap AD, Kharazmi E, Karniadakis GE (2020) Conservative physics-informed neural networks on
discrete domains for conservation laws: Applications to forward and inverse problems. Comput
Methods Appl Mech Eng 365:113028
Jang DP, Fazily P, Yoon JW (2021) Machine learning-based constitutive model for J2-plasticity. Int
J Plast 138:102919
Jansson T, Nilsson L, Redhe M (2003) Using surrogate models and response surfaces in struc-
tural optimization-with application to crashworthiness design and sheet metal forming. Struct
Multidiscip Optim 25(2):129–140
Jayasundara N, Thambiratnam D, Chan T, Nguyen A (2020) Damage detection and quantification
in deck type arch bridges using vibration based methods and artificial neural networks. Eng Fail
Anal 109:104265
Jha D, Singh S, Al-Bahrani R, Liao WK, Choudhary A, De Graef M, Agrawal A (2018) Extracting
grain orientations from EBSD patterns of polycrystalline materials using convolutional neural
networks. Microsc Microanal 24(5):497–502
Jiang J, Hu G, Li X, Xu X, Zheng P, Stringer J (2019) Analysis and prediction of printable bridge
length in fused deposition modelling based on back propagation neural network. Virtual Phys
Prototyp 14(3):253–266
Jiménez AA, Márquez FPG, Moraleda VB, Muñoz CQG (2019) Linear and nonlinear features and
machine learning for wind turbine blade ice detection and diagnosis. Renew Energy 132:1034–
1048
Jiménez AA, Zhang L, Muñoz CQG, Márquez FPG (2020) Maintenance management based on
machine learning and nonlinear features in wind turbines. Renew Energy 146:316–328
Jin Z, Zhang Z, Demir K, Gu GX (2020) Machine learning for advanced additive manufacturing.
Matter 3(5):1541–1556
Jokar M, Semperlotti F (2021) Finite element network analysis: A machine learning based compu-
tational framework for the simulation of physical systems. Comput Struct 247:106484
Jovanović MR, Schmid PJ, Nichols JW (2014) Sparsity-promoting dynamic mode decomposition.
Phys Fluids 26(2):024103
Jung J, Yoon JI, Park HK, Jo H, Kim HS (2020) Microstructure design using machine learning
generated low dimensional and continuous design space. Materialia 11:100690
68 F. J. Montáns et al.
Jung J, Yoon K, Lee PS (2020) Deep learned finite elements. Comput Methods Appl Mech Eng
372:113401
Kadic M, Milton GW, van Hecke M, Wegener M (2019) 3d metamaterials. Nat Rev Phys 1(3):198–
210
Kaehler A, Bradski G (2016) Learning OpenCV 3: computer vision in C++ with the OpenCV
library. O’Reilly Media, Inc
Kalitzin G, Medic G, Iaccarino G, Durbin P (2005) Near-wall behavior of RANS turbulence models
and implications for wall functions. J Comput Phys 204(1):265–291
Kamath C (2001) On mining scientific datasets. In: Data mining for scientific and engineering
applications. Springer, Berlin, pp 1–21
Kanno Y (2018) Data-driven computing in elasticity via kernel regression. Theor Appl Mech Lett
8(6):361–365
Kanouté P, Boso D, Chaboche JL, Schrefler B (2009) Multiscale methods for composites: a review.
Arch Comput Methods Eng 16(1):31–75
Karthikeyan J, Hie TS, Jin NY (eds) (2021) Learning outcomes of classroom research. L’Ordine
Novo Publication, Tamil Nadu, India. ISBN: 978-93-92995-15-6
Kao CY, Loh CH (2013) Monitoring of long-term static deformation data of fei-tsui arch dam using
artificial neural network-based approaches. Struct Control Health Monit 20(3):282–303
Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physics-informed
machine learning. Nat Rev Phys 3(6):422–440
Karpatne A, Atluri G, Faghmous JH, Steinbach M, Banerjee A, Ganguly A, Shekhar S, Samatova
N, Kumar V (2017) Theory-guided data science: A new paradigm for scientific discovery from
data. IEEE Trans Knowl Data Eng 29(10):2318–2331
Kashinath K, Mustafa M, Albert A, Wu J, Jiang C, Esmaeilzadeh S, Azizzadenesheli K, Wang
R, Chattopadhyay A, Singh A et al (2021) Physics-informed machine learning: case studies for
weather and climate modelling. Philos Trans R Soc A 379(2194):20200093
Khan S, Awan MJ (2018) A generative design technique for exploring shape variations. Adv Eng
Inform 38:712–724
Khosravi K, Shahabi H, Pham BT, Adamowski J, Shirzadi A, Pradhan B, Dou J, Ly HB, Gróf G, Ho
HL et al (2019) A comparative assessment of flood susceptibility modeling using multi-criteria
decision-making analysis and machine learning methods. J Hydrol 573:311–323
Khurana S, Saxena S, Jain S, Dixit A (2021) Predictive modeling of engine emissions using machine
learning: A review. Mater Today: Proc 38:280–284
Kim P (2017) Matlab deep learning: with machine learning, neural networks and artificial intelli-
gence. Apress
Kim JW, Lee BH, Shaw MJ, Chang HL, Nelson M (2001) Application of decision-tree induction
techniques to personalized advertisements on internet storefronts. Int J Electron Commer 5(3):45–
62
Kim C, Batra R, Chen L, Tran H, Ramprasad R (2021) Polymer design using genetic algorithm and
machine learning. Comput Mater Sci 186:110067
Kim Y, Park HK, Jung J, Asghari-Rad P, Lee S, Kim JY, Jung HG, Kim HS (2021) Exploration
of optimal microstructure and mechanical properties in continuous microstructure space using a
variational autoencoder. Mater Des 202:109544
King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758
Kirchdoerfer T, Ortiz M (2016) Data-driven computational mechanics. Comput Methods Appl Mech
Eng 304:81–101
Kirchdoerfer T, Ortiz M (2017) Data driven computing with noisy material data sets. Comput
Methods Appl Mech Eng 326:622–641
Klein DK, Fernández M, Martin RJ, Neff P, Weeger O (2022) Polyconvex anisotropic hyperelasticity
with neural networks. J Mech Phys Solids 159:104703
Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M (2002) Logistic regression. Springer, Berlin
Ko J, Ni YQ (2005) Technology developments in structural health monitoring of large-scale bridges.
Eng Struct 27(12):1715–1725
1 Machine Learning in Computer Aided Engineering 69
Kochkov D, Smith JA, Alieva A, Wang Q, Brenner MP, Hoyer S (2021) Machine learning-
accelerated computational fluid dynamics. Proc Natl Acad Sci 118(21):e2101784118
Kolodziejczyk F, Mortazavi B, Rabczuk T, Zhuang X (2021) Machine learning assisted multiscale
modeling of composite phase change materials for li-ion batteries’ thermal management. Int J
Heat Mass Transf 172:121199
Kontolati K, Alix-Williams D, Boffi NM, Falk ML, Rycroft CH, Shields MD (2021) Manifold learn-
ing for coarse-graining atomistic simulations: Application to amorphous solids. Acta Materialia
215:117008
Kossaifi J, Panagakis Y, Anandkumar A, Pantic M (2016) Tensorly: Tensor learning in python.
arXiv:1610.09555
Koumoulos E, Konstantopoulos G, Charitidis C (2019) Applying machine learning to nanoinden-
tation data of (nano-) enhanced composites. Fibers 8(1):3
Kralovec C, Schagerl M (2020) Review of structural health monitoring methods regarding a multi-
sensor approach for damage assessment of metal and composite structures. Sensors 20(3):826
Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks.
AIChE J 37(2):233–243
Krzeczkowski SA (1980) Measurement of liquid droplet disintegration mechanisms. Int J Multiph
Flow 6(3):227–239
Kulkrni KS, Kim DK, Sekar S, Samui P (2011) Model of least square support vector machine
(lssvm) for prediction of fracture parameters of concrete. Int J Concr Struct Mater 5(1):29–33
Ladevèze P, Néron D, Gerbaud PW (2019) Data-driven computation for history-dependent materials.
Comptes Rendus Mécanique 347(11):831–844
Laflamme S, Cao L, Chatzi E, Ubertini F (2016) Damage detection and localization from dense
network of strain sensors. Shock Vib 2016:2562946
Lakshminarayan K, Harp SA, Goldman RP, Samad T et al (1996) Imputation of missing data using
machine learning techniques. In: KDD, vol 96. https://cdn.aaai.org/KDD/1996/KDD96-023.pdf
Langley P et al (2011) The changing science of machine learning. Mach Learn 82(3):275–279
Lantz B (2019) Machine learning with R: expert techniques for predictive modeling. Packt Pub-
lishing Ltd
Latorre M, Montáns FJ (2013) Extension of the Sussman-Bathe spline-based hyperelastic model to
incompressible transversely isotropic materials. Comput Struct 122:13–26
Latorre M, Montáns FJ (2017) WYPiWYG hyperelasticity without inversion formula: Application
to passive ventricular myocardium. Comput Struct 185:47–58
Latorre M, Montáns FJ (2020) Experimental data reduction for hyperelasticity. Comput Struct
232:105919
Latorre M, De Rosa E, Montáns FJ (2017) Understanding the need of the compression branch to
characterize hyperelastic materials. Int J Non-Linear Mech 89:14–24
Latorre M, Peña E, Montáns FJ (2017) Determination and finite element validation of the WYPI-
WYG strain energy of superficial fascia from experimental data. Ann Biomed Eng 45(3):799–810
Latorre M, Mohammadkhah M, Simms CK, Montáns FJ (2018) A continuum model for tension-
compression asymmetry in skeletal muscle. J Mech Behav Biomed Mater 77:455–460
Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY (2011) On optimization methods for deep
learning. In: ICML’11: Proceedings of the 28th international conference on machine learning, pp
265–272
Lee J, Kim J, Yun CB, Yi J, Shim J (2002) Health-monitoring method for bridges under ordinary
traffic loadings. J Sound Vib 257(2):247–264
Lee DW, Hong SH, Cho SS, Joo WS (2005) A study on fatigue damage modeling using neural
networks. J Mech Sci Technol 19(7):1393–1404
Lewandowski JJ, Seifi M (2026) Metal additive manufacturing: a review of mechanical properties.
Ann Rev of Mater Res 46:151–186
Lewis FL, Liu D (2013) Reinforcement learning and approximate dynamic programming for feed-
back control. Wiley, New York
70 F. J. Montáns et al.
Miao P, Yokota H (2022) Comparison of markov chain and recurrent neural network in predicting
bridge deterioration considering various factors. Struct Infrastruct Eng 1–13. https://doi.org/10.
1080/15732479.2022.2087691
Miao P, Yokota H, Zhang Y (2023) Deterioration prediction of existing concrete bridges using a
LSTM recurrent neural network. Struct Infrastruct 19(4):475–489
Michalski RS, Carbonell JG, Mitchell TM (2013) Machine learning: An artificial intelligence
approach. Springer Science & Business Media
Miñano M, Montáns FJ (2015) A new approach to modeling isotropic damage for Mullins effect in
hyperelastic materials. Int J Solids Struct 67:272–282
Miñano M, Montáns FJ (2018) WYPiWYG damage mechanics for soft materials: A data-driven
approach. Arch Comput Methods Eng 25(1):165–193
Mishra M (2021) Machine learning techniques for structural health monitoring of heritage buildings:
A state-of-the-art review and case studies. J Cult Herit 47:227–245
Mitchell M (1998) An introduction to genetic algorithms. MIT Press
Miyazawa Y, Briffod F, Shiraiwa T, Enoki M (2019) Prediction of cyclic stress-strain property of
steels by crystal plasticity simulations and machine learning. Materials 12(22):3668
Mohammadzadeh S, Kim Y, Ahn J (2015) Pca-based neuro-fuzzy model for system identification
of smart structures. J Smart Struct Syst 15(5):1139–1158
Molnar C, Casalicchio G, Bischl B (2018) iml: An R package for interpretable machine learning. J
Open Source Softw 3(26):786
Monostori L, Márkus A, Van Brussel H, Westkämpfer E (1996) Machine learning approaches to
manufacturing. CIRP Ann 45(2):675–712
Morandin R, Nicodemus J, Unger B (2022) Port-Hamiltonian dynamic mode decomposition.
arXiv:2204.13474
Moreno S, Amores VJ, Benítez JM, Montáns FJ (2020) Reverse-engineering and modeling the 3d
passive and active responses of skeletal muscle using a data-driven, non-parametric, spline-based
procedure. J Mech Behav Biomed Mater 110:103877
Moroni D, Pascali MA (2021) Learning topology: bridging computational topology and machine
learning. Pattern Recognit Image Anal 31(3):443–453
Moya B, Alfaro I, Gonzalez D, Chinesta F, Cueto E (2020) Physically sound, self-learning digital
twins for sloshing fluids. Plos One 15(6):e0234569
Moya B, Badías A, Alfaro I, Chinesta F, Cueto E (2022) Digital twins that learn and correct
themselves. Int J Numer Methods Eng 123(13):3034–3044
Mozaffar M, Bostanabad R, Chen W, Ehmann K, Cao J, Bessa M (2019) Deep learning predicts
path-dependent plasticity. Proc Natl Acad Sci 116(52):26414–26420
Mukherjee S, Lu D, Raghavan B, Breitkopf P, Dutta S, Xiao M, Zhang W (2021) Accelerating
large-scale topology optimization: state-of-the-art and challenges. Arch Comput Methods Eng
28(7):4549–4571
Muñoz D, Nadal E, Albelda J, Chinesta F, Ródenas J (2022) Allying topology and shape optimization
through machine learning algorithms. Finite Elem Anal Des 204:103719
Murata T, Fukami K, Fukagata K (2020) Nonlinear mode decomposition with convolutional neural
networks for fluid dynamics. J Fluid Mech 882:(A13)1–15
Murphy KP (2012) Machine Learning: a probabilistic perspective. MIT Press. ISBN 978-
0262018029
Muthali A, Laine F, Tomlin C (2021) Incorporating data uncertainty in object tracking algorithms.
arXiv:2109.10521
Nascimento RG, Viana FA (2020) Cumulative damage modeling with recurrent neural networks.
AIAA J 58(12):5459–5471
Nasiri S, Khosravani MR, Weinberg K (2017) Fracture mechanics and mechanical fault detection
by artificial intelligence methods: A review. Eng Fail Anal 81:270–293
Nassif AB, Talib MA, Nassir Q, Albadani H, Albab FD (2021) Machine learning for cloud security:
a systematic review. IEEE Access 9:20717–20735
1 Machine Learning in Computer Aided Engineering 73
Nawafleh N, AL-Oqla FM (2022) Artificial neural network for predicting the mechanical perfor-
mance of additive manufacturing thermoset carbon fiber composite materials. J Mech Behav
Mater 31(1):501–513
Nayak HD, Anvitha L, Shetty A, D’Souza DJ, Abraham MP et al (2021) Fraud detection in online
transactions using machine learning approaches—a review. Adv Artif Intell Data Engg 589–599
Nayak S, Lyngdoh GA, Shukla A, Das S (2022) Predicting the near field underwater explosion
response of coated composite cylinders using multiscale simulations, experiments, and machine
learning. Compos Struct 283:115157
Nguyen LTK, Keip MA (2018) A data-driven approach to nonlinear elasticity. Comput Struct
194:97–115
Nguyen DH, Nguyen QB, Bui-Tien T, De Roeck G, Wahab MA (2020) Damage detection in girder
bridges using modal curvatures gapped smoothing method and convolutional neural network:
Application to bo nghi bridge. Theor Appl Fract Mech 109:102728
Nguyen-Thanh VM, Zhuang X, Rabczuk T (2020) A deep energy method for finite deformation
hyperelasticity. Eur J Mech-A/Solids 80:103874
Ni YQ, Jiang S, Ko JM (2001) Application of adaptive probabilistic neural network to damage detec-
tion of tsing ma suspension bridge. In: Health monitoring and management of civil infrastructure
systems, vol 4337. SPIE, pp 347–356
Ni F, Zhang J, Noori MN (2020) Deep learning for data anomaly detection and data compression
of a long-span suspension bridge. Comput-Aided Civ Infrastruct Eng 35(7):685–700
Nick H, Aziminejad A, Hosseini MH, Laknejadi K (2021) Damage identification in steel girder
bridges using modal strain energy-based damage index method and artificial neural network. Eng
Fail Anal 119:105010
Oh S, Jung Y, Kim S, Lee I, Kang N (2019) Deep generative design: Integration of topology
optimization and generative models. J Mech Des 141(11):111405
Olivier A, Shields MD, Graham-Brady L (2021) Bayesian neural networks for uncertainty quan-
tification in data-driven materials modeling. Comput Methods Appl Mech Eng 386:114079
Omairi A, Ismail ZH (2021) Towards machine learning for error compensation in additive manu-
facturing. Appl Sci 11(5):2375
Ongsulee P (2017) Artificial intelligence, machine learning and deep learning. In: 2017 15th inter-
national conference on ICT and knowledge engineering (ICT&KE). IEEE, pp 1–6
Paluszek M, Thomas S (2016) MATLAB machine learning. Apress
Panagiotopoulos P, Waszczyszyn Z (1999) The neural network approach in plasticity and fracture
mechanics. In: Neural networks in the analysis and design of structures. Springer, Berlin, pp
161–195
Panakkat A, Adeli H (2009) Recurrent neural network for approximate earthquake time and location
prediction using multiple seismicity indicators. Comput-Aided Civ Infrastruct Eng 24(4):280–
292
Pang G, Lu L, Karniadakis GE (2019) fPINNs: Fractional physics-informed neural networks. SIAM
J Sci Comput 41(4):A2603–A2626
Paszkowicz W (2009) Genetic algorithms, a nature-inspired tool: survey of applications in materials
science and related fields. Mater Manuf Process 24(2):174–197
Pathak J, Subramanian S, Harrington P, Raja S, Chattopadhyay A, Mardani M, Kurth T, Hall D,
Li Z, Azizzadenesheli K et al (2022) Fourcastnet: A global data-driven high-resolution weather
model using adaptive fourier neural operators. arXiv:2202.11214
Pathan M, Ponnusami S, Pathan J, Pitisongsawat R, Erice B, Petrinic N, Tagarielli V (2019) Pre-
dictions of the mechanical properties of unidirectional fibre composites by supervised machine
learning. Sci Rep 9(1):1–10
Ding S, Lin L, Wang G, Chao H, Pattern Recognit (2015) Deep feature learning with relative
distance comparison for person re-identification. Pattern Recognit 48(10):2993–3003
Pawar S, San O, Nair A, Rasheed A, Kvamsdal T (2021) Model fusion with physics-guided machine
learning: Projection-based reduced-order modeling. Phys Fluids 33(6):067123
74 F. J. Montáns et al.
Regan T, Canturk R, Slavkovsky E, Niezrecki C, Inalpolat M (2016) Wind turbine blade dam-
age detection using various machine learning algorithms. In: International design engineering
technical conferences and computers and information in engineering conference, vol 50206, p
V008T10A040. American Society of Mechanical Engineers
Regazzoni F, Dedè L, Quarteroni A (2020) Machine learning of multiscale active force generation
models for the efficient simulation of cardiac electromechanics. Comput Methods Appl Mech
Eng 370:113268
Ren T, Wang L, Chang C, Li X (2020) Machine learning-assisted multiphysics coupling performance
optimization in a photocatalytic hydrogen production system. Energy Convers Manag 216:112935
Rice L, Wong E, Kolter Z (2020) Overfitting in adversarially robust deep learning. In: International
conference on machine learning, pp 8093–8104. PMLR
Rocha I, Kerfriden P, van der Meer F (2020) Micromechanics-based surrogate models for the
response of composites: a critical comparison between a classical mesoscale constitutive model,
hyper-reduction and neural networks. Eur J Mech-A/Solids 82:103995
Rocha I, Kerfriden P, van der Meer F (2021) On-the-fly construction of surrogate constitutive models
for concurrent multiscale mechanical analysis through probabilistic machine learning. J Comput
Phys: X 9:100083
Rodríguez M, Kramer T (2019) Machine learning of two-dimensional spectroscopic data. Chem
Phys 520:52–60
Rogers T, Holmes G, Cross E, Worden K (2017) On a grey box modelling framework for nonlinear
system identification. In: Special topics in structural dynamics, vol 6, pp 167–178. Springer,
Berlin
Roisman I, Breitenbach J, Tropea C (2018) Thermal atomisation of a liquid drop after impact onto
a hot substrate. J Fluid Mech 842:87–101
Romero X, Latorre M, Montáns FJ (2017) Determination of the WYPiWYG strain energy density
of skin through finite element analysis of the experiments on circular specimens. Finite Elem
Anal Des 134:1–15
Rosafalco L, Torzoni M, Manzoni A, Mariani S, Corigliano A (2021) Online structural health
monitoring by model order reduction and deep learning algorithms. Comput Struct 255:106604
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization
in the brain. Psychol Rev 65(6):386
Rosti A, Rota M, Penna A (2022) An empirical seismic vulnerability model. Bull Earthq Eng
20:4147–4173
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Sci-
ence 290(5500):2323–2326
Rowley CW (2005) Model reduction for fluids, using balanced proper orthogonal decomposition.
Int J Bifurc Chaosos 15(03):997–1013
Rubio PB, Chamoin L, Louf F (2021) Real-time data assimilation and control on mechanical systems
under uncertainties. Adv Model Simul Eng Sci 8(1):1–25
Rudy SH, Brunton SL, Proctor JL, Kutz JN (2017) Data-driven discovery of partial differential
equations. Sci Adv 3(4):e1602614
Ruggieri S, Cardellicchio A, Leggieri V, Uva G (2021) Machine-learning based vulnerability anal-
ysis of existing buildings. Autom Constr 132:103936
Salazar F, Toledo M, Oñate E, Morán R (2015) An empirical comparison of machine learning
techniques for dam behaviour modelling. Struct Saf 56:9–17
Salazar F, Toledo MÁ, Oñate E, Suárez B (2016) Interpretation of dam deformation and leakage
with boosted regression trees. Eng Struct 119:230–251
Salazar F, Morán R, Toledo MÁ, Oñate E (2017) Data-based models for the prediction of dam
behaviour: a review and some methodological considerations. Arch Comput Methods Eng
24(1):1–21
76 F. J. Montáns et al.
Salloum SA, Alshurideh M, Elnagar A, Shaalan K (2020) Machine learning and deep learning
techniques for cybersecurity: a review. In: The international conference on artificial intelligence
and computer vision. Springer, Berlin, pp 50–57
Salman O, Elhajj IH, Kayssi A, Chehab A (2020) A review on machine learning-based approaches
for internet traffic classification. Ann Telecommun 75(11):673–710
Salman O, Elhajj IH, Chehab A, Kayssi A (2022) A machine learning based framework for
IoT device identification and abnormal traffic detection. Trans Emerg Telecommun Technol
33(3):e3743
Sancarlos A, Cameron M, Le Peuvedic JM, Groulier J, Duval JL, Cueto E, Chinesta F (2021)
Learning stable reduced-order models for hybrid twins. Data-Centric Eng 2:e10
Sankarasrinivasan S, Balasubramanian E, Karthik K, Chandrasekar U, Gupta R (2015) Health
monitoring of civil structures with integrated uav and image processing system. Procedia Comput
Sci 54:508–515
Sarmadi H, Karamodin A (2020) A novel anomaly detection method based on adaptive mahalanobis-
squared distance and one-class knn rule for structural health monitoring under environmental
effects. Mech Syst Signal Process 140:106495
Scardovelli R, Zaleski S (1999) Direct numerical simulation of free-surface and interfacial flow.
Annu Rev Fluid Mech 31(1):567–603
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656:5–28
Schmid PJ (2011) Application of the dynamic mode decomposition to experimental data. Exp Fluids
50(4):1123–1130
Schmid PJ, Li L, Juniper MP, Pust O (2011) Applications of the dynamic mode decomposition.
Theor Comput Fluid Dyn 25(1):249–259
Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science
324(5923):81–85
Schmidt M, Lipson H (2010) Symbolic regression of implicit equations. In: Genetic programming
theory and practice VII. Springer, Berlin, pp 73–85
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: International
conference on computational learning theory. Springer, Berlin, pp 416–426
Schölkopf B, Smola A, Müller KR (1997) Kernel principal component analysis. In: International
conference on artificial neural networks. Springer, Berlin, pp 583–588
Searson D (2009) GPTIPS: Genetic programming and symbolic regression for matlab. https://sites.
google.com/site/gptips4matlab/?pli=1
Seff A, Zhou W, Richardson N, Adams RP (2021) Vitruvion: A generative model of parametric cad
sketches. arXiv:2109.14124
Seibi A, Al-Alawi S (1997) Prediction of fracture toughness using artificial neural networks (anns).
Eng Fracture Mech 56(3):311–319
Sevieri G, De Falco A (2020) Dynamic structural health monitoring for concrete gravity dams based
on the bayesian inference. J Civ Struct Health Monit 10(2):235–250
Sharma S, Bhatt M, Sharma P (2020) Face recognition system using machine learning algorithm.
In: 2020 5th international conference on communication and electronics systems (ICCES). IEEE,
pp 1162–1168
Sharp M, Ak R, Hedberg T Jr (2018) A survey of the advancing use and development of machine
learning in smart manufacturing. J Manuf Syst 48:170–179
Shihavuddin A, Chen X, Fedorov V, Nymark Christensen A, Andre Brogaard Riis N, Branner K,
Bjorholm Dahl A, Reinhold Paulsen R (2019) Wind turbine surface damage detection by deep
learning aided drone inspection analysis. Energies 12(4):676
Shin D, Yoo S, Lee S, Kim M, Hwang KH, Park JH, Kang N (2021) How to trade off aesthetics
and performance in generative design? In: The 2021 world congress on advances in structural
engineering and mechanics (ASEM21). IASEM, KAIST, KTA, SNU DAAE
1 Machine Learning in Computer Aided Engineering 77
Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big
Data 6(1):1–48
Shu D, Cunningham J, Stump G, Miller SW, Yukish MA, Simpson TW, Tucker CS (2020) 3D design
using generative adversarial networks and physics-based validation. J Mech Des 142(7):071701
Sigmund O (2009) Systematic design of metamaterials by topology optimization. In: IUTAM sym-
posium on modelling nanomaterials and nanosystems. Springer, Berlin, pp 151–159
Silva M, Santos A, Figueiredo E, Santos R, Sales C, Costa JC (2016) A novel unsupervised approach
based on a genetic algorithm for structural damage detection in bridges. Eng Appl Artif Intell
52:168–180
Simpson T, Dervilis N, Chatzi E (2021) Machine learning approach to model order reduction of
nonlinear systems via autoencoder and LSTM networks. J Eng Mech 147(10):04021061
Singh AP, Medida S, Duraisamy K (2017) Machine-learning-augmented predictive modeling of
turbulent separated flows over airfoils. AIAA J 55(7):2215–2227
Singh H, Gupta M, Mahajan P (2017) Reduced order multiscale modeling of fiber reinforced
polymer composites including plasticity and damage. Mech Mater 111:35–56
Sirca G Jr, Adeli H (2012) System identification in structural engineering. Scientia Iranica
19(6):1355–1364
Soize C, Ghanem R (2020) Physics-constrained non-Gaussian probabilistic learning on manifolds.
Int J Numer Methods Eng 121(1):110–145
Sordoni A, Bengio Y, Vahabi H, Lioma C, Grue Simonsen J, Nie JY (2015) A hierarchical recurrent
encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM
international conference on information and knowledge management, pp 553–562
Sorini A, Pineda EJ, Stuckner J, Gustafson PA (2021) A convolutional neural network for multiscale
modeling of composite materials. In: AIAA Scitech 2021 Forum, p 0310
Spalart P, Allmaras S (1992) A one-equation turbulence model for aerodynamic flows. In: 30th
aerospace sciences meeting and exhibit, p 439
Speziale CG (1998) Turbulence modeling for time-dependent RANS and VLES: a review. AIAA J
36(2):173–184
Sprangers O, Babuška R, Nageshrao SP, Lopes GA (2014) Reinforcement learning for port-
Hamiltonian systems. IEEE Trans Cybern 45(5):1017–1027
Stahl BC (2021) Artificial intelligence for a better future: an ecosystem perspective on the ethics
of ai and emerging digital technologies. Springer Nature
Stainier L, Leygue A, Ortiz M (2019) Model-free data-driven methods in mechanics: material data
identification and solvers. Comput Mech 64(2):381–393
Stančin, I., Jović, A.: An overview and comparison of free python libraries for data mining and
big data analysis. In: 2019 42nd International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO), pp. 977–982. IEEE (2019)
Stevens B, Colonius T (2020) Enhancement of shock-capturing methods via machine learning.
Theor Comput Fluid Dyn 34(4):483–496
Stoll A, Benner P (2021) Machine learning for material characterization with an application for
predicting mechanical properties. GAMM-Mitteilungen 44(1):e202100003
Straus J, Skogestad S (2017) Variable reduction for surrogate modelling. In: Proceedings of the
foundations of computer-aided process operations. Tucson, AZ, USA, pp 8–12
Ströfer CM, Wu J, Xiao H, Paterson E (2018) Data-driven, physics-based feature extraction from
fluid flow fields using convolutional neural networks. Commun Comput Phys 25(3):625–650
Sun H, Burton HV, Huang H (2021) Machine learning applications for building structural design
and performance assessment: State-of-the-art review. J Build Eng 33:101816
Sun F, Liu Y, Sun H (2021) Physics-informed spline learning for nonlinear dynamics discovery.
arXiv:2105.02368
Surjadi JU, Gao L, Du H, Li X, Xiong X, Fang NX, Lu Y (2019) Mechanical metamaterials and
their engineering applications. Adv Eng Mater 21(3):1800864
78 F. J. Montáns et al.
Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the koopman
operator: Extending dynamic mode decomposition. J Nonlinear Sci 25(6):1307–1346
Wilt JK, Yang C, Gu GX (2020) Accelerating auxetic metamaterial design with deep learning. Adv
Eng Mater 22(5):1901266
Wirtz D, Karajan N, Haasdonk B (2015) Surrogate modeling of multiscale models using kernel
methods. Int J Numer Methods Eng 101(1):1–28
Wood MA, Cusentino MA, Wirth BD, Thompson AP (2019) Data-driven material models for
atomistic simulation. Phys Rev B 99(18):184305
Wu RT, Jahanshahi MR (2020) Data fusion approaches for structural health monitoring and system
identification: past, present, and future. Struct Health Monit 19(2):552–586
Wu Y, Sui Y, Wang G (2017) Vision-based real-time aerial object localization and tracking for uav
sensing system. IEEE Access 5:23969–23978
Wu JL, Xiao H, Paterson E (2018) Physics-informed machine learning approach for augmenting
turbulence models: A comprehensive framework. Phys Rev Fluids 3(7):074602
Wu L, Liu L, Wang Y, Zhai Z, Zhuang H, Krishnaraju D, Wang Q, Jiang H (2020) A machine
learning-based method to design modular metamaterials. Extreme Mech Lett 36:100657
Wu L, Zulueta K, Major Z, Arriaga A, Noels L (2020) Bayesian inference of non-linear multiscale
model parameters accelerated by a deep neural network. Comput Methods Appl Mech Eng
360:112693
Wu X, Park Y, Li A, Huang X, Xiao F, Usmani A (2021) Smart detection of fire source in tunnel
based on the numerical database and artificial intelligence. Fire Technol 57(2):657–682
Wu D, Wei Y, Terpenny J (2018) Surface roughness prediction in additive manufacturing using
machine learning. In: International manufacturing science and engineering conference, vol 51371,
p V003T02A018. American Society of Mechanical Engineers
Xames MD, Torsha FK, Sarwar F (2023) A systematic literature review on recent trends of machine
learning applications in additive manufacturing. J Intell Manuf 34:2529–2555
Xiao S, Hu R, Li Z, Attarian S, Björk KM, Lendasse A (2020) A machine-learning-enhanced
hierarchical multiscale method for bridging from molecular dynamics to continua. Neural Comput
Appl 32(18):14359–14373
Xie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and inter-
pretable prediction of material properties. Phys Rev Lett 120(14):145301
Xie Y, Ebad Sichani M, Padgett JE, DesRoches R (2020) The promise of implementing machine
learning in earthquake engineering: A state-of-the-art review. Earthq Spectra 36(4):1769–1801
Xie X, Bennett J, Saha S, Lu Y, Cao J, Liu WK, Gan Z (2021) Mechanistic data-driven prediction
of as-built mechanical properties in metal additive manufacturing. NPJ Comput Mater 7(1):1–12
Xiong W, Wu L, Alleva F, Droppo J, Huang X, Stolcke A (2018) The Microsoft 2017 conversational
speech recognition system. In: 2018 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, pp 5934–5938
Xu J, Duraisamy K (2020) Multi-level convolutional autoencoder networks for parametric prediction
of spatio-temporal dynamics. Comput Methods Appl Mech Eng 372:113379
Xu C, Cao BT, Yuan Y, Meschke G (2022) Transfer learning based physics-informed neural networks
for solving inverse problems in tunneling. arXiv:2205.07731
Xu H, Caramanis C, Mannor S (2008) Robust regression and Lasso. Adv Neural Inf Process Syst
21 (NIPS2008)
Yadav D, Salmani S (2019) Deepfake: A survey on facial forgery technique using generative adver-
sarial network. In: 2019 international conference on intelligent computing and control systems
(ICCS). IEEE, pp 852–857
Yagawa G, Okuda H (1996) Neural networks in computational mechanics. Arch Comput Methods
Eng 3(4):435–512
Yamaguchi T, Okuda H (2021) Zooming method for fea using a neural network. Comput Struct
247:106480
1 Machine Learning in Computer Aided Engineering 81
Yan S, Zou X, Ilkhani M, Jones A (2020) An efficient multiscale surrogate modelling framework for
composite materials considering progressive damage based on artificial neural networks. Compos
Part B: Eng 194:108014
Yan C, Vescovini R, Dozio L (2022) A framework based on physics-informed neural networks and
extreme learning for the analysis of composite structures. Comput Struct 265:106761
Yáñez-Márquez C (2020) Toward the bleaching of the black boxes: Minimalist machine learning.
IT Prof 22(4):51–56
Yang C, Kim Y, Ryu S, Gu GX (2020) Prediction of composite microstructure stress-strain curves
using convolutional neural networks. Mater Des 189:108509
Yang L, Zhang D, Karniadakis GE (2020) Physics-informed generative adversarial networks for
stochastic differential equations. SIAM J Sci Comput 42(1):A292–A317
Yang L, Meng X, Karniadakis GE (2021) B-PINNs: Bayesian physics-informed neural networks
for forward and inverse PDE problems with noisy data. J Comput Phys 425:109913
Ye Y, Yang Q, Yang F, Huo Y, Meng S (2020) Digital twin for the structural health management of
reusable spacecraft: A case study. Eng Fract Mech 234:107076
Ye W, Hohl J, Mushongera LT (2022) Prediction of cyclic damage in metallic alloys with crystal
plasticity modeling enhanced by machine learning. Materialia 22:101388
Yoo S, Lee S, Kim S, Hwang KH, Park JH, Kang N (2021) Integrating deep learning into cad/cae
system: generative design and evaluation of 3D conceptual wheel. Struct Multidiscip Optim
64:2725–2747
Yu Y, Hur T, Jung J, Jang IG (2019) Deep learning for determining a near-optimal topological
design without any iteration. Struct Multidiscip Optim 59(3):787–799
Yu Y, Rashidi M, Samali B, Yousefi AM, Wang W (2021) Multi-image-feature-based hierarchical
concrete crack identification framework using optimized svm multi-classifiers and d-s fusion
algorithm for bridge structures. Remote Sens 13(2):240
Yuan FG, Zargar SA, Chen Q, Wang S (2020) Machine learning for structural health monitoring:
challenges and opportunities. Sens Smart Struct Technol Civ, Mech, Aerosp Syst 11379:1137903
Yuan D, Gu C, Wei B, Qin X, Xu W (2022) A high-performance displacement prediction model
of concrete dams integrating signal processing and multiple machine learning techniques. Appl
Math Model 112:436–451
Yucel M, Bekdaş G, Nigdeli SM, Sevgen S (2019) Estimation of optimum tuned mass damper
parameters via machine learning. J Build Eng 26:100847
Yu S, Tack J, Mo S, Kim H, Kim J, Ha JW, Shin J (2022) Generating videos with dynamics-aware
implicit generative adversarial networks. arXiv:2202.10571
Yuvaraj P, Murthy AR, Iyer NR, Sekar S, Samui P (2013) Support vector regression based models
to predict fracture characteristics of high strength and ultra high strength concrete beams. Eng
Fract Mech 98:29–43
Yvonnet J, He QC (2007) The reduced model multiscale method (R3M) for the non-linear homog-
enization of hyperelastic media at finite strains. J Comput Phys 223(1):341–368
Yvonnet J, Monteiro E, He QC (2013) Computational homogenization method and reduced database
model for hyperelastic heterogeneous structures. Int J Multiscale Comput Eng 11(3):201–225
Zadpoor AA (2016) Mechanical meta-materials. Mater Horiz 3(5):371–381
Zehtaban L, Elazhary O, Roller D (2016) A framework for similarity recognition of CAD models.
J Comput Des Eng 3(3):274–285
Zhan Z, Li H (2021) A novel approach based on the elastoplastic fatigue damage and machine
learning models for life prediction of aerospace alloy parts fabricated by additive manufacturing.
Int J Fatigue 145:106089
Zhang Z, Friedrich K (2003) Artificial neural networks applied to polymer composites: a review.
Compos Sci Technol 63(14):2029–2044
82 F. J. Montáns et al.
Zhang J, Sato T, Iai S (2007) Novel support vector regression for structural system identification.
Struct Control Health Monit: Off J Int Assoc Struct Control Monit Eur Assoc Control Struct
14(4):609–626
Zhang Z, Hsu TY, Wei HH, Chen JH (2019) Development of a data-mining technique for regional-
scale evaluation of building seismic vulnerability. Appl Sci 9(7):1502
Zhang D, Guo L, Karniadakis GE (2020) Learning in modal space: Solving time-dependent stochas-
tic PDEs using physics-informed neural networks. SIAM J Sci Comput 42(2):A639–A665
Zhang XL, Michelén-Ströfer C, Xiao H (2020) Regularized ensemble Kalman methods for inverse
problems. J Comput Phys 416:109517
Zhang P, Yin ZY, Jin YF (2021) State-of-the-art review of machine learning applications in consti-
tutive modeling of soils. Arch Comput Methods Eng 28(5):3661–3686
Zhang Z, Liu Y (2021) Robust data-driven discovery of partial differential equations under uncer-
tainties. arXiv:2102.06504
Zhang W, Mehta A, Desai PS, Higgs III CF (2017) Machine learning enabled powder spreading pro-
cess map for metal additive manufacturing (am). In: 2017 international solid freeform fabrication
symposium. University of Texas at Austin
Zhao Y, Akolekar HD, Weatheritt J, Michelassi V, Sandberg RD (2020) RANS turbulence model
development using CFD-driven machine learning. J Comput Phys 411:109413
Zhao P, Liao W, Xue H, Lu X (2022) Intelligent design method for beam and slab of shear wall
structure based on deep learning. J Build Eng 57:104838
Zheng H, Moosavi V, Akbarzadeh M (2020) Machine learning assisted evaluations in structural
design and construction. Autom Constr 119:103346
Zheng X, Zheng P, Zheng L, Zhang Y, Zhang RZ (2020) Multi-channel convolutional neural net-
works for materials properties prediction. Comput Mater Sci 173:109436
Zheng B, Yang J, Liang B, Cheng JC (2020) Inverse design of acoustic metamaterials based on
machine learning using a gauss–bayesian model. J Appl Phys 128(13):134902
Zhu Y, Zabaras N, Koutsourelakis PS, Perdikaris P (2019) Physics-constrained deep learning for
high-dimensional surrogate modeling and uncertainty quantification without labeled data. J Com-
put Phys 394:56–81
Zhuang X, Guo H, Alajlan N, Zhu H, Rabczuk T (2021) Deep autoencoder based energy method
for the bending, vibration, and buckling analysis of Kirchhoff plates with transfer learning. Eur
J Mech-A/Solids 87:104225
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc: Ser
B (Statistical Methodology) 67(2):301–320
zur Jacobsmühlen J, Kleszczynski S, Witt G, Merhof D (2015) Detection of elevated regions in
surface images from laser beam melting processes. In: IECON 2015-41st annual conference of
the IEEE industrial electronics society. IEEE, pp 001270–001275
1 Machine Learning in Computer Aided Engineering 83
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Artificial Neural Networks
2.1 Introduction
To many people, the history of machine learning itself is synonymous with the history
and development of artificial neural networks. There is a good reason for this; the ear-
liest conception of machine learning—or more generally of artificial intelligence—
was inextricably linked to the idea of reproducing the capabilities of the human brain
in some mathematical or computational framework. There were two main motiva-
tions for the programme. In the first place, it was hoped that a mathematical model
of the brain would shed light on its biological functionality, thus leading to a compu-
tational basis for neuroscience. The second motivation was more pragmatic; it was
known from the earliest days of electronic computing that the brain is effortlessly
superior to electronic (sequential/serial von Neumann) computers at certain tasks;
for example, image recognition, speech recognition, coordination, etc. The hope was
that a computer modeled on the architecture of the brain could extend capability
on the sort of tasks indicated. As this volume is very much focused on the solution
of problems in engineering, the discussion of neural networks here will concentrate
on the second motivation; the curious reader is directed elsewhere to find out about
computational neuroscience (Churchland and Seknowski 2017; Miller 2018).
This chapter is not intended as a comprehensive history or literature survey in any
shape or form; the idea will be to simply convey how Artificial Neural Networks
(ANNs) have developed in the context of engineering applications. It will prove
useful to split that development into three main periods. The layout of the chapter
will be as follows: the next section will outline the “pre-history” of the subject,
and will discuss developments up to the point where a general ANN architecture
emerged, which was versatile and powerful enough to address a range of engineering
The foundations for the study of ANNs begin with pioneering work on neurons as
structural constituents of the brain in the 1910s (R̃amón y Cajál 1911). This early
work established that the basic processing unit of the brain was the nerve cell or
neuron; the structure and operation of such neurons will be outlined in this section.
In brief, the neuron acts by summing stimuli from connected neurons. If the total
stimulus or activation exceeds a certain threshold, the neuron “fires”, i.e. it generates
a stimulus which is passed on into the network. The essential components of the
neuron are shown in the schematic Fig. 2.1.
The cell body, which contains the cell nucleus, carries out those biochemical
reactions which are necessary for sustained functioning of the neuron. Two main types
of neuron are found in the cortex (the part of the brain associated with the higher
reasoning capabilities); they are distinguished by the shape of the cell body. The
predominant type has a pyramid-shaped body and is usually referred to as pyramidal
neurons. Most of the remaining nerve cells have star-shaped bodies and are referred
to as stellate neurons. The cell bodies are typically a few microns in diameter. The fine
tendrils surrounding the cell body are the dendrites, they typically branch profusely
in the neighborhood of the cell and extend for a few hundred microns. The nerve fiber
or axon is usually much longer than the dendrites, sometimes extending for up to a
meter. These fibers connect the neurons with distant parts of the nervous system via
the spinal cord; they are not connections within the brain. The axon only branches
at its extremity, where it makes connections with other cells.
The dendrites and axon serve to conduct signals to and from the cell body. In
general, input signals to the cell are conducted along the dendrites, while the cell
output is directed along the axon. Signals propagate along the fibers as electrical
impulses. Connections between neurons, called synapses, are usually made between
axons and dendrites, although they can occur between dendrites, between axons and
between an axon and a cell body.
Synapses operate as follows: the arrival of an electrical nerve impulse at the
end of an axon say, causes the release of a chemical—a neurotransmitter—into
2 Artificial Neural Networks 87
the synaptic gap (the region of the synapse, typically extending 0.01 microns). The
neurotransmitter then binds itself to specific sites—neuroreceptors—usually in the
dendrites of the target neuron. There are distinct types of neurotransmitters: excitatory
transmitters which trigger the generation of a new electrical impulse at the receptor
site and inhibitory transmitters which act to prevent the generation of new impulses.
Table 2.1 (reproduced from Abeles 1991) gives the statistics for, and typical
properties of, neurons within the cerebral cortex (the term remote sources in the
table refers to sources outside the cortex). One of the first things one sees from the
table is that the network is far from fully connected.
The operation of the neuron in reality is not at all simple; the dynamics are those
of a complex electro-chemical dynamical system; however, in broad terms, the cell
body carries out a sort of summation of all the incoming electrical impulses directed
inward along the dendrites. The effectiveness of the connection between two neurons
is determined by the chemical environment in the synapse, so the elements of the
summation over neuronal connections are individually weighted by the strength of
the connection or synapse. If sufficient energy is directed into a neuron within a
certain interval of time from its neighbors, it will itself discharge an electrical pulse
along the axon. This process can be put into terms which will make the design of an
artificial neuron seem clear. If the value of the summation over incoming signals—the
activation of the neuron—exceeds a certain threshold, the neuron fires and directs
an electrical impulse outward via its axon. Via synapses with the axon, the signal
88 K. Worden et al.
is communicated to other neurons. If the activation is less than the threshold, the
neuron remains dormant.
A mathematical model of the neuron, exhibiting the essential features of this
restricted view of the biological neuron, was developed as early as 1943 in McCulloch
and Pitts (1943). This model forms the subject of a later discussion; the remainder of
this section is concerned with those properties of the brain which emerge as a result
of its massively parallel nature.
2.2.1 Memory
Information is actually stored in the brain in the network connectivity and the
strengths of the connections or synapses between neurons. In this case, knowledge is
stored as a distributed quantity throughout the entire network. As one might imagine,
the act of retrieving information from such a memory is quite different from that for
an electronic computer. In order to access data on a PC, say, the processor is informed
of the relevant address in memory, and it retrieves the data from that location. In a
neural network, a stimulus is presented (i.e. a number of selected neurons receive
an external input), and the required data are encoded in the subsequent pattern of
neuronal activations. Potentially, recovery of the pattern is dependent on the entire
distribution of connection weights or synaptic strengths.
One advantage of this type of memory retrieval system is that it has a much
greater resistance to damage. If the surface of a PC hard disk is damaged, all data
at the affected locations may be irreversibly corrupted. In a neural network, because
2 Artificial Neural Networks 89
2.2.2 Learning
When a cell A excites cell B by its axon and when in a repetitive and persistent manner it
participates in the firing of B, a process of growth or of changing metabolism takes place
in one or both cells such that the effectiveness of A in stimulating and impulsing cell B is
increased with respect to all other cells which can have this effect.
It was considered that, if some similar mechanism could be established for compu-
tational models of neural networks, there would be the attractive possibility of “pro-
gramming” these systems simply by presenting them with a sequence of stimulus-
response pairs so that the network could learn the appropriate relationship by rein-
forcing some of its internal connections.
As observed earlier, the massively parallel nature of the brain as a computing facility
led researchers to believe that it could motivate a new and powerful paradigm for
artificial computing, based on artificial neural networks (ANNs). It was assumed that
the most important elements of this paradigm would include (Haykin 1994).
• Nonlinearity
Neurons are highly nonlinear (thresholding) devices and therefore neural networks
will also be nonlinear. Moreover, the nonlinearity can be distributed throughout
the network according to its structure. ANNs are usually designed to be nonlinear.
• Adaptivity and Learning
ANNs can modify their behavior in response to the environment. Once a set of
inputs with desired outputs is presented to the network, the connections between
the neurons self-adjust to produce consistent responses. This training is repeated
for many examples until the network reaches a stable state.
90 K. Worden et al.
• Evidential Response
A neural network can be designed to provide information not only about the
response but also about the confidence in the response; examples include pattern
recognition analysis where patterns can be classified with a returned confidence
level.
• Fault Tolerance
If a neuron or its connections are damaged, processing of information is impaired.
However, a neural network can exhibit a graceful degradation in performance,
rather than catastrophic failure.
• VLSI Implementation
The parallel nature of neural networks makes them ideally suited for implemen-
tation using microchip technology.
1. Autoassociation
A signal is reconstructed from noisy or incomplete data.
2. Regression/Heteroassociation
Input–output mapping, i.e. for given input data, produces a required output char-
acteristic.
3. Classification
Assign input data to given classes.
4. Anomaly Detection
Detect statistical abnormalities in the input data.
The first two tasks are often associated with modeling applications using neural net-
works. The last category includes the problem of novelty detection (Markou and
Singh 2003a; neural network based approaches 2003b; Tarassenko et al. 1995; Wor-
den 1997). Regression and classification are considered to be two of the main prob-
lems addressable with machine learning, so it is clear that neural networks offer a
general capability. Anomaly or novelty detection establishes a description of nor-
mality using features representing initial or normal conditions for some system or
process, and then tests for abnormality or novelty. This capability led to the first suc-
cessful applications of machine learning in the engineering discipline of structural
health monitoring (SHM), which seeks to detect and locate damage in engineering
structures using measured (sensor) data. The parallel discipline of condition monitor-
ing (detection of damage in machines) has also benefited significantly from novelty
detection. There also exist a number of general data fusion and signal processing
applications within each group of these tasks. In general, ANNs can be used for (Luo
and Unbehauen, 1998): filtering, detection, reconstruction, array signal processing,
system identification, signal compression, and adaptive feature extraction.
2 Artificial Neural Networks 91
Before discussing one of the most commonly used variants of ANNs, it is useful to
spend a little time discussing how a single neuron might be modeled in a mathematical
way.
Despite the diversity of ANN paradigms, nearly all of them consist of very similar
building blocks—the artificial neurons. The structure of these neurons has actually
changed very little since the first study in McCulloch and Pitts (1943). The model
neurons receive a set of inputs or stimuli (regarded as emerging from neighboring
neurons) and produce a single output or response. In the foundational McCulloch–
Pitts (MCP) model, both the inputs and outputs are considered to be binary (reflecting
the fact that biological neurons either fire or don’t fire). The MCP neuron structure
is considered to consist of two blocks concerned with summation and activation, as
shown in Fig. 2.2.
The input values xi ∈ 0, 1 are weighted by a factor wi before they are passed to the
body of the neuron. These weights mimic the effect of different strengths of synaptic
connection in the biological neuron; they can be positive (excitatory) or negative
(inhibitory). The weighted inputs are then summed to produce an activation signal
z,
n
z= wi xi (2.1)
i=1
This signal is then passed through a nonlinear activation or transfer function; this
corresponds to the threshold in the biological neuron. However, in the mathematical
model, any one of a number of functions could be used for processing. For example,
if the output signal y is described as,
92 K. Worden et al.
y = kz (2.2)
z>β (2.3)
z≤β (2.4)
Initial studies of the MCP neuron indicated that its computational capabilities
were extremely limited if it were used alone. This was no surprise, a brain with a
single neuron would not be expected to work in any way. The classic example of MCP
limitation is the fact that a single neuron is incapable of learning the XOR function
(two binary inputs and a single binary output). The obvious solution, with reference
to the brain, was to move to networks of neurons; the first serious learning machine
designed on this basis made its appearance in 1950 in the form of the Perceptron.
The first serious study of artificial neural networks (as opposed to single neurons) was
carried out by Rosenblatt (1962), who proposed a three-layered network structure—
the Perceptron. This network was not homogeneous; the first layer was an input
layer which simply distributed signals to the processing layers. The first processing
layer was the associative layer (also referred to as the hidden layer because it had
no external connections); the second, which outputs signals to the outside world was
termed the decision layer. In the classic form of the perceptron, only the connections
between the decision and associative nodes were adjustable in strength; those between
the input and associative nodes had to be preset before training took place. One of
the main motivations of the perceptron was that it might take inputs, black and white,
from an image in order to recognize patterns within that image; for this reason, it
was sometimes considered to be an “artificial retina” as in Fig. 2.3. It was possible to
prove a number of nice theorems for the perceptron, including a proof that if a given
pattern were learnable, a learning rule existed which would converge in finite time.
The problems in adopting the perceptron as a general learning machine proved to be
related to discovering which patterns were actually learnable.
2 Artificial Neural Networks 93
Although perceptrons were initially received with enthusiasm, they were soon
associated with many problems; a completely rigorous investigation into the capabil-
ity of perceptrons was given in Minsky and Papert (1988). In representing a function
with N arguments, the generic perceptron was shown to need 2 N elements in the
associative layer; i.e. the networks appeared to grow exponentially in complexity
with the dimension of the problem. It was initially hoped that real problems could
be solved which did not require the maximal structure; however, Minsky and Papert
(1988) showed that many fundamental problems required the full complexity. For
example, it was shown that a perceptron could not determine if a given pattern was
connected (not falling into disjoint components) without the full number of neurons.
One way out of the dilemma appeared to be the possibility of adding and training
other layers of weights in the perceptron; however, it was not possible to establish an
algorithm which would allow this. Only the layer of weights between the outermost
hidden layer and the output layer could be trained using the perceptron learning rule
(which was based on Hebbian learning as described above). With the perceptron
structure as it stood, further progress proved impossible. The effect of Minsky and
Papert’s book was to discourage research in ANNs, and the field lay dormant for
many years.
The period of inactivity ended with the work of Hopfield (1984, 1985) in the
1980s. He considered networks from the point of view of dynamical systems the-
ory; the outputs of the constituent neurons in Hopfield’s networks were regarded as
dynamical states which could evolve in time. The Hopfield network proved capable
of solving a number of practical problems (many of them optimization problems)
and reinvigorated ANN research. An immediate result of the resurgence in activity
was the solution by various groups of the problems associated with Rosenblatt-type
perceptrons. The problem of finding a learning rule for the multi-layer structures
94 K. Worden et al.
turned out to be the result of using the hard threshold as an activation function in the
individual neurons. The solution proved to hinge on a matter as simple as replacing
the hard threshold with a continuous function, such as the sigmoidal function,
1
y= (2.5)
1 + e−z
y = tanh(z) (2.6)
Once the activation function became continuous, the solution to the whole problem
turned out to be the chain rule of partial differentiation (Bishop 2013). The backprop-
agation learning rule was discovered simultaneously by a number of research groups
and was reported in Rumelhart et al. (1986), LeCun (1986). In fact, the learning rule
had been discovered as early as 1974 in the PhD work of Werbos (1974); however,
this work lay undiscovered by the machine learning community, partly because of
the lack of activity in the field as a result of the disappointment in perceptrons. The
backpropagation rule is essentially the gradient-descent algorithm for optimization
and in this sense appears to have had antecedents in the control engineering literature
as early as the 1960s (Necessary conditions for extremal solutions 1963). The exis-
tence of the backpropagation algorithm allowed the development of the Multi-Layer
Perceptron (MLP) algorithm, which has proved to be one of the most commonly
used and influential machine learning paradigms so far discovered. Interest in ANNs
flowered and a number of large programs of research were initiated; this will be
considered here as the start of the first age of ANNs.
xi(k) = f (z i(k) ) = f wi(k) (k−1)
j xj (2.7)
j
As discussed above, various choices for the function f are possible (as long as
they are continuous and satisfy some other mild conditions); the hyperbolic tangent
function f (x) = tanh(x) is a good choice. A novel feature of this network is that
the neuron outputs can take any value in the interval [–1, 1]. There are no explicit
threshold values associated with the neurons. One node of the network, the bias node,
is special in that it is connected to all other nodes in the hidden and output layers;
the output of the bias node is held fixed at a value of unity throughout, in order to
allow constant offsets in the excitations z i of each node.
The first stage of using a network is to establish the appropriate values for the
connection weights wi j , i.e. the training phase. The type of training usually used is
a form of supervised learning and makes use of a set of network inputs for which
the desired network outputs (often called targets) are known. At each training step, a
96 K. Worden et al.
set of inputs is passed forward through the network yielding trial outputs which can
be compared with the desired outputs. If the comparison error is considered small
enough, the weights are not adjusted. If however a significant error is obtained, the
error is passed backward through the net and the training algorithm uses the error
to adjust the connection weights so that the error is reduced. The learning algorithm
used is usually referred to as the backpropagation algorithm, as discussed earlier,
and can be summarized as follows. For each presentation of a training set, a measure
J of the network error is evaluated, where the most common choice is the squared
error,
(l)
1
n
J (t) = (yi (t) − ŷi (t))2 (2.8)
2 i=1
and n (l) is the number of output layer nodes. J is implicitly a function of the network
parameters J = J (w) where the w are all the connection weights, ordered into a
vector in some appropriate manner (Nabney 2001). The integer t labels the presen-
tation order of the training sets. After presentation of a training set, the standard
steepest-descent algorithm requires an adjustment of the parameters according to
∂J
wi = −η = −η∇i J (2.9)
∂wi
where ∇i is the gradient operator in the parameter space. The parameter η determines
how large a step is made in the direction of steepest descent and therefore how quickly
the optimum parameters are obtained. For this reason η is called the learning rate or
learning coefficient. Detailed analysis (Bishop 2013) gives the update rule after the
presentation of a training set,
where δi(m) is the error in the output of the ith node in layer m. This error is not
known a priori but must be constructed from the known errors δi(l) = yi − ŷi at the
output layer l. This is the source of the name backpropagation, the weights must be
adjusted layer by layer, moving backward from the output layer.
There is little guidance in the literature as to what the learning rate η should be; if it
is taken too small, convergence to the correct parameters may take an extremely long
time. However, if η is made large, learning is much more rapid but the parameters
may diverge or oscillate. One way around this problem is to introduce a momentum
term into the update rule so that previous updates persist for a while, i.e.
where α is termed the momentum coefficient. The effect of this additional term
is to damp out high-frequency variations in the backpropagated error signal. The
2 Artificial Neural Networks 97
Before advocating the use of neural networks in representing functions and processes,
it is important to establish what they are capable of. As described above, ANNs were
all but abandoned as a subject of study following Minsky and Papert’s book (Minsky
and Papert 1988), which showed that perceptrons were incapable of modeling very
simple logical functions. In fact, recent years have seen a number of rigorous results
(Cybenko 1989 is a good example), which show that an MLP network is capable
of approximating any given function with arbitrary accuracy, even if possessed of
only a single hidden layer. Unfortunately, the proofs are not constructive and offer
no guidelines as to the complexity of network required for a given function. A single
hidden layer may be sufficient, but might require many more neurons than if two
hidden layers were used. This observation motivated the development of neural
networks with many more hidden layers—deep networks—and led to the second
age of neural networks.
This is the problem of local minima. The error function for an MLP network is
an extremely complex object. Given a converged MLP network, there is no way
of establishing if it has arrived at the global minimum. Some attempts to avoid
the problem are centered around the association of a temperature with the learning
schedule. Roughly speaking, at each training cycle the network may randomly be
given enough “energy” to escape from a local minimum. The probable energy is
calculated from a network temperature function which decreases with time. (Recall
that molecules of a solid at high temperature escape the energy minimum which
98 K. Worden et al.
One of the main problems encountered in practice with ANNs (and in machine
learning in general) is that of generalization or avoidance of overfitting. This is
essentially the problem of rote learning the training data rather than learning the
underlying function of interest. It occurs when there are too many parameters in
the model compared to the number of training points or patterns. Consider a simple
one-dimensional regression problem with 10 points of training data. Suppose that
the true relationship between x and y is linear, but the presence of noise means that
the plot of y against x is far from a straight line. If one were to fit a nine-dimensional
least-squares polynomial to the data, there would be 10 tunable parameters which
could be set so that the function passes through the data with no error. The problem
here is that the polynomial fit would very probably deviate badly from the true linear
form away from the training points. If one now applied the model to different data,
generated by the same physical process as the training data, the predictions could be
very bad indeed, the model will fail to generalize. At the heart of this problem is the
availability of too many parameters. In the context of a neural network, one is likely
to overfit the data if there are too many weights compared to training data points. The
simplest solution to the problem is to always have enough data; the rule-of-thumb
espoused in Tarassenko (1998) is that one should have 10 training patterns for each
network weight (this rule is not arbitrary and some theoretical motivation is given
in Bishop 2013). In engineering problems, data are often the result of expensive
experimentation and will be in short supply; in this case, the only way to ensure
generalization is to restrict the number of weights in the network. In any case, if one
has fitted a neural network to training data, one should always evaluate performance
on an independent test set in order to assess generalization.
One way of controlling the number of adjustable weights in the network is to
control the number of hidden units; this is accomplished by the use of cross validation
on an independent dataset. In fact, the number of hidden units is another example of a
hyperparameter. One divides the available data into three sets for training, validation,
and testing. For all numbers of hidden units between 1 and some maximum, one trains
a network on the training data and then also computes the error on the validation
data. When the number of hidden units has reached the point where overtraining is
beginning, the error on the validation set will begin to rise, even though the error on
the training set continues to decrease. One then fixes the number of hidden units at
the point where the minimum error on the validation set occurred. Now, the network
has been tuned to both the training and validation sets and the independent testing
set is brought in to assess generalization. The independence of the various datasets
used in training and assessment is critical; for example, if the data in the testing
2 Artificial Neural Networks 99
set are too strongly correlated with the data in the training set, it is likely that a
misleading impression of the generalization capacity of the network will be obtained.
The validation set can be used to determine values for any number of hyperparameters
by cross validation. If data are very scarce, it may be unattractive to use much of
the data for a validation set; in this situation, alternatives like leave-one-out cross
validation can be explored (Bishop 2013).
Another way of thinking about overfitting is in terms of the magnitude of the
weights. In situations where there are as many weights as data points, one often
finds that the weights have very high values and the accurate predictions on the
training set are the result of cancellations between large positive and large negative
weights. This observation suggests that better generalization might be achieved by
controlling the size of the weights. Alternatively, one can regard the issue as being
one of smoothness of the fitted function. A high-order polynomial model as discussed
before can dance rapidly between noisy data points, where the true underlying linear
system response is much smoother. It can be shown that smaller weights generate
smoother response. The science of controlling the smoothness of the network is
generally called regularization. Having established that smaller weights are desirable,
one simple way to achieve this is to add a term to the neural-network objective/error
function which penalizes large weights, the simplest such term being
W
α wi2 = α||w||2 (2.12)
i=1
where the constant α weights the relative importance of the squared error and the
weight norm—it is yet another example of a hyperparameter.
This prescription is commonly called weight-decay regularization. Two other
methods of regularization are early stopping, where one stops training before the
algorithm has tuned the weights to the point of overfitting, and adding noise to the data
during training in order to stop the algorithm learning exact training data points. It has
been shown that these three mentioned methods of regularization are theoretically
equivalent (Bishop 1994). One of the most advanced theoretical frameworks for
assessing generalization capacity of models is that of statistical learning theory
(Vapnik 1995).
This is a straightforward but important matter concerned with the nonlinear activation
functions of the neurons in the MLP. In order to have a universal approximator, it is
necessary that the hidden units of the MLP adopt a nonlinear activation function. As
discussed above, the hyperbolic tangent and sigmoid functions are the most often used
as activation functions. One also needs to specify a form for the activation functions of
the output layer and it turns out that best practice demands that this is problem specific
100 K. Worden et al.
(Bishop 2013). For regression problems, one should use linear activation functions.
For classification problems, one should use a nonlinear activation. Furthermore, if
the classifier is trained using the 1 of M rule (Tarassenko 1998), a softmax activation
should be used. In that case, there will be M output neurons, each with an associated
activation z i , i = 1, . . . , M. The output of the ith neuron is defined as
exp(z i )
yi = M (2.13)
j=1 exp(z j )
This rule forces all the network outputs to sum to unity, a necessary condition
for interpreting the outputs as posterior probabilities of class membership (Bishop
2013).
Although it might appear controversial to a computer scientist, it is arguably rea-
sonable that engineers would regard the MLP as ushering in the “first age” of neural
computing, as it provided an architecture with the power and versatility to address
many engineering problems. The main developments in the MLP following the intro-
duction of backpropagation were not significant changes to the basic structure. In
terms of the learning algorithm, the main development was a transition from gradi-
ent descent to second-order methods for the backpropagation algorithm. One very
significant development, which did not affect the basic architecture, was the adop-
tion of a Bayesian probabilistic viewpoint on MLPs. The Bayesian view allowed a
principled interpretation of the MLP and provided practical benefits like confidence
intervals on predictions. It is probably fair to say that the Bayesian approach was
largely introduced and led by Neal (1996) and Mackay (2003); the latter of these
references must be regarded as a classic of the machine learning literature in general.
Despite its versatility, it would be unfair to say that the MLP was the only archi-
tecture that proved useful for engineering. Another popular feedforward architecture
was the radial-basis function (RBF) neural network (Broomhead and Lowe 1988).
Another popular idea of the time was the merging of neural and fuzzy concepts (Brown
and Harris 1994); one well-known neuro-fuzzy architecture was the ANFIS network;
however, this actually boiled down to an RBF network on closer inspection (Worden
et al. 2011). Another influential paradigm was the self-organizing map (SOM) of
Kohonen (1982); this was an unsupervised algorithm, which proved very powerful
for tasks like clustering and data visualization. A more “principled” version of the
SOM also appeared in the form of the generative topographic map (GTM) (Bishop
et al. 1998). Some other popular ANN architectures are discussed in the previously
mentioned (Worden et al. 2011).
The historical development of the subject will now be briefly interrupted with an
engineering illustration of the application of the MLP.
2 Artificial Neural Networks 101
The case study presented here is a perfect example of the use of MLPs in an engineer-
ing problem. All of the ideas discussed in the previous section were exposed in the
course of the study. The study was made possible by Qinetiq Ltd., who allowed use
of a Gnat trainer aircraft for structural health monitoring (SHM), which was studied
over a period of 3 years. A detailed discussion of the whole program can be found in
the sequence of papers (Worden et al. 2003; Manson et al. 2003a, b). The case study
here relates mainly to Manson et al. (2003b), in which the aim of the exercise was
to locate damage within the starboard wing of the aircraft (Fig. 2.5).
A condition of the experiments was that the wing should not suffer actual damage.
It was therefore decided to simulate damage by sequentially removing a number of
inspection panels on the wing. Unlike real damage, this approach had the distinct
advantage that each damage scenario was reversible and it would be possible to
monitor the repeatability of the measurements. Of the various panels available, nine
were chosen, mainly for their ease of removal and also to cover a range of sizes.
These panels were distributed as shown in Fig. 2.6. The areas of the panels ranged
Fig. 2.6 Schematic of the starboard wing of the Gnat aircraft, showing the positions of the inspection
panels (not to scale)
from 0.00825 m2 to 0.08 m2 , so their removal actually constituted quite large damage.
Panels P3 and P6 were always likely to give the most difficulty for a damage-detection
procedure because they were by far the smallest. It is important to note that the
fixing conditions of the panels were important. Each panel was fixed to the wing by
a number of screws, the numbers varying between 8 and 26. On some of the panels,
screws were missing as a result of damaged threads in the holes. In fact, during the
repeated removal of the plates, further holes were damaged. This damage meant that
there was some variation throughout the test, even for normal condition (all panels
attached). An attempt was made to control this variation by using a torque-controlled
screwdriver.
2 Artificial Neural Networks 103
The fact that there are a discrete number of damage locations meant that the
problem was one of classification, i.e. the neural network should assign the label
of the damaged panel, from 1 to 9. Thus, the classifier was trained using the 1
of M strategy, as discussed in Sect. 2.3.4. Another critical choice for the exercise
was that of the neural network inputs; these would need to be features sensitive to
the nature of the problem, they would need to be carrying information about the
location of damage. The features would need to be derived from sensors, placed on
the wing. It was decided to use accelerometers to record the data, in accordance with
common practice in vibration-based SHM, and in structural dynamics generally. An
electrodynamic shaker placed under the wing was used to excite vibrations, using a
white-noise source. In order to use sensors effectively, the panels were split into three
groups A, B, and C. Each group was allocated a centrally placed reference sensor,
together with three other sensors, each associated with a specific plate. The sensor
layout was as shown in Fig. 2.7.
The accelerometers produced many thousands of samples of data; far too many to
use as direct inputs to the MLP. For reasons discussed in the last section, the number of
weights in the ANN determines how much training data are needed; a very large input
layer means a very large number of weights, and thus an infeasible amount of training
data. Previous experience in vibration-based SHM had shown that transforming the
data into the frequency domain gives a reduction in the dimension of the features
and can concentrate damage-related information. For this reason, transmissibilities
were computed for each panel. The spectra—Fourier transforms of the time data—
were computed; the transmissibility for each panel was then obtained by taking
the ratio of the panel spectrum and the spectrum of the reference sensor of the
group. The transmissibilities are characterized by sharp peaks at certain frequencies,
which move when damage occurs. Following an exhaustive visual inspection, small
ranges of frequencies around the most sensitive peaks were chosen. These ranges still
contained too many spectral lines to give a sensible input dimension, so the ranges
for each panel were converted to scalar values using ideas from outlier analysis, as
discussed in detail in Manson et al. (2003b). This feature extraction and selection
process finally led to nine input features for the neural network. As the output layer
of the network was fixed at nine neurons by the 1 of M approach, only the number
of hidden units remained to be determined.
A sequence of tests was carried out in which each panel was removed in turn. In
between each group of three panel tests, a “normal” condition (no panels removed)
was taken. The sequence was then repeated, so that it was possible to monitor consis-
tency of results. This program of tests took 3 days and still did not produce copious
amounts of data; 200 patterns were obtained for each damage state and 700 patterns
were obtained for the normal condition. After dividing the data into training, vali-
dation, and testing sets, 66 patterns were available for each panel; the datasets thus
had 594 patterns.
In the first analysis carried out on the data, as discussed in Manson et al. (2003b),
the multi-layer perceptron (MLP) neural network was trained on the feature data from
the visual selection. The results were quite good, and a classification probability of
104 K. Worden et al.
error of 0.135 was obtained on the test data, with the confusion matrix C, as presented
in Table 2.2.
The confusion matrix is a common (and very effective) means of displaying the
results of a classification exercise; it is quite simple in construction. One begins
with an empty matrix; as each testing pattern is presented, the entry at the (i, j)
location is incremented if the pattern is from true class i, but is assigned to class j. A
perfect confusion matrix would thus be diagonal; any off-diagonal entries represent
confusion between classes.
2 Artificial Neural Networks 105
Table 2.2 Confusion matrix for neural network classifier for aircraft wing damage location—testing
set
Prediction 1 2 3 4 5 6 7 8 9
True class 1 62 1 0 0 2 0 0 1 0
True class 2 0 61 0 0 5 0 0 0 0
True class 3 0 1 52 0 7 4 0 2 0
True class 4 1 0 3 60 0 1 0 1 0
True class 5 2 1 0 0 60 3 0 0 0
True class 6 2 0 6 0 8 52 0 0 0
True class 7 1 0 4 0 1 1 58 1 0
True class 8 0 0 0 0 1 1 0 62 2
True class 9 2 1 1 0 0 0 0 15 47
Although the neural network has done a good job here—it is correct 86.5% of
the time—the results still leave a little to be desired. The main aim of SHM is to
advise owners and operators of structures, so that they can make decisions; significant
consequences in terms of safety or cost may occur if those decisions are incorrect.
In many cases, the power of the algorithm used for classification is not the main
issue; in many engineering problems, the really critical choice is of the features. If
very effective features are chosen, even a simple classifier might produce excellent
or perfect results.
In the current case study, the first set of features was chosen on the basis of
(somewhat-involved) engineering judgement. This choice left open the question of
what performance might be obtained if the features were formally optimized. This
exercise was carried out in Worden et al. (2003), with a genetic algorithm (GA) used
for optimization. Even the optimization approach began with a subjective element. As
part of the original study, the authors of Manson et al. (2003b) sorted the features for
classification into groups termed strong, fair, and weak; the criteria are described in
detail in the original paper. With some subtleties to deal with (not discussed here), the
selection process here made an initial restriction to only strong or fair features, which
resulted in 44 candidates. The final selection stage used an integer-coded GA to select
subsets of features which minimized the probability of misclassification. The results
from the GA were excellent, the optimal MLP had a probability of misclassification
of 2.69% on the independent test set; the confusion matrix is given in Table 2.3.
There are a total of 16 misclassifications—eight of them for Panel 6, one of the two
(equal-)smallest panels.
106 K. Worden et al.
Table 2.3 Confusion matrix for neural network classifier for aircraft wing nine-panel damage-
location problem—testing set
Prediction 1 2 3 4 5 6 7 8 9
True class 1 65 0 0 0 0 0 0 0 1
True class 2 0 64 0 2 0 0 0 0 0
True class 3 0 0 64 1 0 1 0 0 0
True class 4 1 0 0 65 0 0 0 0 0
True class 5 0 0 0 0 66 0 0 0 0
True class 6 1 4 0 1 0 58 0 1 1
True class 7 0 0 0 0 0 0 66 0 0
True class 8 0 0 0 0 1 0 0 65 0
True class 9 0 0 1 0 0 0 0 0 65
Despite the power and versatility of the MLP algorithm and its position as the “go to”
neural network in many engineering applications, the machine learning community
was still struggling with some of the problems mentioned earlier, like image and
speech recognition; problems for which the brain had proved superior. In many ways,
the MLP had distanced neural networks from their biological origins by emphasizing
its properties as a nonparametric function approximator. Most applications used a
single hidden layer, as it had been established that this was sufficient to make the MLP
a universal approximator; the resulting architecture is very shallow. The problem is
that shallow learning is not how the brain works; the brain is densely interconnected,
massively parallel, and not shallow. The solution to the problem seemed obvious;
why not add more hidden layers and create a deep network?
Although the move to deeper networks appeared to be a simple strategy, it proved
to have serious technical problems in terms of training. If the MLP architecture is
simply made deeper, a serious problem soon arises in terms of the backpropagation
algorithm. Backpropagation has problems in deep networks, because the computed
errors, and thus gradients, become unreliable as they are passed further and further
backward; gradients can vanish or diverge (explode).
Because of the backpropagation problem, the first deep neural networks—
sometimes called deep belief networks—used different architectures. One of the first
successful structures was Hinton et al. (2006), which consisted of a series of binary
stochastic nodes arranged in layers. Hinton’s solution to training a deep network was
to train it one layer at a time—a “greedy” strategy. Each pair of layers was known
as a Restricted Boltzmann Machine (RBM), and a training procedure already existed
for this; Hinton’s innovation was to train the layers sequentially. Another notable
characteristic of deep belief nets was that learning was made “biological” again; in
fact, Hinton even wrote of “looking into the mind of the network”.
2 Artificial Neural Networks 107
be the small “Robin” figure, visible on top of the coffee machine, to the left of the
image (Fig. 2.9). This is quite a small object; it is also comparable in size to many
of the other objects in the image.
Detecting the object is quite simple, most human observers will find Robin almost
immediately; the interesting question is how. What is almost certain is that people
searching for objects do not project the entire image onto the retina and then parse the
full image. This is because people are aware that the pattern of interest is local. The
usual search strategy would be to scan the eyes carefully across the image until Robin
comes into the receptive field, then parse that subimage to see if “Robin present” fires.
This is how a modern architecture like a convolutional neural network (CNN) works.
It will prove useful now to “abstract” the problem, perhaps simplifying in the process.
First, suppose that the image is represented by a 2000 × 2000 binary (B/W) array;
further suppose that the retina is a rectangular array of 100 × 100 photoreactive cells.
It is now possible to formalize the search strategy. It will be assumed that the image
can be indexed in terms of indices (i, j), which specify the position of a pixel in the
2 Artificial Neural Networks 109
image “array”. The search can start by placing the retina “window” at the top-left
corner of the image at the point (1, 1), and activating the cells of the retina, with
inputs from the window. One then moves to (1, 2) on the image and looks again,
and so on until the retina is at (1, 1901). The scan then moves to the second row of
pixels and starts again at (2, 1). This process is repeated until the retina is positioned
at (1901, 1901), at which point, the image has been covered. At each “look”, the
subimages are passed to a recognition “layer” to see if a cup is present.
In terms of the mathematical abstraction of the problem, one can associate a
weight wi j with each cell of the retina. One can then either: (a) try and train these
weights to produce an output 1 or 0 (Robin present or not present) or (b) pass on the
activations from each scan to another recognition layer. In any case, each position
(top-left pixel is (k, l)) of the retina will produce an activation,
z kl = wi j xk+i−1,l+ j−1 (2.14)
i j
that one could downsample by a factor of 100; then the total number of activations
passed on for the recognition layer is 19 × 19 = 361; this represents another huge
reduction. Note that downsampling is not the same as setting a higher stride; one can
use a combination of both, in practice.
Consider the action of the first layer again,
z kl = wi j xk+i−1,l+ j−1 (2.15)
i j
One can compare and contrast this with how a filter works on a time series (where
xi denotes the input series and yi the output),
yi = h j xi− j (2.16)
j
i.e. the filter coefficients are convolved with the input. It should now be clear that
the action of the moving retina in generating the next layer inputs is thus a two-
dimensional convolution; the structure is thus termed a convolution layer, and the
weights on the retina are revealed to be filter coefficients. Combinations of such
layers (with downsampling) are the basic form of what are called convolutional
neural networks (CNNs).
In terms of terminology: the convolution filter is sometimes called a convolution
kernel. As defined, the action of the kernel is different on points at the edge of the
image, to the action on central points. If this is considered an issue, or one simply
wants to produce an “image” of the same size after convolution, the input image to a
layer can be padded with zeros, so that each image point is at the center of the kernel.
A simple consequence of this condition is that the kernel window sizes must be odd.
One might now reasonably ask the question—why multiple layers? Clearly, the
approach depends on the retina/filter/convolution kernel being the right size to detect
Robin, and for the relevant information to survive downsampling. The important
point is that the architecture must learn; on some training set, the weights are tuned,
so that a final recognition layer detects Robin or not. Consider a two-convolution-
layer system; this might learn in such a way that the first layer detects features
that suggest parts of Robin; then the second assembles the information so that the
figure is detected. If this were the case, it means that the two-layer system is doing
implicit feature extraction and selection, with no a priori insight into what the features
might be. More layers allow more possibilities for intrinsic feature extraction. In any
object detection problem, it is well known that the algorithm must be insensitive
to the scale and orientation of the object of interest; CNNs thus offer the attractive
possibility that effective features can be derived without engineering, or domain-
specific, insight. This viewpoint—like many starting from a position of ignorance—
can be quite dangerous; at the very least, it can lead to poor research. At the beginning
of the first age of neural networks, many people hoped that the blind use of MLPs,
using software downloaded from the Internet, would allow the solution of all their
problems. In fact, blind usage of MLP software led to such disappointing results that
2 Artificial Neural Networks 111
the initial neural network “bandwagon” within the engineering community led quite
quickly to a “backlash”. The main problems associated with the earlier papers were
often associated with sparsity of training data and sub-optimal (or just plain wrong)
determination of hyperparameters. The second age of neural networks—based on
deep architectures—led very quickly to a “bandwagon-backlash” transition in the
engineering community. As before, the way around the problem was to learn how
the deep architectures worked and to tune their parameters carefully. In order to see
what sort of “parameters” are involved, it is useful to look at a little more history.
vious best MLP error of 0.7%, Simard et al. (2003). These differences may appear
to be small, but the problem is difficult and reductions in the error tend to reflect
the introduction of new ideas or technology. The advance in the BP-CNN architec-
ture was not just a matter of the weight-sharing algorithm, but also exploited GPU
cards to speed up processing and used ideas of max pooling. The point of the latter
advance is that a downsampling “layer” in a CNN doesn’t have to decimate; it can
pass on summary statistics over a window. This is the idea of max pooling; rather
than decimating on subregions, the downsampling “layer” passes on the maximum
activation in the subregion. In 2009, a CNN-MP architecture helped win the first
official competition (although for recurrent neural networks). In 2010, a standard BP
network won back the MNIST record by using GPUs, and set the lowest error at
0.35%; this was really only because the GPUs speeded up computation by a factor
of 50 compared to CPUs.
A significant advance occurred in 2012 when an architecture based on CNN-
MPs, using GPUs and ensemble methods, achieved 0.2% on MNIST. This is human
competitive and represented the first major drop in around 10 years. It is observed
by Schmidhuber (2015) that these advances were not the result of significant new
technology, but relied on clever use of the CNN-MP architecture and availability of
very fast computing options. Another significant tweak which occurred around this
time was a change in the network transfer function; in 2011, the Rectified Linear
Unit (ReLU) was introduced,
f (x) = x , x >0
(2.17)
=0, x ≤0
which enables very fast evaluation, and further speeded up computation (Nair and
Hinton 2010). Perhaps the final ingredient in the CNN palette was introduced in
2012, with the idea of dropout, which simply removes units during training, acts as a
regulariser, and improves generalization. An idea of the synthesis of these ingredients
is given in Fig. 2.12. Setting aside the MNIST data, it was observed in Schmidhuber
(2015) that at the time (2014), most feedforward competition-winning deep NNs
were (ensembles of) GPU-MPCNNs.
As mentioned above, one of the first successful attempts at designing deep net-
works was Hinton’s construction in terms of layers of restricted Boltzmann machines
(RBMs) with a greedy layerwise training algorithm (Hinton et al. 2006). It would
be unfair not to point out that other similar strategies at the time also worked. One
approach built an architecture from a stack of autoencoder networks; this was inter-
esting from the point of view that the autoencoders could be standard MLPs. However,
in terms of recent developments, the rest of this section will be concerned with novel
2 Artificial Neural Networks 113
architectures which can be built from CNNs (among other components); the reader
interested in other deep ANN architectures can consult (Schmidhuber 2015).
of the cat/dog network might be very similar to the earlier layers of a sheep/goats
network, if appropriate training data were available. This observation leads to an
immediate strategy for transfer learning; one proposes a similar CNN structure in
the target problem as the source problem, but then simply copies across and freezes
the weights for the lower layers. If enough sheep/goat data are available, the later
layers in the target network are then trained for the specific sheep/goat problem. This
strategy is referred to as fine-tuning; it has proved to be such a successful strategy
that it is easily implemented in modern machine learning packages like Tensorflow.
Although the fine-tuning approach to transfer has been applied in image and
language processing, it does not yet appear to have found great many successful
applications in engineering. One reason for this lack of interest may be the lack of
appropriate source problems in given contexts. This situation produces studies like
the one in Chen et al. (2020); in this case, the “source” network is Res-Net-50. This
network is a 50-layer CNN which has been trained to carry out image recognition on
the ImageNet database, which contains over a million images. The authors of Chen
et al. (2020) chose a “target” problem of classifying fault states of a bearing; in order
to do this, they first had to transform the raw time data into an “image” format, using
the continuous wavelet transform. The fine-tuning approach was then used to train
the last few layers on target data from the bearing database. Although the results
showed a good classification accuracy, there is something a little unsettling about
the disconnect between the source and target problems. A final judgement on fine-
tuning as a means of transferring engineering problems should probably wait for more
evidence. Alternative means of transfer learning have been applied in engineering in
a more intuitive manner; one of the most prevalent appears to be domain adaptation,
which has been applied to problems in SHM with success; one example is given in
Gardner et al. (2020).
Generated
samples G(z)
Real
samples x
Fig. 2.13 Layout of a basic GAN; the generator transforms noise into generated samples and the
discriminator attempts to distinguish between real and generated samples
other hand, the generator network G takes as input a noise vector z from some pre-
defined probability distribution pz (z) and creates a sample G(z) in the feature space
of the dataset. Thereafter, the sample is passed through the discriminator in order to
classify whether it is real or generated. The probability of a generated sample being
real is given by D(G(z)). Forcing the generator to create samples that “fool” the
discriminator into classifying them as real (i.e. minimization of log(1 − D(G(z))))
results in creating samples/images that look real. The objective function V (D, G)
for the training of both networks is given by
min max V (D, G) = E x∼ pdata (x) [log D(x)] + E z∼ pz (z) [log(1 − D(G(z)))] (2.18)
G D
Fig. 2.14 Power spectral densities (PSDs: Y1 (ω), Y2 (ω), Y3 (ω)) from three-floor experimental
structure; these correspond to the physical coordinates (y1 (t), y2 (t), y3 (t))
Fig. 2.15 PSDs of three-floor structure (denoted by U in the transformed domains); POD (linear,
black) analysis and GAN (red) analysis
are shown for data from a highly nonlinear three-story model shear building. Figure
2.14 shows the spectra of the physical (directly measured) variables; the system is
very clearly multi-modal. Figure 2.15 (black) shows the result of the optimal lin-
ear transform (the principal orthogonal decomposition); the variables are still highly
coupled. In contrast, Fig. 2.15 (red) shows the GAN-transformed variables, which
(according to the specific criterion) appear to be perfectly decoupled.
2 Artificial Neural Networks 117
2.6 Conclusions
Given the nature of this article, lengthy conclusions are not warranted. This paper
has discussed the history of artificial neural networks in the context of their uptake in
engineering problems. In this respect, that history appears to naturally fall into three
“ages”; pre-history, the first age and the second age. The period of pre-history deals
with developments up to the mid-1980s; over this period, ANNs were developed
from their very simplest forms—a single neuron—into a number of versatile archi-
tectures, which could be used to solve engineering problems. Foremost among these
architectures was the multi-layer perceptron (MLP), which proved to be of such great
general use, that it could be argued to have initiated the “first age”. This stable period
covered a time when the MLP proved to be the “go-to” algorithm for the engineering
community, and—apart from developments in the training algorithm—it required no
serious modification. However, although the “shallow” MLP was excellent in many
engineering problems, it proved deficient in some of the bigger machine learning
problems associated with image and speech processing and natural-language pro-
cessing. The response from the computer science community was to move toward
deeper neural networks; the development of the required technology led to the sec-
ond ANN age—of deep learning. As before, the engineering community appeared to
gather around another general-purpose architecture—the convolutional neural net-
work (CNN)—which also soon stabilized in terms of its structure and learning rules.
The three ages of ANNs are discussed here, along with a case study illustrating an
engineering application of MLPs. Finally, the paper discusses two new paradigms
for learning which can incorporate CNNs, but open up new pathways for engineering
applications.
Acknowledgements The authors would like to acknowledge the support of the UK EPSRC via
the Programme Grants EP/R006768/1 and EP/R004900/1. For the purpose of open access, the
authors have applied a Creative Commons Attribution (CC-BY-ND) licence to any Author Accepted
Manuscript version arising.
References
Abeles M (1991) Corticonics–neural circuits of the cerebral cortex. Cambridge University Press,
Cambridge
Bishop CM (1994) Training with noise is equivalent to Tikhonov regularization. Neural Comput
7:108–116
Bishop CM (2013) Pattern Recognition and machine learning. Springer, Berlin
Bishop CM, Svensém M, Williams CKI (1998) Developments of the generative topographic map-
ping. Neurocomputing 21:203–224
Broomhead DS, Lowe D (1988) Multivariable functional interpolation and adaptive networks.
Complex Syst 2:321–355
Brown M, Harris CJ (1994) Neuro fuzzy adaptive modelling and control. Prentice Hall
Bryson A, Denham W, Dreyfuss S (1963) Optimal programming problem with inequality con-
straints. I: Necessary conditions for extremal solutions. AIAA J 1:25–44
118 K. Worden et al.
Chen Z, Cen J, Xiong J (2020) Rolling bearing fault diagnosis using time-frequency analysis and
deep transfer convolutional neural network. IEEE Access 8:150248–150261
Churchland PS, Seknowski TJ (2017) The computational brain. MIT Press
Cybenko G (1989) Approximation by superpositions of sigmoidal functions. Math Control, Signals
Syst 2:303–314
Fukushima K (1979) Neural network model for a mechanism of pattern recognition unaffected by
shift in position—Neocognitron. Trans IECE J62-A:658–665
Gardner PA, Liu X, Worden K (2020) On the application of domain adaptation in structural health
monitoring. Mech Syst Signal Process 138:106550
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y
(2014) Generative adversarial nets. In: Advances in neural information processing systems, pp
2672–2680
Haykin S (1994) Neural networks. Macmillan College Publishing Company, A comprehensive
foundation
Hebb DO (1949) The Organisation of Behaviour. Wiley, New York
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18:1527–1554
Hopfield JJ (1984) Neural networks and physical systems emergent collective computational abil-
ities. Proc Natl Acad Sci, USA 52:2554–2558
Hopfield JJ, Tank DW (1985) Neural computation of decisions in optimization problems. Biol
Cybern 52:141–152
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern
43:59–69
LeCun Y (1986) Learning processes in an asymmetric threshold network. Disordered systems and
biological organisations. Les Houches, France, Springer, pp 233–240
LeCun Y./, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD, (1989) Back-
propagation applied to handwritten zip code recognition. Neural Comput 1:541–551
LeCun Y./, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD, (1990) Hand-
written digit recognition with a back-propagation network. Proceedings of advances in neural
information processing systems 2:396–404
LeCun Y./, Bottou L, Bengio Y, Haffner P, (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86:2278–2324
Luo FL, Unbehauen R (1998) Applied neural networks for signal processing. Cambridge University
Press
Mackay DJC (2003) Information theory, inference and learning algorithms. Cambridge University
Press, Cambridge
Manson G, Worden K, Allman DJ (2003) Experimental validation of structural health monitoring
methodology II: novelty detection on an aircraft wing. J Sound Vib 259:363–435
Manson G, Worden K, Allman DJ (2003) Experimental validation of structural health monitoring
methodology III: Damage location on an aircraft wing. J Sound Vib 259:365–385
Markou S, Singh S (2003) Novelty detection a review. Part I: statistical approaches. Signal Process
83:2481–2497
Markou S, Singh S (2003) Novelty detection a review. Part II: neural network based approaches.
Signal Process 83:2499–2521
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
Miller P (2018) An introductory course in computational neuroscience. MIT Press
Minsky ML, Papert SA (1988) Perceptrons. MIT Press
Möller MF (1943) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw
6:525–533
Nabney IT (2001) Netlab: algorithms for pattern recognition. Springer, Berlin
2 Artificial Neural Networks 119
Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Pro-
ceedings of international conference on machine learning (ICML)
Neal RM (1996) Bayesian learning in neural networks. Springer, Berlin
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C. Cambridge
University Press
R̃amón y Cajál S (1911) Histologie du Systéme Nerveux de l’Homme et des Vertébrés. Maloine,
Paris
Rosenblatt F (1962) Principles of neurodynamics. Spartan
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back propagating
errors. Nature 323:533–536
Schmidhuber J (2015) Deep learning in neural networks: An overview. Neural Netw 61:85–117
Simard P, Steinkraus D, Platt J (2003) Best practices for convolutional neural networks applied to
visual document analysis. In: Proceedings of 7th international conference on document analysis
and recognition, pp 958–963
Tarassenko L (1998) A guide to neural computing applications. Arnold
Tarassenko L, Hayton P, Cerneaz Z, Brady M (1995) Novelty detection for the identification of
masses in mammograms. In: Proceedings of the 4th international conference on artificial neural
networks. Cambridge, pp 442–447
Tsialiamanis G, Champneys MD, Wagg DJ, Dervilis N, Worden K (2022) On the application of gen-
erative adversarial networks for nonlinear modal analysis. Mech Syst Signal Process 166:108473
Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin
Werbos PJ (1974) Beyond Regression: New Tools for Prediction and Analysis in the Behavioural
Sciences. Ph.D. thesis, Applied Mathematics. Harvard University
Widrow B, Hoff ME (1960) Adaptive switching circuits. IRE Wescon Conv Rec Part 4:96–104
Worden K (1997) Structural fault detection using a novelty measure. J Sound Vib 201:85–101
Worden K, Hensman JJ, Staszewski WJ (2011) Natural computing for mechanical systems research:
A tutorial overview. Mech Syst Signal Process 25:4–111
Worden K, Manson G, Allman DJ (2003) Experimental validation of structural health monitoring
methodology I: novelty detection on a laboratory structure. J Sound Vib 259:232–343
Worden K, Manson G, Hilson G (2003) Genetic optimisation of a neural damage locator. J Sound
Vib 309:529–544
Yang Q, Zhang Y, Dai W, Pan S-J (2020) Transfer learning. Cambridge University Press
Chapter 3
Gaussian Processes
3.1 Introduction
When performing machine learning tasks, essentially every task can be reduced to
the learning of a functional map:
f :X→Y
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 121
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_3
122 T. J. Rogers et al.
Figure 3.1 shows the progression of the GP in “learning” a function as more data
are introduced into the process. In Fig. 3.1a, the prior model of the GP is shown.
Under the prior very little is known about the shape of the function which is to be
modeled, it has a mean value of zero across the whole input domain (dark orange
line) and an equal Gaussian uncertainty on either side of that zero mean (shaded
orange area). It can be seen that samples from the GP (blue lines in Fig. 3.1a) are
nonlinear functions filling the space of the uncertain region. An important point to
note: despite having Gaussian confidence intervals around the mean, the uncertainty
in the GP is not uncorrelated white noise, instead the uncertainty is correlated for the
family of functions being represented—hence, smooth function draws from the GP
are seen. When interpreting confidence intervals of GPs this is an important point to
note and one which can confuse interpretation of results. It is also worth considering
Fig. 3.1 Visual demonstration of the Gaussian process refining its estimate of a function as more
data are added. Mean prediction shown in solid orange with 4σ confidence shown as shaded orange.
For each GP, ten samples are shown in blue. Added data are shown by red dots
3 Gaussian Processes 123
that, although a finite number of samples from the function are shown in the figure,
the shaded uncertain region actually represents the infinite set of possible functions
which could be drawn from that GP.
The beauty of the manner in which the GP “learns” the function of interest is seen
in frames Fig. 3.1b–f. Rather than needing to determine a large number of parameters
which govern the shape of the function, its form is discovered nonparametrically.
Remembering the basic assumption in the GP, that the output will be close to that
previously seen at locations close to the corresponding input, i.e. near to where
data have been observed (red dots) the GP is confident that the function value will be
close to that previously seen. Visually, this manifests as a “pinching” of the uncertain
region close to previously observed points, in other words, all the possible functions
are forced through the already observed data point. As more data are included in the
process, the model’s confidence increases in regions with higher density of observed
points, e.g. one third inwards from the right-hand edge of the domain in Fig. 3.1. It
is seen in Fig. 3.1b that the data are forced through the single observed point on the
right, by Fig. 3.1f the region with five close datapoints is very well represented with
low uncertainty. One other important point is revealed in this visual exploration of
the GP, suppose one were to continue adding data, in this case, the confidence of
the model would continue to increase as more and more of the domain is covered
by already observed data. It can be shown that the GP will fit the function to an
arbitrary precision if more data keep being introduced, i.e. the GP has a universal
approximation property (subject to the chosen kernel being a universal kernel, see
Micchelli et al. 2006).
Attention can now turn to the mathematical tool which enables the behavior of the
GP which has been seen already. Formally, the GP can be considered to be an infinite-
dimensional collection of jointly Gaussian distributed variables, any finite subset of
which are also jointly Gaussian distributed.1 For now, restricting the problem to the
case of multiple-input–single-output regression, in other words, solving problems of
the form:
y = f (x) + ε (3.1)
1This is implied by the infinite set being jointly Gaussian but will be useful to understand how the
GP can be implemented.
124 T. J. Rogers et al.
model. This is commonly referred to as the training dataset; for solving the regression
problem, described in Eq. (3.1), this training set consists of N pairs of input vec-
tors xi and observed targets yi , Dtr = {xi , yi }i=1N
. Collecting all of these observed
targets into a vector y = {y1 , . . . , y N } will give an N -dimensional jointly Gaus-
sian distributed random variable which is an N -dimensional subset of the infinite-
dimensional GP, i.e.
⎛⎡ ⎤ ⎡ ⎤⎞
m(x1 ) k(x1 , x1 ) + σn2 . . . k(x1 , x N )
⎜⎢ ⎥ ⎢ ⎥⎟
y ∼ N ⎝⎣ ... ⎦ , ⎣ ..
.
..
.
..
. ⎦⎠ (3.2)
m(x N ) k(x N , x1 ) . . . k(x N , x N ) + σn2
where m(xi ) is some parametric mean function and k(x, x ) is the covariance function,
to be discussed shortly.
Being Gaussian distributed, the properties of y are fully defined by its first two
statistical moments, its mean and covariance. The first moment, which is the mean,
can be any parametric function of the input x, for example, linear or polynomial, and
encodes the prior belief of the modeler regarding the gross behavior of the process.
In certain modeling scenarios, this may be a very strong belief, i.e. the average
behavior with respect to the input is well understood, or—more commonly in the
machine learning setting—no prior restriction is imposed and mean can be simply
set to zero (or some other constant value).
Moving to the second moment, the covariance of the data, while it would be
possible to compute this sample-wise, that would not yield a useful methodology.
Instead, the covariance between two observed targets yi and y j is also defined as
a function of the inputs xi and x j associated with those targets. This covariance is
computed pairwise for all combinations within the training dataset Dtr . As part of
the specification of the GP, the modeler must define a function which encodes how
the covariance between two outputs is expressed as a function of the pair of inputs
xi and x j . This function is referred to as the covariance function or synonymously
covariance kernel or simply kernel. There are certain requirements on this function
in order for it to be a valid measure of the covariance; for details, the reader is directed
toward Chap. 4 of Williams and Rasmussen (2006).
It is this covariance function which governs the family of functions which comprise
the prior of the GP. In other words, the choice of the covariance function impacts
which functions live in the infinite set from which samples are drawn. Therefore, the
choice of this function is critical to the performance of the GP. For example, if one
were to choose a linear kernel,
k xi , x j = σ 2f xi xTj (3.3)
sions passing through zero—not a very expressive model.2 Much more common
is to choose a more flexible and richer, general nonlinear kernel, for example, the
popular squared exponential (sometimes also exponentiated quadratic or, somewhat
confusingly, Gaussian) kernel,
xi − x j 2
k xi , x j = σ f exp −
2
(3.4)
22
from which continuous, smooth functions are drawn (the squared exponential also
fulfills the requirements to be a universal kernel, as mentioned previously, see Mic-
chelli et al. 2006). The squared exponential is governed by two hyperparameters: a
signal variance σ 2f which controls the overall scaling of the function outputs, i.e. the
magnitude of f (·); and a length scale which controls (colloquially) the wiggliness
of the function, i.e. as this value reduces more inflections will be seen in samples
from the GP.
However, the experience of the authors aligns with the suggestion of Stein (1999)
that the Matérn family of kernels is a good general choice for modeling collected data
from physical processes. Unlike the squared exponential, which is infinitely differ-
entiable, the Matérn family is a finite number of times differentiable which appears
to more reasonably match the data encountered in engineering problems. Matérn
kernels are additionally governed by a roughness parameter ν which is usually cho-
sen such that ν = 1/2 + p, where p is zero or a positive integer. The corresponding
functions drawn from the GP for different choices of ν will be p-times differentiable.
Popular choices of Matérn kernels are when p = 0, 1, 2 given by
p = 0:
x i − x j
km1/2 xi , x j = σ f exp −
2
(3.5)
p = 1:
√ √
3 xi − x j 3 x i − x j
km3/2 xi , x j = σ 2f 1+ exp − (3.6)
p = 2:
√ 2 √
5 xi − x j 5 xi − x j 5 x i − x j
km5/2 xi , x j = σ 2f 1+ + exp −
32
(3.7)
All of the Matérn kernels are also governed by the same two hyperparameters σ 2f and
as the squared exponential where they play an identical role. It can also be shown
2However, this choice of kernel is interesting in the fact that it exactly recovers the solution to
Bayesian linear regression.
126 T. J. Rogers et al.
that, as ν → ∞, the Matérn kernels converge onto the squared exponential, this is
intuitive when one knows that the squared exponential is infinitely differentiable.
The squared exponential and Matérn type kernels are also examples of stationary
kernels, i.e. they are only a function of the distance between the two inputs xi and
x j . This stationarity additionally means that the kernels make the calculation of the
covariance between two data points invariant to translation and rotation in the input
space.
It should be stressed, however, that determining an “optimal” choice of kernel
is a very difficult task3 (if possible at all) and certainly not one for which general
statements can be made—appropriate choice of covariance function should be the
key concern of the practitioner, before embarking on any modeling with a GP.
Thus far, only the relationship between the data within the training set Dtr has been
discussed. On its own, this description is not particularly helpful, as one will hope to
use the GP to make predictions for an input x with, as yet, unobserved output y , or a
set of inputs X with outputs y . At this point, it will become clear why the definition
of the GP as an infinite-dimensional collection of jointly Gaussian distributed random
variables will be useful. Consider the graphical model of the GP which is shown in
Fig. 3.2. There is a collection of latent function values f n for n = 1, . . . ∞ (although
only a finite number N are assessed at any given time, the distribution itself is of
infinite dimension) which are all jointly Gaussian distributed as represented by the
fully connected undirected graph in the center of the image. The covariance of this
high-dimensional Gaussian is then computed by means of the covariance function
being evaluated pairwise for all inputs. For convenience, all the individual inputs xi
3Although some have attempted to perform automatic selection for specific problems, e.g.
Abdessalem et al. (2017).
3 Gaussian Processes 127
in Dtr , which are a collection of N row vectors, can be stacked vertically to form
a training input matrix X ∈ R N ×D for N , D-dimensional inputs. Similarly, the N
scalar target observations yi can be collected into a vector y ∈ R N . Each observation
in y is conditionally independent, depending
only on the corresponding fi , this arises
from the fact that p (yi | fi ) = N fi , σn2 , i.e. yi is an observation of fi corrupted
by i.i.d. white Gaussian noise. Given this notational simplification, Eq. (3.2) can be
written as
y ∼ N m(X ) , K X,X + σn2 I (3.8)
where K a,b is the pairwise covariance matrix between the functions at inputs a and
b, e.g. K X,X is the auto-covariance between all the inputs in X , and I is the identity
matrix.
To make predictions, the vector of targets being considered can be expanded to
include some, as yet, unobserved targets y which increases the dimension of the
distribution by N —the number of prediction points—while leaving the full joint
distribution Gaussian.
y m(X ) K X,X + σn2 I K X,X
∼N , (3.9)
y m(X ) K X ,X K X ,X + σn2 I
Similarly, it is possible to evaluate the distribution over only the latent function
values f so as to only investigate the uncertainty in the learnt function itself, not
including the measurement noise on the data.
f | X , Dtr ∼ N (E [f ] , V [f ]) (3.11a)
−1
E [f ] = m(X ) + K X ,X K X,X + σn2 I y − m(X ) (3.11b)
−1
V [f ] = K X ,X − K X ,X K X,X + σn2 I K X,X (3.11c)
It can be seen that the expectation remains unchanged (as may be expected given
the measurement noise is expectation zero) and the only change in the variance is to
128 T. J. Rogers et al.
no longer include the term related to the measurement noise σn2 I. The combination
of the covariance function, as defined earlier, and the predictive equations in either
Eq. (3.10) or Eq. (3.11) are the sum total of the mathematical machinery which
enables the GP to act as a regression tool and which allow uncertainty quantification
over those predictions. Before continuing, it is worth pausing for a moment to con-
sider quite how remarkable it is that such a powerful approach can be written down
in four lines of mathematics.
While it has been shown that the core machinery of the GP is surprisingly simple,
there are a couple of points which should be expanded upon further to understand
how it may be used practically. Thus far, the GP has been described as nonparametric,
i.e. it does not have a large set of parameters to tune as part of the learning process
like, for example, a neural network. Instead, the training data directly inform the
shape of the prediction through their inclusion in the predictive equations, Eq. (3.10).
However, one may notice that these predictive equations require evaluation of the
covariance kernel and, for the majority of covariance kernels, there exist a number of
hyperparameters which control the gross behavior of the family of functions which
that kernel embeds. These hyperparameters were alluded to when first introducing
the kernels higher up.
Although, in theory, the choice of these hyperparameters is far less important to the
quality of fit of the GP than, for example, learning the weights in a neural network, in
practice tuning of the hyperparameters can significantly affect the performance of the
method. If one were taking a “fully-Bayesian” view of the problem, it would be pos-
sible to place prior distributions over these parameters, perform posterior inference
on the basis of Dtr , and marginalize out the effect of the hyperparameters. However,
this approach (in almost all situations) proves to be computationally infeasible or
inadvisable given the level of insight it provides. Despite this challenge, it is worth
noting that some works in the literature do attempt the task of marginalizing the GP
hyperparameters including (Garnett et al. 2013; Svensson et al. 2015; Gardner et al.
2021; Simpson et al. 2021). It should also be noted that marginalization of the GP
hyperparameters is more commonly seen in settings with small training datasets, i.e.
N is small, such as in Bayesian optimization4 problems, e.g. Hernández-Lobato et al.
(2014).
4 Unfortunately, in the name of brevity it is not possible to cover Bayesian optimization with the rigor
it deserves in this short chapter. In short, for an optimization problem seeking a global minimum
(or maximum) to a cost function, which can be evaluated pointwise, a GP is fit to samples from
that cost function to approximate its shape. The GP fit to the cost surface is then used, usually in
combination with the inherent measure of uncertainty, to select a new point at which to probe the
cost function. In this way, the GP can be used to estimate the location of the current global minimum
and also to guide future evaluations of the cost function. Bayesian optimization excels in settings
where the cost surface can be modeled well by a GP and where evaluations of the cost function
are computationally expensive prohibiting classical metaheuristic optimization approaches. The
3 Gaussian Processes 129
In common usage of the GP, rather than attempting to marginalize the hyperpa-
rameters, these quantities are instead optimized. The form of the optimization most
commonly employed is a Type-II optimization problem where the cost function is
related to the marginal likelihood of the process. Conceptually, the hyperparameters
are optimized to ensure the GP fits the training data the best it can, taking into account
all the possible values of f (the latent function). Formally, the optimization problem
being solved is
φ̂ = arg max p (y | f, X, φ) p (f | X, φ) df
φ
(3.12)
= arg min {− log p (y | X, φ)}
φ
concept of Bayesian optimization has arguably been conceptually present since the 1970s but a
good introductory text can be found in Snoek et al. (2012).
130 T. J. Rogers et al.
dJ dJ dφ
= ·
d log φ dφ d log φ
dJ d log φ −1
= · (3.14)
dφ dφ
dJ
= ·φ
dφ
Fig. 3.3 Comparison of GP posteriors with length scales altered relative to the original sample
(black solid line). As before, posterior mean shown in thick orange, confidence intervals in shaded
orange, observed data as red dots, and samples from the posterior in blue
3 Gaussian Processes 131
the possibility of mistaking noise on the measured values for genuine features of the
function and, as can be seen, it will not be useful for making predictions even a small
distance from the observed training set.
The alternative case, shown in Fig. 3.3b, is where the length scale is selected to
be deliberately too long. Here, the GP is seen to estimate a set of functions which
smooth through all of the data, ignoring local variations in the target values. This
effect is also clearly undesirable. It is seen that the target function, in black, now lies
well outside the confidence intervals of the posterior. In practice, these overconfident
predictions mean that the GP will return tight predictive posteriors even far from the
observed training data (when the length scale is very long), however, these confident
predictions may well be confidently wrong if the length scale is overestimated.
It is easy to imagine that there is some “goldilocks” scenario which balances
the ability of short-length scale solutions to model small variations in the target
function and the long-length scale solution’s ability to make useful predictions away
from the training data. It is this optimal value which hopefully lies at the minima
of the negative log marginal likelihood. When working with the GP, however, it is
worth considering that the cost function is itself conditioned only on the available
training data. This conditioning means that it can be possible to fall into the “wrong”
minimum which chooses a length scale appropriate for the observed data, but not for
the physical process one seeks to model, or the cost of the short- versus long-length
scale solution can be very similar, making it difficult to make an informed choice. If
there is not a clear localized minimum in the cost surface of the negative log marginal
likelihood, it would be the recommendation of the authors to return to marginalizing
the hyperparameters in a Bayesian manner.
Throughout this section, the approach of the GP has been compared, in an informal
manner, to the more popular neural network methods seen in the literature. It has been
argued that the strengths of the GP are its low number of (hyper) parameters which
need optimization and its ability to automatically quantify uncertainty. It would be
remiss to at this point not highlight that it is certainly possible to perform uncertainty
quantification within the neural network family of models, the most direct comparison
of course being that of Bayesian neural networks. Significant work on developing
neural networks where the posterior distributions over the weighted was carried out
as early as the 1990s, for example, MacKay (1992) and the thesis of Neal (1995). Neal
(1995) showed how MCMC approaches could be applied to the weights of a neural
network and give very strong results, albeit at significantly increased computational
load to the deterministic counterparts. It is also in that work where an equivalency
between Bayesian neural networks, under certain assumptions, and the Gaussian
process itself (Neal 1996). A trend which has continued, for example, Williams
(1996) presents a covariance function for a GP equivalent to a neural network with
sigmoid activations and Gaussian weight distributions. More recently, these classical
results for shallow neural networks have been extended for deep and convolutional
architectures, e.g. Matthews et al. (2018), Garriga-Alonso et al. (2018). One criticism
leveled at the Bayesian approach to neural networks is the significant increase in
computational load, however, (as will be seen with GPs in the following section)
methodologies to reduce this load are available, e.g. Blundell et al. (2015). It is
132 T. J. Rogers et al.
fair to say that interest continues in Bayesian perspectives on neural networks and
their relationship to GP models, evidence of this being the prominence of work
on Bayesian deep learning as seen by workshops dedicated to this topic at many
of the most prominent machine learning conferences. In the context of the work
presented here, it is worth considering that this equivalency may well mean that the
choice between the GP and a (Bayesian) neural network is not so much a choice as
a difference of perspective.
Thus far, the GP has been presented as a compact and elegant tool which can power-
fully solve regression problems with little user intervention. This section will consider
some cases where the simple version of the GP seen already may not be sufficient
and, where possible, highlight solutions from the literature.
One of the major criticisms of the GP is that it can scale poorly with increasing num-
bers of training data.During the optimization
of the hyperparameters,
it is necessary
to invert the matrix K X,X + σn2 I multiple times at a cost of O N 3 per iteration.
Each prediction can then be made at a cost of O (N ) for the expectation and O N 2
for the predictive covariance. There has, therefore, been significant attention given
to producing approximations to the full GP which reduce this computational over-
head, often referred to as sparse GPs. The aim of any sparse GP method is to retain
as much of the expressive power as possible from the full GP while reducing the
computational load to the minimum possible amount.
The, perhaps obvious, first port of call would be to simply use less data. To
attempt to choose a set of M points from Dtr which effectively summarize the
function to be learnt while ignoring redundant data. In other words, a regular GP is
learnt using only M of the N available training points. This approach, sometimes
referred to as the subset of data, can clearly be seen as the simplest but least useful
approximation that could be made.While taking only
a subset of data will reduce the
complexity of the process from O N 3 to O M 3 the information contained in the
discarded data is completely lost. In engineering, where often these data have been
collected at great expense, this method is a very wasteful approach. Furthermore, if
the function to be modeled is sufficiently intricate that it requires a large number of
data in Dtr to be adequately modeled—which can happen especially when the input
is higher dimensional—then the quality of the fit will be drastically reduced through
the subset of data approximation. A final challenge in this most basic method is how
to choose the appropriate subset of data to include in the selected M, which can be
a difficult task. One should seek to include the data which most effectively capture
3 Gaussian Processes 133
same spectral content. Rahimi and Recht (2008) show how a randomly selected
Fourier basis serves as a good approach for approximating Gaussian process solu-
tions, although this method is arguably superseded by Hensman et al. (2017) which
combines ideas of Fourier features with the variational interpretation of sparse GPs
from VFE to develop the Variational Fourier Features (VFF) method. It is shown in
Hensman et al. (2017) how the VFF performs very well on large datasets including
the benchmark US flight-delay prediction task (Hensman et al. 2013). Moving away
from the Fourier basis it should also be noted that work of Solin and Särkkä (2020)
has shown that an eigendecomposition of the Laplace operator associated with the
kernel also provides an effective basis for approximating GPs with a more efficient
parametric model.
Finally, it should be noted that it is also possible to combine subsets of the training
data in various ways, see Cao and Fleet (2014), Deisenroth and Ng (2015), Nguyen-
Tuong et al. (2009), Gramacy and Apley (2015) for a non-exhaustive list of examples.
The final point to note is the approach for distributed parallel training (hyperparameter
learning) of GP models achieved through stochastic variational inference (Hoffman
et al. 2013) and shown in the context of GPs in Hensman et al. (2013). This work
shows how the concepts of variational sparse GPs can give further wall-time speed
up through mini-batched parallel evaluation and updates to the hyperparameters.
Remembering the initial setup of the GP as solving a regression problem in the form
yi = f (xi ) + ε ε ∼ N 0 , σn2 ,
it is seen that the regression targets are corrupted with i.i.d. Gaussian white noise.
This ensures that the joint distribution assessed in Eq. (3.10), for prediction, and the
marginal in Eq. (3.13), for hyperparameter learning, are available in closed form as
Gaussian densities. The advantage of this approach is that the computation of the
conditional which enables prediction and the marginal for learning of the hyper-
parameters can be done exactly. However, in some ways, this model also imposes
strong assumptions about the data-generating process which one wishes to model.
In this short subsection, it will be discussed how the assumption regarding the noise
process in the GP can in some ways be relaxed by considering certain approximate
approaches. It should be noted that in almost all cases, it will no longer be possible
to compute the marginal likelihood or the predictive posterior in closed form when
the assumption of white Gaussian i.i.d. noise is relaxed.
The first case in which the i.i.d. Gaussian noise assumption may not be sufficient
is that of heteroscedastic Gaussian noise. This is still an additive Gaussian noise
process but one where the variance of the measurement noise is now a function of
the inputs to the model, i.e. ε(xi ) ∼ N (0 , g(xi )) where g(xi ) is some function of
the inputs xi . A practical example of this noise process may be a situation where a
3 Gaussian Processes 135
sensor is working across different temperature ranges and the signal-to-noise ratio
varies as a function of the temperature. In this case, it would be insufficient to assume
a constant noise variance σn2 across the entire range of measurements; instead, the
variance of that noise process will also be a function of temperature, alongside the
function of interest f (·).
How then could the GP machinery be extended to manage this scenario? In the
simplest possible case, perhaps the function g(·) is known a priori to the modeler,
and it is known that the noise is independent across the input space. In this setting,
it is relatively simple to include the effect of the known changing noise in the usual
equations of the GP, the required modification is to replace the σn2 I term in Eq. (3.8)
with another noise matrix which is also diagonal and whose (diagonal) elements
are defined by g(xi ). In terms of prediction, to recover the latent function values
distribution p ( f | X , Dtr ) no further modification is needed in Eq. (3.11) than
replacing the σn2 I term in the training data. However, if the predictive distribution
over y is needed as in Eq. (3.10), then one must also modify the final term in
Eq. (3.10c) such that it is a diagonal matrix with the diagonal elements equalling
g(X ).
The far more common scenario however, is one where the function which governs
the noise process g(·) is unknown, as well as the underlying latent function to be
learnt f (·). Clearly, this setting will pose more of a challenge when attempting to
model the data. Given that this chapter has so far espoused the benefits of the GP
as a tool for learning arbitrary functions, it may come as no surprise that one can
suggest placing a GP prior over g(·) as well as f (·) and learning the two functions
simultaneously. Assuming that the noise on each observation is independent, i.e. the
covariance matrix of ε is diagonal, with the diagonal elements being drawn from
some function g(·), these elements must be positive (one cannot have a negative or
zero variance). If one were to place a GP prior over g(·) in the same manner as has
been shown for f (·) this requirement will not necessarily be enforced, in fact, if a
zero mean is chosen, a priori many function values will be implausible. To rectify
this difficulty, the common solution is to impose a link function ĝ(·) = (g(·)),
where (·) maps elements from the whole real line onto the set of positive real
numbers. For instance, one could choose (g(·)) = g(·)2 although by far the most
common choice is (g(·)) = exp {g(·)}, in other words, the GP prior is placed over
the log values of the noise variance. It is also commonplace to choose a GP prior
which has a constant mean function m g (X ) = log σn2 , where log σn2 is the log of
the average noise variance across the whole function and may be learnt alongside
the hyperparameters. A heteroscedastic GP model which learns both the unknown
function and the unknown noise variance can then be constructed as
y = f (X ) + ε
f (X ) ∼ GP m(X ), k f (x, x )
(3.15)
ε ∼ N (0 , exp {g(X )})
g(X ) ∼ GP C, k g (x, x )
136 T. J. Rogers et al.
While theoretically the above is a logical and powerful construction, it has one
major drawback. It is not possible to compute, in closed form, an exact solution
which would allow either learning of the hyperparameters or predictions at new
inputs. As such, it may seem that the model is doomed, however, it is possible to
construct very good approximate solutions which are then computable. A variational
approximation to the posterior of the process is the usual solution to this model,
this was the solution proposed in Lázaro-Gredilla and Titsias (2011). The variational
approximation allows for computation of a lower bound on the unavailable marginal
likelihood which can be used to optimize the hyperparameters of the process and a
similar variational approximation allows for computation of the mean and variance
of the predictive posterior, again in closed form. The mean of the process being
learnt is found to be identical to the predictive mean in the standard GP (under the
variational approximation), then the variance is approximated by means of the law
of total variance. This predictive variance is given by the variance of the prediction
of f (X ) plus the exponential of the predictive mean for g(X ) summed with half
the predictive variance of g(X ) (Sect. 4 Lázaro-Gredilla and Titsias 2011). In doing
this, a Gaussian approximation of the predictive posterior is recovered but it should
be noted that the true predictive posterior of the model in Eq. (3.15) will not be
Gaussian.
The heteroscedastic model, however, is not the only non-Gaussian likelihood
model which may be of interest. The likelihood of the model can reflect other known
beliefs about the form of the generating process. For example, if modeling count
data one may choose a generative model to be a Poisson process, for measurements
bounded at zero the likelihood may be gamma, log-normal, or exponential, for mea-
surements bounded above and below a beta likelihood may be a sensible choice. For
all of these cases, the GP in its standard form is an inappropriate model. For these
intractable models, the “gold standard” approximate method (as in most intractable
Bayesian analysis) is the Markov Chain Monte Carlo (MCMC) approach (Gelman
et al. 2013), more recently the Hamiltonian Monte Carlo (HMC) (Neal et al. 2011;
Betancourt 2017) has been preferred as a more efficient choice. Options to use these
sampling-based approaches are available in major GP libraries including GPy (GPy
2012) and GPFlow (Matthews et al. 2017). While sampling-based approximations to
intractable GP likelihoods are appealing in a number of ways, they have some guar-
antees that they will converge to the true posterior distributions, they have one major
drawback, they impose a significant computational burden. To alleviate this load, the
chained GP approach of Saul et al. (2016) gives a flexible framework for variational
inference over models composed of multiple GPs and non-Gaussian likelihoods. In
many practical settings, the variational approximation will capture sufficiently the
information required to determine a good model.
3 Gaussian Processes 137
Unlike the framework available for neural networks, where additional nodes can
simply be added to the output layer to enable multiple targets in regression, the
formation of the GP shown in Eq. (3.1) does not obviously lend itself to a multiple
target scenario. There are scenarios where it could be useful, however, to consider
multiple target data sequences within a combined regression. For example, if it is
known that two processes are correlated but one part of the input space can only be
observed for one process, then it may be desirable to use the correlation between the
processes to effectively expand the training dataset for both functions. To simplify the
presentation, a setting with two functions to be learnt will be shown, but it should be
clear how this would extend to multiple processes. Consider the following scenario,
y1 = f 1 (X 1 ) + ε1
(3.16)
y2 = f 2 (X 2 ) + ε2
where both ε1 and ε2 are i.i.d. zero-mean white Gaussian noise with variances σn,1 2
and σn,2 , respectively. y1 and y2 are the two vectors of target variables from the
2
functions f 1 (·) and f 2 (·), with sets of inputs X 1 and X 2 . The core assumption that
will enable multiple output predictions from the GP will be that the two functions
are also both drawn from an infinite-dimensional Gaussian distribution which can
model the correlation between them, as well as the correlation in the input space. It
is possible then to consider the prior joint distribution of the two sets of targets,
y1 m 1 (X 1 ) k1,1 (X 1 , X 1 ) + σn,1
2
I k1,2 (X 1 , X 2 )
∼N , (3.17)
y2 m 2 (X 2 ) k2,1 (X 2 , X 1 ) k2,2 (X 2 , X 2 ) + σn,2
2
I
For each function, there is an associated mean function m j (X j ) which can be defined
as before as any parametric mean. Considering each function in isolation, e.g. for
the first function,
y1 ∼ N m 1 (X 1 ) , k1,1 (X 1 , X 1 ) + σn,1
2
I (3.18)
when working with this model has been that it can produce very good results and it
was possible to replicate the work shown in Boyle and Frean (2004). However, the
learning procedure for the hyperparameters, despite following the same optimization
of the marginal likelihood as previously shown, could become very unstable. There
are combinations of the hyperparameters to be learnt which cause the model to be
numerically unstable and for which the gradients are particularly difficult to recover.
This difficulty is coupled with an inherent non-identifiability problem in the model
where, given that each signal is a summation of an independent and linked GP, it can
lead to vastly different qualities of prediction in testing depending on the relative
hyperparameters of the different kernels.
Perhaps for this reason, the more popular approaches for multiple output GPs
are those which have been classically inherited from the kriging community, where
the multiple output problem has been referred to as co-kriging. Several popular
approaches are developed from the assumption that the observed processes can be
described as the linear combination of a number of latent Gaussian processes.
A helpful review which covers the major approaches to multiple-output GPs can be
found in Alvarez et al. (2012). The interested reader is directed toward that reference
for further details on the approaches mentioned here.
functions from data collected in a wind farm. In both cases, the data have been
normalized to remove information which may identify the exact turbines from which
the data have been collected.
An example set of power curve data is shown in Fig. 3.4. It follows a very char-
acteristic shape with three key sections. Close to zero wind speed (–1 normalized
wind speed), no power is produced. This trend continues until the speed is sufficient
to begin producing power, at which point the curve enters a roughly cubic segment
where the power output increases with wind speed. This segment can be found to
be roughly cubic if one considers the theoretical maximum work done by the fluid
flowing over the wind turbine, in practice, inefficiency mean that this would be a
highly simplified assumption. At some point, the wind speed increases to the level
where the turbine produces its rated power, i.e. maximum output. In Fig. 3.4, this
point occurs close to zero normalized wind speed. For the remaining increase in wind
speed, no increase in power output is observed as it is limited by the control system of
the turbine. This limiting of the power output creates a saturation-type nonlinearity
at higher wind speed as the output of the turbine is limited to not exceed the rated
power (normalized power output of 1).
The key references for this section are now briefly reviewed. Papatheou et al.
(2017) presented the GP as a useful damage-sensitive feature for performance moni-
toring purposes. In that work, the GP is used to allow extreme function theory analysis
of the power curve, which is shown to highlight abnormal turbines. The approach
used in this initial work was based upon the most basic form of the GP as first intro-
duced in this chapter; in Rogers et al. (2020) attention was paid to the noise process
of the power-curve function. The data were modeled with a heteroscedastic form of
the GP which modeled the changing variance over the power curve, additionally, a
committee machine was employed to handle the three distinct regions seen in the
power curve. Finally, in Mclean et al. (2022), further work has been completed on
the modeling of the power curve, this time a GP model with a beta likelihood is
employed to capture the bounded nature of the function space. This arises from the
physical constraints that the turbine cannot produce negative power nor can it exceed
its maximum rated power.
The key results of these works will now be briefly reviewed in the context of how
the GP was applied in increasingly sophisticated ways to improve understanding of
the power curve. Since the dataset used in these studies contains a relatively large
number of data points—in the order of 15,000 observations—sparse formulations of
the GP are used in all cases. Initially, taking the usual approach, assuming Gaussian
likelihood and homoscedastic noise (i.e. additive, isotropic, independent Gaussian
noise) and the VFE methodology for computing a sparse GP, the results shown in
Fig. 3.5 can be recovered.
Beginning with the quality of the mean prediction of the model, it can be seen
that the expected output of the learnt functions is very good. Visually it matches
closely to the data and using a normalized mean squared error (NMSE) metric a
score of 0.81% is achieved which indicates excellent quality of fit. The GP used here
employs a squared exponential kernel and a hyperbolic tangent mean to promote the
general behavior of the function which can be seen a priori. For reference, this score
140 T. J. Rogers et al.
Fig. 3.5 Prediction of the power curve using a VFE approach (Rogers et al. 2020)
is a notable improvement over the 3.94% for a piecewise linear fit or 1.50% for a
hyperbolic tangent, which are competing parametric models commonly employed. At
this stage, the user has a consideration to make, if only concerned with the pointwise
prediction of the model, it may be reasonable to stop, satisfied with this good quality
mean fit. However, returning to Fig. 3.5, it can be seen that the variance in the end
segments is greatly overestimated, i.e. the GP is too uncertain, and in the central
segment, which may be of most interest, the variance is underestimated. This poses
a problem since one key advantage of the GP is its quantification of uncertainty,
given that in the proposed model the form of the uncertainty is not well captured it
is reasonable to try to improve upon this result.
To improve the quality of this fit, the immediate option is to make a change in the
choice of noise model. Since the variance in the data is seen to vary over the input
space, a heteroscedastic approach is taken. The form of this model was to place a
GP prior with a constant mean over the log of the noise variance, in this way the
variation in the noise could be learnt across the input space in the same manner as
the function is learnt. In Fig. 3.6, the predictions from this updated model are shown.
Immediately, it is clear that the variance toward the ends of the power curve is much
better captured. No longer is there a high degree of uncertainty in these regions
(below cut-in and above-rated power) where it is expected that the power output
should be very consistent, i.e. low noise. The model has also managed to capture
the increasing variance with wind speed in the middle segment of the curve and
the data predominantly lie within the 3σ confidence bounds as would be expected.
3 Gaussian Processes 141
Fig. 3.6 Prediction of the power curve using the heteroscedastic approach (Rogers et al. 2020)
The mean performance of the model is also maintained. This updated approach,
therefore, is of value when the user is more interested in capturing the variability in
the process with greater accuracy. This information regarding the uncertainty will be
particularly valuable if this model were to be incorporated into financial forecasting
which wished to consider future risk, or if the model is used within a probabilistic
monitoring and/or decision framework.
A final point on the model shown in Fig. 3.6, since the power curve was character-
ized by the three distinct segments of different regimes, a committee machine model
was used to blend together different GPs for each of these segments. In Fig. 3.7, the
three separate heteroscedastic models and the combined model are all shown, noting
that these are shown with the mean function removed. The mechanism of the com-
mittee machine is to combine the predictions of the models weighted automatically
by their confidence. This approach is very attractive as it allows various GP models
to be combined without requiring manual tuning. The automatic quantification of
the uncertainty allows for the combination to happen in quite a natural way, as with
humans one may pay attention to the expert on the committee who is most confident
in their knowledge. A word of warning in this modeling approach, as is probably
prudent when dealing with humans, the user should be wary of overconfidence which
may skew this combination but again the combination of the heteroscedastic process
and Bayesian Occam’s Razor provides some degree of automatic protection against
this issue.
142 T. J. Rogers et al.
Fig. 3.7 Individual heteroscedastic GPs inside the committee machine prediction (Rogers et al.
2020)
In the figures above, it has been shown how a heteroscedastic noise model can
better capture the variation in the noise variance across the input space. However,
inspecting Fig. 3.6 closer it can be seen that around the transition into rated power,
near zero normalized wind speed, there is a significant probability mass predicting
power outputs greater than the rated power. This mismatch poses a problem as it
may mean samples from the GP of the power curve, even in the heteroscedastic case,
over-predict the potential power output of the turbine beyond what is physically
possible. This limitation on the model motivates the modeler to consider if further
modification to the process is needed. In the setting of the power curve, it is natural to
consider that the likelihood of the data, which is the generative process, is bounded,
i.e. samples from the distribution which encodes prior belief should not exceed the
rated power.
It may seem that the issue of the bounded space (no prediction beyond rated power)
is limited to the quantification of the uncertainty from the GP. However, considering
the results of the standard GP,5 as shown in Fig. 3.8, it is clearly seen that despite
the mean prediction giving a very low NMSE score, it can still violate the physical
bounds of the process. Clearly, this is a concern when attempting to establish trust
in a data-driven approach.
Consider then how the GP model can be constructed in such a way that the physical
bounds of the process are obeyed. The modeler chooses a likelihood function which
restricts the generating process to some other domain, for example, a gamma, log-
normal, or exponential distribution to enforce positive only target values or in the
5In the referenced Fig. 3.8, a zero mean function is used as opposed to the hyperbolic tangent in
Figs. 3.5 and 3.6.
3 Gaussian Processes 143
case of the power curve the beta distribution provides an obvious choice of likelihood
as it is bounded at zero and one. This choice of likelihood imposes restrictions on the
family of functions which can be generated by the GP, in fact, choosing a different
likelihood in some ways stops the model being a GP at all, as the likelihood is
no longer Gaussian. One should be wary that this change does not inadvertently
over-restrict the possible functions learnt by the process. For example, the choice
of beta likelihood is still imposing that the distribution conditioned on a particular
input is unimodal. However, considering the data in Fig. 3.4, it can be seen that this
assumption is reasonable.
To develop the power curve model further in Mclean et al. (2022) adopts a model
where a multiple output GP jointly models the α and β parameter of the generat-
ing beta distribution. The ability for variation between the GP outputs which feed
the beta distribution naturally also allows heteroscedastic behavior in the posterior.
Considering the effect of this approach, in Fig. 3.9, it can be seen that the mean pre-
diction under the beta likelihood is comparable to the regular GP but samples from
the posterior never leave the physically plausible space. Zooming in on the transition
to rated power in Fig. 3.10, the effect is especially pronounced. In this region, where
the regular GP and heteroscedastic model have previously struggled, the beta GP
captures the transition well without violating the upper bound on the target variable.
In Fig. 3.11, a plot showing the posterior likelihood of the process over the space
is presented. The likelihood toward the two tails concentrates close to zero and one
in normalized power at low and high wind speeds, respectively. Since the likelihoods
become very high, the colormap has been truncated for readability. It is also noted that
the concept of confidence intervals which are usually used to represent the predictive
uncertainty of the GP is not necessarily helpful in this instance as the distributions are
highly skewed, hence the use of a darkening colormap with increasing likelihood.
What is seen in these results is the expected high confidence in the tails of the
process and more diffuse posterior in the transitional segment. It is argued that, for
the additional modeling effort as compared to the GP model shown in Fig. 3.5, the
reward is improved quantification of the posterior predictive uncertainty in the power
curve which will aid any future processes relying on these uncertainty predictions.
3.4 Conclusions
The GP has been presented as a powerful and conceptually simple tool for performing
regression tasks. In its most basic form, the required calculations to produce mean
and variance predictions span only a few lines of maths or code. However, various
extensions which increase the expressive power of the model have been presented. In
the case study, it has been shown how consideration and careful application of these
extensions can greatly increase the value of the GP as a tool for modeling engineering
functions. As is the case in most applications of machine learning within engineering,
it is important to consider if the model can be set up in such a manner that it conforms
to the expected/desired behavior that is known a priori. In particular, it has been
highlighted that, while valuable, quantification of uncertainty must receive careful
attention to ensure its utility in further applications.
Fig. 3.11 Posterior of the beta GP for the power curve, noting that the colormap is clipped at the
likelihood equal to 30 (Mclean et al. 2022)
References
Abdessalem AB, Dervilis N, Wagg DJ, Worden K (2017) Automatic kernel selection for gaus-
sian processes regression with approximate Bayesian computation and Sequential Monte Carlo.
Frontiers Built Environ 3:52
146 T. J. Rogers et al.
Alvarez MA, Rosasco L, Lawrence ND et al (2012) Kernels for vector-valued functions: a review.
Found Trends® Mach Learn 4(3):195–266
Andrianakis I, Challenor PG (2012) The effect of the nugget on Gaussian process emulators of
computer models. Comput Stat Data Anal 56(12):4215–4228
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network.
In: International conference on machine learning. PMLR, pp 1613–1622
Boyle P, Frean M (2004) Dependent Gaussian processes. In: Saul L, Weiss Y, Bottou L (eds)
Advances in neural information processing systems, vol 17. MIT Press
Bui TD, Yan J, Turner RE (2017) A unifying framework for Gaussian process pseudo-point approx-
imations using power expectation propagation. J Mach Learn Res 18(1):3649–3720
Cao Y, Fleet DJ (2014) Generalized product of experts for automatic and principled fusion of
Gaussian process predictions. arXiv:1410.7827
Deisenroth M, Ng JW (2015) Distributed Gaussian processes. In: International conference on
machine learning. PMLR, pp 1481–1490
Gardner P, Rogers T, Lord C, Barthorpe R (2021) Learning model discrepancy: a Gaussian process
and sampling-based approach. Mech Syst Signal Process 152:107381
Garnett R, Osborne MA, Hennig P (2013) Active learning of linear embeddings for Gaussian
processes. arXiv:1310.6740
Garriga-Alonso A, Rasmussen CE, Aitchison L (2018) Deep convolutional networks as shallow
gaussian processes. arXiv:1808.05587
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis
GPy (2012) GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy
Gramacy RB, Apley DW (2015) Local Gaussian process approximation for large computer exper-
iments. J Comput Graph Stat 24(2):561–578
Hensman J, Durrande N, Solin A (2017) Variational Fourier features for Gaussian processes. J Mach
Learn Res 18(1):5537–5588
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Proceedings of the
twenty-ninth conference on uncertainty in artificial intelligence, UAI’13, Arlington, Virginia,
USA, AUAI Press, pp 282–290
Hernández-Lobato JM, Hoffman MW, Ghahramani Z (2014) Predictive entropy search for efficient
global optimization of black-box functions. Adv Neural Inf Process Syst 27
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn
Res
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds)
3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA,
Conference Track Proceedings
Lázaro-Gredilla M, Titsias MK (2011) Variational heteroscedastic Gaussian process regression. In:
ICML
MacKay DJ (1992) A practical bayesian framework for backpropagation networks. Neural Comput
4(3):448–472
Matthews AGDG, Hron J, Rowland M, Turner RE, Ghahramani Z (2018) Gaussian process
behaviour in wide deep neural networks. In: International conference on learning representa-
tions
Matthews AGDG, van der Wilk M, Nickson T, Fujii K, Boukouvalas A, León-Villagrá P, Ghahramani
Z, Hensman J (2017) GPflow: A gaussian process library using tensor flow. J Mach Learn Res
18(40):1–6
Mclean J, Jones M, O’Connell B, Maguire A, Rogers T (2022) Physically meaningful uncertainty
quantification in probabilistic wind turbine power curve models as a damage sensitive feature.
arXiv:2209.15579
Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J Mach Learn Res 7(12)
Minka T (2004) Power expectaion propagation. Technical report, Technical report, Microsoft
Research, Cambridge
3 Gaussian Processes 147
Minka TP (2001) A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Mas-
sachusetts Institute of Technology
Neal RM (1995) Bayesian Learning for Neural Networks. Ph.D. thesis, University of Toronto
Neal RM (1996) Priors for infinite networks. In: Bayesian learning for neural networks. Springer,
Berlin, pp 29–53
Neal RM et al (2011) Mcmc using hamiltonian dynamics. Handb Markov Chain Monte Carlo
2(11):2
Nguyen-Tuong D, Seeger M, Peters J (2009) Model learning with local Gaussian process regression.
Adv Robot 23(15):2015–2034
Papatheou E, Dervilis N, Maguire AE, Campos C, Antoniadou I, Worden K (2017) Performance
monitoring of a wind turbine using extreme function theory. Renew Energy 113:1490–1502
Press WH, Teukolsky SA, Vetterling WT, Flannery BP et al (1992) Numerical recipes in C
Quinonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian
process regression. J Mach Learn Res 6:1939–1959
Rahimi A, Recht B (2008) Weighted sums of random kitchen sinks: Replacing minimization with
randomization in learning. Adv Neural Inf Process Syst 21
Rogers T, Gardner P, Dervilis N, Worden K, Maguire A, Papatheou E, Cross E (2020) Probabilistic
modelling of wind turbine power curves with application of heteroscedastic Gaussian process
regression. Renew Energy 148:1124–1136
Saul AD, Hensman J, Vehtari A, Lawrence ND (2016) Chained Gaussian processes. In: Gretton
A, Robert CC (eds) Proceedings of the 19th international conference on artificial intelligence
and statistics, volume 51 of Proceedings of machine learning research. Cadiz, Spain, PMLR, pp
1431–1440
Simpson F, Lalchand V, Rasmussen CE (2021) Marginalised Gaussian processes with nested sam-
pling. Adv Neural Inf Process Syst 34:13613–13625
Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning
algorithms. Adv Neural Inf Process Syst 25
Solin A, Särkkä S (2020) Hilbert space methods for reduced-rank Gaussian process regression. Stat
Comput 30(2):419–446
Stein ML (1999) Interpolation of spatial data: some theory for kriging. Springer Science & Business
Media
Svensson A, Dahlin J, Schön TB (2015) Marginalizing Gaussian process hyperparameters using
sequential monte carlo. In: 2015 IEEE 6th international workshop on computational advances in
multi-sensor adaptive processing (CAMSAP). IEEE, pp 477–480
Titsias M (2009) Variational learning of inducing variables in sparse Gaussian processes. In: Arti-
ficial intelligence and statistics. PMLR, pp 567–574
Williams C (1996) Computing with infinite networks. Adv Neural Inf Process Syst 9
Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning. MIT Press
Chapter 4
Machine Learning Methods
for Constructing Dynamic Models
From Data
J. Nathan Kutz
4.1 Introduction
2013; Brunton et al. 2020). In what follows, we consider some of the leading machine
learning strategies that help represent physics-based systems by learning dynamics
and embeddings jointly.
The data-driven discovery process begins with data acquisition. Thus, sensors are
critical for empowering the physics learning process. Often what is measured by the
sensors y is not the correct variable u for building a parsimonious, or dynamic, model
representation. Thus, it is important to first learn how to map the measurements y to a
state space representation u where a model is constructed, i.e. a measurement model
must be constructed. Once achieved, a parsimonious model for u can be constructed in
a variety of ways as detailed in subsequent sections of the manuscript. Importantly,
many modern applications of data-driven modeling require that the measurement
model and parsimonious model be constructed simultaneously. There are, of course,
many nuances to this process, but we will primarily focus at a first approximation on
learning spatio-temporal systems governed by partial differential equations which
are specified by a nonlinear operator N(·).
Thus, we seek to construct a dynamic model based on data y ∈ Rm measured from
a high-dimensional state u ∈ Rn , where n 1 and m n. Specifically,
As sufficient data is acquired from sensors in time and space, the data-discovery
pipeline then produces the flow:
with two functions to discover, h and N. Alternatively, one can also find a new
coordinate system z in which to build the dynamic model so that
To frame the application of machine learning algorithms for building dynamic models
from data, we will consider physics-based systems modeled by partial differential
equations (PDEs). PDEs model a diversity of spatio-temporal systems, including
those found in the classical physics fields of fluid dynamics, electrodynamics, heat
conduction, and quantum mechanics. Indeed, PDEs provide a canonical description
of how a system evolves in space and time due the presence of partial derivatives
which model the relationships between rates of change in time and space. Governing
equations of physics-based systems simply provide a constraint, or restriction, on
how these evolving quantities are related. We consider PDEs of the form Courant
and Hilbert (2008)
u̇ = N (u, x, t; β) (4.4)
where u(x, t) is the variable of interest, or the state space, which we are trying to
model. The solution u(x, t) of the PDE depends upon the spatial variable x, which
can be in 1D, 2D, or 3D, and time t. Importantly, solutions can often depend on the
parameters β, thus requiring a solution ultimately that can model parametric depen-
dencies, i.e. u(x, t; β). In what follows, to illustrate various mathematical methods,
the parameters β are assumed to be fixed. Solutions of (4.4) are typically achieved
through numerical simulation, unless N (·) is linear and constant-coefficient so that
152 J. Nathan Kutz
analytic solutions are available. Asymptotic and perturbation methods can also offer
analytically tractable solution methods (Kutz 2020). In many modern applications,
discretization of the evolution for u(x, t) results in a high-dimensional system for
which computations are prohibitively expensive. The goal of data-driven modeling is
to generate a proxy model of (4.4) that renders tractable, inexpensive computations
that approximate the full simulations of (4.4).
In what follows, we will typically assume that we do not know (4.4), but have only
data collected at different time points. However, if the governing equations are indeed
known, then recourse to model reduction allows for the construction of efficient proxy
models. Traditional reduced order models (ROMs) produce a dimensionally reduced
version of (4.4) by projecting the governing PDE into a new, low-rank coordinate
system. Coordinate transformations are one of the fundamental paradigms for pro-
ducing solutions to PDEs (Keener 2018). ROMs produce coordinate transformations
from the simulation data itself. Thus, if the governing PDE (4.4) is discretized so
that u(x, t) → uk = u(tk ) ∈ Rn , then snapshots of the solution can be collected into
the data matrix ⎡ ⎤
| | ··· |
X = ⎣ u1 u2 · · · um ⎦ (4.5)
| | ··· |
where X ∈ Cn×m . An optimal coordinate system for ROMs is extracted from the data
matrix X with a singular value decomposition (SVD) (Kutz 2013)
X = V∗ (4.6)
where the vector a = a(t) is determined by using this solution form and Galerkin pro-
jecting to (4.4) (Benner et al. 2015). Thus, projection-based ROM simply represents
the governing PDE evolution (4.4) in the r -rank subspace spanned by .
The projective-based ROM construction often produces a low-rank model for the
dynamics of a(t) that can be unstable (Carlberg et al. 2017), i.e. the models produced
generate solutions that rapidly go to infinity in time. Machine learning techniques
offer a diversity of alternative methods for computing the time dynamics in the low-
rank subspace. The simplest architecture aims to train a deep neural network that
maps the solution forward in time
4 Machine Learning Methods for Constructing Dynamic Models From Data 153
where ak = a(tk ) and fθ represents a deep neural network whose weights and biases
are determined by θ . A diversity of neural networks can be used to advance solu-
tions or learn the flow map from time t to t + t (Qin et al. 2019; Liu et al. 2020).
Indeed, deep learning algorithms provide a flexible framework for constructing a
mapping between successive time steps. The typical ROM architecture constrains
the dynamics to a subspace spanned by the POD modes; thus, in the new coordi-
nate system which is generated by projection to the low-rank subspace, the snapshot
matrix is now constructed from the ak . These snapshots can be used to construct a
time-stepping model using neural networks. Recently, Parish et al. (2020) and Regaz-
zoni et al. (2019) developed a suite of neural network-based methods for learning
time-stepping models for (4.8). Moreover, Parish and Carlberg provide extensive
comparisons between different neural network architectures along with traditional
techniques for time-series modeling. In such models, the neural networks (or time-
series analysis methods) simply map an input ak to an output ak+1 .
Machine learning algorithms offer options beyond a Galerkin-POD or deep learn-
ing of time-stepping in the time variable a(t). Thus, instead of inserting (4.7) back
into (4.4) or learning a flow map fθ for (4.8), we can instead think about directly build-
ing a model for the dynamics of a(t). Thus, we would like to construct a dynamical
system
da
= f(a, t) (4.9)
dt
where f(·) now prescribes the dynamics. Two highly successful options have emerged
for producing a model for the dynamics, (i) The dynamic mode decomposition
(DMD) (Kutz et al. 2016) and (ii) the sparse identification of nonlinear dynam-
ics (SINDy) (Brunton et al. 2016). The DMD model for f(·) is assumed to be linear
so that
da
≈ La (4.10)
dt
where L is a linear operator found by regression. Solutions are then trivial as all
that one requires is to find the eigenvalues and eigenvectors of L in order to provide
a general solution by linear superposition. The SINDy method makes a different
assumption, that the dynamics f(·) can be represented by only a few terms. In this
case, the regression is to the form
da
≈ ξ (4.11)
dt
where the columns of the matrix are candidate terms from which to construct
a dynamical system and ξ contains the loading (or coefficient or weight) of each
library term. SINDy assumes that ξ is a sparse vector so that most of the library
terms do not contribute to the dynamics. The regression is a simple Ax = b solve for
154 J. Nathan Kutz
where z is the new variable representing the state space u and fθ is the neural network
coordinate transformation that defines the mapping. The goal is then to enforce a
DMD or SINDy model in the new coordinate system
dz
= Lz (4.13a)
dt
dz
= ξ . (4.13b)
dt
This allows us to find a coordinate system beyond the standard linear, low-rank
subspace which can be advantageous for ROM construction. In what follows,
we highlight the basic mathematical formulations allowing for a diversity of ROM
approximations, especially those that leverage DMD and SINDy in constructing
advantageous ROMs.
The previous section highlights a number of methods that can be leveraged to build
a proxy model of the dynamics. The different paradigms can each be quite effective
depending on the goal of the practitioner. To demonstrate the construction of the
various representations, we consider the canonical nonlinear PDE: Burgers’ equation
with diffusive regularization. The evolution, as illustrated in Fig. 4.1a, is governed
by diffusion with a nonlinear advection (Burgers 1948).
When = 0, the evolution can lead to shock formation in finite time. The presence
of the diffusion term regularizes the PDE, ensuring continuous solutions for all time.
The governing equation (4.14) can be learned directly from data using the SINDy
architecture. Thus a parsimonious nonlinear evolution is learned. However, it is clear
in what follows that other representations can also be learned.
4 Machine Learning Methods for Constructing Dynamic Models From Data 155
Fig. 4.1 a Evolution dynamics of Burgers’ equation with initial condition u(x, 0) = exp(−(x +
2)2 ). b Fifteen mode DMD approximation of the Burgers’ evolution. The simulation of (4.14)
was performed over t ∈ [0, 30] where the sampling was taken at every t = 1. The domain was
discretized with n = 256 points on a domain x ∈ [−15, 15]
Burgers’ equation is one of the few nonlinear PDEs whose analytic solution form
can be derived. In independent, seminal contributions, Hopf (1950) and Cole (1951)
derived a transformation that linearizes the PDE. The Cole–Hopf transformation is
defined as follows:
u = −2vx /v . (4.15)
The transformation to the new variable v(x, t) replaces the nonlinear PDE (4.14)
with the linear, diffusion equation
vt = vx x (4.16)
where v̂ = v̂(k, t) denotes the Fourier transform of v(x, t) and k is the wavenumber.
The solution in the Fourier domain is easily found to be
where v̂0 = v̂(k, 0) is the Fourier transform of the initial condition v(x, 0).
The linearization process gives a direct construction of the Koopman operator. To
construct the Koopman operator, we can then combine the transform to the variable
v(x, t) from (4.15)
156 J. Nathan Kutz
x
−∞ u(ξ, t)dξ
v(x, t) = exp − (4.19)
2
g(u) = v̂ . (4.20)
K = exp(−k 2 t) . (4.21)
This is one of the rare instances where an explicit expression for the Koopman opera-
tor and the observables can be constructed analytically. The inverse scattering trans-
form (Ablowitz and Segur 1981) for other canonical PDEs, Korteweg-deVries (KdV)
and nonlinear Schrödinger (NLS) equations, also can lead to an explicit expression
for the Koopman operator, but the scattering transform and its inversion are much
more difficult to construct in practice.
If instead of an extended domain, one considers a bounded domain where x ∈
[0, L] and v(0) = v(L) = 0, then the diffusion equation yields the solution
∞
v(x, t) = bn exp(−λn t) sin(nπ x/L). (4.22)
j=1
In practice, one would only use a finite sum to represent the solution so that
r
v(x, t) ≈ bn exp(−λn t) sin(nπ x/L). (4.23)
j=1
where r is the number of modes used to approximate the solution. Thus, r is like the
rank of a reduced order approximation.
The purpose in explicitly working out the Burgers’ equation is to highlight the
diversity of what can be learned. Specifically, three representations can be constructed
from data: the SINDy model (4.14), a linear Koopman model (4.16), or the eigenso-
lution itself (4.23). Note that our typically analytical methods actually start with the
nonlinear governing equation (4.14) and transform it first to a linear model (4.16)
before producing the solution (4.23). From a data-driven approach, all three are sim-
ply representations in different coordinate systems and variables. Thus, when using
machine learning methods to jointly learn coordinates and dynamics, it is critical to
specify exactly which representation is desired. There are indeed infinite represen-
tations of physics models (Chen et al. 2022). One can impose one’s desires on the
model by constructing appropriate loss functions when training a neural network for
a coordinate system.
4 Machine Learning Methods for Constructing Dynamic Models From Data 157
The regression to (4.24) shows the immediate value of DMD for forecasting. Specif-
ically, any time t ∗ can be evaluated to produce an approximation to the state of the
system X(t ∗ ). However, despite its introduction more than a decade ago, DMD is
rarely used for forecasting and/or reconstruction of time-series data except in cases
with high-quality (noise-free or nearly noise-free) data. Indeed, practitioners who
work with DMD and noisy data know that the algorithm fails not only to produce a
reasonable forecast but also often fails in the task of reconstructing the time series it
was originally regressed to. Thus, in the past decade, the value of DMD has largely
been an important diagnostic tool as the DMD modes and frequencies are highly
interpretable. Indeed, from Schmid’s (Schmid and Sesterhenn 2008; Schmid 2010)
original work until now, DMD papers are primarily diagnostic in nature with the key
figures of any given paper being the DMD modes and eigenvalues. In cases, where
DMD is used on noise-free data, such as for producing reduced order models from
high-fidelity numerical simulation data (Bagheri 2014; Alla and Kutz 2017), then
the DMD solution (4.24) can be used for reconstructing and forecasting accurate
representations of the solution.
The algorithmic construction of the DMD method can be best understood from the
so-called exact DMD (Tu et al. 2014). Indeed, this exact DMD is simply a least-square
fitting procedure. Specifically, the DMD algorithm seeks a best-fit linear operator A
that approximately advances the state of a system, x ∈ Rn , forward in time according
to the linear dynamical system
xk+1 = Axk , (4.25)
where xk = x(kt), and t denotes a fixed time step that is small enough to resolve
the highest frequencies in the dynamics. Thus, the operator A is an approximation
of the Koopman operator K restricted to a measurement subspace spanned by direct
measurements of the state x (Rowley et al. 2009). In the original DMD formula-
tion (Schmid 2010), uniform sampling in time was required so that tk = kt. The
exact DMD algorithm (Tu et al. 2014) does not require uniform sampling. Rather,
for each snapshot u(tk ), there is a corresponding snapshot u(tk ) one time step t in
the future. These snapshots are arranged into two matrices, X which is given in (4.5),
and X which is the matrix (4.5) with all snapshots advanced t. In terms of these
matrices, the DMD regression (4.25) is
X ≈ AX. (4.26)
The exact DMD is the best fit, in a least-squares sense, operator A that approximately
advances snapshot measurements forward in time. Specifically, it can be formulated
as an optimization problem
A = argmin X − AX F = X X† (4.27)
A
where · F is the Frobenius norm and † denotes the pseudo-inverse. The pseudo-
inverse may be computed using the SVD of X = U V∗ as X† = V −1 U∗ . The
4 Machine Learning Methods for Constructing Dynamic Models From Data 159
ÃW = W . (4.29)
The diagonal matrix contains the DMD eigenvalues, which correspond to eigenval-
ues of the high-dimensional matrix A. The columns of W are eigenvectors of Ã, and
provide a coordinate transformation that diagonalizes the matrix. These columns may
be thought of as linear combinations of POD mode amplitudes that behave linearly
with a single temporal pattern given by the corresponding eigenvalue λ.
The eigenvectors of A are the DMD modes , and they are reconstructed using
the eigenvectors W of the reduced system and the time-shifted data matrix X
−1
= X Ṽ ˜ W. (4.30)
Tu et al. (2014) proved that these DMD modes are eigenvectors of the full A matrix
under certain conditions. As already shown in the introduction, the DMD decompo-
sition allows for a reconstruction of the solution as (4.24). The amplitudes of each
mode b can be computed from
b = † X1 , (4.31)
however, alternative and often better approaches are available to compute b (Chen
et al. 2012; Jovanović et al. 2014; Askham and Kutz 2018). Thus, the data matrix X
may be reconstructed as
X ≈ diag(b)T(ω)
⎡ ⎤⎡b ⎤⎡ ω t
e 1 1 · · · eω1 tm
⎤
| | 1
⎢ ⎥⎢ ⎥
= ⎣φ 1· · · φ r ⎦⎣ . . . ⎦⎣ ... . . . ... ⎦ . (4.32)
| | br eωr t1 · · · eωr tm
Bagheri (2014) first highlighted that DMD is particularly sensitive to the effects of
noisy data, with systematic biases introduced to the eigenvalue distribution (Duke
et al. 2012; Bagheri 2013; Dawson et al. 2016; Hemati et al. 2017). As a result, a
number of methods have been introduced to stabilize performance, including total
least-squares DMD (Hemati et al. 2017), forward-backward DMD (Dawson et al.
2016), variational DMD (Azencot et al. 2019), subspace DMD (Takeishi et al. 2017),
160 J. Nathan Kutz
time-delay embedded DMD (Brunton et al. 2017; Arbabi and Mezić 2017; Kamb et al.
2020; Hirsh et al. 2021), and robust DMD methods (Askham and Kutz 2018; Scherl
et al. 2020). However, the optimized DMD algorithm of Askham and Kutz (2018),
which uses a variable projection method for nonlinear least squares to compute the
DMD for unevenly timed samples, provides the best and optimal performance of
any algorithm currently available. This is not surprising given that it actually is
constructed to optimally satisfy the DMD problem formulation. Specifically, the
optimized DMD algorithm solves the exponential fitting problem directly
where b = diag(b). This has been shown to provide a superior decomposition due
to its ability to optimally suppress bias and handle snapshots collected at arbitrary
times. Moreover, it can be used with statistical bagging methods to produce the BOP-
DMD (bagging, optimized DMD) (Sashidhar and Kutz 2021), which is perhaps
the most stable variant of DMD. BOP-DMD also provides spatial and temporal
uncertainty quantification. The disadvantage of optimized DMD is that one must
solve a nonlinear optimization problem, often which can fail to converge.
The construction of a traditional ROM that is accurate and efficient is centered
on the reduction (4.4). Thus, once a low-rank subspace is computed from the SVD,
the POD modes are used for projecting the dynamics. DMD allows us to use a
data-driven, non-intrusive method in order to regress to a model for the temporal
dynamics. Consider modification of (4.4) to the evolution dynamics
ut = Lu + N (u, ux , ux x , . . . , x, t; β) (4.34)
where the linear and nonlinear parts of the evolution, denoted by L and N(·), respec-
tively, have been explicitly separated. The solution ansatz u = a yields the ROM
da
= T La + T N(a, β). (4.35)
dt
Note that the linear operator in the reduced space T L in a r × r matrix which
is easily computed. The nonlinear portion of the operator T N(a, β) is more
complicated since it involves repeated computation of the operator as the solution a,
and consequently the high-dimensional state u, is updated in time.
One method for overcoming the difficulties introduced in evaluating the nonlinear
term on the right-hand side is to introduce the DMD algorithm. DMD approximates
a set of snapshots by a best-fit linear model. Thus, the nonlinearity can be evaluated
over snapshots and a linear model constructed to approximate the dynamics. Thus,
two matrices can be constructed
4 Machine Learning Methods for Constructing Dynamic Models From Data 161
⎡ ⎤ ⎡ ⎤
| | | | | |
N− = ⎣N1 N2 · · · Nm−1 ⎦ and N+ = ⎣N2 N3 · · · Nm ⎦ (4.36)
| | | | | |
N+ = AN N− . (4.37)
ut ≈ Lu + AN u = (A + AN )u (4.38)
where the operator L has been replaced by A. The dynamics is now completely linear
and solutions can be easily constructed from the eigenvalues and eigenvectors of the
linear operator A + AN .
In practice, the DMD algorithm exploits low-dimensional structure in building
a ROM model. Thus, instead of the approximate linear model (4.34), we build a
low-dimensional version. From snapshots (4.36) of the nonlinearity, the DMD algo-
rithm can be used to approximate the dominant rank-r nonlinear contribution to the
dynamics as
r
N (u, ux , ux x , . . . , x, t; β) ≈ j exp(ω j t)b j = exp( t)b (4.39)
j=1
where b j determines the weighting of each mode. Here, j is the DMD mode and
ω j is the DMD eigenvalue. This approximation can be used in (4.35) to produce the
POD-DMD approximation
da
= T La + T exp( t)b. (4.40)
dt
In this formulation, there are a number of advantageous features, (i) the nonlin-
earity is only evaluated once with the DMD algorithm (4.39) and (ii) the products
T L and T are also only evaluated once and both produce matrices that are
low rank, i.e. they are independent of the original high-rank system. Thus, with a
one-time, up-front evaluation of two snapshot matrices to produce and , the
ROM produces a computationally efficient ROM that requires no recourse to the
original high-dimensional system.
Alla and Kutz (2017) integrated the DMD algorithm into the traditional ROM
formalism to produce the POD-DMD model (4.40). The comparison of this com-
putationally efficient ROM with traditional model reduction is shown in Fig. 4.2.
Specifically, both the computational time and error are evaluated using this tech-
nique. Once the DMD algorithm is used to produce an approximation of the nonlin-
162 J. Nathan Kutz
Fig. 4.2 Computation time and accuracy on a semi-linear parabolic equation (Modified from Alla
and Kutz 2017). Four methods are compared, the high-fidelity simulation of the governing equations
(FULL), a Galerkin-POD reduction as given in (4.40) (POD), a Galerkin-POD reduction with the
discrete empirical interpolation (DEIM) algorithm for evaluation the nonlinearity (POD-DEIM),
and the POD-DMD approximation (4.40). The left panel shows the computation time, which is
an order-of-magnitude faster than traditional POD-DEIM algorithms. The right panel shows the
accuracy of the different methods for reproducing the high-fidelity simulations. POD-DMD loses
some accuracy in comparison to Galerkin-POD methods due to the fact that DMD modes are not
orthogonal and thus the error does not decrease as quickly as the POD-based methods
ear term, it can be used for producing future state predictions and a computationally
efficient ROM. Indeed, its computational acceleration is quite remarkable in com-
parison to traditional methods. Moreover, the method is non-intrusive and does not
require additional evaluation of the nonlinear term. The entire method can be used
with randomized algorithms to speed up the low-rank evaluations even further (Alla
and Kutz 2019). Note that the computational performance boost comes at the expense
of accuracy as shown in Fig. 4.2. This is primarily due to the fact that additional POD
modes used for standard ROMs, which are orthogonal by construction and guaran-
teed to be a best fit in 2 , are now replaced by DMD modes which are no longer
orthogonal (Alla and Kutz 2017).
In addition to a DMD model for modeling the low-rank dynamics, the SINDy regres-
sion framework also allows one to build a model for the evolution of the temporal
dynamics in the low-rank subspace. Discovery of governing equations plays a fun-
damental role in the development of physical theories, and in this case, we wish to
discover the evolution dynamics of a(t) for constructing our ROM. With increas-
ing computing power and data availability in recent years, there have been sub-
stantial efforts to identify the governing equations directly from data (Bongard and
Lipson 2007; Schmidt and Lipson 2009; Yang et al. 2020). There has been partic-
ular emphasis on parsimonious representations because they have the benefits of
4 Machine Learning Methods for Constructing Dynamic Models From Data 163
promoting interpretability and generalizing well to unknown data (Bai et al. 2015;
Brunton et al. 2014, 2016; Mackey et al. 2014; Ozoliņš et al. 2013; Proctor et al.
2014; Tran and Ward 2017; Wang et al. 2011). The SINDy method was proposed
in Brunton et al. (2016), which leverages dictionary learning and sparse regression
to model dynamical systems. This approach has been successful in modeling a diver-
sity of applications, including in chemistry (Hoffmann et al. 2019), optics (Sorokina
et al. 2016), engineered systems (Li et al. 2019), epidemiology (Horrocks and Bauch
2020), and plasma physics (Dam et al. 2017). Furthermore, there has been a variety
of modifications, including improved robustness to noise (Champion et al. 2019,
2020; Kaheman et al. 2020), generalizations to partial differential equations (Raissi
and Karniadakis 2018; Rudy et al. 2017, 2019), multi-scale physics (Champion et al.
2019), and libraries of rational functions (Mangan et al. 2016; Kaheman et al. 2020).
Just like the BOP-DMD algorithm (Sashidhar and Kutz 2021), recent Bayesian and
ensemble methods make SINDy much more robust for model discovery for noisy
systems and with little data (Hirsh et al. 2021; Fasel et al. 2021).
In the context of ROMs modeling, the goal is now to discover a dynamic, parsi-
monious model of the evolution dynamics of a high-fidelity model embedded in a
low-rank subspace. Recall that u(t) ≈ a(t) for building a ROM. Although can
be easily computed with the SVD, it is the evolution of a(t) that ultimately deter-
mines the temporal behavior of the system. Thus far, the temporal evolution has been
computed via Galerkin projection and DMD. SINDy gives yet another alternative
d
a = f(a) (4.41)
dt
where the right-hand side function prescribing the evolution dynamics f(·) is
unknown. SINDy provides a sparse regression framework to determine this dynam-
ics. As in DMD, the snapshots of a(t) are collected into the matrix
⎡ ⎤
| | |
A = ⎣a1 a2 · · · am ⎦ . (4.42)
| | |
Ȧ = (A). (4.43)
Fig. 4.3 Application of SINDy algorithm for ROM construction. High-dimensional data is used
with the sparse identification of nonlinear dynamics (SINDy) (Brunton et al. 2016) in order to
produce a model for a(t). This procedure is modular so that different techniques can be used for the
feature extraction and regression steps. In this example of flow past a cylinder, SINDy discovers
the model of Noack et al. (2003). Modified from Brunton et al. (2016)
Note that ȧk is the kth column of Ȧ, and λ is a sparsity-promoting regularization. There
are many variants for sparsity promotion that can be used (Tibshirani 1996; Donoho
2006; Candès 2006; Candès et al. 2023, 2006; Candès and Tao 2006; Baraniuk
2007; Tropp and Gilbert 2007), including the advocated sequential least-squares
thresholding to select active terms (Brunton et al. 2016).
The SINDy-POD method provides a simple regression framework for discovering
a parsimonious, and generally nonlinear, model for the evolution dynamics of the
high-dimensional model in a low-dimensional subspace. For example, the canonical
example of flow past a circular cylinder is considered. This is modeled by the 2D,
incompressible Navier–Stokes equations (Bagheri 2014)
1
∇ · u = 0, ∂t u + (u · ∇)u = −∇ p + u (4.45)
Re
where u is the two-component flow velocity field in 2D and p is the pressure term. For
Reynolds number Re = Rec ≈ 47, the fluid flow past a cylinder undergoes a super-
critical Hopf bifurcation, where the steady flow for Re < Rec transitions to unsteady
vortex shedding (Bearman 1969). The unfolding gives the celebrated Stuart–Landau
ODE, which is essentially the Hopf normal form in complex coordinates. This has
resulted in accurate and efficient reduced order models for this system (Noack et al.
2003, 2011).
In Fig. 4.3, simulations at Re = 100 were considered. The SVD of the data matrix
at this Reynolds number shows that three modes dominate the dynamics. As such
the first three columns of the matrix V are extracted and the SINDy regression (4.43)
is performed. The discovered dynamical model is given by
4 Machine Learning Methods for Constructing Dynamic Models From Data 165
which is the same found be Noack et al. (2003) through a detailed asymptotic reduc-
tion of the flow dynamics. Thus, the ROM evolution dynamics (4.46c) represents
a significantly different model than what is achieved via Galerkin POD projection.
This model is stable and it also captures the correct supercritical Hopf bifurcation
dynamics as a function of the Reynolds number. Thus, the SINDy-POD provides an
improved ROM description of the dynamics.
In the projection-based ROMs of previous sections, the amplitude dynamics a(t) are
constructed by Galerkin projection on the governing equations. With neural networks,
the dynamics a(t) is approximated from the discrete time-series data encoded in V.
Specifically, this gives
166 J. Nathan Kutz
Fig. 4.4 Illustration of neural network integration with POD subspaces. The autoencoder structure
projects the original high-dimensional state space data into a low-dimensional space via u(t) ≈
a(t). As shown in the bottom left, the snapshots uk are generated by high-fidelity numerical
solutions of the governing equations ut = N (u, ux , ux x , . . . , x, t; β). In traditional ROMs, the
snapshots ak are constructed from Galerkin projection as shown in the bottom right. Neural networks
instead learn a mapping ak+1 = fθ (ak ) from the original, low-dimensional snapshot data. It should
be noted that time-stepping Runge–Kutta schemes, for instance, are a form of feed-forward neural
networks, which are used to produce the original high-fidelity data snapshots uk (Gonzalez-Garcia
et al. 1998)
⎡ ⎤
| | |
a(t) → ˜ Ṽ∗ = ⎣ a1 a2 · · · am ⎦ (4.48)
| | |
over the m time snapshots of the original data matrix on which the ROM is to be
constructed.
Deep learning algorithms provide a flexible framework for constructing a mapping
between successive time steps. As shown in Fig. 4.4, the typical ROM architecture
constrains the dynamics to a subspace spanned by the POD modes . Thus, in the
original coordinate system, the high-fidelity simulations of the governing equations
for u are solved with a given numerical discretization scheme to produce a snapshot
matrix X containing uk . In the new coordinate systems which is generated by projec-
4 Machine Learning Methods for Constructing Dynamic Models From Data 167
tion to the subspace , the snapshot matrix is now constructed from ak as shown in
(4.48). In traditional ROMs, the snapshot matrix (4.48) is not used. Instead snapshots
of ak are achieved by solving the Galerkin projected model. However, the snapshot
matrix (4.48) can be used to construct a time-stepping model using neural networks.
Neural networks allow one to use the high-fidelity simulation data to train a mapping
where the ± denotes the input (−) and output (+), respectively. This gives the training
data necessary for learning (optimizing) a neural network map
A+ = fθ (A− ). (4.51)
There are numerous neural network architectures that can learn the mapping fθ . A
simple feed-forward network was already shown to be quite accurate in learning such
a model. Further sophistication can improve accuracy and reduce data requirements
for training. Regardless of the architecture, the error accumulation in DNNs when
the solution is obtained recursively is an open research question which is important
to understand for its usage in ROM architectures and for long time solutions.
Regazzoni et al. (2019) formulated the optimization of (4.51) in terms of
maximum-likelihood. Specifically, they considered the most suitable representation
of the high-fidelity model in terms of simpler neural network models. They show
that such neural network models can approximate the solution to within any accuracy
required (limited by the accuracy of the training data, or course) simply by construct-
ing them from the input-output pairs given by (4.51). Parish et al. (2020) provide an
in-depth study of different neural network architectures that can be used for learning
168 J. Nathan Kutz
the time-steppers. They are especially focused on recurrent neural network (RNN)
architectures that have proven to be so effective in temporal sequences associated
with language (Goodfellow et al. 2016). Their extensive comparisons show that long
short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) neural networks
outperform other methods and provide substantial improvements over traditional
time-series approaches such as autoregressive models. In addition to a baseline Gaus-
sian process (GP) regression, they specifically compare time-stepping models that
include the following, k-nearest neighbors (kNN), artificial neural networks (ANN),
autoregressive with exogenous inputs (ARX), Integrated ANN (ANN-I), latent ARX
(LARX), RNN, LSTM, and standard GP. Some models include recursive training
(RT) and others do not (NRT). Their comparisons on a diversity of PDE models,
which will not be detailed here, are evaluated on the fraction of variance unex-
plained (FVU). Figure 4.5 gives a representation of the extensive comparisons made
on these methods for an advection–diffusion PDE model.
4 Machine Learning Methods for Constructing Dynamic Models From Data 169
For neural networks, the flow map is approximated by the learned model (4.49) so
that F = fθ . Qin et al. (2019) and Liu et al. (2020) have explored the construction
of flow maps from neural networks as yet another modeling paradigm for advanc-
ing the solution in time without recourse to high-fidelity simulations. Such methods
offer a broader framework for fast time-stepping algorithms as no initial dimen-
sionality reduction needs to be computed. In Qin et al. (2019), the neural network
model fθ is constructed with a residual network (ResNet) as the basic architecture
for approximation. In addition to a one-step method, which is shown to be exact in
temporal integration, a recurrent ResNet and recursive ResNet is also constructed
for multiple time steps. Their formulation is also in the weak form where no deriva-
tive information is required in order to produce the time-stepping approximations.
Several numerical examples are presented to demonstrate the performance of the
methods. Like Parish et al. (2020) and Regazzoni et al. (2019), the method is shown
to be exceptionally accurate even in comparison with direct numerical integration,
highlighting the qualities of the universal approximation properties of fθ .
Liu et al. (2020) leveraged the flow map approximation scheme to learn a mul-
tiscale time-stepping scheme. Specifically, one can learn flow maps for different
characteristic timescales. Thus, a given model
can learn a flow map over a prescribed timescale τ . If there exist distinct timescales in
the data, for instance, denoted by t1 , t2 , and t3 with t1 t2 t3 (slow, medium, and
fast times), then three models can be learned, fθ 1 , fθ 2 , and fθ 3 for the slow, medium,
and fast times, respectively. Figure 4.6 shows the hierarchical time-stepping (HiTS)
scheme with three distinct timescales. The training data of a high-fidelity simulation,
or collection of experimental data, allow for the construction of flow maps which
can then be used to efficiently forecast long times in the future. Specifically, one can
use the flow map constructed on the slowest scale fθ 1 to march far into the future
while the medium and fast scales are then used to advance to the specific point in
time. Thus, a minimal number of steps is taken on the fast scale, and the work of
forecasting long into the future is done by the slow and medium scales. The method
is highly efficient and accurate.
Figure 4.7 compares the HiTS scheme across a number of example problems,
some of which are videos and music frames. Thus, HiTS does not require governing
equations, simply time-series data arranged into input–output pairs. The performance
of such flow maps is remarkably robust, stable, and accurate, even when compared to
leading time-series neural networks such as LSTMs, echo state networks (ESN) and
clockwork recurrent neural networks (CW-RNNs). This is especially true for long
170 J. Nathan Kutz
Fig. 4.6 Multiscale hierarchical time-stepping scheme (modified from Liu et al. 2020). Neural
network representations of the time-steppers are constructed over three distinct time scales. The
red model takes large steps (slow scale fθ 1 ), leaving the finer time-stepping to the yellow (medium
time scales fθ 2 ) and blue (fast time scales fθ 3 ) models. The dark path shows the sequence of maps
from u1 to um
Fig. 4.7 Evaluation of different neural network architectures (column) on each training sequence
(row) (From Liu et al. 2020). Key diagnostics are visualized from a diversity of examples, including
music files and videos. The last frame of the reconstruction is visualized for the first, third, and
fourth examples while the entire music score is visualized in the second example. Note the superior
performance of the hierarchical time-stepping scheme in comparison with other modern neural net-
work models such as LSTMs, echo state networks (ESN), and clockwork recurrent neural networks
(CW-RNNs)
forecasts in contrast to the small time-steps evaluated in the work of Parish et al.
(2020).
Overall, the works of Parish et al. (2020), Regazzoni et al. (2019), Qin et al.
(2019), and Liu et al. (2020) exploit very simple training paradigms related to input–
4 Machine Learning Methods for Constructing Dynamic Models From Data 171
Fig. 4.8 Schematic of the SINDy autoencoder method for simultaneous discovery of coordinates
and parsimonious dynamics (From Champion et al. 2019). a An autoencoder architecture is used
to discover intrinsic coordinates z from high-dimensional input data x. The network consists of
two components, an encoder ϕ(x), which maps the input data to the intrinsic coordinates z, and a
decoder ψ(z), which reconstructs x from the intrinsic coordinates. b A SINDy model captures the
dynamics of the intrinsic coordinates. The active terms in the dynamics are identified by the nonzero
elements in , which are learned as part of the NN training. The time derivatives of z are calculated
using the derivatives of x and the gradient of the encoder ϕ. The inset shows the pointwise loss
function used to train the network. The loss function encourages the network to minimize both the
autoencoder reconstruction error and the SINDy loss in z and x. L 1 regularization on is also
included to encourage parsimonious dynamics
dz(t)
= g(z(t)). (4.54)
dt
Specifically, a parsimonious description of the dynamics is sought where g contains
only a few active terms from a SINDy library. Thus in addition to a dynamical
model, the method learns a coordinate transforms ϕ, ψ that map the measurements
to intrinsic coordinates via z = ϕ(x) (encoder) and back via x ≈ ψ(z) (decoder).
The autoencoder is a flexible, feedforward neural network that allows one to
discover underlying low-dimensional coordinates in which to represent the data.
Thus, the layers of the autoencoder learn a latent representation of a new variable
in which to express the data, in this case, the dynamic evolution dynamics. The
network is trained to output an approximate reconstruction of its input, and the
restrictions placed on the network architecture (e.g. the type, number, and size of the
hidden layers) characterize the intrinsic coordinates (Goodfellow et al. 2016). The
autoencoder gives a nonlinear generalization of a PCA analysis (Baldi and Hornik
1989); thus, it goes beyond the standard linear POD subspace description.
Autoencoders can learn a low-dimensional representation in isolation without
need to specify any other constraints. Without further specifications, the intrin-
sic coordinates learned have no particular meaning or interpretation. However,
if in the latent space, additional constraints are imposed, then additional struc-
ture and meaning can be imposed on the model. For the SINDy AE model, the
network is required to learn coordinates associated with parsimonious dynamics.
Thus, it integrates the sparse regression framework of SINDy in the latent space,
or intrinsic coordinates z. This constraint in the autoencoder provides a regular-
ization framework whereby model discovery is achieved by constructing a library
(z) = [θ 1 (z), θ 2 (z), . . . , θ p (z)] of candidate basis functions, e.g. polynomials, and
learning a sparse set of coefficients = [1 , . . . , d ] that defines the dynamical
system
dz(t)
= g(z(t)) = (z(t)).
dt
Typical of SINDy, the library is specified before training occurs, where library load-
ings (coefficients) are learned along with the autoencoder weights during training
(optimization). Importantly, the derivatives ẋ(t) of the original states are computed
in order to pass these along to the encoder variables as ż(t) = ∇x ϕ(x(t))ẋ(t). This
helps enforce accurate prediction of the dynamics by incorporating the loss function
2
Ldz/dt = ∇x ϕ(x)ẋ − (ϕ(x)T )2 . (4.55)
4 Machine Learning Methods for Constructing Dynamic Models From Data 173
This term uses both the typical SINDy regression along with the gradient of the
encoder to promote learning of a sparse dynamical model which accurately predicts
the time derivatives of the encoder variables. Additional loss terms require that the
SINDy predictions accurately reconstruct the time derivatives of the original data
2
Ldx/dt = ẋ − (∇z ψ(ϕ(x))) (ϕ(x)T ) 2 . (4.56)
These loss terms (4.55) and (4.56) are added to the the standard autoencoder loss
Lrecon = x − ψ(ϕ(x)) 2
2 ,
which ensures that the autoencoder can accurately reconstruct the original input data.
To help promote sparsity in the SINDy architecture, an 1 regularization penalty is
included on the SINDy coefficients . This promotes a parsimonious model for the
dynamics by selecting only a small number of terms. The combination of the above
four terms gives the following overall loss function:
4.6 Conclusions
Data-driven methods have emerged as an invaluable tool for aiding in the construction
of dynamic models and their representation. Demonstrated here are three emerging
paradigms for data-driven models, (i) dynamic mode decomposition, (ii) sparse iden-
tification for nonlinear dynamics, and (iii) neural networks. In each case, the goal is to
use these methods to construct a model for the dynamics of a(t). This is a data-driven
construction as opposed to a projection-based construction typical of ROMs (Benner
et al. 2015) when governing equations are already known.
Each of the data-driven constructions has advantages that can be leveraged by
practitioners. The DMD method is perhaps the simplest as it provides a regression to
a best-fit linear model. The linear model, which models a Koopman operator (Brun-
ton et al. 2021), is advantageous since solutions can be easily represented as a linear
combination of eigenvalues and eigenfunctions of the constructed linear operator.
The data requirements are minimal for the DMD approximation and there exists
174 J. Nathan Kutz
open-source code, pyDMD (Demo et al. 2018), for producing DMD models. SINDy
requires more data, but it allows for a nonlinear representation of the dynamic evolu-
tion. SINDy is advantageous since it produces a parsimonious evolution dynamics for
a(t) that is typically interpretable and amenable to analysis with tools from dynamical
systems theory. It also has open-source software available called pySINDy (de Silva
et al. 2020; Kaptanoglu et al. 2021). If sufficient data is available, a diversity of deep
learning algorithms are available for producing neural networks for modeling the
time evolution of a(t). Such algorithms have been shown to be successful in a diver-
sity of application areas. Moreover, deep learning can be structured, for instance, to
learn multiscale physics.
Overall, data-driven methods are providing significant improvement capabilities
for traditional ROMs. As these methods are developed further over the next decade,
it is anticipated that ROMs will be substantially improved in terms of computational
performance and accuracy. This has the potential to revolutionize digital twin tech-
nologies as these methods can use computational or measurement data to construct
proxy models that are accurate and efficient to simulate.
Acknowledgements The work of JNK was supported in part by the US National Science Founda-
tion (NSF) AI Institute for Dynamical Systems (dynamicsai.org), grant 2112085.
References
Ablowitz MJ, Segur H (1981) Solitons and the inverse scattering transform, vol 4. Siam
Alla A, Kutz JN (2017) Nonlinear model order reduction via dynamic mode decomposition. SIAM
J Sci Comput 39(5):B778–B796
Alla A, Kutz JN (2019) Randomized model order reduction. Adv Comput Math 45(3):1251–1271
Antoulas AC (2005) Approximation of large-scale dynamical systems. SIAM
Arbabi H, Mezić I (2017) Ergodic theory, dynamic mode decomposition, and computation of spectral
properties of the koopman operator. SIAM J Appl Dyn Syst 16(4):2096–2126
Askham T, Kutz JN (2018) Variable projection methods for an optimized dynamic mode decom-
position. SIAM J Appl Dyn Syst 17(1):380–416
Azencot O, Yin W, Bertozzi A (2019) Consistent dynamic mode decomposition. SIAM J Appl Dyn
Syst 18(3):1565–1585
Bagheri S (2013) Koopman-mode decomposition of the cylinder wake. J Fluid Mech 726:596–623
Bagheri S (2014) Effects of weak noise on oscillating flows: Linking quality factor, Floquet modes,
and Koopman spectrum. Phys Fluids 26(9):094104
Bai Z, Wimalajeewa T, Berger Z, Wang G, Glauser M, Varshney PK (2015) Low-dimensional
approach for reconstruction of airfoil data via compressive sensing. AIAA J 53(4):920–933
Baldi P, Hornik K (1989) Neural networks and principal component analysis: Learning from exam-
ples without local minima. Neural Netw 2(1):53–58
Baraniuk RG (2007) Compressive sensing. IEEE Signal Process Mag 24(4):118–120
Bearman PW (1969) On vortex shedding from a circular cylinder in the critical reynolds number
regime. J Fluid Mech 37(3):577–585
Benner P, Gugercin S, Willcox K (2015) A survey of projection-based model reduction methods
for parametric dynamical systems. SIAM Rev 57(4):483–531
Bongard J, Lipson H (2007) Automated reverse engineering of nonlinear dynamical systems. Proc
Natl Acad Sci 104(24):9943–9948
4 Machine Learning Methods for Constructing Dynamic Models From Data 175
Brunton SL, Brunton BW, Proctor JL, Kaiser E, Kutz JN (2017) Chaos as an intermittently forced
linear system. Nat Commun 8(19):1–9
Brunton SL, Budišić M, Kaiser E, Kutz JN (2021) Modern Koopman theory for dynamical systems.
arXiv:2102.12086
Brunton SL, Kutz JN (2019) Data-driven science and engineering: machine learning, dynamical
systems, and control. Cambridge University Press
Brunton SL, Kutz JN, Manohar K, Aravkin AY, Morgansen K, Klemisch J, Goebel N, Buttrick
J, Poskin J, Blom-Schieber A et al (2020) Data-driven aerospace engineering: Reframing the
industry with machine learning. arXiv:2008.10740
Brunton SL, Proctor JL, Tu JH, Kutz JN (2015) Compressive sampling and dynamic mode decom-
position. To appear in the J Comput Dyn. Available: arXiv:1312.5186
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc Natl Acad Sci 113(15):3932–3937
Brunton SL, Tu JH, Bright I, Kutz JN (2014) Compressive sensing and low-rank libraries for
classification of bifurcation regimes in nonlinear dynamical systems. SIAM J Appl Dyn Syst
13(4):1716–1732
Brunton BW, Johnson LA, Ojemann JG, Kutz JN (2016) Extracting spatial-temporal coherent
patterns in large-scale neural recordings using dynamic mode decomposition. J Neurosci Methods
258:1–15
Burgers JM (1948) A mathematical model illustrating the theory of turbulence. Adv Appl Mech
1:171–199
Candès EJ (2006) Compressive sensing. In: Proceedings of the international congress of mathemat-
ics
Candès EJ, Romberg J, Tao T, Stable signal recovery from incomplete and inaccurate measurements.
Commun Pure Appl Math 8(1207–1223):1207–1223, 59
Candès EJ, Tao T (2006) Near optimal signal recovery from random projections: Universal encoding
strategies? IEEE Trans Inf Theory 52(12):5406–5425
Candès EJ, Romberg J, Tao T (2006) Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Trans Inf Theory 52(2):489–509
Carlberg K, Barone M, Anti H (2017) Galerkin v. least-squares petrov-galerkin projection in non-
linear model reduction. J Comput Phys 330:693–734
Champion K, Lusch B, Kutz JN, Brunton SL (2019) Data-driven discovery of coordinates and
governing equations. Proc Natl Acad Sci 116(45):22445–22451
Champion K, Zheng P, Aravkin AY, Brunton SL, Kutz JN (2020) A unified sparse optimization
framework to learn parsimonious physics-informed models from data. IEEE Access 8:169259–
169271
Chen KK, Tu JH, Rowley CW (2012) Variants of dynamic mode decomposition: Boundary condi-
tion, Koopman, and Fourier analyses. J Nonlinear Sci 22(6):887–915
Chen B, Huang K, Raghupathi S, Chandratreya I, Du Q, Lipson H (2022) Automated discovery of
fundamental variables hidden in experimental data. Nat Comput Sci 2(7):433–442
Champion KP, Brunton SL, Kutz JN (2019) Discovery of nonlinear multiscale systems: Sampling
strategies and embeddings. SIAM J Appl Dyn Syst 18(1):312–333
Cole JD (1951) On a quasi-linear parabolic equation occurring in aerodynamics. Quart Appl Math
9:225–236
Courant R, Hilbert D (2008) Methods of mathematical physics: partial differential equations. Wiley,
New York
Dam M, Brøns M, Rasmussen JJ, Naulin V, Hesthaven JS (2017) Sparse identification of a predator-
prey system from simulation data of a convection model. Phys Plasmas 24(2):022310
Dawson STM, Hemati MS, Williams MO, Rowley CW (2016) Characterizing and correcting for
the effect of sensor noise in the dynamic mode decomposition. Exp Fluids 57(3):1–19
Demo N, Tezzele M, Rozza G (2018) Pydmd: Python dynamic mode decomposition. J Open Source
Softw 3(22):530
176 J. Nathan Kutz
de Silva BM, Champion K, Quade M, Loiseau J-C, Kutz JN, Brunton SL (2020) Pysindy: a python
package for the sparse identification of nonlinear dynamics from data. arXiv:2004.08424
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Duke D, Soria J, Honnery D (2012) An error analysis of the dynamic mode decomposition. Exp
Fluids 52(2):529–542
Erichson NB, Brunton SL, Kutz JN (2016) Compressed dynamic mode decomposition for real-time
object detection. J Real-Time Image Process
Erichson NB, Voronin S, Brunton SL, Kutz JN (2019) Randomized matrix decompositions using
R. J Stat Softw 89(11):1–48
Fasel U, Kutz JN, Brunton BW, Brunton SL (2021) Ensemble-sindy: Robust sparse model discovery
in the low-data, high-noise limit, with active learning and control. arXiv:2111.10992
Gin CR, Shea DE, Brunton SL, Kutz JN (2020) DeepGreen: Deep learning of Green’s functions
for nonlinear boundary value problems. arXiv:2101.07206
Gin C, Lusch B, Brunton SL, Kutz JN (2021) Deep learning models for global coordinate transfor-
mations that linearise pdes. Eur J Appl Math 32(3):515–539
Gonzalez-Garcia R, Rico-Martinez R, Kevrekidis IG (1998) Identification of distributed parameter
systems: A neural net based approach. Comput Chem Eng 22:S965–S968
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Grosek J, Kutz JN (2014) Dynamic mode decomposition for real-time background/foreground
separation in video. arXiv:1404.7592
Haberman R (1983) Elementary applied partial differential equations, vol 987. Prentice Hall Engle-
wood Cliffs, NJ
Hemati MS, Rowley CW, Deem EA, Cattafesta LN (2017) De-biasing the dynamic mode decom-
position for applied Koopman spectral analysis. Theor Comput Fluid Dyn 31(4):349–368
Hesthaven JS, Rozza G, Stamm B et al (2016) Certified reduced basis methods for parametrized
partial differential equations, vol 590. Springer, Berlin
Hirsh SM, Barajas-Solano DA, Kutz JN (2021) Sparsifying priors for bayesian uncertainty quan-
tification in model discovery. arXiv:2107.02107
Hirsh SM, Ichinaga SM, Brunton SL, Kutz JN, Brunton BW (2021) Structured time-delay models
for dynamical systems with connections to frenet-serret frame. arXiv:2101.08344
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hoffmann M, Fröhner C, Noé F (2019) Reactive sindy: Discovering governing reactions from
concentration data. J Cheml Phys 150(2):025101
Holmes P, Lumley JL, Berkooz G, Rowley CW (2012) Turbulence, coherent structures, dynamical
systems and symmetry. Cambridge university Press
Hopf E (1950) The partial differential equation u t + uu x = μu x x . Commun Pure App Math 3:201–
230
Horrocks J, Bauch CT (2020) Algorithmic discovery of dynamic models from infectious disease
data. Sci Rep 10(1):1–18
Jovanović MR, Schmid PJ, Nichols JW (2014) Sparsity-promoting dynamic mode decomposition.
Phys Fluids 26(2):024103
Kaheman K, Brunton SL, Kutz JN (2020) Automatic differentiation to simultaneously identify
nonlinear dynamics and extract noise probability distributions from data. arXiv:2009.08810
Kaheman K, Kutz JN, Brunton SL (2020) Sindy-pi: A robust algorithm for parallel implicit sparse
identification of nonlinear dynamics. arXiv:2004.02322
Kamb M, Kaiser E, Brunton SL, Kutz JN (2020) Time-delay observables for Koopman: Theory
and applications. SIAM J Appl Dyn Syst 19(2):886–917
Kaptanoglu AA, de Silva BM, Fasel U, Kaheman K, Callaham JL, Delahunt CB, Champion K,
Loiseau J-C, Kutz JN, Brunton SL (2021) Pysindy: A comprehensive python package for robust
sparse system identification. arXiv:2111.08481
Kaptanoglu AA, Morgan KD, Hansen CJ, Brunton SL (2020) Characterizing magnetized plasmas
with dynamic mode decomposition. Phys Plasmas 27:032108
Keener JP (2018) Principles of applied mathematics: transformation and approximation. CRC Press
4 Machine Learning Methods for Constructing Dynamic Models From Data 177
Kutz JN (2013) Data-driven modeling and scientific computation: methods for complex systems
and big data. Oxford University Press
Kutz JN (2020) Advanced differential equations: Asymptotics and perturbations. arXiv:2012.14591
Kutz JN, Brunton SL, Brunton BW, Proctor JL (2016) Dynamic mode decomposition: data-driven
modeling of complex systems. SIAM
Kutz JN, Fu X, Brunton SL (2016) Multiresolution dynamic mode decomposition. SIAM J Appl
Dyn Syst 15(2):713–735
Kontolati K, Goswami S, Shields MD, Em Karniadakis G (2022) On the influence of over-
parameterization in manifold based surrogates and deep neural operators. arXiv:2203.05071
Lange H, Brunton SL, Kutz N (2020) From Fourier to Koopman: Spectral methods for long-term
time series prediction. arXiv:2004.00574
Li S, Kaiser E, Laima S, Li H, Brunton SL, Kutz JN (2019) Discovering time-varying aerodynam-
ics of a prototype bridge by sparse identification of nonlinear dynamical systems. Phys Rev E
100(2):022220
Liu Y, Kutz JN, Brunton SL (2020) Hierarchical deep learning of multiscale differential equation
time-steppers. arXiv:2008.09768
Lusch B, Kutz JN, Brunton SL (2018) Deep learning for universal linear embeddings of nonlinear
dynamics. Nat Commun 9(1):4950
Mackey A, Schaeffer H, Osher S (2014) On the compressive spectral method. Multiscale Model
Simul 12(4):1800–1827
Mamakoukas G, Castano M, Tan X, Murphey T(2019) Local Koopman operators for data-driven
control of robotic systems. In: Proceedings of “Robotics: Science and Systems 2019”, Freiburg
im Breisgau. IEEE
Mamakoukas G, Castano M, Tan X, Murphey T (2020) Derivative-based Koopman operators for
real-time control of robotic systems. arXiv:2010.05778
Mangan NM, Brunton SL, Proctor JL, Kutz JN (2016) Inferring biological networks by sparse
identification of nonlinear dynamics. IEEE Trans Mol, Biol Multi-Scale Commun 2(1):52–63
Mann J, Kutz JN (2016) Dynamic mode decomposition for financial trading strategies. In: Quanti-
tative finance, pp 1–13
Noack BR, Afanasiev K, Morzynski M, Tadmor G, Thiele F (2003) A hierarchy of low-dimensional
models for the transient and post-transient cylinder wake. J Fluid Mech 497:335–363
Noack BR, Morzynski M, Tadmor G (2011) Reduced-order modelling for flow control, vol 528.
Springer Science & Business Media
Ozoliņš V, Lai R, Caflisch R, Osher S (2013) Compressed modes for variational problems in
mathematics and physics. Proc Natl Acad Sci 110(46):18368–18373
Parish EJ, Carlberg KT (2020) Time-series machine-learning error models for approximate solutions
to parameterized dynamical systems. Comput Methods Appl Mech Eng 365:112990
Proctor JL, Brunton SL, Brunton BW, Kutz JN (2014) Exploiting sparsity and equation-free archi-
tectures in complex systems. Eur Phys J Spec Top 223(13):2665–2684
Proctor JL, Brunton SL, Kutz JN (2016) Dynamic mode decomposition with control. SIAM J Appl
Dyn Syst 15(1):142–161
Proctor JL, Eckhoff PA (2015) Discovering dynamic patterns from infectious disease data using
dynamic mode decomposition. Int Health 7(2):139–145
Qin T, Wu K, Xiu D (2019) Data driven governing equations approximation using deep neural
networks. J Comput Phys 395:620–635
Quarteroni A, Manzoni A, Negri F (2015) Reduced basis methods for partial differential equations:
an introduction, vol 92. Springer, Berlin
Raissi M, Em Karniadakis G (2018) Hidden physics models: Machine learning of nonlinear partial
differential equations. J Comput Phys 357:125–141
Regazzoni F, Dede L, Quarteroni A (2019) Machine learning for fast and reliable solution of time-
dependent differential equations. J Comput Phys 397:108852
Rowley CW, Mezić I, Bagheri S, Schlatter P, Henningson DS (2009) Spectral analysis of nonlinear
flows. J Fluid Mech 645:115–127
178 J. Nathan Kutz
Rudy SH, Brunton SL, Proctor JL, Kutz JN (2017) Data-driven discovery of partial differential
equations. Sci Adv 3(4):e1602614
Rudy S, Alla A, Brunton SL, Kutz JN (2019) Data-driven identification of parametric partial dif-
ferential equations. SIAM J Appl Dyn Syst 18(2):643–660
Sashidhar D, Kutz JN (2021) Bagging, optimized dynamic mode decomposition (bop-dmd) for
robust, stable forecasting with spatial and temporal uncertainty-quantification. arXiv:2107.10878
Scherl I, Strom B, Shang JK, Williams O, Polagye BL, Brunton SL (2020) Robust principal com-
ponent analysis for particle image velocimetry. Phys Rev Fluids 5(054401)
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656:5–28
Schmid PJ, Sesterhenn J (2008) Dynamic mode decomposition of numerical and experimental data.
In: 61st annual meeting of the APS division of fluid dynamics. American Physical Society
Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science
324(5923):81–85
Sorokina M, Sygletos S, Turitsyn S (2016) Sparse identification for nonlinear optical communication
systems: Sino method. Opt Express 24(26):30433–30443
Susuki Y, Mezić I, Hikihara T (2009) Coherent dynamics and instability of power grids.
repository.kulib.kyoto-u.ac.jp
Susuki Y, Mezic I (2011) Nonlinear Koopman modes and coherency identification of coupled swing
dynamics. IEEE Trans Power Syst 26(4):1894–1904
Takeishi N, Kawahara Y, Yairi T (2017) Subspace dynamic mode decomposition for stochastic
Koopman analysis. Phys Rev E 96(3):033310
Taylor R, Kutz JN, Morgan K, Nelson BA (2018) Dynamic mode decomposition for plasma diag-
nostics and validation. Rev Sci Instrum 89(5):053501
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B, pp 267–288
Tran G, Ward R (2017) Exact recovery of chaotic systems from highly corrupted data. Multiscale
Model Simul 15(3):1108–1129
Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching
pursuit. IEEE Trans Inf Theory 53(12):4655–4666
Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN (2014) On dynamic mode decompo-
sition: theory and applications. J Comput Dyn 1(2):391–421
Wang W, Yang R, Lai YC, Kovanis V, Grebogi C (2011) Predicting catastrophes in nonlinear
dynamical systems by compressive sensing. Phys Rev Lett 106(15):154101
Wiggins S (2003) Introduction to applied nonlinear dynamical systems and chaos, vol 2. Springer
Science & Business Media
Yang Y, Bhouri MA, Perdikaris P (2020) Bayesian differential programming for robust systems
identification under uncertainty. arXiv:2004.06843
Chapter 5
Physics-Informed Neural Networks:
Theory and Applications
5.1 Introduction
Machine learning (ML) methods based on artificial neural networks (ANNs) have
become increasingly used, particularly in data-rich fields such as text, image, and
audio processing, where they have achieved remarkable results, greatly surpassing
the previous state-of-the-art algorithms. Typically, ML methods are most efficient
in applications where the patterns are difficult to describe by clear-cut rules, such
as handwriting recognition. In these cases, it may be more efficient to generate the
rules by a kind of high-dimensional regression between a sufficiently large number
of input–output pairs. However, other techniques based on ANNs have also been suc-
cessful in domains where the rules are relatively easy to describe, such as AlphaZero
(Silver et al. 2017) for game playing and AlphaFold (Jumper et al. 2021) for protein
folding. Many of these advancements have been driven by an increase in compu-
tational capabilities, in particular with regard to Graphics Processing Units (GPUs)
and Tensor Processing Units (TPUs) (Jouppi et al. 2017), but also by theoretical
advances related to the initialization and architecture of the ANNs. In the scientific
community, there has also been increased interest in applying the new developments
in ANNs and ML to solve partial differential equations (PDEs) and other engineering
problems of interest.
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 179
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_5
180 C. Anitescu et al.
An artificial neural network (ANN) is loosely modeled after the structure of the
brain, which is made up of a large number of cells (neurons) which communicate
with their neighbors through electrical signals. Mathematically, an ANN can be
seen as a function u N N : Rn → Rm , which maps n inputs into m outputs. An ANN
is a universal function approximator (Hornik et al. 1989). Therefore, u N N can be
used to interpolate some unknown function from the data given at certain points or to
approximate the solution of a partial differential equation. The function u N N depends
on a collection of parameters (called trainable parameters) which are obtained by
an optimization procedure with the goal of minimizing some user-defined objective
or loss function.
In an ANN, the neurons, or computational units, are organized in layers which are
connected by composition with an activation function as detailed below. Different
types of layers (and activation functions) can be assembled together, according to the
application and the information known about the function to be approximated. There
are several types of ANNs, which include fully connected feed-forward networks,
convolutional neural networks (CNNs), recurrent neural networks (RNNs), residual
neural networks (ResNets), transformers, and others. In the following, we will focus
mostly on the feed-forward neural networks which are among the simplest and can
also be used as building blocks of more complicated architectures.
In this type of network, also called multi-layer perceptron (MLP), the output is
obtained by successive compositions of a linear transformation and a nonlinear acti-
vation function. The network consists of an input layer, an output layer, and any
number of intermediate hidden layers. The function u N N for a network with an n-
dimensional input, and m-dimensional output and k hidden layers can be written
as
u N N = L k ◦ L k−1 ◦ . . . ◦ L 0 (5.1)
with
L i (xi ) = σi (Wi xi + bi ) = xi+1 for i = 0, . . . , k. (5.2)
Fig. 5.1 A fully connected feed-forward neural network with the input, hidden, and output layers
The simplest activation function is the linear activation, which means that σ is simply
the identity function:
5 Physics-Informed Neural Networks: Theory and Applications 183
σ (x) = x. (5.3)
On a network with no hidden layers, a linear activation function between the input
and output layers can be used to perform a linear regression between the input and
output data. For networks with one or more hidden layers, stacking linear layers is not
useful since a composition of linear activations is still linear. However, linear layers
can be combined with other nonlinear activation functions. For example, linear layers
can be used as the last layer to scale the output to arbitrary values. A non-trainable
linear transformation is often used to normalize the input of a network to speed up
the training (optimization) process, as will be detailed in Sect. 5.2.3.3.
One of the simplest nonlinear activation functions is the piece-wise linear rectified
linear unit (ReLU) function, defined as
It can easily be seen that a single hidden layer with ReLU activation, followed by
a linear activation layer, can approximate exactly piecewise linear functions in one
dimension (Samaniego et al. 2020). Indeed, on a grid with nodes x0 < x1 < . . . < xn ,
the finite element linear hat function Ni (x) can be written as
1 1 1 1
Ni (x) = ReLU (x − xi−1 ) − + ReLU (x − xi ) + ReLU (x − xi+1 )
hi hi h i+1 h i+1
(5.5)
where h i = xi − xi−1 . This observation can be extended to higher dimensions, where
two hidden layers are enough to approximate piecewise linear simplex elements in
two and more dimensions (He et al. 2020). Further, error bounds for the approxima-
tion of ReLU networks in Sobolev norms are given, e.g. in Petersen and Voigtlaender
(2018), Gühring et al. (2020).
5.2.2.3 Sigmoid
1
σ (x) = . (5.6)
1 + exp(−x)
This function has a S-shaped form, as shown in Fig. 5.2b. The range of this function
is the interval (0, 1), therefore it is often used in the output layer of neural networks
used for binary classification tasks, where the output is a probability that the input
184 C. Anitescu et al.
belongs to a given class. The function is also differentiable infinitely many times,
resulting in a smooth approximation which is desirable for many applications.
exp(x) − exp(−x)
tanh(x) = . (5.7)
exp(x) + exp(−x)
This function looks similar to the sigmoid activation, maintaining the overall S-shape
and smoothness. An important difference is that the range of the outputs is (−1, 1)
which is centered at 0. This makes the tanh activation more suitable for deep networks
without creating a bias toward positive outputs.
5 Physics-Informed Neural Networks: Theory and Applications 185
5.2.2.5 Swish
where σ (x) is the sigmoid activation. The plot of this function is shown in Fig. 5.2d.
The swish function looks similar to the ReLU activation. However, like sigmoid and
tanh, it is infinitely differentiable.
We note that there are several other activations that have been proposed which are
similar to ReLU and swish, such as Leaky ReLU (Maas et al. 2013), exponential linear
units (ELUs) (Clevert et al. 2015), Gaussian error linear units (GELUs) (Hendrycks
and Gimpel 2016), Mish (Misra 2019), and others. These have been shown to remedy
some of the drawbacks of the previously considered activation functions and provide
a modest improvement on some machine learning tasks, particularly related to image-
based classification and segmentation tasks (Li et al. 2021). However, from the point
of view of function approximation where partial derivatives are involved, tanh or
swish are also well-suited due to their smoothness properties.
In addition to the standard activation functions, which are fixed at each layer, the
so-called adaptive activations have been proposed which depend on some model-
dependent or trainable parameters. In particular, for a given activation function σ (x),
we can define the adaptive version by
The idea of using trainable parameters in the activation function was proposed in
Agostinelli et al. (2014), and further developed in the context of function and PDE
solution approximation in Jagtap et al. (2020b, a, 2022), Shukla et al. (2020) among
others. Some adaptive or trainable activation functions have a different form, for
example, the original Swish activation proposed in Ramachandran et al. (2017) is of
the form: x
σβ (x) = , (5.10)
1 + exp(−βx)
5.2.3 Training
As mentioned earlier, the training process involves optimizing the network param-
eters (weights and biases) such that an objective function is minimized. Suppose
the loss function is denoted by L(u N N (x; θ )), where u N N is the neural network and
θ represents the trainable parameters, e.g. the matrices Wi and vectors bi in (5.2).
In the case of regression, a commonly used loss function is the mean square error,
defined as
1
N
L M S E (u N N (x j ; θ )) = |u N N (x j ) − y j |2 , (5.11)
N j=1
where x j , j = 1, . . . , N are input points at which the ground truth output values y j
are known. For the case of PDE approximations, more complicated loss functions
which contain the partial derivatives of u N N with respect to the inputs can be devised.
Additional terms can be used to incorporate the governing equations and boundary
conditions, as will be detailed in Sect. 5.3. Then the process of training a neural
network can be described as
We note that since L(u N N (x; θ ) is usually based on the evaluation of u N N or its
derivatives at a finite number of points (called training points); therefore, θ ∗ will
depend in general on number and location of these points. A careful choice of select-
ing L and a proper weighting between its terms is therefore key to ensuring that the
training is successful and the output generalizes well to new inputs.
Finding the optimal weights is usually done by gradient-based methods, such as gra-
dient descent. Parallelization and automatic differentiation methods are key ingredi-
ents in efficient implementations. The optimization method requires the gradients of
a possibly large number of trainable parameters, with many networks containing tens
of millions of parameters. Some are even larger, for example, the GPT-3 language
model uses 175 billion parameters (Floridi and Chiriatti 2020). Therefore, reverse-
mode differentiation, also known as back-propagation (Rumelhart et al. 1986), is
commonly used to compute the gradients with respect to the trainable parameters.
The differentiation process involves a forward pass, during which the neural net-
work output and the loss function are evaluated from a given input, and the operations
involved are recorded in a graph. Then the derivatives are computed in reverse order
of the evaluation, with the intermediate results obtained from the chain rule stored at
the graph nodes (see e.g. Sect. 6.5 in Goodfellow et al. 2016 for details). The remark-
able outcome of this procedure is that the partial derivatives of the loss function with
5 Physics-Informed Neural Networks: Theory and Applications 187
respect to all the parameters can be evaluated at a cost that is proportional to the
number of floating points operations involved in the forward evaluation.
Using forward mode differentiation, where the partial derivatives are computed in
the order of evaluation, would result in a much higher cost that is also proportional to
the number of parameters, although the memory requirements may be lower (López
et al. 2021). In general, evaluating the partial derivatives (Jacobian) of a function
f : Rn → Rm requires O(n) operations in forward mode, and O(m) operations in
reverse mode. In the context of PDEs, forward mode differentiation may be more
efficient when computing the partial derivatives of the outputs with respect to the
input coordinates, particularly for multiphysics models or other coupled problems
where several solution fields are considered.
When initializing the training process, particular care is needed for the selection of
the initial value. For example, if all the weights and biases are set to zero, then the
gradients with respect to the weights within a layer will have the same value. In a
gradient descent update with a fixed step size, all the parameters will be updated by
the same amount, resulting in the equivalent of a network with a single neuron per
layer. Part of the recent success of deep neural networks in applications is owed to
better techniques for initializing the values of the network parameters, such as Glorot
(Xavier) (Glorot and Bengio 2010) and He et al. (2015) initialization.
While the initialization method can be seen as a hyperparameter which can be
tuned according to the problem at hand, a commonly used one is Glorot uniform,
where the weights are chosen from a uniform distribution U [−l, l], where
6
l= , (5.13)
n in + n out
with n in and n out being the number of input and output neurons for a given layer.
The biases are initialized to zero. This is also the default initialization used in the
TensorFlow deep learning framework.
It can be observed that the nonlinear region of most activation functions σ (x), such
as the ones in Fig. 5.2, is centered in a small interval around x = 0. Therefore, if the
input data is in a region far away from the origin, then the activation will be mostly
constant or linear, which will hinder the performance of gradient descent methods
(see also Sect. 5.2.4.2). To remedy this issue, it is essential to perform a normalization
on the input data, which is just a linear transformation into the interval [−1, 1]. In
particular, for each input neuron, the transformation is given by the formula:
188 C. Anitescu et al.
2 · (x − xmin )
Tnor m (x) = − 1, (5.14)
xmax − xmin
where xmax and xmin are the maximum and minimum input values, respectively. In
the case where the input values are points in the computational domain, then xmin
and xmax represent the bounding box of the domain. These values must be fixed for
training and testing; otherwise, incorrect results will be obtained.
most common. Underfitting can occur when the neural network does not have enough
approximation capability to satisfactorily fit the data or solve the problem at hand. It
can also occur when the optimization has not converged, for example, because too
few iterations have been performed, or because the learning rate is too low or too
high. Underfitting can be typically identified when both the training and validation
losses are higher than acceptable values.
Overfitting, on the other hand, can appear when the network capacity is larger
than required. In this scenario, the training data is well approximated but other data
points may be far off from the actual values, or in machine learning parlance, the
model “does not generalize” well. A similar case where fitting exactly a small dataset
does guarantee that the target function is well approximated occurs in interpolation
by high-order polynomials, where the interpolant can oscillate wildly between the
interpolation points. In this case, the training loss value decreases to a low value
(even zero), while the validation loss can be much higher.
A good strategy to avoid overfitting or underfitting is to monitor both the training
and validation losses and to stop the training when the testing loss begins to increase.
To illustrate, the results for regression of the function u(x) = sin(π x) for x ∈ [−1, 1]
are shown in Fig. 5.3. A random uniform noise with magnitude in the interval (0, 0.1)
was added to the training and validation data, which consists of 201 and 50 points,
respectively. A neural network with two hidden layers consisting of 64 neurons has
been used, together with the tanh activation function for the first two layers and linear
Fig. 5.3 Fitting a noisy function and the loss convergence history
190 C. Anitescu et al.
activation in the last layer. The ADAM optimizer with the default parameters and
learning rate of 0.001 is used to minimize the mean square error of the difference
between the predicted and training values.
We observe from Fig. 5.3a that after 300 iterations, the neural network can start to
approximate the sinusoidal function, but it is still quite far away from the actual shape
(underfitting). The training loss value at this stage is 0.1129, while the validation loss
is 0.1393. After 10000 iterations, the approximation is already quite good, with only
a small error between the prediction and the actual function (without noise) as shown
in Fig. 5.3b. Here, the training loss is 0.0033 and the validation loss is quite close
at 0.0035. Next, if we continue to training, we start to observe that after many more
iterations, the training and validation loss start to diverge (see Fig. 5.3d). After 100000
iterations, we notice that the predicted function has some oscillations and spikes as it
tries to capture the noise in the data as shown in Fig. 5.3c. At this stage, the training
loss is 0.0026 and the test loss is 0.0037.
Two other types of problems encountered in training of artificial neural networks are
those related to the magnitude of the gradients. The vanishing gradients phenomenon
occurs when the derivative of the loss function with respect to the training variables
is very small. This can mean that the objective function is very close to a stationary
point, which can also be a saddle point or some other point far from the global
minimum. The end result is very slow or no convergence of the loss function. A
common remedy for this problem is to perform a normalization of the input data (see
also Sect. 5.2.3.3). Otherwise, changing the network architecture or the activation
function (for example using rectified activations like ReLU or Swish) may also be
helpful, since S-shaped activations like sigmoid or hyperbolic tangent are particularly
susceptible to vanishing gradients.
Exploding gradients, on the contrary, refer to the occurrence of too large deriva-
tives of the loss function with respect to the trainable parameters. In extreme cases,
the gradients can overflow, resulting in not-a-number (NaN) values for the loss.
Another possible effect is unstable training, where the loss value oscillates without
converging to the optimal value. Possible remedies for this problem include using a
smaller learning rate, and adding residual (or skip) connections to the neural network
(Philipp et al. 2018).
The ReLU activation function may suffer from a related problem known as “dying
ReLU”, which occurs when some neurons become inactivated, in the sense that they
always output zero for all the inputs. This can happen when a large negative bias
value is learned for a particular neuron. Because the derivative of the constant zero
function is also zero, it is not possible to recover a “dead” ReLU neuron, resulting
in a diminished approximation capability.
5 Physics-Informed Neural Networks: Theory and Applications 191
5.2.5 Optimizers
We will now briefly describe the optimization algorithms commonly used to train
(i.e. minimize the loss function) a neural network. First, we mention that two types of
optimization strategies can be employed: full-batch training and mini-batch training.
In the former, the entire data set is used during a forward pass through the network
and the gradients with respect to all the data points are computed in one step. In
mini-batch training, on the other hand, the training data is split into several sub-
sets of (approximately) the same size called mini-batches. Then an optimization
sub-step is taken with respect to each mini-batch. When the entire dataset is seen
by the optimization algorithm once, then a training epoch is completed. In general,
first-order optimization methods, like gradient descent, are commonly used with
mini-batch training, while algorithms that make use of (approximations of) second
derivative information use full-batch training. A detailed survey of optimization
methods used in machine learning has been presented in Sun et al. (2019).
The gradient descent method is the simplest gradient-based optimizer. The idea is to
minimize the function in the direction of the gradient evaluated at the current guess
by a fixed step size (also called the learning rate). If the objective function is L(w),
then an optimization step can be written as
where η is the learning rate. In the case of mini-batch training, since the mini-batches
are typically randomly selected, the method is called stochastic gradient descent
(SGD). Using mini-batches has been shown to improve the robustness, allowing the
optimizer to find the global optima (or better local optima) even for non-convex
problems (De Sa et al. 2015; Mertikopoulos et al. 2020).
This optimization method, proposed in Kingma and Ba (2014), replaces the fixed
learning rate of the conventional SGD with a variable step-size based on the momen-
tum, which can be seen as a linear combination of the gradients of the current and
previous time steps.
An update of the ADAM optimizer from step t to step t + 1 has the form:
192 C. Anitescu et al.
Here, m and v are the moment vectors which are initialized to zeros, β1 , β2 , and are
constants which are usually initialized to β1 = 0.9, β2 = 0.999, and = 10−8 , and
η is the learning rate. β1t and β2t denote β1 and β2 to the power t, and (∇w L(w(t) ))2
denotes the element-wise squaring of the gradient vector. Because the momentum
vectors are initialized to zeros, a bias-correction is introduced in (5.18) and (5.19).
This technique can smooth out the oscillations in the gradients and usually improves
the convergence compared to the standard SGD optimizer.
The gradient descent-based methods approximate the loss at each step by a linear
function without taking into account the curvature information. Faster convergence
can be obtained by using Newton algorithms, which involve computing the second
derivatives. Nevertheless, for a large number of parameters, the cost of Newton’s
method in terms of memory storage and floating point operations can be prohibitive,
since the Hessian matrix has size n × n, where n is the number of parameters. A
more feasible alternative is the family of quasi-Newton methods, like the Broyden–
Fletcher–Goldfarb–Shanno (BFGS) algorithm (Broyden 1970; Fletcher 1970; Gold-
farb 1970; Shanno 1970) or the limited memory version L-BFGS (Liu and Nocedal
1989), which are already implemented in machine learning frameworks like PyTorch
or TensorFlow Probability (Dillon et al. 2017). Another algorithm that can be used
for problems with a small number of parameters is the Levenberg–Marquardt algo-
rithm (Levenberg 1944; Marquardt 1963), which can be seen as a combination of
the Gauss–Newton method and gradient descent.
is used; however, some important differences can be noted in the form of the objective
function, in particular regarding to whether the strong or weak form of the PDE is
used.
The classical PINNs are collocation-based, meaning that the neural network aims to
approximate the strong form of the governing equation at a set of collocation points.
Because the collocation points can be randomly distributed inside the domain and no
mesh is needed, this method belongs to the category of mesh-free methods. Moreover,
once the “building blocks” for constructing the neural network and evaluating the
partial derivatives with respect to the inputs are obtained, the implementation is
relatively simple.
In particular, suppose that the governing PDE is of the form:
∂u(x)
F(u(x), , . . .) = 0 for x ∈ (5.21)
∂ x1
∂u(x)
G(u(x), , . . .) = 0 for x ∈ , (5.22)
∂n
where F represents a differential operator for the domain interior, G is a differential
operator for the boundary conditions, u is the unknown function, and are the
computational domain and its boundary, and n is the outer normal vector to the
boundary. The interior differential operator may contain any order of derivatives with
respect to the inputs, while the boundary operator may contain any order of derivative
with respect to the outer normal vector for Neumann-type boundary conditions.
The loss function for a neural network u N N (x; θ ) with trainable parameters θ
(which include the weights and biases for each layer) can be constructed based
on the “mean square error” (MSE) evaluated at a set of Nint interior collocation
points {xiint }, i = 1, . . . , Nint and a set of Nbnd boundary collocation points {xbnd
j },
j = 1, . . . , Nbnd as
λ1
Nint
∂u N N (xiint ; θ )
Lcoll (θ ) = F(u N N (xiint ; θ ), , . . .)2
Nint i=1 ∂ x1
λ2 ∂u N N (xbnd
j ; θ)
Nbnd
+ G(u N N (xbnd
j ; θ ), , . . .)2 . (5.23)
Nbnd j=1 ∂n
Here λ1 and λ2 are weight terms; usually, choosing λ2 >> λ1 helps to speed up
convergence by ensuring that the boundary conditions are satisfied. Adaptive methods
for choosing the weights have also been proposed in Wang et al. (2022). In the case
of time-dependent problems, the classical PINNs use a space–time discretization,
where the time is considered as an additional dimension.
194 C. Anitescu et al.
with denoting the portion of the boundary over which the boundary term is evalu-
ated. Then we can define the loss function of the form:
Lenergy (θ) = Hint (u N N ) d + Hbnd (u N N ) d . (5.25)
Q int
Q bnd
Lenergy (θ ) ≈ Hint (u N N (qiint ))wiint + Hbnd (u N N (qbnd
j ))w j .
bnd
(5.26)
i=1 j=1
When additional constraints are needed, such as Dirichlet boundary conditions, addi-
tional terms can be added to the loss function, similarly to (5.23). Alternatively, one
can impose the Dirichlet boundary conditions strongly (i.e. exactly) by modifying the
output of the neural network to match the prescribed boundary data. In particular, we
consider the computed solution to be ũ N N , satisfying ũ N N (x) = u D (x) for x ∈ D ,
where u D is the Dirichlet boundary condition specified on the boundary D to be of
the form:
ũ N N (x) = g(x) + d(x)u N N (x), (5.27)
it requires an integration mesh and it is more difficult to verify that the solution is
correct within a certain tolerance, since the objective function should converge to
some non-zero minimum which is not known in advance. A possible approach to
overcome this problem is to compute the residual loss for validation, which can then
also be used to adaptively adjust the number of integration points, as proposed in
Goswami et al. (2020).
By using a small set of training or input data (e.g. initial and boundary conditions
and/or measured data) as well as governing physical laws, PINNs attempt to approx-
imate the solution of the problem. Complex nonlinear systems and phenomena in
physics and engineering are described by differential equations.
PINNs have shown their capabilities to solve both forward and inverse problems in
science and engineering. A forward problem can be defined as a problem of finding a
particular effect of a given cause utilizing a physical or mathematical model, whereas
an inverse problem refers to finding causes from the given effects (Vauhkonen et al.
2016). We can investigate the 1D steady-state heat equation with the source term to
give more concrete examples of forward and inverse problems.
Let us consider a rod with unit length along the x-axis and the heat flowing through
this rod with a heat source as our model. We can represent the temperature at location
x on the rod as T (x). Under certain assumptions, such as the rod being perfectly
insulated, with the source term q(x) being known, then the governing equation can
be written as
d2T
κ 2 + q(x) = 0 (5.28)
dx
where κ > 0 is the thermal diffusivity constant. Finding temperature at any location
on the rod is a forward problem. On the other hand, finding the constant κ, which
is a rod feature, from observed temperature data is a good example of an inverse
problem. These examples will be detailed in Sects. 5.4.1 and 5.4.2.
To summarize, the aforementioned procedures explained in the previous sections
to solve differential equations with PINNs will become tangible with numerical
applications in this section. The solution estimation of PINNs for both forward and
inverse problems will be discussed by providing simple and complex examples.
In the introductory part of this section, the definition of a forward problem is given as
finding the particular effect of a given cause using a physical or mathematical model.
196 C. Anitescu et al.
Let us remember the 1D steady-state heat conduction problem with a heat source.
As we discussed before, the governing equation for this example is given in (5.28).
Let the thermal diffusivity constant be κ = 0.5 and x denote the location on the rod.
Here, the source term is given as q(x) = 15x − 2. We assume that the temperatures
at both ends are 0. Then we can re-write (5.28) as
d2T q(x)
+ = 0, x ∈ [0, 1]
dx2 κ
q(x) = 15x − 2 (5.29)
κ = 0.5
T (0) = T (1) = 0
The first step to solve this problem is to discretize the domain with uniform or
randomly sampled collocation points. Then the neural network will process these
collocation points through its linearly connected layers consisting of neurons with
nonlinear activation functions. Of course, the outcome of the first forward propaga-
tion will not be compatible with the true solution. Therefore, at this point, the physics
and boundary knowledge will guide the neural network to approximate the ground
truth by updating the weights and biases of the neural network. Let us elaborate on
this step by step and reinforce these steps with code snippets. Note that these codes
are written with TensorFlow version 2.x with the Keras API.
We first generate 100 equidistant points in our domain. Here, the choice of the
number of points is up to the user. However, it should be noted that the number of
points also has some influence on the number of iterations or network size required
to have results with similar accuracy. The ADAM optimizer with a learning rate of
0.005 is used for this example. An input layer, three hidden layers with 32 neurons
equipped with tanh activation function, and an output layer form the neural network
(see Fig. 5.4). The input and output layers have one neuron each since the input for
the network is only one spatial dimension, and the output is the temperature at these
points. By setting the number of iterations to 1000 and introducing the boundary
condition data in TensorFlow tensors, we complete the initial settings of our model
(see Listing 1).
hidden layers
(1) (2) (3)
a1 a1 a1
input output
layer
layer (1) (2) (3)
a3 a3 a3
x T
(1) (2) (3)
a4 a4 a4
.. .. ..
. . .
(1) (2) (3)
a32 a32 a32
Fig. 5.4 The architecture of the feed-forward neural network for 1D steady-state heat conduction
problem. The network consists of one input layer, one output layer, and three hidden layers with
32 neurons each. a is the activation function. Superscripted numbers denote the layer number, and
subscripted ones denote the neuron number in the relevant layer
# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))
return model
Then we define our loss function, which is composed of two parts, the boundary
loss, and physics loss, as formulated in (5.30). Here, the loss term tells us how far
away our model is from ’reality’. For the measure of these loss terms, we will use
mean square error formulation, which is mentioned in Sects. 5.2.3 and (5.11).
Constructing the boundary loss is easier compared to the physics loss. Our model’s
assumptions should be compatible with the prescribed boundary conditions, which
are T (0) = 0 and T (1) = 0 for our case. Thus, our goal should be to minimize the
mean square error between our model’s temperature prediction at both ends of the
rod and the real temperature values at these points, which must be 0. The boundary
condition loss is given by (5.31)
N B =2
λ1
L BCs = |TN N (x j ) − y j |2 , (5.31)
N B j=1
where N B = 2 since we have boundary condition data for two points which are
T (0) = T (1) = 0. The regularization term λ1 is taken as 1.
We also need to provide information about the interior points to get reasonable
results. Although we do not know the temperature data for intermediate points on
the rod, we know those points have to satisfy some physical laws that we derived in
(5.28). Or in other words, our temperature prediction needs to satisfy (5.28). When
we take the derivative of the temperature prediction of the network with respect to x
two times and sum this result with the source term q(x) divided by κ, this summation
must yield 0. Thus, the physics loss for our example becomes
N =100
λ2 P d 2 TN N q(x j ) 2
L Physics = | + | , (5.32)
N P j=1 d x 2 x=x j κ
Again, the regularization term λ2 is taken as 1. Now we can combine the boundary
conditions loss and physics loss functions to form our model’s loss function (see
(5.30)), which will guide the model to make better predictions in each iteration.
5 Physics-Informed Neural Networks: Theory and Applications 199
# Training loop
for i in range(N + 1):
loss = train_step()
# printing loss amount in each 100 epoch
if i
print("Epoch {:05d}: loss = {:10.8e}".format(i, loss))
Once the training process is completed with the desired loss value, we can validate
the output by performing one forward pass with a test dataset which is typically
formed in the same domain as the training dataset. In our example, the training data
was 100 equidistant points between 0 and 1. We can determine our test dataset as 200
equidistant points in the same domain. Figure 5.5 depicts that the model’s prediction
captures the analytical result.
neural networks in scientific problems. Now, let us proceed with a more complex
application.
Therefore, consider the cantilever beam model (Wang et al. 2006; Otero and
Ponta 2010; Wang and Feng 2009), a classical example in linear elasticity theory.
This problem is governed by the well-known equilibrium equation (5.33) given as
1
= (∇u + ∇uT ) (5.34)
2
and Hooke’s law for a linear isotropic elastic solid:
where μ and λ are the Lamé constants, and I is the identity tensor. The Dirichlet
boundary conditions are u(x) = û for x ∈ D and the Neumann boundary conditions
are σ n = tˆ for x ∈ N , where n is the normal vector.
For this example (see Fig. 5.6), is a rectangle with corners at (0, 0) and (8, 2).
Letting x = (x, y), and u = (u, v) the Dirichlet boundary conditions for x = 0 are
Py W2
u(x, y) = (2 + ν)(y 2 − )
6E I 4 (5.36)
P
v(x, y) = − (3νy 2 L)
6E I
Commonly, a parabolic traction at x = 8
y 2 − yW
p(x, y) = P (5.37)
2I
L=8 y
W=2
x z
b=1
Pmax=2
hidden layers
(1) (2) (3)
a1 a1 a1
input output
layer
layer (1) (2) (3)
a2 a2 a2
x u
(1) (2) (3)
a3 a3 a3
y a4
(1) (2)
a4 a4
(3) v
.. .. ..
. . .
(1) (2) (3)
a20 a20 a20
Fig. 5.8 The architecture of the feed-forward neural network for the Timoshenko beam problem.
The network consists of one input layer, one output layer, and three hidden layers. There are 20
neurons per hidden layer. Two neurons in the input layer take x and y coordinates, and the output
neurons give displacements in u and v directions. a is the activation function that is swish in this
example. Superscripted numbers denote the layer number, and subscripted ones denote the neuron
number in the relevant layer
5.4.1.3 3D Hyperelasticity
∇ · P + fb = 0,
Dirichlet boundary : u = ū on ∂ D , (5.38)
Neumann boundary : P · n = t̄ on ∂ N ,
where ū is the prescribed displacement given on the Dirichlet boundary and t̄ is the
prescribed traction at the Neumann boundary; n denotes the outward unit normal
vector, P is the 1st Piola Kirchoff stress tensor and fb is the body force. The potential
energy functional of this problem is given by Samaniego et al. (2020)
204 C. Anitescu et al.
Fig. 5.9 Predicted and exact values for displacements on a Timoshenko cantilever beam in x and
y directions
(a) Estimation error for displace- (b) Estimation error for displace-
ments in x-direction ments in y-direction
Fig. 5.10 The difference between the exact solution and the predicted solution for displacements
on the beam in x and y directions
ε(ϕ) = d V − fb · ϕd V − t̄ · ϕd A, (5.39)
∂ N
where is the strain energy density and ϕ indicates the mapping of points on the
body from the initial/undeformed to the deformed state.
In order to obtain optimal parameters of the neural network, the potential energy
(5.39) is parameterized by the neural network’s prediction for the displacements.
5 Physics-Informed Neural Networks: Theory and Applications 205
N∂ N
V V A∂ N
N N
L( p) ≈ ((ϕ p )i ) − (fb )i · (ϕ p )i − t¯i · (ϕ p )i , (5.41)
N i=1 N i=1 N∂ N i=1
in which V is the volume and N is the number of data points within the solid;
N∂ N and A∂ N denote the number of points on the surface subjected to the force
and the surface area, respectively.
Let us consider now 3D cuboid of length L = 1.25, width W = 1.0, and depth
H = 1.0. It is fixed at the left surface and twisted 60◦ counter-clockwise by boundary
conditions u | 1 at the right-end surface. Also, at the lateral surfaces, a body force
fb = [0, −0.5, 0]T and traction forces t̄ = [1, 0, 0]T are applied(see Fig. 5.11).
The Dirichlet boundary conditions for this particular problem are
u| = [0, 0, 0]T ,
⎡ ⎤
0
0 (5.42)
u| 1
= ⎣0.5[0.5 + (X 2 − 0.5) cos(π/3) − (X 3 − 0.5) sin(π/3) − X 2 ]⎦
0.5[0.5 + (X 2 − 0.5) sin(π/3) + (X 3 − 0.5) cos(π/3) − X 3 ]
The Neo-Hookean model is assumed in this problem. The material properties are
shown in Table 5.1.
We now proceed with determining the network parameters. In each direction, 40
equally spaced points, 64000 points in total, are placed over the whole domain (see
Fig. 5.12a). The neural network consists of three hidden layers, and each hidden
L=1
.25
1
Fixed Support W=
206 C. Anitescu et al.
Description Value
E—Young’s modulus 106
ν—Poisson ratio 0.3
E
μ—Lame’ parameter
2(1 + ν)
Eν
λ—Lame’ parameter
(1 + ν)(1 − 2ν)
Fig. 5.12 Training points on the cuboid and its predicted deformed shape after training
layer has 30 neurons with a tanh activation function. The input and output layers
have three neurons corresponding to coordinates of the initial configuration of the
designated points and their deformed coordinates after loading, respectively. The
network is trained with 50 iterations and the parameters are optimized by the L-
BFGS optimizer.
The predicted deformed shape of the cuboid is given in Fig. 5.12b. A line passing
through two points on the cube A(0.625, 1, 0.5) and B(0.625, 0, 0.5) is drawn to
compare the displacement predictions and the real displacements on the line. We
showed in Samaniego et al. (2020), that a neural network with the same setup but
trained with 25 steps has an error in the L 2 norm of 0.13210, whereas the finite
element model has an error of 0.13275 for estimating the displacements on the line
AB.
# We set seeds initially. This feature starts the model with same random
# variables (e.g. initial weights of the network).
# By doing it so, we have same results whenever the code is run
tf.random.set_seed(123)
# Number of iterations
N = 6000
#The exact solution of the problem. It will be used to produce measured data
#and test data
solution = lambda x: -5 * x**3 + 2 * x**2 + 3 * x
# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))
return model
After defining the model settings, we can proceed with constructing the loss
function. The loss function (5.43) consists of three parts, namely, boundary loss,
physics loss, and data loss.
with
5 Physics-Informed Neural Networks: Theory and Applications 209
λ1
NB
L BCs = |TN N (xi ) − yi |2 ,
N B i=1
λ2 d 2 TN N
NP
q(x j ) 2
L Physics = | + | , (5.44)
N P j=1 dx 2
x=x j κ
λ3
ND
L Data = |TN N (x j ) − y j |2
N D j=1
Here N B , N P , and N D correspond to the number of data points for boundary loss,
physics loss, and measured data loss, respectively. Regularization terms λ1 , λ2 , λ3
are taken as 1 in this example; (5.43) and (5.44) are defined in the code as follows:
# Source term
def source_func(x): return (15 * x - 2)
@tf.function
def physics_loss(x):
x = x[1:-1]
predicted_Txx = second_deriv(x)
mse_phys = tf.reduce_mean(
tf.square(predicted_Txx * kappa + source_func(x)))
return mse_phys
@tf.function
210 C. Anitescu et al.
def data_loss(x):
x = x[1:-1]
ob_T = solution(x)[:, None]
data_loss = tf.reduce_mean(tf.square(ob_T - model(x)))
return data_loss
@tf.function
def loss_func(x):
bcs_loss = boundary_loss(bcs_x_tensor, bcs_T_tensor)
phys_loss = physics_loss(x)
ob_loss = data_loss(x)
loss = phys_loss + ob_loss + bcs_loss
return loss
The training and testing procedures are the same as for the forward problem.
Again, the gradients of the loss function with respect to κ and the trainable variables,
which are weights and biases of the network, are determined with backpropagation.
Then, the trainable variables and the κ value are updated by the ADAM optimizer
using previously obtained gradients. This iterative procedure is repeated a number
of epochs times, and eventually, it is expected to reach the possible minimum loss
value.
Listing 6: Training
# taking gradients of the loss function w.r.t. trainable variables
# and kappa
@tf.function
def get_grad():
with tf.GradientTape(persistent=True) as tape:
# This tape is for derivatives with
# respect to trainable variables
tape.watch(model.trainable_variables)
tape.watch(kappa)
Loss = loss_func(x)
g = tape.gradient(Loss, model.trainable_variables)
g_kappa = tape.gradient(Loss, kappa)
return Loss, g, g_kappa
# optimizing and updating the weights and biases of the model and
# kappa by using the gradients
@tf.function
def train_step():
# Compute current loss and gradient w.r.t. parameters
loss, grad_theta, grad_kappa = get_grad()
The network parameters obtained at the last epoch form our model. We can test
the model with a new data set in the same domain and plot the results to compare
it with the ground truth (see Fig. 5.13a). The value for κ estimated by the neural
5 Physics-Informed Neural Networks: Theory and Applications 211
Fig. 5.13 The temperature and thermal diffusivity constant prediction of the neural network and
true values
network is equal to 0.5000, and the real value of κ is 0.5. Figure 5.13b illustrates
that as the network is being trained, the value of κ converges to the true value. The
relative L 2 error norm is 7.575 × 10−6 .
The second and also last example is the Helmholtz equation, which is a time-
independent version of the wave equation. It is used for describing problems in
electromagnetic radiation, acoustics, and seismology. The homogeneous form of the
Helmholtz equation is written as :
where ∇ 2 is the Laplace operator and k is the wave number. The solution of the
problem is u(x, y) for (x, y) ∈ . An inverse acoustic duct problem, adopted from
Anitescu et al. (2019), whose governing equation is a complex-valued Helmholtz
equation such that k is unknown and u(x, y) is known at some points in the domain,
will be investigated.
We can write (5.45) with domain information and boundary conditions as:
∂u
= cos(mπ x) at x = 0
∂n
∂u
= −iku at x = 2 (5.47)
∂n
∂u
= 0 for y = 0 and y = 1
∂n
m being the number of modes which is taken as 1. The wave number k is unknown.
The initial guess for k is 1, and the true value is chosen as k = 6. The exact solution
for u(x, y) can be written as:
Similar to the previous inverse problem in which we obtained the thermal diffusivity
constant of a rod, the overall loss function is composed of boundary loss, physics loss
and data loss. The boundary loss is constructed by Neumann and Robin boundary
conditions specified in (5.47), and the physics loss is equal to the left-hand-side of
(5.46). In addition, the data loss, in other words, the mean square error between
observed u(x, y) values and the prediction of the neural network is the last term in
our overall loss function. These loss functions can be described as follows:
Fig. 5.14 Collocation points for 2D Helmholtz equation. Black points depict the boundary points
where Neumann boundary conditions are valid whereas the red points show the Robin boundary
points. In addition, blue points represent the inner collocation points where physics loss and data
loss are computed
λ1
NB
∂u N N b b ∂u b b 2
L BCs = | (xi , yi ) − (x , y )| ,
N B i=1 ∂n ∂n i i
λ2
NP
L Physics = |∇ 2 u N N (x ∗j , y ∗j ) − k 2 u(x ∗j , y ∗j )|2 , (5.51)
N P j=1
λ3
ND
L Data = |u N N (x ∗j , y ∗j ) − u(x ∗j , y ∗j )|2
N D j=1
Here L BCs , L Physics , L Data refer to the loss obtained from boundary conditions,
governing equation, and measured data, respectively. The regularization term λ1 is
100 whereas λ2 and λ3 are 1; N B indicates the number of boundary points, N P , N D
are the number of interior collocation points where physics loss is computed and the
number of points where the observed data is available, respectively. In this problem,
784 equidistant points (28 × 28) such that N P =N D = 676 and N B =108 are created
(see Fig. 5.14).
The neural network consists of 5 hidden layers with the tanh activation function,
and there are 30 neurons in each layer. The data is normalized to the interval [−1, 1]
before being processed. First, ADAM optimizer and, subsequently, the quasi-Newton
method (L-BFGS) are employed to minimize the loss function. Five thousand itera-
tions for ADAM and 6200 iterations with L-BFGS are applied. The estimated solution
for u(x, y) and the exact solution are shown in Fig. 5.15.
The initial guess for k was one, and the neural network’s estimation for k after
the training is 5.999. The relative L 2 error norm for the real part of the solution is
0.0015. A comparison between the predicted solution and the exact solution can be
found in Fig. 5.16.
214 C. Anitescu et al.
(a) Predicted solution for the real (b) Exact solution for the real part of
part of Helmholtz equation Helmholtz equation
(c) Predicted solution for the imagi- (d) Exact solution for the imaginary
nary part of Helmholtz equation part of Helmholtz equation
Fig. 5.15 Predicted and exact values for real and imaginary parts of the Helmholtz equation
(a) Error distribution between pre- (b) Error distribution between pre-
dicted and exact solution for the real dicted and exact solution for the
part imaginary part
Fig. 5.16 Error distribution for real and imaginary parts of the Helmholtz equation
5 Physics-Informed Neural Networks: Theory and Applications 215
5.5 Conclusions
In this chapter, we have introduced some of the main building blocks for PINNs.
The main idea is to cast the process of solving a PDE as an optimization problem,
where either the residual or some energy functional related to the governing equations
is minimized. We showed the implementation of PINNs for both simple and more
advanced inverse problems. First, a 1D steady-state heat conduction problem with
a source term was solved for the unknown thermal diffusivity constant κ. Later,
a complex-valued Helmholtz equation for an inverse acoustic duct problem was
investigated. The wave number k is unknown in the beginning, and it is approximated
by the PINN model. Unlike the forward problems, we have an additional term in the
loss function, which is formed as the mean square error between the measured data
and the model’s prediction.
By taking advantage of modern machine learning libraries, it is possible to write
fairly succinct programs that approximate the solution or some quantity of inter-
est, while at the same time taking advantage of the built-in parallelization offered by
multi-processor and GPU architectures. Nevertheless, solving PDEs by the optimiza-
tion of parameters in a “standard” fully connected neural network is less efficient
than current methods such as finite elements. More advances seem possible by com-
bining machine learning algorithms with classical methods for solving PDEs which
make use of the available knowledge for approximating the solutions or quantities
of interest.
References
Abadi M, Agarwal A, Barham P, Brevdo E et al (2015) TensorFlow: large scale machine learning
on heterogeneous systems. Software available from tensorflow.org. https://www.tensorflow.org/
Agostinelli F, Hoffman M, Sadowski P, Baldi P (2014) Learning acti vation functions to improve
deep neural networks. arXiv:1412.6830
Anitescu C, Atroshchenko E, Alajlan N, Rabczuk T (2019) Artificial neural network methods for
the solution of second order boundary value problems. Comput Mater Continua 59(1):345–359
Apicella A, Donnarumma F, Isgr‘o F, Prevete R (2021) A survey on modern trainable activation
functions. Neural Netw 138:14–32
Bin Waheed U, Haghighat E, Alkhalifah T, Song C et al (2021) PINNeik: Eikonal solution using
physics-informed neural networks. Comput Geosci 155:104833
Bradbury J, Frostig R, Hawkins P, Johnson MJ et al (2018) JAX: compos able transformations of
Python+NumPy programs. Version 0.2.5. http://github.com/google/jax
Broyden CG (1970) The convergence of a class of double-rank minimiza tion algorithms: 2. The
new algorithm. IMA J Appl Math 6(3):222–231
Chen Y, Lu L, Karniadakis GE, Dal Negro L (2020) "Physics-informed neural networks for inverse
problems in nano-optics and metamateri als. Opt Express 28(8):11618–11633
Clevert D-A, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by expo-
nential linear units (ELUs). arXiv:1511.07289
De Sa C, Re C, Olukotun K (2015) Global convergence of stochastic gradient descent for some
non-convex matrix problems. International conference on machine learning. PMLR, pp 2332–
2341
216 C. Anitescu et al.
Depina I, Jain S, Mar Valsson S, Gotovac H (2022) Application of physics-informed neural networks
to inverse problems in unsaturated groundwater flow. Georisk: Assess Manag Risk Eng Syst
Geohazards 16(1):21–36
Dillon JV, Langmore I, Tran D, Brevdo E et al (2017) Tensorflow dis tributions. arXiv:1711.10604
Fletcher R (1970) A new approach to variable metric algorithms. Comput J 13(3):317–322
Floridi L, Chiriatti M (2020) GPT-3: Its nature, scope, limits, and consequences. Minds Mach
30(4):681–694
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural net-
works. In: Proceedings of the thirteenth international conference on artificial intelligence and
statistics. JMLR Work shop and conference proceedings, pp 249–256
Goldfarb D (1970) A family of variable-metric methods derived by varia tional means. Math Comput
24(109):23–26
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Goswami S, Anitescu C, Rabczuk T (2020) Adaptive fourth-order phase field analysis for brittle
fracture. Comput Methods Appl Mech Eng 361:112808
Gühring I, Kutyniok G, Petersen P (2020) Error bounds for approxi mations with deep ReLU neural
networks in Ws, p norms. Anal Appl 18(05):803–859
Haghighat E, Amini D, Juanes R (2022) Physics-informed neural net work simulation of multiphase
poroelasticity using stress-split sequen tial training. Comput Methods Appl Mech Eng 397:115141
He J, Li L, Xu J, Zheng C (2020) Relu deep neural networks and linear finite elements. J Comput
Math 38(3):502–527
Hendrycks D, Gimpel K (2016) Gaussian error linear units (GELUs). arXiv:1606.08415
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification. In: Proceedings of the IEEE international conference on
computer vision, pp 1026–1034
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approx-
imators. Neural Netw 2(5):359–366
Jagtap AD, Kawaguchi K, Em Karniadakis G (2020a) Locally adap tive activation functions with
slope recovery for deep and physics informed neural networks. Proc R Soc A 476(2239):20200334
Jagtap AD, Kawaguchi K, Karniadakis GE (2020b) "Adaptive acti vation functions accelerate
convergence in deep and physics-informed neural networks. J Comput Phys 404:109136
Jagtap AD, Shin Y, Kawaguchi K, Karniadakis GE (2022) Deep Kronecker neural networks: A gen-
eral framework for neural networks with adaptive activation functions. Neurocomputing 468:165–
180
Jouppi NP, Young C, Patil N, Patterson D et al (2017) In-datacenter performance analysis of a
tensor processing unit. In: Proceedings of the 44th annual international symposium on computer
architecture, pp 1–12
Jumper J, Evans R, Pritzel A, Green T et al (2021) Highly accurate protein structure prediction with
Alpha fold. Nature 596(7873):583–589
Kharazmi E, Zhang Z, Karniadakis GE (2019) Variational physics informed neural networks for
solving partial differential equations. arXiv:1912.00873
Kingma DP, Ba J (2014) Adam: A method for stochastic optimiza tion. arXiv:1412.6980
Kissas G, Yang Y, Hwuang E, Witschey WR et al (2020) Machine learn ing in cardiovascular
flows modeling: Predicting arterial blood pressure from non-invasive 4D flow MRI data using
physics-informed neural net works. Comput Methods Appl Mech Eng 358:112623
Lagaris IE, Likas AC, Papageorgiou DG (2000) Neural-network methods for boundary value prob-
lems with irregular boundaries. IEEE Trans Neural Netw 11(5):1041–1049
Lagaris IE, Likas A, Fotiadis DI (1997) Artificial neural network methods in quantum mechanics.
Comput Phys Commun 104(1–3):1–14, 40
Lagaris IE, Likas A, Fotiadis DI (1998) Artificial neural networks for solving ordinary and partial
differential equations. IEEE Trans Actions Neural Netw 9(5):987–1000
Levenberg K (1944) A method for the solution of certain non-linear prob lems in least squares. Q
Appl Math 2(2):164–168
5 Physics-Informed Neural Networks: Theory and Applications 217
Li A, Chen R, Farimani AB, Zhang YJ (2020a) Reaction diffusion system prediction based on
convolutional neural network. Sci Rep 10(1):1-9
Li Z, Kovachki N, Azizzadenesheli K, Liu B et al (2020b) Fourier neural op erator for parametric
partial differential equations. arXiv:2010.08895
Li Z, Liu F, Yang W, Peng S et al (2021) A survey of convolutional neural networks: analysis,
applications, and prospects. IEEE Trans Neural Netw Learn Syst
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math
Program 45(1):503–528
López J, Anitescu C, Rabczuk T (2021) Isogeometric structural shape optimization using automatic
sensitivity analysis. Appl Math Model 89:1004–1024
Lu L, Jin P, Pang G, Zhang Z et al (2021) Learning nonlinear opera tors via DeepONet based on
the universal approximation theorem of operators. Nat Mach Intell 3(3):218–229
Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic
models. Proc icml 30(1):3. Citeseer
Marquardt DW (1963) An algorithm for least-squares estimation of non linear parameters. J Soc
Indus Appl Math 11(2):431–441
Mertikopoulos P, Hallak N, Kavis A, Cevher V (2020) On the al most sure convergence of stochastic
gradient descent in non-convex problems. Adv Neural Inf Process Syst 33:1117–1128
Misra D (2019) Mish: A self regularized non-monotonic activation function. arXiv:1908.08681
Nguyen-Thanh VM, Zhuang X, Rabczuk T (2020) A deep energy method for finite deformation
hyperelasticity. Eur J Mech-A/Solids 80:103874
Otero AD, Ponta FL (2010) Structural analysis of wind-turbine blades by a generalized Timoshenko
beam model
Paszke A, Gross S, Massa F, Lerer A et al (2019) PyTorch: an impera tive style, high-performance
deep learning library. In: Advances in Neural Information Processing Systems 32. Curran Asso-
ciates, Inc., 2019, pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-
style-high-performance-deep-learning-library.pdf
Petersen P, Voigtlaender F (2018) Optimal approximation of piecewise smooth functions using deep
ReLU neural networks. Neural Netw 108:296–330
Pfau D, Spencer JS, Matthews AGDG, Foulkes WMC (2020) Ab initio solution of the many-electron
Schrödinger equation with deep neural networks. Phys Rev Res 2:033429
Philipp G, Song D, Carbonell JG (2018) Gradients explode—Deep Networks are shallow—ResNet
explained
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neu ral networks: A deep learn-
ing framework for solving forward and inverse problems involving nonlinear partial differential
equations. J Comput Phys 378:686–707
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv:1710.05941
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning rep resentations by back-propagating
errors. Nature 323(6088):533–536
Samaniego E, Anitescu C, Goswami S, Nguyen-Thanh VM et al (2020) An energy approach to
the solution of partial differential equations in computational mechanics via machine learning:
Concepts, implementation and applications. Comput Methods Appl Mech Eng 362:112790
Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput
24(111):647–656
Shukla K, Di Leoni PC, Blackshire J, Sparkman D et al (2020) Physics informed neural network for
ultrasound nondestructive quantification of surface breaking cracks. J Nondestruct Eval 39(3):1–
20
Shukla K, Jagtap AD, Karniadakis GE (2021) Parallel physics informed neural networks via domain
decomposition. J Comput Phys 447:110683
Silver D, Hubert T, Schrittwieser J, Antonoglou I et al (2017) Mastering chess and shogi by self-play
with a general reinforcement learning algorithm. arXiv:1712.01815
Sirignano J, Spiliopoulos K (2018) DGM: A deep learning algorithm for solving partial differential
equations. J Comput Phys 375:1339–1364
218 C. Anitescu et al.
Sukumar N, Srivastava A (2022) Exact imposition of boundary con ditions with distance functions
in physics-informed deep neural net works. Comput Methods Appl Mech Eng 389:114333
Sun S, Cao Z, Zhu H, Zhao J (2019) A survey of optimization meth ods from a machine learning
perspective. IEEE Trans Cybern 50(8):3668–3681
Vauhkonen M, Tarvainen T, Lähivaara T (2016) Inverse problems. In: Pohjolainen S (ed) Mathe-
matical modelling. Springer International Publishing
Wang G-F, Feng X-Q (2009) Timoshenko beam model for buckling and vibration of nanowires
with surface effects. J Phys D: Appl Phys 42(15):155411
Wang C, Tan V, Zhang Y (2006) Timoshenko beam model for vibra tion analysis of multi-walled
carbon nanotubes. J Sound Vib 294(4–5):1060–1072
Wang S, Yu X, Perdikaris P (2022) When and why PINNs fail to train: A neural tangent kernel
perspective. J Comput Phys 449:110768
Wight CL, Zhao J (2020) Solving allen-cahn and cahn-hilliard equations using the adaptive physics
informed neural networks. arXiv:2007.04542
Yu B et al (2018) The deep Ritz method: a deep learning-based numerical algorithm for solving
variational problems. Commun Math Stat 6(1):1–12
Yu J, Lu L, Meng X, Karniadakis GE (2022) Gradient-enhanced physics-informed neural networks
for forward and inverse PDE problems. Comput Methods Appl Mech Eng 393:114823
Zhuang X, Guo H, Alajlan N, Zhu H et al (2021) Deep autoencoder based energy method for
the bending, vibration, and buckling anal ysis of Kirchhoff plates with transfer learning. Eur J
Mech-A/Solids 87:104225
Chapter 6
Physics-Informed Deep Neural Operator
Networks
6.1 Introduction
Physics-informed neural networks (PINNs) have transformed the way we model the
behavior of physical systems for which we have available some measurements and
at least a parameterized partial differential equation (PDE) to provide additional
information in a semi-supervised type of learning (Raissi et al. 2019; Karniadakis
et al. 2021; Samaniego et al. 2020). They can solve ill-posed problems that may lack
boundary conditions, e.g. thermal boundary conditions in heat transfer problems
(Cai et al. 2021) or discover voids and defects in materials based only on a handful
of displacements (Zhang et al. 2022), or obtain the failure pattern (Goswami et al.
2020b). Despite their effectiveness, PINNs are trained for specific boundary and ini-
tial conditions, as well as loading or source terms, and require expensive training
during inference. Therefore, they are not particularly effective for other operating
conditions and real-time inference, although transfer learning can somewhat alle-
viate this limitation (Goswami et al. 2020c, 2022e). What we need in engineering
disciplines such as design, control, uncertainty quantification, robotics, etc. is a gen-
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 219
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_6
220 S. Goswami et al.
eralized version of PINNs that can infer the system’s response in real time for many
different boundary/initial conditions and loadings, without further training or perhaps
with very light training. This will lead to speed up factors of thousands compared to
conventional numerical solvers, e.g. CFD or solid mechanics simulators.
The neural operators, introduced in 2019 in the form of DeepONet (Lu et al.
2021), fulfill this promise. Training is performed offline in a predefined input space
and hence inference is very fast since no further training is required as long as the
new conditions are inside the input space. For arbitrary inputs, which are out of the
distribution (OOD), further training is required but this may be relatively light if
the input space is sampled sufficiently. We note here that a neural operator is very
different from a reduced order model (ROM) that is restricted to a very small subset of
conditions and lacks generalization properties due to the under-parameterization in
such methods (Kontolati et al. 2022; Geelen et al. 2022). Another important property
of DeepONet is that it is based on a universal approximation theorem for operators
(Chen and Chen 1995; Lu et al. 2021), and more recent theoretical work (Lanthaler
et al. 2022) has shown that DeepONet can break the curse of dimensionality in the
input space, unlike approaches based on ROM for parameterized PDEs (Riffaud et al.
2021).
Figure 6.1 represents the schematic of a polymorphic DeepONet and its resem-
blance to a human neuron. An operator network is made up of two deep neural
networks (DNNs): branch and trunk. The branch and trunk networks are analogous
to synchronized dendritic branches and axonal spiking. The result is a nonlinear oper-
ator that can be used to approximate any function defined as an input in the branch
Fig. 6.1 A schematic representation of the deep operator network (DeepONet). It consists of two
neural networks (the branch network and the trunk network) with flexible architectures (DNNs: deep
neural networks; GNNs: graph neural networks; SNN: spiking neural networks). The networks take
functions as input and output functions, which are represented by the dot product of the outputs of
the branch (basis coefficients) and the trunk networks (basis functions). The color coding indicates
the resemblance of computational operations with a human neuron
6 Physics-Informed Deep Neural Operator Networks 221
network and evaluated at the locations specified in the trunk network. In 2020, work
on graph kernel network (GKN) for PDE led to another type of operator regression
(Li et al. 2020c). Subsequently, the same group proposed a different architecture in
which they formulated the operator regression by parameterizing the integral kernel
directly in Fourier space and named it Fourier Neural Operator (FNO) (Li et al.
2020b). All these versions are different realizations of DeepONet if appropriate
changes in the trunk and branch are imposed, see Fig. 6.1 and also (Lu et al. 2022a).
Consider an operator G that maps from the input function v to the output function
u, i.e. G : v → u. DeepONet tries to learn the operator G by approximating the basis
function for expressing the output functional space. In this chapter, we will discuss
in detail the three aforementioned neural operators, and their extensions, and put
forward numerical examples to illustrate the usage and limitations of each of these
approaches. The physics of the problem is enforced using the labeled input–output
dataset pairs for the conventional architecture of the proposed operators. In related
sections, variants of the operators that use the physics of the problem (and require
little to no labeled data) to train the network are also covered.
This chapter is organized as follows. In Sect. 6.2, we discuss the deep neural
operator (DeepONet), its extensions, and variants that have been developed and used
for different problem setups. In Sect. 6.3, we describe the Fourier neural operator
(FNO) architecture and present the extensions of the operator to deal with com-
plex geometries, problems defined in different input and output spaces, and also
the physics-informed version of the operator. In Sect. 6.4, we introduce the graph
neural operator (GNO), its components, and non-local kernel networks. In Sect. 6.6,
we compare the performance of the studied models for several examples from the
literature. Finally, we summarize our observations and provide concluding remarks
in Sect. 6.7.
ξ = {xi , yi , ti } and evaluates the solution operator to compute the loss function. The
solution operator for an input realization, v1 , can be expressed as follows:
p
Gθ (v1 )(ξ ) = bri · tri
i=1
(6.1)
p
= bri (v1 (η1 ), v1 (η2 ), . . . , v1 (ηm )) · tri (ξ )
i=1
where br1 , br2 , . . . , br p are outputs of the branch net and tr1 , tr2 , . . . , tr p are outputs
of the trunk net.
Furthermore, the functions brk (outputs of the branch network) and trk (outputs of
the trunk network) can be chosen as diverse classes of neural networks, satisfying the
classical universal approximation theorem of functions, e.g. fully connected neural
networks (FNN), residual neural networks, and convolution neural networks (CNN).
This theorem was proved in Chen and Chen (1995) with two-layer neural net-
works. Also, the theorem holds when the Banach space C(K 1 ) is replaced by L q (K 1 )
and C(K 2 ) replaced by L r (K 2 ), q, r ≥ 1. We notice that in Eq. (6.1), the last layer of
each brk branch network is bias-free. Although bias is not a requirement in Theorem
6.1, adding bias may improve performance by lowering the generalization error (Lu
et al. 2021). The trainable parameters of a data-driven DeepONet, represented by
θ in Eq. (6.1), are obtained by minimizing a loss function, which is expressed as
follows:
1
N
L(θ ) = wi |u i (ξ ) − Gθ (vi )(ξ )|2 (6.2)
N i=1
where u i (ξ ) is the ground truth and N denotes the total number of functions in the
branch network. The weights associated with each sample in Eq. (6.2) are denoted
by wi , which are assumed to be unity in the simplest case. Examples described in
Sects. 6.6.1.2 and 6.6.1.3 are solved with the architecture of conventional data-driven
DeepONet.
During the optimization process, some query points must be penalized more than
others in order to satisfy constraints (initial condition, boundary condition). In such
6 Physics-Informed Deep Neural Operator Networks 223
cases, properly designed non-uniform training point weights can improve accuracy.
These penalizing parameters can be manually modulated but are often a tedious
procedure or should be decided adaptively during the training of the DeepONet
(McClenny and Braga-Neto 2020; Kontolati et al. 2022). These parameters in the
loss function can be updated by gradient descent side by side with the network
parameters. The modified loss function is defined as follows:
1
N
L(θ , λ) = g(λ)|u i (ξ ) − Gθ (vi )(ξ )|2 (6.3)
N i=1
The self-adaptive weights are updated using the gradient-descent method, such that
Therefore, if g(λi ) > 0, ∇λi L would be zero only if the term (u i (ξ ) − Gθ (vi )(ξ )) is
zero. Implementing self-adaptive weights in Kontolati et al. (2022) has considerably
improved the accuracy prediction of discontinuities or non-smooth features in the
solution. Following the inception of DeepONet in 2019 (Lu et al. 2021), a num-
ber of enhancements to traditional architecture were developed to offer inductive
bias and speed up training. In subsequent sections, we present several extensions of
DeepONet.
In this section, we lay emphasis on learning the operator from paired data using any
available prior knowledge of the underlying system. This knowledge can then be
224 S. Goswami et al.
p
Gθ (v, w) = briv · briw · tri (6.8)
i=1
where briv and briw denote the i-th output embedding of the branch networks corre-
sponding to the input functions denoted by v and w, respectively. A schematic repre-
sentation of this architecture is shown in Fig. 6.2. In Goswami et al. (2022f), a Deep-
ONet framework based on multiple input functions is proposed that encompasses
two different DNNs (CNN and FNN) as branch networks. The architecture uses
grayscale images of systolic and diastolic geometry (in CNNs) along with patient-
specific information (such as hypertension in FNN) to predict the initial distribution
and extent of the mechanobiological insult in a patient with thoracic aortic aneurysm.
Branch Network 1
⋮ ⋮
Branch Network 2
⋮ ⋮
Trunk Network
Data Loss
⋮⋮⋮ Minimize Loss
IniƟal condiƟon
Loss ( , , , )
⋮⋮⋮ Boundary
condiƟon Loss
Fig. 6.2 A schematic representation of a multiple input data-driven DeepONet with self-adaptive
weights. The operator takes as inputs the functional field in branch network 1 (employing a CNN),
the boundary conditions in branch network 2 (employing an FNN to take as input the values at
three sensor locations marked with a red cross mark), and computes the solution for the coordinates
which are inputs of the trunk network (employing an FNN). The loss function, L, is the sum of
three components: data loss, loss at the initial condition, and boundary loss. Each of these losses
is penalized with self-adaptive penalty parameters, λ1 , λ2 , λ3 , respectively. These parameters are
updated adaptively along with the weights and biases, θ = (W, b) of the networks
226 S. Goswami et al.
L = Ldata + L physics ,
where L physics = Linit + Lbound + L pde ,
1
Ninit
Linit = [Gθ (v)(x, y, t0 ) − u(x, y, t0 )]2 ,
Ninit 1=1
N
bound
(6.9)
1
Lbound = [Gθ (v)(xb , yb , t) − u(xb , yb , t)] ,
2
Nbound i=1
1
Nf
L pde = [ f (Gθ )]2
N f i=1
where Ninit denotes the number of initial data points, Nbound denotes the number
of boundary points, and N f denotes the number of collocation points or integration
points on which the PDE is evaluated. Additionally, f (Gθ ) denotes the residual form
(Wang et al. 2021) or the variational form (Goswami et al. 2022d) of the governing
equation. L pde acts as an appropriate regularization mechanism for biasing the target
output functions to satisfy the underlying PDE constraints when very few labeled data
are available. Examples discussed in Sects. 6.6.2.1 and 6.6.2.2 are solved using the
additional physics-informed loss term to obtain the optimized network parameters.
The Fourier neural operator (FNO) is based on replacing the kernel integral operator
with a convolution operator defined in Fourier space. The operator takes input func-
tions defined on a well-defined, equally spaced lattice grid and outputs the field of
6 Physics-Informed Deep Neural Operator Networks 227
interest on the same grid points. The network parameters are defined and learned in
the Fourier space rather than in the physical space, i.e. the coefficients of the Fourier
series of the output function are learned from the data. A schematic representation of
the FNO is shown in Fig. 6.3, which can be viewed as a DeepONet with a convolu-
tion neural network in the branch net to approximate the input functions and Fourier
basis functions in the trunk net. In particular, the network has three components:
first, the input function v(x) is lifted to a higher dimensional representation h(x, 0),
through a lifting layer, P, which is often parameterized by a linear transformation or
a shallow neural network. Then, the neural network architecture is formulated in an
iterative manner: h(x, 0) → h(x, 1) → h(x, 2) → · · · → h(x, L), where h(x, j),
j = 0, . . . , L, is a sequence of functions representing the values of the architecture
at each layer. Each layer is defined as a nonlinear operator via the action of the sum
of Fourier transformations and a bias function:
Here, we use σ to denote the activation function. W j , c j , and R j are trainable param-
eters for the jth layer, such that each layer has different parameters (i.e. different
kernels, weights, and biases). Lastly, the output u(x) is obtained by projecting h(x, L)
through a local transformation operator layer, Q. In Sect. 6.6.1.1, we demonstrate
the performance of conventional data-driven FNO on learning the solution operator
for PDEs.
One important requirement of FNO is that the input function is defined on a lattice
grid, which makes FNO often difficult to apply for problems where the input function
is defined on a few points (such as the boundary condition and initial condition) to be
mapped to the solution over the whole domain. Additionally, if the problem domain
is defined on an unstructured mesh like in complex geometries, implementing the
conventional FNO architecture is a challenge. To address these issues, several feature
expansions were proposed in Lu et al. (2022a).
dFNO+: This feature is implemented for problem setup where the domain of the
input function is different from the domain of the output function. For example,
if we want to map the initial condition for a problem to the spatial and temporal
evolution of the solution, we define the mapping as
where v(x, 0) defines the initial condition and v(x, t) is the evolved dynamics. In
such cases, the input space is defined on the spatial domain and so it is difficult to map
228 S. Goswami et al.
Fig. 6.3 Schematic representation of the Fourier neural operator. The input function, v(x), is
transformed to a higher dimensional representation using a shallow neural network, P , and then
operated by a series of L Fourier layers. Within each Fourier layer, a Fourier transform, F , of the
input is obtained to filter out the higher modes using a linear transform, R, and then converted back to
physical space using inverse Fourier transform, F −1 . Along with the Fourier space transformation,
a residual connection with a weight matrix, W, is applied to the input function and added and is later
acted upon by a nonlinear activation function. At the end of the Fourier layers, a local transformation
Q is applied by employing a shallow neural network to convert the output space to a dimension of
the input grid points
the output to the spatial and the temporal domain simultaneously. To this end, we
propose two approaches. In the first approach, we define a new input function, ṽ(x, t),
which has an additional temporal component and is defined as ṽ(x, t) = v(x). As our
second approach, we propose to define the output space employing a recurrent neural
network such that the solution operator is decomposed into a series of operators and
the solution of each time step is obtained iteratively using a time marching scheme
such that G : v(x, t) → v(x, t + Δt). Alternatively, the input space could be defined
as a subset of the output space, while attempting to map the boundary condition to
the solution defined over the entire domain. In such cases, the input function can be
padded with zeros for the domain’s interior points and then considered as input to
the neural operator.
gFNO+: FNO employs discrete Fast Fourier Transform (FFT), which necessitates
the definition of the input and output functions defined on a Cartesian domain with
a lattice grid mesh. However, for problems defined on complex real-life geometry,
an unstructured mesh is typically used, requiring us to deal with two issues: (1)
non-Cartesian domain and (2) non-lattice mesh. To handle issues associated with the
input and output functions defined on a non-Cartesian domain, we define a bounding
box and project the input and the output space by “nearest neighbor” to maintain
continuity at the boundaries. Alternatively, for issues associated with the non-lattice
mesh, we perform interpolation between the unstructured mesh and a lattice grid
mesh. The examples described in Sects. 6.6.1.2 and 6.6.1.3 are solved with the
combination of dFNO+ and gFNO+.
Wavelet Neural Operators: In Tripura and Chakraborty (2022), the Wavelet Neural
Operator (WNO) was proposed, which learns the network parameters in the wavelet
space that are both frequency and spatial localized, thereby can learn the patterns
in the images and/or signals more effectively. Specifically, the Fourier integral of
6 Physics-Informed Deep Neural Operator Networks 229
FNO was replaced by wavelet integrals for capturing the spatial behavior of a signal
or for studying the system under complex boundary conditions. It was shown that
WNO can handle domains with both smooth and complex geometries, and it was
applied in learning solution operators for a highly nonlinear family of PDEs with
discontinuities and abrupt changes in the solution domain and the boundary.
In the vanilla FNO, to guarantee the universal approximation property different train-
able parameters are employed for each Fourier layer (Kovachki et al. 2021a). Hence,
the number of trainable parameters increases as the network gets deeper, which
makes the training process of the FNO more challenging and potentially prone to
overfitting. In You et al. (2022a) (see Fig. 6.6 of Sect. 6.6.1.1), it was found that when
the network gets deeper, the training error decreases in the FNO while the test error
becomes much larger than the training error, indicating that the network is overfitting
the training data. Furthermore, if one further increases the number of hidden layers L,
training the FNOs becomes challenging due to the vanishing gradient phenomenon.
To improve the neural network’s stability performance in the deep layer limit,
in You et al. (2022f), You et al. proposed to model the PDE solutions of unknown
governing laws as the implicit mappings between given loading/boundary conditions
and the resultant solution, with the neural network serving as a surrogate for the
solution operator. Based on this idea, the implicit FNO (IFNO) architecture was
developed, which can be interpreted as a data-driven surrogate of the fixed point
procedure, in the sense that the increment of fixed point iterations is modeled as an
autonomous operator between layers. As illustrated in Fig. 6.4, in IFNOs the same
parameter set is employed for each iterative layer, with the layer update written as
Fig. 6.4 A schematic representation for the implicit Fourier Neural Operator (IFNO), which
enhances the vanilla FNOs architecture with reduced memory cost and improved stability in the
deep layer limit. This architecture also employs the lifting layer, P , and the projection layer, Q, as in
the original FNOs, and proposes a modified model for the Fourier layer. Within each Fourier layer,
the number of layers is identified with the number of time steps in a time-discretization scheme,
and the increments between layers are parameterized via the action of the sum of Fourier space
transformation and the local linear transformation, in a manner that all Fourier layers share the
same set of trainable parameters. As such, the iterative layers can be interpreted as a discretized
autonomous integral differential equation
In IFNO, a forward pass through a very deep network is analogous to obtaining the
PDE solution as an implicit problem, and the universal approximation capability is
guaranteed as far as there exists a convergent fixed point equation. Since the proposed
architecture is built as a modification of the FNO, it also parameterizes the integral
kernel directly in the Fourier space and utilizes the fast Fourier transformation (FFT)
to efficiently evaluate the integral operator. Hence, IFNO inherits the advantages
of FNO on resolution independence and efficiency, while demonstrates not only
enhanced stability but also improved accuracy in the deep network limit. In Sect.
6.6.2.3, we demonstrate the performance of IFNO on biological tissue modeling
based on digital image correlation (DIC) measurements.
on a grid that employs FFT. Hence, the gradients of the output function with respect
to the input space cannot be computed using the automatic differentiation library
commonly employed in machine learning algorithms. To this end, the following
approaches can be used to explicitly define the gradients:
1. Using the conventional numerical gradients such as finite difference and Fourier
gradient. These approaches require either a fine discretization (for finite differ-
ence), else the numerical error would be magnified or would require a smooth
and uniform grid (spectral methods).
2. Applying automatic differentiation of the sum of the Fourier coefficient at every
spatial location (without doing the inverse FFT) and the value of a query func-
tion defined as an interpolation or a low-rank integral operator (Kovachki et al.
2021b).
3. Explicitly defining the gradients on the Fourier space and applying the chain
rule to compute the required quantities.
The authors of Li et al. (2021) have computed the exact gradient by defining the
gradient on the Fourier space. Additionally, the linear mapping that arises due to
the residual connection with the weight matrix, W shown in Fig. 6.3, is interpolated
using the Fourier method. To optimize the network parameters, the loss function
defined in Eq. (6.9) is minimized.
Besides the scenarios where full physics constraints, such as the known governing
equations, are provided, in many real-world modeling tasks, only partial physics laws
are available. To improve the learning efficacy on such problems, in You et al. (2022e)
the authors proposed to impose partial physics knowledge via a soft penalty constraint
to the neural operator. In particular, an IFNO was built to model the heterogeneous
material responses from the DIC displacement tracking measurements of multiple
biaxial stretching protocols on a porcine tricuspid valve anterior leaflet. Both the
constitutive model and the material microstructure are unknown, and hence there is no
known governing law that can be imposed. The authors then proposed to infuse the no-
permanent-set assumption to guide the training and prediction of the neural operators.
In other words, when the specimen is at rest, one should observe a zero displacement
field in the specimen. Specifically, a hybrid loss function L = Ldata + λL physics is
employed, with the same data-driven loss as defined in Eq. (6.9) and the physics-
informed loss defined as a penalization term:
Here, 0(x) denotes the input function valued zero everywhere, and λ > 0 is a tunable
hyperparameter to balance the data-driven loss and the physics-informed loss. This
method was shown to improve the extrapolative performance of IFNO in the small
deformation regime.
232 S. Goswami et al.
where hi(k) is the current state k, for node i, and N (i) represents the neighborhood of
node i which consists of the nodes that have a direct edge connection with i. Here,
update and aggregate can be defined as different kinds of functions based on specific
learning tasks, such as mean, max, normalized sum, a Multi-Layer Perceptron (MLP),
and Recurrent Neural Network (RNN), just to name a few. All these functions should
be designed to preserve the permutation invariance as desired. Figure 6.5 shows a
schematic representation of the kth MPL on a graph with five nodes. Usually, in the
input layer, a node embedding is generated to represent the feature of each node. The
goal of node embedding is to encode nodes so that similarity in the embedding space
(e.g. dot product) approximates similarity in the original network. Up to date, the
two major GNNs that are widely used are Graph Convolution Network (GCN) (Kipf
and Welling 2016) and Graph Attention Network (GATs) (Veličković et al. 2017).
In GCN, the equivalent of Eq. (6.13) is given by
h i(k+1) = σ D̂ − 2 Â D̂ − 2 h i(k) W (k)
1 1
(6.14)
6 Physics-Informed Deep Neural Operator Networks 233
Fig. 6.5 A schematic representation of the message-passing layer (MPL) in graph neural networks
j
(GNNs). Herein, h i represents the embedded feature of node i in the jth layer. In the message and
aggregate steps, information is gathered from each node, its edges, and its neighboring nodes. Then,
j+1
the node feature is updated to h i , the feature of node i in the ( j + 1)th layer
matrix for projecting output features into a lower dimensional subspace and σ is the
ReLU activation function. Overall, the GCN methodology produces the normalized
sum of the neighbor’s node features.
GAT can be seen as a modified version of GCN, improving its generalizability.
GAT uses the attention mechanism as a substitute for the statically normalized convo-
lution operation so that more important nodes receive higher weight during neighbor-
hood aggregation. GAT introduces three more steps prior to using the same normal-
ized aggregation of GCN: (i) a linear transformation z i(k) = w (k) h i(k) to transform the
input features h ik into higher level features, (ii) computing pairwise un-normalized
attention coefficients between two neighbors ei(k) (k)
z i(k) z (l)
T
j = LeakyReLU a j ,
and (iii) using a softmax activation function αi(k) (k)
j = exp ei j / k∈N (i) exp eik (k)
Here, we introduce the graph kernel network (GKN) approach and its variations,
which can be interpreted as a continuous version of a GNN as well as an integral
neural operator. The GKN is the first integral neural operator, which was introduced
in Li et al. (2020c) and has the foundation in the representation of the solution of
a PDE by Green’s function. As a motivating example, let us consider a PDE of the
form:
(La u) (x) = v(x), x ∈ Ω ,
u(x) = 0, x ∈ ∂Ω
234 S. Goswami et al.
where a(x) is the parameter field, v(x) is the loading term which acts as the input
function in the solution operator, and u(x) is the PDE solution which can be seen as
the output function. We can define Green’s function G : Ω × Ω → R as the unique
solution to the problem under relatively general constraints on La , such that
La G(x, ·) = δx ,
with κ being a shallow neural network taking x, y, a(x), and a(y) as its inputs. Based
on this idea, two graph-based neural operators, namely, the graph kernel network (Li
et al. 2020c) and the non-local kernel network (You et al. 2022a) were constructed
and will be entailed below. Here, both graph-based neural operators were constructed
with the data-driven loss only, and hence the physics is introduced through data. We
also point out that the ideas of imposing full or partial physics in Sects. 6.2.3 and
6.3.3 can be easily applied to these graph-based neural operators as well.
Graph Kernel Networks (GKNs): The idea of GKNs comes from parameterizing
Green’s function in an iterative architecture (Li et al. 2020c). As an integral neural
network similar to the FNOs, GKNs are also composed of a lifting layer P, iterative
kernel integration layers, and a projection layer Q. While the lifting layer and the
projection layer share the same architecture as the FNO, it is assumed that the iterative
kernel integration part is invariant across layers, with the update of each layer network
given via the action of the summation of a non-local integral operator Eq. (6.15) and
a linear operator:
Similar to FNOs, in GKNs the nodes within each layer are treated as a continuum,
so each layer representation can be seen by a function of the continuum set of nodes
D ⊂ Rd . κ ∈ Rs×s is a tensor kernel function that takes the form of a (usually shallow)
NN whose parameters θ are to be learned through training. Different from the setting
in FNOs, in GKNs the parameters W , c, and θ are layer independent. As such, the
GKN resembles the original ResNet block (He et al. 2016), where the usual discrete
affine transformation is substituted by a continuous integral operator. In practice, the
interaction range of kernel κ(x, y, a(x), a(y); θ ) is often chosen based on the known
information about the true kernel of the application or based on the computational
efficiency needs. When taking the interaction range as Ω, i.e. every point in the
whole domain has an impact on x, then the model will be more expressive but
computationally expensive. On the other hand, one can restrict the interaction range
as the ball with radius r centered at x, i.e. Br (x), for efficiency purposes, keeping in
mind that this choice might compromise the accuracy Br (x).
In Li et al. (2020c), the integral in Eq. (6.16) is realized through a message-passing
graph neural network architecture. In particular, the physical domain Ω is assumed
to be discretized as a set of points χ := {x1 , . . . , x J } ⊂ Ω. Then, these points are
treated as the nodes of a weighted, directed graph, such that an edge {i, j} presents
when the representation of point x j has an impact on the representation of point xi ,
i.e. κ(xi , x j , a(xi ), a(x j ))) = 0. Denoting N (x) as the neighborhood of each point
x ∈ χ according to the graph, with the message-passing algorithm of Gilmer et al.
(2017) the integral operator in Eq. (6.16) is implemented as an averaging aggregation
of messages:
⎛ ⎞
1
h(x, j + 1) =σ ⎝W h(x, j) + κ(x, y, a(x), a(y); θ )h(y, j) + c⎠
|N (x)| y∈N (x)
(6.17)
where |N (x)| represents the total number of points in N (x). Comparing with the
FNOs described in Sect. 6.3, GKN has a more general kernel formulation since FNO
assumes κ(x, y, a(x), a(y); θ ) := κ(x − y) so as to allow the fast Fourier transform.
Hence, GKNs are theoretically more expressive than FNOs. Moreover, the general
kernel formulation also provides flexibility in model designs. As such, partial physics
knowledge such as the material isotropy (You et al. 2022c), homogeneity (You et al.
2021d, b), and rotational equivalence properties (Liu et al. 2022) can be explicitly
imposed by designing proper kernel models. However, when taking the interaction
range as the whole domain, i.e. N (xi ) = χ for all xi , the corresponding graph of Eq.
(6.17) will be fully connected, and the number of edges scales like O(J 2 ), which
makes the GKNs generally much more expensive than some other neural operators,
say, FNOs. To accelerate the computation of GKNs, several techniques were imple-
mented. In Li et al. (2020c), the Nyström approximation method is considered, which
samples uniformly at random the points of N (x), to reduce the complexity of compu-
tation when calculating the integral. In Li et al. (2020d), the multipole graph neural
236 S. Goswami et al.
Fig. 6.6 2D Darcy flow in a square domain. Demonstration of the stability performance of three
integral neural operators. Comparison of relative mean squared errors from GKNs, FNOs, and
NKNs when using the training set with grid size Δx = 1/15 (You et al. 2022a). Error bars represent
standard errors over five simulations. Left: errors on the training dataset. Right: errors on the test
dataset with different resolutions: Δx = 1/15, Δx = 1/30, and Δx = 1/60. Detailed experimental
settings and further numerical results are provided in Sect. 6.6.1.1
that the integral operator on the right-hand side of Eq. (6.18) can be interpreted as a
non-local Laplacian operator:
Ldi
κ
ff
[h] := κ(x, y, a(x), a(y))(h(y, t) − h(x, t))dy.
Ω
∂h
(x, t) − Ldi
κ
ff
[h](x, t) + β(x)h(x, t) = c.
∂t
The stability of NKNs was analyzed via non-local vector calculus, showing that
when the kernel function κ is square-integrable and non-negative, and the reaction
parameter function β is positive and bounded, the learned non-local operator is
positive definite and the network is stable in the limit of deep layers. In You et al.
(2022a), when applied to the Poisson equation solution learning task it was found that
the NKNs’ amplification matrix is positive definite, while the GKNs’ matrix exhibits
negative eigenvalues, indicating that instabilities might occur. In Sect. 6.6.1.1, we
demonstrate the performance of NKNs, GKNs, and the vanilla FNOs, on learning
the solution operator for the 2D Darcy’s equation.
Remark 6.1 Here, we would like to point out that a similar idea of considering the
correspondence between the stable architecture and the stable PDEs was also recog-
nized recently in Graph Neural Diffusion (GRAND) (Chamberlain et al. 2021). In
GRAND, the authors interpreted GNN architectures from a mathematical framework
by different choices of the form of the diffusion equation and discretization schemes.
It was shown that more advanced and stable numerical schemes such as Runge–Kutta
and implicit schemes would help to improve the performance and amount to larger
multi-hop diffusion operators in the design of deep GNN architectures.
Beyond the pioneering work of Chen and Chen (1995) on the universal approximation
theory of operators for a single layer, other works have appeared only recently for
deep neural networks. The first theoretical work that extended the Chen and Chen
theorem to deep neural networks was in Lu et al. (2021). The paper by Deng et al.
(2022) considers the advection–diffusion equation, including nonlinear cases. The
authors have shown that DeepONet has exponential approximation rates in the linear
case. Moreover, they demonstrated that by emulating numerical methods, DeepONet
has algebraic convergence with respect to the network size. In the paper by Lanthaler
et al. (2022), the authors have extended the universal approximation theorem in Chen
and Chen (1995) and have removed the continuity and compactness assumptions.
238 S. Goswami et al.
Also, they have provided an upper bound for the DeepONet error by decomposing it
into three parts: encoding error, approximation error, and reconstruction error. They
have also proven lower bounds on the reconstruction error by utilizing optimal errors
for projections on finite-dimensional affine subspaces of separable Hilbert spaces.
They have used this to prove the two-sided bounds on the DeepONet error. This
construction also allows them to infer the size of the trunk net needed to approximate
the eigenfunctions of the covariance operator in order to obtain optimal reconstruction
errors. In Marcati and Schwab (2021), the authors have shown that for linear second-
order PDEs with non-homogeneous coefficients and source terms, DeepONet has
exponential expression rates for the coefficient-to-solution operators in the H 1 norm.
Additionally, their results also show that neural networks can emulate accurately the
(discrete) solution map of Galerkin methods for the elliptic PDEs mentioned above
with numerical integration. They have also proven that the DeepONet architecture has
size O(|log()|κ ) for any κ > 0 depending on the physical space dimension. In paper
(Yu et al. 2021), the authors have shown that for non-polynomial activation functions,
an operator with a neural network of width five is arbitrarily close to any given
continuous nonlinear operator. They have also shown the theoretical advantages of
depth by constructing operator ReLU neural networks of depth 2k 3 + 8 with constant
width, which they compare with other operators’ ReLU neural networks of depth k.
In paper (Kovachki et al. 2021a), the authors have shown that FNOs are uni-
versal, i.e. they can approximate any continuous operator to the desired accuracy.
However, they have also shown that in the worst case, the size of FNO can grow
super-exponentially in terms of the desired error for approximating a general Lips-
chitz continuous operator. They have proved rigorously that there exists a ψ-FNO,
which can approximate the underlying nonlinear operators efficiently for Darcy flow
and the incompressible Navier–Stokes equations. In paper (You et al. 2022f), You et
al. have shown that the IFNO is a universal solution-finding operator, in the sense
that it can approximate a fixed point method to the desired accuracy.
Apart from the data-driven approaches discussed above, the paper (De Ryck and
Mishra 2022) has presented error bounds for physics-informed operator learning for
both DeepONets and FNOs. Finally, in the paper (García et al. 2020), the authors have
proposed a general framework to analyze the rates of spectral convergence for alarge
1
log n 2m
family of graph Laplacians. They established a convergence rate of O n
.
Also, in Li et al. (2020c), Li et al have shown that the Graph Kernel Network approach
has competitive approximation accuracy to classical and deep learning methods.
6.6 Applications
This section is divided into two parts that present several numerical examples that
were solved using either data-driven training of the neural operators (Sect. 6.6.1)
or physics-informed training (Sect. 6.6.2) to evaluate the performance of the neural
operators discussed above. To assess the efficacy of neural operators, we compute
6 Physics-Informed Deep Neural Operator Networks 239
the relative L 2 error of the predictions. Each example includes information about the
data generation, network architecture, activation function, and optimizer used.
Fig. 6.7 2D Darcy flow in a square domain. A visualization of 16-layer FNO, GKN, and NKN
performance on an instance of conductivity parameter field K (x), when using (normalized) “coarse”
training dataset
(Δx = 1/15) and test on the “finer” dataset (Δx = 1/60). Here, the absolute point-
wise error u data (x) − u pr ed (x) is plotted
employed for the purpose of training, and the performance of the resultant neural
operators was tested on datasets with all resolutions.
Results: We report the training/test errors of FNO, GKN, and NKN in Fig. 6.6 and
the plot of solutions in one representative test sample in the “finer” dataset in Fig.
6.7, where all neural operators are with L = 16 layers. One can observe that all
solutions obtained with NKN are visually consistent with the ground-truth solutions,
while GKN loses accuracy near the material interfaces. FNO results are off in even
larger regions. These results provide a further qualitative demonstration of the loss
of accuracy in FNOs and GKNs from the previous sections. In this case, the averaged
relative test errors from GKN, FNO, and NKN are 4.71, 9.29, and 3.28%, respectively.
The results for this problem have been adapted from You et al. (2022a).
We now further consider Darcy flows in a triangular domain with a vertical notch
to map the boundary condition to the hydraulic head, u(x), with K (x, y) = 0.1,
and f = −1. In this illustration, we assess the outcomes of data-driven DeepONet,
dgFNO+, and POD-DeepONet.
6 Physics-Informed Deep Neural Operator Networks 241
∇ · u = 0, (6.21)
∂t u + u · ∇u = −∇ P + ν∇ u 2
(6.22)
242 S. Goswami et al.
Fig. 6.8 Darcy flow in a triangular domain with a notch. For a representative boundary condition,
the hydraulic head is obtained using DeepONet, dgFNO+, and POD-DeepONet. The prediction
errors for the three operator networks are shown against the respective plots. The ground truth is
simulated using the PDE Toolbox in MATLAB. The predicted solutions and the ground truth share
the same color bar, while the errors corresponding to each of the neural operators are plotted on the
same
datacolor bar.prThe results
are adopted from Lu et al. (2022a). Here, the absolute pointwise error
u (x) − u ed (x) is plotted
where u = (u, v), where u is the velocity in the x−direction and v is the velocity
in the y−direction, P denotes the pressure, and ν is the kinematic viscosity of the
fluid. We consider a case with different boundary conditions for the upper wall. In
particular, the boundary conditions are expressed as follows:
cosh[r (x − L2 )]
u =U 1− , v=0 (6.23)
cosh( r2L )
where U , r , and L are constants. In this setup, r = 10, L = 1 is the length of the
cavity. In addition, the remaining walls are assumed to be stationary. The aforemen-
tioned equations are then solved using the lattice Boltzmann method (Meng and Guo
2015) to generate the training data. Interested readers could find more details on data
generation in Lu et al. (2022a). To generate the labeled dataset, we generate 100
velocity flow fields at various Reynolds numbers, Re = [100, 2080], with a step
size of 20. A total of 100 labeled datasets of the flow field were simulated, 90 of
which were used for training the neural operator, and 10 were used as unseen cases to
test the accuracy of the trained network. The operator network maps the upper wall
boundary condition to the converged velocity field. The DeepONet architecture uses
a convolution neural network in the branch net and an FNN with 3 hidden layers of
128 neurons in the trunk net. The hyperbolic tangent activation function tanh is used
for this problem. Next, we use POD-DeepONet with the same branch net architecture
and 6 modes to approximate the flow fields.
6 Physics-Informed Deep Neural Operator Networks 243
Fig. 6.9 Steady cavity flow described by Navier–Strokes Equation. A representative case from
the test dataset depicting the flow fields, u and v in x− and y−directions, respectively, and the
associated errors obtained using POD-DeepONet. The predicted solutions and the ground truth
share the same color bar, while the errors corresponding to each
of the neural operators
are plotted
on the same color bar. Here, the magnitude of velocity error udata (x) − u pr ed (x) is plotted
Results: The relative L 2 error is reported as 1.20, 0.33, and 0.63% for predicting
flow fields for unseen boundary conditions using DeepONet, POD-DeepONet, and
dFNO+. A representative case is shown in Fig. 6.9, where the predicted results and
errors corresponding to POD-DeepONet are shown for a specific boundary. The
results for this problem have been adapted from Lu et al. (2022a).
In this example, we aim to model the final crack path, given any initial location of the
crack on a unit square plate, fixed at the left and bottom edge, is subjected to shear
loading on the top edge. This example illustrates the benefits of using a physics-driven
loss comparing the results obtained using PI-DeepONet and data-driven DeepONet.
244 S. Goswami et al.
We use the phase-field approach (Goswami et al. 2019) to model the crack in the
domain. The material properties considered are ν = 121.15 kN/mm2 , μ = 80.77
kN/mm2 , as Lamé’s first and second parameters, respectively, and the critical energy
release rate, G c = 2.7 × 10−3 kN/mm. The boundary conditions of the setup are
denoted as follows:
where u and v are the solutions of the elastic field in x- and y-axis, respectively, and
Δu is the incrementally applied shear displacement on the top edge of the plate. To
train the PI-DeepONet using the variational form, we minimize the total energy of
the system, which is defined as follows:
E = Ψe + Ψc ,
where Ψe = f e (x)dx, f e (x) = (1 − φ)2 ψe+ () + ψe− () ,
Ω
Gc 2
Ψc = f c (x)dx, f c (x) = φ + l02 |∇φ|2 − (1 − φ)2 H (x, t)
2l0
Ω
(6.25)
where Ψe is the stored elastic strain energy, Ψc is the fracture energy, l0 is the length
scale parameter that controls the diffusion of the crack, ψe+ and ψe− are the tensile and
the compressive components of the strain energy densities obtained by the spectral
decomposition of the strain tensor, and H (x, t) is the strain-history functional. In
this example, we have used a hybrid loss function to train the network parameters.
The training samples are obtained (n = 11) for different initial crack lengths, lc ∈
[0.2, 0.7], in steps of 0.05. For the network architecture, the branch net and the trunk
net are four-layer fully connected neural networks with [100, 50, 50, 50] neurons,
respectively. Once the solution is evaluated at the sampled points, the outputs for the
elastic field are modified to exactly satisfy the Dirichlet boundary conditions:
where Ĝθu and Ĝθv are obtained from the DeepONet. The conventional data-driven
DeepONet is trained with the same 11 samples, keeping the network architecture
of the branch net and the trunk net exactly the same. The synthetic data to train
the network is generated using the codes developed in Goswami et al. (2020a). To
improve the accuracy, the training samples are increased to 43.
Results: A prediction error of 2.16% on φ is reported when PI-DeepONet was
employed. Additionally, predictions of data-driven DeepONet have an error of 26.2
and 3.12% for φ when trained with 11 samples and 43 samples, respectively. Figure
6.10 showing the plots of the predicted solutions for lc = 0.375 mm is presented,
6 Physics-Informed Deep Neural Operator Networks 245
Fig. 6.10 Shear failure: the PI-DeepONet is trained with 11 crack lengths to predict the final
damage path for any crack length when the height of the crack is fixed at the center of the left edge.
The plot is for lc = 0.375 mm, where Δu = 0.220 mm. The predicted displacement in x-direction
is plotted for two locations along the x-axis and is compared with ground truth to show the accuracy
of the prediction. The results are adopted from Goswami et al. (2022d). Here, for the phase-field
parameter the absolute pointwise errorφ data (x) − φ pr ed (x) is plotted, and for the displacement
field the magnitude of pointwise error udata (x) − u pr ed (x) is plotted
which is obtained using PI-DeepONet. The results for the data-driven DeepONet
suggest that it is unable to capture the crack diffusion phenomenon and also cannot
generalize to complex fracture phenomena with limited datasets.
where K (x) is spatially varying hydraulic conductivity and h(x) is the hydraulic
head. The setup is of a unit square plate with a discontinuity of 5 × 10−3 mm.
For generating multiple conductivity fields for training the neural operator, we
describe the conductivity field, K (x), as a stochastic process. In particular, we take
K (x) = exp(F(x)), with F(x) denoting a truncated Karhunen–Loève (KL) expan-
sion for a certain Gaussian process, which is a finite-dimensional random variable.
The DeepONet is trained using the variational formulation of the governing equa-
246 S. Goswami et al.
tion and without any labeled input–output datasets. The optimization problem can
be defined as follows:
Minimize: E = Ψh ,
(6.28)
subject to: h(x) = 0 on ∂Ω D
The network architecture of the branch and the trunk networks are two separate
6-hidden layers FNN with 32 neurons per hidden layer.
Results: The trained PI-DeepONet yields a predictive error of 3.12%. The loss tra-
jectory is shown in Fig. 6.11b. The prediction of h(x) for a representative sample of
K (x) using PI-DeepONet is shown in Fig. 6.11a. It is interesting to note that we have
tried to solve the problem by minimizing the residual (Wang et al. 2021). However,
the residual-based DeepONet is not able to approximate the solution of h(x) for a
given K (x).
Fig. 6.11 Flow in heterogeneous porous media: a The predicted h(x) for a given conductivity field,
K (x) (plotted on a log scale), is shown for a representative sample. True h(x) represents the ground
truth and is the simulated solution using the MATLAB PDE toolbox. The difference between the
predicted h(x) and the ground truth is shown in the error plot. b The plots show the loss trajectory
of the two components of the loss function. The plot on the left shows the decrease in energy of
the domain with respect to the number of epochs, while the plot on the right shows the boundary
loss term. The results are adopted from Goswami et al. (2022d). Here, the absolute pointwise error
h data (x) − h pr ed (x) is plotted
6 Physics-Informed Deep Neural Operator Networks 247
Fig. 6.12 Problem and experimental setups for the biological tissue modeling example. a An
image of the speckle-patterned porcine tricuspid valve anterior leaflet (TVAL) specimen subject to
biaxial stretching, with the DIC tracking grid shown in green. b Schematic of a specimen subject to
Dirichlet-type boundary conditions, so the goal of neural operator learning is to provide a surrogate
mapping from the boundary displacement u D (x), x ∈ ∂Ω, to the displacement field u(x), x ∈ Ω.
c Illustration of the seven protocols of mechanical testing on a representative TVAL specimen. Here,
P11 and P22 denote the first Piola–Kirchhoff stresses in the x- and y-directions, respectively
248 S. Goswami et al.
Fig. 6.13 Biological tissue modeling from DIC displacement data. Upper: in-distribution pre-
diction with training set on 83% of randomly selected samples. a Sample-wise error comparison
between a conventional constitutive model (the Fung-type model) and the implicit FNO on all
biaxial testing protocol sets. b Visualization of the Fung-type model fitting and implicit FNO
performances on a representative test sample. Bottom: out-of-distribution prediction on the small
deformation regime, by training each model on protocol sets 1, 2, and 4 then testing on the rest of the
sets. c Sample-wise error comparison between the Fung-type model, the original implicit FNO, and
the physics-guided implicit FNO on testing protocol sets. d Visualization of the Fung-type model
fitting and physics-guided implicit FNO performances on a representative test sample. On (a) and
(c), the relative mean squared errors were plotted for each sample. On (b) and (d), the magnitudes
of displacement errors on each material point were plotted
FNO methods for predicting the out-of-distribution material responses in the small
deformation regime.
Results: From in-distribution tests (see Fig. 6.13a, b), we found that the proposed data-
driven approach presents good prediction capability to unseen loading conditions
with the same type of biaxial loading ratios and outperforms the phenomenological
model. Specifically, the implicit FNO model achieved only 1.64% relative error on
the test dataset, while the Fung-type model has a 10.83% error. To provide further
insights into this comparison, in Fig. 6.13b, both the x- and y-displacement solutions
and the prediction errors are visualized on a representative test sample. The Fung-type
model, which considered the homogenized stress–strain at one material point (i.e. the
center of the specimen) due to limited information about the spatial variation in the
stress measurement, failed to capture the material heterogeneity and hence exhibited
large prediction errors in the interior region of the TVAL specimen domain. This
observation confirms the importance of capturing the material heterogeneity and
verifies the capability of the neural operators in heterogeneous material modeling for
in-distribution learning tasks.
6 Physics-Informed Deep Neural Operator Networks 249
On the other hand, when tested on out-of-distribution loading ratios, the neural
operator learning approach becomes less effective and has comparable performance
as the constitutive modeling (see Fig. 6.13c, d). In particular, when the models are
trained on protocols in the large deformation region and then tested on protocols in
the small deformation region, 16.78% and 16.80% prediction errors were observed
from the implicit FNO and Fung-type model, respectively. To improve the gener-
alizability of the neural operators, partial physics knowledge was infused using the
no-permanent-set assumption discussed in Sect. 6.3.3, and this method is shown
to improve the model’s extrapolative performance in the small deformation regime
by around 1.5%. This study demonstrates that with sufficient data coverage and/or
regularization terms from partial physics constraints, the data-driven neural operator
learning approach can be a more effective method for modeling complex biological
materials than the traditional constitutive modeling approaches. The results for this
problem have been adapted from You et al. (2022e).
In this chapter, we have reviewed the basics of three neural operators, DeepONet,
FNO, and the graph neural operator, as well as their extensions, and have presented
representative application examples. While the list of possible applications of neural
operators will continue to expand in the near future, here we provide a partial list of
their role in applications so far.
• For real-time forecasting: designing efficient control systems, fault detectors for
car engines, solving complex multi-physics problems in less than a second (Cai
et al. 2021).
• Proper hybridization of physics- and data-based models can achieve the goal of
generating an efficient, accurate, and generalizable model that can be used to
greatly accelerate the modeling of time-dependent multi-scale systems (Yin et al.
2022).
• To reduce the dependency of large paired datasets (when accurate information
about the governing law of the physical system is not available), Deep Transfer
Operator Learning (DeepONet) (Goswami et al. 2022e) can be used for accurate
prediction of quantities of interest in related domains, with only a handful of new
measurements.
• Develop faster ways to train neural operators by incorporating multi-modality and
multi-fidelity data (Lu et al. 2022b; Howard et al. 2022; De et al. 2022).
• The application of neural operators in life sciences is endless. For example,
approaches developed in Yin et al. (2022), Goswami et al. (2022f) show the appli-
cation of DeepONet for accurate prediction of aortic dissection and aneurysm,
which is patient specific and hence could provide clinicians sufficient time for
planning surgery.
250 S. Goswami et al.
• DeepONet can be used for accelerating climate modeling by adding learned high-
order corrections to the low-resolution (e.g. 100 Km) climate simulations.
Next, we discuss possible new developments required in the future to further
advance physics-informed deep learning, and in particular neural operators.
In the last two decades, we have seen rapid advances in GPU computing that
together with the simultaneous advances in deep learning algorithms have enabled
the development of new hybrid models based on both physics-driven and data-driven
methods. Research teams are now working on developing high-fidelity digital twins
of the human organs and the Earth’s atmosphere, which will require long and expen-
sive training, even more than the expensive transformer language models developed
by big software companies involving hundreds of billions of parameters. To deal
with the increasing cost and cope with the urgent demand for a real-time inference
that requires only very little new training for problems at scale in computational
mechanics, higher levels of abstraction of these algorithms are required. To that end,
continual learning at the operator regression level is a promising avenue in establish-
ing the mathematical foundations of digital twins in computational mechanics and
beyond. Some authors and even industry researchers use concepts from principal
component analysis and reduced order modeling to build digital twins but the lack of
over-parametrization of such methods, even of their nonlinear extensions, is a lim-
iting factor for their effectiveness in realistic scenarios of diverse and unanticipated
operating conditions.
Adding physics into the training of neural operators in addition to any available
data enhances their accuracy and generalization capacity for tasks even outside the
distribution of the input space. Scalable physics-informed neural networks (Shukla
et al. 2022) can be employed to solve high-dimensional problems not possible with
traditional finite element solvers, e.g. up to approximately 10 dimensions if not
more. Similarly, scalable physics-informed neural operators can also solve high-
dimensional problems even in real time and can be used for designing very complex
systems. For example, in De et al. (2022), the authors solved an industry-based
problem of computing the power generated in a utility-scale wind plant with 64
turbines, considering the uncertainty of the wind speed, inflow direction, and yaw
angle (a schematic representation shown in Fig. 6.14). The layout of 64 wind turbines
is on a two-dimensional mesh which was used to generate the training data as shown in
Fig. 6.14b. The annual energy output of a wind farm is often calculated by estimating
the predicted power of the joint distribution of wind speed and direction; however, the
quantity of function evaluations necessary frequently prevents the use of high-fidelity
models in industry (King et al. 2020).
To that end, training a DeepONet or any other neural operator with only high-
fidelity data would be computationally expensive. One promising way to solve such
realistic problems is using multi-fidelity approaches proposed in De et al. (2022),
Lu et al. (2022b), Howard et al. (2022). These real problems take to leverage the
generalized flavor of DeepONet, which can be flexibly designed for any problem at
hand. Scaling neural operators to industry-level problems with parallel multi-GPU
training could be a very impactful research direction. Another interesting direction is
6 Physics-Informed Deep Neural Operator Networks 251
387 m
387 m
x2 474 m
x1
(a) (b)
Fig. 6.14 a Wind farm layout of six wind turbines. b Arrangements of 8 × 8 wind turbines in a
farm. Adopted from De et al. (2022)
developing graph neural operators for modeling digital twins and complex systems-
of-systems, in particular, with the ability to apply causal inference for discovering
intrinsic pathways not easily discovered by other methods. Finally, to address the
excessive cost of training in deep learning and make edge computing a reality, devel-
oping the energetically favorable spiking neural networks on neuromorphic comput-
ers could lead to (at least) three orders of magnitude in energy savings while we may
come closer to more biologically plausible neural operator architectures; see also
Fig. 6.1.
References
Cai S, Wang Z, Lu L, Zaki TA, Karniadakis GE (2021) Deepm&mnet: Inferring the electrocon-
vection multiphysics fields based on operator approximation by neural networks. J Comput Phys
436:110296
Cai S, Wang Z, Wang S, Perdikaris P, Karniadakis GE (2021) Physics-informed neural networks
for heat transfer problems. J Heat Transf 143(6)
Chamberlain B, Rowbottom J, Gorinova MI, Bronstein M, Webb S, Rossi E (2021) Grand: Graph
neural diffusion. In: International conference on machine learning. PMLR, pp 1407–1418
Chen T, Chen H (1995) Universal approximation to nonlinear operators by neural networks with
arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw
6(4):911–917
Chen T, Chen H (1995) Universal approximation to nonlinear operators by neural networks with
arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw
6(4):911–917
De Hoop M, Huang DZ, Qian E, Stuart AM (2022) The cost-accuracy trade-off in operator learning
with neural networks. arXiv:2203.13181
De Ryck T, Mishra S (2022) Generic bounds on the approximation error for physics-informed (and)
operator learning. arXiv:2205.11393
252 S. Goswami et al.
You H, Yu Y, D’Elia M, Gao T, Silling S (2022a) Nonlocal kernel network (NKN): a stable and
resolution-independent deep neural network. arXiv:2201.02217
You H, Yu Y, Silling S, D’Elia M (2021b) Data-driven learning of nonlocal models: from high-
fidelity simulations to constitutive laws. In: Accepted in AAAI spring symposium. MLPS
You H, Yu Y, Silling S, D’Elia M (2022c) A data-driven peridynamic continuum model for upscaling
molecular dynamics. Comput Methods Appl Mech Eng 389:114400
You H, Yu Y, Trask N, Gulian M, D’Elia M (2021d) Data-driven learning of robust nonlocal physics
from high-fidelity synthetic data. Comput Methods Appl Mech Eng 374:113553
You H, Zhang Q, Ross CJ, Lee C-H, Hsu M-C, Yu Y (2022e) A physics-guided neural opera-
tor learning approach to model biological tissues from digital image correlation measurements.
arXiv:2204.00205
You H, Zhang Q, Ross CJ, Lee C-H, Yu Y (2022f) Learning deep implicit fourier neural operators
(IFNOs) with applications to heterogeneous material modeling, To appear on Comput Methods
Appl Mech Eng
Yu A, Becquey C, Halikias D, Mallory ME, Townsend A (2021) Arbitrary-depth universal approx-
imation theorems for operator neural networks. arXiv:2109.11354
Zhang E, Dao M, Karniadakis GE, Suresh S (2022) Analyses of internal structures and defects in
materials using physics-informed neural networks. Sci Adv 8(7):eabk0644
Zhu Y, Zabaras N, Koutsourelakis P-S, Perdikaris P (2019) Physics-constrained deep learning
for high-dimensional surrogate modeling and uncertainty quantification without labeled data. J
Comput Phys 394:56–81
Chapter 7
Digital Twin for Dynamical Systems
7.1 Introduction
A digital twin is a virtual replica of a physical system that exists either in the computer
or in the cloud (Vinicius et al. 2019; Coronado et al. 2018). Compared to conventional
computational models that emulate the physical system’s behavior in a temporally
static sense, a digital twin attempts temporal synchronization of the physical and
digital twins (Arup 2019; Worden et al. 2020); this naturally necessitates real-time
updating of the digital twin based on real data collected by using sensors. Note
that a digital twin might dictate changes to the physical system via actuators. This
is particularly true when a digital twin is used for active vibration control. While
the concept, in theory, has been there since 2002, the first practical definition was
provided in Eric et al. (2011). Rapid developments in Machine Learning and Internet
of Things (IoT) are two of the critical factors that enabled the emergence of this
technology. Additionally, with continuous growth in the usage of connected devices,
network speed, and increasing popularity of cloud platforms, digital twin technology
is likely to observe rapid growth in the upcoming years.
By definition itself, the term “digital twin” is extremely vast and can be applied
to any engineering field (Wagg et al. 2020a, b). Few of the many applications of
digital twin include prognostics and health monitoring (Wihan et al. 2020; Jinjiang
et al. 2019; Millwater et al. 2019; Zhou et al. 2019), manufacturing (Tarasankar et al.
2017; Haag and Anderl 2018; Yuqian et al. 2020; Kyu et al. 2020; Bin and Kai-Jian
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 255
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_7
256 T. Tripura et al.
2019), automotive and aerospace engineering (Li et al. 2017; Michael et al. 2020;
Eric et al. 2011; Sergey and Andrey 2020), to mention a few. In this chapter, we
mainly focus on digital twins for dynamical systems. The development of digital
twin technology is a non-trivial undertaking because of the presence of multiple
timescales (Chakraborty et al. 2021). For example, a typical time period of a wind
turbine is generally in seconds, whereas the operational period is in years (Adhikari
and Bhattacharya 2012). Therefore, a digital twin should be able to handle multiple
timescales seamlessly.
The development of digital twins can either be pursued from the perspective of
physics-based modeling or from the standpoint of data-based modeling. For exam-
ple, in Ganguli and Adhikari (2020), a physics-based digital twin was proposed for
dynamical systems. A physics-based digital twin is robust to changing environments;
however, these twins rely on (a) exact knowledge of the governing physics and (b)
clean (noise-free) data. Unfortunately, neither can be achieved in a realistic scenario.
The alternative, data-based digital twin Wihan et al. (2020) eliminates these issues.
However, purely data-based twins generally fail to generalize to unseen environ-
ments. Accordingly, the performance often deteriorates, and then the digital twin
goes out of sync with its physical counterpart. A third alternative does exist where
the idea is to fuse data with physics so as to exploit the strength of the two approaches.
For instance, if the physics of the problem is partially known, data-driven modeling
techniques can be used to compensate for the missing physics (Garg et al. 2022).
Such data-physics hybridization is often referred to as the “gray-box” modeling and
will potentially drive the development of future digital twin technologies.
The development of hybrid digital twin technology is often based on the assump-
tion that the overall physics of the system stays the same, and it is the parameter
that evolves. With the updated parameters the hybrid digital twin aims to predict the
behavior of the physical twin under unseen environments. For example, to numer-
ically model damage, we generally reduce the stiffness of a system/component.
However, this is an approximation and in a realistic setting, the physics of the sys-
tem often changes. For instance, the physics of a system can vary because of the
initiation of the crack. As a result, the predictions will diverge away from the actual
trajectory and over the course of time, the digital twin will get desynchronized from
the physical twin. Therefore, a digital twin must be able to track the change in
physics. Unfortunately, the digital twin literature addressing this aspect is almost
absent. One possible approach is to continuously identify the model-form errors in
the nominal model and perform a simultaneous parameter-model update. A more
appropriate solution is to identify the model updates using explainable functions
and estimate-associated epistemic uncertainties. These added features will further
enable the framework to correctly define the degree of generalizability, and accurate
estimation of long-term prediction and remaining useful life. Toward the end of this
chapter, a Bayesian framework in conjunction with stochastic structural dynamics is
demonstrated which addresses these issues.
Lastly, all physical systems have inherently associated randomness; hence, a dig-
ital twin should also account for the uncertainty in the system and data. Overall,
in a physical system, uncertainty might be present in material properties, bound-
7 Digital Twin for Dynamical Systems 257
By definition, the idea of the digital twin is quite broad. However, on a higher level,
a digital twin generally consists of four key modules, (a) visualization module, (b)
update module, (c) prediction module, and (d) decision module. While the visual-
ization module is the front end of the overall digital twin framework, the other three
are the engines that drive the overall digital twin framework. A brief description and
functionality of each of the modules are provided below. A schematic representation
of the same is shown in Fig. 7.1.
Remark 7.1 While all the moduli are equally important, this chapter primarily
focuses on the update module and the prediction module. Note that a digital twin
with only the first three components is often called the predictive digital twin.
In this section, we discuss the physics-based digital twin. However, before going into
the details of a physics-based digital twin, we briefly discuss the notion of a nominal
model.
Fig. 7.1 Different modules within a digital twin, along with its functionality
258 T. Tripura et al.
The journey of a digital twin starts from a nominal model. By definition, a nominal
model is the initial model and represents the system during its installation. In engi-
neering, the nominal model can be considered a validated, verified, and calibrated
physics-based model. For instance, it can be a finite element model of a bridge or an
aircraft. We here explain the overall concept of the nominal model by using a simple
single-degree-of-freedom (SDOF) system.
We consider a physical system that can be represented as an SDOF system. We
also consider that the sensors sample at discrete time points ts . Assuming that the
evolution of the system parameters is only dependent on the slow time ts , the equation
of motion can be written as
∂ 2 u(t, ts ) ∂u(t, ts )
m(ts ) + c(ts ) + k(ts )u(t, ts ) = f (t, ts ) (7.1)
∂t 2 ∂t
Here t and ts are the “system time” and “slow time”, respectively. The terms m(ts ),
c(ts ), and k(ts ) are, respectively, the mass, damping, and stiffness of the system at ts .
The response u(t, ts ) and force f (t, ts ) are now a function of both t and ts ; hence, the
equation is represented using partial derivatives. We consider the slow time or the
service time ts as a time variable that is much slower than t. For example, ts could
represent the number of cycles, and the change in system parameters represents the
degradation of the system during its lifetime.
Equation (7.1) is considered as a digital twin of an SDOF dynamic system. When
ts = 0, that is, at the beginning of the service life of the system, the digital twin in
Eq. (7.1) reduces to the nominal system.
d2 u 0 (t) du 0 (t)
m0 + c0 + k0 u 0 (t) = f 0 (t) (7.2)
dt dt
where m 0 , c0 , k0 , and f 0 are, respectively, the mass, damping, stiffness, and force at
t = 0.
For Eq. (7.1) to be the digital twin for a SDOF system, the system parameters m(ts ),
c(ts ), and k(ts ) need to be continuously updated based on the data collected from
the physical counterpart. This essentially means that updating the digital twin is
equivalent to updating these parameters.
We consider the sensors to be installed on the physical system to take measure-
ments at locations of time defined by ts . With this setup, the objective of the update
module is to estimate the parameters based on measurements at each ts . Herein, we
assume that the variation in c(ts ) is negligible and limit ourselves to the variation in
7 Digital Twin for Dynamical Systems 259
m(ts ) and k(ts ) only. Without any loss of generality, the following functional forms
are considered:
k(ts ) = k0 (1 + k (ts ))
(7.3)
and m(ts ) = m 0 (1 + m (ts ))
where k (ts ) and m (ts ) are the changes in stiffness and mass parameters. Note that
k(ts ) is generally expected to be a decaying function over a long time to represent
a loss in the stiffness of the system. m(ts ), on the contrary, is expected to be an
increasing or a decreasing function. The following functions have been chosen for
generating synthetic data:
(1 + k cos(βk ts ))
k (ts ) = e−αk ts −1 (7.4)
(1 + k )
and m (ts ) = m SawTooth(βm (ts − π/βm )) (7.5)
Here αk , βk , k , m , and βm are the constants deciding the rate of change of stiffness
and mass parameters. A schematic representation of the same is shown in Fig. 7.2.
The choice of these functions is motivated by the fact that the stiffness degrades over
time in a periodic manner representing a possible fatigue crack growth in an aircraft
over repeated pressurizations. On the other hand, the mass increases and decreases
over the nominal value due to re-fueling and fuel burn over a flight period. The key
consideration is that a digital twin of the dynamical system should track these types
of changes by exploiting sensor data measured on the system.
1.5
Normalised changes
0.5
Fig. 7.2 Examples of model functions representing long-term variability in the mass and stiffness
properties of a digital twin system
260 T. Tripura et al.
As stated earlier, we consider that only the mass and the stiffness change with time.
Accordingly, the governing differential equation for this case is represented as
d2 u (t) du (t)
m s (ts ) 2
+ c0 + ks (ts ) u (t) = f (t) (7.6)
dt dt
where ks (ts ) and m s (ts ) are represented by Eq. (7.3). Substituting Eq. (7.3) into Eq.
(7.6) and solving, we obtain
where ωs (ts ), ωds (ts ), and ζs (ts ) denote the evolution in natural frequency, damped
natural frequency, and damping ratio, respectively. The evolution is defined as fol-
lows: √
1 + k (ts )
ωs (ts ) = ω0 √ (7.8a)
1 + m (ts )
ζ0
ζs (ts ) = √ √ and (7.8b)
1 + m (ts ) 1 + k (ts )
ωds (ts ) = ωs (ts ) 1 − ζs2 (7.8c)
where ω0 and ζ0 are the natural frequency and damping ratio of the system at t = 0.
The objective here is to exploit Eq. (7.7) to estimate k and m . However, this
is non-trivial because of the coupled nature of the two. Following Ganguli and
Adhikari (2020), this can be addressed by considering the real and the imaginary
parts separately. By introducing a distance metric d (·, ·) and applying it to the real
and imaginary parts, we have
dR (ts ) ζ0
d̃R (ts ) = = − ζ0 (7.11)
ω0 1 + m (ts )
(1 + k (ts )) (1 + m (ts )) − ζ02
dI (ts )
and d̃I (ts ) = = 1 − ζ02 − − ζ0
ω0 1 + m (ts )
(7.12)
7 Digital Twin for Dynamical Systems 261
As already stated, the digital twin is completely described by m (ts ) and k (ts ), and
both of these can be computed by solving Eqs. (7.11) and (7.12) simultaneously,
d̃R (ts )
m (ts ) = − , (7.13a)
ζ0 + d̃R (ts )
ζo d̃R2 (ts ) − 1 − 2ζ02 d̃I (ts ) + ζ02 d̃I2 (ts )
k (ts ) = (7.13b)
ζ0 + d̃R (ts )
Remark 7.2 Although the above expression provides a closed-form solution for
updating the digital twin, it is incapable of handling noisy data. This will be illustrated
in the numerical example section.
The digital twin described using Eqs. (7.1) and (7.13) is examined in this section
through a simple numerical example. We consider two cases, one where the data is
clean and the second where the data is noisy. The damping ratio ζ0 is fixed at 0.05.
The results obtained for the noise-free case are shown in Fig. 7.3. It is observed that
the digital twin defined using Eqs. (7.1) and (7.13) exactly captures the evolution of
the stiffness in slow time. However, the performance of the digital twin in capturing
the temporal evolution of mass is slightly off. This is because of the fact that the
digital twin was updated using only a few observations. Overall, as long as the data
collected is noise free, the physics-based digital twin performs exceedingly well.
Next, we consider a more realistic case where the sensor data is corrupted by noise.
The observations are corrupted by white Gaussian noise with a standard deviation
equal to 0.5% of the standard deviation of actual data. The results corresponding
to the noise-corrupted case are shown in Fig. 7.4. We observe with the noise, the
digital twin starts deteriorating and performs extremely poorly. This is expected as
the formula in Eq. 7.13 is based on the assumption that the available data is noise
free. Overall, it is safe to conclude that the physics-based digital twin presented here
only works when the available data is noise free. Another challenge associated with
this setup is the lack of predictive capability. Since we don’t learn the evolution of
the parameters m (ts ) and k (ts ), it is not possible to predict the state variables at a
future time point.
Fig. 7.3 Performance of the digital twin in capturing the evolution of m and k with slow time
(normalized). The data is noise free
a machine learning model. In this section, we illustrate this possibility, but coupling
Gaussian process (GP) regression with the physics-based digital twin was discussed
before.
7 Digital Twin for Dynamical Systems 263
Fig. 7.4 Performance of the digital twin in capturing the evolution of m and k with slow time
(normalized). The data is noisy
d̃R (ts )
m̂ (ts ) = − , (7.15a)
ζ0 + d̃R (ts )
ζo d̃R2 (ts ) − 1 − 2ζ02 d̃I (ts ) + ζ02 d̃I2 (ts )
k̂ (ts ) = (7.15b)
ζ0 + d̃R (ts )
where d̃R (ts ) and d̃I (ts ), as before, are distance measures. Such a setup allows us
to train the GP model and estimate the parameters θ in a seamless manner. A brief
discussion on GP is given in the following section.
Remark 7.3 From a practical point of view, one needs to assume a functional form
of the mean function and a covariance kernel for implementing GP in practice. One
can also opt to select the mean and covariance kernel in an adaptive manner. In this
work, we have used Bayesian information criteria for selecting the optimal mean
function and covariance kernel.
Gaussian process (GP) is a popular machine learning technique that aims to infer
a distribution over functions and then utilize the distribution to make predictions at
some unknown points (Murphy 2012). Although different modern variants of GP
exist in the literature, the vanilla GP is utilized in this work. For an understanding
of the vanilla GP, consider ∈ Rd to be the input variable and y ∈ Rd to be a set of
noisy measurements of the response variable. Then the regression equation can be
written as
y = g( ) + v, (7.16)
where v represents the noise. With this setup, the objective is to estimate the latent
(unobserved) function g( ) that will enable the prediction of the response variable,
ŷ at new values of . In GP-based regression setup, defined over g( ) with
a GP is
the mean function μ( ) and covariance function κ , ; θ as
g() ∼ GP μ(), κ , ; θ (7.17)
7 Digital Twin for Dynamical Systems 265
In the above equation, the mean and covariance functions are defined as
μ( ) = E[g( )] (7.18a)
κ , ; θ = E (g( ) − μ( )) g − μ (7.18b)
where θ denotes the hyperparameters of the covariance function κ(·, ·), such as
the length and scale parameters for a squared exponential kernel. The choice of
the covariance function κ allows encoding of any prior knowledge about g( ) (e.g.
periodicity, linearity, smoothness) and can accommodate approximation of arbitrarily
complex functions (Rasmussen and Williams 2006). In Eq. (7.17), it can be seen that
any finite collection of function values has a joint multivariate Gaussian distribution,
i.e. (g ( 1 ) , g ( 2 ) , . . . , g ( N )) ∼ N(μ, K), where μ = [μ ( 1 ) , . . . , μ ( N )]T is
the mean vector and K is the covariance matrix with K(i, j) = κ i , j for i, j =
1, 2, . . . , N . Then mean function is generally taken as a zero vector, i.e. μ( ) = 0
when no prior information
is available about the mean function. On the other hand,
any function κ , that generates a positive, semi-definite, covariance matrix K
is considered to be valid for the covariance function. Once Eq. (7.16) is ready, the
objective of GP is to estimate the hyperparameters, θ, based on the observed input–
Nt
output pairs, j , y j j=1 , where Nt is the number of training samples. Once θ have
been computed, the predictive distribution of g ( ∗ ) given the datasets { y, }, the
hyperparameters θ and the new inputs, ∗ is represented as
2 ∗
p g ∗ | y, , θ , ∗
= N g ∗ | μGp ∗ , σGp (7.19)
where the mean and covariance functions of the predictive distribution are given as
∗ ∗
−1
μGP = k T , ;θ K(, ; θ ) + σn2 I y (7.20a)
∗ ∗ ∗
∗
−1 ∗
σGP
2
=k , ; θ − k T , ;θ K( , ; θ ) + σn2 I k , ;θ (7.20b)
We revisit the same experiment carried out in Sect. 7.3 with the same setup (see Fig.
7.2). Figure 7.5 depicts the evolution of real and imaginary parts of the measured
266 T. Tripura et al.
-0.5
Actual change
-0.6 37 Samples avaialble for the digital twin
150 Samples avaialble for the digital twin
-0.7 200 Samples avaialble for the digital twin
-0.8
-0.9
-1
-1.1
-1.2
-1.3
-1.4
-1.5
0 100 200 300 400 500 600 700 800 900 1000
Normalised time: ts /T0
(a) Changes in real part of natural frequency
1.5
Actual change
1.4 37 Samples avaialble for the digital twin
150 Samples avaialble for the digital twin
1.3 200 Samples avaialble for the digital twin
1.2
1.1
0.9
0.8
0.7
0.6
0.5
0 100 200 300 400 500 600 700 800 900 1000
Normalised time: ts /T0
(b) Changes in imaginary part of natural frequency
Fig. 7.5 Variation (normalized) in the real and imaginary parts of the natural frequency
7 Digital Twin for Dynamical Systems 267
natural frequency of the system over time. Figure 7.6 shows the results obtained
for clean data. The proposed and updated digital twin is able to capture the time
evolution of mass and stiffness. However, this setup is unrealistic as, even with the
most advanced sensors, the data collected will always be noisy (Zhang et al. 2017).
As a natural progression, we consider a more realistic case with noisy data. We
consider three noise levels. Figure 7.7 shows the mass and stiffness evolution of the
digital twin trained with 37 noisy observations. We observe that the digital twin is
able to capture the time evolution of stiffness with high accuracy; however, it fails
to capture the evolution of mass adequately. Additionally, uncertainty due to limited
and noisy data is perfectly captured and can be used in further decision-making.
268 T. Tripura et al.
Fig. 7.7 GP-based digital twin obtained from 37 noisy data with σθ = 0.005. The shaded plot
depicts the 95% confidence interval
Figures 7.8 and 7.9 show the performance of the digital twin trained with 150
noisy observations. We observe a dramatic improvement in the digital twin with the
increased data points. The results obtained are highly accurate. Lastly, evolution of
mass with 200 noisy observations and σθ = 0.025 is shown in Fig. 7.10. As expected,
this yields the best results.
7 Digital Twin for Dynamical Systems 269
Fig. 7.8 Mass evolution for digital twin obtained from 150 noisy data. Noise levels of σθ = 0.005,
σθ = 0.015, and σθ = 0.025 are considered
The digital twins discussed till now are deterministic in nature. However, real systems
are always associated with uncertainty in one form or the other. In this section, we
focus on the development of digital twins for the multi-degree-of-freedom (MDOF)
270 T. Tripura et al.
Fig. 7.9 Stiffness evolution for digital twin trained with 150 noisy data with σθ = 0.025. The
shaded plot depicts the 95% confidence interval
Fig. 7.10 Mass evolution as a function of the normalized slow time ts /T0 for GP-based digital twin
(simultaneous mass and stiffness evolution) trained with 200 noisy data with σθ = 0.025. Bayesian
information criteria yield a “Linear” basis and an “ARD Matern 5/2” covariance kernel. The shaded
plot depicts the 95% confidence interval
stochastic system. Note that, similar to previous sections, we assume that the temporal
evolution of a system accounts for changes in the system parameters only. No change
in the governing physics is considered in this section.
7 Digital Twin for Dynamical Systems 271
∂ 2 X(t, ts ) ∂ X(t, ts )
M(ts ) + C(ts ) + K(ts )X(t, ts ) + G ((t, ts ), α) = F(t, ts ) + Ẇ (7.22)
∂t 2 ∂t
The DT for MDOF nonlinear system is defined in Eq. (7.22) as incomplete without
proper update mechanism for the system parameters M(ts ), C(ts ), and K(ts ). Similar
to the previous section, we assume that the temporal variation in M(ts ), C(ts ), and
K(ts ) is slow, and hence the dynamics are decoupled. We assume that the sensors
collect data at discrete time instants ts , and at each time instant, time history mea-
surements of acceleration response in ts ± t are available. In this section, we only
consider variation in the stiffness, and the objective is to develop a DT for a nonlinear
MDOF system. Other requirements like continuous updates and future predictions
remain, as discussed in the previous section.
Updating DT
Physical system
High-fidelity model
Predicting future
response, prognosis,
diagnosis, estimating Predicting future values of parameters
remaining useful life, (Expected), using GP
Note that for brevity, the hyperparameters in Eq. (7.23) are omitted. The GPR is
trained by following the procedure discussed in Sect. 7.4.1. For algorithmic details,
more information can be found in Garg et al. (2021). Once trained, the GPR can
predict the system parameters at future timesteps. Note that GPR being a Bayesian
machine learning model also provides predictive uncertainty, which can be used to
judge the accuracy of the model. For the ease of readers, the overall DT framework
proposed is shown in Algorithm 1.
Algorithm 1: Proposed DT
1 Select nominal model ; Section 7.5.1
2 Use data (acceleration measurements) Ds collected at time ts to compute the parameters
K(ts ) ; UKF (Garg et al. 2021, 2022)
ts
3 Train a GP using D = [tn , v n ]n=1 as training data, where v n represents the system
parameter. Predict K(t˜) at future time t˜. Substitute K(t˜) into the governing equation
(high-fidelity model) and solve it to obtain responses at time t˜.
4 Take decisions related to maintenance, remaining useful life, and health of the system.
5 Repeat steps 2–6 as more data become available
We consider a 7-DOF system as shown in Fig. 7.12. The 7-DOF system is mod-
eled with a Duffing–van der Pol (DVP) oscillator at the fourth DOF. The governing
equations of motion for the 7-DOF system are as follows:
M Ẍ + C Ẋ + KX + G (X, α) = F + Ẇ (7.24)
that nonlinear stiffness is generally used for vibration control (Das et al. 2021) and
energy harvesting (Cao et al. 2019), and hence kept constant. Further details on the
parameters are shown in Table 7.1.
For generating synthetic data, data simulation is carried out using the Taylor-1.5-
Strong scheme, and a filtering model is formed using the Euler–Maruyama (EM)
equation in Garg et al. (2021). Note that although the value k4 is a priori known, we
have still considered it into the state vector. It was observed that such a setup helps in
regularizing the UKF estimates. The measurement model for the UKF remains the
same that is written as
⎡ ⎤
− m1 (y1 (k1 + k2 ) − c2 y4 − k2 y3 + y2 (c1 + c2 ))
1
⎢ 1 (c y − y (k + k ) + c y + k y + k y − y (c + c )) ⎥
⎢ 2 2 3 2 3 3 6 2 1 3 5 4 2 3 ⎥
⎢ m 2 ⎥
⎢− 1 k y − c y − k y − c y + y (k − k ) + α 3 ⎥
⎢ m3 4 7 4 8 3 3 3 4 5 3 4 DVP (y5 − y7 ) + y6 (c3 + c4 ) ⎥
⎢ ⎥
h( y) = ⎢ 1 3
⎢ m 4 c4 y6 + c5 y10 − k4 y5 + k5 y9 + y7 (k4 − k5 ) + αDVP (y5 − y7 ) − y8 (c4 + c5 )
⎥
⎥
⎢ ⎥
⎢ 1 (c y − y (k + k ) + c y + k y + k y − y (c + c )) ⎥
⎢ m5 5 8 9 5 6 6 12 5 7 6 11 10 5 6 ⎥
⎢ 1 (c y − y (k + k ) + c y + k y + k y − y (c + c )) ⎥
⎣ m6 6 10 11 6 7 7 14 6 9 7 13 12 6 7 ⎦
1 (c y − c y + k y − k y )
m7 7 12 7 14 7 11 7 13
(7.25)
Process noise covariance matrix Q is expressed as Garg et al. (2021)
σ σ σ σ̃ σ σ σ
qc = t −1 diag 0 m1 0 m2 0 m3 0 m4 0 m5 0 m6 0 m7 0 0 0 0 0 0 0
1 2 3 4 5 6 7
Q = qc qcT
(7.26)
where σ̃4 = m − k (7) σ 4 and m −
k (7) is the seventh element of UKF’s predicted mean.
Simulated acceleration and input are corrupted with a Gaussian noise having SNR
values of 50 and 20, respectively. The final data used in filtering is presented in Fig.
7.13.
The functionality of the digital twin is dependent on the performance of UKF
and GP. To that end, we first illustrate the accuracy of UKF. The acceleration vec-
tors (noisy) shown in Fig. 7.14 are considered as the measurements. The state and
parameter estimation results obtained using the UKF algorithm are shown in Fig.
7.15. The digital twin provides a highly accurate estimate of the state vectors. Param-
eters k2 , k3 , and k5 also converge exactly toward the ground truth. As for k1 , k6 , and
7 Digital Twin for Dynamical Systems 275
40 20
UKF Input
Ground Truth
20 10
force
acc.
0 0
-20 -10
-40 -20
0 2 4 0 2 4
t (sec) t (sec)
Fig. 7.13 Acceleration and the deterministic component of the force for the 7-DOF problem
20
Measurement 40
Ground Truth
20
A1
f1
0
0
-20 -20
0 2 4 0 2 4
t (sec) t (sec)
20 50
0
A4
f4
0
-50
-20 -100
0 2 4 0 2 4
t (sec) t (sec)
20 50
A7
0
f7
-20 -50
0 2 4 0 2 4
t (sec) t (sec)
Fig. 7.14 Force (deterministic part) and acceleration vector corresponding to DOF 1, 4, and 7 used
in UKF
y1
y2
0
-0.5 -2
0 2 4 0 2 4
t (sec) t (sec)
10
4
5
y7
y8
2 0
0 -5
0 2 4 0 2 4
t (sec) t (sec)
6 10
4
y13
y14
0
2
0 -10
0 2 4 0 2 4
t (sec) t (sec)
2200 2000
k1
k2
2000
1800
1800
0 2 4 0 2 4
t (sec) t (sec)
1100
1100
1000
k3
k5
1000
900
900
0 2 4 0 2 4
t (sec) t (sec)
1100 550
1000 500
k6
k7
900 450
0 2 4 0 2 4
t (sec) t (sec)
(b) Parameter (Stiffness) Estimation) Estimation
7 Digital Twin for Dynamical Systems 277
4 Ground Truth
Estimated Value
2
y2
2000 0
-2 4
0 2 4
t (sec) 2
1750
y2
0
0.5
k1
-2
0
y1
0 2 4
t (sec)
1500
-0.5
0 5
t (sec) 0.5
0
y1
1250
-0.5
0 5
t (sec)
0 2500 5000 7500 10000
ts (Days)
(a) 1
4 Ground Truth
Estimated Value
2
y4
2000 0
-2
-4 4
0 1 2 3 4 5
t (sec) 2
y4
1750 0
-2
k2
1
-4
0.5 0 2 4
0 t (sec)
y3
1500
-0.5
-1 1
0 5
t (sec)
0
y3
1250
-1
0 5
t (sec)
0 2500 5000 7500 10000
ts (Days)
(b) 2
Fig. 7.16 Stiffness (k1 and k2 ) obtained using the UKF algorithm
278 T. Tripura et al.
1100 1100
Ground Truth
Estimated Value
1000 1000
k3 900 900
k5
800 800
700 700
600 600
1100 550
1000 500
900 450
k6
k7
800 400
700 350
600 300
Fig. 7.17 Estimated stiffness (k3 , k5 , k6 and k7 ) in slow timescale using the UKF algorithm
the digital twin predicted results are found to diverge from the true solution. How-
ever, the divergence only happens after 3.5 years from the last observation, which,
for all practical purposes, is reasonable enough. For stiffness k6 also, the predic-
tion improves as more data are made available to the digital twin. This establishes
that the digital twin has the capacity for self-correction, which in turn helps better
representation of the physical systems.
The construction of digital twins in the previous three sections is based on the assump-
tion that the evolution of a system can be perfectly captured by changes in its system
parameters. However, such an assumption limits the capability of a digital twin. In
section, we discuss the predictive digital twin that is capable of tracking change in
physics.
Let us consider the following D-dimensional second-order partial differential
equation:
7 Digital Twin for Dynamical Systems 279
1750 1750
k1
k2
k3
800
1500 1500
1250 1250
600
0 5000 10000 0 5000 10000 0 5000 10000
Time (days) Time (days) Time (days)
1000 1000 500
k5
k6
k7
800 800 400
Fig. 7.18 Results representing the performance of the proposed digital twin (GP + UKF) for the
7-DOF system
where M ∈ R D×D , C ∈ R D×D , and K ∈ R D×D represent the mass, damping, and
linear stiffness matrices of the system, respectively. The functions H( Ẋ, X, t, ts ) :
R D → R D and Q( Ẋ, X, t, ts ) : R D → R D denote the linear and nonlinear pertur-
bations in the system, respectively. The term B(t, ts ) ∈ R D on the right-hand side
represents the white noise (the generalized derivative of Brownian motion) with noise
intensity matrix ∈ R D×D . In the above equation, two timescales t and ts are used,
which represent the intrinsic time and the service time, respectively. The service
time refers to the periods over which the underlying structure or a component is
expected to be inspected. The timescale ts is comparatively much slower than t and
since X(t, ts ) is a function of both the timescales Eq. (7.27) is written in terms of
the partial derivatives. It can be understood that the evolution in M(ts ), C(ts ), and
K(ts ) occurs very slowly with respect to timescale ts . The forcing term Ḃ(t, ts )
however can change with respect to both the timescales t and ts . We call Eq. (7.27) as
the model of the proposed digital twin. Since it is already mentioned that the system
evolves with respect to the slower timescale, we rephrase Eq. (7.27) when ts = 0 as
The above equation denotes the beginning of the service life of the underlying system
and is often called the nominal model in DT (this is almost similar to the DT defined in
Eq. (7.21)). Here, M0 , C0 , and K0 are the parameters of the nominal model. Further
280 T. Tripura et al.
Physical Model
Nominal
Validation
Mirror Model
Data assimilation
and processing
Module-1
If both input-output
Type of
Prediction, information
measurement
Design, Real-time correction and If only output information
forecast about remaining useful
Unseen environment life Module-2
Fig. 7.19 Schematic architecture of the predictive digital twin framework with simultaneous
parameter-model updating feature. The network primarily consists of three components, namely,
the physical model, the mirror model, and the linking mechanism. The linking mechanism performs
three simultaneous operations that are (i) data assimilation and processing, (ii) updating of the
nominal mirror model, and (iii) making predictions in the presence of unseen environmental agents
using the updated digital twin. For updating the virtual mirror model using explainable functions,
the data assimilation and processing unit utilizes two modules. In the backend, these modules use the
sparse Bayesian linear regression. Based on whether both input–output or output-only measurements
is available, module-1 and module-2 are activated, respectively. The input refers to the source and
the output refers to the state measurements. The details on the modules are provided in Figs. 7.20
and 7.21
we assumed that as the timescale ts shifts from the initial condition the nominal
system gets perturbed by new terms, expressed using the functions H( Ẋ, X, t, ts )
and Q( Ẋ, X, t, ts ) as
parameters of the functions H(·) and Q(·). For these, the sparse Bayesian inference
is employed. The resulting framework thus is interpretable, and since the functions are
learned in a probabilistic framework, the chances of overfitting are very low. Further,
the physics of the underlying perturbations is learned using actual mathematical
functions. Therefore, it is conjectured that the proposed DT will be able to track
the evolution of the physical twin accurately. A schematic representation of the
predictive digital twin is shown in Fig. 7.19. The network architecture has three
primary components—(a) the physical model, (b) the digital twin, and (c) the linking
mechanism. The linking mechanism further consists of three independent modules—
(i) the data assimilation and processing module, (ii) the model updating module, and
(iii) the prediction module. The data processing is performed by using a physics-
based nominal model, and in the updating module, the sparse Bayesian regression
is employed. Since the perturbations in the physical model are obtained in terms of
interpretable functions, we consider the digital twin as white in nature. Although the
proposed framework should theoretically work for higher order dynamical systems,
in this work, we assume only the second-order dynamical systems. Furthermore, we
consider that the second-order dynamical systems can be completely represented
using displacement and velocity measurements.
Due to the advances in the development of sensor technologies now it is possible
to measure the displacement and velocity time histories of a dynamical system. How-
ever, often the measurement of input forces is not feasible. Toward this, we propose
two frameworks—(i) when both the input–output measurements are available and
(ii) when only the state measurements are measurable. In framework-1, we simply
remove the information of the nominal model using the measured state measurements
and then perform sparse regression to identify the perturbation terms. The schematic
illustration of the framework is presented in Fig. 7.20. In framework-2, similar to the
previous, we first remove the information of the nominal model and then employ the
sparse Bayesian regression in the purview of the Kramers–Moyal expansion (Hannes
1996) to identify the perturbations in terms of stochastic differential equations (Peter
and Eckhard 1992; Bernt 2013). The schematic representation of framework-2 is
provided in Fig. 7.21.
Fig. 7.20 Schematic architecture of the Module-1 in Fig. 7.19 for Bayesian model updat-
ing of dynamical systems in the presence of both input–output observations. Module-1 takes
the state and force measurements as input and forms a library of candidate basis functions from
these measurements. To update the nominal model, a sparse Bayesian regression between the state
measurements and their derivatives is constructed using the library
282 T. Tripura et al.
Fig. 7.21 Schematic architecture of the Module-2 in Fig. 7.19 for Bayesian model updating of
dynamical systems with output-only measurements. In module-2, only the state measurements
are provided as input. The library of candidate basis functions is created from state-only measure-
ments. Similar to module-1, sparse Bayesian regression is performed to update the nominal model.
However, the target vectors in module-2 are obtained using the Kramers–Moyal formula
In the above equation, i represents the ith state of m-dimensional statespace and θik
denote the kth basis function of ith state. In the regression format, the above equation
is expressed as
Y i = Lθ i + i (7.34)
In the previous section, we demonstrated the model updating framework using both
the input–output information. However, often the accurate measurement of inputs is
not feasible. In such cases, the library of candidate functions becomes ill-conditioned,
284 T. Tripura et al.
and this leads to the selection of the wrong basis functions. Since, in this case, the
input information is assumed to be unavailable, we try to represent the underlying
governing physics in terms of the stochastic differential equations (SDEs) (Peter and
Eckhard 1992; Bernt 2013). To represent the systems in terms of SDEs we treat the
output measurements as a stochastic process and perform sparse Bayesian learning
in the purview of stochastic calculus. We again use the statespace to represent the
higher order dynamical systems in terms of the SDEs. Let the statespace be realized
by a map T : Rd → Rm that maps the d-dimensional system to the m-dimensional
SDEs with d < m. Then using T , any perturbed higher order system can be reduced
to the following SDEs:
Ẋ = f (X t , t) + h (X t , t) + g (X t , t) ξ (t) (7.35)
Nominal model Perturbation Diffusion
d Z t = ( f (X t , t) + h (X t , t)) dt + g (X t , t) d B (t) − f (X t , t) dt
(7.37)
= h (Z t , t) dt + g (Z t , t) d B (t)
The above SDE is now the function of the nominal model removed state measure-
ments and contains the information of—(i) perturbation in the drift and (ii) diffu-
sion. At this point, it is straightforward to understand that the discovery of governing
physics in terms of Eq. (7.36) requires the independent identifications of the drift
and diffusion components. In contrast to the diffusion term, the deterministic drift
functions behave as smooth functions, i.e. they are assumed to be at least twice
differential. Thus there exists a finite variation of drifts. On the contrary, the stochas-
tic Brownian motions are not differentiable everywhere with respect to the process
Z(t). Due to such non-differentiability property, the Brownian motions are assumed
to have only the quadratic variation; as a consequence, their convergence is defined
in the mean square sense only.
Mathematically, let us consider the interval s ∈ [0, t] that is partitioned into n-
parts. If Z t denotes arbitrary random processn then according to the Itô calculus, as
n → ∞ the finite variation {Vn (Z , t) : i |Z (si ) − Z (si−1 )|} → V (Z , t) and the
quadratic variation {Q n (Z , t) : in |Z (si ) − Z (si−1 )|2 } → Q(Z , t) (Uwe Hassler
et al. 2016). This suggests that if the sampling rate is sufficiently small, then the
drift and diffusion components of an SDE in Eq. (7.36) can be learned using only
the state measurements in terms of their linear and quadratic variations, respectively
(Hannes 1996). However, it should be noted that the diffusion components—(i) have
zero finite variations and (ii) are bounded by the quadratic variations. Thus, the dif-
fusion components are recoverable only through their covariation terms. Therefore,
we express the drift and diffusion components of the SDE in Eq. (7.36) in terms of
the state measurements as follows:
1
hi (Z t , t) = lim E [Z i (t + t) − Z i (t)] ∀ k = 1, 2, . . . N , (7.38a)
t→0 t
1 1
i j (Z t , t) = lim E |Z i (t + t) − Z i (t)| Z j (t + t) − Z j (t) (7.38b)
2 t→0 t
∀ k = 1, 2, . . . N
where hi (Z t , t) is the ith drift component and i j is the (i j)th component of the dif-
fusion covariance matrix ∈ Rn×n := (g g T )(Z t , t). In order to derive the analytical
form of the drift and diffusion components from state measurements, we represent
the drift and diffusion as a linear superposition of candidate basis functions.
Let { k (Z t ); k = 1, . . . , K } be the set of candidate library functions where
k (Z t ) represents the various linear and nonlinear mathematical functions defined
with respect to the system states. We first construct the libraries L f ∈ R N ×K and
Lg ∈ R N ×K from the subsets { k (Z t )} ⊆ { k (Z t )} and { k (Z t )} ⊆ { k (Z t )} for drift
f g
and diffusion, respectively. Then, we express the ith drift component and the i jth
term of diffusion covariance matrix as a linear superposition of the library functions
as
286 T. Tripura et al.
f f f f f f
hi (Z t , t) = θi1 1 (Z t ) + · · · + θi k k (Z t ) + · · · + θi K K (Z t ) (7.39a)
g g g g g g
i j (Z t , t) = θi j 1 1 (Z t ) + · · · + θi j k k (Z t ) + · · · + θi j K K (Z t ) (7.39b)
f g
where θik and θi j k are the weights associated with the kth basis function of ith drift
and i jth diffusion covariance components, respectively. In a compact form, the above
equations can be represented as
f
Y i = L f θ i + i (7.40a)
g
Yij = Lg θ i j + ηi j (7.40b)
f T g T
In the above equations θ i = θi1 , θi2 , . . . , θi K and θ i j = θi j 1 , θi j 2 , . . . , θi j K ,
which correspond to ith drift component and i jth element of diffusion covariance
matrix, respectively. Similarly, the target vectors Y i and Y i j correspond to the ith-drift
component and (i j)th component of the diffusion covariance matrix, respectively.
The terms i and ηi j represent the corresponding measurement error vectors. For the
discovery, the target vectors are constructed using Eq. (7.38) as
1 T
Yi = Z i,1 − ξi,1 , . . . , Z i,N − ξi,N (7.41a)
t
1 T
Yij = { Z i,1 − ξi,1 Z j,1 − ξ j,1 }, . . . , { Z i,N − ξi,N Z j,N − ξ j,N }
t
(7.41b)
D = [X, Y ] the aim is to find the posterior distribution of the weight vector θ . For
estimating the posterior distribution of the weight vector θ , the Bayes formula is
utilized as follows:
P (θ) P (Y |θ )
P (θ |Y ) = (7.43)
P (Y )
where I N ×N is the identity matrix. In the sparse Bayesian regression, the aim is
to obtain a sparse representation of the weight vector θ, which further renders the
resulting model interpretable. In the purview of the Bayesian regression, the sparsity
in the resulting model is introduced by assigning certain sparsity-promoting priors.
For a detailed review of the sparsity-promoting priors, the readers are referred to the
literature (Edward and Robert 1997; Robert and Mikko 2009). In this section, the
sparse Bayesian regression is demonstrated using the spike and slab prior. The spike
and slab prior is a mixture of two distributions, where the spike helps to shrink the
small values of weights to zero and the slab allows only the large values to escape
the shrinkage. For various models of spike and slab prior, the readers can refer to
the literature (Robert and Mikko 2009; Rajdip et al. 2021). In this section, the slab
is modeled as zero-mean Gaussian distribution with large variance and the spike
using the Dirac-delta function. Due to the presence of the Dirac-delta function, the
prior can be regarded as discontinuous spike and slab (DSS) prior. Since DSS is a
mixture of two distributions, for regression, the weights θ k ∈ θ need to be classified
into the spike and slab components. This is done by introducing a latent indicator
variable vector Z = [Z 1 , . . . , Z K ] for each of the component θk ∈ θ . The latent
indicator variable Z k takes a value of 1 or 0 depending on whether the corresponding
weight belongs to slab or spike, respectively. In the DSS-prior model, the weight
components that belong to spike do not contribute to selection of basis function form
library L(·, ·); therefore, a reduced weight vector θ r ∈ Rr : {r K } is composed
from the elements of the weight vector θ for which Z k = 1. Using this reduced weight
vector θ r the DSS-prior is defined as
p (θ |Z) = pslab (θr ) pspike (θk ) (7.45)
k,Z k =0
where pspike (θk ) = δ0 and pslab (θ r ) = N 0, σ 2 ϑs R0,r . Here, δ0 is the Dirac-delta
function, ϑs is the slab variance, and the matrix R0,r is the scaling matrix. If the
correlation between the basis functions is ignored the scaling matrix is taken as
R0,r = Ir ×r ; otherwise, the Fischer information matrix is used as R0,r = N (DT D)−1 .
288 T. Tripura et al.
In order to increase the accuracy and faster convergence, the Bayes formula in Eq.
(7.43) is further expanded to a hierarchical model by assuming a prior on the noise
variance σ 2 , the slab variance ϑs , and the latent vector Z. The priors are selected
based on the information from the conjugate priors as
where “IG” denotes the inverse-Gamma distribution, “Bern” denotes the Bernoulli
distribution, and “Beta” denotes the Beta distribution. The Bayesian hierarchical
model is shown in Fig. 7.22, where the hyperparameters αϑ , βϑ , α p , β p , ασ , and
βσ are provided as a deterministic constants. The joint posterior distribution is then
obtained from Fig. 7.22 as
from the conditional distributions using Gibbs sampling are provided in Algorithm
2 in Appendix A.
Once the stationary distribution is reached, the marginal posterior inclusion prob-
ability (PIP) is estimated on the samples from Gibbs sampling. The PIP is denoted
as PIP:= p (Z k = 1|Y ), which measures the probability of participation of the corre-
sponding basis function in the updated model. A PIP = 1 will mean the corresponding
basis function is present in all the visited models, whereas a PIP = 0.5 will mean
that the corresponding basis function occurs at least in half of the visited models.
Let N MC denote the length of Markov Chain Monte Carlo (MCMC) required to
achieve the stationary distribution after discarding the burn-in samples. Then the PIP
is approximated for each of the K -basis functions by estimating the mean over the
N MC -Gibbs samples of the kth latent variable Z k ∈ Z (Rajdip et al. 2021) as
1
Ns
j
p (Z k = 1|Y ) ≈ Z ; k = 1, . . . , K (7.51)
Ns j=1 k
By selecting a desired threshold for the PIP values, the final updated model can be
derived from any pair of [X, Y ]. A higher PIP threshold will result in a highly sparse
model, whereas a lower PIP threshold will result in a model with a large number
of functions. Also, higher PIP suggests that in the case of unseen scenarios, the
corresponding basis functions are highly likely to occur in the data representation
target vector. Being probabilistic in nature, the Bayesian algorithm also helps in
capturing the mean and covariance of the weights θk ∈ θ . In an unseen scenario, the
covariance information can be used to construct confidence intervals and to judge
the uncertainties associated with the updated model.
(a) framework-1
(b) framework-2
Table 7.2 Posterior mean and standard deviations of the selected basis functions
Systems Basis function a Deterministic † Stochastic
Table 7.2 shows its efficacy in identifying the parametric values associated with the
library terms. Results obtained using both frameworks are shown. We observe that
for both cases, the proposed approach yields highly accurate results, as indicated by
the excellent match between the estimated mean and actual parametric values. The
posterior distribution of the parameters is shown in Fig. 7.24.
The predictive capability of the interpretable digital twin is shown in Fig. 7.25.
Results obtained using both frameworks are shown. In both cases, we observe that
the digital twin predicts highly accurate results, with the mean prediction matching
almost exactly with the ground truth.
Remark 7.4 As an alternate to the interpretable digital twin frameworks, one can
opt for using a machine learning model to learn the unknown physics. For example,
a framework similar to the one proposed in Garg et al. (2022) is used. However,
it was observed that the amount of data required in such a hybrid framework is
generally more as compared to the interpretable version presented here. Also, as we
learn the exact physics, perpetual generalization is obtained. With such hybridization,
generalization is compromised to some extent.
7.7 Conclusions
In this chapter, we discussed the concept of digital twins for dynamical systems. Out
of the four modules in digital twins ((a) visualization, (b) update, (c) prediction, and
(d) decision), this chapter particularly focused on the update and prediction module.
In particular, we discussed how purely physics-driven and gray-box models could be
used for updating a digital twin. While a purely physics-driven digital twin is prone
to noise in the data, a gray-box model-based digital twin is somewhat immune to
the noise in the data. Additionally, integrating Bayesian statistics and probabilistic
292 T. Tripura et al.
(a) framework-1
(b) framework-2
machine learning algorithm makes a digital twin robust and aids in the technology’s
journey toward autonomy.
All physical systems have inherently associated randomness, and hence it is essen-
tial for a digital twin to account for the possible uncertainties in a system. We illus-
trated how to develop a digital twin for a stochastic dynamical system by using Itô
calculus, machine learning, and Kalman filtering approaches. Following the same
spirit as before, we recommend using probabilistic machine learning approaches as
the predictive uncertainty obtained in probabilistic machine learning has a huge role
to play in the development of digital twins. Case studies involving a seven-story
nonlinear stochastic dynamical system were presented to illustrate the applicability
and performance of the digital twin.
A digital twin is supposed to be a virtual replica of a physical system and is sup-
posed to emulate the evolution of the system throughout its lifetime. In most cases,
the evolution of system dynamics is represented as the evolution of the system param-
eters. Unfortunately, this is an approximate scenario as it is not only the parameters
but the governing equation itself that can change. Therefore, an important aspect of
the development of digital twins is the identification of model-form errors. One part
of this chapter is devoted toward developing digital twins for systems with misspec-
7 Digital Twin for Dynamical Systems 293
(a) Framework-1: displacement time series (b) Framework-2: displacement time series
(c) Framework-1: velocity time series (d) Framework-2: velocity time series
(e) Framework-1: crack path evolution (f) Framework-2: crack path evolution
Fig. 7.25 Predictive performance of the proposed predictive digital twin for example—3. a and
b Results for the DT using framework-1, where both the input–output observations are available. c
and d Results of the DT when only output measurements are feasible. The DT perfectly identifies the
perturbation terms along with their associated parameters. As a result, the prediction results match
almost perfectly with the actual system responses. However, when the models are updated using
only the output observations, the uncertainty in the predictions increases by some amount. This
ability to learn the uncertainties in the identified system parameters helps us to perform reliability
analysis on the systems designed using the proposed DT
ified physics. We illustrate how sparse Bayesian learning can be used for identifying
the missing terms and correcting/updating the digital twin. The applicability of the
system for both deterministic and stochastic systems is shown.
294 T. Tripura et al.
Appendix
for i = 1, . . . , MCMC do
(i) (i) (i)
p0 p Y |ψk =0, −k ,ϑs
Estimate u k = and λ =
6
p0 +λ 1− p0 (i) (i) (i) .
p Y |ψk =1, −k ,ϑs
7 Then update the latent variable vector (i+1) from the Bernoulli distribution as,
8 Update the noise variance σ 2(i+1) from the Inverse-gamma distribution as,
(i+1)
9 Update the slab variance ϑs from Inverse-gamma distribution as,
(i+1) 1 (i)T (i)−1 (i)
p ϑs |θ (i) , (i+1) , σ 2(i+1) = I G αϑ + 0.5h z , βϑ + 2
θ r R0,r θ r .
2σ
K (i+1) and update the success rate p (i) from the Beta distribution as,
10 Estimate h z = k=1 ψk 0
(i+1)
p p0 | (i+1) = Beta α p + h z , β p + K − h z .
(i+1)
11 Update the weight vector θ r from step 7. Repeat steps 7→11
12 end
13 Discard the burn-in MCMC samples and calculate the marginal PIP values.
n MC
1 j
p ψk = 1|Y ≈ ψk ; k = 1, . . . , K .
n MC
j=1
14 Select the basis function in the final model with desired PIP values.
15 Output: The mean μθ and covariance θ .
7 Digital Twin for Dynamical Systems 295
References
Arup (2019) Digital twin: Towards a meaningful framework. Technical report, Arup, London,
England
Bilionis I, Zabaras N, Konomi BA, Lin G (2013) Multi-output separable gaussian process: Towards
an efficient, fully bayesian paradigm for uncertainty quantification. J Comput Phys 241:212–239
Booyse W, Wilke DN, Heyns S (2020) Deep digital twins for detection, diagnostics and prognostics.
Mech Syst Signal Process 140:106612
Casella G, George EI (1992) Explaining the gibbs sampler. Am Stat 46(3):167–174
Chenzhao L, Sankaran M, You L, Sergio C, Liping W (2017) Dynamic bayesian network for aircraft
wing health monitoring digital twin. Aiaa J 55(3):930–941
Debroy T, Zhang W, Turner J, Suresh Babu S (2017) Building digital twins of 3d printing machines.
Scripta Materialia 135:119–124
Dongxing C, Xiangying G, Wenhua H (2019) A novel low-frequency broadband piezoelectric
energy harvester combined with a negative stiffness vibration isolator. J Intell Mater Syst Struct
30(7):1105–1114
Ganguli R, Adhikari S (2020) The digital twin of discrete dynamic systems: Initial approaches and
future challenges. Appl Math Model 77:1110–1128
George EI, McCulloch RE (1997) Approaches for bayesian variable selection. Statistica Sinica
339–373
Harry M, Juan O, Nathan C (2019) Probabilistic methods for risk assessment of airframe digital
twin structures. Eng Fract Mech 221:106674
Hassler U et al (2016) Stochastic processes and calculus. Springer Texts in Business and Economics
He B, Bai K-J (2019) Digital twin-based sustainable intelligent manufacturing: a review. Adv Manuf
1–21
Hoodorozhkov S, Krasilnikov A (2020) Digital twin of wheel tractor with automatic gearbox. In:
E3S web of conferences, vol 164. EDP Sciences, pp 03032
Kapteyn MG, Knezevic DJ, Willcox K (2020) Toward predictive digital twins via component-based
reduced-order models and interpretable machine learning. In: AIAA scitech 2020 forum, pp 0418
Klebaner FC (2005) Introduction to stochastic calculus with applications. World Scientific Publish-
ing Company
Kloeden PE, Platen E (1992) Higher-order implicit strong numerical schemes for stochastic differ-
ential equations. J Stat Phys 66(1–2):283–314
Lu Y, Liu C, Kevin I, Wang K, Huang H, Xu X (2020) Digital twin-driven smart manufacturing:
Connotation, reference model, applications and research issues. Robot Comput-Integr Manuf
61:101837
Mike Z, Jianfeng Y, Donghao F (2019) Digital twin framework and its application to power grid
online analysis. CSEE J Power Energy Syst 5(3):391–398
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press
Nayek R, Fuentes R, Worden K, Cross EJ (2021) On spike-and-slab priors for bayesian equation
discovery of nonlinear dynamical systems via sparse linear regression. Mech Syst Signal Process
161:107986
O’Hara RB, Sillanpää MJ (2009) A review of bayesian variable selection methods: what, how and
which. Bayesian Anal 4(1):85–117
Oksendal B (2013) Stochastic differential equations: an introduction with applications. Springer
Science & Business Media
Park KT, Lee D, Do Noh S (2020) Operation procedures of a work-center-level digital twin for
sustainable and smart manufacturing. Int J Precis Eng Manuf-Green Technol 7(3):791–814
Rajdip N, Souvik C, Sriram N (2019) A gaussian process latent force model for joint input-state
estimation in linear structural systems. Mech Syst Signal Process 128:497–530
Rasmussen CE, Williams CK (2006) Gaussian processes for machine learning, vol 1
Risken H (1996) Fokker-planck equation. In: The Fokker-Planck equation. Springer, Berlin, pp
63–95
296 T. Tripura et al.
List of Acronyms
AE AutoEncoder
AMR Adaptive Mesh-Refinement
ANN Artificial Neural Network
BiLSTM Bidirectional Long Short-Term Memory
DEIM Discrete Empirical Interpolation Method
DMD Dynamic Mode Decomposition
FE Finite Element
FNN Feedforward Neural Network
FOM Full Order Model
LSTM Long Short-Term Memory
ML Machine Learning
NIROM Non-Intrusive Reduced Order Model
NN Neural Network
PDE Partial Differential Equation
PG Petrov–Galerkin
POD Proper Orthogonal Decomposition
POD-G Proper Orthogonal Decomposition based Galerkin projection
PROM Parametric Reduced Order Model
Z. Dar
International Center for Numerical Methods in Engineering, Barcelona, Spain
e-mail: [email protected]
J. Baiges · R. Codina (B)
Universitat Politècnica de Catalunya, Barcelona, Spain
e-mail: [email protected]
J. Baiges
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 297
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_8
298 Z. Dar et al.
8.1 Introduction
online step of the ROM can be performed several times to solve an optimization
problem or a real-time control problem, thus providing large computational savings
and real-time solutions.
The implementation strategy of the online step of ROMs is used to classify them
into intrusive and non-intrusive ROMs. The intrusive ROMs are physics based in the
sense that they require the governing equations, and possible access to the computa-
tional code, to project the FOM onto a reduced order space to solve for the unknowns.
Non-intrusive ROMs, on the other hand, are purely data-driven and do not require
access to governing equations or computational code.
In the current data-centric era, Machine Learning (ML) has emerged as a viable
tool for reduced order modeling. ML has revolutionized a wide range of fields over
the past few decades. Scientific computing, in particular reduced order modeling, is
no exception. Although the interest in exploring ML techniques for reduced order
modeling is relatively new, it has already shown great potential by replacing in part,
or entirely, the offline and online steps.
A variety of conventional (see Remark 8.2) and ML-based techniques have been
developed for offline and online phases till date and applied successfully in a variety
of contexts, e.g. solid mechanics (Daniel et al. 2020; Yvonnet and He 2007), material
science (Hunter et al. 2019), fluid mechanics (Baiges et al. 2013b, c; Burkardt et al.
2006; Galletti et al. 2004; Glaz et al. 2010; Kalashnikova and Barone 2011; Lucia
and Beran 2003; Pawar et al. 2019; Srinivasan et al. 2019; Wang et al. 2012), shape
optimization (Akhtar et al. 2010; Bergmann et al. 2007; Li et al. 2022; LeGresley and
Alonso 2000; Rozza et al. 2011), and flow control (Arian et al. 2000; Graham et al.
1999; Mohan and Gaitonde 2018; Noack et al. 2011) problems. The proper orthog-
onal decomposition-based Galerkin projection (POD-G) method can be considered
to be the most well-established and commonly used method for reduced order mod-
eling. This method uses proper orthogonal decomposition (POD) to find the basis,
called POD modes, of the low-dimensional space. The FOM is then projected in an
intrusive manner onto these POD modes using mostly Galerkin projection to solve
for the unknowns.
The organization of the chapter is as follows. Section 8.2 describes POD along
with its essential ingredient, the singular value decomposition (SVD). Section 8.3
describes the Galerkin projection, hyperreduction, and stabilization of POD-ROMs.
Section 8.4 describes the non-intrusive ROMs with a brief explanation of dynamic
mode decomposition (DMD). Section 8.5 deals with the description of parametric
ROMs. Finally, Sect. 8.6 describes the ML techniques used for the online and offline
phases of reduced order modeling.
Remark 8.1 In literature, the term reduced basis is also sometimes used for
projection-based methods. However, more commonly reduced basis methods are
meant to refer to a particular class of projection-based methods based on greedy
algorithms.1 This later usage is also applicable in the context of this chapter.
1 Greedy algorithms are a class of algorithms that are based on choosing the option which produces
the largest immediate reward with the expectation that the successive application of greedy sampling
will lead to a global optimum. Greedy algorithms may use an error estimator to guide the sampling.
300 Z. Dar et al.
The bases commonly used, e.g. piecewise-linear basis in the finite element (FE)
method, Fourier modes, etc. can solve a large number of dynamical systems, but these
bases are generic and do not correspond to the intrinsic modes of the systems which
they solve. Hence, a large number of such basis functions need to be used to capture
the solution. The intrinsic modes which form the basis of the solution space can
be found using proper generalized decomposition, reduced basis method or proper
orthogonal decomposition (POD), among others. Proper generalized decomposition
and reduced basis methods are commonly based on a greedy approach and a com-
prehensive review on them can be found in Chinesta et al. (2010, 2011), Hesthaven
et al. (2016), respectively. POD (Chatterjee 2000) is perhaps the most commonly
used method to find the basis, called POD modes in the context of POD.
For clarity, let us first introduce the concept of function and vector based descrip-
tion of a variable in the context of numerical methods. Suppose that the analysis
domain is spatially discretized using an interpolation-based method like finite ele-
ments, finite volumes, spectral elements, etc. The variable of interest can then either
be represented in the vectorial form as the collection of coefficients that multiply
the basis functions, or in a corresponding functional form which relies on the inter-
polation to define the variable over the entire domain. Throughout this chapter, the
functional representation is denoted using the lowercase letters, like q, and the vecto-
rial representation using the uppercase letters, like Q. In the case of Greek alphabets,
where the case of alphabets is not obvious, functional representation is denoted by
showing the dependence on the spatial coordinates x explicitly, like ζ (x). Also, a
variable with underbar a represents the variable in a general form which could be
either functional or vectorial based on the context. In the case of the FE method, the
vectorial and functional representations of a variable u are related as
nn
u(x, t) = χ k (x)U k (t)
k=1
where u(x, t) is the functional representation which depends on the spatial coordi-
nates x in addition to time t, U k the k−th element of the vector U, χ k (x) the FE
interpolation function for the k−th node and nn the total number of nodes.
8 Reduced Order Modeling 301
where k is the kth POD mode, Urk the kth ROM coefficient, nb the number of
basis vectors and μ a parameter that characterizes the behavior of the system.
u and could be the functional or vectorial representation, as required, but the
same representation must be used for both variables.
POD relies on a singular value decomposition (SVD) to find the basis. Let us
describe how the basis is determined using the SVD of the data generated using
PDEs.
Let us consider a general unsteady nonlinear PDE describing the behavior of a real-
valued function u of n components and dependent on a parameter μ. The evolution
of u in the spatial domain ⊂ Rd , d denoting the dimensions of the problem, and
time interval ]0, tf [ is given by
where N is a nonlinear operator, f the forcing term and ∂t the time derivative.
Equation (8.2) is further provided with suitable boundary and initial conditions so
that the problem is well-posed. For simplicity, μ is considered to be fixed for now,
and hence u will not be stated explicitly as a function of the parameter μ from now
on till Sect. 8.5, when parametric ROMs are discussed.
After the advent of the computational era, the most commonly used technique to
solve (8.2) is to discretize it in space using a discretization technique, e.g. the FE
method. This discretization leads to the a system of ordinary differential equations
(ODEs) which reads: find U :]0, tf [→ Rnp such that
A(U)U = R, (8.4)
where A(U) ∈ Rnp×np is the nonlinear system matrix and R ∈ Rnp is the right-hand-
side which takes into account the contributions of the previous values of U as well.
Equation (8.4) can then be solved for nt time steps to get nt solution vectors.
To find the POD basis, all solution vectors are not generally required. Rather it is
desired to select a minimum, but sufficient, number of solution vectors that contain all
the important dynamic features of the system. A simple approach for a uniform step
time integration scheme is to use solution vectors after every i-th time-step, where
i is a natural number (Baiges et al. 2015). Another approach could be to capture
each cycle of a periodic phenomenon using a certain number of solution vectors
(Mou et al. 2021). Suppose, that we are able to gather a set of ns solution vectors,
called snapshots, {U 1 , U 2 , ..., U ns } carrying all the important features of the system
dynamics. To simplify the exposition, we have assumed that the snapshots correspond
to the first ns consecutive solution vectors, but this can be easily generalized. Note
that the term snapshots will be interchangeably used for the solution vectors, as well
as their mean-subtracted form discussed in Sect. 8.2.2. Once the solution set has been
gathered, the SVD is used to find the basis or POD modes.
SVD, also known as principal component analysis, is one of the most important
matrix factorization techniques used across many fields. The SVD of a matrix is
guaranteed to exist and can be considered unique for the bases generation purposes.
To perform the SVD, it is customary to first subtract the mean value U ∈ Rnp from
the solution vectors U j to obtain S j = U j − U, for j = 1, 2, ..., ns. The mean-
subtracted snapshots are then arranged into a matrix S ∈ Rnp×ns as follows:
⎡ ⎤
S = ⎣ S1 S2 . . . Sns ⎦ ,
a tall skinny matrix as, in general, ns np. It is now desired to find the basis of the
space B ⊂ B f to which these snapshots belong using SVD. The SVD of S gives
S = V T , (8.5)
where ∈ Rnp×np contains the left singular vectors, ∈ Rnp×ns contains the sin-
gular values and V ∈ Rns×ns contains the right singular vectors. Some important
properties of the SVD are listed below:
• Matrices and V are orthogonal i.e.
8 Reduced Order Modeling 303
T = T = I np and V T V = V V T = I ns , (8.6)
where ∈ Rnp×ns . The full and reduced versions of SVD are shown in Fig. 8.1.
Note that (8.8) represents the exact decomposition of S. The set of columns { j }ns
j=1
of the matrix , represents the basis vectors of B. So
So, the dimension of the solution space has been reduced from np to ns, with
ns np. However, ns still could be of the order of hundreds or even thousands and
could still be considered computationally demanding. So, to unlock the full potential
of ROMs in terms of computational savings, truncation is performed to yield a smaller
number of basis vectors than the one provided by the economy SVD.
Fig. 8.1 Matrix representation of full and economy SVD. The lightening of the color of the circles
represents the ordered decreasing of the diagonal values of and
304 Z. Dar et al.
As discussed above, in practice, the basis is truncated to get r < ns number of basis
vectors. This leads to a reduced space Br ⊂ B ⊂ B f , where dimBr = r , dimB = ns
and dimB f = np and r < ns np. The truncation is motivated by the ordered
decreasing singular values in , Property (8.7). A singular value ii represents
the amount of energy or information with which the corresponding basis vector i
contributes toward the solution. This contribution can be quantified using the relative
information content (RIC) given by
r
kk
RIC = k=1
ns .
k=1 kk
The singular and RIC values for the classical problem of the flow over a cylinder
approximated using the FE method are shown in Fig. 8.2. It can be seen that only a
few initial values contain most of the energy. So, instead of using 150 basis vectors,
it is possible to use just a few of them to describe the flow dynamics around the
cylinder. Based on the reasoning discussed above, RIC is widely used as a truncation
criterion. The number of POD modes can then be decided such that RIC is equal
to a desired value, e.g. 0.9, meaning that the POD modes will retain 90% of the
information. This truncation can be represented as a truncated SVD as
ˆ
S ≈ Ŝ = ˆ V̂ T , (8.9)
where ˆ ∈ Rnp×r contains the first r columns of , ˆ ∈ Rr ×r contains the top left
r × r block of and V̂ ∈ Rns×r
contains first r columns of V . Note that the truncated
SVD only approximates matrix S, as shown in (8.9). However, the truncated SVD is
guaranteed to give the optimal approximation of S in the low-dimensional space as
guaranteed by the Eckart–Young theorem (Eckart and Young 1936).
Theorem 8.1 The rank-r truncated SVD Ŝ provides the best rank-r approximation
to S in the L 2 sense, i.e.
arg min S − Ŝ 2
ˆ
= ˆ V̂ T .
Ŝ of rank r
where s i (x) is the functional form of the vectors Si , i = 1, ..., ns, and φ̂ j (x),
j = 1, ..., r , are the basis functions, which are L 2 ()-orthogonal. Note that (8.11)
minimizes the difference over the entire domain . For the sake of clarity, we have
considered that the unknown of the problem is a scalar function, and so are the snap-
shots and the basis, but the extension to the vector case is straightforward. Also, note
that the difference between ˆ and φ̂(x) is not only of the vectorial and functional
ˆ
representation; rather and φ̂(x) are two different bases having different orthogonal
properties. The functional SVD (8.11) can be shown to be the same as minimizing
306 Z. Dar et al.
2
ns r
ˆ 1 , ...,
J ( ˆ r) = M S −
1/2 i ˆ j M 1/2
(M 1/2 Si )T M 1/2 ˆj ,
i=1 j=1
Rnp
T
ˆ M
subject to ˆ = I r , (8.12)
where M is the mass-matrix as in (8.3) and I r ∈ Rr ×r is the identity matrix. Note that
the L 2 ()-orthogonality of basis functions φ̂(x) translates to orthogonality of the
corresponding basis vectors ˆ with respect to the mass-matrix M as ˆ T M ˆ = Ir .
ˆ
To find , we perform the SVD of S = M S: 1/2
S= ˆ V̂ T ,
ˆ = M −1/2
Using the functional SVD produced more accurate results in Dar et al. (2023).
Note, however, that throughout this chapter the SVD term will refer to the one that
solves problem (8.10), unless stated otherwise, e.g. in Sect. 8.3.3.3.
As discussed in Sect. 8.1, reduced order modeling consists of offline and online
steps. The offline step of finding the reduced order basis is discussed in Sect. 8.2.
Now we describe the most commonly used method for the online step, the Galerkin
projection, to find the ROM coefficients in (8.1).
ˆ forms the basis of the reduced solution space Br , and was calculated from
As
mean-subtracted snapshots, decomposition (8.1) can be written as
ˆ r + U,
U ≈ U (8.13)
ˆ r + U) = R.
AU ≈ A(U
ˆ r = R − AU.
AU (8.14)
which is the Galerkin projection of the full order system (8.4) onto the reduced space.
Let us write (8.15) compactly as
Ar U r = Rr (8.16)
where
ˆ T A
Ar := ˆ ∈ Rr ×r ,
T
ˆ (R − AU)
Rr := ∈ Rr .
Applicable for the general matrices A, the so-called Petrov–Galerkin (PG) pro-
jection is found to provide more stable results, as compared to Galerkin projection,
in the case of A not being a SPD matrix (Carlberg et al. 2011). Using ˆ T A as a
suitable PG projector, the PG reduced order form of (8.4) is given by
Ar A U r = Rr A , (8.17)
where now
ˆ T AT A
Ar A := ˆ ∈ Rr ×r ,
ˆ T AT (R − AU)
Rr A := ∈ Rr .
This corresponds to a least squares strategy for solving (8.14) with respect to the
standard Euclidean norm in Rnp . Irrespective of the type of projection used, both
the final reduced order systems (8.16) and (8.17) are r × r systems as opposed to
the full order system (8.4) of size np × np, with r np. Thus, the reduced order
system can be solved at a fraction of the cost of the full order system. All the concepts
described later apply to both the Galerkin and PG ROMs; however, for simplicity,
we will describe them using the Galerkin-ROM (8.15).
308 Z. Dar et al.
8.3.2 Hyperreduction
The ROM discussed above can be solved at a reduced computational expense. How-
ever, assembling the system matrices has a cost of the same order as that of the FOM.
For linear problems, the assembling of matrices needs to be done once, and hence, is
not considered a bottleneck to achieving reduced computation times. However, for
nonlinear problems, the system matrices need to be assembled for every nonlinear
iteration, i.e. multiple times for every time-step in general, and will lead to a signif-
icant cost. Thus, it is important to use some techniques to determine the nonlinear
terms at a reduced cost. This is achieved using hyperreduction techniques and the
resulting models are called hyper-ROMs. There are many methods used for hyperre-
duction including, but not limited to, empirical interpolation method (Barrault et al.
2004) or its discrete version discrete empirical interpolation method (DEIM) (Chat-
urantabut and Sorensen 2010), Gauss–Newton with approximate tensors (Carlberg
et al. 2011), missing point estimator approach (Astrid et al. 2008), cubature-based
approximation method (An et al. 2008), energy conserving sampling and weighting
method (Farhat et al. 2015), and adaptive mesh refinement (AMR)-based hyperre-
duction (Reyes and Codina 2020). Here we briefly describe DEIM- and AMR-based
hyperreduction.
Remark 8.3 In the case of a polynomial nonlinearity in general, and quadratic non-
linearity in particular, the reduced nonlinear operator can be written as a tensor which
is not a function of U r , and hence needs to be computed just once. However, hyper-
reduction techniques, like DEIM and AMR, described in this chapter are applicable
in a broader context.
DEIM is a greedy algorithm and its origin can be traced back to the gappy POD
method (Everson and Sirovich 1995) which was originally designed for image recon-
struction. Just as ROM approximates the solution space by a subspace, DEIM does
the same but for nonlinear terms only. However, DEIM uses interpolation indices to
find the temporal coefficients instead of solving the reduced system.
Let us denote the vector of nonlinear terms as N(θ ) ∈ Rnp , depending on θ . θ can
represent time t or any other parameter in the case of parametric ROMs. However,
here we explain DEIM in the context of non-parametric nonlinear ROMs with θ = t.
For DEIM applied to parametric ROMs, see Antil et al. (2014). DEIM proposes
approximating the space to which the N belongs by a subspace of lower dimension
s, i.e. s np and not necessarily equal to the dimension r of the ROM space. Let
this subspace be spanned by the basis B = [B 1 , ..., B s ] ∈ Rnp×s . Thus, we can write
where d(t) is the vector of coefficients. For simplicity, from now on, the dependence
on t will be omitted from the notation.
An efficient way to determine d is to sample s spatial points and use them to
determine d. This can be achieved using a sampling matrix H defined as
H = [H s1 , ..., H ss ] ∈ Rnp×ns
where H s j = [0, ..., 0, 1, 0, ...0]T ∈ Rnp , for j = 1, ..., s, is the s j th column of the
identity matrix I np ∈ Rnp×np . Using the sampling matrix H, we can write
H T N = H T Bd.
d = (H T B)−1 H T N. (8.19)
Now we need to define the basis B and the sampling points s j , j = 1, ..., s, called
interpolation indices in DEIM, to approximate N using (8.20) at a reduced cost. The
basis B is found using POD for the nonlinear vector N. During the simulations, the
nonlinear vectors at different time steps are gathered to form a snapshot matrix S N
of nonlinear terms as ⎡ ⎤
S N = ⎣ N 1 N 2 . . . N ns ⎦ .
Ŝ N = B N V TN (8.21)
to obtain the basis B of the desired order. The interpolation indices are then selected
iteratively using the basis B. This approach is shown in Algorithm 1, where [|ρ| y]
= max{|X|} means that y is the index of the maximum value of the components of
vector X = [X 1 , ..., X np ], i.e. X y ≥ X z , for z = 1, ..., np. The smallest y is taken if
more than one component corresponds to the maximum value.
regions of higher physics and coarsening it everywhere else such that the overall
degrees of freedom are reduced. AMR uses a posteriori error estimator to decide
these areas of higher physics-based activity. In Reyes and Codina (2020), the mesh
was coarsened such that the total error, in a certain norm, remained approximately the
same before and after hyperreduction. An a posteriori residual-based error estimator
was used and a coarse mesh containing 80% less degrees of freedom was achieved
giving results with a negligible error. Numerical analysis of the error estimator was
also performed in Codina et al. (2021) and it was shown that the estimator provides
an upper bound for the true error and has the correct numerical behavior.
Instabilities can arise when PDEs are solved using numerical methods, usually in
singular perturbation problems or when the approximation spaces of the different
unknowns need to satisfy compatibility conditions. This issue is further exacerbated
when POD-G is used to develop a ROM. This has to do with the fact that the ROM
does not account for the impact of the FOM scales that are not captured by the
low-order space. This problem is well-known in other computational mechanics set-
tings, such as finite elements, where stabilized formulations have been developed to
address the instability of the Galerkin projection. The Variational Multiscale (VMS)
framework, originally proposed in Hughes et al. (1998), is a popular framework used
to develop stabilized formulations taking into account the effect of the discarded
scales in a multiscale problem. A comprehensive review of VMS-based stabilization
methods developed for fluid problems is provided in Codina et al. (2018). Inspired
by this, VMS-based stabilization methods have been developed for projection-based
ROMs (Reyes and Codina 2020) and successfully applied in the context of flow
problems (Reyes et al. 2018), fluid-structure interaction (Tello et al. 2020; Tello and
Codina 2021), and adaptive mesh-based hyperreduction (Reyes and Codina 2020).
A comprehensive description of it is provided in Codina et al. (2021), Reyes and
Codina (2020). However, a summary of the method, which uses the same VMS for-
mulation to stabilize both FOM and ROM, is presented here for completeness. Let
8 Reduced Order Modeling 311
Let us consider again problem (8.2) and write it in a slightly modified form, along
with the boundary and initial conditions. Let the boundary of the domain be
split into non-overlapping Dirichlet, D , and Neumann, N , parts. Given the initial
condition for the unknown u0 , the problem aims at finding u of n components that
satisfies
∂t u + N(u; u) = f in , t ∈ ]0, tf [,
Du = Du0 on D, t ∈ ]0, tf [,
F (u; u) = f N on N, t ∈ ]0, tf [,
u=u 0
in , t = 0,
where K i j , Ac,i , A f,i , and S are matrices in Rn×n and are a function of u, ∂i
denotes differentiation with respect to the i-th Cartesian coordinate xi and indexes
i, j = 1, ..., d. Let us also define the flux operator F using Einstein’s notation as
where n is the external unit normal to the boundary with n i being its i-th component.
To write the weak form of the problem, let the integral of the product of two
functions, f and g over the domain ω be defined by f , gω . For simplicity, the
subscript ω is omitted in the case ω = . Let us also introduce the form Bω and the
linear form L ω as
Bω (u; y, v)ω := ∂i v, K i j (u)∂ j yω + v, Ac,i (u)∂i yω + ∂i ( ATf,i (u)v), yω
+ v, S(u) yω ,
L ω (v) := v, f ω + v, f N N .
Let u(., t) and v belong to the space Bc , the solution space of the continuous problem.
The weak form of the problem (in space) consists of finding u :]0, tf [→ Bc such that
312 Z. Dar et al.
for all v ∈ Bc,0 , where Bc,0 is the space of time independent test functions that satisfy
Dv = 0 on D . For simplicity, we assume in what follows homogeneous Dirichlet
conditions so that v ∈ Bc = Bc,0 .
VMS method can be applied to other discretization techniques, but in what follows
we shall concentrate on the FE method. Thus, let us discretize the domain using
FEs. Let Ph = {K } be a FE partition of the domain , assumed quasi-uniform for
simplicity, with elements of size h. From this, a conforming FE space Bh ⊂ Bc may
be constructed using a standard approach. Note that now Bh = B f , i.e. the FE space
is a particular realization of the FOM space introduced earlier.
Any time integration scheme may be used for time discretization. For conciseness,
we shall assume that a backward difference scheme is employed with a uniform time
step t and the time discretization is represented by replacing ∂t with δt , where δt
involves unh , un−1
h , ..., depending on the order of the scheme used. Using a superscript
n for the time step counter, and a subscript h for FE quantities, the fully discretized
Galerkin approximation of problem (8.22) is to find {unh } ∈ Bh , for n = 1, ..., nt, nt
being the number of time steps, that satisfy
where we have omitted the initial conditions and the time step superscript for sim-
plicity. This problem may suffer from instabilities, and hence requires the use of
stabilization methods like those based on VMS method.
The core idea of the VMS approach lies in the splitting Bc = Bh ⊕ B , where B is
any space that completes Bh in Bc . We call B the space of subgrid scales or subscales
(SGS), and the functions in the SGS spaces will be identified with the superscript .
Using the splitting u = uh + u and similarly for the test function v = v h + v , the
continuous problem (8.22) splits into
Important Considerations/Assumptions
• Choosing the subscale space B . In fact, the approximation to B will be a conse-
quence of the approximation to u . The choice of SGS space leads to algebraic
subgrid scales (Codina 2000a) or orthogonal subgrid scales, among other possi-
bilities (Codina 2000b).
• While expanding (8.25), we come across the application of the operator N to
subscales as N(u; u ). Because the subscale problem is infinite dimensional, the
following key approximation is used
N(u; u )| K ≈ τ −1
K (u)u | K ,
where B (u∗ ; uh , v h ) and L (u∗ ; v h ) are defined based on the choices made regard-
ing the considerations discussed above. B (u∗ ; uh , v h ) and L (u∗ ; v h ) for different
combination of choices can be found in Reyes and Codina (2020).
314 Z. Dar et al.
A ROM for the FOM discussed above can be developed by constructing a ROM
space Br ⊂ Bh ⊂ Bc . Using the POD relying on SVD for functions, described in
Sect. 8.2.2.2, we may obtain a ROM space of dimension r
ˆ 1,
Br = span{ ˆ 2 , ...,
ˆ r },
B = Bh ⊕ B = Br ⊕ B ,
ˆ 1,
Bh = span{ ˆ 2 , ...,
ˆ np }.
Then, since the basis vectors obtained from the POD are L 2 ()-orthogonal, choosing
orthogonal subgrid scales allows us to write
ˆ r +1 ,
B = span{ ˆ r +2 , ...,
ˆ np } ⊕ B ,
i.e. we have an explicit representation of the ROM space of SGSs. So, when VMS-
ROM is used to approximate ROM SGSs, it accounts for the FOM subscales, present
in the subspace B , as well as the SGSs arising as a result of ROM trunctaion, present
in the subspace spanned by { ˆ r +2 , ...,
ˆ r +1 , ˆ np }.
Having in mind the previous discussion, the final reduced order problem can be
written as finding urn ∈ Br , for n = 1, ..., nt, that satisfy
It can be seen that the Equations (8.29)–(8.31) look exactly the same as (8.26)–(8.28).
Furthermore, the expressions for B and L are also the same for the FOM and the
ROM if the same choices are made for both, regarding the considerations discussed in
Sect. 8.3.3.2. The only difference between the FOM and the ROM formulation is that
in the case of ROM, functions are approximated in Br instead of Bh . B (u∗ ; ur , vr )
and L (u∗ ; vr ) for different combination of choices can be found in Reyes and Codina
(2020).
reduced order models (NIROMs) which provide solutions based only on the data
and without using the governing equations. NIROMs allow the decoupling of the
FOM and the ROM implementations completely and are particularly useful in cases
where the code used for the FOM is not open-source. NIROMs can be obtained using
conventional or ML-based techniques and the recent large-scale adoption of NIROMs
can be attributed to the increasing popularity of ML in scientific computing. The
ML-based NIROMs are later discussed in Sect. 8.6.2. For now, we describe dynamic
mode decomposition (DMD), which can be considered a conventional non-intrusive
extension of the POD.
S∗ = T S (8.32)
T = S∗ S+ . (8.33)
As ˆ T
ˆ and V̂ are orthogonal and satisfy ˆ = I np and V̂ T V̂ = I ns , using (8.34)
allows us to write (8.33) as
T ≈ S∗ V̂ ˆ −1
ˆ T,
where ˆ is a diagonal matrix and can be easily inverted. We are only interested in
the first r eigenvalues and eigenvectors of matrix T . The r -rank approximation of
T , denoted by T r ∈ Rr ×r , is achieved by projecting T on to the reduced space using
basis ˆ as
T
ˆ T
Tr = ˆ
= ˆ −1
ˆ T S∗ V̂ ˆ T
ˆ
ˆ −1 .
ˆ T S∗ V̂
=
where the entries of the diagonal matrix ϒ are the eigenvalues of the low-dimensional
T r , as well as the high-dimensional T . E contains the eigenvectors of T r and allows
us to obtain the eigenvectors of T , denoted as ϕ , as
−1
ˆ
ϕ = S∗ V̂ E
where the columns of ϕ ∈ Rnp×r , called DMD modes, are the eigenvectors of T .
Once DMD eigenvalues and eigenvectors have been determined, the state of the
system at the k-th time-step, U k , is given by
318 Z. Dar et al.
Fig. 8.3 Illustration of DMD applied to a flow over a cylinder. Three DMD modes and the temporal
evolution of their coefficients is shown
U k = ϕϒ k−1 D,
where D ∈ Rr is the vector of mode amplitudes that can be computed using initial
conditions. The DMD of a flow over a cylinder is illustrated in Fig. 8.3.
In the previous sections, we have discussed how to build a ROM during an offline
stage and how to use it for getting results quickly during an online stage. So far, we
have assumed that the unknown U(t, μ) was a function of t only and the parameter
μ ∈ D ⊂ R, was kept constant. So, in essence, the ROM was used to solve exactly
the same problem whose solution was used to generate the snapshots to be used for
the ROM basis generation. The aim of reduced order modeling is to perform the
computationally expensive offline stage once (or a few times) and then use the gener-
ated ROM to perform many simulations in the cheap online phase for the new values
of the parameter μ. This situation arises routinely in optimization and control prob-
lems governed by parametric PDEs. The parameter can represent anything including
boundary conditions, geometry, viscosity, Reynold’s number, etc. For simplicity, we
assume that the parameter represents a scalar and its different values, μ1 , ..., μ ps ,
represent different configurations, however, the subsequent discussion is equally
valid where μ represents more than one parameters. The difficulty with parametric
reduced order models (PROMs) lies in the fact that the basis μ1 obtained for μ1 is
unlikely to perform well for μ2 as the behavior captured by the basis μ1 might be
different from the behavior exhibited by the system for μ2 , i.e.
U(μ1 ) ≈ μ1 U r (μ1 ),
but
8 Reduced Order Modeling 319
Several techniques have been developed to obtain a suitable basis for PROMs.
Hyperreduction techniques, like DEIM described in Sect. 8.3.2.1, can also be used
for PROMS (Antil et al. 2014) with θ = μ. Here, we describe two popular techniques
to obtain a basis for PROMs, the global basis method and the local basis with the
interpolation method. These techniques commonly use a greedy approach to sample
suitable parameter values to obtain the snapshots. Thus, they are commonly referred
to as POD-greedy approaches.
Probably the most obvious approach is to sample different parameter values, obtain
snapshots corresponding to them, and perform the SVD on all the snapshots to obtain
ˆ such that
a single global basis
ˆ r (μ), ∀ μ ∈ D.
U(μ) ≈ U (8.35)
A greedy approach can be used to sample ps parameter values to obtain the snap-
shots. The global basis approach can provide a compact r dimensional basis satisfying
(8.35) if the solution is not very sensitive to the parameter μ, i.e. the solution manifold
has rapidly decaying Kolmogorov n-width. If the solution manifold has slow decay-
ing Kolmogorov n-width, it might require obtaining snapshots at a lot of sampled
parameter values, which can lead to a prohibitively expensive offline phase. Even
if the computational expense of the offline phase is completely ignored, achieving
reasonable accuracy in the online phase will require a lot of POD modes. Hence,
truncating the global basis to a rank r , ensuring a real-time execution of the online
phase with reasonable accuracy, will not be possible.
In the case that the global basis approach is not feasible, a local basis can be developed
and used with interpolation. Similar to the global basis approach, ps parameter values
are sampled and the snapshots are obtained for them. However, instead of performing
a SVD of the matrix containing all the snapshots, a separate SVD is performed for
the snapshot matrix for every sampled parameter value μ to obtain a corresponding
local basis μi , for i = 1, ..., ps. Now, the basis μ∗ can be obtained at a requested,
but unsampled, parameter value μ∗ using interpolation.
If conventional interpolation techniques are used, the interpolated basis μ∗ is
likely to lose the key properties, e.g. orthogonality, after interpolation. Hence, inter-
polation using property preserving matrix manifolds is recommended to preserve the
320 Z. Dar et al.
ps
key properties. Let G be such a manifold of orthogonal matrices. Also, let {μi }i=1 be
ps
the set of sampled parameter values and { μi }i=1 be the set of corresponding bases.
∗ ps
The basis μ∗ for the unsampled point μ ∈ / {μi }i=1 can be obtained as follows.
First, a tangent space T ( μ̃ ) is defined such that it is tangent to G at a reference
ps
point μ̃ ∈ { μi }i=1 . Now, μi , i = 1, ..., ps, except the reference point μ̃ , are
projected to the tangent space using a logarithmic map defined as
T ( μi ) = Rμi tan−1 ( μi )W μi
T
where Rμ̃ , μ̃ and W μ̃T are obtained from the following SVD:
T ( μ∗ ) = Rμ∗ μ∗ W μ∗ .
T
Remark 8.5 The above described manifold-based interpolation has been shown to
be applicable to the direct interpolation of reduced order system matrices/vectors as
ps
Fig. 8.4 Interpolation of a set of matrices { μi }i=1 using the matrix manifold G and the tangent
plane T ( μ̃ )
8 Reduced Order Modeling 321
well for linear systems (Amsallem and Farhat 2011). Consider a spatially discretized
reduced order parametric linear system
If {M r (μi )}i=1 , {L r (μi )}i=1 and {F r (μi )}i=1 are obtained offline, M r (μ∗ ), L r (μ∗ )
ps ps ps
and F r (μ∗ ) can be obtained during the online phase using the manifold based interpo-
lation. This ensures that the key properties, e.g. symmetric positive definiteness (SPD),
are preserved after the interpolation. This direct interpolation is more efficient than first
finding the interpolated basis μ∗ and then finding the reduced matrix X r (μ∗ ) using
μT ∗ X μ∗ . However, this direct interpolation has been shown to work only for linear
problems so far. The appropriate logarithmic and exponential maps to be used to pre-
serve different matrix properties can be found in Amsallem and Farhat (2011).
The impact of ML has been profound on scientific computing. In this section, the
applications of ML in the context of projection-based ROMs are explored. However,
it is pertinent to mention the natural suitability of ML techniques to develop extremely
computationally inexpensive models, even beyond the context of projection-based
ROMs. A lot of the success of ML techniques can be attributed to their ability to
find and learn nonlinear mappings that govern a physical system. For any system,
a few key inputs can be selected and a ML technique can be applied to learn the
(non)linear mapping that exists between its outputs and the selected inputs. Since
the online (testing) phase of ML algorithms is very fast, any such application would
result in a computationally inexpensive model, i.e. a ROM.
In the context of projection-based ROMs, ML techniques have been applied to
achieve higher accuracy, improvement in speeds or a combination of both. ML tech-
niques have been applied to obtain nonlinear reduced spaces for ROMs which offer
a more compact representation than linear POD spaces. ML techniques have also
been used to obtain NIROMs by directly learning the nonlinear evolution of reduced
coordinates, previously referred to as ROM coefficients in the context of POD. The
term reduced coordinates is more popular in the literature in the context of ML-based
ROMs, and hence it will be used from here on. ML can also improve the accuracy of
the intrusive Galerkin ROMs by providing closure models or corrections based on
finer solutions. Purely data-driven ML techniques can be very data hungry. In order
to reduce the reliance on data, and improve the generalization of the ML models,
physics has been embedded in ML techniques and then applied to reduced order
modeling. ML has even been used for system identification to discover simple equa-
tions for the evolution of reduced coordinates. Let us describe the state-of-art in the
above-mentioned application domains.
322 Z. Dar et al.
POD (or DMD) provides modes that approximate a linear subspace. However, the
evolution of many dynamical systems lies in nonlinear spaces. The linear approxima-
tion can lead to two issues. First, the POD modes might be unable to capture highly
nonlinear phenomena. Second, in the case that the linear representation can capture
the dynamics with reasonable accuracy, using POD may require more reduced coor-
dinates than the nonlinear representation. So, instead of the linear mapping provided
by the POD (8.13), a more robust mapping would be using a nonlinear function
ϑ D : Rr → Rnp given by
U ≈ U D = ϑ D (U r ), (8.36)
where U D ∈ Rnp is the mapped value and U is the FOM solution. This nonlinear
mapping can be achieved using autoencoders (AEs) (Ballard 1987).
AEs are artificial neural networks (ANNs) used widely for dimension reduction.
The simplest AE, called undercomplete AE, consists of input and output layers of
the same size as the size of the FOM, np in this case. Furthermore, it has a bottleneck
layer in the middle of the desired size r , the same as the size of the reduced space.
The architecture of an undercomplete AE is shown in Fig. 8.5. In general, AEs are
quite deep, i.e. they consist of many layers, and use nonlinear activation functions.
Based on the task performed, an AE can be subdivided into two sub-parts: an
encoder and a decoder. The encoder compresses the high-dimensional data succes-
sively through its many layers to produce the low-dimensional representation. The
Bottleneck
Encoder Decoder
Fig. 8.5 Architecture of an undercomplete autoencoder with encoder–decoder parts. The number
of layers and the size of the bottleneck is set to three and two, respectively, for illustration purposes
8 Reduced Order Modeling 323
U r = ϑ E (U).
Fig. 8.6 Non-intrusive ROM obtained by replacing the Galerkin projection a with a ML approach
b to model the dynamics of the reduced coordinates U r
reduced coordinates from U rn to U rn+1 , where n is the time-step counter. Inputs addi-
tional to U rn can also be provided to ϑ M L to better learn the mapping U rn → U rn+1 .
ML-based NIROMs have been successfully used for a variety of applications.
Deep feedforward neural networks (FNNs), combined with POD for dimensionality
reduction, were applied to get accurate results for the differentially heated cavity
flow at various Rayleigh numbers in Pawar et al. (2019). The least squares support
vector machine (Suykens et al. 2002) was used in Chen et al. (2021) to relate reduced
coordinates with the applied excitations for predicting hypersonic aerodynamic per-
formance. FNN was used to develop a NIROM for the industrial thermo-mechanical
phenomena arising in blast furnace hearth walls in Shah et al. (2022). The multi-
output support vector machine (Xu et al. 2013) was used to model the dynamics of
POD coefficients to predict the stress and displacement for geological and geotech-
nical processes in Zhao (2021).
Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and its
variant bidirectional long short-term memory (BiLSTM) (Graves and Schmidhuber
2005) neural networks (NNs) have a memory effect and can capture sequences like
the time evolution of a process with higher accuracy than a FNN. LSTM/BiLSTMs
accepts a history of q time steps, {U rn , U rn−1 , ..., U rn−q+1 }, to predict the future value
U rn+1 . Thus
U rn+1 = ϑ L ST M (U rn , U rn−1 , ..., U rn−q+1 ).
LSTM and BiLSTM NNs have been widely used to predict the temporal evolution
of systems based on past values. LSTM and BiLSTM NNs were used to model
isotropic and magnetohydrodynamic turbulence in Mohan and Gaitonde (2018),
where the Hurst Exponent (Hurst 1951) was employed to study and quantify the
memory effects captured by the LSTM/BiLSTM NNs. LSTM NNs were used in
Vlachas et al. (2018) to predict high-dimensional chaotic systems and were shown to
outperform Gaussian processes. The improved performance of LSTM NNs was also
shown for reduced order modeling of near-wall turbulence as compared to FNNs in
Srinivasan et al. (2019). Finally, LSTM NNs were also used to model a synthetic jet
and three-dimensional flow in the wake of a cylinder in Abadía-Heredia et al. (2022).
Training ML algorithms in general, and LSTM/BiLSTM NNs in particular, can be
very computationally demanding. Transfer learning can be used to speed up the train-
ing phase. Instead of initializing the weights of a network randomly, transfer learning
8 Reduced Order Modeling 325
relies on using weights of a network previously trained for a closely related prob-
lem for initialization. Providing better initial weights allows the training to converge
faster to the optimal weights. Transfer learning was used to speed up the training of
LSTM and BiLSTM NNs modeling the three-dimensional turbulent wake of a finite
wall-mounted square cylinder in Yousif and Lim (2022). The flow was analyzed on
2D planes at different heights, each modeled using a separate LSTM/BiLSTM NN.
After the first LSTM/BiLSTM NN was trained, the others were initialized using its
weights, as the flow in different planes is correlated.
Gaussian process regression (Rasmussen and Williams 2005) can be used to build
NIROMs alongside providing uncertainty quantification. Gaussian process regres-
sion has been used to develop NIROM for shallow water equations (Maulik et al.
2021), chaotic systems like climate forecasting models (Wan and Sapsis 2017), and
nonlinear structural problems (Guo and Hesthaven 2018). In the domain of unsu-
pervised learning, cluster reduced order modeling has been applied to mixing layers
(Kaiser et al. 2014). The cluster reduced order modeling groups the ensemble of
data (snapshots) into clusters and the transitions between the states are dynamically
modeled using a Markov process.
ˆ r +
U = U U, (8.37)
Let us assume that (8.3) represents the semi-discretized (in space) form of the
governing equations describing the behavior of U. A Galerkin projection of (8.3)
onto resolved and unresolved spaces, using ˆ and , respectively, along with using
(8.37), results in the following system:
∂t U r G(U r ,
U)
= ,
∂t
U G(U r ,
U)
∂t U r = G(U r ,
U).
Using the truncated basis to build the ROM implicitly implies, abusing the notation,
∂t U r = G(U r , 0) = G(U r ),
which is not true in the nonlinear cases, as the behavior of the resolved scales is
governed by their interaction with the unresolved ones as well. So, it is desired to
model this interaction as a term C(U r ), which is a function of the resolved scales
U r , so that
∂t U r = G(U r ) + C(U r ). (8.38)
Closure error
modeled by C(Ur)
Projection error
Fig. 8.7 An illustration of the ROM closure problem. The projection error is due to the use of the
space Br ⊂ B f for approximating the unknowns. The closure error is due to neglecting the effect
of nonlinear interaction on the evolution of the resolved coordinates
8 Reduced Order Modeling 327
sure models for ROMs. In San and Maulik (2018b), a single layer FNN, trained
using Bayesian regularization and extreme learning machine approaches, was used
to model ROM closure terms for flow problems governed by viscous Burgers equa-
tions. The ROM closure terms were modeled as a function of the Reynolds number
and resolved reduced coordinates. An extreme learning approach was also used in San
and Maulik (2018a) to determine eddy viscosity for a LES-inspired closure model.
An uplifting ROM with closure was proposed in Ahmed et al. (2021). LSTMs were
used to provide the closure term, as well as to determine the reduced coordinates
U of
the unresolved space. Since the basis was already known via POD, (8.37) was used
to approximate U. A similar approach was used in Ahmed et al. (2021) to develop
a closure model for pressure modes to accurately predict the hydrodynamic forces.
A residual neural network, a hundreds of layers deep FNN, was used to develop a
closure model for 1D Burgers equation in Xie et al. (2020).
ROM closure based on Mori–Zwanzig formalism (Mori 1965; Zwanzig 1960)
is also popular. Such approaches, in general, use closure models consisting of two
terms
where the memory integral is a non-Markovian2 term and takes into account the
contribution of the past values of the resolved coordinates U r to model the unresolved
ones U. This memory integral is very computationally expensive to compute. To
evaluate it efficiently, neural closure models using neural delay differential equations
were proposed in Zhu et al. (2021). The number of past values required to accurately
determine the closure term was also determined. A conditioned LSTM NN was
used to model the memory term in Wang et al. (2020). To ensure computational
efficiency, the authors further used explicit time integration of the memory term,
while using implicit integration for the remaining terms of the discrete system. The
Mori–Zwanzig formalism and LSTM NNs were shown to have a natural analogy
between them in Ma and Wang (2019). This analogy was also used to develop a
systematic approach for constructing memory-based ROMs. Recently, reinforcement
learning techniques have also been applied to build unsupervised reward-oriented
closure models for ROMs (Benosman et al. 2020; San et al. 2022).
2 A non-Markovian term implies that the future state depends not only on the current values, but
also on the past value, i.e. such processes have memory effects of the past values.
328 Z. Dar et al.
more accurate than the ROM solution. This is applicable to projection-based ROMs,
as well as to the cases in which ROM represents a model with a coarser spatial
discretization or a larger time step.
Let U c be the solution of the coarse ROM system given by
AU c = R. (8.40)
Let the fine solution U f be also available for the given problem. Let us assume that
the projection of the fine solution onto the coarse solution space, denoted by U c f , is
the best possible coarse solution. In this case, a correction vector C can be added to
modify system (8.40) to obtain a new system
AU c f + C = R, (8.41)
with U c f as its solution. When the fine solution is not known, C needs to be modeled
as a function of the coarse solution, i.e. C = C(U c ).
A correction term using a least squares (LS) approach was proposed for POD-
ROM in Baiges et al. (2015). Special considerations regarding gathering training data
and using the appropriate initial conditions were also addressed with least squares
providing a linear model of the correction term. A nonlinear correction term was
determined using ANN in Baiges et al. (2020). The correction term was determined
for a coarse mesh-based ROM using the solution of an AMR-based FOM, and applied
to the fluid, structure, and FSI problems. A comparison of linear least squares and
nonlinear ANNs-based corrections was carried out for the wave equation in Fabra
et al. (2022). A nonlinear ANN-based correction for POD-ROM was used in Dar et al.
(2023). Different combinations of features to be provided as the inputs to the ANNs
were evaluated to develop an accurate model while minimizing the complexity. The
implicit and explicit treatment of the ANN-based correction was also evaluated. It
was shown that the ROM was able to produce good results for parametric interpola-
tion, as well as temporal extrapolation. All of the above-mentioned works relied on
significantly less training data as compared to NIROMs. A training-free correction
was further proposed in Baiges et al. (2021) to account for the loss of information
due to the adaptive coarsening of the coarse mesh-based ROM. The correction was
based solely on the data generated within the same simulation and did not require
any external data.
Remark 8.6 The correction based on fine solutions discussed in Sect. 8.6.4 and the
closure modeling discussed in Sect. 8.6.3 are similar to some extent. However, there
is a difference in their motivation, as well as the accepted definition in the literature.
Closure modeling for ROMs is understood to account for the error generated due to
the Galerkin projection, i.e. spatial discretization as given by (8.38). On the other
hand, the correction based on fine solutions works by introducing a correction at the
fully discrete level given by (8.41). Hence, the two approaches have been discussed
separately.
8 Reduced Order Modeling 329
ML tools, in general, require a large amount of data to provide accurate results. More-
over, developing a model for generalized cases further increases the data requirement.
Generating and storing such an amount of data is not always possible. An efficient
way of reducing the reliance on data, without affecting the accuracy or generaliz-
ability, is to embed physics in the ML tools. Embedding physics in ML is being
increasingly used in the broader field of scientific computing; however, limited work
is done so far in the domain of reduced order modeling.
One way of employing physics is to use ANNs to solve PDEs directly by mini-
mizing the residual of the governing equations without using any FOM data. Such
ANNs are called physics informed neural networks (Raissi et al. 2019) and can be
used to directly solve the reduced order system without relying on training–testing
phases. Physics reinforced neural networks, a variation of physics informed neural
networks, were proposed in Chen et al. (2019) in the context of ROMs. In general,
incorporating physics in ML models involves a loss function consisting of two terms:
a data-driven loss function J D and a physics-based loss function J P . Let A be a
general operator that describes the desired physics such that
A(U r ) = 0, (8.42)
where A may account for temporal derivatives, nonlinearity, etc. The physics-based
loss function J P is given by
J P = A(U M L ) 2 , (8.43)
J D = U M L − R(U) 2 . (8.44)
In general, the training phase involves minimizing the mean of (8.43) and (8.44) for
multiple time steps and/or parameter values. The total loss function is given by
J = J D + εJ P , (8.45)
et al. (2019). The physics was incorporated by requiring some terms of the closure
model to be energy dissipating, while others to be energy conserving.
Introducing physics using (8.45), so that the training phase becomes a constrained
optimization problem, can be considered as applying weak constraints (note that
J P = 0 if U M L is replaced by U r in Eq. (8.43)). The violation of the physics leads
to a large loss function, and hence in an effort to minimize the loss function, the
ANN tries to adhere to the physics as well. The physics embedded in such a way is
prone to be violated in the testing phase when the ANN is exposed to cases beyond
the training phase. This is because, architecturally, the ANN is still unaware of the
physics. Furthermore, ε acts as an additional hyperparameter that needs to be tuned.
An alternative strategy is to amend the ANN structurally so that it enforces the
physical laws strongly. Such an ANN is hoped to be more robust in the testing phase
as it is not blind to the physics. Embedded physics in a coarse grained ROM for 3D
turbulence via hard constraints was achieved in Mohan et al. (2020). The divergence
free condition was enforced using curl-computing layers which formed a part of the
backpropagation process as well. Backpropagation through the curl operator ensured
that the network is not blind and has intimate knowledge of the constraints through
which it must make predictions. Another way of embedding physics in the layers
was proposed in Pawar et al. (2021). A physics-guided machine learning framework
was introduced which injected the desired physics in one of the layers of the LSTM
NN. By incorporating physics, an ANN applicable to more generalized cases was
achieved as compared to the purely data-driven approach.
ML can also be used to find the equations representing the dynamics of a system
using data. An equation comprising different terms with adjustable coefficients is
assumed to model the behavior of a system. Data is then used to find the value of
these adjustable coefficients. This technique is known as system identification and
is of particular interest in two scenarios. First, if the equations are not known, as
in the case of modeling climate, epidemiology, neuroscience, etc. Second, when
the equations describing the behavior of the reduced coordinates are required. The
resulting equation-based representation of a system provides generalizability and
interpretability, not achievable by simply constructing a regression model based on
data. System identification is a broad field and many techniques have been applied
in this context, see Ljung (1998) and Juang (1994) for details.
Reduced system identification aims to obtain sparse equations (consisting of a few
simple terms) to describe the evolution of reduced coordinates of a projection-based
ROM. The sparse identification of nonlinear dynamics (SINDy) (Brunton et al. 2016)
algorithm can be used to get a simplistic dynamic model for the reduced coordinates.
A library of simple nonlinear terms, like polynomials or trigonometric functions,
is provided. SINDy then tries to find a mapping for the provided input–output data
332 Z. Dar et al.
using the minimum number of terms of the library, thus providing a minimalistic
interpretable model offering a balance of efficiency and accuracy.
SINDy has been applied to recover models for a variety of flow behaviors including
shear layers, laminar and turbulent wakes, and convection (Callaham et al. 2022;
Deng et al. 2020; Loiseau 2020; Loiseau and Brunton 2018). The vortex shedding
behind a cylinder, for example, can be captured using three modes only (Noack et al.
2003), first two POD modes and a shift mode as
∂t Ur 1 = μUr 1 − Ur 2 − Ur 1 Ur 3 ,
∂t Ur 2 = Ur 1 + μUr 2 − Ur 2 Ur 3 , (8.46)
∂t Ur 3 = Ur21 + Ur22 − Ur 3 ,
where Ur 1 , Ur 2 and Ur 3 are the reduced coordinates related to first two POD modes
and the shift mode. SINDy was able to recover this minimalistic model using data,
identifying the dominant terms and the associated coefficients correctly (Brun-
ton et al. 2016). SINDy was also combined with an autoencoder to find the low-
dimensional nonlinear representation, as well as to model the dynamics of the corre-
sponding reduced coordinates (Champion et al. 2019). To improve the performance
of SINDy, physics was also embedded in it in the form of symmetry in Guan et al.
(2021) and of conservation laws in Loiseau and Brunton (2018).
been developed based entirely on ML. Thus, a range of reduced order modeling
techniques are at disposal with purely conventional techniques at one end of the
spectrum to purely ML techniques at the other end, with the hybrid techniques lying
in-between.
References
Abadía-Heredia R et al (2022) A predictive hybrid reduced order model based on proper orthogonal
decom position combined with deep learning architectures. Expert Syst Appl 187:115910
Ahmed HF et al (2021) Machine learning-based reduced-order modeling of hydrodynamic forces
using pressure mode decomposition. Proc Inst Mech Eng, Part G: J Aerosp Eng 235(16):2517–
2528
Ahmed SE et al (2020) A long short-term memory embedding for hybrid uplifted reduced order
models. Phys D: Nonlinear Phenom 409:132471
Ahmed SE et al (2021) On closures for reduced order models-a spectrum of first-principle to
machine-learned avenues. Phys Fluids 33(9):091301
Akhtar I, Borggaard J, Hay A (2010) Shape sensitivity analysis in flow models using a finite-
difference approach. Math Probl Eng
Alla A, Kutz JN (2017) Nonlinear model order reduction via dynamic mode decomposition. SIAM
J Sci Comput 39(5):B778–B796
Amsallem D, Farhat C (2011) An online method for interpolating linear parametric reduced-order
models. SIAM J Sci Comput 33(5):2169–2198
Amsallem D, Farhat C (2012) Stabilization of projection-based reduced-order models. Int J Numer
Methods Eng 91(4):358–377
An SS, Kim T, James DL (2008) Optimizing cubature for efficient integration of subspace defor-
mations. ACM Trans Graph 27(5):65:1–165:10
Antil H, Heinkenschloss M, Sorensen DC (2014) Application of the discrete empirical interpolation
method to reduced order modeling of nonlinear and parametric systems. In: Quarteroni A, Rozza
G (eds) Reduced order methods for modeling and computational reduction. MS&A—Modeling,
Simulation and Applications. Springer International Publishing, Cham, pp 101–136
Arian E, Fahl M, Sachs EW (2000) Trust-Region Proper Orthogonal Decomposition for Flow Con-
trol. Technical report. Institute for Computer Applications in Science and Engineering, Hampton
VA
Astrid P et al (2008) Missing point estimation in models described by proper orthogonal decompo-
sition. IEEE Trans Autom Control 53(10):2237–2251
Azaïez M, Chacon Rebollo T, Rubino S (2021) A cure for instabilities due to advection-dominance
in POD solution to advection-diffusion-reaction equations. J Comput Phys 425:109916
Baiges J et al (2020) A finite element reduced-order model based on adaptive mesh refinement and
artificial neural networks. Int J Numer Methods Eng 121(4):588–601
Baiges J et al (2021) An adaptive finite element strategy for the numerical simulation of additive
manufacturing processes. Addit Manuf 37:101650
Baiges J, Codina R, Idelsohn S (2015) Reduced-order subscales for POD models. Comput Methods
Appl Mech Eng 291:173–196
Baiges J, Codina R (2013a) A variational multiscale method with subscales on the element bound-
aries for the helmholtz equation. Int J Numer Methods Eng 93(6):664–684
Baiges J, Codina R, Idelsohn S (2013b) A domain decomposition strategy for reduced order models.
Application to the incompressible Navier–Stokes equations. Comput Methods Appl Mech Eng
267:23–42
334 Z. Dar et al.
Baiges J, Codina R, Idelsohn S (2013c) Explicit reduced-order models for the stabilized finite
element approximation of the incompressible Navier–Stokes equations. Int J Numer Methods
Fluids 72(12):1219–1243
Baldi P, Hornik K (1989) Neural networks and principal component analysis: learning from exam-
ples without local minima. Neural Netw 2(1):53–58
Ballard DH (1987) Modular learning in neural networks. In: Proceedings of the sixth national
conference on artificial intelligence, vol 1. AAAI’87. AAAI Press, Seattle, Washington, pp 279–
284
Ballarin F et al (2015) Supremizer stabilization of POD-galerkin approximation of parametrized
steady incom pressible Navier–Stokes equations. Int J Numer Methods Eng 102(5):1136–1161
Barrault M et al (2004) An empirical interpolation method: application to efficient reduced-basis
discretization of partial differential equations. Comptes Rendus Mathematique 339(9):667–672
Benosman M, Chakrabarty A, Borggaard J (2020) Reinforcement learning-based model reduction
for partial differential equations. IFAC-PapersOnLine. 21st IFAC World Congress 53(2):7704–
7709
Bergmann M, Cordier L, Brancher J-P (2007) Drag minimization of the cylinder wake by trust-
region proper orthogonal decomposition. In: Active flow control. Springer, Berlin, pp 309–324
Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106(7):1039–1082
Brooks AN, Hughes TJR (1982) Streamline upwind/petrov-galerkin formulations for convection
dominated flows with particular emphasis on the incompressible Navier–Stokes equations. Com-
put Methods Appl Mech Eng 32(1):199–259
Brunton SL et al (2017) Chaos as an intermittently forced linear system. Nat Commun 8(1):19
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc Natl Acad Sci 113(15):3932–3937
Bui-Thanh T, Willcox K, Ghattas O (2008) Model reduction for large-scale systems with high-
dimensional parametric input space. SIAM J Sci Comput 30(6):3270–3288
Buoso S et al (2022)Stabilized reduced-order models for unsteady incompressible flows in three-
dimensional parametrized domains. Comput Fluids 246:105604
Burkardt J, Gunzburger M, Lee H-C (2006) POD and CVT-based reduced-order modeling of Navier–
Stokes flows. Comput Methods Appl Mech Eng 196(1–3):337–355
Callaham JL et al (2022) An empirical mean-field model of symmetry-breaking in a turbulent wake.
Sci Adv 8(19):eabm4786
Carlberg K, Barone M, Antil H (2017) Galerkin v. Least-Squares Petrov-Galerkin projection in
nonlinear model reduction. J Comput Phys 330:693–734
Carlberg K, Bou-Mosleh C, Farhat C (2011) Efficient non-linear model reduction via a least-squares
petrov-galerkin projection and compressive tensor approximations. Int J Numer Methods Eng
86(2):155–181
Champion K et al (2019) Data-driven discovery of coordinates and governing equations. Proc Natl
Acad Sci 116(45):22445–22451
Chatterjee A (2000) An introduction to the proper orthogonal decomposition. Curr Sci 78(7):808–
817
Chaturantabut S, Sorensen DC (2010) Nonlinear model reduction via discrete empirical interpola-
tion. SIAM J Sci Comput 32(5):2737–2764
Chen KK, Tu JH, Rowley CW (2012) Variants of dynamic mode decomposition: boundary condition,
koopman, and fourier analyses. J Nonlinear Sci 22(6):887–915
Chen W et al (2021) Physics-informed machine learning for reduced-order modeling of nonlinear
problems. J Comput Phys 446:110666
Chen Z, Zhao Y, Huang R (2019) Parametric reduced-order modeling of unsteady aerodynamics
for hyper sonic vehicles. Aerosp Sci Technol 87:1–14
Chinesta F, Ammar, A Cueto E (2010) Recent advances and new challenges in the use of the proper
generalized decomposition for solving multidimensional models. Arch Comput Methods Eng
17(4):327–350
8 Reduced Order Modeling 335
Chinesta F, Ladeveze P, Cueto E (2011) A short review on model order reduction based on proper
generalized decomposition. Arch Comput Methods Eng 18(4):395
Codina R (2000a) On stabilized finite element methods for linear systems of convection-diffusion-
reaction equations. Comput Methods Appl Mech Eng 188(1):61–82
Codina R (2000b) Stabilization of incompressibility and convection through orthogonal sub-scales
in finite element methods. Comput Methods Appl Mech Eng 190(13–14):1579–1599
Codina R (2002) Stabilized finite element approximation of transient incompressible flows using
orthogonal subscales. Comput Methods Appl Mech Eng 191(39–40):4295–4321
Codina R et al (2007) Time dependent subscales in the stabilized finite element approximation of
incompressible flow problems. Comput Methods Appl Mech Eng 196(21–24):2413–2430
Codina R et al (2018) Variational multiscale methods in computational fluid dynamics. Encycl
Comput Mech 1–28
Codina R, Baiges J (2011) Finite element approximation of transmission conditions in fluids and
solids introducing boundary subgrid scales. Int J Numer Methods Eng 87(1–5):386–411
Codina R, Principe J, Baiges J (2009) Subscales on the element boundaries in the variational two-
scale finite element method. Comput Methods Appl Mech Eng 198(5–8):838–852
Codina R, Reyes R, Baiges J (2021) A posteriori error estimates in a finite element vms-based
reduced order model for the incompressible Navier–Stokes equations. Mech Res Commun. Spe-
cial Issue Honoring G.I. Taylor Medalist Prof. Arif Masud 112:103599
Dal Santo N et al (2019) An algebraic least squares reduced basis method for the solution of
nonaffinely parametrized stokes equations. Comput Methods Appl Mech Eng 344:186–208
Daniel T et al (2020) Model order reduction assisted by deep neural networks (ROM-net). Adv
Model Simul Eng Sci 7(1):16
Dar Z, Baiges J, Codina R (2023) Artificial neural network based correction models for reduced
order models in computational fluid mechanics. Comput Methods Appl Mech Eng 415:116232
Deng N et al (2020) Low-order model for successive bifurcations of the fluidic pinball. J Fluid
Mech 884:A37
Dupuis R, Jouhaud J-C, Sagaut P (2018) Surrogate modeling of aerodynamic simulations for mul-
tiple operating conditions using machine learning. AIAA J 56(9):3622–3635
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika
1(3):211–218
Eivazi H et al (2022) Towards extraction of orthogonal and parsimonious non-linear modes from
turbulent flows. Expert Syst Appl 202:117038
Everson R, Sirovich L (1664) Karhunen-Loeve procedure for Gappy data. JOSA A 12(8):1657–1664
Fabra A, Baiges J, Codina R (2022) Finite element approximation of wave problems with correcting
terms based on training artificial neural networks with fine solutions. Comput Methods Appl Mech
Eng 399:115280
Farhat C, Chapman T, Avery P (2015) Structure-preserving, stability, and accuracy properties of
the energy conserving sampling and weighting method for the hyper reduction of nonlinear finite
element dynamic models. Int J Numer Methods Eng 102(5):1077–1110
Fresca S, Dede’ L, Manzoni A (2021) A comprehensive deep learning-based approach to reduced
order modeling of nonlinear time-dependent parametrized PDEs. J Sci Comput 87(2):61
Fresca S, Manzoni A (2022) POD-DL-ROM: enhancing deep learning-based reduced order models
for non linear parametrized PDEs by proper orthogonal decomposition. Comput Methods Appl
Mech Eng 388:114181
Galletti B et al (2004) Low-order modelling of laminar flow regimes past a confined square cylinder.
J Fluid Mech 503:161–170
García-Archilla B, Novo J, Rubino S (2022) Error analysis of proper orthogonal decomposition
data assimilation schemes with grad-div stabilization for the Navier–Stokes equations. J Comput
Appl Math 411:114246
Giere S et al (2015) SUPG reduced order models for convection-dominated convection-diffusion-
reaction equations. Comput Methods Appl Mech Eng 289:454–474
336 Z. Dar et al.
Giere S, John V (2017) Towards physically admissible reduced-order solutions for convection-
diffusion problems. Appl Math Lett 73:78–83
Glaz B, Liu L, Friedmann PP (2010) Reduced-order nonlinear unsteady aerodynamic modeling
using a surrogate-based recurrence framework. AIAA J 48(10):2418–2429
Gonzalez FJ, Balajewicz M (2018) Deep convolutional recurrent autoencoders for learning low-
dimensional feature dynamics of fluid systems. arXiv:1808.01346 [physics]
Graham WR, Peraire J, Tang KY (1999) Optimal control of vortex shedding using low-order models.
Part I-open-loop model development. Int J Numer Methods Eng 44(7):945–972
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Netw. IJCNN 2005 18(5):602–610
Guan Y, Brunton SL, Novosselov I (2021) Sparse nonlinear models of chaotic electroconvection.
R Soc Open Sci 8(8):202367
Guo M, Hesthaven JS (2019) Data-driven reduced order modeling for time-dependent problems.
Comput Methods Appl Mech Eng 345:75–99
Guo M, Hesthaven JS (2018) Reduced order modeling for nonlinear structural analysis using gaus-
sian process regression. Comput Methods Appl Mech Eng 341:807–826
Hesthaven JS, Rozza G, Stamm B (2016) Certified reduced basis methods for parametrized partial
differential equations. SpringerBriefs in Mathematics. Springer International Publishing, Cham
Hesthaven JS, Ubbiali S (2018) Non-intrusive reduced order modeling of nonlinear problems using
neural networks. J Comput Phys 363:55–78
Higgins I et al (2022) Beta-VAE: learning basic visual concepts with a constrained variational
framework. In: International conference on learning representations
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hughes TJR et al (1998) The variational multiscale method-a paradigm for computational mechan-
ics. Comput Methods Appl Mech Engineering Adv Stab Methods Comput Mech 166(1):3–24
Hunter A et al (2019) Reduced-order modeling through machine learning and graph-theoretic
approaches for brittle fracture applications. Comput Mater Sci 157:87–98
Hurst HE (1951) Long-term storage capacity of reservoirs. Trans Am Soc Civ Eng 116(1):770–799
John Leask L (1967) The structure of inhomogeneous turbulent flows. Struct Inhomogeneous Turbul
Flows 166–178
John V, Moreau B, Novo J (2022) Error analysis of a SUPG-stabilized POD-ROM method for
convection diffusion-reaction equations. Comput Math Appl 122:48–60
Juang J-N (1994) Applied system identification. Prentice Hall
Kaheman K, Brunton SL, Kutz JN (2022) Automatic differentiation to simultaneously identify
nonlinear dynamics and extract noise probability distributions from data. Mach Learn: Sci Technol
3(1):015031
Kaiser E et al (2014) Cluster-based reduced-order modelling of a mixing layer. J Fluid Mech
754:365–414
Kalashnikova I, Barone M (2011) Stable and efficient galerkin reduced order models for non-linear
fluid flow. In: 6th AIAA theoretical fluid mechanics conference, p 3110
Kapteyn MG, Knezevic DJ, Willcox K (2020) Toward predictive digital twins via component-
based reduced-order models and interpretable machine learning. In: AIAA scitech 2020 forum.
American Institute of Aeronautics and Astronautics
Kast M, Guo M, Hesthaven JS (2020) A non-intrusive multifidelity method for the reduced order
modeling of nonlinear problems. Comput Methods Appl Mech Eng 364:112947
Kingma DP, Welling M (2013) Auto-encoding variational bayes. In: International conference on
learning representations
Lee K, Carlberg KT (2020a) Deep conservation: a latent-dynamics model for exact satisfaction of
physical conservation laws. arXiv:1909.09754 [physics]
Lee K, Carlberg KT (2020b) Model reduction of dynamical systems on nonlinear manifolds using
deep convolutional autoencoders. J Comput Phys 404:108973
8 Reduced Order Modeling 337
LeGresley P, Alonso J (2000) Airfoil design optimization using reduced order models based on
proper orthogonal decomposition. In: Fluids 2000 conference and exhibit. American Institute of
Aeronautics and Astronautics
Li J, Du X, Martins JRRA (2022) Machine learning in aerodynamic shape optimization. Prog
Aerosp Sci 134:100849
Ljung L (1998) System identification: theory for the user, 2nd edn. Pearson, Upper Saddle River,
NJ
Loiseau J-C (2020) Data-driven modeling of the chaotic thermal convection in an annular ther-
mosyphon. Theor Comput Fluid Dyn 34:339–365
Loiseau J-C, Brunton SL (2018) Constrained sparse Galerkin regression. J Fluid Mech 838:42–67
Lucia DJ, Beran PS (2003) Projection methods for reduced order models of compressible flows. J
Comput Phys 188(1):252–280
Lusch B, Kutz JN, Brunton SL (2018). Deep learning for universal linear embeddings of nonlinear
dynamics. Nat Commun 9(1):4950
Maulik R et al (2021) Latent-space time evolution of non-intrusive reduced-order models using
gaussian process emulation. Phys D: Nonlinear Phenom 416:132797
Ma C, Wang J (2019) Model reduction with memory and the machine learning of dynamical systems.
Commun Comput Phys 25(4)
Milano M, Koumoutsakos P (2002) Neural network modeling for near wall turbulent flow. J Comput
Phys 182(1):1–26
Mohan AT, Gaitonde DV (2018) A deep learning based approach to reduced order modeling for
turbulent flow control using LSTM neural networks. arXiv:1804.09269 [physics]
Mohan AT et al (2020) Embedding hard physical constraints in neural network coarse-graining of
3D turbulence. arXiv:2002.00021 [physics]
Mohebujjaman M, Rebholz L, Iliescu T (2019) Physically constrained data-driven correction for
reduced order modeling of fluid flows. Int J Numer Methods Fluids 89(3):103–122
Mori H (1965) Transport, collective motion, and brownian motion*). Prog Theor Phys 33(3):423–
455
Mou C et al (2021) Data-driven variational multiscale reduced order models. Comput Methods
Appl Mech Eng 373:113470
Murata T, Fukami K, Fukagata K (2020) Nonlinear mode decomposition with convolutional neural
networks for fluid dynamics. J Fluid Mech 882:A13
Noack BR et al (2003) A hierarchy of low-dimensional models for the transient and post-transient
cylinder wake. J Fluid Mech 497:335–363
Noack BR et al (2016) Recursive dynamic mode decomposition of transient and post-transient wake
flows. J Fluid Mech 809:843–872
Noack BR et al (eds) (2011) Reduced-order modelling for flow control, vol 528. Springer, CISM
International Centre for Mechanical Sciences. Vienna
Otto SE, Rowley CW (2019) Linearly recurrent autoencoder networks for learning dynamics. SIAM
J Appl Dyn Syst 18(1):558–593
Pacciarini P, Rozza G (2014) Stabilized reduced basis method for parametrized advection-diffusion
PDEs. Comput Methods Appl Mech Eng 274:1–18
Pawar S et al (2019) A deep learning enabler for nonintrusive reduced order modeling of fluid flows.
Phys Fluids 31(8):085101
Pawar S et al (2021) Model fusion with physics-guided machine learning: projection-based reduced-
order modeling. Phys Fluids 33(6):067123
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neural networks: a deep learning
frame work for solving forward and inverse problems involving nonlinear partial differential
equations. J Comput Phys 378:686–707
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning
Reyes R et al (2018) Reduced order models for thermally coupled low mach flows. Adv Model
Simul Eng Sci 5(1):1–20
338 Z. Dar et al.
Reyes R, Codina R (2020) Projection-based reduced order models for flow problems: a variational
multiscale approach. Comput Methods Appl Mech Eng 363:112844
Rozza G, Huynh DBP, Manzoni A (2013) Reduced basis approximation and a posteriori error
estimation for stokes flows in parametrized geometries: roles of the inf-sup stability constants.
Rozza
Rozza G, Lassila T, Manzoni A (2011) Reduced basis approximation for shape optimization in
thermal flows with a parametrized polynomial geometric map. In: Spectral and high order methods
for partial differential equations. Springer, Berlin, pp 307–315
Sahba S et al (2022) Dynamic mode decomposition for aero-optic wavefront characterization. Opt
Eng 61(1):013105
San O, Iliescu T (2015) A stabilized proper orthogonal decomposition reduced-order model for
large scale quasigeostrophic ocean circulation. Adv Comput Math 41(5):1289–1319
San O, Maulik R (2018a) Extreme learning machine for reduced order modeling of turbulent
geophysical flows. Phys Rev E 97(4):42322
San O, Maulik R (2018b) Neural network closures for nonlinear model order reduction. Adv Comput
Math 44(6):1717–1750
San O, Pawar S, Rasheed A (2022) Variational multiscale reinforcement learning for discovering
reduced order closure models of nonlinear spatiotemporal transport systems. arXiv:2207.12854
[physics]
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656:5–28
Schmid PJ, Violato D, Scarano F (2012) Decomposition of time-resolved tomographic PIV. Exp
Fluids 52(6):1567–1579
Shah NV et al (2022) Finite element based model order reduction for parametrized one-way coupled
steady state linear thermo-mechanical problems. Finite Elem Anal Des 212:103837
Srinivasan PA et al (2019) Predictions of turbulent shear flows using deep neural networks. Phys
Rev Fluids 4(5):054603
Suykens JAK et al (2002) Least squares support vector machines. World Scientific
Takeishi N, Kawahara Y, Yairi T (2017) Learning koopman invariant subspaces for dynamic mode
decom position. Proceedings of the 31st international conference on neural information processing
systems. NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 1130–1140
Tello A, Codina R (2021) Field-to-field coupled fluid structure interaction: a reduced order model
study. Int J Numer Methods Eng 122(1):53–81
Tello A, Codina R, Baiges J (2020) Fluid structure interaction by means of variational multiscale
reduced order models. Int J Numer Methods Eng 121(12):2601–2625
Tissot G et al (2014) Model reduction using dynamic mode decomposition. Comptes Rendus
Mécanique. Flow Separation Control 342(6):410–416
Tu JH et al (2014) On dynamic mode decomposition: theory and applications. J Comput Dyn
1(2)(Mon Dec 01 01:00:00 CET 2014):391–421
Vlachas PR et al (2018) Data-driven forecasting of high-dimensional chaotic systems with long
short-term memory networks. Proc R Soc A: Math, Phys Eng Sci 474(2213):20170844
Wan ZY, Sapsis TP (2017) Reduced-space gaussian process regression for data-driven probabilistic
forecast of chaotic dynamical systems. Phys D: Nonlinear Phenom 345:40–55
Wang Z et al (2012) Proper orthogonal decomposition closure models for turbulent flows: a numer-
ical comparison. Comput Methods Appl Mech Eng 237:10–26
Wang Q, Hesthaven JS, Ray D (2019) Non-intrusive reduced order modeling of unsteady flows using
artificial neural networks with application to a combustion problem. J Comput Phys 384:289–307
Wang Q, Ripamonti N, Hesthaven JS (2020) Recurrent neural network closure of parametric
POD-Galerkin reduced-order models based on the mori-zwanzig formalism. J Comput Phys
410:109402
Wehmeyer C, Noe F (2018) Time-lagged autoencoders: deep learning of slow collective variables
for molecular kinetics. J Chem Phys 148(24):241703
8 Reduced Order Modeling 339
Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the koopman
operator: extending dynamic mode decomposition. J Nonlinear Sci 25(6):1307–1346
Xie X, Webster C, Iliescu T (2020) Closure learning for nonlinear model reduction using deep
residual neural network. Fluids 5(1):39
Xu S et al (2013) Multi-output least-squares support vector regression machines. Pattern Recognit
Lett 34(9):1078–1084
Yousif MZ, Lim H-C (2022) Reduced-order modeling for turbulent wake of a finite wall-mounted
square cylinder based on artificial neural network. Phys Fluids 34(1):015116
Yvonnet J, He Q-C (2007) The reduced model multiscale method (R3M) for the non-linear homog-
enization of hyperelastic media at finite strains. J Comput Phys 223(1):341–368
Zhao H (2021) A reduced order model based on machine learning for numerical analysis: an
application to geomechanics. Eng Appl Artif Intell 100:104194
Zhu Q, Guo Y, Lin W (2021)Neural delay differential equations. In: The international conference
on learning representations, p 20
Zwanzig R (1960)Ensemble method in the theory of irreversibility. J Chem Phys 33(5):1338–1341
Chapter 9
Regression Models for Machine Learning
9.1 Introduction
P. Wei (B)
School of Power and Energy, Northwestern Polytechnical University, West Youyi Road 127,
Xi’an 710072, China
e-mail: [email protected]
M. Beer
Institute for Risk and Reliability (IRZ), Leibniz Universität Hannover, Callinstraße 34, Hannover
30167, Germany
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 341
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_9
342 P. Wei and M. Beer
are mostly based on regression models for fitting some expensive-to-estimate and
implicit functions (Hennig et al. 2022). This motivates us to present this chapter for
introducing the regression techniques in a concise way. Specifically, some classical
regression models will be introduced from either a Bayesian or non-Bayesian per-
spective with a focus on understanding the philosophy and rationale behind these
models.
As for real-world practices, there are two phases with slight differences in gen-
erating the training data D. For the first phase, the data can be generated from
observations or measurements (e.g. the response of a building against seismic exci-
tation), for which case the data set X may or may not be designed, depending on
whether the placement of, e.g. the sensors for measurement, can be designed. For the
second phase, the purpose of regression is to generate a cheap-to-estimate surrogate
for approximating expensive-to-estimate simulator, revealing that the training data X
can be arbitrarily designed and Y is generated by calling the simulator. For the later
phase, there may exist alternative simulators with different levels of fidelity, and the
resultant models are termed as multi-fidelity surrogates (Perdikaris et al. 2017). In
this chapter, it is assumed that the data X can be designed, bringing the motivation
of active learning.
The alternative models for regression mainly differ in the model forms and treat-
ments of model parameter estimations. The models and methods for regression to
be treated in this chapter include two groups, i.e. the non-Bayesian regression mod-
els based on minimizing the empirical loss function and the Bayesian regression
models, where the former class includes the Least Square Regression (LSR), the
Ridge Regression (RR) and the Support Vector Regression (SVR), equipped or not
equipped with kernel trick, and the later class to be examined includes the Bayesian
parametric regression and Gaussian Process Regression (GPR). The active learning
procedures for scientific computation based on Bayesian regression models are also
presented.
As has been stated above, different types of regression models can be grouped based
on the model forms and the measures of loss function. The LSR, as the name suggests,
uses the mean square error as loss function. The model parameters are then estimated
by minimizing this loss function. The models utilized in this group are commonly of
parametric form, and it is said to be linear LSR if the models show a linear relationship
with the model parameters (instead of the predictor variables). We take the linear
LSR as an example for introduction as the estimators of the model parameters are of
closed form.
9 Regression Models for Machine Learning 343
Without loss of generality, we assume that there exists a set of basis functions,
termed as φ (x) = 1, φ1 (x) , φ2 (x) , . . . , φ p (x) . These basis functions are usu-
ally called features, and the linear space spanned by them is termed as feature space.
Transforming the data X into the feature space can usually facilitate the regression
as the response may show linear behavior in the feature space. With this in mind, the
linear regression model is then formulated as:
y = β φ (x) + , (9.2)
where β = β0 , β1 , . . . , β p refers to the vector of model parameters to be learned
from the training data D, and denotes the noise (mostly assumed to be Gaussian
white noise with zero mean and constant variance).
With this model assumption, the prediction function is generated as ŷ (x) =
β φ (x). Given the training data, D = (X, Y), the mean squared error (MSE),
rooted in the Euclidean distance (or L 2 distance), in matrix notation, is defined by:
1
MSED (β) = Y − ŷ (X) Y − ŷ (X) . (9.3)
n
In functional form, it is defined as
2
MSE (β) = y (x) − ŷ (x) f (x) dx, (9.4)
Rd
which, when estimated with validation data, presents useful information for validat-
ing the regression model. In Eq. (9.3), f (x) refers to the probability density of the
predictor variables x, which serves as a weight of integral for defining the MSE. The
error defined by Eq. (9.4) is usually termed as generalization error or more specif-
ically, generalization MSE; while the one defined by Eq. (9.3) over a labeled sample
set is called empirical error, or more specifically, empirical MSE, as it is indeed the
average error on this sample set. The generalization error measures the true error for
regression in the input space with the consideration of the probability distribution of
the predictor variables. The empirical error, under specific assumptions, is an approx-
imation of the generalization (true) error. Indeed, if the samples used for computing
the empirical error are independently and identically distributed (i.i.d.) according to
f (x), the expectation of the empirical error is exactly the generalization error (see,
e.g. Ref. Mohri et al. 2018, Chap. 2 ).
The LSR expects to learn the model parameters by minimizing the generaliza-
tion MSE as the generalization capability of the assumed model can be maximized,
which, unfortunately, is commonly infeasible due to the limited training data. A
compromised and practical way is to minimize the empirical MSE defined over the
training data set, which leads to a point estimator of the model parameter formulated
by:
344 P. Wei and M. Beer
Making the gradients of the empirical MSE with respect to the model parameters
equal to zero, the closed-form estimators can be easily derived as:
−1
β̂ LSR = A A A Y, (9.6)
where the subscript “D” indicates that the LSR prediction model is trained with the
labeled data “D”.
Given the trained LSR model, one of the most concerning issues is the quality
of the model. This can be evaluated from two aspects. The first aspect concerns the
quality of fitness to the training data, while the second aspect concerns its general-
ization performance, i.e. the performance for prediction at observed points. For the
former task, the “Coefficient of Determination” is of special interest, and will then
be introduced in the following subsection. For the later task, the Ridge Regression
can be an excellent alternative.
As defined by Eq. (9.4), the empirical MSE is defined as the average squared distance
between the true values and the predicted values of the LSR model, and it is intuitively
concluded that small values of the empirical MSE indicate the fitness of the LSR
model to the data. This can be partly true. For clarifying this, the coefficient of
determination, or called “R squared”, denoted as R 2 , can be helpful. Given the
training data D = (X, Y) with Y = (y1 , y2 , . . . , yn ) , the total Sum of Squares of
the response values is defined as:
n
2
SStot = y j − ȳ , (9.8)
j=1
with ȳ = n1 nj=1 y j being the mean of the observed response values. The residual
Sum of Squares, defined as the accumulated squared errors between predicted and
observed response values, are given as:
9 Regression Models for Machine Learning 345
n
2
SSres = y j − ŷD x j . (9.9)
j=1
The residual Sum of Squares can be interpreted as the unexplained variance of the
observed response values by ŷD (x), subtracting which from the total sum of squares,
the regression Sum of Squares, measuring the explained variance, is defined as:
n
2
SSreg = SStot − SSres = ŷD x j − ȳ . (9.10)
j=1
which can be interpreted as the percentage of the total variance of the observed
response data Y being explained by the linear LSR model ŷD (x).
R 2 takes values between 0 and 1, with R 2 = 1 indicating the model predictions
at any point of training data match precisely with the observed response values, and
R 2 = 0 implying that, at any training point, the model predicts the response value
as the mean ȳ of the observed response values, which corresponds to the case of
worst fitting. Usually, a higher value of R 2 indicates better fitness of the model to
the data. As an example, a training data set consisting of 15 points is created, and
the linear LSR is implemented with the function basis being φ (x) = {1, x} and
φ (x) = 1, x, x 2 . The results are compared in Fig. 9.1. As seen, with a linear
functional basis, the resultant R 2 value is 0.3026, indicating a bad fitness to the
training data. However, as the quadratic functional basis is added, the R 2 value goes
up to 0.9404, implying a much better fitness to the data.
For regression with multiple predictor variables, it is found that the R 2 spuri-
ously increases with the dimension of the predictor variables, which may mislead
the analysts from drawing correct conclusions on the fitting quality, and especially,
it is unsuitable to be used for comparing the performance of regression models with
different numbers of predictors variables. To eliminate this dimension effect, the
adjusted coefficient of determination, with the consideration of the degrees of free-
dom of both SSres and SStot , is adapted from R 2 as:
2 n−1
R = 1 − 1 − R2 . (9.12)
n− p−1
2 2
The adjusted R 2 values for the LSR models in Fig. 9.1 are R = 0.2490 and R =
0.9305 respectively, which are both smaller than the corresponding R 2 values.
346 P. Wei and M. Beer
Fig. 9.1 Schematic illustration of the linear LSR model fitted with basis functions φ (x) = {1, x}
(left) and φ (x) = 1, x, x 2 (right)
2
The R 2 and R values provide estimates of the extent of fitness of the linear LSR
model to the training data, but it does not provide a formal hypothesis test for the
overall significance of the model. For the later use, one can refer to the overall F-test
for regression, which is closely related to the (adjusted) coefficients of determination.
Due to limited space, this will not be introduced here, and the readers with interest
can refer to Chap. 7 of Ref. Johnson and Wichern (2007).
The linear LSR is simple and easy to implement, benefiting from the closed-form
estimators of the model parameters β. However, the generalization performance can
be low as it aims at minimizing the empirical error , instead of the generalization
error. This may result in over-fitting of the regression model, which means good
fitness to the training data, but low predictive performance at unobserved points. As
prediction is usually the main objective of performing regression, developing tricks
for avoiding or alleviating over-fitting is then of special concern in many machine
learning algorithms.
One way to improve the generalization performance is to use the ridge regression,
where the target function for minimization is modified as:
p
T (β) = MSED (β) + λ βi2 , (9.13)
i=1
p
where the L 2 norm i=1 βi2 refers to the regularization term for controlling the
model complexity, and λ is a (user-defined) positive parameter used for balancing
9 Regression Models for Machine Learning 347
the model complexity against accuracy (quantified by the empirical MSE). Indeed,
under specific assumptions, the target function given by Eq. (9.13) is an upper bound
of the generalization error MSE (β) defined by Eq. (9.4). Therefore, minimizing the
target function T (β) is equivalent to minimizing the generalization error. This is the
intrinsic reason why the generalization performance can be improved and the over-
fitting can be alleviated by introducing the regularization term. The above-improved
version of linear LSR is called Ridge Regression (RR) , which is closely related to
the Gaussian process regression, to be introduced in Sect. 9.2.3. If the features φ (x)
are set to be the eigenfunctions of a kernel, it is then called Kernel Ridge Regression
(KRR).
By minimizing the target function in Eq. (9.13), the estimator of the model param-
eters for RR can be derived as:
−1
β̂ RR = A A + λI A Y, (9.14)
where I refers to the identity matrix. For more details of the RR, one can refer to
Chap. 10 of Ref. Mohri et al. (2018) or Chaps. 3 and 5 of Ref. Theodoridis (2015).
The model form assumed for linear SVR is the same as that for the linear LSR, and
the difference appears mainly on the loss function, and thus also the estimation of the
model parameters. Without loss of generality, the model form for SVR is assumed
to be:
ŷ(x) = β φ (x) + β0 , (9.15)
where φ (x) = φ1 (x) , φ2 (x) , . . . , φ p (x) is the functional basis of the feature
space, and β = β1 , β2 , . . . , β p and β0 are the model parameters to be estimated by
minimizing a loss function.
A typical difference between SVR and LSR is that, in SVR, the loss of each training
point contained in a tube with radius and center ŷ(x) = β φ (x) + β0 is assumed
to be zero. This feature makes SVR more tolerant of the noises contained in the
training data. For the jth training point x j , y j , the loss function is defined as:
Li (β, β0 ) = max 0, ŷ x j − y j − . (9.16)
The overall -insensitive loss function of the training data D is then defined as:
1
n
LD (β, β0 ) = β2 + C max 0, ŷ x j − y j − , (9.17)
2 j=1
where the first term 21 β2 is inversely proportional to the gap between the two mar-
gins ŷ (x) − and ŷ (x) + , the second term measures the accumulated distances
of all the training points to the regions bounded by the two margins, and the param-
eter C is introduced for balancing the above two factors, as explained by Fig.9.2.
By minimizing the loss function LD (β, β0 ), it is expected to produce a hyperplane
ŷ (x) = β̂ φ (x) + β̂0 in the feature space, on the one hand, to maximize the gap
between the two margins given fixed , on the other hand, to ensure that the all the
training points get as close as possible to the regions bounded by the two margins.
The remaining task for SVR learning is then to estimate the model parameters β and
β0 by minimizing the -insensitive loss function defined by Eq. (9.17), which is not
as trivial as that for LSR learning as the target function is not smooth. Define two
slack variables ζ j and ζ j∗ as following:
ζ j = max 0, ŷ x j − y j − ŷ (x i ) − yi 0
∗
(9.18)
ζ j = max 0, y j − ŷ x j − otherwise.
9 Regression Models for Machine Learning 349
In words,
ζ j is a non-negative variable taking the minimal value that satisfies
ŷ x j − y j − ζ j , and ζ j∗ is another non-negative variable taking the minimal
value constrained by y j − ŷ x j − ζ j∗ . With these slack variables, the optimiza-
tion problem for model parameter estimation can be reformulated as:
∗
1
n
min L β, β0 , ζ , ζ = β2 + C ζ j + ζ j∗
β,β0 ,ζ ,ζ ∗ 2 j=1
subjected to ŷ x j − y j + ζ j (9.19)
y j − ŷ x j + ζ j∗
ζ j 0, ζ j∗ 0, j = 1 . . . n.
1 n
L β, β0 , ζ , ζ ∗ , μ, μ∗ , α, α ∗ = β2 + C ζ j + ζ j∗
2 j=1
n n
− μjζj − μ∗j ζ j∗
j=1 j=1
(9.20)
n
+ α j ŷ x j − y j − − ζ j
j=1
n
+ α ∗j y j − ŷ x j − − ζ j∗ ,
j=1
with μ j , μ∗j , α j and α ∗j being the lagrangian multipliers. Making the gradient of
L (β, β0 , ζ , ζ ∗ , μ, μ∗ ) with respect to each element of β, β0 , ζ and ζ ∗ respectively
equals to zero yields :
n
β= α j − α ∗j φ(x j ) (9.21a)
j=1
n
0= α j − α ∗j (9.21b)
j=1
C = αj + μj (9.21c)
C = α ∗j + μ∗j , (9.21d)
substituting all of which into Eq. (9.20), the dual optimization problem of Eq. (9.20)
is formulated as:
350 P. Wei and M. Beer
n
min∗ L α, α ∗ = y j α ∗j − α j − α ∗j + α j
α,α
j=1
p p
1 ∗
− α j − α j φ x j φ (x k ) αk∗ − αk
2 j=1 k=1
p
∗
subject to α j − α j = 0, 0 α j , α ∗j C, j = 1 . . . n,
j=1
(9.22)
which is a typical quadratic programming problem, and can be numerically solved
by the Sequential Minimal Optimization (SMO) algorithm. With α and α ∗ being
estimated, the estimate of the model parameters β, denoted as β̂ SVR , can be easily
computed with Eq. (9.21a). The estimate of the model parameter β0 can be com-
puted by β̂0,SVR = yk + − nj=1 α ∗j − α j φ(x k ) φ(x j ) with x k being any of
the training point of the input variables.
With the complementary conditions, it is known that, for any j ∈ {1, 2, . . . , n},
it holds that:
α j × ŷ x j − y j − − ζ j = 0 (9.23a)
α ∗j × ŷ x j − y j + + ζ j∗ = 0. (9.23b)
Equation (9.23) reveals that, for any xj being a support vector with either α j = 0 or
α ∗j = 0 being true, it then holds that ŷ x j − y j − = ζ j 0 or y j − ŷ x j − =
ζ j∗ 0 correspondingly, indicating that all the support vectors lie outside the tube
with radius . For the points inside the tube, it holds that α j = α ∗j = 0, thus the non-
support vectors make no contribution for determining the model parameter values,
as revealed by Eq. (9.21a). For each support vector x j , either α j = 0 or α ∗j = 0
hold, thus the SVR model is uniquely determined by the support vectors. With the
above conclusions in mind, it is known that the value of the tolerance parameter
makes a trade-off between sparsity and accuracy, i.e. a larger value of results in a
smaller number of support vectors, but also leads to a higher risk of ignoring many
important points which have significant effects on determining the model accuracy
(Mohri et al. 2018). Thus, a proper pre-selection of the values of and C is important
for succeeding while training the SVR model.
A common feature of LSR and SVR is that a large number of inner products of
vectors in a feature space needs to be computed, as can be found in Eqs. (9.14) and
(9.22). These numerical procedures are usually computationally cumbersome as the
feature space spanned by φ(x) can be of extremely high or even infinite dimension.
9 Regression Models for Machine Learning 351
This can be comprehensively alleviated by using the kernel trick. Before the kernel
trick for LSR/SVR, some important concepts for kernel need to be introduced.
It measures the similarity between two points. Despite the definition of a kernel via
the inner product, it is not required to compute the inner product, on the contrary,
the inner product can be evaluated by calling the kernel function with much lower
computational cost. To show the connection between a kernel and the inner product
of a feature map, it is useful to first introduce Mercer’s theorem.
holds for any square integrable function a(x) defined on X , the kernel function κ
admits a uniformly convergent expansion as
∞
κ x, x = λi ψi (x) ψi x . (9.26)
i=1
holds for any i = j. Thus, φ(x) forms a set of orthogonal basis for a feature space.
Three commonly used kernels are summarized as follows:
352 P. Wei and M. Beer
d
• Polynomial kernel κ x, x = x, x + C .
2
• Gaussian radial basis kernel κ x, x = exp −γ x − x .
• Sigmoid kernel κ x, x = tanh γ x, x + C .
Take the polynomial kernel with C = 1 and d = 2 as an example, for any x, x ∈ R2 ,
it holds that: 2
κ x, x = x, x = φ (x) , φ x , (9.28)
√ √ √
where φ (x) = 1, 2x1 , 2x2 , x12 , x22 , 2x1 x2 . Obviously, estimating the inner
product φ (x) , φ x by calling the kernel shows much lower computational com-
plexity than calculating it by definition.
One more tool being useful for understanding many of the regression methods is
the Reproducing Kernel Hilbert Space (RKHS), which is given by the following
theorem.
where the inner product is defined by integrating x out. H is then called the RKHS
associated to κ.
For more details on Kernel and RKHS, one can refer to Chap. 11 of Ref. Theodoridis
(2015) and Chap. 5 of Ref. Mohri et al. (2018).
As revealed by Eq. (9.14) for RR and Eq. (9.22) for SVR, the parameter estimation
process involves the computation of many inner products, typically φ x j , φ (x k )
for j, k ∈ {1, 2, . . . n}, each of which can be computationally expensive due to the
high or even infinite dimension of the feature space. Actually, with RKHS H associ-
ated with a PSD kernel κ equipped, there is even no need to know what exactly the
feature mapping is. Given a PSD kernel κ, we know that there must exist a feature
mapping φ(x) and a resultant feature space H making
φ x j , φ (x k ) = κ x j , x k (9.31)
9 Regression Models for Machine Learning 353
hold for any j, k ∈ {1, 2, . . . n}. Equation (9.31) is called Kernel Trick, which not
only facilitates the estimation of the inner products, but also realizes the avoidance
of building the feature map. We take the SVR as an example to illustrate the details,
but it also applies to both LSR and KRR.
Given a kernel κ (x, x), based on the kernel trick, the dual optimization problem
given in Eq. (9.22) can be reformulated as:
n
min∗ L α, α ∗ = y j α ∗j − α j − α ∗j + α j
α,α
j=1
p p
1 ∗
− α j − α j κ x j , x k αk∗ − αk (9.32)
2 j=1 k=1
p
subject to α ∗j − α j = 0, 0 α j , α ∗j C, j = 1 . . . n,
j=1
where the inner product φ x j φ (x k ) is replaced by κ x j , x k . Solv-
ing the optimization problem of Eq. (9.32), the model parameters can
n ∗
then be computed by β̂ SVR = j=1 α j − α j φ(x j ) and β̂0,SVR = yk + −
n
j=1 α ∗
j − α j φ(x k ) φ(x j ). The SVR model for prediction is then formulated
as:
ŷD (x) = β̂ SVR φ (x) + β̂0,SVR
n
= α j − α ∗j φ x j φ (x) + β̂0,SVR
j=1 (9.33)
n
= α j − α ∗j κ x j , x + β̂0,SVR .
j=1
It is now clear that, for both parameter estimation and prediction, there is no need
to know the explicit expression of the feature mapping φ(x), instead, only a kernel
function satisfying Mercer’s condition is required.
An example of implementing the SVR with linear kernel, polynomial kernel of
order 2, and Gaussian kernel is shown in Fig. 9.3 for illustration and comparison.
The value of is set to be 0.6 for all three implementations. As can be seen, for
all three types of kernels, four points out of seven are identified as support vectors,
but the support vectors differ from kernel to kernel. With polynomial and Gaussian
kernels, SVM shows better fitness to the training data than that trained with linear
kernels. For a strict demonstration, some specific techniques such as cross-validation
are required, but will not be discussed in detail here due to the space limitation.
354 P. Wei and M. Beer
Fig. 9.3 An example of SVR implemented with kernel function being linear, polynomial, and
Gaussian forms, respectively
Both the LSR and SVR methods estimate the model parameters by minimizing the
empirical errors, and predict the response value at any unobserved point x as a deter-
ministic value. Instead, a Bayesian regression method produces a stochastic process,
and predicts the response value at x as a random variable, with the variation of which
summarizes the prediction error. In this section, we introduce the Bayesian regres-
sion from both a parametric space perspective and a functional space perspective
respectively. These views of understanding the Bayesian regression can be found in,
e.g. Chap. 2 of Ref. Rasmussen and Williams (2006).
For a better understanding of the difference between Bayesian and non-Bayesian
regression methods, and that between the parametric and non-parametric regression
methods, it is helpful to introduce the concept of Hypothesis Set H. In general terms,
it is defined as a set of functions mapping the vectors in the feature space H to the
response space. For example, given the model formassumption y = β φ (x) + for
linear LSR, the hypothesis set is defined as H := ŷ = β φ (x) : β ∈ R p , and the
LSR regression problem can be formulated as finding an element from H minimizing
the empirical MSE.
9 Regression Models for Machine Learning 355
We assume that the regression model still takes the form of Eq. (9.2), i.e.
y = β φ (x) + , indicating the hypothesis set is H := ŷ = β φ (x) : β ∈ R p .
Instead of finding an element from H by minimization of the empirical errors, the
Bayesian parametric regression aims at attributing a probability distribution on the
model parameters β, and thus the hypothesis set H, with the ability of summariz-
ing the prediction error under proper assumptions. This is realized by a Bayesian
inference procedure.
The prior assumption imposed on the problem includes two aspects. First, the
prior distribution of the model parameters
β is assumed to be Gaussian with zero
mean and covariance pr , i.e. β ∼ N 0, pr , and the prior density is denoted by
f (β); Second, the noise is assumed to be Gaussian with zero mean and variance σn2 ,
and also independent from point to point. With the second assumption, the likelihood
function of the training data D can be formulated as:
2
n
1 yi − β φ (x i )
f (D|β) = √ exp −
i=1
2π σn 2σn2
(9.34)
1 1
= n/2 exp − 2 Y − Aβ .
2
2π σ 2 n
2σn
With Bayes formula, the posterior density of β can then be formulated as:
f (D|β) f (β)
f (β|D) = , (9.35)
f (D)
where f (D) = f (D|β) f (β) dβ is a constant called evidence, used for normal-
izing the posterior density.
As both f (β) and f (D|β) in Eq. (9.35) are Gaussian, the posterior density
f (β|D) must be of Gaussian form. The posterior mean β̄post and posterior covariance
post can be easily derived as:
μβ = σn−2 post X
Y, (9.36)
and −1
post = σn−2 A A + −1
pr , (9.37)
respectively. Comparing the above results with the estimator given by Eq. (9.14), it
can be concluded that, with the assumption that the noise has unit variance, i.e.
σn2 = 1, and the prior covariance is a diagonal matrix with equal elements, i.e.
pr = λI, the posterior mean β̄ post is exactly the same with the estimator β̂ RR of
356 P. Wei and M. Beer
the RR parameters. From this point of view, the prior assumption can be viewed as a
penalty term which aims at improving the generalization performance of the model.
However, it should be kept in mind that, instead of estimating the model parameters
as deterministic values, the Bayesian Linear Regression (BLR) infers the parameter
as a (subjective) probability distribution, with the posterior covariance post sum-
marizing the epistemic uncertainty on the model parameters. With the increase in
the training sample size, the readers can verify that the posterior variance of each
parameter tends to decrease, indicating the reduction of epistemic uncertainty.
Given the above setting and results, the posterior prediction ŷ (x) = φ (x) β
admits a Gaussian process as it is a linear combination of a set of Gaussian variables,
and the posterior mean and variance are formulated as:
and
σ y2 (x) = φ (x) post φ (x) , (9.39)
Fig. 9.4 Example of BLR with the three panels shows the contours of the prior density, the likeli-
hood function, and the posterior density of the model parameter (β0 , β1 ), respectively
9 Regression Models for Machine Learning 357
-1
Training points
LSR prediction
-2 RR prediction
BLR prediction(95.45% Confidence Intervals)
-3
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1
Fig. 9.5 Comparison of the prediction models generated by LSR, RR, and BLR, respectively
density closer to the origin than the likelihood function. This is consistent with the
conclusion drawn following Eq. (9.37) that the prior assumption is indeed a penalty
term.
The prediction models generated by LSR, RR, and BLR are then compared in Fig.
9.5. For RR, the penalty coefficient λ is set to be one, and in this case, the induced RR
prediction model is the same as the posterior mean prediction model induced by BLR,
as shown in Fig. 9.5. Instead of predicting the response value at x as a deterministic
value, the BLR estimates it as a Gaussian distribution, and the filled region in Fig. 9.5
indicates the 95.45% posterior confidence intervals. This feature makes it possible
for the analysts to make a trade-off between risk and prediction accuracy especially
when specific decisions need to be made based on the predictions.
The above procedure applies to any parametric forms of model assumptions, but
for the model form showing the nonlinear relationship between response and model
parameters, the posterior distribution of the model parameters usually do not admit a
closed form, and conditional sampling technique such as Markov Chain Monte Carlo
(MCMC) sampling needs to be introduced for sampling from the posterior density.
It is now clear that, for Bayesian
parametric regression with hypothesis set
H := ŷ = β φ (x) : β ∈ R p , the vector β of model parameter is assumed to be
finite dimension, and the problem is simplified to assigning a proper (subjective)
probability distribution for β. Commonly, we expect to extract as many features as
possible for model assumption as this makes the regression more flexible for captur-
ing the behaviors between response and predictor variables. As introduced in Sect.
9.2.3, this can be easily realized with a RKHS equipped with a kernel function. For
example, given a Gaussian kernel, the corresponding RKHS can realize almost any
smooth functions; with a Matérn kernel, a family of functions with a specific order
of derivatives can be realized (Rasmussen and Williams 2006).
358 P. Wei and M. Beer
In the case with infinite-dimensional feature space, the hypothesis set can still be
indexed by the model parameter β, and the problem can still be formulated in the
parametric space; however, it is intractable to assign a proper probability distribution
to the infinitely dimensional vector β. In contrast to the parametric space perspective,
a functional space perspective for viewing the Bayesian regression problems can be
used in this case to extend the above inference technique from finite-dimensional
feature space to infinite-dimensional cases. This leads to the general GPR model,
which will be introduced in the following subsection.
where m(x) denotes the prior mean function which can be assumed to be zero, con-
stant, or polynomials, and κ(x, x ) is a kernel function indicating the prior covariance
of y(x) and y(x ). Without loss of generality, we assume that the prior mean takes
a constant value, denoted as β, and the prior covariance function takes the Gaussian
kernel (or called squared exponential kernel) with distinct length-scale parameter in
each dimension, that is:
1 −1
κ x, x = σ0 exp − x − x
2
x−x , (9.41)
2
1 1
f D|β, σ02 , = exp − σ02 (Y − βe) K −1 ( ) (Y − βe) ,
2
(2π )n σ02 K ( )
(9.42)
the negative logarithm of which is then formulated as:
1 2 1
− log f D|β, σ02 , ∝ σ0 (Y − βe) K −1 ( ) (Y − βe) + log σ02 K ( )
2 2
L β, σ02 , .
(9.43)
The model hyper-parameters can then be evaluated with two alternative schemes.
First, a full Bayesian inference can be applied to infer a posterior density formulated
by multiplying
the likelihood function of Eq. (9.42) with a proper prior density
f β, σ02 , . This commonly requires implementation of a conditional
sampling
technique (e.g. MCMC) to generate the posterior samples for β, σ02 , , which is
commonly computationally cumbersome. Another strategy is to evaluate the hyper-
parameters as deterministic values by maximizing
the likelihood function, which is
equivalent to minimization of L β, σ02 , . In most literature, the second scheme is
applied, and thus the detail is introduced in what follows.
Making the first-order partial derivatives of L β, σ02 , over σ and σ02 to zero,
the estimators of the hyper-parameters β and σ0 can be formulated as:
2
−1 −1
β̂ ( ) = e K −1 ( ) e e K ( )Y (9.44a)
1
σ̂02 ( ) = (Y − βe) K −1 ( ) (Y − βe) . (9.44b)
n
Substituting Eq. (9.44) into Eq. (9.43), the optimization for hyper-parameters esti-
mation can then be simplified as:
ˆ = arg min L β̂ ( ) , σ02 ( ) , . (9.45)
The posterior mean given by Eq. (9.46a) presents a mean prediction for the
response at x. Based on the RKHS, it is known that the prior covariance term
κ (x i , x)√ ∞ as an inner product, i.e. κ (x i , x) = φ (x i ) , φ (x), where
can be expressed
φ (x) = λi ψi (x) i=1 with (λi , ψi (x)) being an eigenvalue-eigenfunction pair.
The posterior mean is then a linear combination of the feature mapping φ (x), thus
being an element of the RKHS associated with the kernel κ(x, x ). Further, each
realization of the Gaussian process governed by Eq. (9.46) is also an element of the
corresponding RKHS. The above facts reveal that the functional behavior that can
be captured by the GPR model is mainly governed by the property of the kernel
function. Given the squared exponential kernel, the resultant GPR model can almost
realize any smooth function. In real-world applications, the selection of the best ker-
nel can be implemented based on the users’ understanding of the physical process.
One can refer to Chap. 4 of Ref. Rasmussen and Williams (2006) for the properties
of alternative kernels.
The posterior variance σ y2 (x) = c y (x, x) defined by Eq. (9.46b) summarizes the
prediction error at x. It can be found that the posterior variance is always smaller than
the prior variance, this is fair as that, the epistemic uncertainty on the response value
at x tends to decrease after specific information of the model function being learned
from the training data. As will be shown in Sect. 9.4 of this chapter, the posterior
covariance given by Eq. (9.46b) provides a basis for active learning.
One more interesting point is that there are specific connections between the
GPR model and the non-Bayesian regression models when the feature mappings are
defined with a kernel (see Chap. 6 of Ref. Rasmussen and Williams 2006). Take the
as an example. Suppose the feature mappings φ (x) are defined by a kernel
LSR
κ x, x , the LSR prediction model can be reformulated as:
ŷLSR = β̂ LSR φ (x)
−1
= A A A Y φ (x)
(9.47)
= Y K −1 A, φ (x)
= Y K −1 κ (X, x) ,
which is exactly the same as the posterior mean prediction given by Eq. (9.46a) if
the prior mean m(x) is assumed to be zero.
The GPR model introduced above is a noise-free version, indicating that the noise
is assumed to be zero. The mean prediction at a training point xi is exactly equal to its
label value yi , and the corresponding posterior variance equals to zero. This makes it
more like an interpolation method. For taking the noise into consideration,
one more
hyper-parameter σn2 can be embedded into the prior covariance as κ x, x + σn2 ,
and then the covariance matrix can be reformulated as K + σn2 In with In being the
n-dimensional identity matrix. The remaining inference procedures remain almost
the same as the noise-free version, and we don’t give more details.
9 Regression Models for Machine Learning 361
10 10
8 8
6 6
4 4
2 2
0 0
-2 -2
-4 -4
-6 -6
Training Points
-8 95% Confidence Interval -8
Posterior Mean Prediction
-10 -10
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3
Fig. 9.6 An example for illustration of the GPR model, where the left panel shows the posterior
features (including the posterior mean prediction and the 95% posterior confidence interval), and
the right panel displays ten random realizations of the posterior Gaussian process
An example of the noise-free GPR model trained with five points is schematically
shown in Fig. 9.6. As can be seen from the left panel, all the training points are exactly
located on the posterior mean prediction curve, and the corresponding posterior
variances of these points are exactly zero, indicating that it is an interpolation method.
Ten realizations of the posterior Gaussian process model shown in the right panel
are generated at many discretization points based on their posterior mean vector and
posterior covariance matrix. As shown, all the training points are located exactly on
each of the realizations.
of the grid points, by adding which to the training data set, the most reduction of
numerical error can be achieved. The development of an Acquisition Function (or
called learning function) plays a key role for all those active learning procedures as it
informs the “optimal point”. In this section, we take two Bayesian numerical tasks,
i.e. Bayesian cubature and Bayesian reliability assessment as examples to introduce
the Bayesian numerical analysis framework and the corresponding active learning
procedures.
where π (x) is the weight density, and hereinafter assumed to be independent stan-
dard normal, the integrand y (x) is assumed to be governed by a computationally
expensive simulator. The Bayesian cubature aims at inferring a posterior distribution
by attributing a probabilistic assumption on the integrand, and a GPR model-based
scheme is introduced as an example in the following subsection.
Let D = (X, Y) denote a set of integration point, where the ith row x i of X indi-
cate the ith point of the input variable x, and Y is a column vector with the ith
element being computed as yi = y(x i ), e.g. by calling one or more expensive-to-
estimate simulators. Then based on the GPR procedure introduced in Sect. 9.3.2,
a GPR model can be trained
with D for approximating the integrand y(x) as
ŷ (x) ∼ GP μ y (x) , c y x, x , where the posterior mean μ y (x) and posterior
covariance c y x, x are formulated in Eq. (9.46). Then a random variable Î defined
as follows can be used for approximating I:
Î = ŷ (x) = ŷ (x) π (x) dx, (9.49)
where [·] denotes the integral operator over π (x). The integral defined by Eq.
(9.49) is a linear projection of the Gaussian process ŷ(x), thus Î follows Gaussian
distribution (Rasmussen and Ghahramani 2003). As for details, viewing ŷ (x) as an
infinite-dimensional Gaussian vector and π (x) as an infinite-dimensional determin-
istic vector, both indexed by x, the integral of Eq. (9.49) is indeed a linear combination
of infinite Gaussian variables.
9 Regression Models for Machine Learning 363
With the above conclusion in mind, the posterior mean Î can be derived as Ras-
mussen and Ghahramani (2003), Briol et al. (2019):
μI = ŷ (x) π (x) dx f ŷ d ŷ
= ŷ (x) f ŷ d ŷ π (x) dx (9.50)
= μ y (x)
= [m (x)] + [κ (X, x)] K −1 (Y − m (X)) ,
based on the exchangeability of the two integral operators over the density f ŷ of
the GPR model ŷ(x) and the density π (x) of x. Similarly, the posterior variance of
Î can be formulated as:
2
σI2 = ŷ (x) π (x) dx − μ I f ŷ d ŷ
= ŷ (x) − μ y (x) π (x) dx ŷ x − μ y x π x dx f ŷ d ŷ
= ŷ (x) − μ y (x) ŷ x − μ y x f ŷ d ŷ π x π (x)dxdx
= c y x, x
= κ x, x − [κ (X, x)] K −1 κ X, x ,
(9.51)
where [·] indicates the integral operator over π x , and [·] refers to the
double integral operator over π (x) and π x .
From Eqs. (9.50) and (9.51), to derive the closed-form expressions
for μI and
σI2 , one needs to express [m (x)], [κ (X, x)], and κ x, x in closed
forms. The analytical expression of [m (x)] is commonly easy to derive as m (x)
is commonly assumed to be zero/constant/linear form. The availability of closed-
forms expressions for [κ (X, x)] and κ x, x is determined by the form of
the kernel κ and the density π . One can refer to Ref. Briol et al. (2019) for a summary
of the pairs (κ, π ) that lead to closed-form expressions for these two integrals. For
the squared exponential kernel given in Eq. (9.41) and normal density π(x), the
closed-form expressions for both integrals are applicable. For independent standard
normal density, they are formulated as Wei et al. (2020):
−1 −1/2 1
[κ (X, x)] = σ02 +I exp − diag X ( + I)−1 X (9.52a)
2
−1 1/2
κ x, x = σ02 2 +I , (9.52b)
where diag [·] refers to the operator of creating a column vector with the diagonal
elements of the argument.
364 P. Wei and M. Beer
As has been stated above, the posterior variance σ I2 summarizes the numerical error
of the mean estimate μ I , with the assumption that the integrand y(x) can be described
as a realization in the RKHS associated with the kernel κ. Without considering the
above assumption, the demand of accelerating the cubature convergence is equivalent
to designing the integration points such that the posterior variance can be reduced
quickly with the increment of points.
Motivated by the above idea, an acquisition function, called Posterior Variance
Contribution (PVC) function, has been developed as follows Wei et al. (2020):
LPVC (x) = π (x) c y x, x . (9.53)
By definition, the PVC function value at x integrates the posterior correlation infor-
mation of ŷ(x) at x with those at all the other points across the integral domain.
Comparing Eq. (9.53) with Eq. (9.51), it can be found that σ I2 = LPVC (x) dx. The
above two observations reveal that the PVC function value at x measures the con-
tribution of the prediction error of y(x) at x with the consideration of its correlation
across the integral space. The point with the highest PVC value can be identified,
and added to the training data set D to achieve the highest reduction of the posterior
variance of Î.
For the squared exponential kernel κ and Gaussian density π , the PVC function
admits a closed form as:
LPVC (x) = π (x) · κ x, x − κ (X, x) K −1 κ X, x , (9.54)
where
the analytical
expression of κ X, x is given by Eq. (9.52a), and that of
κ x, x is formulated as:
−1 −1/2 1
κ x, x = σ02 +I exp − x ( + I)−1 x . (9.55)
2
9 Regression Models for Machine Learning 365
The PVC function usually shows multi-modal behavior, which means that multi-
ple peaks exist. It is then necessary to find the global maximum point. As the PVC
function admits a closed form and is computationally very cheap, the evolutionary
optimization algorithm such as the particle swarm optimization (PSO) is recom-
mended.
We use a one-dimensional integral with integrand y (x) = x 2 sin (4x) + 1 and
standard Gaussian density as an example to illustrate the method. The active learning
process is initialized with four training integration points randomly drawn between
–2.5 and 2.5, and then the PVC function is utilized for the adaptive design of the
integration points. The active learning process stops if the posterior coefficient of
variation, defined as σ I /μ I , is less than 1%. With the above setting, the algorithm
adaptively produces twelve more points. The results including the posterior 95%
confidence intervals, the integration points, and the PVC functions, at sample size
n = 4, 7, 12, and 16, are shown in Fig. 9.7. The first column shows the results
generated with the four initial integration points, together with the global maximum
point identified by the PSO algorithm. This point is then added to the training data
set for step-by-step active learning. The multi-modal behavior of the PVC function
is also clearly shown in the second row of Fig. 9.7. The evolution details of the
posterior mean and posterior 95% confidence intervals of the integration Î are also
schematically shown in Fig. 9.8. As can be seen, equipped with the PVC function, the
Fig. 9.7 Evolution of the GPR model and PVC function against the training data size for the
one-dimensional integral example
366 P. Wei and M. Beer
1.5
0.5
0
Posterior Mean Estimates
Confidence Intervals
-0.5 True Values
4 6 8 10 12 14 16
Fig. 9.8 Evolution of the posterior confidence interval estimates against the training data size
posterior distribution of Î shrinks to the true value with a high rate, demonstrating
the effectiveness of the active learning scheme.
where p f means the probability of failure, and I F (x) is the indicator function of the
failure domain, and is defined as:
0, y (x) < 0
I F (x) = (9.57)
1, otherwise
In Eq. (9.57), y(x) is called limit state function with y(x) < 0 implying failure
state of the structure, and y(x) 0 indicating the safe state. The surface defined
by y(x) = 0 is then called failure surface or limit state surface, which separates the
input domain Rd into the failure domain F = {x : y (x) < 0} and the safe domain
S = {x : y (x) 0}. In practical application, the limit state function y(x) is governed
by one or more expensive-to-estimate simulators (e.g. finite element model), and the
problem of reliability assessment is then formulated as estimating p f with given
accuracy and the least calls of y(x). This is usually challenging especially when the
probability of failure is extremely small (typically less than 10−6 ).
9 Regression Models for Machine Learning 367
Definitely, p̂ f does not follow Gaussian distribution, but its posterior mean and
variance can be derived as mean prediction and measure of numerical error.
From the definition of the indicator function given in Eq. (9.57), the posterior
mean μ I (x) is formulated as:
368 P. Wei and M. Beer
μ y (x)
μ I F (x) = Pr ŷ (x) < 0 = − , (9.59)
σ y (x)
where (·) indicates the cumulative distribution function (CDF) of standard Gaus-
sian distribution. The posterior covariance of IˆF (x, x ) can also be formulated
exactly, but its computation can be cumbersome. Instead, an upper bound is derived
based on Cauchy-Schwarz inequality as Dang et al. (2021):
c I F x, x σ I F (x) σ I F x (9.60)
where σ I2F (x) refers to the posterior variance of IˆF (x) and is expressed as:
μ y (x) μ y (x)
σ I2F (x) = − 1− − . (9.61)
σ y (x) σ y (x)
The posterior mean of p̂ f does not admit a closed form, but can be estimated with
the sample matrix S as:
n
μ y (x) ∼ 1 μy x j
μpf = − = − μ̂ p f , (9.62)
σ y (x) n n=1
σy x j
where μ y x j and σ y2 x j refer to the posterior mean and variance at the jth row x j
of S. Similarly, a MCS estimator can also be derived based on the GPR predictions
for S, but it is computationally cumbersome. Based on Eq. (9.60), an upper bound as
well as the corresponding MCS estimator for the posterior variance of p̂ f are given
by:
σ p2 f = c I F x, x ( [σ I F (x)])2
⎛ !
n " ⎞2
1 " μ x μ xj (9.63)
∼
=⎝ # − y j
1− −
y
⎠ σ̂ p2 f .
n j=1 σy x j σy x j
The key to active learning for BRA is to propose an effective acquisition function,
with which the design points can be adaptively selected from S. This acquisition
function should be designed to meet the target of reducing the prediction error of
p̂ f with a high rate against the training data size n t . Since the proposition of the
9 Regression Models for Machine Learning 369
AK-MCS method, quite a lot of acquisition functions for this purpose have been
developed, but here only the U-function will be introduced.
Given a GPR model ŷ(x), the U-function for an arbitrary point x is defined as:
μ y (x)
U (x) = . (9.64)
σ y (x)
The U-function can be explained as follows. Given μ y (x) < 0, the probability of
ŷ(x) being positive is:
μ y (x)
Pr ŷ(x) > 0|μ y (x) < 0 = = (−U (x)) . (9.65)
σ y (x) μ y (x)<0
On the contrary, given μ y (x) > 0, the probability of ŷ(x) being negative is:
μ y (x)
Pr ŷ(x) < 0|μ y (x) > 0 = − = (−U (x)) . (9.66)
σ y (x) μ y (x)>0
Thereof, in any case, (−U (x)) measures the probability that the sign of y(x) is
misdiagnosed by the posterior mean μ y (x). Then, by adding the point in S with the
highest value of (−U (x)), or equivalently, the lowest value of U-function, to the
training data set D, it is expected to improve the prediction accuracy for p f the most.
When the upper bound σ̂ p f /μ̂ p f of the posterior coefficient of variation of p f is
less than a threshold, e.g. 5%, the active learning process can be stopped, and the
estimator in Eq. (9.62) can be used for predicting p f with a high accuracy, and the
estimator in Eq. (9.63) can be implemented for summarizing an upper bound of the
prediction error. One notes the above conclusions apply only when the sample size
n is large enough. The sample matrix S can also be adaptively enlarged during the
active learning process based on the coefficient of variation of the MCS estimator
given in Eq. (9.62).
The above active learning algorithm for BRA is illustrated with an academic
example. The limit state function is expressed as
4
y (x1 , x2 ) = ci exp −αi1 (x1 − βi1 )2 − αi2 (x2 − βi2 )2 + 0.8
i=1
2314 −0.5 0.5 −0.5 0.5
with α = ,β = , c = 1 −1.5 −1.5 2 , and
3241 −0.5 −0.5 0.5 0.5
both x1 and x2 follow standard Gaussian distribution. The active learning procedure
is initialized with n = 104 samples S and n t = 12 initial training samples D, and
the training process is shown in Fig. 9.9. The algorithm adaptively produces 61
more points before touching the convergence condition σ̂ p f /μ̂ p f < 5%. Thus, the
total number of limit state function calls is 73. The posterior mean estimate μ̂ p f is
0.0709, with the upper bound σ̂ p f /μ̂ p f being 4.89%. Compared with the reference
370 P. Wei and M. Beer
solution estimated by MCS with 106 samples, which is 0.0723 with coefficient of
variation being 0.36%, the result generated by active learning is accurate and robust.
One notes that the above MCS-based active learning procedure only applies to
relatively large failure probability (e.g. p f < 10−3 ) as for smaller probability of
failure, a larger number n is required, making the selection of each training point
computationally inefficient. For tackling this challenge, active learning procedures
combined with advanced MCS have been developed. One can refer to Ref. Wei et al.
(2019) for an example of the combination of active learning with subset simulation,
and Ref. Song et al. (2021) for a combination of active learning with line sampling,
both of which apply to the extremely small probability of failure (e.g. 10−9 ).
9.5 Conclusions
This chapter has investigated and compared a set of commonly used regression
models and methods for machine learning in engineering computations. Three non-
Bayesian models, i.e. the LSR, the RR for improving generalization performance of
LSR, and the SVR, have been first examined. These three models are all based on
defining the hypothesis set in a parametric form, and then searching the parameter
values by minimizing specific empirical loss functions defined on the training data.
The difference among these methods presents mainly on the definition of the loss
functions. An advantage of these three parametric methods is that they admit a
natural extension with the kernel trick. This is of special importance as it allows the
regression model to capture more or even infinite features of the training data with a
computational complexity linearly dependent on the training data size.
9 Regression Models for Machine Learning 371
In contrast to the non-Bayesian models searching for an optimal element from the
hypothesis set, the Bayesian regression models are trained by assigning a (subjective)
probability distribution model on the hypothesis set based on the beliefs learned from
the training data. This enables the prediction at any unobserved point to be modeled
by a subjective probability distribution with its variation summarizing the prediction
error. It is also shown that, under a specific setting, the posterior mean prediction of
a Bayesian regression model may coincide with a non-Bayesian regression model.
The probabilistic feature of the Bayesian regression models provides a possibility
for devising Bayesian numerical methods for addressing the numerical analysis tasks,
and the Bayesian cubature and Bayesian reliability assessment have been examined
as examples. The common feature is that, except for the mean predictions for the
quantities of interests, the associated numerical errors are treated as epistemic uncer-
tainty and summarized by the posterior variances. The active learning procedures are
then introduced for both Bayesian cubature and Bayesian reliability assessment, and
shown to be effective in reducing the number of required simulator calls to achieve
a specific accuracy level. Other Bayesian numerical algorithms, such as Bayesian
optimization (Huang et al. 2006) and Bayesian ODE/PDE solver (Wang et al. 2021;
Chen et al. 2021), are not investigated due to limited space, but the readers can refer
to corresponding references for these cutting-edge topics.
References
10.1 Introduction
The past two decades saw tremendous developments in artificial intelligence (AI).
Advancements in software, algorithms, and hardware led to the development of
significantly more accurate and versatile artificial intelligence models. This rendered
artificial intelligence a powerful tool that is used in diverse scientific areas, e.g.
medicine and drug design, economics, and self-driving cars, among many others.
These methods, having been successfully implemented in the simulation and
modeling of structures (Lu et al. 2022; Solorzano and Plevris 2022), found their
way to topology optimization problems, where artificial intelligence appears to have
great potential for successful implementation.
In conventional topology optimization, the optimal design of a specific domain
must be calculated subject to specific constraints and the objective is to minimize
the total compliance of the structure and use a specific amount of material. This
is typically an iterative process that involves large matrices and can be very time-
consuming. By means of artificial intelligence models, referred to also as surrogate
models (or surrogates), the computing time can be reduced significantly. The surro-
gate model is apriori trained offline. Following, during the optimization process
the model is inferred based on input data, which is a lot faster due to limited matrix
multiplications that the surrogate performs. The usual process involves either an arti-
ficial intelligence surrogate that complements the conventional procedure to reduce
computational costs or a standalone surrogate which calculates the whole optimized
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 373
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_10
374 I. Chamatidis et al.
structures by itself. The AI surrogates that are used belong to two main categories,
i.e. Surrogates that use density and surrogates that use images.
The surrogates that use density have similar inputs as the conventional method
since the optimization process uses the density of the structure and is updated in each
iteration of the AI model. The surrogates that perform optimization on images are a bit
different because they use techniques like image segmentation and filtering to output
the optimized image (structure) which then is mapped into density. Most surrogates
can be used for 2D and 3D structures and they are transferable, meaning that once
trained they can be used in another topology optimization problem (thermodynamics
or different material).
The Background section contains an introduction to artificial intelligence, the
surrogate models that will be used and an introduction to conventional topology
optimization. The Literature Survey section provides a review of recent advancements
of topology optimization using artificial intelligence models. This section is divided
into two parts, the first describing the models that use density and the second the
models that use image-based approaches.
10.2 Background
where E 0 is the original value of Young’s modulus, E min is a very small positive
value used to avoid a case where the stiffness matrix becomes singular and p is
a penalization factor that ensures that densities belong in [0, 1]. The optimization
problem is formulated as follows:
10 Overview on Machine Learning Assisted Topology Optimization … 375
⎡
N
⎢ min x : c(x) = U K U = E e (xe )u eT k0 u e
T
⎢ e=1
⎢
⎢ subject to :
⎢ (10.2)
⎢ V (x)/V0 = f
⎢
⎣ KU = F
0≤x ≤1
where c is the compliance that needs to be minimized, U and F are the global
displacement and force vectors, respectively, K is the global stiffness matrix, u e is
the element displacement vector, k0 is the element stiffness matrix for an element
with unit Young’s modulus, x is the vector of design variables, N is the number of
variables, V (x) and V0 are the material volume and design domain volume, respec-
tively, and f is the volume fraction that is chosen beforehand. The updates of the
new densities are based on the optimality criteria:
⎧ η
⎨ max(0, xe − m) i f xe Be ≤ max(0, xe − m)
⎪
xenew = min(0, xe + m) i f xe Beη ≥ max(1, xe − m) (10.3)
⎪
⎩
xe Beη , otherwise
where m is a positive move limit and B is obtained from the optimality condition:
− ∂∂cxe
Be = , (10.4)
λ ∂∂ xVe
where the Lagrangian multiplier λ is chosen so that the volume constraints are
satisfied. The sensitivities of the parameters c and V in terms of xe are
∂c p−1
xe
= − pxe (E 0 − E min )u eT k0 u
∂V (10.5)
xe
=1
The last step of that process is to make sure that densities do not produce weird
patterns. For this reason, a filtering is applied.
Artificial intelligence is a large area that involves many scientific fields such as
mathematics, computer science and pertains to many algorithms that exist, which can
“learn” by example in order to solve a problem. By examining many different data
points, these algorithms can approximate the underlying function that describes the
problem. Thus, after the learning process has been completed, when a new example
376 I. Chamatidis et al.
is presented, they can predict a result based on the previous training. There are three
main categories that describe those algorithms:
• Supervised learning: In this kind of learning the samples that the model uses for
training have labels, meaning that after the model predicts an output based on the
input it received, it has the labels of the “ground truth” value of the sample and then
it compares the “ground truth” value with the one that it predicted and corrects
itself accordingly. Some of the most popular supervised learning models used are
neural networks, support vector machines, and k-nearest neighbor. Especially in
the case of neural networks, the last decades have seen great advancements by
using deep neural networks which utilize many hidden layers and a very large
number of nodes.
• Unsupervised learning: This can be considered as the opposite of supervised
learning. Here the data are unlabeled, and the model draws conclusions based
on the statistical properties of the data, e.g. the relevant clusters and distances.
The most popular models for unsupervised learning are K-means, self-organizing
maps, and principal component analysis. The main difference between unsuper-
vised and supervised learning is that in unsupervised learning there are no labels
(“ground truth” values) used during the training of the model.
• Reinforcement learning: This is a different method of learning, which does not
have many applications in topology optimization. Central to this method is the
notion of the agent (robot) that learns an optimal behavior by interacting with its
environment and changing its behavior accordingly.
In topology optimization literature the process is done, most of the times, using
deep neural networks. Neural networks are universal function approximators (Hornik
et al. 1989). Hence, by properly training an artificial neural network, this can
approximate the function that underlies the topology optimization problem.
Each node of the neural network performs an operation between the input and the
weights and biases of the model. After passing all the layers it calculates the error
and by backpropagating it corrects its weights to minimize the prediction error. This
process is iterative, where each iteration in which the model sees the entire dataset is
termed an epoch. The number of epochs that are chosen depends on the problem, the
amount of data, and the type of the model. The process of training a neural network
consists of three steps. The first step is to split the data into the training set, the
validation set, and the testing set. This split must be uniformly distributed, and the
class representation must be the same in all three sets. The second step is to train the
neural network using the training set and use the validation set periodically during the
training to test that the model doesn’t overfit. Overfitting is an undesirable outcome
of the training, where the neural network keeps reducing the error associated with
the training data, while the error associated with other data (validation data, testing
data, or others) increases, which means that the network overfocuses on the specific
training data and has lost its generalization capabilities. The last step is to use the
testing set, which the model has not seen before, to measure how well the network
performs when confronted with new data. Another issue that must be solved during
the training is defining the architecture of the neural network, which has to do with the
10 Overview on Machine Learning Assisted Topology Optimization … 377
number of hidden layers and the number of node in each hidden layer. If the number
of layers/nodes is too small, the model will not have enough learning capacity to
approximate the function properly, while if the number is too large the model will
overfit the dataset and it will not perform well on the test set (low variance-high bias
model).
Another variation of deep neural networks that is often used in topology optimiza-
tion is the convolutional neural networks (CNNs). These models take as their input an
image (either 2D or 3D) instead of a simple 1D output. Then, in each node, instead
of the usual multiplication between weights and the input, perform a convolution
using a small square kernel:
∞
The advantage of these models is that they can unravel localized relations on the
image because they use information from an area instead of a single number.
Patel and Choi (2012) harness the power of Probabilistic Neural Networks (PNN)
and develop an optimization methodology that treats probabilistic constraints under
uncertainty. Probabilistic Neural Networks (PNN) rely on Bayesian inference (Clarke
1974) to make decisions, and the Parzen nonparametric estimator (Parzen 1962) for
the estimation of the probability density functions. The three main benefits of using
a probabilistic approach are the following: (i) Easy interpretation of the results,
(ii) Efficiency in treating nonlinear structures or disjoint failures, and (iii) Useful in
treating uncertainty. The described network is both easy to implement and to interpret.
Its training strategy relies on reducing the expected risk of each class (failure or not).
For example, suppose that ϑ belongs either to class ϑ A orϑ B , and the data vector is p
dimensional X T = [X 1 , X 2 , . . . , X p , ], the Bayes decision rule is
ϑ A i f h A l A f A (X ) > h B l B f B (X )
d(x) = (10.7)
ϑ B i f h A l A f A (X ) < h B l B f B (X )
where f A (X ) and f B (X ) are the probability density functions (PDF) for the two
classes A and B, and l A is the loss function associated with the decision d(x) = ϑ B
when θ = θa . Also h A is the a priori probability of occurrence of class A and
h B = 1 − h A . The loss function used during the training is
378 I. Chamatidis et al.
l = (h A − θ A )2 + (h B + θ B )2 (10.8)
The architecture of the probabilistic neural network consists of four layers: (i)
the input layer, (ii) the pattern layer, (iii) the summation layer, and (iv) the output
layer. The output of the pattern layer using the Parzen window nonlinear function
forms the PDF. A reliability analysis can be incorporated into the deterministic
topology optimization method. This process is called Reliability-Based Topology
Optimization (RBTO) and employs a probabilistic constraint such that
⎡
max / min : f (b)
⎢ b
⎢ subject to :
⎢
⎢
⎢ P j [g j (b, x)0] ≤ R R j
⎢
⎢ N
⎢ (10.9)
⎢ Ai L i − V ∗ ≤ 0
⎢
⎢ i=1
⎢
⎢
⎣K ·u = F
bl ≤ b ≤ bu
where f (.) represents the objective function, g j (.) the limit-state function, b is the
vector of deterministic design variables, and x is a random vector. P j denotes the
probability of the event, and the probability of failure can be expressed as P j [g j (.) <
0], Ai is the cross-sectional area of the elements and L i is the length of the particular
element, V ∗ denotes the volume of the material that can be used in the final design,
Al and Au are the upper and lower bounds of the cross-sectional area of the elements,
K is the global stiffness matrix, u is the global displacement vector, and F is the
nodal load vector. The PNN consists of two blocks, where the first block performs
the topology optimization and the second block the reliability analysis. The force,
boundary conditions, and the final volume V ∗ are used as an input. The PNN is used
to reduce the instances where the Finite Element Analysis routine is invoked, thus
reducing the computational cost of the entire process.
In Liu et al. (2015), a nonlinear multi-material topology optimization is developed
using unsupervised machine learning algorithms. Unsupervised algorithms construct
clusters of data based on their similarity. The model takes as an input the normal-
ized material parameter xe , where 0 ≤ xe ≤ 1. The K-means algorithm provides
the initial design of the structure. The final optimization design is obtained with
a metamodel-based multi-objective optimization strategy. This strategy consists of
five steps: Sampling, Simulation, Metamodel fit, Optimization, and Point selection for
the metamodel update. Sampling and simulation consist of choosing various design
experiments and functions evaluated on those designs required to fit the metamodel.
The fit of the metamodel pertains to fitting all known functions to approximate the
design responses that have not yet been evaluated. The model that is used in this stage
is the Kriging metamodel with a linear regression kernel and spherical correlation.
The last stage of the optimization step uses a multi-objective genetic algorithm to
10 Overview on Machine Learning Assisted Topology Optimization … 379
find the Pareto front. The Pareto front consists of solutions whose objectives are not
dominated by other solutions. The algorithm continues until the difference between
the solution of the model and the “ground truth” value is acceptable. The solution of
the model was compared with solutions from the SIMP optimization algorithm. The
proposed method achieves highly similar designs from metamaterials with SIMP, at
a reduced computational cost.
Lei et al. (2018) develop a real-time topology optimization procedure using
machine learning. The method proposed is based on the Moving Morphable Compo-
nent (MMC), where a set of morphable components are used as the basic building
blocks. The optimization process consists of morphing, merging, and overlap-
ping operations on those elements to achieve the final structure. The machine
learning problem is formulated as follows: Suppose there is a set of parameters
p = ( p1 , ..., pnp )T denoting the location of the concentrated load and D opt the
vector of optimal designs as an approximation of linear combination of eigenvectors
v1 ∈ 7n , ..., vm ∈ 7n , where n is the number of components. There are seven
design variables for each component. Hence, D opt can be expressed as
M
D opt ( p) = wi ( p)vi (10.10)
i=1
where v and b are the state and bias of the ith visible unit, h and b are the state and
bias of the jth hidden unit and wij is the weight coefficient of the connection between
those units. The state of the network with the lowest energy is the one with the higher
probability.
1 −E(v,h)
p(v, h) = e (10.13)
Z
where Z is
i max j max
Z= e−E(v,h) (10.14)
i=1 j=1
The difference between restricted and normal Boltzmann Machines is that in the
restricted ones there are no connections between the hidden units. A Deep Belief
Network (DBN) is eventually created by combining multiple RBM. The hidden
layer of one RBM is the visible layer (input layer) of the next RBM. The training
of the whole model involves two steps: first each RBM is trained individually using
unsupervised learning and then the whole model is trained using supervised learning.
The proposed method outputs a density value for each point and is also integrated
with SIMP to accelerate the optimization process. At the beginning, some initial
iterations are performed using SIMP and then the DBM performs the predictions of
the density. The training, validation, and test dataset are constructed using SIMP to
solve the optimization problem. The size of the dataset of different samples using
the cantilever examples is 480,000 samples. The proposed methodology succeeds
10 Overview on Machine Learning Assisted Topology Optimization … 381
with a reduction in SIMP iterations that reaches as high as 90% with a loss similar
to the one achieved by SIMP. Also, it is scalable for many finite elements and can be
applied both in 2D and 3D structures.
In Kallioras and Lagaros (2021b), deep belief neural networks are used to accel-
erate the topology optimization process by skipping SIMP iterations while using
the AI models to predict the desired density. SIMP is run for the initial iterations
and then the models finalize the design. The proposed method improves the method
introduced above by harnessing the power of DBN for quick calculations to operate
on higher orders (fine mesh) and use SIMP on a coarse mesh to assist the models.
This method accelerates the whole process by at least one order of magnitude. In
Kallioras and Lagaros (2020), a sequential collection of DBM is introduced where
they take as an input the initial iterations of SIMP and try to find hidden patterns and
correlations between initial densities of finite elements and the final densities. An
improvement of the previous models is introduced in Kallioras et al. (2021), where
the models are reduced in their order and by using deep learning the results are
extrapolated to a fully refined model thus achieving great accuracy and speeding up
as high as 80%. Another interesting method is introduced in Kallioras and Lagaros
(2021a) which is about a tool to generate equivalent shapes given an input with a
number of elements, forces, and supports. The produced shapes are not optimized
but are a collection of different shapes that act as a design inspiration. The output
of the process is compatible with 3D printers. The tool is powerful for prototyping
designs. It uses SIMP and Long Short-Term Memory Neural Networks (LSTM NNs)
and image processing methods to generate the shapes.
The work by White et al. (2019) focuses on large macroscale structures with
spatially varying metamaterials. To calculate the density of each element, a neural
network is used with a Gaussian activation function, i.e.
(x) = e−x
2
(10.15)
With the addition of the Gaussian function, the neural network emulates radial
basis function interpolation. The weights and biases of the neural network consist of
the scaling and offset of the Gaussian function and are calculated from the training
process. The neural network uses both the actual densities as an input and their
derivatives, for better accuracy. Apart from the use of the Gaussian function, the
architecture is a classical, one hidden layer neural network. Experiments were tested
with different numbers of neurons in the hidden layer. The results show that when the
dataset is small, using derivative data is largely beneficial (leading to 9 times smaller
error when using derivatives). However, the use of derivatives becomes irrelevant
when the dataset is large and the neural network has enough capacity.
In Chandrasekhar and Suresh (2021), the density function is represented by the
weights of the Neural Network. The difference in this approach is that it does not
try to accelerate the classic SIMP process by skipping some steps using NNs (image
segmentation methods, etc.). Rather, the NN is used to directly perform Topology
Optimization using its weights. Instead of representing the density field by a finite
element mesh, it is represented by the activation functions of the NN. The NN
382 I. Chamatidis et al.
outputs a density value for each point of the domain thus converting the optimization
setup from a constrained into an unconstrained problem with penalization. A fully
connected NN is used that may treat 2D and 3D structures and outputs a value p
which is the density value at any point. The loss function needed to train the neural
network is given by the following expression:
uT K u ρe υe 2
L(w) = +a − 1 (10.16)
J0 e
V∗
3 2
∂ x13 ∂ x13
Loss = L 1 [x13 − (x13 )0 ]2 + L2 − (10.17)
i=1
x11 x11 0
the elements. The size of the input depends on the discretization of the mesh. The
density is considered as a dependent variable and the NN weights are the indepen-
dent variables. The second part requires converting the density outputted by the NN
to physical density corresponding to the physical constraints set in the formulation
of the problem. Filtering is used to achieve this. The third part is to perform Finite
Element Analysis on each iteration after obtaining the structure topology. The final
step of the process is to calculate the loss function to be minimized. In traditional
SIMP or BESO, the derivatives of the objective function need to be calculated to
perform sensitivity analysis. However, in this case, these are directly calculated via
automatic differentiation during the backpropagation step.
The architecture that the neural network is using is a decoder network starting with
a fully connected layer to linearly transform features from one space to another. This
is then followed by sequential up sampling and convolutional layers. Furthermore,
a direct copy of the input to the output is implemented, the same as the ones U-nets
use. An important step after the calculation of the density from the neural network is
to convert it to physical density according to the definition of the problem. Conven-
tional methods are not suffering from this problem because the design constraints
are considered when calculating the density. This conversion happens in two steps:
(1) Make x satisfy the [0, 1] constraint by applying the sigmoid transformation to
the output layer:
1
xi = (10.18)
1 + e xi −b(x,V )
N N
xphy vi = xi vi (10.20)
i=1 i=1
that denotes the volume before and after the projection. Solving Eq. (10.20)
results in the value of η, which is used in the Projection equation. Higher values
of β correspond to thinner branches that will be eliminated from the final struc-
ture, which also makes the manufacturing process easier. Compared with the
SIMP method used in Andreassen et al. (2011) similar results are achieved in
terms of the shape of the final structure. One big difference is that the struc-
ture of the neural network has larger and more rigid branches on the inside of
the structure due to the projection step. The proposed method can also be used
384 I. Chamatidis et al.
N
1
MSE = N et (I, W ) − si 2
(10.21)
N i=1
Another metric used is the Dice Similarity Coefficient (DSC), which measures
the similarity of the output image obtained from the neural network with the ground
truth image. The DSC assumes a value of 1 if the two images are identical:
2y ∩ y
DSC = (10.22)
|y| + y
After training the model, a simple threshold of 0.5 is applied to discretize the
densities in {0, 1}. Both the linear and nonlinear models achieve robust DSC = 0.958
and DSC = 0.964 on the test and the train set, respectively. Since the method does
not rely on any external FEA solver, it is very fast. By transferring the inference of
the neural network to a lower-level hardware, the method can perform instantaneous
optimization of structures for linear and nonlinear materials.
10 Overview on Machine Learning Assisted Topology Optimization … 385
In Deng and To (2021), a new parametric method using deep learning is intro-
duced, where the level set function is described by a deep neural network. The
proposed method utilizes the ability of the deep neural networks to approximate
any function, thus it can approximate the level set function too. A critical aspect
for the convergence of the objective function during training is the initialization of
the weights with random zero-mean values. The level set theory uses a zero contour
(2D) or isosurface (3D) to represent the boundaries of geometry of the structure. The
interface of the structure is described by the zero-level set functions:
⎡
(x, t) > 0, (x ∈ Ω)
⎣ (x, t) = 0, (x ∈ ∂Ω) (10.23)
(x, t) < 0, (x ∈ D/Ω)
where D is the design domain, is the total number of admissible designs, ∂ the
boundary of the shape, and t the pseudo time. By differentiating the zero-level set,
the Hamilton–Jacobi partial differential equation (PDE) can be obtained:
∂
− Vn |∇ | = 0 (10.24)
∂t
And the objective function is
The deep neural network converts the PDE to an ordinary differential equation
(ODE). Hence, instead of solving Hamilton–Jacobi equations to update (x) and
finding the optimal design, (x) is represented by the parameters of the neural
network. The neural network is trained using the values resulting from solving
Hamilton–Jacobi, these values are used as the ground truth values. The resulting
designs have similar structural performance with the traditional methods, while with
different neural networks different conceptual designs can be produced.
In Patel et al. (2022) a method to overcome challenges that traditional topology
optimization struggles with, such as geometric frustration, non-smooth edges,
dangling structures at boundaries is introduced. In addition, the method acceler-
ates the entire process. This method uses two deep neural networks, one that predicts
the optimized microstructures and one that improves connectivity between them.
The method has three stages:
• A macroscale topology optimization solver (SIMP) which predicts optimized
macroscale topology optimizations. It takes as an input the finite elements in each
direction, boundary conditions, Poisson ratio, Young’s modulus, and optimization
parameters.
386 I. Chamatidis et al.
• The second stage contains a deep learning neural network which predicts
microstructures. This model takes as an input the density and nodal deflections of
every macroscale unit of the previous step and outputs optimized microstructures.
• The third stage contains another deep neural network that improves the connec-
tivity of the whole structure and outputs the final optimized structure.
The first neural network is trained using only corner displacement nodes rather
than the whole domain, which makes the calculation of the microscale structures
faster. Stage 1 features a modified density-based SIMP approach, having the objective
to minimize the compliance using conventional methods. The second stage of a model
predicts the optimized microstructures that fit well into the macrostructure. That
model has 3 sub-models, one deep neural network that maps the design variable vector
to a density distribution image, another convolutional neural network that predicts the
optimized structure, and a third post-processing solver that ensures volume fractions
constraints and optimal solutions. The third stage contains two neural networks that
improve the connectivity of the predicted microstructures. The first neural network is
a UNet that predicts an improvement in connectivity between 4 neighboring elements,
and the second neural network uses the pre-optimized corners to predict the output
to reduce the number of iterations. The connectivity of the optimized structure is
improved by 17% by the third stage and an overall 14.6% improvement in compliance.
Also, there is a great improvement in the speed of the calculation of the optimized
structure by a factor of ×10 faster than the conventional method. Also, the proposed
method works in both 2D and 3D structures.
Three types of inputs are given to the neural network, which are (i) 3D density
distribution of voxels at iteration m (m < T ), (ii) Gradient of voxel densities between
iterations m and n (m < n), and (iii) Forces and Boundary Conditions along x, y, and
z directions, where T is the total number of iterations and m and n are intermediate
numbers of iterations. Also at the output of the network a Density Filter Function
was applied to smooth the output based on the neighbors of each voxel:
j∈N hi j v j x j
Density Filter Function : x− = (10.27)
j∈N j h jvj
The architecture used is a convolutional neural network without any dense layers,
as it uses only 3D convolutional layers. Specifically, it follows an encoder–decoder
architecture and the output of the neural network is the same as the input. The
results are compared with standard linear elasticity solvers both in terms of accuracy
and speed. Another hyperparameter that is finetuned is the number of iterations at
which TopOpt is stopped, and it needs to be balanced in accuracy and number of
iterations performed. The best neural network experiment from the ones that have
been tried achieved 40% reduction in computational time and achieved 96% accuracy.
The results show that the calculated compliances of the structures by the traditional
method and the conventional method slightly differ. Also 4.82% of the samples
have huge compliance errors due to emergence of the structural disconnection. The
compliance error compared with the conventional method is 4.16% and volume
fraction error is 0.13%.
In Sosnovik and Oseledets (2019), topology optimizations followed by a neural
network are used to calculate the final structure. An initial number of conventional
topology optimization iterations is obtained N0 using SIMP, the output of SIMP is
turned into an image I and used as an input to the neural network. Image I is a
blurred/distorted representation of the final structure. If only topology optimization
was performed the final structure contains only material and void with no intermediate
values, this structure is represented by I ∗ . So, after performing N0 steps image I is not
the same as image I ∗ . Thus, neural networks are used to perform image segmentation
to converge image I to image I ∗ and resulting in binary densities {0, 1}. The neural
network architecture used in a fully connected convolutional neural network takes as
an input 2 grayscale images, the first image is the densities X n as outputs by the last
step of topology optimization and the second image is the difference of the densities
between 2 consecutive updates δ X = X n − X n−1 . The output of the network is a
grayscale image of the same resolution that contains the final structure. The neural
network follows the encoder–decoder architecture with 6 convolutional layers in the
encoder layer and another 6 in the decoder layered, which are the same shape as the
encoder network but reversed. Also, between the convolution layers Max Pooling
operation is used to introduce variance to the next layer.
The dataset used is synthetic based on SIMP solver for 2D structures. For the
generation of the dataset 100 iterations of SIMP are performed for each problem,
388 I. Chamatidis et al.
each individual problem is defined by its contains and load. To generate the dataset
the following constraints are used:
• The number of nodes with fixed x and y translations and the number of load is
sampled from the Poisson distribution with N x P(λ = 2), N y P(λ = 1)
• The load values are −1 and the probability of choosing a boundary node is 100
times higher than that of an inner node
• Volume is sampled from normal distribution f 0 N (μ = 0.5, σ = 0.1)
The total size of the dataset is 10,000 samples. Each sample consists of a tensor
with shape 100 × 40 × 40, where 100 is the number of iterations and 40 × 40 is the
grid size. During the training data augmentation is applied to the data to increase the
size of the dataset and the variation. The objective function used is:
where L conf is binary cross-entropy and L vol is MSE of the prediction and the target.
The results are compared against SIMP solver in terms of accuracy and time consump-
tion. The metrics used are Binary Accuracy and Intersection over Union (IoU). Where
Binary Accuracy measures the pixels classified correctly over the total number of
pixels of the structure and IoU measures the area of overlap over the area of the union
of the correctly classified pixels. Four different policies were tested using different
stopping iterations for the SIMP algorithm. The number of iterations that SIMP stops
is sampled from uniform distribution U ∼ [1, 100] and Poisson λ = 5, 10, 30. The
output of the structure is similar to the one produced by SIMP and its calculation is
20 times faster. Higher Accuracy and IoU is achieved with more SIMP iterations,
the highest one achieved is Accuracy = 99.6% and I oU = 99.2%.
Another image-based approach is the study by Wang et al. (2022), where a convo-
lutional neural network with strong generalization capabilities is used. The dataset
used consists of 80,000 samples using (Andreassen et al. 2011) which uses SIMP.
The volume fraction, number of forces, and direction of each force are sampled from
uniform distribution. Input of the neural network is a tensor 40 × 80 × 6 tensor,
each one tensor contains an image of Volume fraction, Nodal displacement in X and
Y directions, Nodal Normal Stains εx , ε y and Shearing γx y . The ground truth that
is used for training is the optimized output of SIMP. The architecture of the neural
network is an encoder–decoder where the encoder part reduces the size of the input
gradually up to 8 times and the decoder part restores it and outputs it to its orig-
inal size. Because the probability distribution of each element is between (0, 1) that
denotes the probability of existence of the element, thus a suitable loss function is
Kullback-Leible divergence which tries to minimize the distance between the know
distribution and the output distribution:
p(x) λ
D K L ( p||q) = p(x)log + θi2 (10.29)
x
q(x) 2 i
10 Overview on Machine Learning Assisted Topology Optimization … 389
where the first term is the loss function with p(x) the ground truth distribution and
q(x) the neural network output. The second term of the loss function is the L 2
regularization term to reduce overfitting, where λ is the weight of the regularization
term and θ is the network parameters. The difference compared to SIMP is similar,
as only a 4.12% showed large compliance errors. Also, the neural network provides
a huge speed up in calculation, about 99% faster calculation of the optimal design
structure.
In Kollmann et al. (2020), deep learning is used to optimize 2D structures of
metamaterials. The proposed method uses a convolutional neural network (CNN)
and non-iteratively optimizes metamaterials for either maximizing the bulk modulus,
maximizing the shear modulus, or minimizing Poisson’s ratio that also include nega-
tive values. The data used for the training of the neural network are created by
randomly sampling optimization parameters. These optimization parameters are a
filter radius, a design constraint (volume fraction) and a design objective (maximum
bulk modulus, maximum shear modulus, or minimum Poisson’s ratio), and are
sampled from uniform distribution. And then the optimized design is calculated using
SIMP. The neural network follows the architecture of the encoder–decoder network
and takes as an input 3 images, one for each optimization parameter described before
and outputs an image which represents the optimized structure. More specifically it
utilizes the ResUnet architecture, where the Unet part is utilized for the semantic
segmentation of the image and the skip connections of the ResNet which help train
more efficiently deep neural networks without the issue of the vanishing gradient (too
small values of the gradient in very deep neural networks). The loss function used
during the training of the model is MSE between the ground truth optimized image
and the output of the model, also the Dice Similarity Coefficient is used, which
denotes the similarity of 2 images. To train the model the dataset created is split
into training, validation, and test set. The validation set is used during the training
to ensure that the model is not overfit to the training. Then the model is evaluated
with the test set, which the model has never “seen” before. The model achieves:
M S E = 0.007 and DSC = 0.97, especially the similarity coefficient shows that the
produced optimized design image is almost similar to the ground truth. Final step of
the process is to apply a threshold of 0.5 to binarize the predicted densities, because
the model produces densities ranging from [0, 1] but the desired final optimized
design must have values in {0,1}.
In Chi et al. (2021), a large-scale solution is proposed without a loss in accuracy.
The proposed method has three distinct features: A novel component that’s being
trained from previous iterations, A two-scale topology optimization method using a
localized strategy, and A component that generates new data from actual physical
simulations that constantly improves the machine learning models. In contrast with
other methods that use deep learning, where the training of the neural networks
happens before the optimization process. In the proposed method training happens
online in 2 stages. One initial online training session and several online updates
during the process. There are 4 key parameters that control that process N j , N F
which controls the initial online training step and the frequency of online updating
frequency. The other 2 parameters W I , WU are a window that controls how much
390 I. Chamatidis et al.
back in the data the neural network can use for training in the initial step and the
update stage respectively. The process of the online initial training of the neural
network starts by solving the traditional equation for N I + N W − 1, where during
the training the network can “see” only W I steps back. Using the trained model, the
most computational expensive steps can be avoided (calculation of the state equations
and sensitivity analysis). To ensure that the model stays accurate during the whole
process, the weights of the model change with regular frequency by switching back
to the regular method of solving the equations with the standard procedure and
updating the model. To make the proposed framework efficient and scalable, a two-
scale topology optimization setup is used, where a fine-mesh and a coarse-mesh are
used separately in different stages. Fine-mesh is used to solve the state equations
to collect new data and also the design variables update is performed there. On the
coarse-mesh no design variables updates happen, but the state equation is solved at
every step of the optimization on the stiffness distribution that is mapped from the
fine-scale mesh. The architecture of the neural network used is a fully connected
deep neural network (DNN) with 4 hidden layers and with 1000 neurons at each
hidden layer. A notable addition to the architecture of the DNN is the Parametric
Rectified Linear Unit (PReLU) which is a generalization of ReLU that also contains
a learnable parameter α. The use of PReLU has been shown great performance in
image recognition tasks (He et al. 2015, 2016):
Deep Neural networks are often called universal approximators because, with
the right architecture, they can approach practically every function. In topology
optimization the design variables are millions, so one model cannot have the capacity
to scale and produce an accurate mapping from the input to the output, as the design
variables increase. That is why a two-scale localized setup is used to ensure the
scalability of the model. The proposed setup does not require as much memory for the
calculations. Training examples are produced from the coarse-mesh discretizations,
which are also enclosed in the fine-mesh ones. Also, the whole global design of the
fine-mesh is not treated as an individual example but each element of the mesh is
used individually as a training example. The results show that the localized training
strategy is more efficient in terms of memory efficiency and scalability. To measure
the accuracy of the models, the angle of deviation from the original sensitivity is
used:
GT G
θerror = arccos T (10.31)
G G
With sufficient training steps with the proposed method, θerror approaches zero
proving that the method is also accurate. It is also suggested that the strain vector
instead of the nodal displacement vector from the coarse mesh should be used as an
input for the deep neural network.
10 Overview on Machine Learning Assisted Topology Optimization … 391
10.4 Conclusions
The present chapter presents a review of methods used to perform topology opti-
mization using artificial intelligence-related methodologies. Artificial intelligence
is a useful tool employed in many scientific areas during the last decades. It is no
surprise that such tools have found their way to topology optimization problems
either by completely replacing the conventional methods or by assisting the conven-
tional methods to reduce the required computational cost. The main advantage of
using artificial intelligence to perform topology optimization is that these models
have a large enough learning capacity that can map the input to an output, even in
complex engineering problems. As a result, by properly training the AI model, it can
be used to map an input to an output during the topology optimization process.
There are many models and algorithms that have been developed during the most
recent years for AI-assisted topology optimization problems. The two most popular
families of methods are density based and image-based ones. Density-based methods
use the mechanical properties of the model as an input and output a density. The
second family, image-based methods, use an image as an input, either in 2D or 3D.
Most methods use a deep neural network or a convolutional neural network to calcu-
late the output of the optimized structure. Methods using density-based approaches
392 I. Chamatidis et al.
exhibit usually better performance both in terms of accuracy and computational cost
reduction. Advancements both in software and hardware can improve even further
the performance of these methods, as AI models rely on GPU computations.
Acknowledgements The research was supported by the National Funding for European Competi-
tive Research Projects, for the financial year 2019, with the beneficiary being the National Technical
University of Athens (NTUA), project: 67120100, GSRT AWARD 2019 “Data Driven Surrogate
Models and Applications”.
The fifth author acknowledges the support of the European Union’s Horizon research and innova-
tion programme under the Marie Sklodowska-Curie Individual Fellowship grant “AI2AM: Artificial
Intelligence driven topology optimisation of Additively Manufactured Composite Components”,
No. 101021629.
References
Abueidda DW, Koric S, Sobh NA (2020) Topology optimization of 2D structures with nonlineari-
ties using deep learning. Comput Struct 237:106283. https://doi.org/10.1016/j.compstruc.2020.
106283
Andreassen E, Clausen A, Schevenels M, Lazarov BS, Sigmund O (2011) Efficient topology opti-
mization in MATLAB using 88 lines of code. Struct Multidisc Optim 43(1):1–16. https://doi.
org/10.1007/s00158-010-0594-7
Banga S, Gehani H, Bhilare S, Patel S, Kara L (2018) 3D topology optimization using convolutional
neural networks. arXiv:1808.07440v1. https://doi.org/10.48550/arXiv.1808.07440
Bendsøe MP (1989) Optimal shape design as a material distribution problem. Struct Optim 1(4):193–
202. https://doi.org/10.1007/BF01650949
Bendsøe MP, Sigmund O (2004) Topology optimization: theory, methods, and applications.
Springer, Heidelberg. https://doi.org/10.1007/978-3-662-05086-6
Chandrasekhar A, Suresh K (2021) TOuNN: topology optimization using neural networks. Struct
Multidisc Optim 63(3):1135–1149. https://doi.org/10.1007/s00158-020-02748-4
Chi H, Zhang Y, Tang TLE, Mirabella L, Dalloro L, Song L, Paulino GH (2021) Universal machine
learning for topology optimization. Comput Methods Appl Mech Eng 375:112739. https://doi.
org/10.1016/j.cma.2019.112739
Clarke MRB (1974) Pattern classification and scene analysis. J R Stat Soc: Ser A (general)
137(3):442–443. https://doi.org/10.2307/2344977
Deng H, To AC (2021) A parametric level set method for topology optimization based on deep
neural network. J Mech Des 143(9). https://doi.org/10.1115/1.4050105
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level perfor-
mance on imagenet classification. In: 2015 IEEE international conference on computer vision
(ICCV), pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE
conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/
10.1109/CVPR.2016.90
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal
approximators. Neural Netw 2(5):359–366. https://doi.org/10.1016/0893-6080(89)90020-8
Kallioras NA, Kazakis G, Lagaros ND (2020) Accelerated topology optimization by means of
deep learning. Struct Multidisc Optim 62(3):1185–1212. https://doi.org/10.1007/s00158-020-
02545-z
Kallioras NA, Lagaros ND (2020) DL-scale: deep learning for model upgrading in topology
optimization. Procedia Manuf 44:433–440. https://doi.org/10.1016/j.promfg.2020.02.273
10 Overview on Machine Learning Assisted Topology Optimization … 393
Kallioras NA, Lagaros ND (2021a) DL-SCALE: a novel deep learning-based model order upscaling
scheme for solving topology optimization problems. Neural Comput Appl 33(12):7125–7144.
https://doi.org/10.1007/s00521-020-05480-8
Kallioras NA, Lagaros ND (2021b) MLGen: generative design framework based on machine
learning and topology optimization. Appl Sci 11(24):12044
Kallioras NA, Nordas AN, Lagaros ND (2021) Deep learning-based accuracy upgrade of reduced
order models in topology optimization. Appl Sci 11(24):12005
Kollmann HT, Abueidda DW, Koric S, Guleryuz E, Sobh NA (2020) Deep learning for topology
optimization of 2D metamaterials. Mater Des 196:109098. https://doi.org/10.1016/j.matdes.
2020.109098
Lei X, Liu C, Du Z, Zhang W, Guo X (2018) Machine learning-driven real-time topology optimiza-
tion under moving morphable component-based framework. J Appl Mech 86(1). https://doi.org/
10.1115/1.4041319
Li J, Ye H, Yuan B, Wei N (2022) Cross-resolution topology optimization for geometrical non-
linearity by using deep learning. Struct Multidisc Optim 65(4):133. https://doi.org/10.1007/s00
158-022-03231-y
Liu K, Tovar A, Nutwell E, Detwiler D (2015) Towards nonlinear multimaterial topology opti-
mization using unsupervised machine learning and metamodel-based optimization. In: ASME
2015 international design engineering technical conferences and computers and information in
engineering conference. https://doi.org/10.1115/detc2015-46534
Lu X, Plevris V, Tsiatas G, De Domenico D (2022) Editorial: artificial intelligence-powered method-
ologies and applications in earthquake and structural engineering. Frontiers Built Environ 8.
https://doi.org/10.3389/fbuil.2022.876077
Mlejnek HP (1992) Some aspects of the genesis of structures. Struct Optim 5(1):64–69. https://doi.
org/10.1007/BF01744697
Parzen E (1962) On Estimation of a probability density function and mode. Ann Math Stat
33(3):1065–1076
Patel D, Bielecki D, Rai R, Dargush G (2022) Improving connectivity and accelerating multiscale
topology optimization using deep neural network techniques. Struct Multidisc Optim 65(4):126.
https://doi.org/10.1007/s00158-022-03223-y
Patel J, Choi S-K (2012) Classification approach for reliability-based topology optimization using
probabilistic neural networks. Struct Multidisc Optim 45(4):529–543. https://doi.org/10.1007/
s00158-011-0711-2
Qian C, Ye W (2021) Accelerating gradient-based topology optimization design with dual-model
artificial neural networks. Struct Multidisc Optim 63(4):1687–1707. https://doi.org/10.1007/s00
158-020-02770-6
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image
segmentation. Cham, pp 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
Solorzano G, Plevris V (2022) Computational intelligence methods in simulation and modeling of
structures: a state-of-the-art review using bibliometric maps. Frontiers Built Environ 8. https://
doi.org/10.3389/fbuil.2022.1049616
Sosnovik I, Oseledets I (2019) Neural networks for topology optimization. Russ J Numer Anal
Math Model 34(4):215–223. https://doi.org/10.1515/rnam-2019-0018
Wang D, Xiang C, Pan Y, Chen A, Zhou X, Zhang Y (2022) A deep convolutional neural network
for topology optimization with perceptible generalization ability. Eng Optim 54(6):973–988.
https://doi.org/10.1080/0305215X.2021.1902998
White DA, Arrighi WJ, Kudo J, Watts SE (2019) Multiscale topology optimization using neural
network surrogate models. Comput Methods Appl Mech Eng 346:1118–1135. https://doi.org/
10.1016/j.cma.2018.09.007
394 I. Chamatidis et al.
Zhang Z, Li Y, Zhou W, Chen X, Yao W, Zhao Y (2021) TONR: An exploration for a novel
way combining neural network with topology optimization. Comput Methods Appl Mech Eng
386:114083. https://doi.org/10.1016/j.cma.2021.114083
Zhou M, Rozvany GIN (1991) The COC algorithm, Part II: Topological, geometrical and generalized
shape optimization. Comput Methods Appl Mech Eng 89(1):309–336. https://doi.org/10.1016/
0045-7825(91)90046-9
Chapter 11
Mixed-Variable Concurrent Material,
Geometry, and Process Design
in Integrated Computational Materials
Engineering
Tianyu Huang, Marisa Bisram, Yang Li, Hongyi Xu, Danielle Zeng,
Xuming Su, Jian Cao, and Wei Chen
11.1 Introduction
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 395
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_11
396 T. Huang et al.
et al. 2018), and stochastic modeling and uncertainty quantification methods (Bostan-
abad et al. 2018b; Huang et al. 2021), achieving a weight saving of 30% and cost
saving of $4.01 per pound of weight saved compared to an all-steel baseline, while
preserving the required structural durability (Su and Wagner 2019).
Machine learning (ML) is often used in ICME to enable metamodeling and
metamodel-based design optimization. Metamodeling (Simpson et al. 2001), or
surrogate modeling, is a technique to replace the expensive object models with a
simpler data-driven model to speed up the design evaluation process. The resulting
metamodel is trained using data from high-fidelity physics-based models, which are
often very computationally intensive, to allow more emulated model executions for
design space exploration. It leads to metamodel-based design optimization (MBDO),
wherein the optimization search is based on a metamodel instead of the original,
expensive one. MBDO has several steps, including (1) design of experiments (DoE)
for data collection, (2) metamodeling, (3) model validation, and (4) optimization.
The DoE methods, e.g. Latin hypercube sampling and low discrepancy sequences
(Sobol 2001), aim at developing a training dataset that explore the design space as
much as possible given limited computing budget. For engineering applications, Jin
et al. (2001) compared multiple metamodels and concluded that Gaussian process
(GP) modeling is the most suitable for modeling nonlinear behavior from deter-
ministic computer simulations. The recent surge in using artificial neural networks
(ANN) and deep networks (see, e.g. Daddona and Antonelli 2018; Rao and Liu
2020) is motivated by metamodeling of simulations with high-dimensional inputs.
The metamodel is often validated by methods such as k-fold cross-validation or
leave-one-out cross-validation. Finally, either gradient-based optimization search
algorithms or non-gradient-based method such as Bayesian optimization (Shahriari
et al. n.d) is integrated with metamodel for design optimization. MBDO has been
widely used in engineering design practice, such as design of fuel cell systems (Miao
et al. 2011), improving vehicle crashworthiness (Fang et al. 2005), and optimizing
the aerodynamics performance of car body designs (Song et al. 2017).
Major challenges exist in leveraging ML and correspondingly MBDO methods
in ICME-based design optimization such as the following.
• Mixed design variables: while many existing modeling and optimization
approaches assume continuous input variables, the design variables in ICME are
often of mixed continuous and non-continuous types. More specifically, it can
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 397
The iteration steps of LVGP-CBO are illustrated in Fig. 11.2. The framework is predi-
cated on the end-to-end ICME simulation models, which evaluate process, structural
and material designs consisting of mixed-type variables for multiple performance
metrics and are assumed to be costly to execute due to their nature of being multi-
step and multiscale. We start from a few initial designs in the ICME database and
build an LVGP surrogate model to predict the performance of any design in the input
space with quantifiable uncertainty. Then, an improved version of BO is employed
to search in the design space for optimal solution under constraints. The predicted
optimal designs are consequently evaluated by the ICME models, and the data points
are added to the database for the next iteration. The process continues until a preset
limit on iterations is reached or a convergence condition is met.
The methods are presented in this section. We start with the basics of GP modeling
and BO, followed by the introduction of LVGP modeling, then generalizing BO to
constrained optimization problems.
Throughout the section, we assume the dataset to model
contains N pairs
of predictors and observed responses x (1) , y1 , x (2) , y2 , ..., x (N ) , y N , where
x (i) (i = 1, ..., N ) is d-dimensional. X is the design matrix of size N × d, where each
row contains an observation. y is the response vector of size N , and Yi (i = 1, ..., N )
and Y are Gaussian random variables and vectors for GP modeling.
where m(·) is a mean function and K (·, ·) is a covariance function, both converting
the predictors to the responses’ distribution parameters. A popular choice of the
covariance function is
d
(i) ( j) (i) ( j) (i) ( j)
2
K x ,x = σ r x ,x
2
= σ exp −
2
λk x k − x k (11.2)
k=1
where r (·, ·), the squared exponential function, models the correlation between Yi
and Y j through the Euclidean distance between xi and x j , assuming the process
is stationary and isotropic. K (·, ·) has d + 1 hyperparameters (λk and σ 2 ). By
assuming the mean is a constant, i.e. m(·) = c, the training of the GP model
through maximum likelihood estimation (MLE) is turned into a (d + 2)-dimensional
optimization problem
ĉ, σ̂ 2 , λ̂k = argmax L c, σ 2 , λk |X, y (11.3)
c,σ 2 ,λk
400 T. Huang et al.
where the likelihood function L(·) is derived from the multivariate Gaussian density
(with covariance matrix i j = K x (i) , x ( j) )
L c, σ 2 , λk |X, y = f Y y; X, c, σ 2 , λk
1 1 T −1
= N 1 exp − (y − c) (y − c) (11.4)
2 2 || 2 2
To solve the optimization problem, Eq. (11.4) is often turned into its negative
logarithmic form for numerical stability, and the maximization problem (or mini-
mization when in negative logarithm) is solved by numerical optimization algorithms
(Bostanabad et al. 2018a).
Given the optimal hyperparameters, the distribution of Eq. (11.1) is determined
and we can construct the joint distribution of Y and Y (n) , a new point at x (n) we are
interested in. Note that the distribution of Y (n) conditioned on our observation Y = y
is also Gaussian with conditional mean and variance
E Y (n) |Y = y = c + r x (n) , X R −1 (y − c)
var Y (n) |Y = y = σ 2 1 − r x (n) , X R −1 r x (n) , X (11.5)
where R = ri j is the correlation
matrix. We use E Y (n) |y as our point estimation
of y (n) , with variance var Y (n) |y , and the epistemic prediction uncertainty from lack
of data can be quantified hereafter using the variance itself or a confidence interval
based on it.
Details of GP modeling can be found in references such as Rasmussen (2003) and
Bostanabad et al. (2018a). In addition to the basic version, which predicts a single
noise-free variable from inputs, alternative versions have been developed for less
ideal situations such as multi-response outputs (Conti et al. 2009), stochastic noises
(Ankenman et al. 2008), and additive noises (Zhang et al. 2016a).
Conventional numerical optimization algorithms need access to the objective
function’s gradient information to determine search directions and optimality condi-
tions while convergence is often guaranteed for local solutions (Nocedal and Wright
2000). Bayesian optimization (BO) (Shahriari et al. n.d) is an alternative efficient
global optimization (EGO) approach that guides its solution search through the
probability distributions of outputs predicted by a metamodel constructed from the
search history data. The core idea is to design an acquisition function, which quanti-
tatively measures the potential for improvement at an unexplored site by comparing
and integrating its prediction uncertainty and predicted improvement relative to the
current optimum. If the uncertain prediction y at a location x is characterized by a
random variable Y with mean y and variance σ 2 , while the current best solution (for
ŷ − y ∗ ŷ − y ∗
= ŷ − y ∗ N ( ) + σ̂ φN ( ) (11.6)
σ̂ σ̂
where φN and N denote the probability density and cumulative distribution func-
tion (PDF and CDF) of a standard normal distribution when the predictions are
provided by a GP model. The unexplored point with the highest EI is chosen for
objective function evaluation in each optimization iteration (Fig. 11.3). EI provides a
balance between exploring highly uncertain sites and exploiting predicted promising
sites. Other options include probability of improvement (PI), PI(x) = P[Y ≥ y ∗ ] =
y −y ∗
P (
), and upper confidence bound (UCB), UCB(x) = y + βσ , where β is a
σ
hyperparameter of the user’s choice (Shahriari et al. n.d). The acquisition function
may also be hybrid, choosing from a set of candidates based on their performance in
the optimization process. See, for example, the Thompson sampling (Shahriari et al.
n.d) approach for reference.
d1 2
d2
(i) ( j) (i) ( j) (k) (i) (k) ( j)
K x ,x = σ exp −
2
τk xc,k − xc,k − ||z xt,k −z xt,k ||2
k=1 k=1
(11.7)
LVGP with N data {X, y} and N × N covariance matrix (τ, z(·)) defined by
Eq. (11.7), its negative log-likelihood is
N 1 1
l(c, σ, τ, z(·)) = ln 2π σ 2 + ln|(τ, z(·))| + (y − c)T (τ, z(·))−1 (y − c)
2 2 2σ 2
(11.8)
a tensile strength model but closer to CFRP in the latent space of a product weight
model.
Note that GP models do not generally scale well for big data applications due to
the inherent matrix inversion operations (on a matrix of the size of the data) during
model fitting and inference. However, they are well-suited for BO in the context
of ICME since ICME often encounters small-data scenarios and BO aims at finding
less (more efficient) samples to build the GP toward the optima under the assumption
that the objective function (i.e. the ICME simulations) is computationally expensive.
Consequently, the resulting dataset is generally small enough to be modeled by GP-
based approaches due to resource constraints. On the other hand, for extremely high-
dimensional design problems that do require a large number of samples to model,
GP might be impractical and dimension reduction and feature selection techniques
(Li et al. 2017a) could be applied to reduce the dimensionality of the problem before
a surrogate model is trained for BO.
Multiple objectives often coexist in ICME decision making, e.g. performance opti-
mization, cost minimization, and compliance of prespecified design requirements.
For multi-objective optimization problems without constraints, it is not uncommon
that the most valued objective is kept while the remainders are converted to design
constraints in engineering practices (see, for example, the -constraint method in
Miettinen 1998). Note that Bayesian optimization, in its original formulation, is
an unconstrained optimization algorithm. To take into account the constraints in
ICME design, a constrained BO algorithm is needed, preferably based on the LVGP
metamodel for its power to consume mixed-variable inputs.
We develop our approach based on the following optimization problem
max y = f (x)
x (11.9)
s.t. g(x) ≤ 0
It is assumed that the objective and constraint functions can only be evaluated
from expensive functions (i.e. high-fidelity computer simulations) hence all models
are emulated by LVGPs. Denote a current best design be y ∗ and a new query point
and the predicted performance be {x, y}. Under the assumptions of LVGP, y is a
realization of a Gaussian process. Let the marginal distribution at x be Y , the predicted
improvement (for maximization problems) is defined as
I(x) = max 0, Y − y ∗ (11.10)
Taking the expectation on both sides of Eq. (11.10) leads to the popular expected
improvement (EI(x)) acquisition function in BO. For constrained optimization, first
404 T. Huang et al.
let the constraint and its prediction be g(x) = yc and y c . Similarly, yc is modeled by
a random variable Yc , , marginalized from the presumed GP. Following the method
proposed by Gardner et al. (2014), we define a feasibility indicator function (x)
1, if Yc > 0,
(x) = (11.11)
0, if Yc ≤ 0
It is not difficult to see that Eq. (11.11) represents a Bernoulli random variable
that takes the value 1 when the constraint is violated. The constrained improvement
is defined as
IC(x) = (x)I(x) = (x)max 0, Y − y (11.12)
which is nonnegative, and only nonzero when the constraint is satisfied. Note that
Y and Yc are independently modeled. By taking the expectation on the improve-
ment function, the EIC (expected improvement with constraints) acquisition function
becomes the product of expected improvement and probability of failure (PF), i.e.
violation of the constraint,
EIC(x) = E (x)max 0, Y − y ∗ = PF(x)EI(x) (11.13)
For multiple constraints, if we assume that they are mutually independent, then the
PF part of the EIC criterion becomes the product of all probabilities of failure.
The constrained Bayesian optimization (CBO) is constructed by adopting
Eq. (11.13) as the acquisition function in BO. For mixed-variable problems, we
extend x to x = (xc , xt ) (continuous and categorical variables) and opt for LVGP
to estimate the distributions of Y and YC . We use LVGP-CBO to denote this mixed-
variable and constraint-bounded optimization approach. The PF term in Eq. (11.13)
lowers the expected improvement of points that are more likely to be infeasible. As
a result, they are less likely to be chosen for evaluation.
We demonstrate the use of LVGP-CBO for concurrent material and structure opti-
mization through the design of a thin-walled hat section example for performance
improvement and weight reduction. A hat section is a popular structural model to
demonstrate the mechanical performance of materials in automotive engineering
(Schneider and Jones 2005; Debski et al. 2020; Xu et al. 2020). As shown in Fig. 11.5,
a closed hat is formed by joining two components together, a top hat and a backplate,
which may have separate geometry, materials, and processing specifications.
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 405
The integrated models (Fig. 11.6) allow a comprehensive design of the hat section
from many aspects. The potential input variables include
1. Material selection (Fig. 11.7a). For each of the two components (hat and back-
plate), one of the following candidate materials is chosen: steel, aluminum, unidi-
rectional (UD) CFRP (two possible fabric layups, [0◦ /90◦ ] and [0◦ /60◦ / − 60◦ ]),
woven CFRP ([0◦ /90◦ ]), and chopped fiber CFRP (also known as SMC, sheet
molding compounds).
2. Fiber angle. A rotation of the fabrics, in multiples of ±15◦ , may be applied to
UD and woven to create more angle selections
other than the prespecified ones.
For example, we may have UD 15◦ /105◦ as a result of UD [0◦ /90◦ ] with 15◦
rotation.
3. Thickness (Fig. 11.7c). For each component, a thickness between 1.2 and 4.2 mm
may be chosen. The variable is continuous for steel, aluminum, and chopped fiber
component, and discrete for UD and woven CFRP as it must be multiples of the
fabric thickness, which is 0.4 mm for UD and woven [0◦ /90◦ ], and 0.6 mm for
UD [0◦ /60◦ / − 60◦ ].
4. Hat height (Fig. 11.7c). The height of the hat can be altered within ±1 mm
compared to a reference design, i.e. it falls within the range [−1, 1].
5. Charge design (Fig. 11.7b). The shape of the initial charge for the sheet molded
chopped fiber CFRP. There are two alternatives.
Fig. 11.7 Schematic of the design variables, including a material selections, b process conditions
for SMC, and c part geometry
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 407
Fig. 11.8 Schematic of the performance simulations, including a stiffness, b strength, c fatigue
life, d crashworthiness and (2) criteria of crashworthiness
The list of design variables, along with the number of levels, are summarized in
Table 11.1. To reduce the complexity of the problem, we assume the process condi-
tions are the same for both components (hat and backplate), which leads to 10 design
408 T. Huang et al.
Table 11.1 ICME design variables for the thin-walled hat section component
Group Name Type Choices Hat Backplate
Material Material selection Categorical 6 1 1
Material Fiber angle Discrete 12 1 1
Geometry Thickness Continuous n/a 1 1
Geometry Hat height Continuous n/a 1 0
Process Charge design Categorical 2 1 1
Process Compression speed Discrete 2 1 1
Process Compression force Discrete 3 1 1
Note that if a categorical variable has only two levels, say 0 and 1, modeling it as categorical or
continuous makes no difference since the distance between the levels (to be estimated in LVGP) can
also be accounted for via the roughness parameter τ when it is modeled as a continuous variable
in regular GP. Therefore, we treat charge design as a continuous variable, and it leaves us with two
categorical variables, material selection for the two components. The ordinal (discrete) variables
are treated as continuous during LVGP modeling and validation.
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 409
Among the 160 DoE points, 128 (80%) of them are used for training the LVGP
model and the remaining 32 (20%) are reserved for model validation. All inputs and
outputs are normalized to [0, 1] to improve the convergence of model training. Both
mean squared error (MSE) and mean absolute relative error (MARE) are computed
for validation. For an array of predictions y 1 , y 2 , ..., y N and the truth y1 , y2 , ..., y N
N
1 2 1 ŷ − y
N
MSE = ŷi − yi , MARE = (11.14)
N i=1 N i=1 y
The values are listed in Table 11.3. Low MSEs are observed for almost all outputs,
showing good average prediction performance. Note that absolute errors are more
sensitive to outliers. Since outputs 5 and 6 (strength) are developed for metal and
composite materials respectively, low accuracy is expected when the model tries to
predict the wrong criterion (e.g. the Tsai-Wu criterion, developed for composites,
when the design is all metal), which explains the high MARE for strength predictions.
Due to the high computational cost of the crashworthiness simulations, sometimes
its calculation exits prematurely when an internal algorithm interrupts the simulation
when it decides the hat section will break shortly. In this case, the distance is marked
as −1, which corrupts the crashworthiness data and accounts for the high MARE in
output 7 and 8.
A few selected latent spaces are plotted in Fig. 11.9. The first row shows the latent
relation between material selections for the cost model. The linear alignment by five
of the materials indicates that the LVGP model is able to find the latent representation
of the levels of the categorical variable, which is in fact a 1D variable, the unit price.
Note that although theoretically unit price is the only factor affecting the cost through
material selection because the volume of the hat section is fully specified by the height
and thickness variables, in our model, the realized volume is also material-dependent
because the U1 material (UD CFRP with fabric [0/60/−60]) has one more layer than
the others and has a different unit thickness (0.6 mm as opposed to 0.4 mm for U2
and W). It means choosing this material will affect the cost model in two dimensions:
unit price and the realized thickness. For example, if we assume the unit price of
woven, U1, and U2 are p, k1 p, and k2 p, for thickness values that do not require
round-up for all them (i.e. common multiples of 0.4 and 0.6 such as 2.4 mm), the
cost ratio among the three materials (given all other design variables are equal) will
be precisely 1 : k1 : k2 ; however, for other values, the final thickness (after the
round-up) will change, and we will observe a different cost ratio between U1 and the
rest. An input thickness of 2.0 mm will end up with 2.0 mm U2/woven and 1.8 mm
U1 components, yielding a 1 : 0.9k1 : k2 . Therefore, the data should suggest U1
will influence the cost through both unit price and thickness. This phenomenon is
captured by LVGP and reflected in Fig. 11.9 in that it does not fall into the line of the
other materials formed in the latent space. Similarly, the 2D scattering of the points
in the latent spaces in Fig. 11.9c–f suggests that the influences of material selection
on these mechanical performances are also multi-dimensional, which indicates that
there are potentially many aspects (e.g. stiffness, strengths, etc.) of the materials that
may ultimately influence the performance, i.e. the categorical variables cannot be
modeled as a single variable in predicting mechanical performances.
Another perspective to examine the LVGP model is through τ , its roughness
parameters for continuous variables in Eq. (11.8). These parameters essentially place
a weight on each predictor when the correlation between points is determined. An
example, the parameters for the cost model, is shown in Table 11.4. Note that in the
MLE process, we set the 10−3 as the lower bound of the parameters, which means
those with −3 in the table has attained a parameter’s lowest possible value and are
most likely irrelevant for predictions. It is evident that although the model is trained
purely through data, it has captured the knowledge that cost of the product is a
function of the geometry design (thicknesses, i.e. the volume) among all continuous
variables and it is more sensitive to the hat’s thickness as it has larger cross-sectional
area (thus larger volume change per unit thickness change).
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 411
Fig. 11.9 The latent spaces of the cost, fatigue life, composites strength model (top to bottom) for
top hat material and backplate selection (left to right). (F = steel, A = Aluminum, S = SMC, U1
= UD [0/60/−60], U2 = UD [0/90], W = Woven [0/90])
412 T. Huang et al.
We start with 24 DoE points to construct the initial LVGP models for optimization
and EIC estimation, and run 300 iterations of LVGP-CBO. Three design trials are
performed, as reported in Table 11.5.
The trials are intended for testing single- and combined- objective optimization
capabilities of the proposed approach. The weights w1 , w2 in the objective function
in Trial 3 are chosen such that the two individual objectives are normalized to similar
scale. Ideally, one should run the physical simulations for additional data during the
Bayesian optimization; however, for the purpose of demonstration, we query a
higher-fidelity metamodel, LVGP, built with 160 data points, for design evaluation.
The optimization history and results are reported in Fig. 11.10 and Tables 11.6,
11.7, and 11.8. Some design variables are not shown here. For the low-cost design,
it can be observed that the algorithm repeatedly switches between aluminum, steel,
UD, and woven composites materials, with a strong preference on metals, to search
for the optimal design. Note that steel is the cheapest material in the model. The
baseline design with minimal thickness should have attained the theoretical lower
bound of the cost, i.e. it is impossible to be improved; however, the optimization
history still shows a 5.4% improvement over the initial randomly sampled designs
(from 115.54 to 109.32). The last three designs have very similar (less than 1.1%)
costs compared to the lower bound (Table 11.5) showing LVGP-CBO’s capability to
find close-to global optima.
If the cost is not an issue, the optimal design for weight reduction (Trial 2) will
be most likely an all-composites design favoring UD and Woven CFRP for their
Table 11.5 ICME optimization experiments for the thin-walled hat section component
Trial Objective Baseline performance
1 Cost ($) 109.17
2 Weight (kg) 4.1
3 w1 Cost + w2 Weight,w1 = 1/110, w2 = 1/8 6.12
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 413
Fig. 11.10 Optimization history for hat section designs a low-cost design b low weight design
c combined low cost and weight design
The injection-molded SFRP model is built on process simulations via Moldflow and
performance simulations via LS-DYNA. The flow of information and computational
models is given in Fig. 11.11g. We simulate the injection molding process of an
ASTM D638 tensile coupon under different conditions and assess its mechanical
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 415
Fig. 11.11 a–d Schematics of the design variables of the SFRP model e–f Schematics of the
process and performance simulations g The workflow for design evaluation
performance under tensile loading. The fibers’ aspect ratio is fixed at 5. The model
inputs are
1. Fiber choice (Fig. 11.11a). The fibers in the SFRP may be either glass or carbon.
2. Fiber volume fraction (Fig. 11.11b). The volume fraction of the fibers is a
continuous value ranging in [0.05, 0.2].
3. Mold material (Fig. 11.11c). The material of the mold, which affects the cooling
rate, can be chosen from aluminum, steel, copper, and brass.
4. Maximum injection rate (Fig. 11.11d). Three discrete choices are available, low,
medium, and high, which correspond to 100, 2000, and 5000 cm3 /s respectively.
5. Maximum injection pressure (Fig. 11.11d). Similar to the injection rate with the
low, medium, and high levels corresponding to 50, 100, and 180 MPa respectively.
6. Mold surface temperature (Fig. 11.11d). A continuous value between [20, 80]◦
C.
7. Melt temperature (Fig. 11.11d). The temperature we set for the melted SFRP
material to flow into the mold, a continuous value ranging between [200, 280]◦
C.
The process simulation model predicts the local orientation state of the short fibers
and outputs a field of the second-order orientation tensors (Fig. 11.11e). It can be
seen that the microstructures are heterogeneous. They are converted to local material
properties via a micromechanics model in Moldflow and sent to LS-DYNA for a
tensile test simulation (Fig. 11.11f), with one end fixed and the other applied with a
0.5 m/s velocity displacement, resulting in a field of local strains. We examine the
Green strain fields for performance evaluation.
416 T. Huang et al.
We summarize the design variables in Table 11.9. The designs are varied by switching
the mold and the fiber materials and choosing from a collection of continuous
and discrete parameters for material design (fiber volume fraction) and processing
conditions.
The two categorical variables create a total of 8 possible combinations. For
simplicity, we treated the two discrete variables as continuous, and designed a total
of 96 simulation experiments based on a sliced Latin hypercube (Ba et al. 2015) with
8 slices and 12 points per slice.
The design variables are transferred to Autodesk Moldflow for process simu-
lations, and their outputs, the fields of local SFRP orientation states (Fig. 11.11),
are translated to local material property fields via a built-in material model for LS-
DYNA tensile test simulations. The final output is a 6D strain field, with each local
node containing Green strains in 6 directions (x, y, z, x y, yz, x z). We summarize
the massive data by extracting statistics of strain fields, which include
1. Mean, the average value of the strain fields in all 6 directions. It characterizes
the general trend of the deformation,
2. Max, the maximum of the absolute value of the strain fields in all 6 directions,
which indicates the most extreme deformation across the field,
3. Standard deviation, denoted in the following tables and figures by sd, of the
strain fields in all 6 directions. It is an indicator of the range of the values in terms
of the local deformation, and
4. Short-range spatial correlation, computed by searching for the closest 1% node
pairs in the FEA mesh and calculating their empirical correlation, this metric can
be viewed as a quantitative measure of the smoothness of the fields (Fig. 11.12).
The resulting dataset is partly visualized in Fig. 11.13 via scatter plots (discrete
variables not shown). It is evident that among the continuous variables, the fiber
Table 11.9 Design variables for the injection-molded SFRP tensile coupon
Group Name Type Choices
Material Fiber material Categorical 2: {glass, carbon}
Material Mold material Categorical 4: {aluminum, steel, copper, brass}
Material Fiber volume fraction Continuous [0.05, 0.2]
Process Mold surface temperature Continuous [20, 80]◦ C
Process Melt temperature Continuous [200, 280]◦ C
Process Max injection rate Discrete 3: {100, 2000, 5000} cm3 /s
Process Max injection pressure Discrete 3: {50, 100, 180} MPa
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 417
Fig. 11.12 Examples of the strain fields from the tensile simulations of injection-molded SFRP.
a a high smoothness example, b a low smoothness example
volume fraction is strongly correlated to the standard deviation and the spatial corre-
lation of the strain fields, while the melt temperature is visibly a huge influencer of
all performance criteria.
For the design example, we hypothesized an engineering use case, with the
optimization objective and constraint defined by
where ri is the short-range spatial correlation (smoothness) of the i-th direction, and
max(εiG ) denotes the maximal value in the Green strain field εiG (i = x, y, x y). All
outputs are normalized to [0, 1] using the observed ranges among the 96 simulations.
The design objective ensures the fields are as smooth as possible to avoid strain
concentration, while the design constraint is set so that the maximal values in the
resulting x-, y-, and x y- Green strain fields are bounded. In other words, weak
material designs will be rejected.
418 T. Huang et al.
Fig. 11.13 The scatter plots of continuous design variables and performances, with categorical
variables marked by colors and shapes. Responses regarding the z direction are not shown here
84 out of the 96 data, half from glass fiber designs and half from carbon, are randomly
selected to build the LVGP. The remaining 12 are reserved for validating the model
via their MSE and root MSE (RMSE) of the model predictions. The RMSE has the
same unit as the output; therefore, they can be directly compared to each other as a
measure of accuracy. The validation results regarding the x and y directions are listed
in Table 11.10 as z is the thickness direction of the tensile coupon and compared
to the strain fields in x and y directions, the values in z-related ones are typically
negligible. It can be seen that the MSEs are small, and RMSEs are typically smaller
than the response data by at least one order of magnitude, indicating great predictive
capability of the trained LVGP.
A latent space for the mold materials from the y-smoothness model is plotted
in Fig. 11.14. Although there is no direct physical interpretation of the locations of
these candidate materials, their nonlinear alignment suggests that different materials
influence the results in multiple aspects.
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 419
The response surfaces for y-smoothness from different mold materials, condi-
tioned on glass fiber designs with fixed discrete variables and mold surface temper-
atures, are shown in Fig. 11.15. While Fig. 11.15a presents the complex shape of
the surfaces, the top view image in Fig. 11.15b shows that there is not a domi-
nating material that behaves uniformly better than the rest. Therefore, if we want to
perform design optimization for a metric such as y-smoothness, the searching will
420 T. Huang et al.
Fig. 11.15 a The response surfaces of different mold materials from the glass fiber y-smoothness
model regarding fiber volume fraction and melt temperature, and b is the top view of (a). The other
design variables are kept constant (0.5 for mold surface smoothness)
Table 11.11 The roughness parameters (log10 (τk )) of LVGP in modeling y-direction statistics
k Name Mean Max SD Smoothness
1 Volume fraction 0.48 0.04 0.53 1.7
2 Mold surface temperature –1.3 2.7 –0.94 –0.74
3 Melt temperature 0.95 0.45 1.1 0.27
4 Max injection rate –1.8 –3 –1.3 –0.5
5 Max injection pressure –3 –3 –3 –3
involve switching among the candidate materials while tuning the combination of
the continuous variables concurrently, which can be realized with
LVGP.
As stated before, the roughness parameters, ranging in 10−3 , 103 , indicate
the importance of non-categorical predictors. Those from the y-direction statis-
tics models are summarized in Table 11.11. Their magnitudes generally agree
with the trend in Fig. 11.13, as positive values are observed for predictors with
at least moderate correlation with the field statistics. It can be concluded that for
the y-direction results, the volume fraction and melt temperature are two important
variables in determining the predicted values, while the max injection pressure is
noninfluential.
We start the LVGP-CBO with 32 initial points randomly sampled from the 96-point
DoE, then continue for 300 iterations of optimization. When the initial designs do
not include a feasible one, the algorithm selects a random point from them as the
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 421
Fig. 11.17 a Optimization history and b optimal designs in the performance space for injection-
molded SFRP designs
space, and therefore it is very promising in the concurrent design of CFRP materials
and processes in the context of ICME.
11.5 Conclusions
insights into the physical phenomenon, such as the underlying similarity between
levels of categorial variables, and the importance ranking of design variables.
The LVGP model inherits all advantages of GP modeling, one of which is to offer
uncertainty prediction for sequential sampling. Combined with Bayesian Optimiza-
tion considering constraints, LVGP-CBO adds constrained efficient global optimiza-
tion on top of the predictive ML model. The proposed approach needs only a small
number of data points to start with, and continuously searches for designs that are
most likely to be optimal as predicted by LVGP in the mixed-variable design space.
The subsequent data collection (i.e. simulations) is guided by the CBO algorithm to
balance the exploration of uncertain regions and the exploitation of high-potential
sites for the optimal design. On the contrary, the commonly used gradient-based or
evolutionary optimization methods only look for improved designs in each itera-
tion (as opposed to optimal) and are therefore less efficient. As a result, the frame-
work allows multi-objective, constrained, and mixed-variable optimization with less
required simulations, which makes it suitable for ICME design of process, materials,
and structures in a concurrent fashion.
To demonstrate the approach, we established two ICME workflows with fully
bridged part-scale manufacturing processes and structural performance simulations,
integrated with material-scale microstructure and material property models so that the
processing-induced local microstructure evolution are taken into account for design
evaluation. Both workflows require mixed-variable inputs and generate multiple
performance criteria as design objectives.
We first apply LVGP-CBO to concurrent material and part geometry design
of thin-walled hat sections with a focus on adopting CFRP composites for
lightweighting. Specifically, the goal is to reduce the weight and cost of the compo-
nent given the flexibility to choose among a variety of metal and CFRP materials, the
component’s geometry, and processing conditions, without sacrificing the mechan-
ical performance benchmarked by the all-steel design. Optimization results show that
LVGP-CBO is an efficient design framework that simultaneously searches for the
better combinations of materials, microstructures, processing, and product geometry.
A weight reduction of 81.5% can be achieved at the cost of a 3.7% budget increase
by replacing metal parts with composites ones and optimizing their geometries and
material designs.
The framework is also tested by concurrent process and material design of
injection-molded SFRP parts. The goal is to lower the strain concentration under
tensile loads while the designer has the freedom to choose both fiber and mold
materials and the SFRP’s processing parameters. Toward the goal of reducing strain
concentration constrained by maximal local train, LVGP-CBO is shown to realize
automatic fiber and mold material selection, along with the injection molding param-
eter optimization. With the integrated process-structure–property-performance simu-
lations for design evaluation, the presented LVGP-CBO algorithm is applicable to
other concurrent manufacturable and high-performance designs in ICME.
Acknowledgements The financial support from Ford Motor Company and the US Department of
Energy (Award Number: DE-EE0006867) is acknowledged.
424 T. Huang et al.
References
Ankenman B, Nelson BL, Staum J (2008) Stochastic kriging for simulation metamodeling. In:
Proceedings—winter simulation conference, pp 362–370. https://doi.org/10.1109/WSC.2008.
4736089
Ba S, Myers WR, Brenneman WA (2015) Optimal sliced latin hypercube designs. 57:479–487.
https://doi.org/10.1080/00401706.2014.957867
Bostanabad R, Kearney T, Tao S et al (2018a) Leveraging the nugget parameter for efficient Gaussian
process modeling. Int J Numer Methods Eng 114:501–516
Bostanabad R, Liang B, Gao J et al (2018b) Uncertainty quantification in multiscale simulation
of woven fiber composites. Comput Methods Appl Mech Eng 338:506–532. https://doi.org/10.
1016/J.CMA.2018.04.024
Chen Z, Huang T, Shao Y et al (2018) Multiscale finite element modeling of sheet molding compound
(SMC) composite structure based on stochastic mesostructure reconstruction. Compos Struct
188:25–38. https://doi.org/10.1016/j.compstruct.2017.12.039
Conti S, Gosling JP, Oakley JE, O’Hagan A (2009) Gaussian process emulation of dynamic computer
codes. Biometrika 96:663–676
Daddona DM, Antonelli D (2018) Neural network multiobjective optimization of hot forging.
Procedia CIRP 67:498–503. https://doi.org/10.1016/J.PROCIR.2017.12.251
Debski H, Rozylo P, Teter A (2020) Buckling and limit states of thin-walled composite columns
under eccentric load. Thin-Walled Struct 149:106627. https://doi.org/10.1016/J.TWS.2020.
106627
Deng X, Lin CD, Liu K-WK-W, Rowe RKK (2017) Additive gaussian process for computer models
with qualitative and quantitative factors. Technometrics 59:283–292. https://doi.org/10.1080/
00401706.2016.1211554
Fang H, Rais-Rohani M, Liu Z, Horstemeyer MF (2005) A comparative study of metamodeling
methods for multiobjective crashworthiness optimization. Comput Struct 83:2121–2136. https://
doi.org/10.1016/j.compstruc.2005.02.025
Gardner JR, Kusner MJ, Xu ZE et al (2014) Bayesian optimization with inequality constraints. 32
Huang T, Gao J, Sun Q et al (2021) Stochastic nonlinear analysis of unidirectional fiber compos-
ites using image-based microstructural uncertainty quantification. Compos Struct 260:113470.
https://doi.org/10.1016/J.COMPSTRUCT.2020.113470
Iyer A, Zhang Y, Prasad A et al (2019) Data-centric mixed-variable bayesian optimization for
materials design. In: Proceedings of the ASME design engineering technical conference 2A-
2019. https://doi.org/10.1115/DETC2019-98222
Jin R, Chen W, Simpson TW (2001) Comparative studies of metamodelling techniques under
multiple modelling criteria. Struct Multidiscip Optim 23:1–13. https://doi.org/10.1007/s00158-
001-0160-4
Jin R, Chen W, Sudjianto A (2005) An efficient algorithm for constructing optimal design of
computer experiments. J Stat Plan Inference 134:268–287
Li Y, Chen Z, Xu H et al (2017b) Modeling and simulation of compression molding process for
sheet molding compound (SMC) of chopped carbon fiber composites. SAE Int J Mater Manuf
10:130–137
Li J, Cheng K, Wang S et al (2017a) Feature selection: a data perspective. ACM Comput Surv 50.
https://doi.org/10.1145/3136625
Lin S-P, Chen Y, Zeng D, Su X (2017) Meso-modeling of carbon fiber composite for crash safety
analysis. In: WCX17: SAE world congress experience
Miao J-M, Cheng S-J, Wu S-J (2011) Metamodel based design optimization approach in promoting
the performance of proton exchange membrane fuel cells. Int J Hydrogen Energy 36:15283–
15294. https://doi.org/10.1016/j.ijhydene.2011.08.070
Miettinen K (1998) Nonlinear multiobjective optimization 12. https://doi.org/10.1007/978-1-4615-
5563-6
Nocedal J, Wright S (2000) Numerical optimization
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 425
Zhang Y, Apley DW, Chen W (2020) Bayesian optimization for materials design with mixed quanti-
tative and qualitative variables. Sci Rep 10:1 10:1–13. https://doi.org/10.1038/s41598-020-606
52-9
Zhang Y, Tao S, Chen W, Apley DW (2019) A latent variable approach to gaussian process modeling
with qualitative and quantitative factors. 62:291–302. https://doi.org/10.1080/00401706.2019.
1638834
Zhou Q, Qian PZG, Zhou S (2011) A simple approach to emulation for computer models with
qualitative and quantitative factors. Technometrics 53:266–273
Chapter 12
Machine Learning Interatomic
Potentials: Keys to First-Principles
Multiscale Modeling
Bohayra Mortazavi
12.1 Introduction
B. Mortazavi (B)
Institute of Photonics, Department of Mathematics and Physics, Leibniz Universität Hannover,
Appelstraße 11, 30167 Hannover, Germany
e-mail: [email protected]
Cluster of Excellence PhoenixD (Photonics, Optics, and Engineering–Innovation Across
Disciplines), Leibniz Universität Hannover, Hannover, Germany
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 427
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_12
428 B. Mortazavi
The quantum mechanics models solve the electronic structure of a system and hence
evaluate interaction of electrons and nuclei on the basis of electronic structure infor-
mation. These models may also estimate the interatomic energy via calculating the
electronic interatomic bonds. Within the popular Born–Oppenheimer approximation,
the atomic nuclei or so-called “atoms” are treated as classical particles, while the
electrons are treated as quantum mechanics particles. The DFT method is currently
the most extensively employed quantum mechanics solution, which shows excep-
tional accuracy and computational efficiency, particularly for crystalline and highly
symmetrical structures. Herein the complicated theoretical background of DFT will
not be discussed, but it is worth noting that the main drawback of this approach
is that the computational cost increases exponentially with the number of atoms.
Moreover, vacuum in the plane-wave DFT also add computational burden. Currently
DFT calculations are limited for studying very small systems, consisting of a few
hundred atoms with conventional processors or a few thousands of atoms, achiev-
able only with advanced supercomputers. For the majority of problems in mechanics
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 429
and thermodynamics, one however only deals with forces, energies and stresses, and
consequently the data related to the electronic structure is completely of no use.
Based on this elementary concept, empirical or machine learning interatomic poten-
tials are able to provide substantially reduced computational cost, with a goal of
marginal sacrificial of accuracy. Apart from the computational cost, in Sect. 12.4, we
will discuss a few other bottlenecks of DFT method in the analysis of thermal and
mechanical properties.
Atomic forces can be derived from an empirical interatomic potential function, which
is one of the most computationally efficient approaches. Mechanical forces in an
atomistic system can be divided into either conservative or nonconservative. Conser-
vative forces only depend on positions of the particles, irrespective of instantaneous
velocities and trajectories between different positions. The dissipative or gyroscopic
forces, such as mechanical friction are nonconservative. For an atomistic system
with only conservative forces, one can define a specific function, U, which is called
potential energy and solely depend on the coordinates of the particles,
U = U (r1 , r2 , . . . , r N ) (12.1)
where r i are position vectors of the particles, and the function U m is the m-body
potential. The U 1 naturally represents the energy functional due to an external force
field, such as gravity. The second term shows potential energy of pair-wise interaction
of the particles; the third gives the three-body components and so on. On this basis, the
function U 1 is the external potential, U 2 is the pair-wise or two-body and U m , where m
> 2 is a multibody term. Among the most well-known two-body empirical interatomic
potential, one can refer to Lennard–Jones, Morse or Buckingham potentials. The
Lennard–Jones potential can take various forms, in which the so-called “12–6” type
is expressed as follows:
12 6
σ σ
U L J ri , r j = 4ε − (12.3)
ri j ri j
430 B. Mortazavi
where ε is the depth of the potential well and σ is the equilibrium distance. The two-
body empirical interatomic potential are mostly used to describe non-bonding or
long-range interactions. The higher-order multibody terms of the potential function
(m > 2) are generally required for the modeling of more complex interactions in
solids, which allow to more accurately account for chemical bonds, topology, and
spatial arrangements of atoms. For example, in the three-body potentials, the force
between two connecting atoms not only dependents on their positions, but also on
the position of third atoms within a defined cutoff distance. The Tersoff potential
(Tersoff 1988), which is one of the most popular three-body potentials, that has been
extensively used to study covalent systems, as those of graphene and diamond, shows
the following form:
UT ri j = f c ri j Ai j e−λi j ri j − Bi j e−μi j ri j (12.4)
ζ = f c (rik )ωi j g θi jk (12.6)
k=i, j
c2 c2
g θi jk = 1 + 2 − 2 (12.7)
d d 2 + h − cos θi jk
⎧
⎪
⎨ 1, r i j < Ri j
ri j −Ri j
f c ri j = 1
+ 21 Cos Si j −Ri j
, Ri j < ri j < Si j (12.8)
⎪
⎩
2
0, ri j > Si j
1
U E AM ri j = L i ρh,i + φi j ri j (12.9)
i
2 i j=i
where ρ h,i is the host electron density at atom i due to all other background atoms
in the system, L i [ρ] is the energy to embed the atom i into the background electron
density ρ, and φ ij is a pair-wise component between atoms i and j. The host electron
density can be expressed:
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 431
ρh,i = ρ ∗ ri j (12.10)
j=i
During the last decade, machine learning methods have been extensively employed
to accelerate the evaluation of various physical properties of materials (Ouyang et al.
2021; Novikov et al. 2021; Hu et al. 2020; Chakraborty et al. 2020). Among various
fields, one of the successful applications has been related to the machine learning-
based interatomic potentials, which can be employed in conventional molecular
dynamics simulations or directly utilized to evaluate interatomic forces and calcu-
late desired physical properties, like phonon dispersion relation. MLIPs belong to
nonparametric designed interatomic potentials, with the goal of providing quantum
mechanics level accuracy with empirical interatomic potentials’ order of computa-
tional cost. According to the terminology of regression analysis, interatomic poten-
tials are either parametric or nonparametric (Shapeev 2016). The main difference
between these two approaches is the capability of nonparametric interatomic poten-
tials to reach a higher level of accuracy, with systematically increasing the number of
their parameters, which consequently imposes higher computational costs. In order
to achieve high accuracy in rather large systems, nonparametric potentials are more
suitable choices. A parametric interatomic potential, like Tersoff, on the other side can
take a fixed number of parameters for all studied system, which as expected limits the
accuracy, though the computational efficiency is robust and unchallengeable. A MLIP
consists of two basic elements; “descriptors,” and the “regression model,” which itself
is a function of descriptors. The descriptors capture the atomic environment, irrespec-
tive of the type of the studied system, with a cutoff function for the computational
efficiency. Among various possibilities, Behler-Parrinello (2007) and bispectrum
coefficients (Thompson et al. 2015) are currently among the most popular descrip-
tors in developing MLIPs (Yanxon et al. 2020). Worth noting that type of descriptors
may substantially affect the performance of a MLIP (Thompson et al. 2015). The
regression model is the second basic element in MLIPs. For the development of
MLIPs, there are various regression methods, such as linear/polynomial regression,
432 B. Mortazavi
Kernel methods, and artificial neural networks (Behler and Parrinello 2007; Behler
2015). MLIPs are developed on the basis of locality in the interatomic interactions,
which means that the total energy of an atomistic configuration x, approximated
with a function with θ parameters, E(x; θ), can be partitioned into contributions of
individual atoms as a function of the environment of this atom:
E(x; θ ) = V (ri ; θ ) (12.11)
i
where r i is a collection of vectors connecting to the ith atom, x i , with other atoms
in the environment, within a predefined cutoff distance Rcut . It can be seen that
the function V is the site energy and is equivalent to the interatomic potential. As
mentioned earlier, linear/polynomial regression, Kernel methods, and artificial neural
networks can be used for the representation of the site energy V. In MLIPs, a large
set of descriptors are used in describing the atomic environments, in order to ensure
reliable reconstruction of any reasonable environment.
The machine learning methodology in constructing the nonparametric interatomic
potentials, likely to other counterparts, relies mostly on the data. MLIPs thus also
follow the same concept as other machine learning methods, meaning that with the
aid of sufficiently large data, the importance of prior knowledge concerning the
underlying physics becomes less critical. As such, after defining a potential func-
tion, the corresponding parameters are fitted to the quantum mechanics data. As
a routine machine learning methodology, the performance/accuracy can be subse-
quently tested, and if needed either the potential parameters or the training data can
be modified. In comparison with empirical interatomic potentials, the main chal-
lenge of MLIPs originates from their strong dependency on the quantum mechanics
data. This means that the configurations or atomic environments that evolve during
a simulation, should be compatible with quantum mechanics dataset faced during
the training process. For example, while a Tersoff potential parameterized (Lindsay
2010) for the pristine graphene, could be employed to study the highly defective, so
called amorphous graphene thermomechanical properties (Mortazavi et al. 2016), a
MLIP trained for pristine graphene cannot be reliably used for defective systems.
On this basis, MLIPs show a common transferability issue, which can be resolved
by enhancing the training data or adopting an active learning methodology, in which
new configurations with unexplored atomic environments are gradually included in
the training data, and subsequently new MLIPs are re-trained.
After gaining a basic understanding concerning the concept of MLIPs, in this section
we briefly discuss the standard procedure for development of a MLIP and highlight
the common challenges.
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 433
For a practical usage of an inexpert user, there are already various MLIPs available,
that can be trained using the quantum mechanics datasets. The following are the
mainstream methods:
Neural network potentials. Behler and Parrinello (2007) pioneered the advance
in the field of MLIPs in 2007, in which they proposed the concept of neural network
potentials (NNPs), with rotation and permutation invariant descriptors and neural
network regression approach. To date, NNPs have been the most extensively used
MLIPs (Behler 2014, 2016). For the practical development of NNPs, several plat-
forms have been developed, among them one can refer to: RuNNer (Artrith et al. 2011;
Behler n.d.), RubNNet4MD (Brieuc et al. n.d), ænet (Artrith and Urban 2016), n2p2
(Singraber et al. 2019), SIMPLE-NN (Lee et al. 2019), and PyXtal-FF (Yanxon et al.
2021). A novel approach is deep tensor neural network (Schütt et al. 2017), which
has been recently extended to conduct molecular dynamics simulations, employing
the DeePMD-kit (Wang et al. 2018a).
Gaussian approximation potentials. Gaussian approximation potentials (GAPs)
(Bartók et al. 2010) are among the early types of highly accurate MLIPs. GAPs use
smooth overlap of atomistic positions Kernel descriptors and the Gaussian process
regression approach. GAPs are well-known for their outstanding accuracy and ability
to build transferable MLIPs (Rowe et al. 2020), but exceedingly high computational
costs are their main drawback. QUIP (Bartók and Csányi 2015) is a platform for the
practical development of GAPs.
Moment tensor potentials. Proposed by Shepeev (2016), moment tensor poten-
tial (MTP) are among the most accurate and computationally efficient MLIPs. The
descriptors of MTPs are on the basis of moment tensors, which share similarities
with NNPs. MTPs accuracy for the examination of thermal transport (Mortazavi
et al. 2021a, 2020a; Liu et al. 2021; Arabha and Rajabpour 2021; Arabha et al. 2021)
and mechanical (Rowe et al. 2020; Mortazavi et al. 2021b, 2022a) properties have
been confirmed by several studies. The MLIP package (Novikov et al. 2021) can be
used for the efficient training of MTPs.
Spectral neighbor approximation potentials. Another family of MLIPs are
spectral neighbor analysis potentials (SNAPs) (Thompson et al. 2015). In SNAPs
basis functions are constructed as cubic polynomials of spherical harmonic expan-
sion coefficients of the (nonsmooth) atomic density, likely to that used in GAPs. Like
MTPs, regression is employed to find the parameters of SNAPs. PyXtal-FF package
(Yanxon et al. 2021), can be used to develop SNAPs, alongside with NNPs.
Neuroevolution machine learning potentials. Although they share similar basics
with NNPs, neuroevolution-potential (NEPs) (Fan et al. 2021) are among the latest
MLIPs, which use descriptors based on Chebyshev and Legendre polynomials
and feedforward neural network regression model. The GPUMD (Fan et al. 2022)
package is available for the training of NEPs. In a recent study (Ying et al. 2023),
NEPs could accurately reproduce the complex mechanical responses of single-layer
fullerene networks as compared with DFT results.
434 B. Mortazavi
For a user with limited knowledge of MLIPs, the choice of an efficient method for
a particular problem is not a straightforward decision. Although MLIPs can be conve-
niently developed for a given system, but their computational costs are rather high.
Moreover, if the accuracy is ought to be tested with expensive quantum mechanics
simulations, the computational efficiency suppresses even more. Therefore, to facil-
itate this dilemma for a new user, comparative studies are highly beneficial. In one
of the first comparative studies published in 2020 (Zuo et al. 2020), GAP, MTP,
NNP, and SNAP potentials have been developed for Li, Mo, Cu and Ni metals and
Si and Ge covalent systems. Comparison of MLIP results with those by empirical
interatomic potentials confirms that all considered MLIPs yield considerably higher
accuracy, which reveal their superiority in predicting energies and forces. Results
shown in Fig. 12.1 suggest that MTPs offer the best tradeoff between accuracy and
computational costs. It is clear that GAPs can yield the highest accuracy, but with
almost two order of magnitude higher computational costs than MTPs. Nonetheless,
aforementioned findings are based on a particular case study, and therefore while
the predicted difference in the computational cost of various method should follow
similar patterns, the accuracy may change for another problem or system. In addition,
more elaborated comparisons should be conducted for multicomponent systems.
After deciding about the functional form of a MLIP for a particular problem, one
next has to design and implement a training process. In this regard, the next choice is
how to create the training dataset and subsequently fit the corresponding parameters.
Belonging to the family of machine learning algorithms, MLIPs are not developed to
extrapolate outside the training environment. As such, depending on the problem of
the interest, the prepared dataset should be rich enough to include all relevant config-
urations. Here it is very advantageous to bring a few examples. Ab-initio molecular
dynamics (AIMD) simulations are currently the most popular approach for creating
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 435
datasets for MLIPs. For the problem of phononic properties, like phonon dispersion
and phonon group velocity, the methods for extracting these physical properties are
mostly based on the small displacement approaches. As such for this problem, a
MLIP should be highly accurate to capture only small displacements around the
ground state. As shown in a recent study (Mortazavi et al. 2020b), MLIPs trained
over short, 1000 steps AIMD simulations conducted at 50 K, could reproduce the
phononic properties in close agreements with those by density functional perturba-
tion theory (DFPT) results. When one considers the examination of lattice thermal
conductivity using the Boltzmann transport equation (BTE), as discussed in another
recent study (Mortazavi et al. 2021a), the incorporation of AIMD results at high
temperature become important in order to enhance accuracy and better capture the
anharmonic effects on the thermal transport. For the analysis of complex mechanical
properties, the AIMD results are ought to also include stretched samples in which
the rupture gradually occurs by increasing the temperature (Mortazavi et al. 2021b,
2022b). As it is clear, depending on the type of problem, an efficient strategy for
preparing the dataset has to be adapted. In the aforementioned three examples, the
candidate structures that could define the atomic environment during the simulations
were predictable, and one could produce sufficiently large structures similar to those
conditions. Nonetheless, without enough knowledge about the candidate structures,
like the energy minimization of novel compositions, one has no other way but to use
active learning to assemble the training set during simulations.
As the general basis, quantum mechanics-based data are required for the training
of MLIPs. As mentioned earlier, AIMD simulations are currently the most popular
approach for the aforementioned purpose. Unlike the ground state simulations, it is
possible to adjust temperature in AIMD simulations. High temperatures are usually
equivalent with larger deformations, and vice versa. Therefore, with AIMD simula-
tions one has the choice about the extensiveness of the atomic environment. Moreover,
one major issue with quantum mechanics-based data is the self-consistent loop for the
convergence of electronic structure calculations. With AIMD simulation, the results
for the converged self-consistent loop of previous step can secure faster convergence
in the next step, which can substantially accelerate calculations. It should be nonethe-
less noted that at high temperatures, since the difference in the atomic positions of
two following steps may become considerable, the self-consistent loop may converge
slower. Another critical issue is with the size of considered structures in the AIMD
simulations. Larger structures can be beneficial to describe more complex atomic
environments, nonetheless, the computational costs of AIMD simulations increase
exponentially with the number of atoms, which may affect the computational effi-
ciency. Generally, for periodic systems, the size of the system should be more than
twice of the cutoff distance of the potential. Another issue is that the atomic config-
urations obtained by AIMD simulations are correlated for close time steps, and they
do not describe new useful atomic environments, and as such the incorporation of
complete AIMD data may lead to overfitting issue. To avoid this problem normally
a portion of AIMD results are utilized for training of MLIPs.
436 B. Mortazavi
After creating the training dataset with sufficiently large structures to capture
atomic environment, interatomic potential’s parameters are fitted, with a goal to mini-
mize the difference between the predicted values for energy, forces, and stresses by
the MLIP with those from the quantum mechanics calculations. Since this concept is
very fundamental, herein we present the optimization function for the MTP (Novikov
et al. 2021; Shapeev 2016):
K
N
2
we E kAI M D − E kM T P + w f f AI M D − f M T P 2
k,i i k,i
k=1 (12.12)
3 AI M D
+ws σ − σ M T P 2
→ min ,
i, j=1 k,i j k,i j
not only the practical implementation is complex, but also the computational cost is
substantially higher than that of the passive fitting.
An efficient possible solution is to combine the concept of active and passive fitting
approaches, in order to minimize the need for frequent quantum mechanics calcu-
lations and subsequent MLIPs re-training. Let’s consider the considered problem
of nanoindentation, around the indenter tip one expects the formation of amorphous
configurations, whereas in the far regions the crystal structure is intact. Therefore, one
can include crystalline and various amorphous lattices in the original training datasets
and conduct AIMD simulations with variable temperatures and under different initial
strains, to artificially simulate structural transitions and failures. The complete AIMD
trajectories can be then subsampled to avoid overfitting and efficiently train prelim-
inary MLIPs. The accuracy of the first fitted MLIPs can be then examined over
the complete AIMD dataset, and configurations with worst extrapolation grades
(Podryabinkin and Shapeev 2017) could be identified and incorporated to the orig-
inal subsampled dataset. The final MLIPs with enhanced accuracy and stability could
be then re-fitted using the improved training dataset. This two-step passive fitting
approach has been recently successfully accomplished in several studies on the basis
of MTPs (Mortazavi et al. 2021b, 2022a b, c; Mortazavi 2021), with confirmed
accuracy as compared with DFT results.
Potential choice. As mentioned earlier, the accuracy level of popular MLIPs are all
very close to quantum mechanics simulations, still nonetheless it is not well estab-
lished that which type of MLIPs, and with which combination of hyperparameters,
cutoff distance, and training strategy are best suited for a particular problem. GAPs are
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 439
well known for their accuracy, but also for their expensive computational costs as well
(Behler 2014, 2016). MTPs can be by two orders of magnitude faster than GAPs with
a comparable level of accuracy (Zuo et al. 2020). NNPs and SNAPs are other MLIPs
that are more computationally efficient than GAPs, with good accuracy for complex
systems. Among all available MLIPs, the MTP method, to the best of our knowledge
is the only approach with confirmed accuracy for the simulation of lattice thermal
conductivity with either Boltzmann transport equation (Mortazavi et al. 2021a; Liu
et al. 2021) or molecular dynamics calculations (Mortazavi et al. 2020a; Korotaev
et al. 2019) and evaluation of complex mechanical properties (Mortazavi et al. 2021b,
2022a, b, c; Mortazavi 2021). Other MLIPs nevertheless may yield similar perfor-
mances or even fail, therefore, direct comparisons with first-principles calculations
are highly required to ensure their reliability. At the end, two different MTPs or GAPs
developed with different hypermeters, or trained over dissimilar training datasets, are
expected to yield close, but also different predictions.
Transferability issue. As discussed earlier, MLIPs are expected to work accu-
rately within the atomic environment fed into them during the training. For example,
a MLIP trained for a pristine structure without any defect is unstable or unreliable for
defective samples. Let’s consider DFT calculations for the geometry optimization,
in this case for initially inaccurate positions the commonly used DFT schemes are
expected to conveniently find the global or at least local minimum energy config-
urations, but with MLIPs this process requires a more complex fitting. As such
unless the amount of required geometry optimization simulations are enormous or
the structures are very large, the common DFT is more reliable and computationally
efficient than MLIPs. Let’s look at this challenge from another view. It is expected
that a more transferable MLIP, most probably yields a lower accuracy for a given
lattice, and moreover the training process becomes more complex and computa-
tionally demanding. As such for studying a specific system and problem, one can
develop a specific MLIP, because the ideal goal of MLIPs is to get accurate results
for a given problem. For example, recently mechanical and thermal properties of
various graphene-like BC2 N lattices have been studied using MTPs (Mortazavi et al.
2022a), in which for every lattice a separate MTP has been trained to maximize the
accuracy. Transferable MLIPs are nonetheless the only viability for the accelerated
new materials and structures discovery (Podryabinkin et al. 2019).
Long-range interactions. Long-range interactions, such as van der Waals and
electrostatic interactions, exist in many important systems and their incorporation
in standard MLIPs can be highly beneficial for broadening the application range.
Because of expensive computational costs of MLIPs, increasing the cutoff distance
is not only an exceedingly costly approach, but may also affect the accuracy of
describing critical short-range interactions. To address this issue, one promising
possibility is to simultaneously consider short-range interactions via a standard
MLIP and capture long-range counterparts with empirical interatomic potentials,
like Lennard–Jones, as it has been most recently successfully accomplished for the
2D vdW heterostructures (Novikov et al. 2022). Nonetheless, because of the crit-
ical importance of this aspect, novel computationally efficient approaches need to
devised, implemented, and tested.
440 B. Mortazavi
Although quantum mechanics are well-known for their superior accuracy, however,
when they are combined with other methods to evaluate a desired property, like
lattice thermal conductivity, one may face several technical challenges, which will
be discussed in the following. On the other side, empirical interatomic potentials are
more convenient for utilization to study wide range of problems, but as discussed
earlier they suffer from accuracy and flexibility issue to study novel compositions.
Graphene, the 2D form of carbon atoms, with a highly symmetrical atomic lattice,
lack of magnetism and short-range covalent bonding, is among the simplest materials
to be theoretically investigated. In this section, we consider graphene and other
graphene-like covalent systems to discuss the bottleneck of quantum mechanics
and empirical interatomic potentials in the examination of thermal transport and
mechanical properties. Taking into account the simplicity of considered graphene-
like structures, one may better realize the technical difficulties for more complex
structures.
where n is the phonon distribution function, q is the phonon wave vector, and v is
the group velocity. ∇n( q ) · v ( q ) corresponds to the distribution function change
rate caused by phonon motions, and [∂n( q )/∂t]c refers to the same variation rate
due to collisions, leading to a steady-state condition (Ward 2009). Different pack-
ages nowadays give the ability to estimate thermal conductivity using the BTE
method. For instance, the ShengBTE package (Li et al. 2014) provides the ability
to determine the lattice thermal conductivity by solving the BTE utilizing a set
of interatomic force constants. A BTE solution requires both harmonic and anhar-
monic force constants to evaluate lattice thermal conductivity. To obtain anharmonic
force constants, depending on the symmetry and cutoff distance, a few hundred to
several thousand force calculations over relatively large supercell structures have to
be performed, which can be computationally exceedingly expensive with conven-
tional DFT methods. Nonetheless when using the first-principles DFT-BTE methods,
the estimation for the thermal conductivity may change depending on the choices
for exchange–correlation functional and computational details for the harmonic and
anharmonic force constants calculations, such as plane waves cutoff energy, K-point
mesh size, supercell size, Q-grid of the BTE solution or cutoff distance in the eval-
uation of anharmonic force constants. As it is clear, a convergence test to ensure
independency of the BTE estimation to the aforementioned parameters can be prac-
tically very burdensome. To better understand the challenge of DFT-BTE method in
the estimation of phononic thermal conductivity, we consider two cases of graphene
and C3 N monolayers. In this regard, according to several DFT-BTE-based studies
(Qin et al. 2018; Taheri et al. 2018), it has been shown that the type of exchange–
correlation functional may yield significant effects on the estimated lattice thermal
conductivity of graphene (find Fig. 12.3a). In contrast with DFT-based predictions
(Qin et al. 2018; Taheri et al. 2018), the estimations by the MTP-based BTE solution
however revealed negligible effects of exchange–correlation functional on the lattice
thermal conductivity of graphene (find MTP results for PBE, PBEsol, and revPBE
shown in Fig. 12.3b) (Mortazavi et al. 2021a). Interestingly, a latest DFT-BTE-based
study (Taheri et al. 2021) confirmed earlier predictions by MTP-based BTE solu-
tion (Mortazavi et al. 2021a) and found close values for the thermal conductivity
of graphene with different exchange–correlation functional (find Fig. 12.3c), and
more importantly their estimated values (Taheri et al. 2021) are consistent with the
earlier MLIP-based estimations (Mortazavi et al. 2021a). Authors in Ref. Taheri et al.
(2021) discussed that the scattering in the earlier full-DFT predictions can be asso-
ciated with not accurately describing the quadratic dispersion of the out-of-plane
acoustic mode of graphene. The second example is graphene-like C3 N, in which on
442 B. Mortazavi
Fig. 12.3 The thermal conductivity of graphene for different exchange–correlation functional, by
(a) Qin et al. (Reprinted from Qin et al. 2018, Copyright 2018, Elsevier), Mortazavi et al. (Reprinted
from Mortazavi et al. 2021a, Copyright 2021, Elsevier) and Taheri et al. (2021) (Reprinted from
Taheri et al. 2021, Copyright 2021, American Physical Society)
the basis of PBE functional using VASP package (Kresse and Furthmüller 1996)
and BTE solution with ShengBTE (Li et al. 2014) package, the room temperature
thermal conductivity were reported to be 380 (Wang et al. 2019), 128 (Kumar et al.
2017), 80 (Wang et al. 2018b), 380 (Gao et al. 2018) and 482 (Peng et al. 2018) W/m
K, which unexpectedly shows six-fold scattering. In another recent study (Liu et al.
2021), authors extended the MTP-based BTE solution (Mortazavi et al. 2021a) and
considered four phonon scattering in the evaluation of lattice thermal conductivity of
bulk BAs using the MTP method, and they found excellent agreements with full-DFT
estimations. It can be concluded that MLIPs not only can yield accurate estimations
and significantly accelerate calculations, but since they are less dependent on the
details of AIMD simulations of training dataset and moreover larger supercell sizes
and cutoff distance impose almost negligible costs, they can substantially facilitate
the examination of thermal conductivity.
On the other side, using the original (Tersoff 1989), AIREBO (Stuart et al. 2000),
REBO (Brenner et al. 2002) and optimized Tersoff (Lindsay and Broido 2010) empir-
ical interatomic potentials, the room temperature thermal conductivity of graphene
were estimated to be 870 (Wei et al. 2011), 709 (Hong et al. 2018a), 350 (Thomas et al.
2010) and ~3000 W/m K (Mortazavi and Rabczuk 2015; Fan et al. 2017), respec-
tively. Based on the aforementioned results, only optimized Tersoff (Lindsay and
Broido 2010) potential yield a reasonable value. Interestingly, for the graphene-like
C3 N monolayer, different studies on the basis of Tersoff (Lindsay and Broido 2010;
KInacI et al. 2012) potentials, the thermal conductivity at 300 K was predicted to be
805–520 (Mortazavi 2017), 810–826 (Hong et al. 2018b), 775 (Han et al. 2019), 806
(Dong et al. 2018), 800 (Song et al. 2019), and 780 (Hatam-Lee et al. 2020) W/m K. It
is clear that while these values by more than two-fold overestimate DFT-based results,
but they are all very close, highlighting the good reproducibility of empirical inter-
atomic potentials estimations. In our recent study for three different graphene-like
BC2 N monolayers (Mortazavi et al. 2022a), MTP-based results confirm that empir-
ical interatomic potentials estimations for thermal conductivity can be nonphysical
and misguiding as well (Mortazavi et al. 2022a). With the presented literature in
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 443
this section, the advantages of MLIPs in the examination of thermal transport with
respect to the accuracy and simplicity in comparison with conventional methods
are highlighted, however their computational cost remain their major bottleneck if
employed within the molecular dynamics simulation platforms.
Fig. 12.4 Uniaxial stress–strain curve and failure mechanism of the C5 N monolayer predicted by
DFT (at 0 K) and MTP-based (at 1 K) models, with and without inclusion of vdW dispersion
correction in the AIMD dataset preparation. (Reprinted from Mortazavi et al. 2022c, Copyright
2022, Royal Society of Chemistry)
strain hardening can be removed by the trial and error modification of the potential’s
cutoff distance function (Mortazavi et al. 2016; He et al. 2014) and trying to reproduce
the experimental tensile strength value. As shown in Fig. 12.5, a passively-fitted
MTP model conducted at 1 K could very closely reproduce the directional dependent
stress–strain curves of graphene as compared with DFT results. This example clearly
reveals the superiority of MLIPs in the analysis of mechanical properties, in which
a rapidly trained MTP could substantially outperform widely employed empirical
interatomic potentials, with respect to the accuracy.
At this stage, robustness and challenges of MLIPs and various drawbacks of empir-
ical interatomic potentials and conventional DFT for the examination of mechan-
ical and thermal properties are well discussed. MLIPs can be fitted over compu-
tationally affordable AIMD trajectories, and they have been extensively proven
to exhibit close accuracy as those of the native datasets, but more importantly
benefit from the inherent flexibility of DFT method to investigate diverse and novel
compositions. In addition, the majority of conventionally used MLIP methods are
currently available within the most popular molecular dynamics package of Large-
scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) (Plimpton 1995),
which considerably facilitate their practical utilization, enrich them with extensive
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 445
Fig. 12.5 Comparison of mechanical properties of graphene predicted by (a) original Terosff
(Tersoff 1989) (Reprinted from Ni et al. 2010, Copyright 2010, Elsevier), AIREBO (Stuart et al.
2000) (Reprinted from He et al. 2014, Copyright 2014, Elsevier), ReaxFF (Srinivasan et al. 2015)
(Reprinted from Jensen et al. 2015, Copyright 2015, American Chemical Society), and optimized
Tersoff (Lindsay and Broido 2010) (Reprinted from Mortazavi et al. 2016, Copyright 2016, Else-
vier) empirical interatomic potentials, with those by (e and f) DFT (at 0 K) and MTP-based (at 1 K)
models (Reprinted from Mortazavi et al. 2021b, Copyright 2021, John Wiley and Sons)
libraries of LAMMPS can enable studying large systems and exploring various phys-
ical properties. In our recent works, we have practically confirmed another novel
opportunity of MLIPs. We have demonstrated that MLIPs fitted to fixed AIMD
datasets can enable first-principles multiscale modeling of mechanical (Mortazavi
et al. 2021b) and heat transport (Mortazavi et al. 2020a) properties of complex nanos-
tructures, in which ab-initio level of accuracy can be hierarchically bridged to explore
the properties of macroscopic systems. Worthy to note that bonds rupture and the
subsequent failure progress can substantially affect atomic environment, which can
cause technical difficulties for developing stable and accurate MLIPs. On this basis,
the evaluation of mechanical properties with dynamic atomic environments is basi-
cally more complex than the thermal counterparts, which are mostly explored close
to equilibrium conditions. We therefore discuss the concept of first-principles multi-
scale modeling of mechanical properties (Mortazavi et al. 2021b), which as discussed
are naturally among the most complicated problems for MLIPs. In order to practi-
cally show this exceptional possibility enabled by MLIPs, mechanical properties of
coplanar graphene/borophene heterostructures (Liu et al. 2019) have been explored.
Worth noting that apart from the accuracy concern of empirical interatomic potentials,
because of complex bonding configurations between graphene and borophene, they
are unable to keep these structures stable at finite temperatures. The first-principles
446 B. Mortazavi
Fig. 12.6 First-principles multiscale modeling strategy to simulate the mechanical properties of
graphene/borophene coplanar heterostructures (Reprinted from Mortazavi et al. 2021b, Copyright
2021, John Wiley and Sons)
Since the introduction of the MLIPs concept by Behler and Parrinello (2007) in
2007, an increasing interest has been developed for the use of MLIPs, for example,
to accelerate the calculations and/or conduct more accurate molecular dynamics
calculations. There is no doubt that MLIPs significance will keep only expanding
and novel advanced methods will be developed to accelerate calculations or offer
solutions for more complex problems. MLIPs moreover offer extraordinary capabil-
ities to marry the first-principles accuracy with multiscale modeling, enabling the
modeling of complex nanostructures at continuum level, without any prior physical
knowledge, with DFT level of accuracy and more importantly without paying exces-
sive computational costs. MLIPs therefore offer a highly bright prospect to develop
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 447
References
Dong Y, Meng M, Groves MM, Zhang C, Lin J (2018) Thermal conductivities of two-dimensional
graphitic carbon nitrides by molecule dynamics simulation. Int J Heat Mass Transf 123:738–746.
https://doi.org/10.1016/j.ijheatmasstransfer.2018.03.017
Fan Z, Wang Y, Ying P, Song K, Wang J, Wang Y, Zeng Z, Xu K, Lindgren E, Rahm JM et al (2022)
GPUMD: a package for constructing accurate machine-learned potentials and performing highly
efficient atomistic simulations. J Chem Phys 157:114801. https://doi.org/10.1063/5.0106617
Fan Z, Zeng Z, Zhang C, Wang Y, Song K, Dong H, Chen Y, Ala-Nissila T (2021) Neuroevolution
machine learning potentials: combining high accuracy and low cost in atomistic simulations and
application to heat transport. Phys Rev B 104:104309
Fan Z, Pereira LFC, Hirvonen P, Ervasti MM, Elder KR, Donadio D, Ala-Nissila T, Harju A (2017)
Thermal conductivity decomposition in two-dimensional materials: application to graphene.
Phys Rev B 95. https://doi.org/10.1103/PhysRevB.95.144309
Gao Y, Wang H, Sun M, Ding Y, Zhang L, Li Q (2018) First-principles study of intrinsic phononic
thermal transport in monolayer C3N. Phys E Low-Dimensional Syst Nanostruct 99:194–201.
https://doi.org/10.1016/j.physe.2018.02.012
Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization
of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J Chem Phys
132:154104. https://doi.org/10.1063/1.3382344
Han D, Wang X, Ding W, Chen Y, Zhang J, Xin G, Cheng L (2019) Phonon thermal conduction
in a graphene–C 3 N heterobilayer using molecular dynamics simulations. Nanotechnology
30:075403. https://doi.org/10.1088/1361-6528/aaf481
Hatam-Lee SM, Rajabpour A, Volz S (2020) Thermal conductivity of graphene polymorphs and
compounds: from C3N to graphdiyne lattices. Carbon N Y 161:816–826. https://doi.org/10.
1016/j.carbon.2020.02.007
He L, Guo S, Lei J, Sha Z, Liu Z (2014) The effect of stone–thrower–wales defects on mechanical
properties of graphene sheets—a molecular dynamics study. Carbon N Y 75:124–132. https://
doi.org/10.1016/j.carbon.2014.03.044
Hong Y, Ju MG, Zhang J, Zeng XC (2018a) Phonon thermal transport in a graphene/MoSe2 van
Der Waals heterobilayer. Phys Chem Chem Phys 20:2637–2645. https://doi.org/10.1039/C7C
P06874C
Hong Y, Zhang J, Zeng XC (2018b) Monolayer and bilayer polyaniline C3N: two-dimensional
semiconductors with high thermal conductivity. Nanoscale 10:4301–4310. https://doi.org/10.
1039/C7NR08458G
Hu R, Iwamoto S, Feng L, Ju S, Hu S, Ohnishi M, Nagai N, Hirakawa K, Shiomi J (2020) Machine-
learning-optimized aperiodic superlattice minimizes coherent phonon heat conduction. Phys
Rev X 10:21050
Jensen BD, Wise KE, Odegard GM (2015) Simulation of the elastic and ultimate tensile proper-
ties of diamond, graphene, carbon nanotubes, and amorphous carbon using a revised ReaxFF
parametrization. J Phys Chem A 119:9710–9721. https://doi.org/10.1021/acs.jpca.5b05889
KInacI A, Haskins JB, Sevik C, ÇaǧIn T (2012) Thermal conductivity of BN-C nanostructures. Phys
Rev B-Condens Matter Mater Phys 86:115410. https://doi.org/10.1103/PhysRevB.86.115410
Korotaev P, Novoselov I, Yanilkin A, Shapeev A (2019) Accessing thermal conductivity of complex
compounds by machine learning interatomic potentials. Phys Rev B 100:144308. https://doi.
org/10.1103/PhysRevB.100.144308
Kresse G, Furthmüller J (1996) Efficient iterative schemes for Ab initio total-energy calculations
using a plane-wave basis set. Phys Rev B 54:11169–11186. https://doi.org/10.1103/PhysRevB.
54.11169
Kumar S, Sharma S, Babar V, Schwingenschlögl U (2017) Ultralow lattice thermal conductivity in
monolayer C3 N as compared to graphene. J Mater Chem A 5:20407–20411. https://doi.org/10.
1039/C7TA05872A
Lee K, Yoo D, Jeong W, Han S (2019) SIMPLE-NN: an efficient package for training and executing
neural-network interatomic potentials. Comput Phys Commun 242:95–103. https://doi.org/10.
1016/j.cpc.2019.04.014
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 449
Li W, Carrete J, Katcho NA, Mingo N (2014) ShengBTE: a solver of the boltzmann transport
equation for phonons. Comput Phys Commun 185:1747–1758. https://doi.org/10.1016/j.cpc.
2014.02.015
Lindsay B (2010) Optimized tersoff and brenner empirical potential parameters for lattice dynamics
and phonon thermal transport in carbon nanotubes and graphene. Phys Rev B-Condens Matter
Mater Phys 82:205441
Lindsay L, Broido DA (2010) Optimized tersoff and brenner empirical potential parameters for
lattice dynamics and phonon thermal transport in carbon nanotubes and graphene. Phys Rev
B-Condens Matter Mater Phys 81:205441. https://doi.org/10.1103/PhysRevB.81.205441
Liu Z, Yang X, Zhang B, Li W (2021) High thermal conductivity of wurtzite boron arsenide
predicted by including four-phonon scattering with machine learning potential. ACS Appl Mater
Interfaces. https://doi.org/10.1021/acsami.1c11595
Liu X, Hersam MC (2019) Borophene-graphene heterostructures. Sci Adv 5:eaax6444. https://doi.
org/10.1126/sciadv.aax6444
Mortazavi B (2017) Ultra High stiffness and thermal conductivity of graphene like C<inf>3</Inf>N.
Carbon N Y 118:25–34. https://doi.org/10.1016/j.carbon.2017.03.029
Mortazavi B (2021) Ultrahigh thermal conductivity and strength in direct-gap semiconducting
graphene-like BC6N: a first-principles and classical investigation. Carbon N Y 182:373–383.
https://doi.org/10.1016/j.carbon.2021.06.038
Mortazavi B, Fan Z, Pereira LFC, Harju A, Rabczuk T (2016) Amorphized graphene: a stiff material
with low thermal conductivity. Carbon N. Y. 103:318–326. https://doi.org/10.1016/j.carbon.
2016.03.007
Mortazavi B, Novikov IS, Podryabinkin EV, Roche S, Rabczuk T, Shapeev AV, Zhuang X (2020b)
exploring phononic properties of two-dimensional materials using machine learning interatomic
potentials. Appl Mater Today 20:100685. https://doi.org/10.1016/j.apmt.2020.100685
Mortazavi B, Novikov IS, Shapeev A (2022a) V A machine-learning-based investigation on the
mechanical/failure response and thermal conductivity of semiconducting BC2N monolayers.
Carbon N Y 188:431–441. https://doi.org/10.1016/j.carbon.2021.12.039
Mortazavi B, Podryabinkin EV, Novikov IS, Rabczuk T, Zhuang X, Shapeev A (2021a) V Acceler-
ating first-principles estimation of thermal conductivity by machine-learning interatomic poten-
tials: a MTP/ShengBTE solution. Comput Phys Commun 258:107583. https://doi.org/10.1016/
j.cpc.2020.107583
Mortazavi B, Podryabinkin EV, Roche S, Rabczuk T, Zhuang X, Shapeev A (2020a) V Machine-
learning interatomic potentials enable first-principles multiscale modeling of lattice thermal
conductivity in graphene/borophene heterostructures. Mater. Horizons 7:2359–2367. https://
doi.org/10.1039/D0MH00787K
Mortazavi B, Rabczuk T (2015) Multiscale modeling of heat conduction in graphene laminates.
Carbon N Y 85:1–7. https://doi.org/10.1016/j.carbon.2014.12.046
Mortazavi B, Rémond Y, Ahzi S, Toniazzo V (2012) Thickness and chirality effects on tensile
behavior of few-layer graphene by molecular dynamics simulations. Comput Mater Sci 53:298–
302. https://doi.org/10.1016/j.commatsci.2011.08.018
Mortazavi B, Shahrokhi M, Shojaei F, Rabczuk T, Zhuang X, Shapeev A (2022c) V A first-
principles and machine-learning investigation on the electronic, photocatalytic, mechanical and
heat conduction properties of nanoporous C5N monolayers. Nanoscale 14:4324–4333. https://
doi.org/10.1039/D1NR06449E
Mortazavi B, Shojaei F, Shapeev AV, Zhuang X (2022b) A combined first-principles and machine-
learning investigation on the stability, electronic, optical, and mechanical properties of novel
C6N7-based nanoporous carbon nitrides. Carbon N Y 194:230–239. https://doi.org/10.1016/j.
carbon.2022.03.068
Mortazavi B, Silani M, Podryabinkin EV, Rabczuk T, Zhuang X, Shapeev A (2021b) V First-
principles multiscale modeling of mechanical properties in graphene/borophene heterostructures
empowered by machine-learning interatomic potentials. Adv Mater 33:2102807. https://doi.org/
10.1002/adma.202102807
450 B. Mortazavi