0% found this document useful (0 votes)
2K views

Machine Learning in Modeling and Simulation

Uploaded by

tonyhao0506
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2K views

Machine Learning in Modeling and Simulation

Uploaded by

tonyhao0506
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 456

Computational Methods in Engineering & the Sciences

Timon Rabczuk
Klaus-Jürgen Bathe Editors

Machine
Learning in
Modeling and
Simulation
Methods and Applications
Computational Methods in Engineering &
the Sciences

Series Editor
Klaus-Jürgen Bathe, Department of Mechanical Engineering, Massachusetts
Institute of Technology, Cambridge, MA, USA
This Series publishes books on all aspects of computational methods used in
engineering and the sciences. With emphasis on simulation through mathematical
modelling, the Series accepts high quality content books across different domains
of engineering, materials, and other applied sciences. The Series publishes mono-
graphs, contributed volumes, professional books, and handbooks, spanning across
cutting edge research as well as basics of professional practice. The topics of
interest include the development and applications of computational simulations in
the broad fields of Solid & Structural Mechanics, Fluid Dynamics, Heat Transfer,
Electromagnetics, Multiphysics, Optimization, Stochastics with simulations in and
for Structural Health Monitoring, Energy Systems, Aerospace Systems, Machines
and Turbines. Climate Prediction, Effects of Earthquakes, Geotechnical Systems,
Chemical and Biomolecular Systems, Molecular Biology, Nano and Microflu-
idics, Materials Science, Nanotechnology, Manufacturing and 3D printing, Artificial
Intelligence, Internet-of-Things.
Timon Rabczuk · Klaus-Jürgen Bathe
Editors

Machine Learning
in Modeling and Simulation
Methods and Applications
Editors
Timon Rabczuk Klaus-Jürgen Bathe
Bauhaus University Massachusetts Institute of Technology
Weimar, Germany (MIT)
Cambridge, MA, USA

ISSN 2662-4869 ISSN 2662-4877 (electronic)


Computational Methods in Engineering & the Sciences
ISBN 978-3-031-36643-7 ISBN 978-3-031-36644-4 (eBook)
https://doi.org/10.1007/978-3-031-36644-4

© The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature
Switzerland AG 2023
Chapter “Machine Learning in Computer Aided Engineering” is licensed under the terms of the Creative
Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/). For further
details see license information in the chapter.

This work is subject to copyright. All rights are solely and exclusively licensed by the Publisher, whether
the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse
of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and
transmission or information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This Springer imprint is published by the registered company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland

Paper in this product is recyclable.


Preface

Machine learning (ML) techniques have recently become remarkably successful.


Although some of the basic principles were established decades ago, the advent of
GPU-enabled numerical analysis and parallel computing, coupled with some recent
theoretical advances, has led to significant ML progress in diverse fields, such as
image and speech recognition, natural language processing, bioinformatics, and game
theory. Many of the developed applications require large amounts of data to be used
as training sets. However, also, tremendous advances have been made in having a
computer algorithm learn the problem solution from a reasonable amount of data
and a simple set of rules.
While ML approaches have been applied in a number of areas such as economics,
e-commerce, image recognition, language interpretation, and medicine, their full
potential in “Modeling and Simulation” still needs to be explored. Indeed, ML
approaches have the potential to drastically reduce the computational cost—from
design to analysis back to design—and allow challenging problems in engineering
and the computational sciences to be solved which have never been solved before,
in fact not even been tackled before.
Machine learning algorithms can be separated into two main classes: supervised
and unsupervised learning. In both classes, the algorithms have access to a set of
observations for the role of training data. However, the nature of the training data, and
hence what can be extracted from them, differs. In supervised learning, the training
data consists of a given set of input values and a given corresponding set of output
values. Using the training data, the ML algorithm tries to find the best parameters of
a prediction function that can accurately emulate the output as a function of the input,
each of interest. On the other hand, in unsupervised learning the output of interest is
not known a priori and an effective use of the data is much more challenging. The
third intermediate class exists, semi-supervised learning; in this case, some, but not
all of the given input values have given corresponding output values.
Our objective in this book is to present ML techniques for computer-aided engi-
neering with a focus on the fundamental theoretical ingredients and the exciting use
of ML in the next generation of computational methods for modeling and simulation.

v
vi Preface

An impressive example of ML in engineering is the emergence and great potential


of using digital twins for design and monitoring of structures, with the word “struc-
tures” interpreted to include not only traditional structures, like buildings, dams, and
bridges, but also biological, offshore, electromagnetic, turbines, nuclear, and many
other structures. We foresee here major and widely spread applications.
The book consists of 12 chapters focusing on Machine Learning in Modeling
and Simulation, starting with an extensive overview of concepts and applica-
tions in Chap. 1. The following two chapters provide a historical and theoretical
overview—including implementation details—of two very popular machine learning
approaches: Artificial Neural Networks and Gaussian Processes. The remainder
of the book focuses on the use of ML techniques to solve specific problems in
engineering, physics or materials science starting with data-driven model discovery
followed by physics-informed neural networks that may become a powerful alterna-
tive to classical discretization methods such as finite elements. Namely, the networks
allow the solution of partial differential equations while also incorporating experi-
mental data. Physics-informed Deep Neural Operators can be seen as an improvement
and have the potential of solving not only one specific problem but any problem in
a large class of problems. The next chapters focus on important topics regarding
digital twins, reduced order modeling, regression models and the valuable use of
ML in topology optimization methodologies. The last two chapters of the book
focus on the design of new materials. The first approach is data-driven while the
second approach takes advantage of interatomic potentials in the context of hierar-
chical multiscale modeling. These ML approaches have the potential to significantly
widen the possibilities and accelerate the design of new materials.
The work on this book has been very exciting, and we greatly thank all authors
and co-authors of the book chapters. Their valuable contributions, dedicated work,
and great cooperation made it possible to complete this book in a timely manner.

Weimar, Germany Timon Rabczuk


Cambridge, USA Klaus-Jürgen Bathe
December 2022
Contents

1 Machine Learning in Computer Aided Engineering . . . . . . . . . . . . . . 1


Francisco J. Montáns, Elías Cueto, and Klaus-Jürgen Bathe
2 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
K. Worden, G. Tsialiamanis, E. J. Cross, and T. J. Rogers
3 Gaussian Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
T. J. Rogers, J. Mclean, E. J. Cross, and K. Worden
4 Machine Learning Methods for Constructing Dynamic
Models From Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
J. Nathan Kutz
5 Physics-Informed Neural Networks: Theory and Applications . . . . . 179
Cosmin Anitescu, Burak İsmail Ateş, and Timon Rabczuk
6 Physics-Informed Deep Neural Operator Networks . . . . . . . . . . . . . . . 219
Somdatta Goswami, Aniruddha Bora, Yue Yu,
and George Em Karniadakis
7 Digital Twin for Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Tapas Tripura, Shailesh Garg, and Souvik Chakraborty
8 Reduced Order Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Zulkeefal Dar, Joan Baiges, and Ramon Codina
9 Regression Models for Machine Learning . . . . . . . . . . . . . . . . . . . . . . . 341
Pengfei Wei and Michael Beer
10 Overview on Machine Learning Assisted Topology
Optimization Methodologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 373
Ilias Chamatidis, Manos Stoumpos, George Kazakis,
Nikos Ath. Kallioras, Savvas Triantafyllou, Vagelis Plevris,
and Nikos D. Lagaros

vii
viii Contents

11 Mixed-Variable Concurrent Material, Geometry, and Process


Design in Integrated Computational Materials Engineering . . . . . . . 395
Tianyu Huang, Marisa Bisram, Yang Li, Hongyi Xu,
Danielle Zeng, Xuming Su, Jian Cao, and Wei Chen
12 Machine Learning Interatomic Potentials: Keys
to First-Principles Multiscale Modeling . . . . . . . . . . . . . . . . . . . . . . . . . 427
Bohayra Mortazavi
About the Editors

Timon Rabczuk is Professor of Computational Mechanics at Bauhaus University


Weimar. He has published more than 650 articles and 2 books. Timon is Editor in
Chief, Associate Editor, and Executive Editor of several journals of repute. He is
a member of the European Academy of Sciences and Arts. His research interests
include computational science, integrated computational materials engineering and
the numerical solution of partial differential equations.

Klaus-Jürgen Bathe is Professor of Mechanical Engineering at the Massachusetts


Institute of Technology. Professor Bathe is also the Founder of the company ADINA
R & D, recently acquired by Bentley Systems. He has published numerous articles,
six textbooks, two books on his life experiences. He has been honored by the ASME,
ASCE, U.S. National Academy of Engineering, through three awards at M.I.T., and
through many honorary doctorates for his teaching, pioneering and widely used
fundamental contributions in computational mechanics, and for bridging the world
between academia and industry.

ix
Chapter 1
Machine Learning in Computer Aided
Engineering

Francisco J. Montáns, Elías Cueto, and Klaus-Jürgen Bathe

1.1 Introduction

The purpose of Machine Learning algorithms is to learn automatically from data


employing general procedures. Machine Learning (ML) is today ubiquitous due to
its success in many current daily applications such as face recognition (Hassan and
Abdulazeez 2021), speech (Malik et al. 2021) and speaker recognition (Hanifa et al.
2021), credit card fraud detection (Ashtiani and Raahemi 2021; Nayak et al. 2021),
spam detection (Akinyelu 2021), and cloud security (Nassif et al. 2021). ML governs
our specific Google searches and the advertisements we receive (Kim et al. 2001)
based on our past actions, along many other interactions (Google cloud 2023). It
even anticipates what we will type or what we will do. And, of course, ML schemes
also rank us, scientists (Beel and Gipp 2009).
The explosion of applications of ML came with the increased computer power and
also the ubiquitous presence of computers, cell phones, and other “smart” devices.
These gave ML the spotlight to foster its widespread use to many other areas in which

F. J. Montáns (B)
E.T.S. de Ingeniería Aeronáutica y del Espacio, Universidad Politécnica de Madrid, Pza. Cardenal
Cisneros 3, 28040 Madrid, Spain
e-mail: [email protected]
Department of Mechanical and Aerospace Engineering, Herbert Wertheim College
of Engineering, University of Florida, Gainesville, FL 32611, USA
E. Cueto
Aragon Institute of Engineering Research, Universidad de Zaragoza, Maria de Luna s/n,
50018 Zaragoza, Spain
e-mail: [email protected]
K. J. Bathe
Mechanical Engineering Department, Massachusetts Institute of Technology, 77 Mass. Ave.,
Cambridge, MA 02139, USA
e-mail: [email protected]

© The Author(s) 2023 1


T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_1
2 F. J. Montáns et al.

it had less presence. The success in many extremely useful areas such as speech and
face recognition has contributed to this interest (Marr 2019). Today, ML may help
you (through web services) to find a job, obtain a loan, find a partner, obtain an
insurance, and, among others, also helps in the medical and legal services (Duarte
2018). Of course, ML raises many ethical issues, some of which are described, for
example in Stahl (2021). However, the discovered power and success of ML in many
areas have made a very important impact on our society and, remarkably, on how
many problems are addressed. No wonder, the number of ML papers published in
almost all fields has sharply increased in the last 10 years, with a rate following
approximately Moore’s law (Frank et al. 2020).
Machine Learning is considered a part of Artificial Intelligence (AI) (Michalski
et al. 2013). In essence, ML algorithms are general procedures and codes that, with
the information from datasets, can give predictions for a wide range of problems (see
Fig. 1.1). The main difference to classical programs is that the classical programs
are developed for specific applications, like in Computer Aided Engineering, which
is the topic of this chapter, to solve specific differential equations in integral forms.
An example is how finite elements have been developed. In contrast, ML procedures
are for much more general applications, being used almost unchanged in problems
apparently unconnected as predicting the evolutions of stocks, spam filtering, face
recognition, typing prediction, pharmacologic design, or materials selection. ML
methods are different also from Expert Systems because these are based on fixed
rules or fixed probability structures. ML methods excel when useful information
needs to be obtained from massive amounts of data.
Of course, generality comes usually with a trade-off regarding efficiency for a
specific problem solution (Fig. 1.1), so the use of ML for the solution of simple
problems, or for problems which can be solved by other more specific procedures,
is typically inappropriate. Furthermore, ML is used when predictions are needed for
problems which have not or cannot be accurately formulated; that is, when the vari-
ables and mathematical equations governing the problem are not fully determined—
but physics-informed approaches with ML are now also much focused on, Raissi
et al. (2019). Nonetheless, ML codes and procedures are still mostly used as general
“black boxes”, typically employing standard implementations available in free and
open-source software repositories. A number of input variables are then employed
and some specific output is desired, which together comprises the input-to-output
process learned and adjusted from known cases or from the structure of the input
data. Some of these free codes are Scikit-learn (Pedregosa et al. 2011) (one of the
best-known), Microsoft Cognitive Toolkit (Xiong et al. 2018), TensorFlow (Dillon
et al. 2017) (which is optimal for Cuda-enabled Graphic Processing Unit (GPU)
parallel ML), Keras (Gulli and Pal 2017), OpenNN (Build powerful models 2022),
and SystemML (Ghoting et al. 2011), just to name a few. Other proprietary soft-
ware, used by big companies, are AWS Machine Learning Services from Amazon
(Hashemipour and Ali 2020), Cloud Machine Learning Engine from Google (Bisong
2019a), Matlab (Paluszek and Thomas 2016; Kim 2017), Mathematica (Brodie et al.
2020; Rodríguez and Kramer 2019), etc. Moreover, many software offerings have
libraries for ML, and are often used in ML projects like Python (NumPy, Bisong
1 Machine Learning in Computer Aided Engineering 3

Data

Parameter optimization
Hyperparameters General Algorithm

Efficiency
Prediction,

Feedback:
classification,
complexity
reduction
Generality

Prediction Classification

Dimensionality reduction and manifold learning

Fig. 1.1 Overall Machine Learning (ML) process and contrast between efficiency and generality of
the method. Hyperparameters are user-defined parameters which account for the type of problem,
whereas parameters are optimized for best prediction. ML may be used for prediction and for
classification. it is also often used as a tool for dimensionality reduction

2019b, Scikit-learn, Pedregosa et al. 2011, and Tensorly, Kossaifi et al. 2016; see
review in Stančin and Jović 2019), C++, e.g., Kaehler and Bradski (2016), Julia (a
recent Just-In-Time (JIT) compiling language created with science and ML in mind,
Gao et al. 2020; Innes 2018; Innes et al. 2019), and the R programming environ-
ment (Lantz 2019; Bischl et al. 2016; Molnar et al. 2018); see also Raschka and
Mirjalili (2019), King (2009), Gao et al. (2020), Bischl et al. (2016). These soft-
ware offerings also use many earlier published methods for standard computational
tasks such as mathematical libraries (like for curve fitting, the solution of linear
and nonlinear equations, the determination of eigenvalues and eigenvectors or Sin-
gular Value Decompositions), and computational procedures for optimization (e.g.,
the steepest descent algorithms). The offerings also use earlier established statistical
and regression algorithms, interpolation, clustering, domain slicing (e.g., tessellation
algorithms), and function approximations.
4 F. J. Montáns et al.

ML derives from the conceptually fuzzy (uncertain, non-deterministic) learning


approach of AI. AI is devoted to mimicking the way the human learning process
works—namely, the human brain, through the establishment of neurological connec-
tions based on observations, can perform predictions, albeit mostly only qualitative,
of new events. And then, the more experience (data) has been gathered, the better are
the predictions through experience reinforcements and variability of observations. In
addition, classification is another task typically performed by the human brain. We
classify photos, people, experiences, and so on, according to some common features:
we search continuously for features that allow us to group and separate out things
so that we can establish relations of outcomes to such groups. Abundant data, data
structuring, and data selection and simplification are crucial pieces of this type of
“fuzzy” learning and, hence, of ML procedures.
Based on these observations, neural network concepts were rather early devel-
oped by McCulloch and Pitts in 1943 and Hebb in 1949 (Hebb 2005), who wrote
the well-known sentence “Cells that fire together, wire together”, meaning that the
firing of one cell determines the actions of subsequent cells. While Hebb’s forward
firing rule is unstable through successive epochs, it was the foundation for Artificial
Neural Network (NN) theories. Probably due to the difficulties in implementations
and computational cost in using NN, the widespread use of NN was delayed until the
1990s. The introduction of improvements in the procedures for backpropagation and
optimization, as well as improvements in data acquisition, information retrieval, and
data mining, made possible the application of NNs to real problems. Today, NNs are
very flexible and are the basis of many ML techniques and applications. However, this
delay also facilitated the appearance and use of other ML-related methods as expert
systems and decision trees, and a myriad of pattern recognition and decision-making
approaches.
Today, whenever a complex problem is found, especially if there is no sound
theory or reliable formulation to solve it, ML is a valuable tool to try. In many cases,
the result is successful and indeed even a good understanding of the behavior of the
problem and the variables involved may be obtained. While the introduction of ML
procedures into Computer Aided Engineering (CAE) took a longer time than in other
areas, probably because for many problems the governing equations and effective
computational procedures were known, ML is finally also focused on addressing
complex and computationally intensive CAE solutions. In this chapter, we overview
some of the procedures and applications of Machine Learning employed in CAE.

1.2 Machine Learning Procedures Employed in CAE

As mentioned, ML is often considered to be a subset of AI (Michalski et al. 2013;


Dhanalaxmi 2020; Karthikeyan et al. 2021), although often ML is also recognized
as a separate field itself which only has some intersection with AI (Manavalan 2020;
Langley 2011; Ongsulee 2017). Deep Learning (DL) is a subset of ML. Although
the use of NNs is the most common approach to address CAE problems and ML
1 Machine Learning in Computer Aided Engineering 5

problems in general, there are many other ML techniques that are being used. We
review below the fundamental aspects of these techniques.

1.2.1 Machine Learning Aspects and Classification


of Procedures

Our objective in this section is to focus on various fundamental procedures commonly


used in ML schemes.

1.2.1.1 Classification, Identification, and Prediction

ML procedures are mainly employed for three tasks: classification, identification


(both may broadly be considered as classification), and prediction. An example of
classification is the labeling of e-mails as spam or not spam (Gaurav et al. 2020;
Crawford et al. 2015). Examples of identification are the identification of a type of
behavior or material from some stress–strain history or from force signals in machin-
ing (Denkena et al. 2019; Penumuru et al. 2020; Bock et al. 2019), the identification
of a nanostructure from optical microscopy (Lin et al. 2018), the identification of a
person from a set of images (Ahmed et al. 2015; Ding et al. 2015; Sharma et al. 2020),
and the identification of a sentence from some fuzzy input. Examples of prediction
are the prediction of behavior of a material under some deformation pattern (Ye et al.
2022; Ibragimova et al. 2021; Huang et al. 2020), the prediction of a sentence from
some initial words (Bickel et al. 2005; Sordoni et al. 2015), and the prediction of the
trajectory of salient flying objects (Wu et al. 2017; Fu et al. 2020). Of course, there
are some AI procedures which may not only belong to one of these categories, as the
identification or prediction of governing equations in physics (Rai and Sahu 2020;
Raissi and Karniadakis 2018). Clustering ML procedures are typically used for clas-
sification, whereas regression ML procedures are customarily used for prediction.

1.2.1.2 Expected and Unexpected Data Relations

Another relevant distinction is between ML approaches and Data Mining (DM).


ML focuses on using known properties of data in classification or in prediction,
whereas DM focuses on the discovery of new unknown properties or relations of
data. However, ML, along information systems, is often considered part of DM
(Adriaans and Zantinge 1997). The overlap of DM and ML is seen in cases like the
discovery of unknown relations or in finding optimum state variables which may be,
for example, given in physical equations. Note that ML typically assumes that we
know beforehand the existence of relations (e.g., which are the relevant variables
and what is the type of output we expect), but the purpose of DM is to research the
existence of perhaps unexpected relations from raw data.
6 F. J. Montáns et al.

1.2.1.3 Statistical and Optimization Approaches within ML

Many procedures use, or are derived from, statistics, and in particular probability
theory (Murphy 2012; Bzdok et al. 2018). In a similar manner, ML employs mostly
optimization procedures (Le et al. 2011). The main conceptual difference between
these theories and ML is the purpose of the developments. In the case of statistics,
the purpose is to obtain inference or characteristics of the population such as the
distribution and the mean (which of course could be used thereafter for predictions);
see Fig. 1.2. In the case of ML, the purpose is to predict new outcomes, often without
the need for statistically characterizing populations, and incorporate these outcomes
in further predictions (Bzdok et al. 2018). ML approaches often support predictions
on models. ML optimizes parameters for obtaining the best predictions as quantified
by a cost function, and the values of these parameters are optimized also to account for
the uncertainty in the data and in the predictions. ML approaches may use statistical
distributions, but those are not an objective and their evaluation is often numerical
(ML is interested in predictions). Also, while ML uses optimization procedures to
obtain values of parameters, the objective is not to obtain the “optimum” solution to fit
data, but a parsimonious model giving reliable predictions (e.g., to avoid overfitting).

1.2.1.4 Supervised, Unsupervised, and Reinforced Learning

It is typical to classify the ML procedures in supervised, unsupervised, semi-


supervised, and reinforced learning (Raschka 2015; Burkov 2019, 2020).
In supervised learning, samples {si ≡ {xi , yi }}{i=1,...,n} ∈ S with vectors of fea-
tures xi are labeled with a known result or label yi . The label may be a class, a

Fig. 1.2 Comparison of classical statistics with machine learning approaches


1 Machine Learning in Computer Aided Engineering 7

number, a matrix, or other. The purpose of the ML approach in this case is (typi-
cally) to create a model that relates those known outputs yi to the dataset samples
through some combination of the j = 1, . . . , N features x j (i) ≡ x ji of each sam-
ple i. x j (i) are also referred to as data, variables, measurements, or characteristics,
depending on the context or field of application. An example of a ML procedure
could be to relate the seismic vulnerability of a building (label) as a function of
features like construction type, age, size, location, building materials, maintenance,
etc. Rosti et al. (2022), Zhang et al. (2019), Ruggieri et al. (2021). The ML purpose
is here to be able to learn the vulnerability of buildings from known vulnerabilities
of other buildings. The labeling could have been obtained from experts or from past
earthquakes. Supervised learning is based on sufficient known data, and we want to
determine predictions in the nearby domain. In essence, we can say that “supervised
learning is a high-dimensional interpolation problem” (Mallat 2016; Gin et al. 2021).
We note that supervised learning may be improved with further data when available,
since it is a dynamic learning procedure, mimicking the human brain.
In unsupervised learning the samples si are unlabeled (si ≡ {xi }), so the purpose
is to label the samples from learning similitudes and common characteristics in the
features of the samples; it is usually an instance-based learning. Typical unsupervised
ML approaches are employed in clustering (e.g., classifying the structures by type in
our previous example), dimensionality reduction (detecting which features are less
relevant to the output label, for example because all or most samples have it, like
doors in buildings), and outlier detection (e.g., in detecting abnormal traffic in the
Internet, Salman et al. 2020, 2022; Salloum et al. 2020) for the case when very few
samples have that feature. These approaches are similar to data mining.
Semi-supervised learning is conceptually a combination of the previous
approaches but with specific ML procedures. In essence it is a supervised learn-
ing approach in which there are few labeled samples (output known) but many more
unlabeled samples (output unknown), even sometimes with incomplete features, with
some missing characteristics, which may be filled in by imputation techniques (Lak-
shminarayan et al. 1996; Ramoni and Sebastiani 2001; Liu et al. 2012; Rabin and
Fishelov 2017). The point here is that by having many more samples with unassigned
features, we can determine better the statistical distributions of the data and the pos-
sible significance of the features in the result, resulting in an improvement over using
only labeled data for which the features have been used to determine the label. For
example, in our seismic vulnerability example, imagine that one feature is that the
building has windows. Since almost all buildings have windows, it is unlikely that
this feature is relevant in determining the vulnerability (it will give little Information
Gain; see below). On the contrary, if 20% of the buildings have a steel structure, and
if the correlation is positive regarding the (lack of) vulnerability, it is likely that the
feature is important in determining the vulnerability.
There is also another type of ML seldom used in CAE, which is reinforced learn-
ing (or reward-based learning). In this case, the computer develops and changes
actions to learn a policy depending on the feedback, i.e. rewards which themselves
modify the subsequent actions by maximizing the expected reward. It has some
common concepts to supervised learning, but the purpose is an action, instead of a
8 F. J. Montáns et al.

prediction. Hence, it is a typical ML approach in control dynamics (Buşoniu et al.


2018; Lewis and Liu 2013) with applications, for example, in the aeronautical indus-
try (Choi and Cha 2019; Swischuk and Allaire 2019; He et al. 2021).

1.2.1.5 Data Cleaning, Ingestion, Augmentation, Curation, Data


Evaluation, and Data Standardization

Data is the key to ML procedures, so datasets are usually large and obtained in dif-
ferent ways. The importance of data requires that the data is presented to the ML
method (and maintained if applicable) in optimal format. To reach that goal requires
many processes which often also involve ML techniques. For example, in a dataset
there may be data which are not in a logical range, or with missing entries, hence they
need to be cleaned. ML techniques may be used to determine outliers in datasets, or
assign values (data imputation) according to the other features and labels present in
other samples in the dataset. Different dataset formats such as qualitative entries like
“good”, “fair”, or “bad”, and quantitative entries like “1–9”, may need to be con-
verted (encoded) to standardized formats, using also ML algorithms (e.g., assigning
“fair” to a numerical value according to samples in the dataset). This is called data
ingestion. ML procedures may also need to have data distributions determined, that
is, data evaluated to learn if a feature follows a normal distribution or if there is a
consistent bias, and also standardize data according to min–max values or the same
normal distribution, for example to avoid numerical issues and give proper weight
to different features. In large dynamic databases, much effort is expended for the
proper maintenance of the data so it remains useful, using many operations such as
data cleaning, organization, and labeling. This is called data curation.
Another aspect of data treatment is the creation of a training set, a validation set,
and a test set from a database (although often test data refers to both the validation
and the test set, in particular when only one model is considered). The purpose of the
training set is to train the ML algorithm: to create the “model”. The purpose of the
validation set is to evaluate the models in an independent way from the training set,
for example to see which hyperparameters are best suited, or even which ML method
is best suited. Examples may be the number of neurons in a neural network or the
smoothing hyperparameter in splines fitting; different smoothing parameters yield
different models for the same training set, and the validation set helps to select the best
values, obtaining the best predictions but avoiding overfitting. Recall that ML is not
interested in the minimum error for the training set, but in a predictive reliable model.
The test set is used to evaluate the performance of the final selected model from the
overall learning process. An accurate prediction of the training set with a poor pre-
diction of the test set is an indicator of overfitting: we have reached an unreliable
model. A model with similar accuracy in the training and test sets is a good model.
The training set should not be used for assessing the accuracy of the model because
the parameters and their values have been selected based on these data and hence
overfitting may not be detected. However, if more data is needed for training, there are
techniques for data augmentation, typically performing variations, transformations,
1 Machine Learning in Computer Aided Engineering 9

or combinations of other data (Shorten and Khoshgoftaar 2019). A typical example


is to perform transformations of images (rotations, translations, changes in light, etc.,
Inoue 2018). Data augmentation should be used with care, because there is a risk
that the algorithms correlate unexpected features with outputs: samples obtained by
augmentation may have repetitive features because in the end they are correlated sam-
ples. These repetitive features may mislead the algorithms so they identify the feature
as a key aspect to correlate to the output (Rice et al. 2020). An example is a random
spot in an image that is being used for data augmentation. If the spot is present in the
many generated samples, it may be correlated to the output as an important feature.

1.2.1.6 Overfitting, Regularization, and Cross-Validation

Overfitting and model complexity are important aspects in ML; see Fig. 1.3. Given
that the data has errors and often some stochastic nature, a model which gives zero
error in the training data does not mean that it is a good model; indeed it is usually a
hint that it is the opposite: a presentation of overfitting (Fig. 1.3a). Best models are
those less complex (parsimonious) models that follow Occam’s razor. They are as
simple as possible but still have a great predictive power. Hence, the less parameters,
the better. However, it is often difficult to simplify ML models to have few “smart”
parameters, so model reduction and regularization techniques are often used as a “no-
brainer” remedy for overfitting. Typical regularization (“smoothing”) techniques are
Least Absolute Shrinkage Selection Operator, sparse or L1 regularization (LASSO)
(Xu et al. 2008) and L2, called Ridge (Tikhonov 1963), or noise (Bishop 1995) reg-
ularization, or regression. The LASSO scheme “shrinks” the less important features

Fig. 1.3 Using B-splines to fit hyperelastic stress–strain data. Regression may be performed in
nominal stress–stretch (P − λ) axes, or in true stress–strain σ − E axes; note that the result is
different. While usual test representation in hyperelasticity is in the (P − λ) axes, regression is
preferred in σ − E because of the symmetry of tension and compression in logarithmic strains).
B-spline fit of experimental data with a overfitting and b proper fit using regularization based on
stability conditions. Modified from Latorre and Montáns (2020)
10 F. J. Montáns et al.

(hence is also used for feature selection), whereas the L2 scheme gives a more even
weight to them. The combination of both is known as elastic net regularization (Zou
and Hastie 2005).
Model selection taking into account model fitness and including a penalization for
model complexity is often performed by employing the Akaike Information Criterion
(AIC). Given a collection of models arising from the available data, the AIC allows
to compare these models among them, so as to help select the best fitted model. In
essence, the AIC not only estimates the relative amount of information lost by each
model but also takes into account its parsimony. In other words, it deals with the
trade-off between overfitting and underfitting by computing

AIC = 2 p − 2 ln L (1.1)

where p is the number of parameters of the model (complexity penalty) and L is


the maximum of the likelihood function of the model, the joint probability of the
observed data as a function of the p parameters of the model (see next section).
Therefore, the chosen model should be the one with the minimum AIC. In essence,
the AIC penalizes the number of parameters to choose the best model—and that is the
model not only with as few parameters as possible but also with a large probability
of reproducing the data using these parameters.
Dividing the data into two sets, one for training and one for validation, very
often produces overfitting, especially for small datasets. To avoid the overfitting,
the method of k-fold cross-validation is frequently used. In this process, the data is
divided into k datasets. k − 1 of them are used to train the model and the remaining
one is used for validation. This process is repeated k times, by employing each of the
possible datasets for validation. The final result is given by the arithmetic mean of
the k results (Fig. 1.4). Leave-one-out Cross-Validation (LOOCV) is the special case

3 iterations (3 folds)
Train/test
Data Random

}
split 9 9 9
1 9 9 5 5 5
Training set
2 5 5 2 2 2 (model fitting)
3 2 2 1 1 1

}
4 1 1 6 6 6 Validation set
(hyperparameters
5 6 6 8 8 8
selection)
6 8 8
7 4

}
4
8 7
7 Test set
9 3 (final model assessment)
3

Fig. 1.4 Training and test sets: k-fold generation of training and validation sets from data. Number
of data: 9, data for training and model selection: 6, data for final validation test (test set): 3, number
of folds for model selection: 3, data in each fold: 2, number of models: 3 (k = 3). Sometimes, the
validation test is also considered as test set. The 10-fold cross-validation is a common choice
1 Machine Learning in Computer Aided Engineering 11

where the number of folds is the same as the number of samples, so the test set has
only one element. While LOOCV is expensive in general (Meijer and Goeman 2013),
for the linear case it is very efficient because all the errors are obtained simultaneously
with a single fit through the so-called hat matrix (Angelov and Stoimenova 2017).

1.2.2 Overview of Classical Machine Learning Procedures


Used in CAE

The schemes we present in this section are basic ingredients of many ML algorithms.

1.2.2.1 Simple Regression Algorithms

The simplest ML approach is much older than the ML discipline: linear and nonlinear
regression. In the former case, the purpose is to compute the weights w and the offset
b of the linear model ỹ ≡ f (x) = w T x + b, where x is the vector of features. The
parameters w, b are obtained through the minimization of the cost function (MSE:
Mean Squared Error)

1 1
n n
C(xi ; {w, b}) = Li := [ f (xi ; {w, b}) − yi ]2 (1.2)
n i=1 n i=1

with respect to them, which in this case is the average of the loss function Li =
( ỹi − yi )2 , where the yi are the known values, ỹi = f (xi ; {w, b}) are the predictions,
and the subindex i refers to sample i, so xi is the vector of features of that sample.
Of course, in linear regression, the parameters are obtained simply by solving the
linear system of equations resulting from the quadratic optimization problem. Other
regression algorithms are similar, as for example spline, B-spline regressions, or
P-spline (penalized B-splines) regressions, used in nonlinear mechanics (Crespo
et al. 2017; Latorre and Montáns 2017) or used to perform efficiently an inverse of
functions which does not have an analytical solution (Benítez and Montáns 2018;
Eubank 1999; Eilers and Marx 1996). In all these cases, smoothing techniques are
fundamental to avoid overfitting.
While it is natural to state the regression problem as a minimization of the cost
function, it may be also formulated in terms of the likelihood function L. Given some
training data (yi , xi ) (with labels yi for data xi ), we seek the parameters w (for sim-
plicity we now include b in the set w) that minimize the cost function (e.g., MSE);
or equivalently we seek the set w which maximizes the likelihood L(w|(y, x)) =
p(y|x; w) for those parameters w to give the probability representation for the train-
ing data, which is the same as the probability of finding the data (y, x) given the distri-
bution by w. The likelihood is the “probability” by which a distribution (characterized
by w) represents all given data, whereas the probability is that of finding data if the dis-
tribution is known. Assuming data to be identically distributed and independent such
12 F. J. Montáns et al.

that p(y1 , y2 , . . . , yn |x1 , x2 , . . . , xn ; w) = p(y1 |x1 ; w) p(y2 |x2 ; w) . . . p(yn |xn ; w),
the likelihood is


n
L(w|(yi , xi ), i = 1, . . . , n) = p(yi |xi ; w) (1.3)
i=1

or

n
log L(w|(yi , xi ), i = 1, . . . , n) = log p(yi |xi ; w) (1.4)
i=1

Choosing the linear regression ỹ = w T x (including b and 1 respectively in w and x),


and a normal distribution of the prediction, obtained by assuming a zero-centered
normal distribution of the error
 
1 (w T x − y)2
p(y|x; w, σ 2 ) = √ exp − (1.5)
2π σ 2 2σ 2

it is immediate to verify that the maximization of the log-likelihood in Eq. (1.4)


is equivalent to minimizing the MSE in Eq. (1.2) (regardless of the value of the
variance σ 2 ).
A very typical regression used in ML is logistic regression (Kleinbaum et al.
2002; Hosmer Jr et al. 2013), for example to obtain pass/fail (1/0) predictions. In this
case, a smooth predictor output y ∈ [0, 1] can be interpreted as a probability p(y =
1|x) := p(x). The Bernoulli distribution (which gives p for yi = 1 and (1 − p) for
yi = 0), or equivalently

P(y = yi ) ≡ p yi (1 − p)(1−yi ) (1.6)

describes this case for a given xi , as it is immediate to check. The linear regression
is assigned to the logit function to convert the (−∞, ∞) range into the desired
probabilistic [0, 1] range
 
p(x)
logit( p(x)) = log = wT x (1.7)
1 − p(x)

The logit function is the logarithm of the ratio between the odds of y = 1 (which are
p) and y = 0 (which are (1 − p)). The probability p(x) may be factored out from
Eq. (1.7) as
1
p(x) = (1.8)
1 + exp (−w T x)

which is known as the sigmoid function. Neural Networks frequently use logis-
tic regression with the sigmoid model function where the parameters are obtained
through the minimization of the proper cost function, or through the maximization
1 Machine Learning in Computer Aided Engineering 13

of the likelihood. In this latter case, the likelihood of the probability distribution in
Eq. (1.6) is


n
L( p(x)|(yi , xi ), i = 1, . . . , n) = p(xi ) yi [1 − p(xi )](1−yi ) (1.9)
i=1

where yi are the labels (with value 1 or 0) and p(xi ) are their (sigmoid-based)
probabilistic predicted values given by Eq. (1.8) for the training data, which are a
function of the parameters w. The maximization of the log-likelihood of Eq. (1.9) for
the model parameters gives the same solution as the minimization of the cross-entropy
n 
 
yi (1 − yi )
arg min H (w) = − log p(xi ; w) + log(1 − p(xi ; w)) (1.10)
w
i=1
n n

Another type of regression often used in ML is Kernel Regression. A Kernel is


a positive-definite, typically non-local, symmetric weighting function K (xi , x) =
K (x, xi ), centered in the attribute, with unit integral. The idea is similar to the use of
shape functions in finite element formulations. For example, the Gaussian Kernel is
  2
1 1 |x − xi |
K (xi , x) ≡ K i (x) = √ exp − (1.11)
2π σ 2 2 σ

where σ is the bandwidth or smoothing parameter (deviation), and the weight for
sample i is wi (x) = K i (x)/ nj=1 K j (x). The predictor, using the weights from the
n
kernel, is f (x) = i=1 wi (x)yi (although kernels may be used also for the labels).
The cost function to determine σ 2 or other kernel parameters may be

1  )i(
n
f 2 (x)d − 2 f (xi ) −→ min (1.12)
n i=1

where the last summation term is the LOOCV, which excludes sample i from the
set of predictions (recall that there are n different f )i( functions). Equation (1.12)
focuses in essence on the minimum squared error for the solution. As explained below,
kernels are also employed in Support Vector Machines to deal with nonlinearity and
in dimensionality reduction of nonlinear problems to reduce the space.

1.2.2.2 Naïve Bayes

Naïve Bayes (NB) schemes are frequently used for classification (spam e-mail fil-
tering, seismic vulnerability, etc.), and may be Multinomial NB or Gaussian NB. In
both cases the probability theory is employed.
14 F. J. Montáns et al.

NB procedures operate as follows. From the training data, the prior probabil-
ities for the different classes are computed, e.g., vulnerable or safe, p(V ) and
p(S), respectively, in our seismic vulnerability example. Then, for each feature,
the probabilities are computed within each class, e.g., the probability that a vul-
nerable (or safe) structure is made of steel, p(steel|V ) (or p(steel|S)). Finally,
given a sample outside the training set, the classification is obtained from the
largest probability considering the class and the features present in the sample, e.g.,
p(V ) p(steel|V ) p(. . . |V ) . . . or p(S) p(steel|S) p(. . . |V ) . . ., and so on. Gaussian
NB are applied when the features have continuous (Gaussian) distributions, as for
example the height of a building in the seismic vulnerability example. In this case
the feature conditioned probabilities p(·|V ) are obtained from the respective normal
distributions. Logarithms of the probabilities are frequently used to avoid under-
flows.

1.2.2.3 Decision Trees (DT)

Decision trees are nonparametric. The simplest and well-known decision tree gener-
ator is the Iterative Dichotomiser 3 (ID3) algorithm, a greedy strategy. The scheme
is used in the “guess who?” game and is also in essence the idea behind the root-
finding bisection method or the Cuthill–McKee renumbering algorithm. The objec-
tive is, starting from one root node, to select at each step the feature and condi-
tion from the data that maximizes the Information Gain G (maximizes the benefit
of the split), resulting in two (or more) subsequent leaf nodes. For example, in a
group of people, the first optimal condition (maximizing the benefit of the split)
is typically if the person is male or female, resulting in the male leaf and in the
female leaf, each with 50% of the population, and so on. In seismic vulnerability,
it could be if the structure is made of masonry, steel, wood, or Reinforced Con-
crete (RC). The gain G is the difference between the information entropies before
(H (S), parent entropy) and after the split given by the feature at hand j. Let us
denote by x j (i) the feature j of sample i, by xi the array of features of sample
i, and by x j the different features (we omit the sample index if no confusion is
possible). If H (S|x j ) are the children entropies after the split by feature x j , the
Gain is

G(S, x j ) = H (S) − H (S|x j ) = H (S) − p j H (S j ) (1.13)
si ={xi ,yi }∈S j

where the S j are the subsets of S as a consequence of the split using feature (or
attribute) x j , p j is the subset probability (number of samples si in subset S j divided
by the number of samples in the compete set S), and


l
H (S) = − p(y j ) log p(y j ) (1.14)
j=1
1 Machine Learning in Computer Aided Engineering 15

where H (S) is the information entropy of set S for the possible labels y j , j =
1, . . . , l, so Eq. (1.13) results in


l  
l
G(S, x j ) = − p(yi ) log p(yi ) + pj p(yi |x j ) log p(yi |x j ) (1.15)
i=1 si ∈S j i=1

The gain G is computed for each feature x j (e.g., windows, structure type, building
age, and soil type). The feature that maximizes the Gain is the one selected to generate
the next level of leaves. The decision tree building process ends when the entropy
reaches zero (the samples are perfectly classified). Figure 1.5 shows a simple example
with four samples si in the dataset, each with three features x j of two possible values
(0 and 1), and one label y of two possible values (A and B). The best of the three
features is selected as that one which provides the most information gain. It is seen
that feature 1 produces some information gain because after the split using this
feature, the samples are better classified according to the label. Feature 2 gives no

A AB B (better sorted,
some info gain)
B
A AB AB (equally sorted,
B no info gain)
A
AA BB (perfectly sorted,
best info gain)

Fig. 1.5 Example of determination of the feature with most information gain. If we choose feature
x1 for sorting, we find two subsets, S1 = {si such that x1 = 1} = {s2 , s3 , s4 } and S2 = {si s.t. x1 =
0} = {s1 }. In S1 there are two elements (s3 , s4 ) with label A, and one element (s2 ) with label
B, so the probabilities are 2/3 for label A and 1/3 for label B. In S2 the only element (s1 ) has
label B, so the probabilities are 0/1 for label A and 1/1 for label B. The entropy of subset S1 is
H (S1 ) = − 23 log2 23 − 13 log2 31 = 0.92. The entropy of subset S2 is H (S2 ) = −0 − 11 log2 11 = 0.
In a similar form, since there are a total of 4 samples, two with label A and two with label B, the
parent entropy is H (S) = − 24 log2 24 − 24 log2 24 = 1. Then, the information gain is G = H (S) −
p(S1 )H (S1 ) − p(S2 )H (S2 ) = 1 − 43 0.92 − 41 0 = 0.31, where p(Si ) is the probability of a sample
being in subset Si , i.e. 3/4 for S1 and 1/4 for S2 . Repeating the computations for the other two
features, it is seen that feature x3 is the one that has the best information gain. Indeed the information
gain is G = 1 because it fully separates the samples according to the labels
16 F. J. Montáns et al.

gain because it is useless to distinguish the samples according to the label (it is
in 50% each), and feature 3 is the best one because it fully classifies the samples
according to the label (A is equivalent to x3 = 1, and B is equivalent to x3 = 0).
As for the Cuthill–McKee renumbering algorithm, there is no proof of reaching the
optimum.
While DT are typically used for classification, there are regression trees in which
the output is a real number. Other decision tree algorithms are the C4.5 (using con-
tinuous attributes), Classification And Regresssion Tree (CART), and Multivariate
Adaptive Regression Spline (MARS) schemes.

1.2.2.4 Support Vector Machines (SVM), k-Means, and k-Nearest


Neighbors (kNN)

The Support Vector Machine (SVM) is a technique which tries to find the optimal
hyperplane separating groups of samples for clustering (unsupervised) or classifi-
cation (supervised). Consider the function z(x) = w T x + b for classification. The
model is
y ≡ f (x) = sign(z(x)) = sign(w T x + b) (1.16)

where the parameters {w, b} are obtained by minimizing 21 |w|2 (or equivalently |w|2
or |w|) subject to yi (w T xi + b) ≥ 1 ∀i such that the decision boundary f (x) = 0
given by the hyperplane has maximum distance to the groups of the samples; see
Fig. 1.6. The minimization problem (in primal form) using Lagrange multipliers αi is


n
find arg min 1
2
|w|2 + αi 1 − yi (w T xi + b) with αi ≥ 0 (1.17)
w,b i=1

or in penalty form


n
 
find arg min 1
2
|w|2 +C max 0, 1 − yi (w T xi + b) (1.18)
w,b i=1

A measure of certainty for sample i is based on its proximity to the boundary; i.e.
(w T xi + b)/|w| (the larger the distance to the boundary, the more certain the classifi-
cation of the sample). Of course, SVMs may be used for multiclass classification, e.g.,
using the One-to-Rest approach (employing k SVMs to classify k classes) or the One-
to-One approach (employing 21 k(k − 1) SVMs to classify k classes); see Fig. 1.6.
Taking the derivative of the Lagrangian in squared brackets in Eq. (1.17) with
respect to w and b, we get that at the minimum


n 
n
w= αi yi xi and αi yi = 0 (1.19)
i=1 i=1
1 Machine Learning in Computer Aided Engineering 17

Maximized distance:
Decision boundary

Two-class SVM

Support vectors

One-to-one
One-to-rest

Fig. 1.6 Two-class SVM decision boundary and one-to-one and one-to-rest SVM multiclass clas-
sification

and substituting it in the primal form given in Eq. (1.17), the minimization problem
may be written in its dual form
⎡ ⎤

n
1  n n
find arg max ⎣ αi − α j αk y j yk (x Tj xk )⎦ with αi yi = 0 (1.20)
αi ≥0 i=1
2 j,k=1 i=1

and with b = y j − w T x j being x j any active (support) vector, with α j > 0. Then,
z = w T x + b is z = i αi yi xiT x + b. Instead of searching for the weights wi , i =
1, . . . , N (N is the number of features of each sample), we search for the coefficients
αi , i = 1, . . . , n (n is the number of samples).
Nonlinear separable cases may be addressed through different techniques as using
positive slack variables ξi ≥ 0 or kernels. When using slack variables (Soft Margin
SVM), for each sample i we write yi (w T xi + b) ≥ 1 − ξi and we apply a L1 (LASSO)
regularization by minimizing 21 |w|2 + C i ξi subject to the constraints yi (w T xi +
b) ≥ 1 − ξi and ξi ≥ 0, where C is the penalization parameter. In this case, the
only change in the dual formulation is the constraint for the Lagrange multipliers:
C ≥ αi ≥ 0, as it can be easily verified.
When using kernels, the kernel trick is typically employed. The idea behind the
use of kernels is that if data is linearly non-separable in the features space, it may be
18 F. J. Montáns et al.

(a)
x (c) x2

(b)
x
x

Fig. 1.7 Use of higher dimensions to obtain linearly separable data. a Data is linearly separable
in 1D. b Data is not linearly separable in 1D. c Using two dimensions with mapping φ = [x, x 2 ]T ,
data becomes linearly separable in the augmented space

separable in a larger space; see, for example, Fig. 1.7. This technique uses the dual
form of the SVM optimization problem. Using the dual form


n 
n
|w|2 = w T w = αi α j yi y j (xiT x j ) and w T x = αi yi (xiT x) (1.21)
i, j=1 i=1

the equations only involve inner products of feature vectors of the type (xiT x j ), ideal
for using a kernel trick. For example, the case shown in Fig. 1.8 is not linearly separa-
 √ T
ble in the original features space, but using the mapping φ(x) := x12 , x22 , 2x1 x2
to an augmented space, we find that the samples are linearly separable in this space.
Then, for performing the linear separation in the transformed space, we have to
compute z in that transformed space (Representer Theorem, Schölkopf et al. 2001)


n
z= αi yi [φ(xi )T φ(x)] + b (1.22)
i=1

Fig. 1.8 Linearly non-separable samples (left). Linear separation in a transformed higher dimen-
sional space (right)
1 Machine Learning in Computer Aided Engineering 19

to substitute the inner products in the original space by inner products in the trans-
formed space. These operations (transformations plus inner products in the high-
dimensional space) can be expensive (in complex cases we need to add many dimen-
sions). However, in our example we note that
⎡ ⎤
b12   
 √ ⎢ ⎥   b1 2
K (a, b) := φ(a)T φ(b) = a12 a22 2a1 a2 ⎣ b22 ⎦ = a1 a2 = (a T b)2
√ b2
2b1 b2
(1.23)
so it is not necessary to use the transformed space because the inner product can be
equally calculated in both spaces. Indeed note that, remarkably, we even do not need
to know explicitly φ(x), because the kernel K (a, b) = (a T b)2 is fully written in the
original space and we never need φ(x). Then we just solve
⎡ ⎤

n
1  n n
find arg max ⎣ αi − α j αk y j yk K (x j , xk )⎦ with αi yi = 0 (1.24)
αi ≥0 i=1
2 j,k=1 i=1

Examples of kernel functions are the polynomial K (a, b) = (a T b + 1)d , where d is


the degree, and the Gaussian radial basis function (which may be interpreted as a
polynomial form with infinite terms)

K (a, b) = exp(−γ |a − b|) (1.25)

where γ = 1/(2σ 2 ) > 0.


Considering clustering, another algorithm similar in nature to SVM is used and is
called the k-means (unsupervised technique creating k clusters). The idea here is to
employ a distance measurement in order to determine the optimal center of clusters
and the optimal decision boundaries between clusters. The case k = 1 is essentially
the Voronoi diagram. Another simple approach is the k-Nearest Neighbors (k-NN), a
supervised technique also employed in classification. This technique uses the labels
of the k-nearest neighbors to predict the label of the target points (e.g., by some
weighting method).

1.2.2.5 Dimensionality Reduction Algorithms

When problems have too many features (or data, measurements), dimensionality
reduction techniques are employed to reduce the number of attributes and gain
insight into the most meaningful ones. These are typically employed not only in
pattern recognition and image processing (e.g., identification or compression) but
also to determine which features, data, or variables are most relevant for the learning
purpose. In essence, the algorithms are similar in nature to determining the principal
modes in a dynamic response, because with that information, relevant mechanical
20 F. J. Montáns et al.

properties (mass distribution, stiffness, damping), and the overall response may be
obtained. In ML, classical approaches are given by Principal Component Analysis
(PCA) based on Pearson’s Correlation Matrix (Abdi and Williams 2010; Bro and
Smilde 2014), Singular Value Decomposition (SVD), Proper Orthogonal Decompo-
sition (POD) (Berkooz et al. 1993), Linear (Fisher’s) Discriminant Analysis (LDA)
(Balakrishnama and Ganapathiraju 1998; Fisher 1936), Kernel (Nonlinear) Prin-
cipal Component Analysis (kPCA), Hofmann et al. (2008), Alvarez et al. (2012),
Local Linear Embedding (LLE) (Roweis and Saul 2000; Hou et al. 2009), Man-
ifold Learning (used also in constitutive modeling) (Cayton 2005; Bengio et al.
2013; Turaga et al. 2020), Uniform Manifold Approximation and Projection (UMAP)
(McInnes et al. 2018), and autoencoders (Bank et al. 2020; Zhuang et al. 2021; Xu and
Duraisamy 2020; Bukka et al. 2020; Simpson et al. 2021). Often, these approaches
are also used in clustering.
LLE is one of the simplest nonlinear dimension reduction processes. The idea is
to identify a global space with smaller dimension that reproduces the proximity of
data in the higher dimensional space; it is a k-NN approach. First, we determine the
weights wi j , such that wi j = 1, which minimize the error
⎛ ⎞2
n k
Error(w) = ⎝x i − wi j x j ⎠ (1.26)
i=1 j=1

in the representation of a point from the local space given by the k-nearest points
(k is a user-prescribed hyperparameter), so
w = arg min [Error(w)] (1.27)

Then, we search for the images yi of xi in the lower dimensional space, simply
by considering that the computed wi j reflect the geometric properties of the local
manifold and are invariant to translations and rotations. Given wi j , we now look for
the lower dimensional coordinates yi that minimize the cost function
⎛ ⎞2

n k
C(yi , i = 1, . . . , n) = ⎝yi − wi j y j ⎠ (1.28)
i=1 j=1

Isometric Mapping (ISOMAP) techniques are similar, but use geodesic k-node-
to-k-node distances (computed by Dijkstra’s 1959 or the Floyd–Warshall 1962 algo-
rithms to find the shortest path between nodes) and look for preserving them in the
reduced space. Another similar technique is the Laplacian eigenmaps scheme (Belkin
and Niyogi 2003), based on the non-singular lowest eigenvectors of the Graph Lapla-
cian L = d − w, where dii = j wi j gives the diagonal degree matrix and wi j are the
edge weights, computed for example using the Gaussian kernel wi j = K (xi , x j ) =
exp(−|xi − x j |2 /(2σ 2 )). Within the same k-neighbors family, yet more complex and
advanced, are Topological Data Analysis (TDA) techniques. A valuable overview
may be found in Chazal and Michel (2021); see also the references therein.
1 Machine Learning in Computer Aided Engineering 21

For the case of PCA, it is typical to use the covariance matrix

1
n
S jk = (x j (i) − x̄ j )(xk(i) − x̄k ) (1.29)
n i=1

where the overbar denotes the mean value of the feature, and x j (i) is feature j of
sample i. The eigenvectors and eigenvalues of the covariance matrix are the principal
components (directions/values of maximum significance/relevance), and the number
of them selected as sufficient is determined by the variance ratios; see Fig. 1.9(a). PCA
is a linear unsupervised technique. The typical implementation uses mean-corrected
n
samples, as in kPCA, so in such case S jk = n1 i=1 x j (i) xk(i) , or in matrix notation
S = n XX . kPCA (Schölkopf et al. 1997) is PCA using kernels (such as polynomials,
1 T

the hyperbolic tangent, or the Radial Basis Function (RBF)) to address the nonlin-
earity by expanding the space. For example, using the RBF, we construct the kernel
matrix K i j , for which the components are obtained from the samples i, j as K i j =
exp(−γ |xi − x j |2 ) . The RBF is then centered in the transformed space by (note that
being centered in the original features space does not mean that the features are also
automatically centered in the transformed space, hence the need for this operation)
1 1 1
K̄ = K − 1K − K1 + 2 1K1 (1.30)
n n n
where 1 is a n × n matrix of the unit entry “1”. Then K̄(xi , x j ) = φ̄(xi )T φ̄(x j )
with the centered φ̄(xi ) = φ(xi ) − 1/n rn=1 φ(xr ). The larger eigenvalues are the

(a) (b) >> min


t
an
x2 in
S x2 m1 rim
isc
x2 nt D
ne
po
m
co
al
cip x
in
Pr m2

x1 x1
x1

>> max

Fig. 1.9 a Principal Component Analysis. The principal components are those with largest vari-
ations (largest eigenvalues of the variances matrix). b Linear Discriminant Analysis to separate
clusters. It is seen that feature x1 is not a good choice to determine if a sample belongs to a given
cluster, but there is a features combination (a line) which gives the best discrimination between
clusters. That combination maximizes the distance between means of clusters whereas minimizes
the dispersion of samples within the clusters
22 F. J. Montáns et al.

principal components in the transformed space, and the corresponding eigenvectors


are the samples already projected onto the principal axes.
In contrast to PCA, the purpose of LDA is typically to improve separability of
known classes (a supervised technique), and hence maximize information in this
sense: maximizing the distance between the mean values of the classes and, within
each class, minimizing the variation. It does so through the eigenvalues of the nor-
malized between-classes scatter matrix S−1w Sb (the between-variances by the within-
variances) where


n−classes 
ni
Sw = (x − mi )(x − mi )T (1.31)
i=1 within-class-i


n−classes
Sb = n i (mi − x̄)(mi − x̄)T (1.32)
Class i=1

and x̄ is the overall mean vector of the features x and mi is the mean vector of
those within-class i. If x̄ = mi the class is not separable from the selected fea-
tures. Frequently used nonlinear extensions of LDA are the Quadratic Discriminant
Analysis (QDA) (Tharwat 2016; Ghosh et al. 2021), Flexible Discriminant Analysis
(FDA) (Hastie et al. 1994), and Regularized Discriminant Analysis (RDA) (Friedman
1989).
Proper Orthogonal Decompositions (POD) are frequently motivated in PCA and
are often used in turbulence and in reducing dynamical systems. It is a technique
also similar to classical modal decomposition. The idea is to decompose the time-
dependent solution as
P
u(x, t) = a p (t)ϕ p (x) (1.33)
p=1

and compute the Proper Orthogonal Modes (POMs) ϕ p (x) that maximize the energy
representation (L2-norm). In essence, we are looking for the set of “discrete func-
tions” ϕ p (x) that best represent u(x, t) with the lowest number of terms P. Since
these are computed as discretized functions, several snapshots u(x, ti ), i = 1, . . . , n
are grabbed in the discretized domain, i.e.
⎡ ⎤
u 11 . . . u 1n
  ⎢ . . .. ⎥
U = u(x, t1 ) u(x, t2 ) . . . u(x, tn ) = ⎣ .. . . . ⎦ (1.34)
u m1 . . . u mn

Then, the POD vectors are the eigenvectors of the sample covariance matrix. If the
snapshots are corrected to have zero mean value, the covariance matrix is

1
S= UUT (1.35)
n
1 Machine Learning in Computer Aided Engineering 23

The POMs may also be computed using the SVD of U (the left singular vectors are the
eigenvectors of UUT ) or auto-associative NNs (Autoencoder Neural Networks that
replicate the input in the output but using a hidden layer of smaller dimension). To
overcome the curse of dimensionality when using too many features (e.g., for para-
metric analyses), the POD idea is generalized in Proper Generalized Decomposition
(PGD), by assuming approximations of the form

N
u(x1 , x2 , . . . , xd ) = φ i1 (x1 ) ◦ φ i2 (x2 ) ◦ . . . ◦ φ id (xd ) (1.36)
i=1
j
where φ i (x j ) are the unknown vector functions (usually also discretized and com-
puted iteratively, for example using a greedy algorithm), and “◦” stands for the
Hadamard or entry-wise product of vectors. Note that, in general we cannot use the
separability (x, y) = φ(x)ψ(y) but PGDs look for the best φi (x)ψi (y) choices
for the given problem such that we can say (x, y) i φi (x)ψi (y) in a suffi-
ciently small number of addends (hence with a reduced complexity). The power of
the idea is that for a large number n of features, determining functions of the type
(x1 , x2 , . . . , xn ) is virtually impossible, but determining products and additions of
scalar functions is feasible.
The UMAP and t-SNE schemes are based on the concept of a generalized metric
or distance between samples. A symmetric and normalized (between 0 and 1) metric
is defined as
j j
di j (xi , x j ) = di (xi , x j ) + d ij (xi , x j ) − di (xi , x j )d ij (xi , x j ) (1.37)
where the unidirectional distance function is defined as
 
j ρi j − ρi1
di (xi , x j ) = exp − (1.38)
ρik

where ρi j = |xi − x j | and ρik = |xi − xk |, with k referring to the k-nearest neighbor
(ρi1 refers to the nearest neighbor to i). Here k is an important hyperparameter. Note
j j
that di = 1 if i, j are nearest neighbors, and di → 0 for far away neighbors. We
are looking for a new set of lower dimensional features z to replace x. The same
generalized distance di j (zi , z j ) may be applied to the new features. To this end,
through optimization techniques, like the steepest descent, we minimize the fuzzy
set dissimilarity cross-entropy (or entropy difference) like the Kullback–Leibler (KL)
divergence (Hershey and Olsen 2007; Van Erven and Harremos 2014), which mea-
sures the difference between the probability distributions di j (xi , x j ) and di j (zi , z j ),
and their complementary values [1 − di j (xi , x j )] and [1 − di j (zi , z j )] (recall that
d ∈ (0, 1], so it is seen as a probability distribution)
 n   
di j (xi , x j )
K L(d(x), d(z)) = di j (xi , x j ) ln
i, j=1
di j (zi , z j )
 
  1 − di j (xi , x j )
+ 1 − di j (xi , x j ) ln (1.39)
1 − di j (zi , z j )
24 F. J. Montáns et al.

Note that the KL scheme is not symmetric with respect to the distributions. If dis-
tances in both spaces are equal for all the samples, KL = 0. In general, a lower
dimensional space gives KL = 0, but with the dimension of z fixed, the features
(or combinations of features) that give a minimum KL considering all n samples
represent the optimal selection.
Autoencoders are a type of neural network, discussed below, and can be interpreted
as a nonlinear generalization of PCA. Indeed, an autoencoder with linear activation
functions is equivalent to a SVD.

1.2.2.6 Genetic Algorithms

Genetic Algorithms (Mitchell 1998) in ML (or more generally evolutionary algo-


rithms) are in essence very similar to those employed in optimization (Grefenstette
1993; De Jong 1988). They are metaheuristic algorithms which include the steps
in natural evolution: (1) Initial population, (2) a fitness function, (3) a (nature-like)
selection according to fitness, (4) crossover (the gene combination), (5) mutation
(random alteration). After running many generations, convergence is expected to the
superspecies. Feature selection and database reduction is a typical application (Vafaie
and De Jong 1992). The variety of implementations is large and the implementations
depend on the specific problem addressed (e.g., polymer design, Kim et al. 2021, and
materials modeling, Paszkowicz 2009), but the essence and ingredients are similar.

1.2.2.7 Rosenblatt’s Perceptron, Adaline (Adaptive Linear Neuron)


Model, and Feed Forward Neural Networks (FFNN)

Currently, the majority of ML algorithms employed in practice are some type or


variation of Neural Networks. Deep Learning (DL) refers to NNs with many layers.
While the NN theory was proposed some decades ago, efficient implementations
facilitating the solution of real-world problems have been established only in the
late 1980s and early 1990s. NNs are based on the ideas from McCulloch and Pitts
(1943) describing a simple model for the work of neurons, and on Rosenblatt’s
perceptron (Rosenblatt 1958); see Fig. 1.10. The Adaline model (Widrow and Hoff
1962) (Fig. 1.10) introduces the activation function to drive the learning process
from the different samples, instead of the dichotomic outputs from the samples.
This activation function is today one of the keystones of NNs. The logistic sigmoid
function is frequently used. There are other alternatives such as the ReLU (Rectified
Linear Unit; the Macaulay ramp function) or the hyperbolic tangent

exp(z) − exp(−z)
tanh(z) = , with z = w T x + w0 (1.40)
exp(z) + exp(−z)

NNs are made from many such artificial neurons, typically arranged in several layers,
with each layer l = 1, . . . , L containing many neurons. The output from the network
1 Machine Learning in Computer Aided Engineering 25

Bias 1
Weights update Error?
0

1
Learning process
1

1
2
Input features

2 -1
3
Neuron input Firing step function Output
3

Rosenblatt’s perceptron
Weights

Bias 1 Weights update


0
Error?

1
Learning process
1

2
Input features

2
3
Activation function Output
Neuron input Threshold firing
3 (linear in Adaline)

Adaline: Adaptive Linear Neuron


Weights

Fig. 1.10 Rosenblatt’s perceptron and Adaline (Adaptive Linear Neuron) model

is defined as a composition of functions

y = f L (f L−1 (f L−2 (. . . f 2 (f 1 (z1 ))))) (1.41)

where the f l are the neuron functions of the layer (often also denoted by σ l in the
sigmoid case), typically arranged by groups in the form f l (Wl xl + bl ), where Wl is
the matrix of weights, zl := Wl xl + bl , xl = yl−1 = f l−1 (zl−1 ) are the neuron inputs
and output of the previous layer (the features for the first function; y0 ≡ x), and bl
is the layer bias vector, which is often incorporated as a weight on a unit bias by
writing zl = Wl xl , so xl has also the index 0, and x0l = 1; see Fig. 1.11. The output
may be also a vector y ≡ y L . The purpose of the learning process is to learn the
optimum values of Wl , bl . The power of the NNs is that a simple architecture, with
simple functions, may be capable of reproducing more complex functions. Indeed,
Rosenblatt’s scheme discussed below may give any linear or nonlinear function. Of
course, complex problems will require many neurons, layers, and data, but the overall
structure of the method is almost unchanged.
The Feed Forward Neural Network (FFNN) with many layers, as shown in
Fig. 1.11, is trained by optimization algorithms (typically modifications of the steep-
est descent) using the backpropagation algorithm, which consists in computing the
26 F. J. Montáns et al.

Fig. 1.11 Neural Network with L − 1 hidden layers and one (the L) output layer. Notation for
weights is Woil , where i is the input cell (zero refers to the bias unit), o is the output cell (the order
is often reversed in the literature), and l = 1, . . . , L are the layers

sensitivities using the chain rule from the output layer to the input layer, so for each
layer, the information of the derivatives of the subsequent layers are known. For exam-
ple, in Fig. 1.11, assume that the error is computed as E = 21 (y − yexp )T (y − yexp )
(logistic errors are more common, but we consider herein this simpler case). Then,
if α is the learning rate (a hyperparameter), the increment between epochs of the
parameters is
∂y
Woil = −α(y − yexp )T (1.42)
∂ Woil

where ∂y/∂ Woil is computed through the chain rule. Figure 1.12 shows a simple
example with two layers and two neurons per layer; superindices refer to layer and
subindices to neuron. For example, following the green path, we can compute
  2 
∂ y2 ∂ y2 ∂z 2
= (1.43)
∂ W21
2
∂z 2
2
∂ W21
2

where ∂ y2 /∂z 22 is the derivative of the selected activation function evaluated at the
iterative value z 22 and ∂z 22 /∂ W21
2
= x12 is also the known iterative value. As an example
of a deeper layer, consider the red line in Fig. 1.12
  2  1 
∂ y1 ∂ y1 z 1 ∂ x22 ∂z 2
= (1.44)
∂ W21
1
∂z 12 ∂ x22 ∂ z 21 ∂ W21
2
1 Machine Learning in Computer Aided Engineering 27

Fig. 1.12 Computation of the gradient through backpropagation. z lo is defined as z lo = Woil xil (which
includes the bias) and f ol (z lo ) is the activation function

y
1 -1
1
z
z=x-1 y
x 1 -1 1 3

y
1 0 1
1 1 -1 x
z=x z y= yi
1
x -1
1 -1
1 -3
y
1 1
1
z
z=x+1
x 1 -1

Fig. 1.13 Neural networks are capable of generating functions to fit data regardless of the dimension
of the space and the nonlinearity of the problem. In this example, three neurons of the simplest
Rosenblatt’s perceptron consisting of a step function are used to generate a local linear function.
This function is obtained by simply changing the weights of the bias and adding the results of the
three neurons with equal weights. Other more complex functions may be obtained with different
weights. Furthermore, the firing step function may be changed by generally better choices as the
ReLU or the sigmoid functions

where we note that the first square bracket corresponds to the last layer, the sec-
ond to the previous one, and so on, until the term in curly brackets addressing the
specific network variable. The procedures had issues with exploding or vanishing
gradients (especially with sigmoid and hyperbolic tangent activations), but several
improvements in algorithms (gradient clipping, regularization, skipping of connec-
tions, etc.), have resulted in efficient algorithms for many hidden layers. The complex
improvement techniques, with an important number of “tweaks” to make them work
in practical problems, is one of the reasons why “canned” libraries are employed and
recommended (Fig. 1.13).
28 F. J. Montáns et al.

1.2.2.8 Bayesian Neural Networks (BNN)

A Bayesian Neural Network (BNN) is a NN that uses probability and the Bayes
theorem relating conditional probabilities

p(x|z) p(z)
p(z|x) = (1.45)
p(x)

where p(x|z) = p(x ∩ z)/ p(z). A typical example is to consider a probabilistic dis-
tribution of the weights (so we take z = w) for a given model, or a probabilistic
distribution of the output (so we take z = y) not conditioned to a specific model.
These choices can be applied in uncertainty quantifications (Olivier et al. 2021),
with metal fatigue a typical application case (Fernández et al. 2022a; Bezazi et al.
2007). Given the complexity in passing analytical distributions through the NN, sam-
pling is often performed through Monte Carlo approaches. The purpose is to learn
the mean and standard deviations of the distributions of the weights, assuming they
follow a normal distribution wi ≈ N(μi , σi2 ). For the case of predicting an output
y, considering one weight, the training objective is to maximize the probability of
the training data for the best prediction, or minimize the likelihood of a bad predic-
tion as

μ∗ ,  ∗ = arg min L( f (xi ; N(μ, ), yi )) + KL[ p(N(μ, )), p(N(0, 1))]
μ, ∀xi ,yi
(1.46)

where KL( p1 , p2 ) is the Kullback–Leibler divergence regularization for the proba-


bilities p1 and p2 explained before, L is the loss function and f (xi ; N(μ, )) is the
function prediction for y from data xi , assuming a distribution N(μ, ). With the
learned optimal parameters μ∗ ,  ∗ , the prediction for new data x is

1 
K
y= f (x; Nk (μ∗ ,  ∗ )) (1.47)
K k=1

where the Nk (μ∗ ,  ∗ ) are the numerical evaluations of the normal distributions for
the obtained parameters.

1.2.2.9 Convolutional Neural Networks (CNNs)

Although a Convolutional Neural Network (CNN) is a type of FFNN, they were


formulated with the purpose of classifying images. CNNs combine one or several
convolution layers combined with pooling layers (for feature extraction from images)
and with normal final FFNN layers for classification (Fig. 1.14). Pooling is also
named subsampling since performing averaging or extracting the maximum of a
1 Machine Learning in Computer Aided Engineering 29

Fig. 1.14 Typical structure of a CNN, including one convolution layer, one pooling layer, a flattened
layer of features, and a FFNN

Fig. 1.15 Convolutional network layer with depth 1, stride length 2 (the filter patch moves 2
positions at once) and edge padding 1 (the boundary is filled with one row and column of zeroes).
Pooling is similar, but usually selects the maximum or average of a moving pad to avoid correlation
of features with location

patch are the typical operations. In the convolutional layers, input data has usually
several dimensions, and they are filtered with a moving patch array (also named
kernel, with a specific stride length and edge padding; see Fig. 1.15) to reduce
the dimension and/or to extract the main characteristics of, or information from,
the image (like looking at a lower resolution version or comparing patterns with a
reference). Each padding using a patch over the same record is called a channel, and
successive or chained paddings are called layers, Fig. 1.15. The same padding, with
lower dimension, may be applied over different sample dimensions (a volume). In
essence, the idea is similar to the convolution of functions in signal processing to
extract information from the signal. Indeed this is also an application of CNN. The
structure of CNNs have obvious and interesting applications in multiscale modeling
in materials science, and in constitutive modeling (Yang et al. 2020; Frankel et al.
2022), and thus also in determining material properties (Xie and Grossman 2018;
Zheng et al. 2020), behavior prediction (Yang et al. 2020), and obviously in extracting
microstructure information from images (Jha et al. 2018).
30 F. J. Montáns et al.

1.2.2.10 Recurrent Neural Networks (RNN)

RNNs are used for sequences of events, so they are extensively used in language
processing (e.g., in “Siri” or translators from Google), and they are effective in
unveiling and predicting sequences of events (e.g., manufacturing) or when history
is important (path-dependent events as in plasticity Mozaffar et al. 2019; Li et al.
2019; du Bos et al. 2020). In Fig. 1.16, a simple RNN is shown with t h representing
the history variables, such that the equations of the RNN are
t+1
h = f lh (Wh t h + Wx t x + b) (1.48)
t
o= f lo (Wh t h + W x t x + b) (1.49)
t
y= f lh (Wh t h + b) (1.50)

The unfolding of a RNN allows for a better understanding of the process; see Fig. 1.16.
Following our previous seismic example, they can be used to study the prediction of

Recurrent layer FF layer

1 1

(a) RNN with one recurrent layer:


Folded representaon

+1

+1 +2
Recurrent layer Recurrent layer Recurrent layer
FF layer
1 1 1 1

+1 +2 +3

+1 +2 +3

(b) RNN with one recurrent layer; unfolded representaon considering three events

One-to-one One-to-many Many-to-one Many-to-many

(c) RNNs classification according to input-output instances

Fig. 1.16 Recurrent Neural Network. a Folded representation, b unfolded representation consid-
ering three events, and c classification according to the input–output instances considered
1 Machine Learning in Computer Aided Engineering 31

Long memory

Forget Input

LSTM
unit
… gate
gate

Output
… LSTM
unit
gate

Short memory

LSTM unit

Fig. 1.17 A LSTM RNN, including long and short memory, and forget, input, and output gates. σ
is the sigmoid function, colored boxes are typical NN layers, tanh is the hyperbolic tangent, ⊗ and
⊕ and tanh are componentwise operations

new earthquakes from previous history; see, for example, Panakkat and Adeli (2009),
Wang et al. (2017). A RNN is similar in nature to a FFNN, and is frequently mixed
with FF layers, but recycling some output at a given time or event for the next time(s)
or event(s). RNNs may be classified according to the number of related input–output
instances as one-to-one, one-to-many (one input instance to many output instances),
many-to-one (e.g., classifying a voice or determining the location of an earthquake),
and many-to-many (translation into a foreign language); see Fig. 1.16. A frequent
ingredient in RNN are “gates” (e.g., in Long Short-Term Memory (LSTM), see
Fig. 1.17) to decide which data is introduced, output, or forgotten.

1.2.2.11 Generative Adversarial Networks (GAN)

A Generative Adversarial Network (GAN) (Goodfellow et al. 2020) is a type of ML


based on game theory (sum-zero game where one agent’s benefit is the other agent’s
loss) with the purpose to learn the probability distribution of the set of training sam-
ples (i.e. to solve the generative modeling problem). Although different algorithms
have been presented within the GAN paradigm, most are based on NN agents, con-
sisting of a generative NN and a discriminative NN. These NNs have opposite targets.
The generative NN tries to fool the discriminative NN, whereas the discriminative
NN tries to distinguish original (true) data from generated data presented by the
generative NN. With successive events, both NNs learn—the generative NN learns
how to fool the other NN, and the discriminative NN how not to be fooled. The type
of NN depends on the problem at hand. For example when distinguishing images, a
CNN is typically used. In this case, for example, in the falsification of photographs
(deepfake Yadav and Salmani 2019), several images of a person are presented and the
discriminator has to distinguish if they are actual pictures or manufactured photos.
This technology is used to generate fake videos, and to detect them (Duarte et al.
2019; Yu et al. 2022) and is used in CAE tasks like the reconstruction of turbulent
32 F. J. Montáns et al.

velocity fields (by comparing images) (Deng et al. 2019). GANs are also used in the
generation of compliant designs, for example in the aeronautical industry (Shu et al.
2020), and also to solve differential equations (Yang et al. 2020; Randle et al. 2020).
A recent overview of GANs may be found in Aggarwal et al. (2021).

1.2.2.12 Ensemble Learning

While NNs may bring accurate predictions through extensive training, obtaining
such predictions may not be computationally efficient. Ensemble learning consists
of employing many low-accuracy but efficient methods to obtain a better prediction
through a sort of averaging (or voting). Following our seismic vulnerability example,
it would be like asking several experts to give a fast opinion (for example just showing
them a photograph) about the vulnerability of a structure or a site, instead of asking
one of them to perform a detailed study of the structure (Giacinto et al. 1997; Tang
et al. 2022). The methods used may be, for example, shallow NNs and decision
tress.

1.3 Constraining to, and Incorporating Physics in,


Data-Driven Methods

ML usually gives no insight into the physics of the problem. The classical proce-
dures are considered “black boxes”, with inherent positive (McCoy et al. 2022) and
negative (Gabel et al. 2014) attributes. While these black boxes are useful in applica-
tions to solve classical fuzzy problems where they have been extensively applied in
economy, image or speech recognition, pattern recognition, etc. they have inherently
several drawbacks regarding use in mechanical engineering and applied sciences.
The first drawback is the large amount of data they require to yield relevant predic-
tions. The second one is the lack of fulfillment of basic physics principles (e.g., the
laws of thermodynamics). The third one is the lack of guarantees in the optimal-
ity or uniqueness of the prediction, or even guarantees in the reasonableness of the
predicted response. The fourth one is the computational cost, if including training,
when compared using classical methods. Although once trained, the use may be
much faster than many classical methods. Probably, the most important drawback is
the lack of physical insight into the problem, because human learning is complex and
needs a detailed understanding of the problem to seek creative solutions to unsolved
problems. Indeed, in contrast to “unexplainable” AI, now also eXplainable Artificial
Intelligence (XAI) is being advocated (Arrieta et al. 2020).
ML may be a good avenue to obtain engineering solutions, but to yield valu-
able (and reliable), scientific answers, physics principles need to be incorporated in
the overall procedure. To this end, the predictions and learning of the previously
overviewed methods, or other more elaborated ones, should be restricted to solution
subsets that do fulfill all the basic principles. That is, conservation of energy, of linear
1 Machine Learning in Computer Aided Engineering 33

momentum, etc. should be fulfilled. When doing so, we use data-driven physics-
based machine learning (or modeling) (Ströfer et al. 2018), or “gray-box” modeling
(Liu et al. 2021; Asgari et al. 2021; Regazzoni et al. 2020; Rogers et al. 2017). The
simplest and probably most used method to impose such principles (an imposition
called “whitening” or “bleaching” Yáñez-Márquez 2020) is the use of penalties and
Lagrange multipliers in the cost function (Dener et al. 2020; Borkowski et al. 2022;
Rao et al. 2021; Soize and Ghanem 2020), but there are many options and procedures
to incorporate physics either in the data or in the learning (Karpatne et al. 2017). The
resulting methods and disciplines which mix data science and physical equations are
often referred to as Physics Based Data Science (PBDS), Physics-Informed Data Sci-
ence (PIDS), Physics-Informed Machine Learning (PIML) (Karniadakis et al. 2021;
Kashinath et al. 2021), Physics Guided Machine Learning (PGML) (Pawar et al.
2021; Rai and Sahu 2020), or Data-Based Physics-Informed Engineering (DBPIE).
In a nutshell, data-based physically informed ML allows for the use of data science
methods without most of the shortcomings of physics-uninformed methods. Namely,
we do not need much data (Karniadakis et al. 2021), solutions are often meaningful,
the results are more interpretable, the methods much more efficient, and the number
of meaningless spurious solutions is substantially smaller. The methods are no longer
a sophisticated interpolation but can give predictions outside the domain given by the
training data. In essence, we incorporate the knowledge acquired in the last centuries.
In PBDS, meaningful internal variables play a key role. In classical engineer-
ing modeling, as in constitutive modeling, variables are either external (position,
velocity, and temperature) or internal (plastic or viscous deformations, damage, and
deformation history). The external variables are observable (common to all methods),
whereas the internal variables, being non-observable, are usually based on assump-
tions to describe some internal state. Here, a usual difference with ML methods is that
a physical meaning is typically assigned to internal variables in classical methods, but
for example when using NNs, internal variables (e.g., those in hidden layers) have
typically no physical interpretation. However, the sought solution of the problem
relates external variables both through physical principles or laws and through state
equations. To link both physical principles and state equations, an inherent physical
meaning is therefore best given (or sought) for the internal ML variables (Carleo et al.
2019; Vassallo et al. 2021). Physical principles are theoretical, of general validity, and
unquestioned for the problem at hand (e.g., mass, energy, momentum conservation,
and Clausius-Duhem inequality), whereas state equations are a result of assumptions
and observations at the considered scales, leading to equations tied to some condi-
tions, assumptions, and simplifications of sometimes questionable generality and of
more phenomenological nature.
In essence, the possible ML solutions obtained from state equations must be
restricted to those that fulfill the basic physical principles, constituting the physically
viable solution manifold, and that is often facilitated by the proper selection of
the structure of the ML method and the involved internal variables. These physical
constraints may be incorporated in ML procedures in different ways, depending
on the analysis and the ML method used, as we briefly discuss below (see also an
example in Ayensa Jiménez 2022).
34 F. J. Montáns et al.

1.3.1 Incorporating Physics in, and Learning Physics From,


the Dataset

An objective may be to discover a hidden physical structure in data or physical


relations in data (Chinesta et al. 2020). One purpose may be to reduce the dimension
of the problem by discovering relations in data that lead to the reduction of complexity
(Alizadeh et al. 2020; Aletti et al. 2015). This is similar to calculating dynamical
modes of displacements (Bathe and Wilson 1973; Bathe 2006) or to discover the
invariants when relating strain components in hyperelasticity (Weiss et al. 1996;
Bonet and Wood 1997). Another objective may be to generate surrogate models
(Bird et al. 2021; Straus and Skogestad 2017; Jansson et al. 2003; Liu et al. 2021)
to discover which variables have little relevance to the physical phenomenon, or
quantifying uncertainty in data (Chan and Elsheikh 2018; Trinchero et al. 2018;
Abdar et al. 2021; Zhu et al. 2019). Learning physics from data is in essence a data
mining approach (Bock et al. 2019; Kamath 2001; Fischer et al. 2006). Of course,
this approach is always followed in classical analysis when establishing analytical
models, for example when neglecting time effects for quasi-stationary problems,
or when reducing the dimension of 3D problems to plane stress or plane strain
conditions. However, ML seeks an unbiased automatic approach to the solution of a
problem.

1.3.2 Incorporating Physics in the Design of a ML Method

A natural possibility to incorporate physics in the design of the ML method is to


impose some equations, in some general form, onto the method, and the purpose is
to learn some of the freedom allowed by the equations (Tartakovsky et al. 2018). That
is the case when learning material parameters (typical in Materials Science informat-
ics, Agrawal and Choudhary 2016; Vivanco-Benavides et al. 2022; Stoll and Benner
2021), selecting specific functions from possibilities (e.g., selecting hardening mod-
els or hyperelastic models from a library of functions, Flaschel et al. 2021, 2022),
or learning corrections of models (e.g., deviations of the “model” from reality).
Physics in the design of the ML procedure may also be incorporated by imposing
some specific meaning to the hidden variables (introducing physically meaningful
internal variables as the backstress in plasticity) or the structure (as the specific
existence and dependencies of variables in the yield function) (Ling et al. 2016).
Doing so, the resulting learned relations may be better interpreted and will be in
compliance with our previous knowledge (Abueidda et al. 2021; Miyazawa et al.
2019; Zhan and Li 2021).
A large amount of ML methods in CAE are devoted to learning constitutive (or
state) equations (Leygue et al. 2018), with known conservation principles and kine-
matic relations (equilibrium and compatibility), as well as the boundary conditions
(González et al. 2019b; He et al. 2021). In essence, we can think of a “physical man-
1 Machine Learning in Computer Aided Engineering 35

ifold” and a “constitutive manifold”, and we seek the intersection of both for some
given actions or boundary and initial conditions (Ibañez et al. 2018; He et al. 2021;
Ibañez et al. 2017; Nguyen and Keip 2018; Leygue et al. 2018). Autoencoders are a
good tool to reduce complexity and filter noise (He et al. 2021). Other methods are
devoted to inferring the boundary conditions or material constitutive inhomogeneities
(e.g., local damage) assuming that the general form of the constitutive relations is
known (this is a ML approach to the classical inverse problem of damage/defect
detection).
Regarding the determination of the constitutive equations, the procedure may be
purely data-driven (without the explicit representation of a constitutive manifold or
constitutive relations, i.e. “model-free” Kirchdoerfer and Ortiz 2016, 2017; Eggers-
mann et al. 2021a; Conti et al. 2020; Eggersmann et al. 2021b) or manifold-based, in
which case a constitutive manifold is established as a data-based constitutive equa-
tion. In the model-free family, we assume that a large amount of data is known, so
a material data “point” is always close to the physical manifold (see Fig. 1.18 left).
Then, while these techniques may be considered within the ML family, they are more
data-driven deterministic techniques (raw data is employed directly, no constitutive
equation is “learned”). In the manifold-based family (Fig. 1.18, center and right), the
manifold may be explicit (e.g., spline-based, Sussman and Bathe 2009; Crespo and
Montáns 2019; Latorre and Montáns 2017; Crespo et al. 2017; Coelho et al. 2017)
or implicit (discrete or local, e.g., Lopez et al. 2018; Ibañez et al. 2020; Meng et al.
2018; Ibañez et al. 2017). This is a family of methods for which the objective is to
learn the state equations from known (experimental or analytical) data points, prob-
ably subject to some physics requirements (as integrability). Within this approach,
once the manifold is established, the computation of the prediction follows a scheme
very similar to the use of classical methods (Crespo et al. 2017).
Remarkably, in some Manifold Learning approaches, physical requirements
(which may include, or not, physical internal variables, Amores et al. 2020) may
result in a substantial reduction of the experimental data needed (Latorre et al. 2017;
Amores et al. 2021) and of the overall computational effort, resulting also in an

Constitutive manifold

Stress Stress

stress
Nearest Converged solution
data
Start Nearest physical Intersection of
manifold solution manifolds

Physical manifold Physical manifold


(e.g. compatibility) (e.g. compatibility)
Stretch 2
Strain Strain Stretch 1

Fig. 1.18 Data-based constitutive modeling. Left: purely data-driven technique, where no consti-
tutive manifold is directly employed. Instead, the closest known data point is located and is used to
compute the solution. Center, right: constitutive (e.g., stress–strain) data points are used to compute
a constitutive manifold (which may include uncertainty quantification), which is then employed to
compute the solution in a classical manner
36 F. J. Montáns et al.

increased interpretability of the solution. An important class of problems where ML


in general, and Manifold Learning approaches in particular, are often applied with
important success, is the generation of surrogate models for multiscale problems
(Peng et al. 2021; Yan et al. 2020; White et al. 2019; El Said and Hallett 2018; Alber
et al. 2019; Brunton and Kutz 2019). The solutions of nonlinear multiscale problems,
in particular those which use Finite Element based computational homogenization
(FE squared) (FE2) techniques, are still very expensive, because at each integration
point, a FE problem representing the Representative Volume Element (RVE) must
be considered (Fish et al. 2021; Arbabi et al. 2020; Fuhg et al. 2021). Then, sur-
rogate models which represent the equivalent behavior at the continuum level are
extremely useful (Fig. 1.19). These surrogate models may be obtained using dif-
ferent techniques. The use of Neural Networks is one option (Wang and Sun 2018;
Wang et al. 2020). Then the dataset for the training is obtained from repeated off-line
simulations with different loading and boundary conditions at different deformation
levels and with different loading histories (Logarzo et al. 2021). Another option is to
use surrogate models based on the equivalence of physical quantities as stored and
dissipated energies (Crespo et al. 2020; Miñano and Montáns 2018). Reduced Order
Methods are also important, especially in nonlinear path-dependent procedures to
determine the main internal variables or simplest representation giving sufficient
accuracy (Singh et al. 2017; Rocha et al. 2020). An important aspect in surrogate
modeling is the possibility of inversion of the map (Haghighat et al. 2021; Raissi
et al. 2019; Haghighat and Juanes 2021), which is crucial when prediction is not the
main purpose of the machine learning procedure but the main objective is to learn
about the material or its spatial distribution. The use of autoencoders can be effective
if decompression is fundamental in the process (Kim et al. 2021; Bastek et al. 2022;
Xu et al. 2022; Jung et al. 2020).

Fig. 1.19 Surrogate


modeling representing the
micromechanics. Examples
are numerical manifolds or
Neural Networks
1 Machine Learning in Computer Aided Engineering 37

1.3.3 Data Assimilation and Correction Methods

The use of ML models, as when using any model (including classical analytical mod-
els; see, for example, Bathe 2006), may result in a significant error in the prediction of
the actual physical response. This error may be produced either by insufficient data (or
insufficient quality of the data because of noise or lack of completeness), or by inac-
curacy of the model (e.g., due to too few layers in a NN or erroneous or oversimplify-
ing assumptions) (Haik et al. 2021). Then problems are how to incorporate new data
(labeled or unlabeled) into the model (Buizza et al. 2022), how to enrich the model
to improve the predictions (Singh et al. 2017), and how to augment physical models
with machine-learned bias (Volpiani et al. 2021) (hybrid models). These problems are
typically encountered in dynamics (Muthali et al. 2021), and the solutions are often
similar to those employed in control theory (Rubio et al. 2021), as the use of Kalman
methods (Zhang et al. 2020). Machine learning techniques may be used for self-
learning complex physical phenomena as the sloshing of fluids (Moya et al. 2020).
In essence, the proposal here is to assume that there is a model-predicted response
ymodel and a true (say “experimental”) response yexp (Moya et al. 2022). The differ-
ence is the error to be corrected, namely ycorr = yexp − ymodel . This error is corrected
in further predictions by assuming that there is an indeterminacy either in the input
data (statistical error) or in the model (some unknown variables that are not being
considered). Note that the statistical error case is conceptually similar to the quantifi-
cation of uncertainty. In case the model needs corrections, some formalism may be
employed to introduce physics corrections to learned models. For example, correcting
dissipative behavior in assumed (hyper)elastic behavior (or vice versa). In case there
are some indeterminacies in the model, we can assume that the model is of the form

y = f (x; w, ω) (1.51)

where the w are the parameters determined previously (e.g., during the usual model
learning process and now fixed) and ω are the parameters correcting the model by
minimizing the error. This model correction process using new data is called data
assimilation. In Dynamic Data-Driven Application Systems (DDDAS), the concepts
of Digital Twins and Hybrid Twins are employed. A Digital Twin (Glaessgen and
Stargel 2012) is basically a virtual (sometimes comprehensive) model which is used
as a replication of a system in real life. For example, a Formula-1 simulator, Mayani
et al. 2018, or a spacecraft simulator, Ye et al. 2020; Wang 2020 may be considered
a Digital Twin (Luo et al. 2020). A Digital Twin serves as a platform to try new
solutions when it is difficult or expensive to try them in the actual physical system.
Digital Twins are increasingly used in industry in many fields (Bhatti et al. 2021;
Garg and Panigrahi 2021; Burov and Burova 2020). This virtual platform may con-
tain classical analytical models, data-driven models, or a combination of both (which
is currently very usual in complex systems). The concept of Hybrid Twin (Chinesta
et al. 2020) (or self-learning digital twin, Moya et al. 2020) is a step forward, which
mixes the virtual/digital twin model with model order reductions and parametrized
solutions. The purpose is to have a twin in real time, which may be used to predict
38 F. J. Montáns et al.

the behavior of a system in advance and correct the system (Moya et al. 2022) or
take any other measure; that is, in essence to control a complex physical system. The
dynamic equation of the Hybrid Twin is

Ẋ(t; μ) = A(X, t; μ) + B(X, t) + C(t) + R(t) (1.52)

where the μ are the model parameters, A(X, t; μ) is the (possibly analytical) model
contribution given those parameters (a linear model would be A(μ)X) (Sancarlos
et al. 2021), B(X, t) is a data-based correction to the model (a continuous update
from measurements), C(t) are the external actions, and R(t) is the (unbiased and
unpredictable) noise. We use the word “hybrid” (Champaney et al. 2022) because
analytical and data-based approaches are employed. Hybrid Twins have been applied
in various fields, for example in simulating acoustic resonators (Martín et al. 2020).

1.3.4 ML Methods Designed to Learn Physics

A different objective from incorporating physics in the ML method is to use a ML


method to learn physics. One example would be to learn constitutive equations with-
out prior (or with minimal) assumptions—a case that is similar to those discussed
above but, for example, without neglecting a priori the influence of some terms or
without assuming the nature of the constitutive equation (for example, not assuming
elasticity, plasticity, or other). Another example is to learn new physical or fundamen-
tal evolution equations in nature. A successful (and quite simple) case is the Sparse
Identification of Physical Systems, in particular the Sparse Identification of Nonlin-
ear Dynamics (SINDy) (Brunton et al. 2016; Rudy et al. 2017). In this approach, the
nonlinear problem
Ẋ = A(X) (1.53)

is re-written as
Ẋ = (X) (1.54)

where  is a sparse matrix of dynamical coefficients and (X) contains a library


of functions evaluated at X. In the Lorenz System shown in Fig. 1.20 (Brunton et al.
2016), (X) involves a set of nonlinear polynomial combinations of the components
of X. The purpose here is to obtain the possibly simplest yet accurate description
(the parsimonious model) in terms of the expansion functions, and this is performed
by the technique of sparse regression, which promotes sparsity in underdetermined
least squares regression by replacing the norm-2 Tikhonov regularization by a
norm-1 penalization (Tibshirani 1996), although in Brunton et al. (2016) the authors
used a slightly different technique. The optimal penalty may be obtained by minimiz-
ing a cross-validation error (i.e. the solution which is accurate but avoids overfitting).
The method has been applied to a variety of physics problems to determine their dif-
ferential equations (Rudy et al. 2017). Similar approaches are Physics-Informed
Spline Learning (PiSL) (Sun et al. 2021), which represents an improvement for data
1 Machine Learning in Computer Aided Engineering 39

Fig. 1.20 Sparse Identification of Nonlinear Dynamics. Case of Lorenz System. Reproduced from
Brunton et al. (2016)

representation allowing for explicit derivatives and uses alternating direction opti-
mization with adaptive Sequential Threshold Ridge regression (STRidge) (Rudy et al.
2017) for promoting sparsity, and also more classical genetic and symbolic regression
procedures (Searson 2009; Schmidt and Lipson 2009, 2010). An overview of these
techniques and others may be found in Brunton and Kutz (2022); see also Zhang and
Liu (2021) for a progressive approach for considering uncertainties.
These approaches, as the SINDy type, can trivially address the correction given
by an imperfect modeling (i.e. the Hybrid Twin). It simply suffices to consider a
correction in Eq. (1.53)
Ẋ − A(X) = B(X) (1.55)

where B(X) is the measured discrepancy to be corrected between the results obtained
from the inexact model and the experimental results. As performed in mathematics
and physics, the key for simplification and possible linearization of a complex prob-
lem consists of finding a proper (possibly reduced) space of (possibly transformed)
input variables to re-write the problem. As mentioned, NNs, in particular autoen-
coders, can be used to find the space, to which, thereafter, a SINDy approach may
be applied to create a Digital or Hybrid Twin (Champion et al. 2019). These mixed
NN approaches have also been employed in multiscale physics transferring learning
through scales by increasingly deep and wide NNs (Liu et al. 2020), also employing
CNNs (Liu et al. 2022). Of course, Dynamic Mode Decomposition (DMD) (Schmid
2010; Tu 2013; Schmid 2011; Jovanović et al. 2014; Demo et al. 2018), a procedure to
determine coupled spatio-temporal modes for nonlinear problems based on Koopman
(composite operator) theory (Williams et al. 2015), is also used for incorporating data
40 F. J. Montáns et al.

into physical systems, or determining the physical system equations themselves. The
idea is to obtain two sets (“snapshots”) of spatial measurements separated by a given
t, namely t X and t+ t X. Then, the eigenvectors of A = t+ t X t X∼1 , where t X∼1
is the pseudoinverse, are the best regressors to the linear model, that is, the minimum-
squares best fit of the nonlinear model, compatible with the snapshots. In practice,
the A matrix is usually not computed because working with the SVD of X is more
efficient (Proctor et al. 2016).
Other techniques to discover physical relations (or nonlinear differential equa-
tions), as well as simultaneously obtain physical parameters and fields, are physics-
informed NN (PINN) (Raissi and Karniadakis 2018; Raissi et al. 2019; Pang et al.
2019; Yang et al. 2021). For example, using neural networks, the viscosity, the den-
sity, and the pressure, with the velocity field in time may be obtained assuming the
Navier–Stokes equations as background and employing a NN as the learning engine
to match snapshots. Moreover, these methods may be combined with time integrators
for obtaining the nonlinear parameters of any differential equation, including higher
derivatives, just from discretized experimental snapshots (Meng et al. 2020; Zhang
et al. 2020). Other applications include inverse problems in discretized conservative
settings (Jagtap et al. 2020).

1.3.4.1 Deep Operator Networks

While it is very well known that the so-called universal approximation theorem
guarantees that a neural network can approximate any continuous function, it is also
possible to approximate continuous operators by means of neural networks (Chen
and Chen 1995). Based on this fact, Lu and coworkers have proposed the Deep
Operator Networks (DeepONets) (Lu et al. 2021).
A DeepONet typically consists of two different networks working together: one
to encode the input function at a number of measurement locations (the so-called
branch net) and a second one (the trunk net) to encode the locations for the output
functions. Assume that we look forward to characterize an operator F : X → Y ,
with X, Y two topological spaces. For any function x ∈ X , this operator produces
G = F(x), the output function. For any point y in the domain of F(x), G(y) is a real
number. A DeepONet thus learns from pairs (x, y) to produce the operator. However,
for an efficient training, the input function x is sampled at discrete spatial locations.
In some examples, DeepONets showed very small generalization errors and even
exponential error convergence with respect to the training dataset size. This is how-
ever not yet fully understood. DeepONets have been applied, for example, to predict
crack paths in brittle materials (Goswami et al. 2022), instabilities in boundary layers
(Di Leoni et al. 2021), and the response of dynamical systems subjected to stochastic
loadings (Garg et al. 2022).
Recently, DeepONets have been generalized by parameterizing the integral kernel
in Fourier space, giving rise to the so-called Fourier Neural Operators (Li et al. 2020).
These networks have also gained a high popularity, and have been applied to weather
forecasting, for instance (Pathak et al. 2022).
1 Machine Learning in Computer Aided Engineering 41

1.3.4.2 Neural Networks Preserving the Physical Structure


of the Problem

Within the realm of PIML approaches, a new family of methods has recently been
proposed. The distinctive characteristic is that these new techniques see the super-
vised learning process as a dynamical system as

ż = f(z, t), with z(0) = z0 (1.56)

with z being the set of variables governing the problem. The supervised learning
problem will thus be to establish f in such a way as to reach an accurate description of
the evolution of the variables. By formulating the problem in this way, the analyst can
use the knowledge already available, and established over centuries, on dynamical
systems. For instance, adopting a Hamiltonian perspective on the dynamics and
enforcing f to be of the form
ż = L∇ H (1.57)

where L is the classical (skew-symmetric) symplectic matrix, which ensures that the
learnt dynamics will conserve energy, because it is derived from the Hamiltonian
H . Many recent references have exploited this approach, either in Hamiltonian or
Lagrangian frameworks (Greydanus et al. 2019; Mattheakis et al. 2022; Cranmer
et al. 2020). If the system of interest is dissipative—which is, by far, most frequently
the case—a second potential must be added to the formulation as

ż = L∇ H + M∇ S (1.58)

where S represents the so-called Mathieu potential. To ensure the fulfillment of the
first and second principles of thermodynamics, an additional restriction (the so-called
degeneracy conditions) must be imposed, i.e.

L∇ S = M∇ H = 0 (1.59)

These equations essentially state that entropy has nothing to do with energy conserva-
tion and, in turn, energy potentials have nothing to do with dissipation. The resulting
NN formulations produce predictions that comply with the laws of thermodynamics
(Hernández et al. 2021, 2022).

1.4 Applications of Machine Learning in Computer


Aided Engineering

In this section we describe some applications of machine learning in CAE. The main
purpose is to briefly focus on a variety of topics and ML approaches employed in
several fields, but not to give a comprehensive review. Hence, given the vast literature
already available, developed in the last few years, many important works have likely
42 F. J. Montáns et al.

been omitted. However, even though the field of applications is very broad, the main
ideas fundamental to the techniques are given in the previous sections.

1.4.1 Constitutive Modeling and Multiscale Applications

The main field of application of machine learning techniques in CAE is ML constitu-


tive modeling, both at the continuum scale and for easing multiscale computations.
As previously mentioned, applicable procedures are model-free approaches, data-
driven manifold learning, data-driven model selection and calibration, and surrogate
modeling. Another interesting application of ML, in particular NNs, is to improve
results from coarse FE models without resorting to expensive fine computations,
e.g., “zooming” (Yamaguchi and Okuda 2021). There are several reviews of applica-
tions of ML in constitutive modeling (specially using NNs), in continuum mechanics
(Bock et al. 2019), for soils (Zhang et al. 2021), composites (Zhang and Friedrich
2003; Liu et al. 2021; El Kadi 2006), and material science (Hkdh 1999). An earlier
review of NN applications in computational mechanics in general can also be found
in Yagawa and Okuda (1996). Below we briefly review some applications.

1.4.1.1 Linear and Nonlinear Elasticity

One of the simplest modeling problems and, hence, one of the most explored ones is
the case of elasticity. The linear elastic problem, addressed from a model-free data-
driven method is analyzed in Kirchdoerfer and Ortiz (2016), Conti et al. (2018), and
even earlier in Wang et al. (2011) for cloths in the animation and design industries.
Data-driven nonlinear elasticity is also analyzed in several works (Conti et al. 2020;
Stainier et al. 2019; Nguyen and Keip 2018), and applied to soft tissues (González
et al. 2020) and foams (Frankel et al. 2022).
In particular, data-driven specific solvers are needed if model-free methods are
employed, and some effort is directed to developing such solvers and data structur-
ing methods for the task (Eggersmann et al. 2021a, b; Platzer et al. 2021). Kernel
regression is also employed (Kanno 2018).
Another common methodology is the use of data-driven constitutive manifolds
(Ibañez et al. 2017), where identification and reduction of the constitutive manifolds
allow for a much more efficient approach. NNs are as well used in finite deformation
elasticity (Nguyen-Thanh et al. 2020; Wang et al. 2022).
Remarkably, nonlinear elasticity is one of the cases where physics-informed meth-
ods are important, because true elasticity means integrable path-independent consti-
tutive behavior, i.e. hyperelasticity. Classical ML methods are not integrable (hence
not truly elastic). To fulfill such requirement, specific methods are needed (González
et al. 2019b; Chinesta et al. 2020; Hernandez et al. 2021). One of the possibilities is
to posit the state variables and a reduced expression of the hyperelastic stored energy
(which may be termed as “interpretable” ML models Flaschel et al. 2021). Then,
this energy may be modeled, for example, by splines or B-splines. This approach,
1 Machine Learning in Computer Aided Engineering 43

based on the Valanis–Landel assumption, was pioneered by Sussman and Bathe for
isotropic polymers (Sussman and Bathe 2009) and extended later for anisotropic
materials (Latorre and Montáns 2013) like soft biological tissues (fascia, Latorre
et al. 2017, skin Romero et al. 2017, heart Latorre and Montáns 2017, muscle Latorre
et al. 2018, Moreno et al. 2020), compressible materials (Crespo et al. 2017), auxetic
foams (Crespo and Montans 2018; Crespo et al. 2020), and composites (Amores
et al. 2021). Polynomials in terms of invariants are also employed, with the coeffi-
cients determined by sparse regression (Flaschel et al. 2021). Another approach is
to select models from a database, and possibly correct them (González et al. 2019a;
Erchiqui and Kandil 2006), or select specific function models for the hyperelastic
stored energy using machine learning methods (e.g., NNs) (Flaschel et al. 2021;
Vlassis et al. 2020; Nguyen-Thanh et al. 2020). In particular, polyconvexity (to guar-
antee stability and global minimizers for the elastic boundary-value problem) may
also be imposed in NN models (Klein et al. 2022). Anisotropy in hyperelasticity may
be learned from data with NNs (Fuhg et al. 2022a).
In material datasets, noise and outliers may be a relevant issue, both regard-
ing accuracy and their promotion of overfitting. Clustering has been employed in
model-free methods to assign a different relevance depending on the distance to
the solution and using an estimation based on maximum entropy (Kirchdoerfer and
Ortiz 2017). For spline-based constitutive modeling, experimental data reduction
using stability-based penalizations allows for the use of noisy datasets and outliers
avoiding overfitting (Latorre and Montáns 2020).

1.4.1.2 Plasticity, Viscoelasticity, and Damage

ML modeling of nonconservative effects is still in quite an incipient state because


path-dependency requires the modeling of latent internal variables and the knowledge
of the previous deformation path (González et al. 2021). However, some early works
using NNs are available (Panagiotopoulos and Waszczyszyn 1999). The amount
of needed data is much larger because the possible deformation paths are infinite,
but there are already a relevant number of works dealing with inelasticity. In the
case of damage, spline-based What-You-Prescribe is What-You-Get (WYPiWYG)
large-strain modeling is available both for isotropic (Miñano and Montáns 2015)
and anisotropic materials (Miñano and Montáns 2018). Crack growth in the aircraft
industry has also been determined with RNNs (Nascimento and Viana 2020). Of
course, ML has been for a long time applied to model fatigue (Lee et al. 2005).
Plasticity is probably the most studied case of the nonconservative behaviors
(Waszczyszyn and Ziemiański 2001). For the case of data-driven (model-free)
“extended experimental constitutive manifolds” including internal variables, the
LArge Time INcrement (LATIN) method (solving by separating the constitutive and
compatibility/equilibrium sets and looking for the intersection) has been successfully
used (Ladevèze et al. 2019); see also Ibañez et al. (2018).
Data-driven model-free techniques in plasticity and viscoelasticity have been
developed using more general history variables (like the history of stresses or strains
44 F. J. Montáns et al.

as typically pursued for hereditary models) (Eggersmann et al. 2019; Ciftci and Hackl
2022). FFNNs with PODs have been employed to fit several plasticity stress–strain
behaviors. NNs are also used to replace the stress integration approaches in FE anal-
ysis of elastoplastic models (Jang et al. 2021). In general, RNNs (Mozaffar et al.
2019; Borkowski et al. 2022) and CNNs (Abueidda et al. 2021) are a good resort for
predicting plastic paths, and sophisticated LSTM and Gated Recurrent Unit (GRU)
schemes have been reported to give excellent predictions even for complex paths
(Wang et al. 2020).
In materials science, ML is employed to predict the cyclic stress–strain behavior
depending on the microstructure of the material obtained from electron backscat-
ter diffraction (EBSD) analysis. The shape of the yield function can also be deter-
mined by employing sparse regression from a strain map and the cell load in a
non-homogeneous test (like considering a plate with holes) (Flaschel et al. 2022).
A mixture of analytical formulas and FFNN machine learning has been employed
to replace the temperature- and rate-dependent term of the Johnson–Cook model
(Li et al. 2019). In plasticity, physics-based modeling is incorporated by assuming
the existence of a stored energy, a plasticity yield function, and a plastic flow rule.
These may be obtained by NNs learned from numerical experiments on polycrystal
databases, resulting in a more robust ML approach than using the classical black-box
ML scheme (Vlassis and Sun 2021). Support Vector Regression (SVR), Gaussian
Process Regression (GPR), and NNs have been used to determine data-driven yield
functions with the convexity constraints required by the theory (Fuhg et al. 2022b).
Automatic hyperparameter (self-)learning has been addressed for NN modeling of
elastoplasticity in Fuchs et al. (2021).

1.4.1.3 Fracture

Fracture phenomena may also be modeled using NNs (Theocaris and Panagiotopou-
los 1993; Seibi and Al-Alawi 1997) and data-driven model-free techniques (Carrara
et al. 2020). Data-driven model extraction from experimental data and knowledge
transfer (Goswami et al. 2020) have been applied to obtain predictions in 3D mod-
els from 2D cases (Liu et al. 2021). Data-driven approaches are used to enhance
fracture paths in simulations of random composites and in model reduction to avoid
high fidelity phase-field computations (Guilleminot and Dolbow 2020). SVMs and
variants have been used for predicting fracture properties, e.g., Yuvaraj et al. (2013),
Kulkrni et al. (2011), and so have been other methods like BNN, Genetic Algorithm
(GA), and hybrid systems; see, for example, Nasiri et al. (2017), Hoshyar et al.
(2020).

1.4.1.4 Multiscale and Composites Modeling

The modeling of complex materials is one of the fields where machine learning
may bring about significant advances in CAE (Peng et al. 2021), in particular when
1 Machine Learning in Computer Aided Engineering 45

nonlinear behavior is modeled (Jackson et al. 2019). This is particularly the case when
the macroscopic behavior or the physical properties depend in a complex manner on
a specific microstructure (Fish et al. 2021) or on physics equations and phenomena
only seen at a micro- or smaller scale, as atomistic (Caccin et al. 2015; Kontolati
et al. 2021; Wood et al. 2019), molecular (Xiao et al. 2020), or cellular (Verkhivker
et al. 2020).
ML allows for the simpler implementation of first-principles in multiscale simu-
lations (Hong et al. 2021), describing physical macroscopic properties, like also in
chaotic dynamical systems for which the highly nonlinear behavior depends on com-
plex interactions at smaller scales (e.g., weather and climate predictions) (Chattopad-
hyay et al. 2020). Generating surrogate models to reproduce the observed macro-
scopic effects due to complex phenomena at the microscale (Wirtz et al. 2015) is often
only possible through ML and Model Order Reduction (MOR) (Wang et al. 2020;
Yvonnet and He 2007). Even in the simplest cases, ML may substantially speed up
the expensive computational costs of classical nonlinear FE2 homogenization tech-
niques (Feng et al. 2022; Wu et al. 2020), allowing for real-time simulations (Rocha
et al. 2021). The nonlinear multiscale case is complex because an infinite number of
simulations would be needed for a complete general database. However, a reduced
dataset may be used to develop a numerical constitutive manifold with sufficient
accuracy, e.g., using Numerically EXplicit Potentials (NEXP) (Yvonnet et al. 2013).
Material designs are often obtained from inverse analyses facilitated by parametric
ML surrogate models (Jackson et al. 2019; Haghighat et al. 2021). In particular, ML
may be employed to determine the phase distributions in heterogeneous materials
(Valdés-Alonzo et al. 2022).
The modeling of classical fiber-based and complex composite heterogeneous
materials often requires multiscale approaches (Pathan et al. 2019; Hadden et al.
2015; Kanouté et al. 2009) because modeling of interactions at the continuum level
requires inaccurate assumptions. CNNs are ideal for dealing with the relation of
an unstructured RVE with continuum equivalent properties. In particular, ML may
be used for dealing with stochastic distributions of constituents (Liu et al. 2022).
Modeling of complex properties such as composite phase changes for thermal man-
agement in Li-ion batteries may be performed with CNNs (Kolodziejczyk et al.
2021). Indeed, CNNs can also be used for performing an inverse analysis (Sorini
et al. 2021). In general, many complex properties and effects observed macroscop-
ically, but through effects mainly attributed to the microscale, are often addressed
with different ML techniques, including CNNs, e.g., Field et al. (2021), Nayak et al.
(2022), and Koumoulos et al. (2019).

1.4.1.5 Metamaterials Modeling

Metamaterials are architected materials with inner custom-made structure. With the
current development of 3D printing, metamaterial modeling and design is becoming
an important field (Kadic et al. 2019; Bertoldi et al. 2017; Zadpoor 2016; Barchiesi
et al. 2019) because a material with unique salient properties may be designed ad
46 F. J. Montáns et al.

libitum allowing for a wide range of applications (Surjadi et al. 2019). Their design
has evolved from the classical optimization-based approach (Sigmund 2009). ML
methods for the design of metamaterials are often used with two objectives. The first
objective is to generate simple surrogate models to accelerate simulations avoiding
FE modeling to the very fine scale describing the structure, especially when non-
linearities are important. The second objective is to perform analyses using a meta-
material topology parametrization which allows for an effective metamaterial design
from macroscopic desired properties. Examples of ML approaches for metamaterials
pursuing these two objectives can be found in, e.g., Wu et al. (2020), Fernández et al.
(2022b), Zheng et al. (2020), and Wilt et al. (2020).

1.4.2 Fluid Mechanics Applications

Fluid phenomena and related modeling approaches are very rich, spanning from the
breakup of liquid droplets under different conditions (Krzeczkowski 1980; Roisman
et al. 2018; Liu et al. 2018) to smoke from fires in tunnels (Gannouni and Maad
2016; Wu et al. 2021), emissions from engines (Khurana et al. 2021; Baklacioglu
et al. 2019), flow and wake effects in wind turbines (Clifton et al. 2013; Ti et al.
2020), and free surface flow dynamics (Becker and Teschner 2007; Scardovelli and
Zaleski 1999). The difficulty in obtaining accurate and efficient solutions, especially
when effects at multiple scales are important, has fostered the introduction of ML
techniques. We briefly review some representative works.

1.4.2.1 Turbulence Flow Modeling

The modeling of turbulence is an important aspect in the solution of the Navier–


Stokes equations of fluid flows. Here ML techniques can be of value.
The ML procedures in turbulence often build on the Reynolds Averaging decom-
position, u(x, t) = ū(x) + ũ(x, t) which splits the flow u(x, t) into an average ū(x)
time-independent component and a fluctuating component ũ(x, t) with zero aver-
age, obtaining the incompressibility conditions ∇ · ū = 0 and ∇ · ũ = 0. Then, the
Navier–Stokes equations are written in terms of the Reynolds stresses ρ ũ ⊗ ũ

1
∇ ū · ū = ∇(− pI + 2μ∇ s ū − ρ ũ ⊗ ũ) (1.60)
ρ
for which a turbulence closure model is assumed, e.g., eddy viscosity model or
the more involved k − ε (Gerolymos and Vallet 1996) or Spalart–Allmaras models
(Spalart and Allmaras 1992). In Eq. (1.60), ∇ s ū is the average deviatoric strain-
rate tensor. The framework in Eq. (1.60) gives the two commonly used models: the
Reynolds-Averaged Navier–Stokes (RANS) model, best for steady flows (Speziale
1998; Kalitzin et al. 2005), and the Large Eddy Simulations (LES) model, using a
subgrid-scale model, thus much more expensive computationally, but best used to
predict flow separation and fine turbulence details. RANS closure models have been
1 Machine Learning in Computer Aided Engineering 47

explored using ML. For example, the work reported (Zhao et al. 2020) trains a tur-
bulence model for wake mixing using a CFD-driven Gene Expression Programming
(an evolutionary algorithm). Physics-informed ML may also be used for augmenting
turbulence models, in particular to overcome the difficulties of ill-conditioning of the
RANS equations with typical Reynolds stress closures, focusing on improving mean
flow predictions (Wu et al. 2018). Results of using ML to improve accuracy of clo-
sure models are, for example, given in Wackers et al. (2020), Wang et al. (2017). One
of the important problems in modeling turbulence and accelerating full field simula-
tions is to upscale the finer details, e.g., vorticity from the small to the larger scale,
using a lower resolution (grid) analysis. These upscaling procedures may be per-
formed by inserting NN corrections which learn the scale evolution relations, greatly
accelerating the computations by allowing lower resolution (Kochkov et al. 2021).

1.4.2.2 Shock Dynamics

More accurate and faster shock-capturing by NN has been pursued in Stevens and
Colonius (2020), where ML has been applied to improve finite volume methods to
address discontinuous solutions of PDEs. In particular, Weighted Essentially Non-
Oscillatory Neural Network (WENO-NN) approaches establish the smoothness of
the solution to avoid spurious oscillations, still capturing accurately the shock, where
the ML procedure facilitates the computation of the optimal nonlinear coefficients
of each cell average.

1.4.2.3 Reduced Models for Accelerating Simulations

An important application of ML in fluid dynamics and aerodynamics is the develop-


ment of reduced order models. In essence, these models capture the main dominant
coarse flow structures, with fine structures included and provide a faster, simpler
model for analysis, i.e. a surrogate model of similar idea as those used in multiscale
analysis. As mentioned previously, there are many techniques used for this task, such
as DMD (Schmid et al. 2011; Hemati et al. 2014) or more general POD (Berkooz et al.
1993; Aubry 1991; Rowley 2005), PGD (Dumon et al. 2011; Chinesta et al. 2011),
PCA (Audouze et al. 2009), and SVD (Lorente et al. 2008; Braconnier et al. 2011).
Autoencoders employing different NN types (Kramer 1991; Murata et al. 2020; Xu
and Duraisamy 2020; Maulik et al. 2021) and other nonlinear extensions of the pre-
vious techniques are a widely used approach for dealing with nonlinear cases typical
in fluid dynamics (Gonzalez and Balajewicz 2018). These techniques also frequently
include physics information to guarantee consistency (Erichson et al. 2019).
48 F. J. Montáns et al.

1.4.3 Structural Mechanics Applications

ML has been used for some time already in structural mechanics, with probably the
most applications in Structural Health Monitoring (SHM) (Farrar and Worden 2012).
ML is applied for the primal identification of structural systems (SSI) (Sirca and
Adeli 2012; Amezquita-Sancheza et al. 2020), in particular of complex or historical
structures, to assess their general and seismic vulnerability (Ruggieri et al. 2021;
Xie et al. 2020) and facilitate ulterior health monitoring (Mishra 2021). Feature
extraction and model reduction is fundamental in these approaches (Rosafalco et al.
2021). Other areas where ML is employed is in the control of structures (e.g., active
Tuned Mass Dampers, Yucel et al. 2019; Colherinhas et al. 2019; Etedali and Mollayi
2018) under wind, seismic or crowd actions, or in structural design (Herrada et al.
2017; Sun et al. 2021; Hong et al. 2020; Yuan et al. 2020). We also comment in this
section on the development of novel ML approaches based on ideas used in structural
and finite element analyses.

1.4.3.1 Structural System Identification and Health Monitoring

Structural System Identification (SSI) is a key in analyzing the vulnerability of histor-


ical structures in seismic zones (e.g., Italy and Spain) (Domaneschi et al. 2021). It is
also a problem in the assessment of modern structures, since modeling assumptions
may not have been sufficiently accurate (Torky and Ohno 2021). Many classical
approaches based on optimization methods are frequently ill-conditioned or they
present many possible solutions, some of which should have been discarded auto-
matically. Hence, ML is an excellent approach to address SSI, and different algo-
rithms have been employed. For example, SVM (Gui et al. 2017), and in particular
Weighted Least Squares Support Vector Machines (LS-SVM) have been employed
to determine the structural parameters and then identify degradation due to damage
through dynamic response (Tang et al. 2006; Zhang et al. 2007). K-Means and KNNs
are also frequently used in SHM. For example, in Sarmadi and Karamodin (2020)
anomaly detection is performed using the (squared) “distance” (x − x̄)S−1 k (x − x̄)
to detect the k-NN in a multivariate one-class k-NN approach. The authors applied
the approach to wood and steel bridges and compared the results obtained with
other ML techniques to reach the smallest misclassification rate. Bridge structures
have also been focused on using Genetic Algorithms in an unsupervised approach
to detect damage (Silva et al. 2016). Health monitoring of bridges is the focus in
rather early research (e.g., the simple case analyzed in Liu and Sun 1997 through
NNs). The traffic load (Lee et al. 2002) and ambient vibrations (Avci et al. 2021) are
often actions that require the study of the evolution of the mechanical properties. The
application of NNs is typical in detecting changes in the properties and the possible
explanations on the origin of those changes (Ko and Ni 2005). Basically, all types of
bridges have been studied using ML techniques, namely steel girder bridges (Nick
et al. 2021), reinforced concrete T-bridges (Hasançebi and Dumlupınar 2013), cable
stayed (Arangio and Bontempi 2015) and long suspension bridges (Ni et al. 2020),
1 Machine Learning in Computer Aided Engineering 49

truss bridges (Mehrjoo et al. 2008), and arch bridges (Jayasundara et al. 2020). Dif-
ferent types of NN are used (e.g., Bayesian, Arangio and Beck 2012; Li et al. 2020;
Ni et al. 2001, Convolutional, Nguyen et al. 2020; Quqa et al. 2022, Recurrent, Miao
et al. 2023; Miao and Yokota 2022), and the use of other techniques is also frequent
as SVM; see, for example, (Alamdari et al. 2017; Yu et al. 2021).
Apart from bridges and multi-story buildings (González and Zapico 2008; Wang
et al. 2020), there are many other types of structures for which SSI and SHM are
performed employing ML. Important structures are dams, where a deterioration and
failure may cause massive destruction, hence visual inspection and monitoring of
displacement cycles are typical actions in SHM of dams. The observations feed
ML algorithms to assess the health of the structure. The estimation of the structural
response from collected data is studied for example in Li et al. (2021b), where
a CNN is used to extract features and a bidirectional gated RNN is employed to
perform transfer learning from long-term dependencies. Similar works addressing
SHM of dams are given (Yuan et al. 2022; Sevieri and De Falco 2020). A review
may be found in Salazar et al. (2017).
Of course, different outputs may be pursued and the appropriate ML technique is
related to both available data and desired output. For example, NNs have been used
in Kao and Loh (2013), Ranković et al. (2012), Chen et al. (2018), and He et al.
(2022) to monitor radial and lateral displacements in arch dams. Several ML tech-
niques such as Random Forest (RF), Boosted Regression Trees (BRT), NN, SVM,
and MARS are compared in Salazar et al. (2015) in the prediction of dam displace-
ments and of dam leakage. The researchers found that BRT outperforms the most
common data-driven technique employed when considering this problem, namely the
Hydrostatic-Seasonal-Time method (HST), which accounts for the irreversible evo-
lution of the dam response due to the reversible hydrostatic and thermal loads; see also
Salazar et al. (2016). Gravity dams are a different type of structure from arch dams.
Their reliability under flooding, earthquakes, and aging has also been addressed using
ML methods in Hariri-Ardebili and Pourkamali-Anaraki (2018), where kNN, SVM,
and NB have been used in the binary classification of structural results, and a failure
surface is computed as a function of the dimensions of the dam. Related to dam infras-
tructure planning, flooding susceptibility predictions due to rainfall using NB and
Naïve Bayes Trees (NBT) are compared in Khosravi et al. (2019) with three classical
methods (see review of Multicriteria Decision Making (MCDM) in de Brito and Evers
2016) in the field. For tunnel designs and monitoring, we have that the soil is also an
integral part of the structure and is difficult to characterize. The understanding of its
behavior often depends on qualitative observations; it is therefore another field where
machine learning techniques will have an important impact in the future (Jafari 2020).
Important types of structures considered in SHM are also aerogenerators or Wind
Turbines (WT); see review in Ciang et al. (2008). Here, two main components are
typically analyzed: the blades and the gearbox (Wang et al. 2016). SVM is a frequent
ML technique used and acoustic noise is a source of relevant data for blade moni-
toring (Regan et al. 2016). Deep NNs are also frequently employed when multiple
sources of data are available, in particular CNNs are used to deal with images from
drones (Shihavuddin et al. 2019; Guo et al. 2021). Images are valuable not only in
the detection of overall damage (e.g., determining a damage index value), but also in
50 F. J. Montáns et al.

determining the location of the damage. This gives an alternative to the placement of
networks of strain sensors (Laflamme et al. 2016). Other WT functional problems,
such as dirt and mud detection in blades to improve maintenance, can be determined
employing different ML methods; e.g., in Jiménez et al. (2020) k-Nearest Neigh-
bors (k-NN), SVM, LDA, PCA, DT, and an ensemble subspace discriminant method
are employed. ther factors like the presence of ice in cold climates are also impor-
tant. In Jiménez et al. (2019), a ML approach is applied to pattern recognition on
guided ultrasonic waves to detect and classify ice thickness. In this work, different
ML techniques are employed for feature extraction (data reduction into meaningful
features), both linear (autoregressive ML models and PCA) and nonlinear (nonlinear-
AR eXogenous and nonlinear PCA), and then feature selection is performed to avoid
overfitting. A wide range of supervised classifiers of different families (DT, LDA,
QDA, several types of SVM, kNN, and ensembles) were employed and compared,
both in terms of accuracy and efficiency.
Applications of ML can be found also in data preparation, including imputation
techniques to fill missing sensor data (Li et al. 2021a, b). Systems, and damage and
structural responses are assessed employing different variables. Typical variables
are the displacements (building drift), which allow for the determination of mate-
rial and structural geometric properties, for example in reinforced concrete (RC)
columns. This can be achieved through locally weighted LS-SVM (Luo and Paal
2019). Bearing capacities and failure modes of structural components (columns,
beams, shear walls) can also be predicted using ML techniques, in particular when
the classical methods are complex and lack accuracy. For example, in Mangalathu
et al. (2020) several ML methods such as Naïve Bayes, kNN, decision trees, and ran-
dom forests combined with several weighted Boost techniques (similar to ensemble
learning under the assumption that many weak learners make a strong learner) such
as AdaBoost (Adaptative Boost, meaning that new weak learners adapt from mis-
classifications of previous ones) are compared to predict the failure modes (flexural,
diagonal tension or compression, sliding shear) of RC shear walls in seismic events.
Identification of smart structures with nonlinearities, like buildings with magne-
torheological dampers, has been performed through a combination of NN, PCA, and
fuzzy logic (Mohammadzadeh et al. 2015).
In SHM, the integration of data from different types or families of sensors (data
fusion) is an important topic. Data fusion (Hall and Llinas 2001) brings not only
challenges in SHM but also the possibility of more accurate, integral prediction
of the health of the structure (Wu and Jahanshahi 2020). For example, in Vitola
et al. (2017) a data fusion system based on kNN classification was used in SHM.
SHM is most frequently performed through the analysis of the dynamic response of
the structure and comparing vibrational modes using the Modal Assurance Criterion
(MAC) (Ho et al. 2021). However, in the more challenging SSI, many other additional
features are employed as typology, age, and images. In SHM, damage detection is also
pursued through the analysis of images. Visual inspection is a long used method for
crack detection in concrete or steel structures, or to determine unusual displacements
and deformations of the overall structure from global structural images. Automatic
processing and damage detection from images obtained from stationary cameras or an
1 Machine Learning in Computer Aided Engineering 51

Unmanned Aereal Vehicle (UAV) (Sankarasrinivasan et al. 2015; Reagan et al. 2018)
is currently being performed using ML techniques. A recent review of these image-
based techniques can be found in Dong and Catbas (2021). Another recent review
of ML applications of SHM of civil structures can be found in Flah et al. (2021).
One of the lessons learnt considering the available results is that to improve predic-
tions and robustness, some progress is needed in physics-based ML approaches for
SHM. For instance, an improvement may be using concrete damage models with envi-
ronmental data, typology, images, etc. to detect damage which may have little impact
in sensors (Kralovec and Schagerl 2020), but which may result in significant losses.
This issue is also of special relevancy in the aircraft industry (Ahmed et al. 2021).

1.4.3.2 Structural Design and Topology Optimization

The design of components and structures is based on creativity and experience


(Málaga-Chuquitaype 2022), so it is also an optimal field for the use of ML proce-
dures, e.g., Adeli and Yeh (1990). ML in the general design of industrial components
is briefly addressed below.
Given the creative nature of structural design, evolutionary algorithms are good
choices. For example in Freischlad and Schnellenbach-Held (2005), linguistic mod-
eling is applied to conceptual design, investigating evolutionary design and opti-
mization of high-rise concrete buildings for lateral load bearing. The process of the
design of a structure using ML from concept to actual structural detailing is dis-
cussed in Chang and Cheng (2020). Different structural systems are conceptually
designed with the aid of ML techniques, including shear walls to sustain earthquakes
(e.g., using GAN, in Lu et al. 2022; Zhao et al. 2022), shell structures (Tam et al.
2020; Zheng et al. 2020), and even the architectural volume (Chang et al. 2021). A
study and proposal of different ML techniques in architectural design can be found
in Tamke et al. (2018).
Of course, one of the main disciplines in structural design is Topology Optimiza-
tion (TO), and ML approaches (a combination coined “learning topology” in Moroni
and Pascali 2021) can be used to develop more robust schemes (Chi et al. 2021;
Muñoz et al. 2022) through tuning numerical parameters (Lynch et al. 2019). For
example in Muñoz et al. (2022), manifold learning approaches such as local linear
embedding (LLE) techniques are employed to extract geometrical modes defined
by the material distribution given by the TO algorithm, facilitating the creation
of new geometries. TO of nonlinear structures is also performed using deep NN
(Abueidda et al. 2020). In order to obtain optimum thermal structures, GANs have
been used to develop non-iterative structural to Li et al. (2019). Using ML to develop
a non-iterative TO approach has also been addressed in Yu et al. (2019). A recent
review of ML techniques in TO can be found in Mukherjee et al. (2021).
52 F. J. Montáns et al.

1.4.4 Machine Learning Approaches Motivated in Structural


Mechanics and by Finite Element Concepts

While ML has contributed to CAE and structural design, new ML approaches have
also been developed based on concepts that are traditional in structural analysis and
finite element solutions. For example, one of the ideas is the concept of substructur-
ing, employed in static condensation, Guyan reduction, and Craig–Bampton schemes
(Bathe 2006). In Jokar and Semperlotti (2021) a Finite Element Network Analysis
(FENA) is proposed. The method substitutes the classical finite elements by a library
of “elements” consisting of a Bidirectional Recurrent Neural Network (BRNN). The
BRNN of the elements are trained individually and the training can be computation-
ally costly. Then these trained BRNN are concatenated, and the composite system
needs no further training. The solution is fast, not considering the training, since
in contrast to FE solutions, no system of equations is solved. The method has only
been applied to the analysis of an elastic bar, so the generalization of the idea to the
solution of more complex problems is still an open research task.
The partition of unity used in finite element and meshless methods has been
employed to develop a Finite Element Machine (FEMa) for fast supervised learning
(Pereira et al. 2020). The idea is that each training sample is the center of a Shepard
function, and the training set is treated as a probabilistic manifold. The advantage
is that, as in the case of spline-based approaches, the technique has no parameters.
Compared to several methods, the BPNN, Naïve Bayes, SVM (using both RBF and
sigmoids), RF, DT, etc. the FEMa method was competitive in the eighteen benchmark
datasets typically employed in the literature when analyzing supervised methods.
Another interesting approach is the substitution of some procedures of finite
element methods with machine learning approaches. Candidates are material and
general element libraries, creating surrogate material models (discussed above) or
surrogate elements, or patches of elements. This approach follows the substructuring
or multiscale computational homogenization (FE2) idea, but in this case using ML
procedures instead of a RVE finite element mesh. In Capuano and Rimoli (2019),
several possibilities are addressed and applied to nonlinear truss structures and a
(nonlinear) hyperelastic perforated plane strain structure. A similar approach is used
in Yan et al. (2022) for composite shells employing physics-based NNs. In Jung et al.
(2020), finite element matrices passing the patch test are generated from data using a
neural network accounting for some physical constraints, as vanishing strain energy
under rigid body motions.

1.4.5 Multiphysics Problems

Despite the already mentioned advance in scientific machine learning in several fields,
much less has been achieved considering multiphysics problems. This is undoubtedly
due to the youth of the discipline, but there are a number of efforts that deserve
1 Machine Learning in Computer Aided Engineering 53

mentioning. For instance, in Alexiadis (2019) a system is developed with the aim
of replicating human physiology. In Alizadeh et al. (2021), a similar approach is
developed for nanofluid flow, while (Ren et al. 2020) studies hydrogen production.
In the field of multiphysics problems, there exists a particularly appealing
approach to machine learning, namely that of port-Hamiltonian formalisms (Van
Der Schaft et al. 2014). Port-Hamiltonian systems are essentially open systems
that obey a Hamiltonian description of their physics (and thus, are conservative, or
reversible). Their interaction with the environment is made through a forcing term.
If we call z the set of variables governing the problem (z = (p, q), e.g., position and
momentum, for a canonical Hamiltonian system), its evolution in time will be given
by
ż = J∇ H + F (1.61)

where J is the classical (skew-symmetric) symplectic matrix, H is the Hamilto-


nian (total energy of the system), and F is the forcing term, which links the port-
Hamiltonian system to other subsystems. This paves the way for an efficient cou-
pling of different systems, possibly governed by different physics. Enforcing this
port-Hamiltonian structure during the learning process, as an inductive bias, ensures
the fulfillment of the conservation of energy in the total system, while allowing
for a proper introduction of dissipative terms in the formulation. This is indeed the
approach followed in Desai et al. (2021); see also Massaroli et al. (2019), Eidnes et al.
(2022), Mattheakis et al. (2022), Sprangers et al. (2014), and Morandin et al. (2022).
A recent review on the progress of these techniques can be found in Cherifi (2020).

1.4.6 Machine Learning in Manufacturing and Design

ML techniques have been applied to classical manufacturing since their early con-
ception, and are now important in Additive Manufacturing (AM). Furthermore, ML
is currently being applied to the complete product chain, from conceptual design
to the manufacturing process. Below, we review ML applications in classical and
additive manufacturing, and in automated design.

1.4.6.1 Classical Manufacturing

In manufacturing, plasticity plays a fundamental role. Machine learning approaches


to solve inelastic problems have already been addressed in Sect. 1.4.1.2 above.
Research activities prior to 1965 are compiled in an interesting review by Monostori
et al. (1996). More recently, another review compiled the works in different research
endeavors within the field of manufacturing (Pham and Afify 2005).
Of course, ML is a natural ally of the Industry 4.0 paradigm (the fourth industrial
revolution), in which sensors are ubiquitous and data streams provide the systems
with valuable information. This synergistic alliance is explored in Raj et al. (2021).
In Sharp et al. (2018), valuable research is reported in which Natural Language Pro-
54 F. J. Montáns et al.

cessing (NLP) was applied to documentation from 2005 to 2017 in the field of smart
manufacturing. The survey analyzes aspects ranging from decision support (prior to
the moment, a piece was manufactured), plant and operations health management
(for the manufacturing process itself), data management, as a consequence of the
vast amount of information produced by Internet of Things (IoT) devices installed in
modern plants, or lifecycle management, for instance. The survey concludes that ML-
based techniques are present in the literature (at the moment of publication, 2018) for
product life cycle management. While many of these ML techniques are inherently
designed to perform prognosis (i.e., to predict several aspects related to manufactur-
ing), in Ademujimi et al. (2017) a review is given of literature that employs ML to
perform diagnosis of manufacturing processes.

1.4.6.2 Additive Manufacturing

Due to its inherent technological complexity and our still limited comprehension of
many of the physical processes taking place, additive manufacturing (AM) has been
an active field of research in machine learning. The interested reader can consult
different reviews of the state of the art (Razvi et al. 2019; Meng et al. 2020; Jin et al.
2020; Wang et al. 2020). One of the fields where ML will be very important, and
that is tied to topology optimization, is 3D printing. AM, in particular 3D printing,
represents a revolution in component design and manufacturing because it allows
for infinite possibilities and largely reduced manufacturing difficulties. Moreover,
these technologies are reaching resolutions at the microscale, so a component may
be designed and manufactured with differently designed structures at the mesoscale
(establishing metamaterials), obtaining unprecedented material properties at the con-
tinuum scale thus widening the design space (Barchiesi et al. 2019; Zadpoor 2016).
There are many different AM procedures, like Fused Deposition Modeling (FDM),
Selective Laser Melting (SLM), Direct Energy Deposition (DED), Electron Beam
Melting (EBM), Binder Jetting, etc. While additive manufacturing offers huge pos-
sibilities, it also results into associated new challenges in multiple aspects, from the
detection of porosity (important in the characterization of the printed material) to the
recognition of defects (melting, microstructural, and geometrical), to the character-
ization of the complex anisotropic behavior, which depends on multiple parameters
of the manufacturing process (e.g., laser power in Selected Laser Melting, direction
of printing, powder and printing conditions). Both the design using AM and the error
correction or compensation (Omairi and Ismail 2021) are typical objectives in the
application of ML to AM. Different ML techniques are employed, with SVM one
of the most used schemes. For example, SVM is employed for identifying defective
parts from images in FDM (Delli and Chang 2018), for detecting geometrical defects
in SLM-made components (zur Jacobsmühlen et al. 2015; Gobert et al. 2018), for
building process maps relating variables to desired properties (e.g., as low porosity)
(Aoyagi et al. 2019), and for predicting surface roughness in terms of process features
(Wu et al. 2018). NNs are often used for optimizing the AM process by predicting
properties as a function of printing variables. For example, NNs have been used for
1 Machine Learning in Computer Aided Engineering 55

predicting and optimizing melt pool geometry in DED (Caiazzo and Caggiano 2020),
to build process maps and optimize efficiency and surface roughness in SLM (Zhang
et al. 2017), to minimize support wasted material (optimize supports in a piece) in
FDM (Jiang et al. 2019), to predict and optimize resulting mechanical properties of
the printed material (Lewandowski and Seifi 2016) like strength (e.g., using CNN
from thermal histories in Xie et al. 2021 or FFNN in Bayraktar et al. 2017), bending
stiffness in AM composites (Nawafleh and AL-Oqla 2022), and stress–strain curves
of binary composites using a combination of CNN and PCA (Yang et al. 2020).
NNs have also been used to create surrogate models with the purpose of mimicking
the acoustic properties of AM replicas of a Stradivarius violin (Tian et al. 2021).
Reviews of techniques and different applications of machine learning in additive
manufacturing may be found in Wang et al. (2020), DebRoy et al. (2021), Meng
et al. (2020), Qin et al. (2022), Xames et al. (2023), and Hashemi et al. (2022). The
review in Guo et al. (2022) addresses in some detail physics-based proposals.

1.4.6.3 Automated CAD and Generative Design

A fundamental step in the design of an industrial component or an architected struc-


ture is the conceptual development of the novel component or structure (the most
creative part), and more often, the customization of a component from a given family
to meet the specific requirements of the component in the system to which it will be
added. The novel product is in essence a variation or evolution of previous concepts
(first case) or previous components (second case). ML may help in both cases. The
challenge of understanding the “rules” of creativity to foster it has paved the way for
interesting contributions of ML in this field (Ganin et al. 2021).
In the first case, ML helps in the generative design of a novel component or
structure by creating variations supported by attributes, based in essence on the
combination and evolution of previous conceptual designs (Gero 1996; Khan and
Awan 2018); see the review of ML contributions in Duffy (1997); see also Tzonis
and White (2012) especially for conceptual design. An example would be to create a
new design of a car. Some conditions are given by the segment to which it will belong,
but some other possibilities are open and can be generated from possible variations
that may please or attract consumers. For example, Generative Adversarial Networks
(GAN) (Goodfellow et al. 2020) are used to explore aerodynamic shapes (Chen et al.
2019). ML is also used for the association of concepts and combinatorial creativity
with aims to the reuse of creativity to create new concepts and designs (Chen 2020).
Further, ML is employed in the evaluation of design concepts from many candidates
based on human preferences expressed in previous concepts (Camburn et al. 2020).
There are also ML works that aid in the development of detailed and consistent CAD
drawings from hand sketches (Seff et al. 2021), i.e. interpreting and detailing a CAD
drawing from a hand sketch.
Considering the second case, the customization of designs is natural to ML
approaches. The idea here is to perform automatic variations of previous conceptual
designs, or of designs obtained from mathematical optimization. A good example
56 F. J. Montáns et al.

of this approach using deep NN is given in Yoo et al. (2021), to propose designs of
a wheel, in which variations that comply with mechanical requirements (strength,
eigenfrequencies, etc. evaluated through surrogate models as a function of geometric
parameters) given with shapes are obtained by variations and simplifications using
autoencoders. Based on this work, an interesting discussion between aesthetics and
performance (aspects to include in ML models) is given in Shin et al. (2021). The
combination of topology optimization and generative design can be found in many
endeavors (Oh et al. 2019; Barbieri and Muzzupappa 2022).
Moreover, in the design process, there are many aspects that can be automated.
A typical aspect is the search for components with similar layout such that detailed
drawings, solid models (Chu and Hsu 2006), and manufacturing processes (Li et al.
2016) of new designs may be inferred from previous similar designs (Zehtaban et al.
2016). Indeed, many works focus on procedures to reuse parts of CAD schemes for
electronic circuits (Boning et al. 2019) or to develop microfluidic devices (Lore et al.
2015; Tsur 2020).

1.5 Conclusions

With the current access to large amounts of data and the ubiquitous presence of real-
time sensors in our life, as those present in cell phones, and also with the increased
computational power, Machine Learning (ML) has resulted in a change of paradigm
on how many problems are addressed. When using ML, the approach to many engi-
neering problems is no longer a matter of understanding the governing equations,
not even a matter of fully understanding the problem being addressed, but of having
sufficient data so relations between features and desired outputs can be established;
and not even in a deterministic way, but in an implicit probabilistic way.
ML has been succeeding for more than a decade in solving complex problems as
face recognition or stocks evolution, for which there was no successful deterministic
method, and not even a sound understanding of the actual significance of the main
variables affecting the result. Computer Aided Engineering (CAE), with the Finite
Element Method standing out, had also an extraordinary success in accurately solv-
ing complex engineering problems, but a detailed understanding of the governing
equations and their discretization is needed. This success delayed the introduction of
ML techniques in classical CAE dominated fields, but during the last years increasing
emphasis has been placed on ML methods. In particular, ML is used to solve some of
the issues still remaining when addressing the problem through classical techniques.
Examples of these issues are the still limited generality of classical CAE methods
(although the success of the FEM is due to its good generalization possibilities), the
search for practical solutions when there is not a complete, full understanding of the
problem, and computational efficiency in high-dimensional problems like multiscale
and nonlinear inverse problems. While we are still seeing the start of a new era, already
a large variety of problems in CAE has been addressed using different ML techniques.
1 Machine Learning in Computer Aided Engineering 57

Lessons have also been learned in the last few years. One important lesson is
that in engineering solutions, robustness and reliability of the solution are important
(Bathe 2006), and data may not be sufficient to guarantee that robustness. Then,
ML methods that incorporate physical laws and use the vast analytical knowledge
acquired in the last centuries may result not only in more robust methods but also
in more efficient schemes. In this chapter, we briefly reviewed ML techniques in
CAE and some representative applications. We focused on conveying some of the
excitement that is now developing in the research and use of ML techniques by short
descriptions of methods and many references to applications of those techniques.

Acknowledgements This is part of the training activities of the project funded by Euro-
pean Union’s Horizon 2020 research and innovation programme under the Marie Sklodowska-Curie
Grant Agreement No. 101007815; FJM gratefully acknowledges this funding.
The Support of the Spanish Ministry of Science and Innovation, AEI /10.13039/501100011033,
through Grant number PID2020-113463RB-C31 and by the Regional Government of Aragon, grant
T24-20R, and the European Social Fund is also gratefully acknowledged by EC.

References

Abdar M, Pourpanah F, Hussain S, Rezazadegan D, Liu L, Ghavamzadeh M, Fieguth P, Cao X,


Khosravi A, Acharya UR et al (2021) A review of uncertainty quantification in deep learning:
Techniques, applications and challenges. Inf Fusion 76:243–297
Abdi H, Williams LJ (2010) Principal component analysis. Wiley Interdiscip Rev: Comput Stat
2(4):433–459
Abueidda DW, Koric S, Sobh NA (2020) Topology optimization of 2d structures with nonlinearities
using deep learning. Comput Struct 237:106283
Abueidda DW, Koric S, Sobh NA, Sehitoglu H (2021) Deep learning for plasticity and thermo-
viscoplasticity. Int J Plast 136:102852
Adeli H, Yeh C (1990) Explanation-based machine learning in engineering design. Eng Appl Artif
Intell 3(2):127–137
Ademujimi TT, Brundage MP, Prabhu VV (2017) A review of current machine learning techniques
used in manufacturing diagnosis. In: IFIP international conference on advances in production
management systems. Springer, Berlin, pp 407–415
Adriaans P, Zantinge D (1997) Data mining. Addison-Wesley Longman Publishing Co., Inc
Aggarwal A, Mittal M, Battineni G (2021) Generative adversarial network: an overview of theory
and applications. Int J Inf Manag Data Insights 1(1):100004
Agrawal A, Choudhary A (2016) Perspective: materials informatics and big data: realization of the
“fourth paradigm” of science in materials science. Apl Mater 4(5):053208
Ahmed O, Wang X, Tran MV, Ismadi MZ (2021) Advancements in fiber-reinforced polymer com-
posite materials damage detection methods: towards achieving energy-efficient shm systems.
Compos Part B: Eng 223:109136
Ahmed E, Jones M, Marks TK (2015) An improved deep learning architecture for person re-
identification. In: Proceedings of the IEEE conference on computer vision and pattern recognition,
pp 3908–3916
Akinyelu AA (2021) Advances in spam detection for email spam, web spam, social network spam,
and review spam: Ml-based and nature-inspired-based techniques. J Comput Secur 29(5):473–529
58 F. J. Montáns et al.

Alamdari MM, Rakotoarivelo T, Khoa NLD (2017) A spectral-based clustering for structural health
monitoring of the sydney harbour bridge. Mech Syst Signal Process 87:384–400
Alber M, Buganza Tepole A, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis G, Lytton
WW, Perdikaris P, Petzold L et al (2019) Integrating machine learning and multiscale modeling-
perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences.
NPJ Digit Med 2(1):1–11
Aletti M, Bortolossi A, Perotto S, Veneziani A (2015) One-dimensional surrogate models
for advection-diffusion problems. In: Numerical mathematics and advanced applications-
ENUMATH 2013. Springer, Berlin, pp 447–455
Alexiadis A (2019) Deep multiphysics: Coupling discrete multiphysics with machine learning to
attain self-learning in-silico models replicating human physiology. Artif Intell Med 98:27–34
Alizadeh R, Allen JK, Mistree F (2020) Managing computational complexity using surrogate mod-
els: a critical review. Res Eng Des 31(3):275–298
Alizadeh R, Abad JMN, Ameri A, Mohebbi MR, Mehdizadeh A, Zhao D, Karimi N (2021) A
machine learning approach to the prediction of transport and thermodynamic processes in mul-
tiphysics systems-heat transfer in a hybrid nanofluid flow in porous media. J Taiwan Inst Chem
Eng 124:290–306
Alvarez MA, Rosasco L, Lawrence ND et al (2012) Kernels for vector-valued functions: A review.
Found Trends® Mach Learn 4(3):195–266
Amezquita-Sancheza J, Valtierra-Rodriguez M, Adeli H (2020) Machine learning in structural
engineering. Scientia Iranica 27(6):2645–2656
Amores VJ, Benítez JM, Montáns FJ (2020) Data-driven, structure-based hyperelastic manifolds:
A macro-micro-macro approach to reverse-engineer the chain behavior and perform efficient
simulations of polymers. Comput Struct 231:106209
Amores VJ, Nguyen K, Montáns FJ (2021) On the network orientational affinity assumption in poly-
mers and the micro-macro connection through the chain stretch. J Mech Phys Solids 148:104279
Amores VJ, San Millan FJ, Ben-Yelun I, Montans FJ (2021) A finite strain non-parametric hyperelas-
tic extension of the classical phenomenological theory for orthotropic compressible composites.
Compos Part B: Eng 212:108591
Angelov S, Stoimenova E (2017) Cross-validated sequentially constructed multiple regression. In:
Annual meeting of the bulgarian section of SIAM. Springer, Berlin, pp 13–22
Aoyagi K, Wang H, Sudo H, Chiba A (2019) Simple method to construct process maps for additive
manufacturing using a support vector machine. Addit Manuf 27:353–362
Arangio S, Beck J (2012) Bayesian neural networks for bridge integrity assessment. Struct Control
Health Monit 19(1):3–21
Arangio S, Bontempi F (2015) Structural health monitoring of a cable-stayed bridge with bayesian
neural networks. Struct Infrastruct Eng 11(4):575–587
Arbabi H, Bunder JE, Samaey G, Roberts AJ, Kevrekidis IG (2020) Linking machine learning with
multiscale numerics: Data-driven discovery of homogenized equations. JOM 72(12):4444–4457
Arrieta AB, Díaz-Rodríguez N, Del Ser J, Bennetot A, Tabik S, Barbado A, García S, Gil-López
S, Molina D, Benjamins R et al (2020) Explainable artificial intelligence (XAI): Concepts, tax-
onomies, opportunities and challenges toward responsible ai. Inf Fusion 58:82–115
Asgari S, MirhoseiniNejad S, Moazamigoodarzi H, Gupta R, Zheng R, Puri IK (2021) A gray-box
model for real-time transient temperature predictions in data centers. Appl Therm Eng 185:116319
Ashtiani MN, Raahemi B (2021) Intelligent fraud detection in financial statements using machine
learning and data mining: a systematic literature review. IEEE Access 10:72504–72525
Aubry N (1991) On the hidden beauty of the proper orthogonal decomposition. Theor Comput Fluid
Dyn 2(5):339–352
Audouze C, De Vuyst F, Nair P (2009) Reduced-order modeling of parameterized PDEs using
time-space-parameter principal component analysis. Int J Numer Methods Eng 80(8):1025–1057
1 Machine Learning in Computer Aided Engineering 59

Avci O, Abdeljaber O, Kiranyaz S, Hussein M, Gabbouj M, Inman DJ (2021) A review of vibration-


based damage detection in civil structures: From traditional methods to machine learning and
deep learning applications. Mech Syst Signal Process 147:107077
Ayensa Jiménez J (2022) Study of the effect of the tumour microenvironment on cell response
using a combined simulation and machine learning approach. application to the evolution of
Glioblastoma. Ph.D. thesis, School of Engineering and Architecture. Universidad de Zaragoza
Baklacioglu T, Turan O, Aydin H (2019) Metaheuristics optimized machine learning modelling of
environmental exergo-emissions for an aero-engine. Int J Turbo Jet-Engines 39(3):411–426
Balakrishnama S, Ganapathiraju A (1998) Linear discriminant analysis-a brief tutorial. Inst Signal
Inf Process 18:1–8
Bank D, Koenigstein N, Giryes R (2020) Autoencoders. arXiv:2003.05991
Barbieri L, Muzzupappa M (2022) Performance-driven engineering design approaches based on
generative design and topology optimization tools: A comparative study. Appl Sci 12(4):2106
Barchiesi E, Spagnuolo M, Placidi L (2019) Mechanical metamaterials: a state of the art. Math
Mech Solids 24(1):212–234
Bastek JH, Kumar S, Telgen B, Glaesener RN, Kochmann DM (2022) Inverting the structure–
property map of truss metamaterials by deep learning. Proc Natl Acad Sci 119(1) (2022)
Bathe KJ (2006) Finite element procedures, 2nd edn 2014, KJ Bathe, Watertown, MA; also published
by Higher Education Press China 2016
Bathe KJ, Wilson EL (1973) Solution methods for eigenvalue problems in structural mechanics.
Int J Numer Methods Eng 6(2):213–226
Bayraktar Ö, Uzun G, Çakiroğlu R, Guldas A (2017) Experimental study on the 3d-printed plas-
tic parts and predicting the mechanical properties using artificial neural networks. Polym Adv
Technol 28(8):1044–1051
Becker M, Teschner M (2007) Weakly compressible SPH for free surface flows. In: Proceedings of
the 2007 ACM SIGGRAPH/Eurographics symposium on computer animation, pp. 209–217
Beel J, Gipp B (2009) Google scholar’s ranking algorithm: an introductory overview. In: Proceedings
of the 12th international conference on scientometrics and informetrics (ISSI’09), vol 1. Rio de
Janeiro (Brazil), pp 230–241
Belkin M, Niyogi P (2003) Laplacian eigenmaps for dimensionality reduction and data representa-
tion. Neural Comput 15(6):1373–1396
Bengio Y, Courville A, Vincent P (2013) Representation learning: A review and new perspectives.
IEEE Trans Pattern Anal Mach Intell 35(8):1798–1828
Benítez JM, Montáns FJ (2018) A simple and efficient numerical procedure to compute the inverse
langevin function with high accuracy. J Non-Newton Fluid Mech 261:153–163
Berkooz G, Holmes P, Lumley JL (1993) The proper orthogonal decomposition in the analysis of
turbulent flows. Annu Rev Fluid Mech 25(1):539–575
Bertoldi K, Vitelli V, Christensen J, Van Hecke M (2017) Flexible mechanical metamaterials. Nat
Rev Mater 2(11):1–11
Bezazi A, Pierce SG, Worden K et al (2007) Fatigue life prediction of sandwich composite materials
under flexural tests using a bayesian trained artificial neural network. Int J Fatigue 29(4):738–747
Bhatti G, Mohan H, Singh RR (2021) Towards the future of smart electric vehicles: Digital twin
technology. Renew Sustain Energy Rev 141:110801
Bickel S, Haider P, Scheffer T (2005) Learning to complete sentences. In: European conference on
machine learning. Springer, Berlin, pp 497–504
Bird GD, Gorrell SE, Salmon JL (2021) Dimensionality-reduction-based surrogate models for real-
time design space exploration of a jet engine compressor blade. Aerosp Sci Technol 118:107077
Bischl B, Lang M, Kotthoff L, Schiffner J, Richter J, Studerus E, Casalicchio G, Jones ZM (2016)
mlr: Machine learning in R. J Mach Learn Res 17(1):5938–5942
Bishop CM (1995) Training with noise is equivalent to tikhonov regularization. Neural Comput
7(1):108–116
Bisong E (2019a) Google cloud machine learning engine (cloud MLE). In: Building machine
learning and deep learning models on google cloud platform, pp. 545–579. Springer, Berlin
60 F. J. Montáns et al.

Bisong E (2019b) Numpy. In: Building machine learning and deep learning models on google cloud
platform. Springer, Berlin, pp 91–113
Bock FE, Aydin RC, Cyron CJ, Huber N, Kalidindi SR, Klusemann B (2019) A review of the
application of machine learning and data mining approaches in continuum materials mechanics.
Frontiers Mater 6:110
Bonet J, Wood RD (1997) Nonlinear continuum mechanics for finite element analysis. Cambridge
University Press
Boning DS, Elfadel IAM, Li X (2019) A preliminary taxonomy for machine learning in vlsi cad.
In: Machine learning in VLSI computer-aided design. Springer, Berlin, pp 1–16
Borkowski L, Sorini C, Chattopadhyay A (2022) Recurrent neural network-based multiaxial plas-
ticity model with regularization for physics-informed constraints. Comput Struct 258:106678
Braconnier T, Ferrier M, Jouhaud JC, Montagnac M, Sagaut P (2011) Towards an adaptive pod/svd
surrogate model for aeronautic design. Comput Fluids 40(1):195–209
Bro R, Smilde AK (2014) Principal component analysis. Anal Methods 6(9):2812–2831
Brodie CR, Constantin A, Deen R, Lukas A (2020) Machine learning line bundle cohomology.
Fortschritte der Physik 68(1):1900087
Brunton SL, Kutz JN (2022) Data-driven science and engineering: Machine learning, dynamical
systems, and control. Cambridge University Press (2022)
Brunton SL, Kutz JN (2019) Methods for data-driven multiscale model discovery for materials. J
Phys: Mater 2(4):044002
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc Natl Acad Sci 113(15):3932–3937
Build powerful models. (2022). https://www.opennn.org/
Buizza C, Casas CQ, Nadler P, Mack J, Marrone S, Titus Z, Le Cornec C, Heylen E, Dur T, Ruiz
LB et al (2022) Data learning: integrating data assimilation and machine learning. J Comput Sci
58:101525
Bukka SR, Magee AR, Jaiman RK (2020) Deep convolutional recurrent autoencoders for flow field
prediction. In: International conference on offshore mechanics and arctic engineering, vol 84409.
American Society of Mechanical Engineers, p V008T08A005
Burkov A (2020) Machine learning engineering, vol 1. True Positive Incorporated
Burkov A (2019) The hundred-page machine learning book, vol 1. Andriy Burkov Quebec City,
QC, Canada
Burov A, Burova O (2020) Development of digital twin for composite pressure vessel. J Phys: Conf
Ser 1441:012133. IOP Publishing
Buşoniu L, de Bruin T, Tolić D, Kober J, Palunko I (2018) Reinforcement learning for control:
Performance, stability, and deep approximators. Ann Rev Control 46:8–28
Bzdok D, Altman N, Krzywinski M (2018) Statistics versus machine learning. Nat Methods 15:233–
234
Caccin M, Li Z, Kermode JR, De Vita A (2015) A framework for machine-learning-augmented
multiscale atomistic simulations on parallel supercomputers. Int J Quantum Chem 115(16):1129–
1139
Caiazzo F, Caggiano A (2020) Laser direct metal deposition of 2024 Al alloy: trace geometry
prediction via machine learning. Materials 11(3):444
Camburn B, He Y, Raviselvam S, Luo J, Wood K (2020) Machine learning-based design concept
evaluation. J Mech Des 142(3):031113
Capuano G, Rimoli JJ (2019) Smart finite elements: A novel machine learning application. Comput
Methods Appl Mech Eng 345:363–381
Carleo G, Cirac I, Cranmer K, Daudet L, Schuld M, Tishby N, Vogt-Maranto L, Zdeborová L (2019)
Machine learning and the physical sciences. Rev Modern Phys 91(4):045002
Carrara P, De Lorenzis L, Stainier L, Ortiz M (2020) Data-driven fracture mechanics. Comput
Methods Appl Mech Eng 372:113390
Cayton L (2005) Algorithms for manifold learning. Univ Calif San Diego Tech Rep 12(1–17):1
1 Machine Learning in Computer Aided Engineering 61

Champaney V, Chinesta F, Cueto E (2022) Engineering empowered by physics-based and data-


driven hybrid models: A methodological overview. Int J Mater Form 15(3):1–14
Champion K, Lusch B, Kutz JN, Brunton SL (2019) Data-driven discovery of coordinates and
governing equations. Proc Natl Acad Sci 116(45):22445–22451
Chan S, Elsheikh AH (2018) A machine learning approach for efficient uncertainty quantification
using multiscale methods. J Comput Phys 354:493–511
Chang KH, Cheng CY, Luo J, Murata S, Nourbakhsh M, Tsuji Y (2021) Building-gan: Graph-
conditioned architectural volumetric design generation. In: Proceedings of the IEEE/CVF inter-
national conference on computer vision, pp 11956–11965
Chang KH, Cheng CY (2020) Learning to simulate and design for structural engineering. In:
International conference on machine learning. PMLR, pp 1426–1436
Chattopadhyay A, Hassanzadeh P, Subramanian D (2020) Data-driven predictions of a multiscale
Lorenz 96 chaotic system using machine-learning methods: reservoir computing, artificial neural
network, and long short-term memory network. Nonlinear Process Geophys 27(3):373–389
Chazal F, Michel B (2021) An introduction to topological data analysis: fundamental and practical
aspects for data scientists. Frontiers Artif Intell 4:667363
Chen L (2020) Data-driven and machine learning based design creativity. Ph.D. thesis, Imperial
College London
Chen T, Chen H (1995) Universal approximation to nonlinear operators by neural networks with
arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw
6(4):911–917
Chen W, Chiu K, Fuge M (2019) Aerodynamic design optimization and shape exploration using
generative adversarial networks. In: AIAA Scitech 2019 forum, p 2351
Chen S, Gu C, Lin C, Zhao E, Song J (2018) Safety monitoring model of a super-high concrete
dam by using rbf neural network coupled with kernel principal component analysis. Math Probl
Eng 1712653
Cherifi K (2020) An overview on recent machine learning techniques for port hamiltonian systems.
Physica D: Nonlinear Phenomena 411:132620
Chi H, Zhang Y, Tang TLE, Mirabella L, Dalloro L, Song L, Paulino GH (2021) Universal machine
learning for topology optimization. Comput Methods Appl Mech Eng 375:112739
Chinesta F, Ladeveze P, Cueto E (2011) A short review on model order reduction based on proper
generalized decomposition. Arch Comput Methods Eng 18(4):395–404
Chinesta F, Cueto E, Abisset-Chavanne E, Duval JL, Khaldi FE (2020) Virtual, digital and hybrid
twins: a new paradigm in data-based engineering and engineered data. Arch Comput Methods
Eng 27(1):105–134
Chinesta F, Cueto E, Grmela M, Moya B, Pavelka M, Šípka M (2020) Learning physics from data:
a thermodynamic interpretation. In: Workshop on joint structures and common foundations of
statistical physics, information geometry and inference for learning. Springer, Berlin, pp 276–297
Choi SY, Cha D (2019) Unmanned aerial vehicles using machine learning for autonomous flight;
state-of-the-art. Adv Robot 33(6):265–277
Chu CH, Hsu YC (2006) Similarity assessment of 3d mechanical components for design reuse.
Robot Comput-Integr Manuf 22(4):332–341
Ciang CC, Lee JR, Bang HJ (2008) Structural health monitoring for a wind turbine system: a review
of damage detection methods. Meas Sci Technol 19(12):122001
Ciftci K, Hackl K (2022) Model-free data-driven simulation of inelastic materials using structured
data sets, tangent space information and transition rules. Comput Mech 70:425–435
Clifton A, Kilcher L, Lundquist J, Fleming P (2013) Using machine learning to predict wind turbine
power output. Environ Res Lett 8(2):024009
Coelho M, Roehl D, Bletzinger KU (2017) Material model based on NURBS response surfaces.
Appl Math Model 51:574–586
Colherinhas GB, de Morais MV, Shzu MA, Avila SM (2019) Optimal pendulum tuned mass damper
design applied to high towers using genetic algorithms: Two-dof modeling. Int J Struct Stab Dyn
19(10):1950125
62 F. J. Montáns et al.

Conti S, Müller S, Ortiz M (2018) Data-driven problems in elasticity. Arch Rat Mech Anal
229(1):79–123
Conti S, Müller S, Ortiz M (2020) Data-driven finite elasticity. Arch Rat Mech Anal 237(1):1–33
Cranmer M, Greydanus S, Hoyer S, Battaglia P, Spergel D, Ho S (2020) Lagrangian neural networks.
arXiv:2003.04630
Crawford M, Khoshgoftaar TM, Prusa JD, Richter AN, Al Najada H (2015) Survey of review spam
detection using machine learning techniques. J Big Data 2(1):1–24
Crespo J, Montans FJ (2018) A continuum approach for the large strain finite element analysis of
auxetic materials. Int J Mech Sci 135:441–457
Crespo J, Montáns FJ (2019) General solution procedures to compute the stored energy density of
conservative solids directly from experimental data. Int J Eng Sci 141:16–34
Crespo J, Latorre M, Montáns FJ (2017) WYPIWYG hyperelasticity for isotropic, compressible
materials. Comput Mech 59(1):73–92
Crespo J, Duncan O, Alderson A, Montáns FJ (2020) Auxetic orthotropic materials: Numerical
determination of a phenomenological spline-based stored density energy and its implementation
for finite element analysis. Comput Methods Appl Mech Eng 371:113300
de Brito MM, Evers M (2016) Multi-criteria decision-making for flood risk management: a survey
of the current state of the art. Nat Hazards Earth Syst Sci 16(4):1019–1033
De Jong K (1988) Learning with genetic algorithms: An overview. Mach Learn 3(2):121–138
DebRoy T, Mukherjee T, Wei H, Elmer J, Milewski J (2021) Metallurgy, mechanistic models and
machine learning in metal printing. Nat Rev Mater 6(1):48–68
Delli U, Chang S (2018) Automated process monitoring in 3d printing using supervised machine
learning. Procedia Manuf 26:865–870
Demo N, Tezzele M, Rozza G (2018) Pydmd: Python dynamic mode decomposition. J Open Source
Softw 3(22):530
Dener A, Miller MA, Churchill RM, Munson T, Chang CS (2020) Training neural networks under
physical constraints using a stochastic augmented Lagrangian approach. arXiv:2009.07330
Deng Z, He C, Liu Y, Kim KC (2019) Super-resolution reconstruction of turbulent velocity
fields using a generative adversarial network-based artificial intelligence framework. Phys Fluids
31(12):125111
Denkena B, Bergmann B, Witt M (2019) Material identification based on machine-learning algo-
rithms for hybrid workpieces during cylindrical operations. J Intell Manuf 30(6):2449–2456
Desai SA, Mattheakis M, Sondak D, Protopapas P, Roberts SJ (2021) Port-hamiltonian neural
networks for learning explicit time-dependent dynamical systems. Phys Rev E 104(3):034312
Dhanalaxmi B (2020) Machine learning and its emergence in the modern world and its contribution
to artificial intelligence. In: 2020 International conference for emerging technology (INCET).
IEEE, pp 1–4
Di Leoni PC, Lu L, Meneveau C, Karniadakis G, Zaki TA (2021) DeepONet prediction of linear
instability waves in high-speed boundary layers. arXiv:2105.08697
Dijkstra EW et al (1959) A note on two problems in connexion with graphs. Numerische Mathematik
1(1):269–271
Dillon JV, Langmore I, Tran D, Brevdo E, Vasudevan S, Moore D, Patton B, Alemi A, Hoffman M,
Saurous RA (2017) Tensorflow distributions. arXiv:1711.10604
Domaneschi M, Noori AZ, Pietropinto MV, Cimellaro GP (2021) Seismic vulnerability assessment
of existing school buildings. Comput Struct 248:106522
Dong CZ, Catbas FN (2021) A review of computer vision-based structural health monitoring at
local and global levels. Struct Health Monit 20(2):692–743
du Bos ML, Balabdaoui F, Heidenreich JN (2020) Modeling stress-strain curves with neural net-
works: a scalable alternative to the return mapping algorithm. Comput Mater Sci 178:109629
Duarte AC, Roldan F, Tubau M, Escur J, Pascual S, Salvador A, Mohedano E, McGuinness K,
Torres J, Giro-i Nieto X (2019) WAV2PIX: Speech-conditioned face generation using generative
adversarial networks. In: ICASSP, pp 8633–8637
1 Machine Learning in Computer Aided Engineering 63

Duarte, F (2018) 5 algoritmos que ya están tomando decisiones sobre tu vida y que quizás tu no
sabías [in spanish, translation: 5 algorithms that are already making decisions about your life,
and perhaps you did not know]. https://www.bbc.com/mundo/noticias-42916502
Duffy AH (1997) The “what” and “how” of learning in design. IEEE Expert 12(3):71–76
Dumon A, Allery C, Ammar A (2011) Proper general decomposition (PGD) for the resolution of
Navier-Stokes equations. J Comput Phys 230(4):1387–1407
Eggersmann R, Kirchdoerfer T, Reese S, Stainier L, Ortiz M (2019) Model-free data-driven inelas-
ticity. Comput Methods Appl Mech Eng 350:81–99
Eggersmann R, Stainier L, Ortiz M, Reese S (2021) Efficient data structures for model-free data-
driven computational mechanics. Comput Methods Appl Mech Eng 382:113855
Eggersmann R, Stainier L, Ortiz M, Reese S (2021) Model-free data-driven computational mechan-
ics enhanced by tensor voting. Comput Methods Appl Mech Eng 373:113499
Eidnes S, Stasik AJ, Sterud C, Bøhn E, Riemer-Sørensen S (2022) Port-hamiltonian neural networks
with state dependent ports. arXiv:2206.02660
Eilers PH, Marx BD (1996) Flexible smoothing with b-splines and penalties. Stat Sci 11(2):89–121
El Kadi H (2006) Modeling the mechanical behavior of fiber-reinforced polymeric composite mate-
rials using artificial neural networks-a review. Compos Struct 73(1):1–23
El Said B, Hallett SR (2018) Multiscale surrogate modelling of the elastic response of thick com-
posite structures with embedded defects and features. Compos Struct 200:781–798
Erchiqui F, Kandil N (2006) Neuronal networks approach for characterization of softened polymers.
J Reinf Plast Compos 25(5):463–473
Erichson NB, Muehlebach M, Mahoney MW (2019) Physics-informed autoencoders for lyapunov-
stable fluid flow prediction. arXiv:1905.10866
Etedali S, Mollayi N (2018) Cuckoo search-based least squares support vector machine models for
optimum tuning of tuned mass dampers. Int J Struct Stab Dyn 18(02):1850028
Eubank RL (1999) Nonparametric regression and spline smoothing. CRC Press
Farrar CR, Worden K (2012) Structural health monitoring: a machine learning perspective. Wiley,
New York
Feng N, Zhang G, Khandelwal K (2022) Finite strain FE2 analysis with data-driven homogenization
using deep neural networks. Comput Struct 263:106742
Fernández J, Chiachío M, Chiachío J, Muñoz R, Herrera F (2022) Uncertainty quantification in
neural networks by approximate Bayesian computation: Application to fatigue in composite
materials. Eng Appl Artif Intell 107:104511
Fernández M, Fritzen F, Weeger O (2022) Material modeling for parametric, anisotropic finite strain
hyperelasticity based on machine learning with application in optimization of metamaterials. Int
J Numer Methods Eng 123(2):577–609. https://onlinelibrary.wiley.com/doi/full/10.1002/nme.
6869
Field D, Ammouche Y, Peña JM, Jérusalem A (2021) Machine learning based multiscale calibration
of mesoscopic constitutive models for composite materials: application to brain white matter.
Comput Mech 67(6):1629–1643
Fischer CC, Tibbetts KJ, Morgan D, Ceder G (2006) Predicting crystal structure by merging data
mining with quantum mechanics. Nat Mater 5(8):641–646
Fish J, Wagner GJ, Keten S (2021) Mesoscopic and multiscale modelling in materials. Nat Mater
20(6):774–786
Fisher RA (1936) The use of multiple measurements in taxonomic problems. Ann Eugenics
7(2):179–188
Flah M, Nunez I, Ben Chaabene W, Nehdi ML (2021) Machine learning algorithms in civil structural
health monitoring: a systematic review. Arch Comput Methods Eng 28(4):2621–2643
Flaschel M, Kumar S, De Lorenzis L (2021) Unsupervised discovery of interpretable hyperelastic
constitutive laws. Comput Methods Appl Mech Eng 381:113852
Flaschel M, Kumar S, De Lorenzis L (2022) Discovering plasticity models without stress data. NPJ
Comput Mater 8(1):1–10
Floyd RW (1962) Algorithm 97: shortest path. Commun ACM 5(6):345
64 F. J. Montáns et al.

Frank M, Drikakis D, Charissis V (2020) Machine-learning methods for computational science and
engineering. Computation 8(1):15
Frankel AL, Safta C, Alleman C, Jones R (2022)ased graph convolutional neural networks for
modeling materials with microstructure. J Mach Learn Model Comput 3(1)
Frankel A, Hamel CM, Bolintineanu D, Long K, Kramer S (2022) Machine learning constitutive
models of elastomeric foams. Comput Methods Appl Mech Eng 391:114492
Freischlad M, Schnellenbach-Held M (2005) A machine learning approach for the support of
preliminary structural design. Adv Eng Inf 19(4):281–287
Friedman JH (1989) Regularized discriminant analysis. J Am Stat Assoc 84(405):165–175
Fu K, Li J, Zhang Y, Shen H, Tian Y (2020) Model-guided multi-path knowledge aggregation for
aerial saliency prediction. IEEE Trans Image Process 29:7117–7127
Fuchs A, Heider Y, Wang K, Sun W, Kaliske M (2021) DNN2: A hyper-parameter reinforcement
learning game for self-design of neural network based elasto-plastic constitutive descriptions.
Comput Struct 249:106505
Fuhg JN, Bouklas N, Jones RE (2022) Learning hyperelastic anisotropy from data via a tensor basis
neural network. arXiv:2204.04529
Fuhg JN, Fau A, Bouklas N, Marino M (2022) Elasto-plasticity with convex model-data-driven
yield functions. Hal-03619186v1. https://hal.science/hal-03619186/
Fuhg JN, Böhm C, Bouklas N, Fau A, Wriggers P, Marino M (2021) Model-data-driven constitutive
responses: application to a multiscale computational framework. Int J Eng Sci 167:103522
Gabel J, Desaphy J, Rognan D (2014) Beware of machine learning-based scoring functions. on the
danger of developing black boxes. J Chem Inf Model 54(10):2807–2815
Ganin Y, Bartunov S, Li Y, Keller E, Saliceti S (2021) Computer-aided design as language. Adv
Neural Inf Process Syst 34:5885–5897
Gannouni S, Maad RB (2016) Numerical analysis of smoke dispersion against the wind in a tunnel
fire. J Wind Eng Ind Aerodyn 158:61–68
Gao K, Mei G, Piccialli F, Cuomo S, Tu J, Huo Z (2020) Julia language in machine learning:
Algorithms, applications, and open issues. Comput Sci Rev 37:100254
Garg A, Panigrahi BK (2021) Multi-dimensional digital twin of energy storage system for electric
vehicles: A brief review. Energy Storage 3(6):e242
Garg S, Gupta H, Chakraborty S (2022) Assessment of deeponet for time dependent reliability
analysis of dynamical systems subjected to stochastic loading. Eng Struct 270:114811
Gaurav D, Tiwari SM, Goyal A, Gandhi N, Abraham A (2020) Machine intelligence-based algo-
rithms for spam filtering on document labeling. Soft Comput 24(13):9625–9638
Gero JS (1996) Creativity, emergence and evolution in design. Knowl-Based Syst 9(7):435–448
Gerolymos G, Vallet I (1996) Implicit computation of three-dimensional compressible Navier-
Stokes equations using k-epsilon closure. AIAA J 34(7):1321–1330
Ghosh A, SahaRay R, Chakrabarty S, Bhadra S (2021) Robust generalised quadratic discriminant
analysis. Pattern Recognit 117:107981
Ghoting A, Krishnamurthy R, Pednault E, Reinwald B, Sindhwani V, Tatikonda S, Tian Y,
Vaithyanathan S (2011) SystemML: Declarative machine learning on mapreduce. In: 2011 IEEE
27th international conference on data engineering. IEEE, pp 231–242
Giacinto G, Paolucci R, Roli F (1997) Application of neural networks and statistical pattern recog-
nition algorithms to earthquake risk evaluation. Pattern Recognit Lett 18(11–13):1353–1362
Gin CR, Shea DE, Brunton SL, Kutz JN (2021) DeepGreen: deep learning of Green’s functions for
nonlinear boundary value problems. Sci Rep 11(1):1–14
Glaessgen E, Stargel D (2012) The digital twin paradigm for future NASA and US Air force
vehicles. In: 53rd AIAA/ASME/ASCE/AHS/ASC structures, structural dynamics and materials
conference. 20th AIAA/ASME/AHS adaptive structures conference. 14th AIAA, p 1818
Gobert C, Reutzel EW, Petrich J, Nassar AR, Phoha S (2018) Application of supervised machine
learning for defect detection during metallic powder bed fusion additive manufacturing using
high resolution imaging. Addit Manuf 21:517–528
1 Machine Learning in Computer Aided Engineering 65

Gonzalez FJ, Balajewicz M (2018) Deep convolutional recurrent autoencoders for learning low-
dimensional feature dynamics of fluid systems. arXiv:1808.01346
González MP, Zapico JL (2008) Seismic damage identification in buildings using neural networks
and modal data. Comput Struct 86(3–5):416–426
González D, Chinesta F, Cueto E (2019) Learning corrections for hyperelastic models from data.
Front Mater 6:14
González D, Chinesta F, Cueto E (2019) Thermodynamically consistent data-driven computational
mechanics. Contin Mech Thermodyn 31(1):239–253
González D, García-González A, Chinesta F, Cueto E (2020) A data-driven learning method for
constitutive modeling: application to vascular hyperelastic soft tissues. Materials 13(10):2319
González D, Chinesta F, Cueto E (2021) Learning non-markovian physics from data. J Comput
Phys 428:109982
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y
(2020) Generative adversarial networks. Commun ACM 63(11):139–144
Google cloud: AI and machine learning products. innovative machine learning products and services
on a trusted platform. https://cloud.google.com/products/ai
Goswami S, Anitescu C, Chakraborty S, Rabczuk T (2020) Transfer learning enhanced physics
informed neural network for phase-field modeling of fracture. Theor Appl Fract Mech 106:102447
Goswami S, Yin M, Yu Y, Karniadakis GE (2022) A physics-informed variational DeepONet for
predicting crack path in quasi-brittle materials. Comput Methods Appl Mech Eng 391:114587
Grefenstette JJ (1993) Genetic algorithms and machine learning. In: Proceedings of the sixth annual
conference on computational learning theory, pp 3–4
Greydanus S, Dzamba M, Yosinski J (2019) Hamiltonian neural networks. Adv Neural Inf Process
Syst 32
Gui G, Pan H, Lin Z, Li Y, Yuan Z (2017) Data-driven support vector machine with optimization
techniques for structural health monitoring and damage detection. KSCE J Civ Eng 21(2):523–
534
Guilleminot J, Dolbow JE (2020) Data-driven enhancement of fracture paths in random composites.
Mech Res Commun 103:103443
Gulli A, Pal S (2017) Deep learning with Keras. Packt Publishing Ltd
Guo J, Liu C, Cao J, Jiang D (2021) Damage identification of wind turbine blades with deep
convolutional neural networks. Renew Energy 174:122–133
Guo S, Agarwal M, Cooper C, Tian Q, Gao RX, Grace WG, Guo Y (2022) Machine learning for
metal additive manufacturing: Towards a physics-informed data-driven paradigm. J Manuf Syst
62:145–163
Hadden CM, Klimek-McDonald DR, Pineda EJ, King JA, Reichanadter AM, Miskioglu I, Gowtham
S, Odegard GM (2015) Mechanical properties of graphene nanoplatelet/carbon fiber/epoxy hybrid
composites: Multiscale modeling and experiments. Carbon 95:100–112
Haghighat E, Juanes R (2021) Sciann: A keras/tensorflow wrapper for scientific computations and
physics-informed deep learning using artificial neural networks. Comput Methods Appl Mech
Eng 373:113552
Haghighat E, Raissi M, Moure A, Gomez H, Juanes R (2021) A physics-informed deep learning
framework for inversion and surrogate modeling in solid mechanics. Comput Methods Appl Mech
Eng 379:113741
Haik W, Maday Y, Chamoin L (2021) A real-time variational data assimilation method with model
bias identification and correction. In: RAMSES: reduced order models; approximation theory;
machine learning; surrogates; emulators and simulators
Hall D, Llinas J (2001) Multisensor data fusion. CRC Press
Hanifa RM, Isa K, Mohamad S (2021) A review on speaker recognition: technology and challenges.
Comput Electr Eng 90:107005
Hariri-Ardebili MA, Pourkamali-Anaraki F (2018) Simplified reliability analysis of multi hazard
risk in gravity dams via machine learning techniques. Arch Civ Mech Eng 18(2):592–610
66 F. J. Montáns et al.

Hasançebi O, Dumlupınar T (2013) Linear and nonlinear model updating of reinforced concrete
t-beam bridges using artificial neural networks. Comput Struct 119:1–11
Hashemi SM, Parvizi S, Baghbanijavid H, Tan AT, Nematollahi M, Ramazani A, Fang NX, Elahinia
M (2022) Computational modelling of process-structure-property-performance relationships in
metal additive manufacturing: A review. Int Mater Rev 67(1):1–46
Hashemipour S, Ali M (2020) Amazon web services (AWS)–an overview of the on-demand
cloud computing platform. In: International conference for emerging technologies in comput-
ing. Springer, Berlin, pp 40–47
Hassan RJ, Abdulazeez AM et al (2021) Deep learning convolutional neural network for face
recognition: A review. Int J Sci Bus 5(2):114–127
Hastie T, Tibshirani R, Buja A (1994) Flexible discriminant analysis by optimal scoring. J Am Stat
Assoc 89(428):1255–1270
He Q, Laurence DW, Lee CH, Chen JS (2021) Manifold learning based data-driven modeling for
soft biological tissues. J Biomech 117:110124
He S, Shin HS, Tsourdos A (2021) Computational missile guidance: a deep reinforcement learning
approach. J Aerosp Inf Syst 18(8):571–582
He X, He Q, Chen JS (2021) Deep autoencoders for physics-constrained data-driven nonlinear
materials modeling. Comput Methods Appl Mech Eng 385:114034
He Q, Gu C, Valente S, Zhao E, Liu X, Yuan D (2022) Multi-arch dam safety evaluation based on
statistical analysis and numerical simulation. Sci Rep 12(1):1–19
Hebb DO (2005) The organization of behavior: a neuropsychological theory. Psychology Press
Hemati MS, Williams MO, Rowley CW (2014) Dynamic mode decomposition for large and stream-
ing datasets. Phys Fluids 26(11):111701
Hernandez Q, Badias A, Gonzalez D, Chinesta F, Cueto E (2021) Deep learning of thermodynamics-
aware reduced-order models from data. Comput Methods Appl Mech Eng 379:113763
Hernández Q, Badías A, González D, Chinesta F, Cueto E (2021) Structure-preserving neural
networks. J Comput Phys 426:109950
Hernández Q, Badías A, Chinesta F, Cueto E (2022) Thermodynamics-informed graph neural
networks. arXiv:2203.01874
Herrada F, García-Martínez J, Fraile A, Hermanns L, Montáns F (2017) A method for perform-
ing efficient parametric dynamic analyses in large finite element models undergoing structural
modifications. Eng Struct 131:625–638
Hershey JR, Olsen PA (2007) Approximating the Kullback Leibler divergence between Gaus-
sian mixture models. In: 2007 IEEE international conference on acoustics, speech and signal
processing-ICASSP’07, vol 4. IEEE, pp IV–317
Hkdh B (1999) Neural networks in materials science. ISIJ Int 39(10):966–979
Ho LV, Nguyen DH, Mousavi M, De Roeck G, Bui-Tien T, Gandomi AH, Wahab MA (2021) A
hybrid computational intelligence approach for structural damage detection using marine predator
algorithm and feedforward neural networks. Comput Struct 252:106568
Hofmann T, Schölkopf B, Smola AJ (2008) Kernel methods in machine learning. Ann Stat
36(3):1171–1220
Hong T, Wang Z, Luo X, Zhang W (2020) State-of-the-art on research and applications of machine
learning in the building life cycle. Energy Build 212:109831
Hong SJ, Chun H, Lee J, Kim BH, Seo MH, Kang J, Han B (2021) First-principles-based machine-
learning molecular dynamics for crystalline polymers with van der waals interactions. J Phys
Chem Lett 12(25):6000–6006
Hoshyar AN, Samali B, Liyanapathirana R, Houshyar AN, Yu Y (2020) Structural damage detection
and localization using a hybrid method and artificial intelligence techniques. Struct Health Monit
19(5):1507–1523
Hosmer Jr DW, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley,
New York
Hou C, Wang J, Wu Y, Yi D (2009) Local linear transformation embedding. Neurocomputing
72(10–12):2368–2378
1 Machine Learning in Computer Aided Engineering 67

Huang D, Fuhg JN, Weißenfels C, Wriggers P (2020) A machine learning based plasticity model
using proper orthogonal decomposition. Comput Methods Appl Mech Eng 365:113008
Ibañez R, Borzacchiello D, Aguado JV, Abisset-Chavanne E, Cueto E, Ladeveze P, Chinesta F (2017)
Data-driven non-linear elasticity: constitutive manifold construction and problem discretization.
Comput Mech 60(5):813–826
Ibañez R, Abisset-Chavanne E, Aguado JV, Gonzalez D, Cueto E, Chinesta F (2018) A manifold
learning approach to data-driven computational elasticity and inelasticity. Arch Comput Methods
Eng 25(1):47–57
Ibañez R, Gilormini P, Cueto E, Chinesta F (2020) Numerical experiments on unsupervised manifold
learning applied to mechanical modeling of materials and structures. Comptes Rendus. Mécanique
348(10–11):937–958
Ibragimova O, Brahme A, Muhammad W, Lévesque J, Inal K (2021) A new ANN based crystal
plasticity model for fcc materials and its application to non-monotonic strain paths. Int J Plast
144:103059
Innes M (2018) Flux: Elegant machine learning with Julia. J Open Source Softw 3(25):602
Innes M, Edelman A, Fischer K, Rackauckas C, Saba E, Shah VB, Tebbutt W (2019) A differentiable
programming system to bridge machine learning and scientific computing. arXiv:1907.07587
Inoue H (2018) Data augmentation by pairing samples for images classification. arXiv:1801.02929
Jackson NE, Webb MA, de Pablo JJ (2019) Recent advances in machine learning towards multiscale
soft materials design. Curr Opin Chem Eng 23:106–114
Jafari M (2020) System identification of a soil tunnel based on a hybrid artificial neural network-
numerical model approach. Iran J Sci Technol, Trans Civ Eng 44(3):889–899
Jagtap AD, Kharazmi E, Karniadakis GE (2020) Conservative physics-informed neural networks on
discrete domains for conservation laws: Applications to forward and inverse problems. Comput
Methods Appl Mech Eng 365:113028
Jang DP, Fazily P, Yoon JW (2021) Machine learning-based constitutive model for J2-plasticity. Int
J Plast 138:102919
Jansson T, Nilsson L, Redhe M (2003) Using surrogate models and response surfaces in struc-
tural optimization-with application to crashworthiness design and sheet metal forming. Struct
Multidiscip Optim 25(2):129–140
Jayasundara N, Thambiratnam D, Chan T, Nguyen A (2020) Damage detection and quantification
in deck type arch bridges using vibration based methods and artificial neural networks. Eng Fail
Anal 109:104265
Jha D, Singh S, Al-Bahrani R, Liao WK, Choudhary A, De Graef M, Agrawal A (2018) Extracting
grain orientations from EBSD patterns of polycrystalline materials using convolutional neural
networks. Microsc Microanal 24(5):497–502
Jiang J, Hu G, Li X, Xu X, Zheng P, Stringer J (2019) Analysis and prediction of printable bridge
length in fused deposition modelling based on back propagation neural network. Virtual Phys
Prototyp 14(3):253–266
Jiménez AA, Márquez FPG, Moraleda VB, Muñoz CQG (2019) Linear and nonlinear features and
machine learning for wind turbine blade ice detection and diagnosis. Renew Energy 132:1034–
1048
Jiménez AA, Zhang L, Muñoz CQG, Márquez FPG (2020) Maintenance management based on
machine learning and nonlinear features in wind turbines. Renew Energy 146:316–328
Jin Z, Zhang Z, Demir K, Gu GX (2020) Machine learning for advanced additive manufacturing.
Matter 3(5):1541–1556
Jokar M, Semperlotti F (2021) Finite element network analysis: A machine learning based compu-
tational framework for the simulation of physical systems. Comput Struct 247:106484
Jovanović MR, Schmid PJ, Nichols JW (2014) Sparsity-promoting dynamic mode decomposition.
Phys Fluids 26(2):024103
Jung J, Yoon JI, Park HK, Jo H, Kim HS (2020) Microstructure design using machine learning
generated low dimensional and continuous design space. Materialia 11:100690
68 F. J. Montáns et al.

Jung J, Yoon K, Lee PS (2020) Deep learned finite elements. Comput Methods Appl Mech Eng
372:113401
Kadic M, Milton GW, van Hecke M, Wegener M (2019) 3d metamaterials. Nat Rev Phys 1(3):198–
210
Kaehler A, Bradski G (2016) Learning OpenCV 3: computer vision in C++ with the OpenCV
library. O’Reilly Media, Inc
Kalitzin G, Medic G, Iaccarino G, Durbin P (2005) Near-wall behavior of RANS turbulence models
and implications for wall functions. J Comput Phys 204(1):265–291
Kamath C (2001) On mining scientific datasets. In: Data mining for scientific and engineering
applications. Springer, Berlin, pp 1–21
Kanno Y (2018) Data-driven computing in elasticity via kernel regression. Theor Appl Mech Lett
8(6):361–365
Kanouté P, Boso D, Chaboche JL, Schrefler B (2009) Multiscale methods for composites: a review.
Arch Comput Methods Eng 16(1):31–75
Karthikeyan J, Hie TS, Jin NY (eds) (2021) Learning outcomes of classroom research. L’Ordine
Novo Publication, Tamil Nadu, India. ISBN: 978-93-92995-15-6
Kao CY, Loh CH (2013) Monitoring of long-term static deformation data of fei-tsui arch dam using
artificial neural network-based approaches. Struct Control Health Monit 20(3):282–303
Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physics-informed
machine learning. Nat Rev Phys 3(6):422–440
Karpatne A, Atluri G, Faghmous JH, Steinbach M, Banerjee A, Ganguly A, Shekhar S, Samatova
N, Kumar V (2017) Theory-guided data science: A new paradigm for scientific discovery from
data. IEEE Trans Knowl Data Eng 29(10):2318–2331
Kashinath K, Mustafa M, Albert A, Wu J, Jiang C, Esmaeilzadeh S, Azizzadenesheli K, Wang
R, Chattopadhyay A, Singh A et al (2021) Physics-informed machine learning: case studies for
weather and climate modelling. Philos Trans R Soc A 379(2194):20200093
Khan S, Awan MJ (2018) A generative design technique for exploring shape variations. Adv Eng
Inform 38:712–724
Khosravi K, Shahabi H, Pham BT, Adamowski J, Shirzadi A, Pradhan B, Dou J, Ly HB, Gróf G, Ho
HL et al (2019) A comparative assessment of flood susceptibility modeling using multi-criteria
decision-making analysis and machine learning methods. J Hydrol 573:311–323
Khurana S, Saxena S, Jain S, Dixit A (2021) Predictive modeling of engine emissions using machine
learning: A review. Mater Today: Proc 38:280–284
Kim P (2017) Matlab deep learning: with machine learning, neural networks and artificial intelli-
gence. Apress
Kim JW, Lee BH, Shaw MJ, Chang HL, Nelson M (2001) Application of decision-tree induction
techniques to personalized advertisements on internet storefronts. Int J Electron Commer 5(3):45–
62
Kim C, Batra R, Chen L, Tran H, Ramprasad R (2021) Polymer design using genetic algorithm and
machine learning. Comput Mater Sci 186:110067
Kim Y, Park HK, Jung J, Asghari-Rad P, Lee S, Kim JY, Jung HG, Kim HS (2021) Exploration
of optimal microstructure and mechanical properties in continuous microstructure space using a
variational autoencoder. Mater Des 202:109544
King DE (2009) Dlib-ml: a machine learning toolkit. J Mach Learn Res 10:1755–1758
Kirchdoerfer T, Ortiz M (2016) Data-driven computational mechanics. Comput Methods Appl Mech
Eng 304:81–101
Kirchdoerfer T, Ortiz M (2017) Data driven computing with noisy material data sets. Comput
Methods Appl Mech Eng 326:622–641
Klein DK, Fernández M, Martin RJ, Neff P, Weeger O (2022) Polyconvex anisotropic hyperelasticity
with neural networks. J Mech Phys Solids 159:104703
Kleinbaum DG, Dietz K, Gail M, Klein M, Klein M (2002) Logistic regression. Springer, Berlin
Ko J, Ni YQ (2005) Technology developments in structural health monitoring of large-scale bridges.
Eng Struct 27(12):1715–1725
1 Machine Learning in Computer Aided Engineering 69

Kochkov D, Smith JA, Alieva A, Wang Q, Brenner MP, Hoyer S (2021) Machine learning-
accelerated computational fluid dynamics. Proc Natl Acad Sci 118(21):e2101784118
Kolodziejczyk F, Mortazavi B, Rabczuk T, Zhuang X (2021) Machine learning assisted multiscale
modeling of composite phase change materials for li-ion batteries’ thermal management. Int J
Heat Mass Transf 172:121199
Kontolati K, Alix-Williams D, Boffi NM, Falk ML, Rycroft CH, Shields MD (2021) Manifold learn-
ing for coarse-graining atomistic simulations: Application to amorphous solids. Acta Materialia
215:117008
Kossaifi J, Panagakis Y, Anandkumar A, Pantic M (2016) Tensorly: Tensor learning in python.
arXiv:1610.09555
Koumoulos E, Konstantopoulos G, Charitidis C (2019) Applying machine learning to nanoinden-
tation data of (nano-) enhanced composites. Fibers 8(1):3
Kralovec C, Schagerl M (2020) Review of structural health monitoring methods regarding a multi-
sensor approach for damage assessment of metal and composite structures. Sensors 20(3):826
Kramer MA (1991) Nonlinear principal component analysis using autoassociative neural networks.
AIChE J 37(2):233–243
Krzeczkowski SA (1980) Measurement of liquid droplet disintegration mechanisms. Int J Multiph
Flow 6(3):227–239
Kulkrni KS, Kim DK, Sekar S, Samui P (2011) Model of least square support vector machine
(lssvm) for prediction of fracture parameters of concrete. Int J Concr Struct Mater 5(1):29–33
Ladevèze P, Néron D, Gerbaud PW (2019) Data-driven computation for history-dependent materials.
Comptes Rendus Mécanique 347(11):831–844
Laflamme S, Cao L, Chatzi E, Ubertini F (2016) Damage detection and localization from dense
network of strain sensors. Shock Vib 2016:2562946
Lakshminarayan K, Harp SA, Goldman RP, Samad T et al (1996) Imputation of missing data using
machine learning techniques. In: KDD, vol 96. https://cdn.aaai.org/KDD/1996/KDD96-023.pdf
Langley P et al (2011) The changing science of machine learning. Mach Learn 82(3):275–279
Lantz B (2019) Machine learning with R: expert techniques for predictive modeling. Packt Pub-
lishing Ltd
Latorre M, Montáns FJ (2013) Extension of the Sussman-Bathe spline-based hyperelastic model to
incompressible transversely isotropic materials. Comput Struct 122:13–26
Latorre M, Montáns FJ (2017) WYPiWYG hyperelasticity without inversion formula: Application
to passive ventricular myocardium. Comput Struct 185:47–58
Latorre M, Montáns FJ (2020) Experimental data reduction for hyperelasticity. Comput Struct
232:105919
Latorre M, De Rosa E, Montáns FJ (2017) Understanding the need of the compression branch to
characterize hyperelastic materials. Int J Non-Linear Mech 89:14–24
Latorre M, Peña E, Montáns FJ (2017) Determination and finite element validation of the WYPI-
WYG strain energy of superficial fascia from experimental data. Ann Biomed Eng 45(3):799–810
Latorre M, Mohammadkhah M, Simms CK, Montáns FJ (2018) A continuum model for tension-
compression asymmetry in skeletal muscle. J Mech Behav Biomed Mater 77:455–460
Le QV, Ngiam J, Coates A, Lahiri A, Prochnow B, Ng AY (2011) On optimization methods for deep
learning. In: ICML’11: Proceedings of the 28th international conference on machine learning, pp
265–272
Lee J, Kim J, Yun CB, Yi J, Shim J (2002) Health-monitoring method for bridges under ordinary
traffic loadings. J Sound Vib 257(2):247–264
Lee DW, Hong SH, Cho SS, Joo WS (2005) A study on fatigue damage modeling using neural
networks. J Mech Sci Technol 19(7):1393–1404
Lewandowski JJ, Seifi M (2026) Metal additive manufacturing: a review of mechanical properties.
Ann Rev of Mater Res 46:151–186
Lewis FL, Liu D (2013) Reinforcement learning and approximate dynamic programming for feed-
back control. Wiley, New York
70 F. J. Montáns et al.

Leygue A, Coret M, Réthoré J, Stainier L, Verron E (2018) Data-based derivation of material


response. Comput Methods Appl Mech Eng 331:184–196
Li Z, Zhou X, Liu W, Niu Q, Kong C (2016) A similarity-based reuse system for injection mold
design in automotive interior industry. Int J Adv Manuf Technol 87(5):1783–1795
Li B, Huang C, Li X, Zheng S, Hong J (2019) Non-iterative structural topology optimization using
deep learning. Comput-Aided Des 115:172–180
Li X, Roth CC, Mohr D (2019) Machine-learning based temperature-and rate-dependent plasticity
model: application to analysis of fracture experiments on dp steel. Int J Plast 118:320–344
Li G, Liu Q, Zhao S, Qiao W, Ren X (2020) Automatic crack recognition for concrete bridges using
a fully convolutional neural network and naive bayes data fusion based on a visual detection
system. Meas Sci Technol 31(7):075403
Li Y, Bao T, Chen H, Zhang K, Shu X, Chen Z, Hu Y (2021) A large-scale sensor missing data
imputation framework for dams using deep learning and transfer learning strategy. Measurement
178:109377
Li Y, Bao T, Chen Z, Gao Z, Shu X, Zhang K (2021) A missing sensor measurement data recon-
struction framework powered by multi-task gaussian process regression for dam structural health
monitoring systems. Measurement 186:110085
Li Z, Kovachki N, Azizzadenesheli K, Liu B, Bhattacharya K, Stuart A, Anandkumar A (2020)
Fourier neural operator for parametric partial differential equations. arXiv:2010.08895
Lin X, Si Z, Fu W, Yang J, Guo S, Cao Y, Zhang J, Wang X, Liu P, Jiang K et al (2018) Intelligent
identification of two-dimensional nanostructures by machine-learning optical microscopy. Nano
Res 11(12):6316–6324
Ling J, Jones R, Templeton J (2016) Machine learning strategies for systems with invariance prop-
erties. J Comput Phys 318:22–35
Liu WK, Gan Z, Fleming M (2021) Knowledge-driven dimension reduction and reduced order
surrogate models. In: Mechanistic data science for stem education and applications. Springer,
Berlin, pp 131–170
Liu J, Musialski P, Wonka P, Ye J (2012) Tensor completion for estimating missing values in visual
data. IEEE Trans Pattern Anal Mach Intell 35(1):208–220
Liu N, Wang Z, Sun M, Wang H, Wang B (2018) Numerical simulation of liquid droplet breakup
in supersonic flows. Acta Astronaut 145:116–130
Liu HH, Zhang J, Liang F, Temizel C, Basri MA, Mesdour R (2021) Incorporation of physics into
machine learning for production prediction from unconventional reservoirs: A brief review of the
gray-box approach. SPE Reserv Eval Eng 24(04):847–858
Liu X, Tian S, Tao F, Yu W (2021) A review of artificial neural networks in the constitutive modeling
of composite materials. Compos Part B: Eng 224:109152
Liu B, Vu-Bac N, Zhuang X, Fu X, Rabczuk T (2022) Stochastic full-range multiscale modeling of
thermal conductivity of polymeric carbon nanotubes composites: A machine learning approach.
Compos Struct 289:115393
Liu Y, Kutz JN, Brunton SL (2020) Hierarchical deep learning of multiscale differential equation
time-steppers. arXiv:2008.09768
Liu Y, Ponce C, Brunton SL, Kutz JN (2022) Multiresolution convolutional autoencoders. J Comput
Phys 111801
Liu P, Sun S (1997) The application of artificial neural networks on the health monitoring of bridges.
Structural Health Monitoring, Current Status and Perspectives, pp 103–110
Logarzo HJ, Capuano G, Rimoli JJ (2021) Smart constitutive laws: Inelastic homogenization
through machine learning. Comput Methods Appl Mech Eng 373:113482
Lopez E, Gonzalez D, Aguado J, Abisset-Chavanne E, Cueto E, Binetruy C, Chinesta F (2018) A
manifold learning approach for integrated computational materials engineering. Arch Comput
Methods Eng 25(1):59–68
Lore KG, Stoecklein D, Davies M, Ganapathysubramanian B, Sarkar S (2015) Hierarchical fea-
ture extraction for efficient design of microfluidic flow patterns. In: Feature extraction: modern
questions and challenges. PMLR, pp 213–225
1 Machine Learning in Computer Aided Engineering 71

Lorente L, Vega J, Velazquez A (2008) Generation of aerodynamics databases using high-order


singular value decomposition. J Aircr 45(5):1779–1788
Lu L, Jin P, Pang G, Zhang Z, Karniadakis GE (2021) Learning nonlinear operators via DeepONet
based on the universal approximation theorem of operators. Nat Mach Intell 3(3):218–229
Lu X, Liao W, Zhang Y, Huang Y (2022) Intelligent structural design of shear wall residence using
physics-enhanced generative adversarial networks. Earthq Eng Struct Dyn 51(7):1657–1676
Luo H, Paal SG (2019) A locally weighted machine learning model for generalized prediction of drift
capacity in seismic vulnerability assessments. Comput-Aided Civ Infrastruct Eng 34(11):935–
950
Luo W, Hu T, Ye Y, Zhang C, Wei Y (2020) A hybrid predictive maintenance approach for CNC
machine tool driven by digital twin. Robot Comput-Integr Manuf 65:101974
Lynch ME, Sarkar S, Maute K (2019) Machine learning to aid tuning of numerical parameters in
topology optimization. J Mech Des 141(11)
Málaga-Chuquitaype C (2022) Machine learning in structural design: an opinionated review. Fron-
tiers Built Environ 8:815717
Malik M, Malik MK, Mehmood K, Makhdoom I (2021) Automatic speech recognition: a survey.
Multimed Tools Appl 80(6):9411–9457
Mallat S (2016) Understanding deep convolutional networks. Philos Trans R Soc A: Math, Phys
Eng Sci 374(2065):20150203
Manavalan M (2020) Intersection of artificial intelligence, machine learning, and internet of things-
an economic overview. Glob Discl Econ Bus 9(2):119–128
Mangalathu S, Jang H, Hwang SH, Jeon JS (2020) Data-driven machine-learning-based seismic
failure mode identification of reinforced concrete shear walls. Eng Struct 208:110331
Marr B (2019) Artificial intelligence in practice: how 50 successful companies used AI and machine
learning to solve problems. Wiley, New York
Martín CA, Méndez AC, Sainges O, Petiot E, Barasinski A, Piana M, Ratier L, Chinesta F (2020)
Empowering design based on hybrid twin(TM): Application to acoustic resonators. Designs
4(4):44
Massaroli S, Poli M, Califano F, Faragasso A, Park J, Yamashita A, Asama H (2019) Port–
hamiltonian approach to neural network training. In: 2019 IEEE 58th conference on decision
and control (CDC). IEEE, pp 6799–6806
Mattheakis M, Sondak D, Dogra AS, Protopapas P (2022) Hamiltonian neural networks for solving
equations of motion. Phys Rev E 105(6):065305
Maulik R, Lusch B, Balaprakash P (2021) Reduced-order modeling of advection-dominated systems
with recurrent neural networks and convolutional autoencoders. Phys Fluids 33(3):037106
Mayani MG, Svendsen M, Oedegaard S (2018) Drilling digital twin success stories the last 10 years.
In: SPE Norway one day seminar. OnePetro
McCoy LG, Brenna CT, Chen SS, Vold K, Das S (2022) Believing in black boxes: Machine learning
for healthcare does not need explainability to be evidence-based. J Clin Epidemiol 142:252–257
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5(4):115–133
McInnes L, Healy J, Melville J (2018) UMAP: Uniform manifold approximation and projection
for dimension reduction. arXiv:1802.03426
Mehrjoo M, Khaji N, Moharrami H, Bahreininejad A (2008) Damage detection of truss bridge
joints using artificial neural networks. Expert Syst Appl 35(3):1122–1131
Meijer RJ, Goeman JJ (2013) Efficient approximate k-fold and leave-one-out cross-validation for
ridge regression. Biom J 55(2):141–155
Meng L, Breitkopf P, Quilliec GL, Raghavan B, Villon P (2018) Nonlinear shape-manifold learning
approach: concepts, tools and applications. Arch Comput Methods Eng 25(1):1–21
Meng L, McWilliams B, Jarosinski W, Park HY, Jung YG, Lee J, Zhang J (2020) Machine learning
in additive manufacturing: a review. JOM 72(6):2363–2377
Meng X, Li Z, Zhang D, Karniadakis GE (2020) Ppinn: Parareal physics-informed neural network
for time-dependent pdes. Comput Methods Appl Mech Eng 370:113250
72 F. J. Montáns et al.

Miao P, Yokota H (2022) Comparison of markov chain and recurrent neural network in predicting
bridge deterioration considering various factors. Struct Infrastruct Eng 1–13. https://doi.org/10.
1080/15732479.2022.2087691
Miao P, Yokota H, Zhang Y (2023) Deterioration prediction of existing concrete bridges using a
LSTM recurrent neural network. Struct Infrastruct 19(4):475–489
Michalski RS, Carbonell JG, Mitchell TM (2013) Machine learning: An artificial intelligence
approach. Springer Science & Business Media
Miñano M, Montáns FJ (2015) A new approach to modeling isotropic damage for Mullins effect in
hyperelastic materials. Int J Solids Struct 67:272–282
Miñano M, Montáns FJ (2018) WYPiWYG damage mechanics for soft materials: A data-driven
approach. Arch Comput Methods Eng 25(1):165–193
Mishra M (2021) Machine learning techniques for structural health monitoring of heritage buildings:
A state-of-the-art review and case studies. J Cult Herit 47:227–245
Mitchell M (1998) An introduction to genetic algorithms. MIT Press
Miyazawa Y, Briffod F, Shiraiwa T, Enoki M (2019) Prediction of cyclic stress-strain property of
steels by crystal plasticity simulations and machine learning. Materials 12(22):3668
Mohammadzadeh S, Kim Y, Ahn J (2015) Pca-based neuro-fuzzy model for system identification
of smart structures. J Smart Struct Syst 15(5):1139–1158
Molnar C, Casalicchio G, Bischl B (2018) iml: An R package for interpretable machine learning. J
Open Source Softw 3(26):786
Monostori L, Márkus A, Van Brussel H, Westkämpfer E (1996) Machine learning approaches to
manufacturing. CIRP Ann 45(2):675–712
Morandin R, Nicodemus J, Unger B (2022) Port-Hamiltonian dynamic mode decomposition.
arXiv:2204.13474
Moreno S, Amores VJ, Benítez JM, Montáns FJ (2020) Reverse-engineering and modeling the 3d
passive and active responses of skeletal muscle using a data-driven, non-parametric, spline-based
procedure. J Mech Behav Biomed Mater 110:103877
Moroni D, Pascali MA (2021) Learning topology: bridging computational topology and machine
learning. Pattern Recognit Image Anal 31(3):443–453
Moya B, Alfaro I, Gonzalez D, Chinesta F, Cueto E (2020) Physically sound, self-learning digital
twins for sloshing fluids. Plos One 15(6):e0234569
Moya B, Badías A, Alfaro I, Chinesta F, Cueto E (2022) Digital twins that learn and correct
themselves. Int J Numer Methods Eng 123(13):3034–3044
Mozaffar M, Bostanabad R, Chen W, Ehmann K, Cao J, Bessa M (2019) Deep learning predicts
path-dependent plasticity. Proc Natl Acad Sci 116(52):26414–26420
Mukherjee S, Lu D, Raghavan B, Breitkopf P, Dutta S, Xiao M, Zhang W (2021) Accelerating
large-scale topology optimization: state-of-the-art and challenges. Arch Comput Methods Eng
28(7):4549–4571
Muñoz D, Nadal E, Albelda J, Chinesta F, Ródenas J (2022) Allying topology and shape optimization
through machine learning algorithms. Finite Elem Anal Des 204:103719
Murata T, Fukami K, Fukagata K (2020) Nonlinear mode decomposition with convolutional neural
networks for fluid dynamics. J Fluid Mech 882:(A13)1–15
Murphy KP (2012) Machine Learning: a probabilistic perspective. MIT Press. ISBN 978-
0262018029
Muthali A, Laine F, Tomlin C (2021) Incorporating data uncertainty in object tracking algorithms.
arXiv:2109.10521
Nascimento RG, Viana FA (2020) Cumulative damage modeling with recurrent neural networks.
AIAA J 58(12):5459–5471
Nasiri S, Khosravani MR, Weinberg K (2017) Fracture mechanics and mechanical fault detection
by artificial intelligence methods: A review. Eng Fail Anal 81:270–293
Nassif AB, Talib MA, Nassir Q, Albadani H, Albab FD (2021) Machine learning for cloud security:
a systematic review. IEEE Access 9:20717–20735
1 Machine Learning in Computer Aided Engineering 73

Nawafleh N, AL-Oqla FM (2022) Artificial neural network for predicting the mechanical perfor-
mance of additive manufacturing thermoset carbon fiber composite materials. J Mech Behav
Mater 31(1):501–513
Nayak HD, Anvitha L, Shetty A, D’Souza DJ, Abraham MP et al (2021) Fraud detection in online
transactions using machine learning approaches—a review. Adv Artif Intell Data Engg 589–599
Nayak S, Lyngdoh GA, Shukla A, Das S (2022) Predicting the near field underwater explosion
response of coated composite cylinders using multiscale simulations, experiments, and machine
learning. Compos Struct 283:115157
Nguyen LTK, Keip MA (2018) A data-driven approach to nonlinear elasticity. Comput Struct
194:97–115
Nguyen DH, Nguyen QB, Bui-Tien T, De Roeck G, Wahab MA (2020) Damage detection in girder
bridges using modal curvatures gapped smoothing method and convolutional neural network:
Application to bo nghi bridge. Theor Appl Fract Mech 109:102728
Nguyen-Thanh VM, Zhuang X, Rabczuk T (2020) A deep energy method for finite deformation
hyperelasticity. Eur J Mech-A/Solids 80:103874
Ni YQ, Jiang S, Ko JM (2001) Application of adaptive probabilistic neural network to damage detec-
tion of tsing ma suspension bridge. In: Health monitoring and management of civil infrastructure
systems, vol 4337. SPIE, pp 347–356
Ni F, Zhang J, Noori MN (2020) Deep learning for data anomaly detection and data compression
of a long-span suspension bridge. Comput-Aided Civ Infrastruct Eng 35(7):685–700
Nick H, Aziminejad A, Hosseini MH, Laknejadi K (2021) Damage identification in steel girder
bridges using modal strain energy-based damage index method and artificial neural network. Eng
Fail Anal 119:105010
Oh S, Jung Y, Kim S, Lee I, Kang N (2019) Deep generative design: Integration of topology
optimization and generative models. J Mech Des 141(11):111405
Olivier A, Shields MD, Graham-Brady L (2021) Bayesian neural networks for uncertainty quan-
tification in data-driven materials modeling. Comput Methods Appl Mech Eng 386:114079
Omairi A, Ismail ZH (2021) Towards machine learning for error compensation in additive manu-
facturing. Appl Sci 11(5):2375
Ongsulee P (2017) Artificial intelligence, machine learning and deep learning. In: 2017 15th inter-
national conference on ICT and knowledge engineering (ICT&KE). IEEE, pp 1–6
Paluszek M, Thomas S (2016) MATLAB machine learning. Apress
Panagiotopoulos P, Waszczyszyn Z (1999) The neural network approach in plasticity and fracture
mechanics. In: Neural networks in the analysis and design of structures. Springer, Berlin, pp
161–195
Panakkat A, Adeli H (2009) Recurrent neural network for approximate earthquake time and location
prediction using multiple seismicity indicators. Comput-Aided Civ Infrastruct Eng 24(4):280–
292
Pang G, Lu L, Karniadakis GE (2019) fPINNs: Fractional physics-informed neural networks. SIAM
J Sci Comput 41(4):A2603–A2626
Paszkowicz W (2009) Genetic algorithms, a nature-inspired tool: survey of applications in materials
science and related fields. Mater Manuf Process 24(2):174–197
Pathak J, Subramanian S, Harrington P, Raja S, Chattopadhyay A, Mardani M, Kurth T, Hall D,
Li Z, Azizzadenesheli K et al (2022) Fourcastnet: A global data-driven high-resolution weather
model using adaptive fourier neural operators. arXiv:2202.11214
Pathan M, Ponnusami S, Pathan J, Pitisongsawat R, Erice B, Petrinic N, Tagarielli V (2019) Pre-
dictions of the mechanical properties of unidirectional fibre composites by supervised machine
learning. Sci Rep 9(1):1–10
Ding S, Lin L, Wang G, Chao H, Pattern Recognit (2015) Deep feature learning with relative
distance comparison for person re-identification. Pattern Recognit 48(10):2993–3003
Pawar S, San O, Nair A, Rasheed A, Kvamsdal T (2021) Model fusion with physics-guided machine
learning: Projection-based reduced-order modeling. Phys Fluids 33(6):067123
74 F. J. Montáns et al.

Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer


P, Weiss R, Dubourg V et al (2011) Scikit-learn: Machine learning in python. J Mach Learn Res
12:2825–2830
Peng GC, Alber M, Buganza Tepole A, Cannon WR, De S, Dura-Bernal S, Garikipati K, Karniadakis
G, Lytton WW, Perdikaris P et al (2021) Multiscale modeling meets machine learning: What can
we learn? Arch Comput Methods Eng 28(3):1017–1037
Penumuru DP, Muthuswamy S, Karumbu P (2020) Identification and classification of materials using
machine vision and machine learning in the context of industry 4.0. J Intell Manuf 31(5):1229–
1241
Pereira DR, Piteri MA, Souza AN, Papa JP, Adeli H (2020) Fema: A finite element machine for
fast learning. Neural Comput Appl 32(10):6393–6404
Pham DT, Afify AA (2005) Machine-learning techniques and their applications in manufacturing.
Proc Inst Mech Eng, Part B: J Eng Manuf 219(5):395–412
Platzer A, Leygue A, Stainier L, Ortiz M (2021) Finite element solver for data-driven finite strain
elasticity. Comput Methods Appl Mech Eng 379:113756
Proctor JL, Brunton SL, Kutz JN (2016) Dynamic mode decomposition with control. SIAM J Appl
Dyn Syst 15(1):142–161
Qin J, Hu F, Liu Y, Witherell P, Wang CC, Rosen DW, Simpson T, Lu Y, Tang Q (2022) Research
and application of machine learning for additive manufacturing. Addit Manuf 102691
Quqa S, Martakis P, Movsessian A, Pai S, Reuland Y, Chatzi E (2022) Two-step approach for fatigue
crack detection in steel bridges using convolutional neural networks. J Civ Struct Health Monit
12(1):127–140
Rabin N, Fishelov D (2017) Missing data completion using diffusion maps and Laplacian pyramids.
In: International conference on computational science and its applications. Springer, Berlin, pp
284–297
Rai R, Sahu CK (2020) Driven by data or derived through physics? a review of hybrid physics guided
machine learning techniques with cyber-physical system (CPS) focus. IEEE Access 8:71050–
71073
Raissi M, Karniadakis GE (2018) Hidden physics models: Machine learning of nonlinear partial
differential equations. J Comput Phys 357:125–141
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neural networks: A deep learn-
ing framework for solving forward and inverse problems involving nonlinear partial differential
equations. J Comput Phys 378:686–707
Raj R, Tiwari MK, Ivanov D, Dolgui A (2021) Machine learning and Industry 4.0 applications. Int
J Prod Res 59(16):4773–4778
Ramoni M, Sebastiani P (2001) Robust learning with missing data. Mach Learn 45(2):147–170
Randle D, Protopapas P, Sondak D (2020) Unsupervised learning of solutions to differential equa-
tions with generative adversarial networks. arXiv:2007.11133
Ranković V, Grujović N, Divac D, Milivojević N, Novaković A (2012) Modelling of dam behaviour
based on neuro-fuzzy identification. Eng Struct 35:107–113
Rao C, Sun H, Liu Y (2021) Physics-informed deep learning for computational elastodynamics
without labeled data. J Eng Mech 147(8):04021043
Raschka S (2015) Python machine learning. Packt Publishing Ltd
Raschka S, Mirjalili V (2019) Python machine learning: Machine learning and deep learning with
Python, scikit-learn, and TensorFlow 2. Packt Publishing Ltd
Razvi SS, Feng S, Narayanan A, Lee YTT, Witherell P (2019) A review of machine learning
applications in additive manufacturing. In: International design engineering technical conferences
and computers and information in engineering conference, vol 59179, p V001T02A040. American
Society of Mechanical Engineers
Reagan D, Sabato A, Niezrecki C (2018) Feasibility of using digital image correlation for unmanned
aerial vehicle structural health monitoring of bridges. Struct Health Monit 17(5):1056–1072
1 Machine Learning in Computer Aided Engineering 75

Regan T, Canturk R, Slavkovsky E, Niezrecki C, Inalpolat M (2016) Wind turbine blade dam-
age detection using various machine learning algorithms. In: International design engineering
technical conferences and computers and information in engineering conference, vol 50206, p
V008T10A040. American Society of Mechanical Engineers
Regazzoni F, Dedè L, Quarteroni A (2020) Machine learning of multiscale active force generation
models for the efficient simulation of cardiac electromechanics. Comput Methods Appl Mech
Eng 370:113268
Ren T, Wang L, Chang C, Li X (2020) Machine learning-assisted multiphysics coupling performance
optimization in a photocatalytic hydrogen production system. Energy Convers Manag 216:112935
Rice L, Wong E, Kolter Z (2020) Overfitting in adversarially robust deep learning. In: International
conference on machine learning, pp 8093–8104. PMLR
Rocha I, Kerfriden P, van der Meer F (2020) Micromechanics-based surrogate models for the
response of composites: a critical comparison between a classical mesoscale constitutive model,
hyper-reduction and neural networks. Eur J Mech-A/Solids 82:103995
Rocha I, Kerfriden P, van der Meer F (2021) On-the-fly construction of surrogate constitutive models
for concurrent multiscale mechanical analysis through probabilistic machine learning. J Comput
Phys: X 9:100083
Rodríguez M, Kramer T (2019) Machine learning of two-dimensional spectroscopic data. Chem
Phys 520:52–60
Rogers T, Holmes G, Cross E, Worden K (2017) On a grey box modelling framework for nonlinear
system identification. In: Special topics in structural dynamics, vol 6, pp 167–178. Springer,
Berlin
Roisman I, Breitenbach J, Tropea C (2018) Thermal atomisation of a liquid drop after impact onto
a hot substrate. J Fluid Mech 842:87–101
Romero X, Latorre M, Montáns FJ (2017) Determination of the WYPiWYG strain energy density
of skin through finite element analysis of the experiments on circular specimens. Finite Elem
Anal Des 134:1–15
Rosafalco L, Torzoni M, Manzoni A, Mariani S, Corigliano A (2021) Online structural health
monitoring by model order reduction and deep learning algorithms. Comput Struct 255:106604
Rosenblatt F (1958) The perceptron: a probabilistic model for information storage and organization
in the brain. Psychol Rev 65(6):386
Rosti A, Rota M, Penna A (2022) An empirical seismic vulnerability model. Bull Earthq Eng
20:4147–4173
Roweis ST, Saul LK (2000) Nonlinear dimensionality reduction by locally linear embedding. Sci-
ence 290(5500):2323–2326
Rowley CW (2005) Model reduction for fluids, using balanced proper orthogonal decomposition.
Int J Bifurc Chaosos 15(03):997–1013
Rubio PB, Chamoin L, Louf F (2021) Real-time data assimilation and control on mechanical systems
under uncertainties. Adv Model Simul Eng Sci 8(1):1–25
Rudy SH, Brunton SL, Proctor JL, Kutz JN (2017) Data-driven discovery of partial differential
equations. Sci Adv 3(4):e1602614
Ruggieri S, Cardellicchio A, Leggieri V, Uva G (2021) Machine-learning based vulnerability anal-
ysis of existing buildings. Autom Constr 132:103936
Salazar F, Toledo M, Oñate E, Morán R (2015) An empirical comparison of machine learning
techniques for dam behaviour modelling. Struct Saf 56:9–17
Salazar F, Toledo MÁ, Oñate E, Suárez B (2016) Interpretation of dam deformation and leakage
with boosted regression trees. Eng Struct 119:230–251
Salazar F, Morán R, Toledo MÁ, Oñate E (2017) Data-based models for the prediction of dam
behaviour: a review and some methodological considerations. Arch Comput Methods Eng
24(1):1–21
76 F. J. Montáns et al.

Salloum SA, Alshurideh M, Elnagar A, Shaalan K (2020) Machine learning and deep learning
techniques for cybersecurity: a review. In: The international conference on artificial intelligence
and computer vision. Springer, Berlin, pp 50–57
Salman O, Elhajj IH, Kayssi A, Chehab A (2020) A review on machine learning-based approaches
for internet traffic classification. Ann Telecommun 75(11):673–710
Salman O, Elhajj IH, Chehab A, Kayssi A (2022) A machine learning based framework for
IoT device identification and abnormal traffic detection. Trans Emerg Telecommun Technol
33(3):e3743
Sancarlos A, Cameron M, Le Peuvedic JM, Groulier J, Duval JL, Cueto E, Chinesta F (2021)
Learning stable reduced-order models for hybrid twins. Data-Centric Eng 2:e10
Sankarasrinivasan S, Balasubramanian E, Karthik K, Chandrasekar U, Gupta R (2015) Health
monitoring of civil structures with integrated uav and image processing system. Procedia Comput
Sci 54:508–515
Sarmadi H, Karamodin A (2020) A novel anomaly detection method based on adaptive mahalanobis-
squared distance and one-class knn rule for structural health monitoring under environmental
effects. Mech Syst Signal Process 140:106495
Scardovelli R, Zaleski S (1999) Direct numerical simulation of free-surface and interfacial flow.
Annu Rev Fluid Mech 31(1):567–603
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656:5–28
Schmid PJ (2011) Application of the dynamic mode decomposition to experimental data. Exp Fluids
50(4):1123–1130
Schmid PJ, Li L, Juniper MP, Pust O (2011) Applications of the dynamic mode decomposition.
Theor Comput Fluid Dyn 25(1):249–259
Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science
324(5923):81–85
Schmidt M, Lipson H (2010) Symbolic regression of implicit equations. In: Genetic programming
theory and practice VII. Springer, Berlin, pp 73–85
Schölkopf B, Herbrich R, Smola AJ (2001) A generalized representer theorem. In: International
conference on computational learning theory. Springer, Berlin, pp 416–426
Schölkopf B, Smola A, Müller KR (1997) Kernel principal component analysis. In: International
conference on artificial neural networks. Springer, Berlin, pp 583–588
Searson D (2009) GPTIPS: Genetic programming and symbolic regression for matlab. https://sites.
google.com/site/gptips4matlab/?pli=1
Seff A, Zhou W, Richardson N, Adams RP (2021) Vitruvion: A generative model of parametric cad
sketches. arXiv:2109.14124
Seibi A, Al-Alawi S (1997) Prediction of fracture toughness using artificial neural networks (anns).
Eng Fracture Mech 56(3):311–319
Sevieri G, De Falco A (2020) Dynamic structural health monitoring for concrete gravity dams based
on the bayesian inference. J Civ Struct Health Monit 10(2):235–250
Sharma S, Bhatt M, Sharma P (2020) Face recognition system using machine learning algorithm.
In: 2020 5th international conference on communication and electronics systems (ICCES). IEEE,
pp 1162–1168
Sharp M, Ak R, Hedberg T Jr (2018) A survey of the advancing use and development of machine
learning in smart manufacturing. J Manuf Syst 48:170–179
Shihavuddin A, Chen X, Fedorov V, Nymark Christensen A, Andre Brogaard Riis N, Branner K,
Bjorholm Dahl A, Reinhold Paulsen R (2019) Wind turbine surface damage detection by deep
learning aided drone inspection analysis. Energies 12(4):676
Shin D, Yoo S, Lee S, Kim M, Hwang KH, Park JH, Kang N (2021) How to trade off aesthetics
and performance in generative design? In: The 2021 world congress on advances in structural
engineering and mechanics (ASEM21). IASEM, KAIST, KTA, SNU DAAE
1 Machine Learning in Computer Aided Engineering 77

Shorten C, Khoshgoftaar TM (2019) A survey on image data augmentation for deep learning. J Big
Data 6(1):1–48
Shu D, Cunningham J, Stump G, Miller SW, Yukish MA, Simpson TW, Tucker CS (2020) 3D design
using generative adversarial networks and physics-based validation. J Mech Des 142(7):071701
Sigmund O (2009) Systematic design of metamaterials by topology optimization. In: IUTAM sym-
posium on modelling nanomaterials and nanosystems. Springer, Berlin, pp 151–159
Silva M, Santos A, Figueiredo E, Santos R, Sales C, Costa JC (2016) A novel unsupervised approach
based on a genetic algorithm for structural damage detection in bridges. Eng Appl Artif Intell
52:168–180
Simpson T, Dervilis N, Chatzi E (2021) Machine learning approach to model order reduction of
nonlinear systems via autoencoder and LSTM networks. J Eng Mech 147(10):04021061
Singh AP, Medida S, Duraisamy K (2017) Machine-learning-augmented predictive modeling of
turbulent separated flows over airfoils. AIAA J 55(7):2215–2227
Singh H, Gupta M, Mahajan P (2017) Reduced order multiscale modeling of fiber reinforced
polymer composites including plasticity and damage. Mech Mater 111:35–56
Sirca G Jr, Adeli H (2012) System identification in structural engineering. Scientia Iranica
19(6):1355–1364
Soize C, Ghanem R (2020) Physics-constrained non-Gaussian probabilistic learning on manifolds.
Int J Numer Methods Eng 121(1):110–145
Sordoni A, Bengio Y, Vahabi H, Lioma C, Grue Simonsen J, Nie JY (2015) A hierarchical recurrent
encoder-decoder for generative context-aware query suggestion. In: Proceedings of the 24th ACM
international conference on information and knowledge management, pp 553–562
Sorini A, Pineda EJ, Stuckner J, Gustafson PA (2021) A convolutional neural network for multiscale
modeling of composite materials. In: AIAA Scitech 2021 Forum, p 0310
Spalart P, Allmaras S (1992) A one-equation turbulence model for aerodynamic flows. In: 30th
aerospace sciences meeting and exhibit, p 439
Speziale CG (1998) Turbulence modeling for time-dependent RANS and VLES: a review. AIAA J
36(2):173–184
Sprangers O, Babuška R, Nageshrao SP, Lopes GA (2014) Reinforcement learning for port-
Hamiltonian systems. IEEE Trans Cybern 45(5):1017–1027
Stahl BC (2021) Artificial intelligence for a better future: an ecosystem perspective on the ethics
of ai and emerging digital technologies. Springer Nature
Stainier L, Leygue A, Ortiz M (2019) Model-free data-driven methods in mechanics: material data
identification and solvers. Comput Mech 64(2):381–393
Stančin, I., Jović, A.: An overview and comparison of free python libraries for data mining and
big data analysis. In: 2019 42nd International Convention on Information and Communication
Technology, Electronics and Microelectronics (MIPRO), pp. 977–982. IEEE (2019)
Stevens B, Colonius T (2020) Enhancement of shock-capturing methods via machine learning.
Theor Comput Fluid Dyn 34(4):483–496
Stoll A, Benner P (2021) Machine learning for material characterization with an application for
predicting mechanical properties. GAMM-Mitteilungen 44(1):e202100003
Straus J, Skogestad S (2017) Variable reduction for surrogate modelling. In: Proceedings of the
foundations of computer-aided process operations. Tucson, AZ, USA, pp 8–12
Ströfer CM, Wu J, Xiao H, Paterson E (2018) Data-driven, physics-based feature extraction from
fluid flow fields using convolutional neural networks. Commun Comput Phys 25(3):625–650
Sun H, Burton HV, Huang H (2021) Machine learning applications for building structural design
and performance assessment: State-of-the-art review. J Build Eng 33:101816
Sun F, Liu Y, Sun H (2021) Physics-informed spline learning for nonlinear dynamics discovery.
arXiv:2105.02368
Surjadi JU, Gao L, Du H, Li X, Xiong X, Fang NX, Lu Y (2019) Mechanical metamaterials and
their engineering applications. Adv Eng Mater 21(3):1800864
78 F. J. Montáns et al.

Sussman T, Bathe KJ (2009) A model of incompressible isotropic hyperelastic material behav-


ior using spline interpolations of tension-compression test data. Commun Numer Methods Eng
25(1):53–63
Swischuk R, Allaire D (2019) A machine learning approach to aircraft sensor error detection and
correction. J Comput Inf Sci Eng 19(4):041009
Tam KMM, Moosavi V, Van Mele T, Block P (2020) Towards trans-topological design exploration of
reticulated equilibrium shell structures with graph convolution networks. In: Proceedings of IASS
annual symposia, vol 2020, pp 1–13. International Association for Shell and Spatial Structures
(IASS)
Tamke M, Nicholas P, Zwierzycki M (2018) Machine learning for architectural design: Practices
and infrastructure. Int J Arch Comput 16(2):123–143
Tang HS, Xue ST, Chen R, Sato T (2006) Online weighted LS-SVM for hysteretic structural system
identification. Eng Struct 28(12):1728–1735
Tang Q, Dang J, Cui Y, Wang X, Jia J (2022) Machine learning-based fast seismic risk assessment
of building structures. J Earthq Eng 26(15):8041–8062
Tartakovsky AM, Marrero CO, Perdikaris P, Tartakovsky GD, Barajas-Solano D (2018) Learn-
ing parameters and constitutive relationships with physics informed deep neural networks.
arXiv:1808.03398
Tharwat A (2016) Linear versus quadratic discriminant analysis classifier: a tutorial. Int J Appl
Pattern Recognit 3(2):145–180
Theocaris P, Panagiotopoulos P (1993) Neural networks for computing in fracture mechanics.
methods and prospects of applications. Comput Methods Appl Mech Eng 106(1–2):213–228
Ti Z, Deng XW, Yang H (2020) Wake modeling of wind turbines using machine learning. Appl
Energy 257:114025
Tian C, Li T, Bustillos J, Bhattacharya S, Turnham T, Yeo J, Moridi A (2021) Data-driven approaches
toward smarter additive manufacturing. Adv Intell Syst 3(12):2100014
Tibshirani R (1996) Regression shrinkage and selection via the LASSO. J R Stat Soc: Ser B (Method-
ological) 58(1):267–288
Tikhonov AN (1963) On the solution of ill-posed problems and the method of regularization (in
Russian). In: Doklady Akademii Nauk, vol 151, pp 501–504. Russian Academy of Sciences
Torky AA, Ohno S (2021) Deep learning techniques for predicting nonlinear multi-component
seismic responses of structural buildings. Comput Struct 252:106570
Trinchero R, Larbi M, Torun HM, Canavero FG, Swaminathan M (2018) Machine learning and
uncertainty quantification for surrogate models of integrated devices with a large number of
parameters. IEEE Access 7:4056–4066
Tsur EE (2020) Computer-aided design of microfluidic circuits. Annu Rev Biomed Eng 22:285–307
Tu JH (2013) Dynamic mode decomposition: Theory and applications. Ph.D. thesis, Princeton
University
Turaga P, Anirudh R, Chellappa R (2020) Manifold learning. In: Computer vision: a reference guide.
Springer, Cham. https://doi.org/10.1007/978-3-030-03243-2_824-1. https://link.springer.com/
referenceworkentry/10.1007/978-3-030-03243-2_824-1
Tzonis A, White I (2012) Automation based creative design-research and perspectives. Newnes
Vafaie H, De Jong KA (1992) Genetic algorithms as a tool for feature selection in machine learning.
In: ICTAI, pp 200–203
Valdés-Alonzo G, Binetruy C, Eck B, García-González A, Leygue A (2022) Phase distribution and
properties identification of heterogeneous materials: A data-driven approach. Comput Methods
Appl Mech Eng 390:114354
Van Der Schaft A, Jeltsema D, et al (2014) Port-Hamiltonian systems theory: An introductory
overview. Found Trends® Syst Control 1(2–3):173–378
Van Erven T, Harremos P (2014) Rényi divergence and Kullback-Leibler divergence. IEEE Trans
Inf Theory 60(7):3797–3820
1 Machine Learning in Computer Aided Engineering 79

Vassallo D, Krishnamurthy R, Fernando HJ (2021) Utilizing physics-based input features within a


machine learning model to predict wind speed forecasting error. Wind Energy Sci 6(1):295–309
Verkhivker GM, Agajanian S, Hu G, Tao P (2020) Allosteric regulation at the crossroads of new
technologies: multiscale modeling, networks, and machine learning. Front Mol Biosci 7:136
Vitola J, Pozo F, Tibaduiza DA, Anaya M (2017) A sensor data fusion system based on k-nearest
neighbor pattern classification for structural health monitoring applications. Sensors 17(2):417
Vivanco-Benavides LE, Martínez-González CL, Mercado-Zúñiga C, Torres-Torres C (2022)
Machine learning and materials informatics approaches in the analysis of physical properties
of carbon nanotubes: a review. Comput Mater Sci 201:110939
Vlassis NN, Sun W (2021) Sobolev training of thermodynamic-informed neural networks for inter-
pretable elasto-plasticity models with level set hardening. Comput Methods Appl Mech Eng
377:113695
Vlassis NN, Ma R, Sun W (2020) Geometric deep learning for computational mechanics part i:
Anisotropic hyperelasticity. Comput Methods Appl Mech Eng 371:113299
Volpiani PS, Meyer M, Franceschini L, Dandois J, Renac F, Martin E, Marquet O, Sipp D (2021)
Machine learning-augmented turbulence modeling for RANS simulations of massively separated
flows. Phys Rev Fluids 6(6):064607
Wackers J, Visonneau M, Serani A, Pellegrini R, Broglia R, Diez M (2020) Multi-fidelity machine
learning from adaptive-and multi-grid rans simulations. In: 33rd symposium on naval hydrody-
namics
Wang JX, Wu J, Ling J, Iaccarino G, Xiao H (2017) A comprehensive physics-informed machine
learning framework for predictive turbulence modeling. arXiv:1701.07102
Wang L et al (2020) Application and development prospect of digital twin technology in aerospace.
IFAC-PapersOnLine 53(5):732–737
Wang K, Sun W (2018) A multiscale multi-permeability poroplasticity model linked by recursive
homogenizations and deep learning. Comput Methods Appl Mech Eng 334:337–380
Wang H, O’Brien JF, Ramamoorthi R (2011) Data-driven elastic models for cloth: modeling and
measurement. ACM Trans Graph (TOG) 30(4):1–12
Wang L, Zhang Z, Long H, Xu J, Liu R (2016) Wind turbine gearbox failure identification with
deep neural networks. IEEE Trans Indus Inf 13(3):1360–1368
Wang Q, Guo Y, Yu L, Li P (2017) Earthquake prediction based on spatio-temporal data mining:
an lstm network approach. IEEE Trans Emerg Top Comput 8(1):148–158
Wang C, Tan X, Tor S, Lim C (2020) Machine learning in additive manufacturing: State-of-the-art
and perspectives. Addit. Manuf. 36:101538
Wang C, Xiao J, Zhang C, Xiao X (2020) Structural health monitoring and performance analysis
of a 12-story recycled aggregate concrete structure. Eng Struct 205:110102
Wang Y, Cheung SW, Chung ET, Efendiev Y, Wang M (2020) Deep multiscale model learning. J
Comput Phys 406:109071
Wang Y, Ghaboussi J, Hoerig C, Insana MF (2022) A data-driven approach to characterizing non-
linear elastic behavior of soft materials. J Mech Behav Biomed Mater 130:105178
Wang C, Xu LY, Fan JS (2020) A general deep learning framework for history-dependent response
prediction based on UA-Seq2Seq model. Comput Methods Appl Mech Eng 372, 113357
Waszczyszyn Z, Ziemiański L (2001) Neural networks in mechanics of structures and materials-new
results and prospects of applications. Comput Struct 79(22–25):2261–2276
Weiss JA, Maker BN, Govindjee S (1996) Finite element implementation of incompressible, trans-
versely isotropic hyperelasticity. Comput Methods Appl Mech Eng 135(1–2):107–128
White DA, Arrighi WJ, Kudo J, Watts SE (2019) Multiscale topology optimization using neural
network surrogate models. Comput Methods Appl Mech Eng 346:1118–1135
Widrow B, Hoff ME (1962) Associative storage and retrieval of digital information in networks of
adaptive “neurons”. In: Biological prototypes and synthetic systems. Springer, Berlin, pp 160–160
80 F. J. Montáns et al.

Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the koopman
operator: Extending dynamic mode decomposition. J Nonlinear Sci 25(6):1307–1346
Wilt JK, Yang C, Gu GX (2020) Accelerating auxetic metamaterial design with deep learning. Adv
Eng Mater 22(5):1901266
Wirtz D, Karajan N, Haasdonk B (2015) Surrogate modeling of multiscale models using kernel
methods. Int J Numer Methods Eng 101(1):1–28
Wood MA, Cusentino MA, Wirth BD, Thompson AP (2019) Data-driven material models for
atomistic simulation. Phys Rev B 99(18):184305
Wu RT, Jahanshahi MR (2020) Data fusion approaches for structural health monitoring and system
identification: past, present, and future. Struct Health Monit 19(2):552–586
Wu Y, Sui Y, Wang G (2017) Vision-based real-time aerial object localization and tracking for uav
sensing system. IEEE Access 5:23969–23978
Wu JL, Xiao H, Paterson E (2018) Physics-informed machine learning approach for augmenting
turbulence models: A comprehensive framework. Phys Rev Fluids 3(7):074602
Wu L, Liu L, Wang Y, Zhai Z, Zhuang H, Krishnaraju D, Wang Q, Jiang H (2020) A machine
learning-based method to design modular metamaterials. Extreme Mech Lett 36:100657
Wu L, Zulueta K, Major Z, Arriaga A, Noels L (2020) Bayesian inference of non-linear multiscale
model parameters accelerated by a deep neural network. Comput Methods Appl Mech Eng
360:112693
Wu X, Park Y, Li A, Huang X, Xiao F, Usmani A (2021) Smart detection of fire source in tunnel
based on the numerical database and artificial intelligence. Fire Technol 57(2):657–682
Wu D, Wei Y, Terpenny J (2018) Surface roughness prediction in additive manufacturing using
machine learning. In: International manufacturing science and engineering conference, vol 51371,
p V003T02A018. American Society of Mechanical Engineers
Xames MD, Torsha FK, Sarwar F (2023) A systematic literature review on recent trends of machine
learning applications in additive manufacturing. J Intell Manuf 34:2529–2555
Xiao S, Hu R, Li Z, Attarian S, Björk KM, Lendasse A (2020) A machine-learning-enhanced
hierarchical multiscale method for bridging from molecular dynamics to continua. Neural Comput
Appl 32(18):14359–14373
Xie T, Grossman JC (2018) Crystal graph convolutional neural networks for an accurate and inter-
pretable prediction of material properties. Phys Rev Lett 120(14):145301
Xie Y, Ebad Sichani M, Padgett JE, DesRoches R (2020) The promise of implementing machine
learning in earthquake engineering: A state-of-the-art review. Earthq Spectra 36(4):1769–1801
Xie X, Bennett J, Saha S, Lu Y, Cao J, Liu WK, Gan Z (2021) Mechanistic data-driven prediction
of as-built mechanical properties in metal additive manufacturing. NPJ Comput Mater 7(1):1–12
Xiong W, Wu L, Alleva F, Droppo J, Huang X, Stolcke A (2018) The Microsoft 2017 conversational
speech recognition system. In: 2018 IEEE international conference on acoustics, speech and signal
processing (ICASSP). IEEE, pp 5934–5938
Xu J, Duraisamy K (2020) Multi-level convolutional autoencoder networks for parametric prediction
of spatio-temporal dynamics. Comput Methods Appl Mech Eng 372:113379
Xu C, Cao BT, Yuan Y, Meschke G (2022) Transfer learning based physics-informed neural networks
for solving inverse problems in tunneling. arXiv:2205.07731
Xu H, Caramanis C, Mannor S (2008) Robust regression and Lasso. Adv Neural Inf Process Syst
21 (NIPS2008)
Yadav D, Salmani S (2019) Deepfake: A survey on facial forgery technique using generative adver-
sarial network. In: 2019 international conference on intelligent computing and control systems
(ICCS). IEEE, pp 852–857
Yagawa G, Okuda H (1996) Neural networks in computational mechanics. Arch Comput Methods
Eng 3(4):435–512
Yamaguchi T, Okuda H (2021) Zooming method for fea using a neural network. Comput Struct
247:106480
1 Machine Learning in Computer Aided Engineering 81

Yan S, Zou X, Ilkhani M, Jones A (2020) An efficient multiscale surrogate modelling framework for
composite materials considering progressive damage based on artificial neural networks. Compos
Part B: Eng 194:108014
Yan C, Vescovini R, Dozio L (2022) A framework based on physics-informed neural networks and
extreme learning for the analysis of composite structures. Comput Struct 265:106761
Yáñez-Márquez C (2020) Toward the bleaching of the black boxes: Minimalist machine learning.
IT Prof 22(4):51–56
Yang C, Kim Y, Ryu S, Gu GX (2020) Prediction of composite microstructure stress-strain curves
using convolutional neural networks. Mater Des 189:108509
Yang L, Zhang D, Karniadakis GE (2020) Physics-informed generative adversarial networks for
stochastic differential equations. SIAM J Sci Comput 42(1):A292–A317
Yang L, Meng X, Karniadakis GE (2021) B-PINNs: Bayesian physics-informed neural networks
for forward and inverse PDE problems with noisy data. J Comput Phys 425:109913
Ye Y, Yang Q, Yang F, Huo Y, Meng S (2020) Digital twin for the structural health management of
reusable spacecraft: A case study. Eng Fract Mech 234:107076
Ye W, Hohl J, Mushongera LT (2022) Prediction of cyclic damage in metallic alloys with crystal
plasticity modeling enhanced by machine learning. Materialia 22:101388
Yoo S, Lee S, Kim S, Hwang KH, Park JH, Kang N (2021) Integrating deep learning into cad/cae
system: generative design and evaluation of 3D conceptual wheel. Struct Multidiscip Optim
64:2725–2747
Yu Y, Hur T, Jung J, Jang IG (2019) Deep learning for determining a near-optimal topological
design without any iteration. Struct Multidiscip Optim 59(3):787–799
Yu Y, Rashidi M, Samali B, Yousefi AM, Wang W (2021) Multi-image-feature-based hierarchical
concrete crack identification framework using optimized svm multi-classifiers and d-s fusion
algorithm for bridge structures. Remote Sens 13(2):240
Yuan FG, Zargar SA, Chen Q, Wang S (2020) Machine learning for structural health monitoring:
challenges and opportunities. Sens Smart Struct Technol Civ, Mech, Aerosp Syst 11379:1137903
Yuan D, Gu C, Wei B, Qin X, Xu W (2022) A high-performance displacement prediction model
of concrete dams integrating signal processing and multiple machine learning techniques. Appl
Math Model 112:436–451
Yucel M, Bekdaş G, Nigdeli SM, Sevgen S (2019) Estimation of optimum tuned mass damper
parameters via machine learning. J Build Eng 26:100847
Yu S, Tack J, Mo S, Kim H, Kim J, Ha JW, Shin J (2022) Generating videos with dynamics-aware
implicit generative adversarial networks. arXiv:2202.10571
Yuvaraj P, Murthy AR, Iyer NR, Sekar S, Samui P (2013) Support vector regression based models
to predict fracture characteristics of high strength and ultra high strength concrete beams. Eng
Fract Mech 98:29–43
Yvonnet J, He QC (2007) The reduced model multiscale method (R3M) for the non-linear homog-
enization of hyperelastic media at finite strains. J Comput Phys 223(1):341–368
Yvonnet J, Monteiro E, He QC (2013) Computational homogenization method and reduced database
model for hyperelastic heterogeneous structures. Int J Multiscale Comput Eng 11(3):201–225
Zadpoor AA (2016) Mechanical meta-materials. Mater Horiz 3(5):371–381
Zehtaban L, Elazhary O, Roller D (2016) A framework for similarity recognition of CAD models.
J Comput Des Eng 3(3):274–285
Zhan Z, Li H (2021) A novel approach based on the elastoplastic fatigue damage and machine
learning models for life prediction of aerospace alloy parts fabricated by additive manufacturing.
Int J Fatigue 145:106089
Zhang Z, Friedrich K (2003) Artificial neural networks applied to polymer composites: a review.
Compos Sci Technol 63(14):2029–2044
82 F. J. Montáns et al.

Zhang J, Sato T, Iai S (2007) Novel support vector regression for structural system identification.
Struct Control Health Monit: Off J Int Assoc Struct Control Monit Eur Assoc Control Struct
14(4):609–626
Zhang Z, Hsu TY, Wei HH, Chen JH (2019) Development of a data-mining technique for regional-
scale evaluation of building seismic vulnerability. Appl Sci 9(7):1502
Zhang D, Guo L, Karniadakis GE (2020) Learning in modal space: Solving time-dependent stochas-
tic PDEs using physics-informed neural networks. SIAM J Sci Comput 42(2):A639–A665
Zhang XL, Michelén-Ströfer C, Xiao H (2020) Regularized ensemble Kalman methods for inverse
problems. J Comput Phys 416:109517
Zhang P, Yin ZY, Jin YF (2021) State-of-the-art review of machine learning applications in consti-
tutive modeling of soils. Arch Comput Methods Eng 28(5):3661–3686
Zhang Z, Liu Y (2021) Robust data-driven discovery of partial differential equations under uncer-
tainties. arXiv:2102.06504
Zhang W, Mehta A, Desai PS, Higgs III CF (2017) Machine learning enabled powder spreading pro-
cess map for metal additive manufacturing (am). In: 2017 international solid freeform fabrication
symposium. University of Texas at Austin
Zhao Y, Akolekar HD, Weatheritt J, Michelassi V, Sandberg RD (2020) RANS turbulence model
development using CFD-driven machine learning. J Comput Phys 411:109413
Zhao P, Liao W, Xue H, Lu X (2022) Intelligent design method for beam and slab of shear wall
structure based on deep learning. J Build Eng 57:104838
Zheng H, Moosavi V, Akbarzadeh M (2020) Machine learning assisted evaluations in structural
design and construction. Autom Constr 119:103346
Zheng X, Zheng P, Zheng L, Zhang Y, Zhang RZ (2020) Multi-channel convolutional neural net-
works for materials properties prediction. Comput Mater Sci 173:109436
Zheng B, Yang J, Liang B, Cheng JC (2020) Inverse design of acoustic metamaterials based on
machine learning using a gauss–bayesian model. J Appl Phys 128(13):134902
Zhu Y, Zabaras N, Koutsourelakis PS, Perdikaris P (2019) Physics-constrained deep learning for
high-dimensional surrogate modeling and uncertainty quantification without labeled data. J Com-
put Phys 394:56–81
Zhuang X, Guo H, Alajlan N, Zhu H, Rabczuk T (2021) Deep autoencoder based energy method
for the bending, vibration, and buckling analysis of Kirchhoff plates with transfer learning. Eur
J Mech-A/Solids 87:104225
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc: Ser
B (Statistical Methodology) 67(2):301–320
zur Jacobsmühlen J, Kleszczynski S, Witt G, Merhof D (2015) Detection of elevated regions in
surface images from laser beam melting processes. In: IECON 2015-41st annual conference of
the IEEE industrial electronics society. IEEE, pp 001270–001275
1 Machine Learning in Computer Aided Engineering 83

Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0
International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing,
adaptation, distribution and reproduction in any medium or format, as long as you give appropriate
credit to the original author(s) and the source, provide a link to the Creative Commons license and
indicate if changes were made.
The images or other third party material in this chapter are included in the chapter’s Creative
Commons license, unless indicated otherwise in a credit line to the material. If material is not
included in the chapter’s Creative Commons license and your intended use is not permitted by
statutory regulation or exceeds the permitted use, you will need to obtain permission directly from
the copyright holder.
Chapter 2
Artificial Neural Networks

K. Worden, G. Tsialiamanis, E. J. Cross, and T. J. Rogers

2.1 Introduction

To many people, the history of machine learning itself is synonymous with the history
and development of artificial neural networks. There is a good reason for this; the ear-
liest conception of machine learning—or more generally of artificial intelligence—
was inextricably linked to the idea of reproducing the capabilities of the human brain
in some mathematical or computational framework. There were two main motiva-
tions for the programme. In the first place, it was hoped that a mathematical model
of the brain would shed light on its biological functionality, thus leading to a compu-
tational basis for neuroscience. The second motivation was more pragmatic; it was
known from the earliest days of electronic computing that the brain is effortlessly
superior to electronic (sequential/serial von Neumann) computers at certain tasks;
for example, image recognition, speech recognition, coordination, etc. The hope was
that a computer modeled on the architecture of the brain could extend capability
on the sort of tasks indicated. As this volume is very much focused on the solution
of problems in engineering, the discussion of neural networks here will concentrate
on the second motivation; the curious reader is directed elsewhere to find out about
computational neuroscience (Churchland and Seknowski 2017; Miller 2018).
This chapter is not intended as a comprehensive history or literature survey in any
shape or form; the idea will be to simply convey how Artificial Neural Networks
(ANNs) have developed in the context of engineering applications. It will prove
useful to split that development into three main periods. The layout of the chapter
will be as follows: the next section will outline the “pre-history” of the subject,
and will discuss developments up to the point where a general ANN architecture
emerged, which was versatile and powerful enough to address a range of engineering

K. Worden (B) · G. Tsialiamanis · E. J. Cross · T. J. Rogers


Dynamics Research Group, Department of Mechanical Engineering, University of Sheffield,
Mappin Street, Sheffield S1 3JD, UK
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 85
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_2
86 K. Worden et al.

problems—the multi-layer perceptron (MLP). It will be argued that this algorithm


ushered in the “first age” of ANNs for engineering and it is the subject of detailed
discussion in Sect. 2.3. Section 2.4 provides a fairly detailed study on how an MLP
was used to provide a solution to a problem in structural health monitoring. When
limitations of the MLP began to show in other disciplines, a “second age” of ANNs
began, in which strategies and architectures for deep learning were developed; this
period is the subject of Sect. 2.5. As in the first age, the engineering community
adopted one particular architecture for many problems—the convolutional neural
network (CNN); this structure is discussed in some detail. After a brief closing look
at some new ANN paradigms which show great promise, the chapter ends with some
conclusions.

2.2 Biological Motivation and Pre-history

The foundations for the study of ANNs begin with pioneering work on neurons as
structural constituents of the brain in the 1910s (R̃amón y Cajál 1911). This early
work established that the basic processing unit of the brain was the nerve cell or
neuron; the structure and operation of such neurons will be outlined in this section.
In brief, the neuron acts by summing stimuli from connected neurons. If the total
stimulus or activation exceeds a certain threshold, the neuron “fires”, i.e. it generates
a stimulus which is passed on into the network. The essential components of the
neuron are shown in the schematic Fig. 2.1.
The cell body, which contains the cell nucleus, carries out those biochemical
reactions which are necessary for sustained functioning of the neuron. Two main types
of neuron are found in the cortex (the part of the brain associated with the higher
reasoning capabilities); they are distinguished by the shape of the cell body. The
predominant type has a pyramid-shaped body and is usually referred to as pyramidal
neurons. Most of the remaining nerve cells have star-shaped bodies and are referred
to as stellate neurons. The cell bodies are typically a few microns in diameter. The fine
tendrils surrounding the cell body are the dendrites, they typically branch profusely
in the neighborhood of the cell and extend for a few hundred microns. The nerve fiber
or axon is usually much longer than the dendrites, sometimes extending for up to a
meter. These fibers connect the neurons with distant parts of the nervous system via
the spinal cord; they are not connections within the brain. The axon only branches
at its extremity, where it makes connections with other cells.
The dendrites and axon serve to conduct signals to and from the cell body. In
general, input signals to the cell are conducted along the dendrites, while the cell
output is directed along the axon. Signals propagate along the fibers as electrical
impulses. Connections between neurons, called synapses, are usually made between
axons and dendrites, although they can occur between dendrites, between axons and
between an axon and a cell body.
Synapses operate as follows: the arrival of an electrical nerve impulse at the
end of an axon say, causes the release of a chemical—a neurotransmitter—into
2 Artificial Neural Networks 87

Fig. 2.1 The biological (stellate) neuron

the synaptic gap (the region of the synapse, typically extending 0.01 microns). The
neurotransmitter then binds itself to specific sites—neuroreceptors—usually in the
dendrites of the target neuron. There are distinct types of neurotransmitters: excitatory
transmitters which trigger the generation of a new electrical impulse at the receptor
site and inhibitory transmitters which act to prevent the generation of new impulses.
Table 2.1 (reproduced from Abeles 1991) gives the statistics for, and typical
properties of, neurons within the cerebral cortex (the term remote sources in the
table refers to sources outside the cortex). One of the first things one sees from the
table is that the network is far from fully connected.
The operation of the neuron in reality is not at all simple; the dynamics are those
of a complex electro-chemical dynamical system; however, in broad terms, the cell
body carries out a sort of summation of all the incoming electrical impulses directed
inward along the dendrites. The effectiveness of the connection between two neurons
is determined by the chemical environment in the synapse, so the elements of the
summation over neuronal connections are individually weighted by the strength of
the connection or synapse. If sufficient energy is directed into a neuron within a
certain interval of time from its neighbors, it will itself discharge an electrical pulse
along the axon. This process can be put into terms which will make the design of an
artificial neuron seem clear. If the value of the summation over incoming signals—the
activation of the neuron—exceeds a certain threshold, the neuron fires and directs
an electrical impulse outward via its axon. Via synapses with the axon, the signal
88 K. Worden et al.

Table 2.1 Properties of the cortical neural network


Variable Value
Neuronal density 40000/mm3
Neuronal composition
Pyramidal 75%
Stellate 25%
Synaptic density 8.108 /mm3
Axonal length density 3200 m/mm3
Dendritic length density 400 m/mm3
Synapses per neuron 20000
Inhibitory synapses per neuron 2000
Excitatory synapses from remote sources per 9000
neuron
Excitatory synapses from local sources per 9000
neuron
Dendritic length per neuron 10 mm

is communicated to other neurons. If the activation is less than the threshold, the
neuron remains dormant.
A mathematical model of the neuron, exhibiting the essential features of this
restricted view of the biological neuron, was developed as early as 1943 in McCulloch
and Pitts (1943). This model forms the subject of a later discussion; the remainder of
this section is concerned with those properties of the brain which emerge as a result
of its massively parallel nature.

2.2.1 Memory

Information is actually stored in the brain in the network connectivity and the
strengths of the connections or synapses between neurons. In this case, knowledge is
stored as a distributed quantity throughout the entire network. As one might imagine,
the act of retrieving information from such a memory is quite different from that for
an electronic computer. In order to access data on a PC, say, the processor is informed
of the relevant address in memory, and it retrieves the data from that location. In a
neural network, a stimulus is presented (i.e. a number of selected neurons receive
an external input), and the required data are encoded in the subsequent pattern of
neuronal activations. Potentially, recovery of the pattern is dependent on the entire
distribution of connection weights or synaptic strengths.
One advantage of this type of memory retrieval system is that it has a much
greater resistance to damage. If the surface of a PC hard disk is damaged, all data
at the affected locations may be irreversibly corrupted. In a neural network, because
2 Artificial Neural Networks 89

the knowledge is encoded in a distributed fashion, local damage to a portion of the


network may have little effect on the retrieval of a pattern when a stimulus is applied.

2.2.2 Learning

According to the argument above, knowledge is encoded in the connection strengths


between the neurons in the brain. The question arises of how a given distributed
representation of data is obtained; one way is that the initial state of the brain at
birth is gradually modified as a result of its interaction with the environment. This
development is thought to occur as an evolution in the connection strengths between
neurons as different patterns of stimuli and appropriate responses are activated in the
brain as a result of signals from the sense organs.
The first explanation of such learning in terms of the evolution of synaptic
connections—Hebbian learning—was given by Hebb (1949):

When a cell A excites cell B by its axon and when in a repetitive and persistent manner it
participates in the firing of B, a process of growth or of changing metabolism takes place
in one or both cells such that the effectiveness of A in stimulating and impulsing cell B is
increased with respect to all other cells which can have this effect.

It was considered that, if some similar mechanism could be established for compu-
tational models of neural networks, there would be the attractive possibility of “pro-
gramming” these systems simply by presenting them with a sequence of stimulus-
response pairs so that the network could learn the appropriate relationship by rein-
forcing some of its internal connections.

2.2.3 Parallel Distributed Processing Paradigm

As observed earlier, the massively parallel nature of the brain as a computing facility
led researchers to believe that it could motivate a new and powerful paradigm for
artificial computing, based on artificial neural networks (ANNs). It was assumed that
the most important elements of this paradigm would include (Haykin 1994).
• Nonlinearity
Neurons are highly nonlinear (thresholding) devices and therefore neural networks
will also be nonlinear. Moreover, the nonlinearity can be distributed throughout
the network according to its structure. ANNs are usually designed to be nonlinear.
• Adaptivity and Learning
ANNs can modify their behavior in response to the environment. Once a set of
inputs with desired outputs is presented to the network, the connections between
the neurons self-adjust to produce consistent responses. This training is repeated
for many examples until the network reaches a stable state.
90 K. Worden et al.

• Evidential Response
A neural network can be designed to provide information not only about the
response but also about the confidence in the response; examples include pattern
recognition analysis where patterns can be classified with a returned confidence
level.
• Fault Tolerance
If a neuron or its connections are damaged, processing of information is impaired.
However, a neural network can exhibit a graceful degradation in performance,
rather than catastrophic failure.
• VLSI Implementation
The parallel nature of neural networks makes them ideally suited for implemen-
tation using microchip technology.

The power of ANNs in terms of mechanical-systems research is in their suitability for


sensor data processing problems that require parallelism and optimization resulting
from: high dimensionality of the problem space and complex interactions between
the analyzed variables. Broadly speaking, ANNs can offer solutions to four different
types of problem.

1. Autoassociation
A signal is reconstructed from noisy or incomplete data.
2. Regression/Heteroassociation
Input–output mapping, i.e. for given input data, produces a required output char-
acteristic.
3. Classification
Assign input data to given classes.
4. Anomaly Detection
Detect statistical abnormalities in the input data.

The first two tasks are often associated with modeling applications using neural net-
works. The last category includes the problem of novelty detection (Markou and
Singh 2003a; neural network based approaches 2003b; Tarassenko et al. 1995; Wor-
den 1997). Regression and classification are considered to be two of the main prob-
lems addressable with machine learning, so it is clear that neural networks offer a
general capability. Anomaly or novelty detection establishes a description of nor-
mality using features representing initial or normal conditions for some system or
process, and then tests for abnormality or novelty. This capability led to the first suc-
cessful applications of machine learning in the engineering discipline of structural
health monitoring (SHM), which seeks to detect and locate damage in engineering
structures using measured (sensor) data. The parallel discipline of condition monitor-
ing (detection of damage in machines) has also benefited significantly from novelty
detection. There also exist a number of general data fusion and signal processing
applications within each group of these tasks. In general, ANNs can be used for (Luo
and Unbehauen, 1998): filtering, detection, reconstruction, array signal processing,
system identification, signal compression, and adaptive feature extraction.
2 Artificial Neural Networks 91

Fig. 2.2 A schematic of the McCulloch–Pitts neuron

Before discussing one of the most commonly used variants of ANNs, it is useful to
spend a little time discussing how a single neuron might be modeled in a mathematical
way.

2.2.4 The Artificial Neuron

Despite the diversity of ANN paradigms, nearly all of them consist of very similar
building blocks—the artificial neurons. The structure of these neurons has actually
changed very little since the first study in McCulloch and Pitts (1943). The model
neurons receive a set of inputs or stimuli (regarded as emerging from neighboring
neurons) and produce a single output or response. In the foundational McCulloch–
Pitts (MCP) model, both the inputs and outputs are considered to be binary (reflecting
the fact that biological neurons either fire or don’t fire). The MCP neuron structure
is considered to consist of two blocks concerned with summation and activation, as
shown in Fig. 2.2.
The input values xi ∈ 0, 1 are weighted by a factor wi before they are passed to the
body of the neuron. These weights mimic the effect of different strengths of synaptic
connection in the biological neuron; they can be positive (excitatory) or negative
(inhibitory). The weighted inputs are then summed to produce an activation signal
z,


n
z= wi xi (2.1)
i=1

This signal is then passed through a nonlinear activation or transfer function; this
corresponds to the threshold in the biological neuron. However, in the mathematical
model, any one of a number of functions could be used for processing. For example,
if the output signal y is described as,
92 K. Worden et al.

y = kz (2.2)

where k is a constant coefficient, the neuron is linear and consequently networks


made up from these types of neurons would be linear networks. (Such neurons have
considerable limitations; however, in the literature, classes of linear neurons and
networks have found some use and they are called Adeline and Madeline (Widrow
and Hoff, 1960), respectively.)
To mimic the biological neuron, the activation function used in the MCP model
is a hard threshold function. The MCP neuron fires if the weighted sum z exceeds
some predefined threshold β, i.e. if,

z>β (2.3)

and does not fire if,

z≤β (2.4)

Initial studies of the MCP neuron indicated that its computational capabilities
were extremely limited if it were used alone. This was no surprise, a brain with a
single neuron would not be expected to work in any way. The classic example of MCP
limitation is the fact that a single neuron is incapable of learning the XOR function
(two binary inputs and a single binary output). The obvious solution, with reference
to the brain, was to move to networks of neurons; the first serious learning machine
designed on this basis made its appearance in 1950 in the form of the Perceptron.

2.2.5 The Perceptron

The first serious study of artificial neural networks (as opposed to single neurons) was
carried out by Rosenblatt (1962), who proposed a three-layered network structure—
the Perceptron. This network was not homogeneous; the first layer was an input
layer which simply distributed signals to the processing layers. The first processing
layer was the associative layer (also referred to as the hidden layer because it had
no external connections); the second, which outputs signals to the outside world was
termed the decision layer. In the classic form of the perceptron, only the connections
between the decision and associative nodes were adjustable in strength; those between
the input and associative nodes had to be preset before training took place. One of
the main motivations of the perceptron was that it might take inputs, black and white,
from an image in order to recognize patterns within that image; for this reason, it
was sometimes considered to be an “artificial retina” as in Fig. 2.3. It was possible to
prove a number of nice theorems for the perceptron, including a proof that if a given
pattern were learnable, a learning rule existed which would converge in finite time.
The problems in adopting the perceptron as a general learning machine proved to be
related to discovering which patterns were actually learnable.
2 Artificial Neural Networks 93

Fig. 2.3 The perceptron viewed as an “artificial retina”

Although perceptrons were initially received with enthusiasm, they were soon
associated with many problems; a completely rigorous investigation into the capabil-
ity of perceptrons was given in Minsky and Papert (1988). In representing a function
with N arguments, the generic perceptron was shown to need 2 N elements in the
associative layer; i.e. the networks appeared to grow exponentially in complexity
with the dimension of the problem. It was initially hoped that real problems could
be solved which did not require the maximal structure; however, Minsky and Papert
(1988) showed that many fundamental problems required the full complexity. For
example, it was shown that a perceptron could not determine if a given pattern was
connected (not falling into disjoint components) without the full number of neurons.
One way out of the dilemma appeared to be the possibility of adding and training
other layers of weights in the perceptron; however, it was not possible to establish an
algorithm which would allow this. Only the layer of weights between the outermost
hidden layer and the output layer could be trained using the perceptron learning rule
(which was based on Hebbian learning as described above). With the perceptron
structure as it stood, further progress proved impossible. The effect of Minsky and
Papert’s book was to discourage research in ANNs, and the field lay dormant for
many years.
The period of inactivity ended with the work of Hopfield (1984, 1985) in the
1980s. He considered networks from the point of view of dynamical systems the-
ory; the outputs of the constituent neurons in Hopfield’s networks were regarded as
dynamical states which could evolve in time. The Hopfield network proved capable
of solving a number of practical problems (many of them optimization problems)
and reinvigorated ANN research. An immediate result of the resurgence in activity
was the solution by various groups of the problems associated with Rosenblatt-type
perceptrons. The problem of finding a learning rule for the multi-layer structures
94 K. Worden et al.

turned out to be the result of using the hard threshold as an activation function in the
individual neurons. The solution proved to hinge on a matter as simple as replacing
the hard threshold with a continuous function, such as the sigmoidal function,
1
y= (2.5)
1 + e−z

or hyperbolic tangent function,

y = tanh(z) (2.6)

Once the activation function became continuous, the solution to the whole problem
turned out to be the chain rule of partial differentiation (Bishop 2013). The backprop-
agation learning rule was discovered simultaneously by a number of research groups
and was reported in Rumelhart et al. (1986), LeCun (1986). In fact, the learning rule
had been discovered as early as 1974 in the PhD work of Werbos (1974); however,
this work lay undiscovered by the machine learning community, partly because of
the lack of activity in the field as a result of the disappointment in perceptrons. The
backpropagation rule is essentially the gradient-descent algorithm for optimization
and in this sense appears to have had antecedents in the control engineering literature
as early as the 1960s (Necessary conditions for extremal solutions 1963). The exis-
tence of the backpropagation algorithm allowed the development of the Multi-Layer
Perceptron (MLP) algorithm, which has proved to be one of the most commonly
used and influential machine learning paradigms so far discovered. Interest in ANNs
flowered and a number of large programs of research were initiated; this will be
considered here as the start of the first age of ANNs.

2.3 The First Age—The Multi-layer Perceptron

The MLP network is a natural generalization of the perceptron network described in


Sect. 2.2.5. A detailed analysis of the network structure and learning algorithm can be
found in Bishop (2013); however, for completeness, and because of the importance
of the paradigm, a brief discussion is given here.
The MLP is a feedforward network with the neurons arranged in layers (Fig. 2.4).
Signal values pass into the input layer nodes, progress forward through the network
hidden layers, and the result finally emerges from the output layer. Each node i is
connected to each node j in the preceding and following layers via connections of
weight wi j . Signals pass through the nodes as follows: in layer k a weighted sum is
performed at each node i of all the signals x (k−1) j from the preceding layer k − 1,
giving the excitation z i(k) of the node; this is then passed through a nonlinear activation
function f to emerge as the output of the node xi(k) to the next layer, i.e.
2 Artificial Neural Networks 95

Fig. 2.4 Multi-Layer Perceptron (MLP)

 
xi(k) = f (z i(k) ) = f wi(k) (k−1)
j xj (2.7)
j

As discussed above, various choices for the function f are possible (as long as
they are continuous and satisfy some other mild conditions); the hyperbolic tangent
function f (x) = tanh(x) is a good choice. A novel feature of this network is that
the neuron outputs can take any value in the interval [–1, 1]. There are no explicit
threshold values associated with the neurons. One node of the network, the bias node,
is special in that it is connected to all other nodes in the hidden and output layers;
the output of the bias node is held fixed at a value of unity throughout, in order to
allow constant offsets in the excitations z i of each node.
The first stage of using a network is to establish the appropriate values for the
connection weights wi j , i.e. the training phase. The type of training usually used is
a form of supervised learning and makes use of a set of network inputs for which
the desired network outputs (often called targets) are known. At each training step, a
96 K. Worden et al.

set of inputs is passed forward through the network yielding trial outputs which can
be compared with the desired outputs. If the comparison error is considered small
enough, the weights are not adjusted. If however a significant error is obtained, the
error is passed backward through the net and the training algorithm uses the error
to adjust the connection weights so that the error is reduced. The learning algorithm
used is usually referred to as the backpropagation algorithm, as discussed earlier,
and can be summarized as follows. For each presentation of a training set, a measure
J of the network error is evaluated, where the most common choice is the squared
error,
(l)
1
n
J (t) = (yi (t) − ŷi (t))2 (2.8)
2 i=1

and n (l) is the number of output layer nodes. J is implicitly a function of the network
parameters J = J (w) where the w are all the connection weights, ordered into a
vector in some appropriate manner (Nabney 2001). The integer t labels the presen-
tation order of the training sets. After presentation of a training set, the standard
steepest-descent algorithm requires an adjustment of the parameters according to
∂J
wi = −η = −η∇i J (2.9)
∂wi

where ∇i is the gradient operator in the parameter space. The parameter η determines
how large a step is made in the direction of steepest descent and therefore how quickly
the optimum parameters are obtained. For this reason η is called the learning rate or
learning coefficient. Detailed analysis (Bishop 2013) gives the update rule after the
presentation of a training set,

wi(m) (m) (m) (m−1)


j (t) = wi j (t − 1) + ηδi (t)x j (t) (2.10)

where δi(m) is the error in the output of the ith node in layer m. This error is not
known a priori but must be constructed from the known errors δi(l) = yi − ŷi at the
output layer l. This is the source of the name backpropagation, the weights must be
adjusted layer by layer, moving backward from the output layer.
There is little guidance in the literature as to what the learning rate η should be; if it
is taken too small, convergence to the correct parameters may take an extremely long
time. However, if η is made large, learning is much more rapid but the parameters
may diverge or oscillate. One way around this problem is to introduce a momentum
term into the update rule so that previous updates persist for a while, i.e.

wi(m) (m) (m−1)


j (t) = ηδi (t)x j (t) + αwi(m)
j (t − 1) (2.11)

where α is termed the momentum coefficient. The effect of this additional term
is to damp out high-frequency variations in the backpropagated error signal. The
2 Artificial Neural Networks 97

learning rate and momentum coefficients are examples of hyperparameters; these


are parameters of a learning algorithm which must be specified before any training
takes place. Hyperparameters cannot be optimized at the same time as the network
weights—a little more machinery is needed. This learning algorithm is termed first
order because the first derivative of the objective function is used. The current state
of the art for training MLPs uses second-order algorithms which require evaluation
of the second derivatives of the objective function—the Hessian. These algorithms
are computationally more expensive on an iteration-by-iteration basis, but converge
(much) faster and have better properties with respect to trapping in local minima. The
two most often-used variants of the second-order approach are the scaled conjugate
gradient approach (Möller 1943) and the Levenberg–Marquardt approach (Press
et al. 1992).

2.3.1 Existence of Solutions

Before advocating the use of neural networks in representing functions and processes,
it is important to establish what they are capable of. As described above, ANNs were
all but abandoned as a subject of study following Minsky and Papert’s book (Minsky
and Papert 1988), which showed that perceptrons were incapable of modeling very
simple logical functions. In fact, recent years have seen a number of rigorous results
(Cybenko 1989 is a good example), which show that an MLP network is capable
of approximating any given function with arbitrary accuracy, even if possessed of
only a single hidden layer. Unfortunately, the proofs are not constructive and offer
no guidelines as to the complexity of network required for a given function. A single
hidden layer may be sufficient, but might require many more neurons than if two
hidden layers were used. This observation motivated the development of neural
networks with many more hidden layers—deep networks—and led to the second
age of neural networks.

2.3.2 Uniqueness of Solutions

This is the problem of local minima. The error function for an MLP network is
an extremely complex object. Given a converged MLP network, there is no way
of establishing if it has arrived at the global minimum. Some attempts to avoid
the problem are centered around the association of a temperature with the learning
schedule. Roughly speaking, at each training cycle the network may randomly be
given enough “energy” to escape from a local minimum. The probable energy is
calculated from a network temperature function which decreases with time. (Recall
that molecules of a solid at high temperature escape the energy minimum which
98 K. Worden et al.

specifies their position in the lattice.) An alternative approach is to seek network


paradigms with less severe problems, e.g. radial-basis function networks (Bishop
2013).

2.3.3 Generalization and Regularization

One of the main problems encountered in practice with ANNs (and in machine
learning in general) is that of generalization or avoidance of overfitting. This is
essentially the problem of rote learning the training data rather than learning the
underlying function of interest. It occurs when there are too many parameters in
the model compared to the number of training points or patterns. Consider a simple
one-dimensional regression problem with 10 points of training data. Suppose that
the true relationship between x and y is linear, but the presence of noise means that
the plot of y against x is far from a straight line. If one were to fit a nine-dimensional
least-squares polynomial to the data, there would be 10 tunable parameters which
could be set so that the function passes through the data with no error. The problem
here is that the polynomial fit would very probably deviate badly from the true linear
form away from the training points. If one now applied the model to different data,
generated by the same physical process as the training data, the predictions could be
very bad indeed, the model will fail to generalize. At the heart of this problem is the
availability of too many parameters. In the context of a neural network, one is likely
to overfit the data if there are too many weights compared to training data points. The
simplest solution to the problem is to always have enough data; the rule-of-thumb
espoused in Tarassenko (1998) is that one should have 10 training patterns for each
network weight (this rule is not arbitrary and some theoretical motivation is given
in Bishop 2013). In engineering problems, data are often the result of expensive
experimentation and will be in short supply; in this case, the only way to ensure
generalization is to restrict the number of weights in the network. In any case, if one
has fitted a neural network to training data, one should always evaluate performance
on an independent test set in order to assess generalization.
One way of controlling the number of adjustable weights in the network is to
control the number of hidden units; this is accomplished by the use of cross validation
on an independent dataset. In fact, the number of hidden units is another example of a
hyperparameter. One divides the available data into three sets for training, validation,
and testing. For all numbers of hidden units between 1 and some maximum, one trains
a network on the training data and then also computes the error on the validation
data. When the number of hidden units has reached the point where overtraining is
beginning, the error on the validation set will begin to rise, even though the error on
the training set continues to decrease. One then fixes the number of hidden units at
the point where the minimum error on the validation set occurred. Now, the network
has been tuned to both the training and validation sets and the independent testing
set is brought in to assess generalization. The independence of the various datasets
used in training and assessment is critical; for example, if the data in the testing
2 Artificial Neural Networks 99

set are too strongly correlated with the data in the training set, it is likely that a
misleading impression of the generalization capacity of the network will be obtained.
The validation set can be used to determine values for any number of hyperparameters
by cross validation. If data are very scarce, it may be unattractive to use much of
the data for a validation set; in this situation, alternatives like leave-one-out cross
validation can be explored (Bishop 2013).
Another way of thinking about overfitting is in terms of the magnitude of the
weights. In situations where there are as many weights as data points, one often
finds that the weights have very high values and the accurate predictions on the
training set are the result of cancellations between large positive and large negative
weights. This observation suggests that better generalization might be achieved by
controlling the size of the weights. Alternatively, one can regard the issue as being
one of smoothness of the fitted function. A high-order polynomial model as discussed
before can dance rapidly between noisy data points, where the true underlying linear
system response is much smoother. It can be shown that smaller weights generate
smoother response. The science of controlling the smoothness of the network is
generally called regularization. Having established that smaller weights are desirable,
one simple way to achieve this is to add a term to the neural-network objective/error
function which penalizes large weights, the simplest such term being


W
α wi2 = α||w||2 (2.12)
i=1

where the constant α weights the relative importance of the squared error and the
weight norm—it is yet another example of a hyperparameter.
This prescription is commonly called weight-decay regularization. Two other
methods of regularization are early stopping, where one stops training before the
algorithm has tuned the weights to the point of overfitting, and adding noise to the data
during training in order to stop the algorithm learning exact training data points. It has
been shown that these three mentioned methods of regularization are theoretically
equivalent (Bishop 1994). One of the most advanced theoretical frameworks for
assessing generalization capacity of models is that of statistical learning theory
(Vapnik 1995).

2.3.4 Choice of Output Activations Functions

This is a straightforward but important matter concerned with the nonlinear activation
functions of the neurons in the MLP. In order to have a universal approximator, it is
necessary that the hidden units of the MLP adopt a nonlinear activation function. As
discussed above, the hyperbolic tangent and sigmoid functions are the most often used
as activation functions. One also needs to specify a form for the activation functions of
the output layer and it turns out that best practice demands that this is problem specific
100 K. Worden et al.

(Bishop 2013). For regression problems, one should use linear activation functions.
For classification problems, one should use a nonlinear activation. Furthermore, if
the classifier is trained using the 1 of M rule (Tarassenko 1998), a softmax activation
should be used. In that case, there will be M output neurons, each with an associated
activation z i , i = 1, . . . , M. The output of the ith neuron is defined as
exp(z i )
yi =  M (2.13)
j=1 exp(z j )

This rule forces all the network outputs to sum to unity, a necessary condition
for interpreting the outputs as posterior probabilities of class membership (Bishop
2013).
Although it might appear controversial to a computer scientist, it is arguably rea-
sonable that engineers would regard the MLP as ushering in the “first age” of neural
computing, as it provided an architecture with the power and versatility to address
many engineering problems. The main developments in the MLP following the intro-
duction of backpropagation were not significant changes to the basic structure. In
terms of the learning algorithm, the main development was a transition from gradi-
ent descent to second-order methods for the backpropagation algorithm. One very
significant development, which did not affect the basic architecture, was the adop-
tion of a Bayesian probabilistic viewpoint on MLPs. The Bayesian view allowed a
principled interpretation of the MLP and provided practical benefits like confidence
intervals on predictions. It is probably fair to say that the Bayesian approach was
largely introduced and led by Neal (1996) and Mackay (2003); the latter of these
references must be regarded as a classic of the machine learning literature in general.
Despite its versatility, it would be unfair to say that the MLP was the only archi-
tecture that proved useful for engineering. Another popular feedforward architecture
was the radial-basis function (RBF) neural network (Broomhead and Lowe 1988).
Another popular idea of the time was the merging of neural and fuzzy concepts (Brown
and Harris 1994); one well-known neuro-fuzzy architecture was the ANFIS network;
however, this actually boiled down to an RBF network on closer inspection (Worden
et al. 2011). Another influential paradigm was the self-organizing map (SOM) of
Kohonen (1982); this was an unsupervised algorithm, which proved very powerful
for tasks like clustering and data visualization. A more “principled” version of the
SOM also appeared in the form of the generative topographic map (GTM) (Bishop
et al. 1998). Some other popular ANN architectures are discussed in the previously
mentioned (Worden et al. 2011).
The historical development of the subject will now be briefly interrupted with an
engineering illustration of the application of the MLP.
2 Artificial Neural Networks 101

2.4 A First-Age Case Study: Structural Monitoring of an


Aircraft Wing

The case study presented here is a perfect example of the use of MLPs in an engineer-
ing problem. All of the ideas discussed in the previous section were exposed in the
course of the study. The study was made possible by Qinetiq Ltd., who allowed use
of a Gnat trainer aircraft for structural health monitoring (SHM), which was studied
over a period of 3 years. A detailed discussion of the whole program can be found in
the sequence of papers (Worden et al. 2003; Manson et al. 2003a, b). The case study
here relates mainly to Manson et al. (2003b), in which the aim of the exercise was
to locate damage within the starboard wing of the aircraft (Fig. 2.5).
A condition of the experiments was that the wing should not suffer actual damage.
It was therefore decided to simulate damage by sequentially removing a number of
inspection panels on the wing. Unlike real damage, this approach had the distinct
advantage that each damage scenario was reversible and it would be possible to
monitor the repeatability of the measurements. Of the various panels available, nine
were chosen, mainly for their ease of removal and also to cover a range of sizes.
These panels were distributed as shown in Fig. 2.6. The areas of the panels ranged

Fig. 2.5 Starboard wing of


the Gnat aircraft used in the
SHM case study
102 K. Worden et al.

Fig. 2.6 Schematic of the starboard wing of the Gnat aircraft, showing the positions of the inspection
panels (not to scale)

from 0.00825 m2 to 0.08 m2 , so their removal actually constituted quite large damage.
Panels P3 and P6 were always likely to give the most difficulty for a damage-detection
procedure because they were by far the smallest. It is important to note that the
fixing conditions of the panels were important. Each panel was fixed to the wing by
a number of screws, the numbers varying between 8 and 26. On some of the panels,
screws were missing as a result of damaged threads in the holes. In fact, during the
repeated removal of the plates, further holes were damaged. This damage meant that
there was some variation throughout the test, even for normal condition (all panels
attached). An attempt was made to control this variation by using a torque-controlled
screwdriver.
2 Artificial Neural Networks 103

The fact that there are a discrete number of damage locations meant that the
problem was one of classification, i.e. the neural network should assign the label
of the damaged panel, from 1 to 9. Thus, the classifier was trained using the 1
of M strategy, as discussed in Sect. 2.3.4. Another critical choice for the exercise
was that of the neural network inputs; these would need to be features sensitive to
the nature of the problem, they would need to be carrying information about the
location of damage. The features would need to be derived from sensors, placed on
the wing. It was decided to use accelerometers to record the data, in accordance with
common practice in vibration-based SHM, and in structural dynamics generally. An
electrodynamic shaker placed under the wing was used to excite vibrations, using a
white-noise source. In order to use sensors effectively, the panels were split into three
groups A, B, and C. Each group was allocated a centrally placed reference sensor,
together with three other sensors, each associated with a specific plate. The sensor
layout was as shown in Fig. 2.7.
The accelerometers produced many thousands of samples of data; far too many to
use as direct inputs to the MLP. For reasons discussed in the last section, the number of
weights in the ANN determines how much training data are needed; a very large input
layer means a very large number of weights, and thus an infeasible amount of training
data. Previous experience in vibration-based SHM had shown that transforming the
data into the frequency domain gives a reduction in the dimension of the features
and can concentrate damage-related information. For this reason, transmissibilities
were computed for each panel. The spectra—Fourier transforms of the time data—
were computed; the transmissibility for each panel was then obtained by taking
the ratio of the panel spectrum and the spectrum of the reference sensor of the
group. The transmissibilities are characterized by sharp peaks at certain frequencies,
which move when damage occurs. Following an exhaustive visual inspection, small
ranges of frequencies around the most sensitive peaks were chosen. These ranges still
contained too many spectral lines to give a sensible input dimension, so the ranges
for each panel were converted to scalar values using ideas from outlier analysis, as
discussed in detail in Manson et al. (2003b). This feature extraction and selection
process finally led to nine input features for the neural network. As the output layer
of the network was fixed at nine neurons by the 1 of M approach, only the number
of hidden units remained to be determined.
A sequence of tests was carried out in which each panel was removed in turn. In
between each group of three panel tests, a “normal” condition (no panels removed)
was taken. The sequence was then repeated, so that it was possible to monitor consis-
tency of results. This program of tests took 3 days and still did not produce copious
amounts of data; 200 patterns were obtained for each damage state and 700 patterns
were obtained for the normal condition. After dividing the data into training, vali-
dation, and testing sets, 66 patterns were available for each panel; the datasets thus
had 594 patterns.
In the first analysis carried out on the data, as discussed in Manson et al. (2003b),
the multi-layer perceptron (MLP) neural network was trained on the feature data from
the visual selection. The results were quite good, and a classification probability of
104 K. Worden et al.

Fig. 2.7 Schematic of the


starboard wing of the Gnat
aircraft, showing the
positions of the sensor
groups

error of 0.135 was obtained on the test data, with the confusion matrix C, as presented
in Table 2.2.
The confusion matrix is a common (and very effective) means of displaying the
results of a classification exercise; it is quite simple in construction. One begins
with an empty matrix; as each testing pattern is presented, the entry at the (i, j)
location is incremented if the pattern is from true class i, but is assigned to class j. A
perfect confusion matrix would thus be diagonal; any off-diagonal entries represent
confusion between classes.
2 Artificial Neural Networks 105

Table 2.2 Confusion matrix for neural network classifier for aircraft wing damage location—testing
set
Prediction 1 2 3 4 5 6 7 8 9
True class 1 62 1 0 0 2 0 0 1 0
True class 2 0 61 0 0 5 0 0 0 0
True class 3 0 1 52 0 7 4 0 2 0
True class 4 1 0 3 60 0 1 0 1 0
True class 5 2 1 0 0 60 3 0 0 0
True class 6 2 0 6 0 8 52 0 0 0
True class 7 1 0 4 0 1 1 58 1 0
True class 8 0 0 0 0 1 1 0 62 2
True class 9 2 1 1 0 0 0 0 15 47

Although the neural network has done a good job here—it is correct 86.5% of
the time—the results still leave a little to be desired. The main aim of SHM is to
advise owners and operators of structures, so that they can make decisions; significant
consequences in terms of safety or cost may occur if those decisions are incorrect.
In many cases, the power of the algorithm used for classification is not the main
issue; in many engineering problems, the really critical choice is of the features. If
very effective features are chosen, even a simple classifier might produce excellent
or perfect results.
In the current case study, the first set of features was chosen on the basis of
(somewhat-involved) engineering judgement. This choice left open the question of
what performance might be obtained if the features were formally optimized. This
exercise was carried out in Worden et al. (2003), with a genetic algorithm (GA) used
for optimization. Even the optimization approach began with a subjective element. As
part of the original study, the authors of Manson et al. (2003b) sorted the features for
classification into groups termed strong, fair, and weak; the criteria are described in
detail in the original paper. With some subtleties to deal with (not discussed here), the
selection process here made an initial restriction to only strong or fair features, which
resulted in 44 candidates. The final selection stage used an integer-coded GA to select
subsets of features which minimized the probability of misclassification. The results
from the GA were excellent, the optimal MLP had a probability of misclassification
of 2.69% on the independent test set; the confusion matrix is given in Table 2.3.
There are a total of 16 misclassifications—eight of them for Panel 6, one of the two
(equal-)smallest panels.
106 K. Worden et al.

Table 2.3 Confusion matrix for neural network classifier for aircraft wing nine-panel damage-
location problem—testing set
Prediction 1 2 3 4 5 6 7 8 9
True class 1 65 0 0 0 0 0 0 0 1
True class 2 0 64 0 2 0 0 0 0 0
True class 3 0 0 64 1 0 1 0 0 0
True class 4 1 0 0 65 0 0 0 0 0
True class 5 0 0 0 0 66 0 0 0 0
True class 6 1 4 0 1 0 58 0 1 1
True class 7 0 0 0 0 0 0 66 0 0
True class 8 0 0 0 0 1 0 0 65 0
True class 9 0 0 1 0 0 0 0 0 65

2.5 The Second Age—Deep Learning

Despite the power and versatility of the MLP algorithm and its position as the “go to”
neural network in many engineering applications, the machine learning community
was still struggling with some of the problems mentioned earlier, like image and
speech recognition; problems for which the brain had proved superior. In many ways,
the MLP had distanced neural networks from their biological origins by emphasizing
its properties as a nonparametric function approximator. Most applications used a
single hidden layer, as it had been established that this was sufficient to make the MLP
a universal approximator; the resulting architecture is very shallow. The problem is
that shallow learning is not how the brain works; the brain is densely interconnected,
massively parallel, and not shallow. The solution to the problem seemed obvious;
why not add more hidden layers and create a deep network?
Although the move to deeper networks appeared to be a simple strategy, it proved
to have serious technical problems in terms of training. If the MLP architecture is
simply made deeper, a serious problem soon arises in terms of the backpropagation
algorithm. Backpropagation has problems in deep networks, because the computed
errors, and thus gradients, become unreliable as they are passed further and further
backward; gradients can vanish or diverge (explode).
Because of the backpropagation problem, the first deep neural networks—
sometimes called deep belief networks—used different architectures. One of the first
successful structures was Hinton et al. (2006), which consisted of a series of binary
stochastic nodes arranged in layers. Hinton’s solution to training a deep network was
to train it one layer at a time—a “greedy” strategy. Each pair of layers was known
as a Restricted Boltzmann Machine (RBM), and a training procedure already existed
for this; Hinton’s innovation was to train the layers sequentially. Another notable
characteristic of deep belief nets was that learning was made “biological” again; in
fact, Hinton even wrote of “looking into the mind of the network”.
2 Artificial Neural Networks 107

Fig. 2.8 Untidy kitchen

Other candidate architectures emerged, mostly based on greedy training of single


layers of weights, an excellent historical review can be found in Schmidhuber (2015);
another good reference is Goodfellow et al. (2016). For the remainder of this section,
the discussion will be focused on one specific architecture—the Convolutional Neural
Network (CNN)—which appears to have emerged as a general-purpose algorithm,
and can be thought of as a generalization of the MLP.
The problems with deep networks are not restricted to the issues with gradient
estimation, another serious problem arises in terms of generalization. More layers
mean more weights; more weights mean more training data. Consider a problem
in image recognition: suppose one wished to “interpret” images from a four mega-
pixel camera. Assuming black and white images for simplicity, an MLP would need
4 × 106 neurons in the input layer. Conservatively assuming 1000 hidden units, the
first layer alone would need 4 × 109 weights. The previously mentioned rule of
thumb says that one would need 4 × 1010 pieces of training data! Even if such data
were available, the network would take an infeasible time to train. The solution to
this problem is to look to biology/physiology.
The image used for illustration here will be the “untidy kitchen” image shown in
Fig. 2.8.
The discussion will focus on a specific image interpretation problem—that of
detecting a specific object within the image. In this case, the object of interest will
108 K. Worden et al.

Fig. 2.9 Object of interest in the untidy kitchen—“Robin”

be the small “Robin” figure, visible on top of the coffee machine, to the left of the
image (Fig. 2.9). This is quite a small object; it is also comparable in size to many
of the other objects in the image.
Detecting the object is quite simple, most human observers will find Robin almost
immediately; the interesting question is how. What is almost certain is that people
searching for objects do not project the entire image onto the retina and then parse the
full image. This is because people are aware that the pattern of interest is local. The
usual search strategy would be to scan the eyes carefully across the image until Robin
comes into the receptive field, then parse that subimage to see if “Robin present” fires.
This is how a modern architecture like a convolutional neural network (CNN) works.

2.5.1 Convolutional Neural Networks (CNNs)

It will prove useful now to “abstract” the problem, perhaps simplifying in the process.
First, suppose that the image is represented by a 2000 × 2000 binary (B/W) array;
further suppose that the retina is a rectangular array of 100 × 100 photoreactive cells.
It is now possible to formalize the search strategy. It will be assumed that the image
can be indexed in terms of indices (i, j), which specify the position of a pixel in the
2 Artificial Neural Networks 109

Fig. 2.10 Retinal


window/subimage
containing “Robin”

image “array”. The search can start by placing the retina “window” at the top-left
corner of the image at the point (1, 1), and activating the cells of the retina, with
inputs from the window. One then moves to (1, 2) on the image and looks again,
and so on until the retina is at (1, 1901). The scan then moves to the second row of
pixels and starts again at (2, 1). This process is repeated until the retina is positioned
at (1901, 1901), at which point, the image has been covered. At each “look”, the
subimages are passed to a recognition “layer” to see if a cup is present.
In terms of the mathematical abstraction of the problem, one can associate a
weight wi j with each cell of the retina. One can then either: (a) try and train these
weights to produce an output 1 or 0 (Robin present or not present) or (b) pass on the
activations from each scan to another recognition layer. In any case, each position
(top-left pixel is (k, l)) of the retina will produce an activation,

z kl = wi j xk+i−1,l+ j−1 (2.14)
i j

where xi j is the array referencing the original image.


For option (a), one would need to train the weights to give z kl ∈ [0, 1]. This option
would be unusual, but if it were possible, one would only have to train 104 weights
now; a massive improvement on 109 . This gain is basically from weight sharing.
If one were to go with option (b), there are now (2000–100) × (2000–100) inputs
for the next layer, so the recognition layer would have almost the same problem as
before. However, a simple way to reduce the number of activations is to step the
retina over n ≥ 1 pixels between each look. This parameter n is called the stride in
the CNN literature. Another way to reduce the number of activations would be to
subsample them.
Suppose that the receptive field (or retina) was seeing the window shown in Fig.
2.10.
The first observation one might make is that a stride n > 1 will certainly catch
Robin. Secondly, in this case, the receptive field is much bigger than the Robin figure;
so many of the z i j in this neighborhood will carry “Robin present” information;
however, one only needs one confident detection. This observation suggests that one
can get away with downsampling the activations passed on to the next layer. Suppose
110 K. Worden et al.

that one could downsample by a factor of 100; then the total number of activations
passed on for the recognition layer is 19 × 19 = 361; this represents another huge
reduction. Note that downsampling is not the same as setting a higher stride; one can
use a combination of both, in practice.
Consider the action of the first layer again,

z kl = wi j xk+i−1,l+ j−1 (2.15)
i j

One can compare and contrast this with how a filter works on a time series (where
xi denotes the input series and yi the output),

yi = h j xi− j (2.16)
j

i.e. the filter coefficients are convolved with the input. It should now be clear that
the action of the moving retina in generating the next layer inputs is thus a two-
dimensional convolution; the structure is thus termed a convolution layer, and the
weights on the retina are revealed to be filter coefficients. Combinations of such
layers (with downsampling) are the basic form of what are called convolutional
neural networks (CNNs).
In terms of terminology: the convolution filter is sometimes called a convolution
kernel. As defined, the action of the kernel is different on points at the edge of the
image, to the action on central points. If this is considered an issue, or one simply
wants to produce an “image” of the same size after convolution, the input image to a
layer can be padded with zeros, so that each image point is at the center of the kernel.
A simple consequence of this condition is that the kernel window sizes must be odd.
One might now reasonably ask the question—why multiple layers? Clearly, the
approach depends on the retina/filter/convolution kernel being the right size to detect
Robin, and for the relevant information to survive downsampling. The important
point is that the architecture must learn; on some training set, the weights are tuned,
so that a final recognition layer detects Robin or not. Consider a two-convolution-
layer system; this might learn in such a way that the first layer detects features
that suggest parts of Robin; then the second assembles the information so that the
figure is detected. If this were the case, it means that the two-layer system is doing
implicit feature extraction and selection, with no a priori insight into what the features
might be. More layers allow more possibilities for intrinsic feature extraction. In any
object detection problem, it is well known that the algorithm must be insensitive
to the scale and orientation of the object of interest; CNNs thus offer the attractive
possibility that effective features can be derived without engineering, or domain-
specific, insight. This viewpoint—like many starting from a position of ignorance—
can be quite dangerous; at the very least, it can lead to poor research. At the beginning
of the first age of neural networks, many people hoped that the blind use of MLPs,
using software downloaded from the Internet, would allow the solution of all their
problems. In fact, blind usage of MLP software led to such disappointing results that
2 Artificial Neural Networks 111

Fig. 2.11 Samples of the hand-written digits in the MNIST database

the initial neural network “bandwagon” within the engineering community led quite
quickly to a “backlash”. The main problems associated with the earlier papers were
often associated with sparsity of training data and sub-optimal (or just plain wrong)
determination of hyperparameters. The second age of neural networks—based on
deep architectures—led very quickly to a “bandwagon-backlash” transition in the
engineering community. As before, the way around the problem was to learn how
the deep architectures worked and to tune their parameters carefully. In order to see
what sort of “parameters” are involved, it is useful to look at a little more history.

2.5.2 A Little More History

In reality, the CNN architecture was basically introduced in 1979 by Fukushima,


in the form of the Neocognitron (Fukushima 1979). The modern CNN is largely
based on the same concept; however, it has evolved into its current form with some
smallish but important developments. The first major development was in training;
in 1989 Le Cun et al. developed a backpropagation algorithm for weight-sharing
convolutional layers (LeCun et al. 1989, 1990, 1998). This work also introduced
the MNIST data set; one of the most famous ML benchmarks. The MNIST dataset
comprises a number of images of hand-written digits, the recognition problem is to
associate the correct digit label with the given image. A set of samples of MNIST
patterns is given in Fig. 2.11.
A number of competitions emerged in the ANN community, based on certain
standard datasets; the MNIST dataset is one of the most enduring. In fact, the weight-
sharing backpropagation-training CNN strategy of LeCun et al. set a new MNIST
record with an error of 0.4% (Simard et al. 2003); this can be compared to the pre-
112 K. Worden et al.

vious best MLP error of 0.7%, Simard et al. (2003). These differences may appear
to be small, but the problem is difficult and reductions in the error tend to reflect
the introduction of new ideas or technology. The advance in the BP-CNN architec-
ture was not just a matter of the weight-sharing algorithm, but also exploited GPU
cards to speed up processing and used ideas of max pooling. The point of the latter
advance is that a downsampling “layer” in a CNN doesn’t have to decimate; it can
pass on summary statistics over a window. This is the idea of max pooling; rather
than decimating on subregions, the downsampling “layer” passes on the maximum
activation in the subregion. In 2009, a CNN-MP architecture helped win the first
official competition (although for recurrent neural networks). In 2010, a standard BP
network won back the MNIST record by using GPUs, and set the lowest error at
0.35%; this was really only because the GPUs speeded up computation by a factor
of 50 compared to CPUs.
A significant advance occurred in 2012 when an architecture based on CNN-
MPs, using GPUs and ensemble methods, achieved 0.2% on MNIST. This is human
competitive and represented the first major drop in around 10 years. It is observed
by Schmidhuber (2015) that these advances were not the result of significant new
technology, but relied on clever use of the CNN-MP architecture and availability of
very fast computing options. Another significant tweak which occurred around this
time was a change in the network transfer function; in 2011, the Rectified Linear
Unit (ReLU) was introduced,

f (x) = x , x >0
(2.17)
=0, x ≤0

which enables very fast evaluation, and further speeded up computation (Nair and
Hinton 2010). Perhaps the final ingredient in the CNN palette was introduced in
2012, with the idea of dropout, which simply removes units during training, acts as a
regulariser, and improves generalization. An idea of the synthesis of these ingredients
is given in Fig. 2.12. Setting aside the MNIST data, it was observed in Schmidhuber
(2015) that at the time (2014), most feedforward competition-winning deep NNs
were (ensembles of) GPU-MPCNNs.

2.5.3 Other Recent Developments

As mentioned above, one of the first successful attempts at designing deep net-
works was Hinton’s construction in terms of layers of restricted Boltzmann machines
(RBMs) with a greedy layerwise training algorithm (Hinton et al. 2006). It would
be unfair not to point out that other similar strategies at the time also worked. One
approach built an architecture from a stack of autoencoder networks; this was inter-
esting from the point of view that the autoencoders could be standard MLPs. However,
in terms of recent developments, the rest of this section will be concerned with novel
2 Artificial Neural Networks 113

Fig. 2.12 Synthesis of ingredients in a modern convolutional neural network architecture.


From: Krut Patel: https://towardsdatascience.com/mnist-handwritten-digits-classification-using-
a-convolutional-neural-network-cnn-af5fafbc35e9

architectures which can be built from CNNs (among other components); the reader
interested in other deep ANN architectures can consult (Schmidhuber 2015).

2.5.3.1 Transfer Learning

Transfer Learning is emerging as one of the most significant technologies from


machine learning of the last couple of decades (Yang et al. 2020). The basic idea
is this; suppose that one is trying to solve a problem using machine learning, but is
hindered (or even stalled), by the scarcity of training data. Suppose further that data
are available in large quantities for a similar problem. Transfer learning is concerned
with improving performance on the new problem, by leveraging the data on the old
problem. It will be useful to give a concrete example here, as it will also explain how
deep ANNs can be used to solve the problem.
Suppose that one has a source problem of training an image classifier which is to
distinguish between cats and dogs. Furthermore, suppose that copious amounts of
training data are available. In this case, one could in principle train a deep CNN, like
the one in Fig. 2.12, to solve the problem. Because the data are so abundant, there are
no problems of overfitting, etc., so one can imagine arriving at network with 100%
classification accuracy. This is all well and good; however, suppose that one is pre-
sented with a new target problem—distinguishing sheep and goats—for which very
little training data are available. How might one exploit the initial (highly successful)
cat/dog network, to help solve the sheep/goats problem? The answer to this question
lies in a hypothesis stated earlier; it was proposed that, in a deep recognition network,
the earlier layers might carry out basic feature-extraction tasks, common to image-
processing problems, while the later layers would adjust for more context-specific
features of the problem. For example, the earlier layers might transform the images
into a representation, where the main variations resulting from object scale, posi-
tion, and orientation have been projected out. If this were the case, the earlier layers
114 K. Worden et al.

of the cat/dog network might be very similar to the earlier layers of a sheep/goats
network, if appropriate training data were available. This observation leads to an
immediate strategy for transfer learning; one proposes a similar CNN structure in
the target problem as the source problem, but then simply copies across and freezes
the weights for the lower layers. If enough sheep/goat data are available, the later
layers in the target network are then trained for the specific sheep/goat problem. This
strategy is referred to as fine-tuning; it has proved to be such a successful strategy
that it is easily implemented in modern machine learning packages like Tensorflow.
Although the fine-tuning approach to transfer has been applied in image and
language processing, it does not yet appear to have found great many successful
applications in engineering. One reason for this lack of interest may be the lack of
appropriate source problems in given contexts. This situation produces studies like
the one in Chen et al. (2020); in this case, the “source” network is Res-Net-50. This
network is a 50-layer CNN which has been trained to carry out image recognition on
the ImageNet database, which contains over a million images. The authors of Chen
et al. (2020) chose a “target” problem of classifying fault states of a bearing; in order
to do this, they first had to transform the raw time data into an “image” format, using
the continuous wavelet transform. The fine-tuning approach was then used to train
the last few layers on target data from the bearing database. Although the results
showed a good classification accuracy, there is something a little unsettling about
the disconnect between the source and target problems. A final judgement on fine-
tuning as a means of transferring engineering problems should probably wait for more
evidence. Alternative means of transfer learning have been applied in engineering in
a more intuitive manner; one of the most prevalent appears to be domain adaptation,
which has been applied to problems in SHM with success; one example is given in
Gardner et al. (2020).

2.5.3.2 Generative Adversarial Networks

A very recent approach to machine learning is the Generative Adversarial Network


(GAN) (Goodfellow et al. 2014). This entirely novel algorithm was initially created
to generate synthetic images that look real, i.e. the model learns how to embed figures
into some latent space and simultaneously how to generate data according to a proper
distribution. Apart from the main goal of the algorithm, a new way of training neural
networks was introduced. Adversarial training is framed as a competition between
two neural networks (which are often MLPs or CNNs). In the basic GAN, the first
network is the generator, which tries to generate samples that look real, and the
second is the discriminator, which tries to identify whether a sample comes from the
real dataset, or is artificial.
Training is orchestrated as a competition between the two networks. The discrim-
inator D is a network with an output representing the probability of the input sample
x being real, i.e. Px∼ pdata = D(x). Throughout training, real and fake samples are
introduced to the discriminator and, using backpropagation, it becomes better at dis-
tinguishing samples from the real dataset from generated/artificial samples. On the
2 Artificial Neural Networks 115

Generated
samples G(z)

Noise, z Generator Discriminator Probability D(G(z))

Real
samples x

Fig. 2.13 Layout of a basic GAN; the generator transforms noise into generated samples and the
discriminator attempts to distinguish between real and generated samples

other hand, the generator network G takes as input a noise vector z from some pre-
defined probability distribution pz (z) and creates a sample G(z) in the feature space
of the dataset. Thereafter, the sample is passed through the discriminator in order to
classify whether it is real or generated. The probability of a generated sample being
real is given by D(G(z)). Forcing the generator to create samples that “fool” the
discriminator into classifying them as real (i.e. minimization of log(1 − D(G(z))))
results in creating samples/images that look real. The objective function V (D, G)
for the training of both networks is given by

min max V (D, G) = E x∼ pdata (x) [log D(x)] + E z∼ pz (z) [log(1 − D(G(z)))] (2.18)
G D

where E[ ] is the mean value of the quantity within the brackets.


The layout of a basic GAN is shown in Fig. 2.13. In practice, training is performed
in two steps. During the first step, a batch of samples is randomly sampled from the
dataset along with an equally large batch sampled from the generator. Training of
the discriminator only is then performed using as target labels 1 (real) for the dataset
samples and 0 (fake) for the generated samples. During this step, the discriminator is
trained to better distinguish real from fake samples. In the second step, the generator
is trained, while the parameters of the discriminator are held constant. For this step,
a batch of noise vectors are sampled and the two networks are connected together as
shown in Fig. 2.13. The second term of Eq. (2.18) is used alone for training and the
target labels for the output of the discriminator are 1s, meaning that the generator
should transform the noise vectors into samples that the discriminator accepts as real.
GANs can be used in engineering to generate artificial data in cases where acquir-
ing data from real structures or systems is expensive; this aspect of GANs is useful
and reduces the cost of acquiring data. GANs may also be used in other ways, to
learn mappings from one space to another. The paper (Tsialiamanis et al. 2022)
uses a variant of the GAN—the cycle GAN—to carry out nonlinear modal analy-
sis by transforming response data from a structure into independent variables. The
details of the study cannot be covered here, but some of the results can be shown.
The problem was to extend linear modal analysis to nonlinear structures and allow
multi-degree-of-freedom nonlinear systems to be transformed into decoupled single-
degree-of-freedom systems. The measure of success adopted was to check that the
spectra of the transformed variables had single peaks (resonances). The results here
116 K. Worden et al.

Fig. 2.14 Power spectral densities (PSDs: Y1 (ω), Y2 (ω), Y3 (ω)) from three-floor experimental
structure; these correspond to the physical coordinates (y1 (t), y2 (t), y3 (t))

Fig. 2.15 PSDs of three-floor structure (denoted by U in the transformed domains); POD (linear,
black) analysis and GAN (red) analysis

are shown for data from a highly nonlinear three-story model shear building. Figure
2.14 shows the spectra of the physical (directly measured) variables; the system is
very clearly multi-modal. Figure 2.15 (black) shows the result of the optimal lin-
ear transform (the principal orthogonal decomposition); the variables are still highly
coupled. In contrast, Fig. 2.15 (red) shows the GAN-transformed variables, which
(according to the specific criterion) appear to be perfectly decoupled.
2 Artificial Neural Networks 117

2.6 Conclusions

Given the nature of this article, lengthy conclusions are not warranted. This paper
has discussed the history of artificial neural networks in the context of their uptake in
engineering problems. In this respect, that history appears to naturally fall into three
“ages”; pre-history, the first age and the second age. The period of pre-history deals
with developments up to the mid-1980s; over this period, ANNs were developed
from their very simplest forms—a single neuron—into a number of versatile archi-
tectures, which could be used to solve engineering problems. Foremost among these
architectures was the multi-layer perceptron (MLP), which proved to be of such great
general use, that it could be argued to have initiated the “first age”. This stable period
covered a time when the MLP proved to be the “go-to” algorithm for the engineering
community, and—apart from developments in the training algorithm—it required no
serious modification. However, although the “shallow” MLP was excellent in many
engineering problems, it proved deficient in some of the bigger machine learning
problems associated with image and speech processing and natural-language pro-
cessing. The response from the computer science community was to move toward
deeper neural networks; the development of the required technology led to the sec-
ond ANN age—of deep learning. As before, the engineering community appeared to
gather around another general-purpose architecture—the convolutional neural net-
work (CNN)—which also soon stabilized in terms of its structure and learning rules.
The three ages of ANNs are discussed here, along with a case study illustrating an
engineering application of MLPs. Finally, the paper discusses two new paradigms
for learning which can incorporate CNNs, but open up new pathways for engineering
applications.

Acknowledgements The authors would like to acknowledge the support of the UK EPSRC via
the Programme Grants EP/R006768/1 and EP/R004900/1. For the purpose of open access, the
authors have applied a Creative Commons Attribution (CC-BY-ND) licence to any Author Accepted
Manuscript version arising.

References

Abeles M (1991) Corticonics–neural circuits of the cerebral cortex. Cambridge University Press,
Cambridge
Bishop CM (1994) Training with noise is equivalent to Tikhonov regularization. Neural Comput
7:108–116
Bishop CM (2013) Pattern Recognition and machine learning. Springer, Berlin
Bishop CM, Svensém M, Williams CKI (1998) Developments of the generative topographic map-
ping. Neurocomputing 21:203–224
Broomhead DS, Lowe D (1988) Multivariable functional interpolation and adaptive networks.
Complex Syst 2:321–355
Brown M, Harris CJ (1994) Neuro fuzzy adaptive modelling and control. Prentice Hall
Bryson A, Denham W, Dreyfuss S (1963) Optimal programming problem with inequality con-
straints. I: Necessary conditions for extremal solutions. AIAA J 1:25–44
118 K. Worden et al.

Chen Z, Cen J, Xiong J (2020) Rolling bearing fault diagnosis using time-frequency analysis and
deep transfer convolutional neural network. IEEE Access 8:150248–150261
Churchland PS, Seknowski TJ (2017) The computational brain. MIT Press
Cybenko G (1989) Approximation by superpositions of sigmoidal functions. Math Control, Signals
Syst 2:303–314
Fukushima K (1979) Neural network model for a mechanism of pattern recognition unaffected by
shift in position—Neocognitron. Trans IECE J62-A:658–665
Gardner PA, Liu X, Worden K (2020) On the application of domain adaptation in structural health
monitoring. Mech Syst Signal Process 138:106550
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y
(2014) Generative adversarial nets. In: Advances in neural information processing systems, pp
2672–2680
Haykin S (1994) Neural networks. Macmillan College Publishing Company, A comprehensive
foundation
Hebb DO (1949) The Organisation of Behaviour. Wiley, New York
Hinton GE, Osindero S, Teh YW (2006) A fast learning algorithm for deep belief nets. Neural
Comput 18:1527–1554
Hopfield JJ (1984) Neural networks and physical systems emergent collective computational abil-
ities. Proc Natl Acad Sci, USA 52:2554–2558
Hopfield JJ, Tank DW (1985) Neural computation of decisions in optimization problems. Biol
Cybern 52:141–152
Kohonen T (1982) Self-organized formation of topologically correct feature maps. Biol Cybern
43:59–69
LeCun Y (1986) Learning processes in an asymmetric threshold network. Disordered systems and
biological organisations. Les Houches, France, Springer, pp 233–240
LeCun Y./, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD, (1989) Back-
propagation applied to handwritten zip code recognition. Neural Comput 1:541–551
LeCun Y./, Boser B, Denker JS, Henderson D, Howard RE, Hubbard W, Jackel LD, (1990) Hand-
written digit recognition with a back-propagation network. Proceedings of advances in neural
information processing systems 2:396–404
LeCun Y./, Bottou L, Bengio Y, Haffner P, (1998) Gradient-based learning applied to document
recognition. Proc IEEE 86:2278–2324
Luo FL, Unbehauen R (1998) Applied neural networks for signal processing. Cambridge University
Press
Mackay DJC (2003) Information theory, inference and learning algorithms. Cambridge University
Press, Cambridge
Manson G, Worden K, Allman DJ (2003) Experimental validation of structural health monitoring
methodology II: novelty detection on an aircraft wing. J Sound Vib 259:363–435
Manson G, Worden K, Allman DJ (2003) Experimental validation of structural health monitoring
methodology III: Damage location on an aircraft wing. J Sound Vib 259:365–385
Markou S, Singh S (2003) Novelty detection a review. Part I: statistical approaches. Signal Process
83:2481–2497
Markou S, Singh S (2003) Novelty detection a review. Part II: neural network based approaches.
Signal Process 83:2499–2521
McCulloch WS, Pitts W (1943) A logical calculus of the ideas immanent in nervous activity. Bull
Math Biophys 5:115–133
Miller P (2018) An introductory course in computational neuroscience. MIT Press
Minsky ML, Papert SA (1988) Perceptrons. MIT Press
Möller MF (1943) A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw
6:525–533
Nabney IT (2001) Netlab: algorithms for pattern recognition. Springer, Berlin
2 Artificial Neural Networks 119

Nair V, Hinton GE (2010) Rectified linear units improve restricted Boltzmann machines. In: Pro-
ceedings of international conference on machine learning (ICML)
Neal RM (1996) Bayesian learning in neural networks. Springer, Berlin
Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C. Cambridge
University Press
R̃amón y Cajál S (1911) Histologie du Systéme Nerveux de l’Homme et des Vertébrés. Maloine,
Paris
Rosenblatt F (1962) Principles of neurodynamics. Spartan
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back propagating
errors. Nature 323:533–536
Schmidhuber J (2015) Deep learning in neural networks: An overview. Neural Netw 61:85–117
Simard P, Steinkraus D, Platt J (2003) Best practices for convolutional neural networks applied to
visual document analysis. In: Proceedings of 7th international conference on document analysis
and recognition, pp 958–963
Tarassenko L (1998) A guide to neural computing applications. Arnold
Tarassenko L, Hayton P, Cerneaz Z, Brady M (1995) Novelty detection for the identification of
masses in mammograms. In: Proceedings of the 4th international conference on artificial neural
networks. Cambridge, pp 442–447
Tsialiamanis G, Champneys MD, Wagg DJ, Dervilis N, Worden K (2022) On the application of gen-
erative adversarial networks for nonlinear modal analysis. Mech Syst Signal Process 166:108473
Vapnik V (1995) The nature of statistical learning theory. Springer, Berlin
Werbos PJ (1974) Beyond Regression: New Tools for Prediction and Analysis in the Behavioural
Sciences. Ph.D. thesis, Applied Mathematics. Harvard University
Widrow B, Hoff ME (1960) Adaptive switching circuits. IRE Wescon Conv Rec Part 4:96–104
Worden K (1997) Structural fault detection using a novelty measure. J Sound Vib 201:85–101
Worden K, Hensman JJ, Staszewski WJ (2011) Natural computing for mechanical systems research:
A tutorial overview. Mech Syst Signal Process 25:4–111
Worden K, Manson G, Allman DJ (2003) Experimental validation of structural health monitoring
methodology I: novelty detection on a laboratory structure. J Sound Vib 259:232–343
Worden K, Manson G, Hilson G (2003) Genetic optimisation of a neural damage locator. J Sound
Vib 309:529–544
Yang Q, Zhang Y, Dai W, Pan S-J (2020) Transfer learning. Cambridge University Press
Chapter 3
Gaussian Processes

T. J. Rogers, J. Mclean, E. J. Cross, and K. Worden

3.1 Introduction

When performing machine learning tasks, essentially every task can be reduced to
the learning of a functional map:

f :X→Y

when performing regression, Y is a continuous target (or set of targets), in classifica-


tion or clustering the targets are a discrete set of labels. The vast majority of machine
learning literature, therefore, is concerned with how to construct the functional map-
ping, f , learnt from data, which holds for as yet unseen inputs.
A conceptual introduction to the Gaussian process (GP) begins by considering
some set of functions which contains an f able to model the mapping from X to
Y appropriately. In fact, for the GP, an infinite set of functions is considered which
behaves as a probability distribution over functions. In other words, the GP is con-
structed as a probability distribution, from which any sample is a continuous function
over the whole D-dimensional input space. The “learning” process in the GP is then
to determine which functions, from within the initial infinite set, are the most plau-
sible explanations for an observed set of data. The general premise of the GP is that
for new inputs, similar to those seen before, the targets will also be similar. Before
presenting the mathematical machinery which enables this learning, it can use useful
to explore the concept visually.

T. J. Rogers · J. Mclean · E. J. Cross · K. Worden (B)


Dynamics Research Group, Department of Mechanical Engineering, University of Sheffield,
Mappin Street, Sheffield S1 3JD, UK
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 121
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_3
122 T. J. Rogers et al.

3.1.1 A Visual Introduction To Gaussian Processes

Figure 3.1 shows the progression of the GP in “learning” a function as more data
are introduced into the process. In Fig. 3.1a, the prior model of the GP is shown.
Under the prior very little is known about the shape of the function which is to be
modeled, it has a mean value of zero across the whole input domain (dark orange
line) and an equal Gaussian uncertainty on either side of that zero mean (shaded
orange area). It can be seen that samples from the GP (blue lines in Fig. 3.1a) are
nonlinear functions filling the space of the uncertain region. An important point to
note: despite having Gaussian confidence intervals around the mean, the uncertainty
in the GP is not uncorrelated white noise, instead the uncertainty is correlated for the
family of functions being represented—hence, smooth function draws from the GP
are seen. When interpreting confidence intervals of GPs this is an important point to
note and one which can confuse interpretation of results. It is also worth considering

(a) GP prior (b) Two observed datapoints

(c) Four observed datapoints (d) Six observed datapoints

(e) Eight observed datapoints (f) Ten observed datapoints

Fig. 3.1 Visual demonstration of the Gaussian process refining its estimate of a function as more
data are added. Mean prediction shown in solid orange with 4σ confidence shown as shaded orange.
For each GP, ten samples are shown in blue. Added data are shown by red dots
3 Gaussian Processes 123

that, although a finite number of samples from the function are shown in the figure,
the shaded uncertain region actually represents the infinite set of possible functions
which could be drawn from that GP.
The beauty of the manner in which the GP “learns” the function of interest is seen
in frames Fig. 3.1b–f. Rather than needing to determine a large number of parameters
which govern the shape of the function, its form is discovered nonparametrically.
Remembering the basic assumption in the GP, that the output will be close to that
previously seen at locations close to the corresponding input, i.e. near to where
data have been observed (red dots) the GP is confident that the function value will be
close to that previously seen. Visually, this manifests as a “pinching” of the uncertain
region close to previously observed points, in other words, all the possible functions
are forced through the already observed data point. As more data are included in the
process, the model’s confidence increases in regions with higher density of observed
points, e.g. one third inwards from the right-hand edge of the domain in Fig. 3.1. It
is seen in Fig. 3.1b that the data are forced through the single observed point on the
right, by Fig. 3.1f the region with five close datapoints is very well represented with
low uncertainty. One other important point is revealed in this visual exploration of
the GP, suppose one were to continue adding data, in this case, the confidence of
the model would continue to increase as more and more of the domain is covered
by already observed data. It can be shown that the GP will fit the function to an
arbitrary precision if more data keep being introduced, i.e. the GP has a universal
approximation property (subject to the chosen kernel being a universal kernel, see
Micchelli et al. 2006).

3.1.2 Gaussian Process Regression

Attention can now turn to the mathematical tool which enables the behavior of the
GP which has been seen already. Formally, the GP can be considered to be an infinite-
dimensional collection of jointly Gaussian distributed variables, any finite subset of
which are also jointly Gaussian distributed.1 For now, restricting the problem to the
case of multiple-input–single-output regression, in other words, solving problems of
the form:
y = f (x) + ε (3.1)

where y is a continuous scalar target which is an observation of the output of the


function f (·) with the D-dimensional input vector x. The observation y is corrupted
by some additive noise ε which in the basic
 form of the GP is assumed to be i.i.d.
Gaussian white noise, i.e. ε ∼ N 0 , σn2 .
As is usual in a machine learning context, it is assumed that some set of data,
observed from the process of interest, is available which may be used to develop the

1This is implied by the infinite set being jointly Gaussian but will be useful to understand how the
GP can be implemented.
124 T. J. Rogers et al.

model. This is commonly referred to as the training dataset; for solving the regression
problem, described in Eq. (3.1), this training set consists of N pairs of input vec-
tors xi and observed targets yi , Dtr = {xi , yi }i=1N
. Collecting all of these observed
targets into a vector y = {y1 , . . . , y N } will give an N -dimensional jointly Gaus-
sian distributed random variable which is an N -dimensional subset of the infinite-
dimensional GP, i.e.
⎛⎡ ⎤ ⎡ ⎤⎞
m(x1 ) k(x1 , x1 ) + σn2 . . . k(x1 , x N )
⎜⎢ ⎥ ⎢ ⎥⎟
y ∼ N ⎝⎣ ... ⎦ , ⎣ ..
.
..
.
..
. ⎦⎠ (3.2)
m(x N ) k(x N , x1 ) . . . k(x N , x N ) + σn2

where m(xi ) is some parametric mean function and k(x, x ) is the covariance function,
to be discussed shortly.
Being Gaussian distributed, the properties of y are fully defined by its first two
statistical moments, its mean and covariance. The first moment, which is the mean,
can be any parametric function of the input x, for example, linear or polynomial, and
encodes the prior belief of the modeler regarding the gross behavior of the process.
In certain modeling scenarios, this may be a very strong belief, i.e. the average
behavior with respect to the input is well understood, or—more commonly in the
machine learning setting—no prior restriction is imposed and mean can be simply
set to zero (or some other constant value).
Moving to the second moment, the covariance of the data, while it would be
possible to compute this sample-wise, that would not yield a useful methodology.
Instead, the covariance between two observed targets yi and y j is also defined as
a function of the inputs xi and x j associated with those targets. This covariance is
computed pairwise for all combinations within the training dataset Dtr . As part of
the specification of the GP, the modeler must define a function which encodes how
the covariance between two outputs is expressed as a function of the pair of inputs
xi and x j . This function is referred to as the covariance function or synonymously
covariance kernel or simply kernel. There are certain requirements on this function
in order for it to be a valid measure of the covariance; for details, the reader is directed
toward Chap. 4 of Williams and Rasmussen (2006).
It is this covariance function which governs the family of functions which comprise
the prior of the GP. In other words, the choice of the covariance function impacts
which functions live in the infinite set from which samples are drawn. Therefore, the
choice of this function is critical to the performance of the GP. For example, if one
were to choose a linear kernel,
 
k xi , x j = σ 2f xi xTj (3.3)

which is the inner product of xi and x j scaled by some hyperparameter σ 2f , then


the set of functions which could be learnt from would only contain linear regres-
3 Gaussian Processes 125

sions passing through zero—not a very expressive model.2 Much more common
is to choose a more flexible and richer, general nonlinear kernel, for example, the
popular squared exponential (sometimes also exponentiated quadratic or, somewhat
confusingly, Gaussian) kernel,
   
  xi − x j 2
k xi , x j = σ f exp −
2
(3.4)
22

from which continuous, smooth functions are drawn (the squared exponential also
fulfills the requirements to be a universal kernel, as mentioned previously, see Mic-
chelli et al. 2006). The squared exponential is governed by two hyperparameters: a
signal variance σ 2f which controls the overall scaling of the function outputs, i.e. the
magnitude of f (·); and a length scale  which controls (colloquially) the wiggliness
of the function, i.e. as this value reduces more inflections will be seen in samples
from the GP.
However, the experience of the authors aligns with the suggestion of Stein (1999)
that the Matérn family of kernels is a good general choice for modeling collected data
from physical processes. Unlike the squared exponential, which is infinitely differ-
entiable, the Matérn family is a finite number of times differentiable which appears
to more reasonably match the data encountered in engineering problems. Matérn
kernels are additionally governed by a roughness parameter ν which is usually cho-
sen such that ν = 1/2 + p, where p is zero or a positive integer. The corresponding
functions drawn from the GP for different choices of ν will be p-times differentiable.
Popular choices of Matérn kernels are when p = 0, 1, 2 given by
p = 0:   
  x i − x j 
km1/2 xi , x j = σ f exp −
2
(3.5)


p = 1:
 √    √  
  3 xi − x j  3 x i − x j 
km3/2 xi , x j = σ 2f 1+ exp − (3.6)
 

p = 2:
 √    2   √  
  5 xi − x j  5 xi − x j  5 x i − x j 
km5/2 xi , x j = σ 2f 1+ + exp −
 32 
(3.7)
All of the Matérn kernels are also governed by the same two hyperparameters σ 2f and
 as the squared exponential where they play an identical role. It can also be shown

2However, this choice of kernel is interesting in the fact that it exactly recovers the solution to
Bayesian linear regression.
126 T. J. Rogers et al.

Fig. 3.2 Graphical model of


the Gaussian process

that, as ν → ∞, the Matérn kernels converge onto the squared exponential, this is
intuitive when one knows that the squared exponential is infinitely differentiable.
The squared exponential and Matérn type kernels are also examples of stationary
kernels, i.e. they are only a function of the distance between the two inputs xi and
x j . This stationarity additionally means that the kernels make the calculation of the
covariance between two data points invariant to translation and rotation in the input
space.
It should be stressed, however, that determining an “optimal” choice of kernel
is a very difficult task3 (if possible at all) and certainly not one for which general
statements can be made—appropriate choice of covariance function should be the
key concern of the practitioner, before embarking on any modeling with a GP.
Thus far, only the relationship between the data within the training set Dtr has been
discussed. On its own, this description is not particularly helpful, as one will hope to
use the GP to make predictions for an input x with, as yet, unobserved output y , or a
set of inputs X  with outputs y . At this point, it will become clear why the definition
of the GP as an infinite-dimensional collection of jointly Gaussian distributed random
variables will be useful. Consider the graphical model of the GP which is shown in
Fig. 3.2. There is a collection of latent function values f n for n = 1, . . . ∞ (although
only a finite number N are assessed at any given time, the distribution itself is of
infinite dimension) which are all jointly Gaussian distributed as represented by the
fully connected undirected graph in the center of the image. The covariance of this
high-dimensional Gaussian is then computed by means of the covariance function
being evaluated pairwise for all inputs. For convenience, all the individual inputs xi

3Although some have attempted to perform automatic selection for specific problems, e.g.
Abdessalem et al. (2017).
3 Gaussian Processes 127

in Dtr , which are a collection of N row vectors, can be stacked vertically to form
a training input matrix X ∈ R N ×D for N , D-dimensional inputs. Similarly, the N
scalar target observations yi can be collected into a vector y ∈ R N . Each observation
in y is conditionally independent, depending
  only on the corresponding fi , this arises
from the fact that p (yi | fi ) = N fi , σn2 , i.e. yi is an observation of fi corrupted
by i.i.d. white Gaussian noise. Given this notational simplification, Eq. (3.2) can be
written as  
y ∼ N m(X ) , K X,X + σn2 I (3.8)

where K a,b is the pairwise covariance matrix between the functions at inputs a and
b, e.g. K X,X is the auto-covariance between all the inputs in X , and I is the identity
matrix.
To make predictions, the vector of targets being considered can be expanded to
include some, as yet, unobserved targets y which increases the dimension of the
distribution by N —the number of prediction points—while leaving the full joint
distribution Gaussian.
     
y m(X ) K X,X + σn2 I K X,X 
∼N , (3.9)
y m(X  ) K X  ,X K X  ,X  + σn2 I

Remembering that the variables y are unobserved, it is still possible to evaluate


the mean function m(X  ) and relevant covariances K X,X  , K X  ,X and K X  ,X  using
only the inputs for the new prediction locations X  . In order to make a prediction,
the distribution of the prediction targets y conditioned
  the training data Dtr and
on
prediction inputs X  will be assessed, i.e. p y | X  , Dtr .
Since the full joint distribution is Gaussian, the conditional of a subset of variables
can be assessed simply and in closed form. Using the standard rules of conditioning
multivariate Gaussians,
    
y | X  , Dtr ∼ N E y , V y (3.10a)
   −1  
E y = m(X  ) + K X  ,X K X,X + σn2 I y − m(X ) (3.10b)
   −1
V y = K X  ,X  − K X  ,X K X,X + σn2 I K X,X  + σn2 I (3.10c)

Similarly, it is possible to evaluate the distribution over only the latent function
values f  so as to only investigate the uncertainty in the learnt function itself, not
including the measurement noise on the data.

f  | X  , Dtr ∼ N (E [f  ] , V [f  ]) (3.11a)
 −1  
E [f  ] = m(X  ) + K X  ,X K X,X + σn2 I y − m(X ) (3.11b)
 −1
V [f  ] = K X  ,X  − K X  ,X K X,X + σn2 I K X,X  (3.11c)

It can be seen that the expectation remains unchanged (as may be expected given
the measurement noise is expectation zero) and the only change in the variance is to
128 T. J. Rogers et al.

no longer include the term related to the measurement noise σn2 I. The combination
of the covariance function, as defined earlier, and the predictive equations in either
Eq. (3.10) or Eq. (3.11) are the sum total of the mathematical machinery which
enables the GP to act as a regression tool and which allow uncertainty quantification
over those predictions. Before continuing, it is worth pausing for a moment to con-
sider quite how remarkable it is that such a powerful approach can be written down
in four lines of mathematics.

3.1.3 Implementation and Learning of the GP

While it has been shown that the core machinery of the GP is surprisingly simple,
there are a couple of points which should be expanded upon further to understand
how it may be used practically. Thus far, the GP has been described as nonparametric,
i.e. it does not have a large set of parameters to tune as part of the learning process
like, for example, a neural network. Instead, the training data directly inform the
shape of the prediction through their inclusion in the predictive equations, Eq. (3.10).
However, one may notice that these predictive equations require evaluation of the
covariance kernel and, for the majority of covariance kernels, there exist a number of
hyperparameters which control the gross behavior of the family of functions which
that kernel embeds. These hyperparameters were alluded to when first introducing
the kernels higher up.
Although, in theory, the choice of these hyperparameters is far less important to the
quality of fit of the GP than, for example, learning the weights in a neural network, in
practice tuning of the hyperparameters can significantly affect the performance of the
method. If one were taking a “fully-Bayesian” view of the problem, it would be pos-
sible to place prior distributions over these parameters, perform posterior inference
on the basis of Dtr , and marginalize out the effect of the hyperparameters. However,
this approach (in almost all situations) proves to be computationally infeasible or
inadvisable given the level of insight it provides. Despite this challenge, it is worth
noting that some works in the literature do attempt the task of marginalizing the GP
hyperparameters including (Garnett et al. 2013; Svensson et al. 2015; Gardner et al.
2021; Simpson et al. 2021). It should also be noted that marginalization of the GP
hyperparameters is more commonly seen in settings with small training datasets, i.e.
N is small, such as in Bayesian optimization4 problems, e.g. Hernández-Lobato et al.
(2014).

4 Unfortunately, in the name of brevity it is not possible to cover Bayesian optimization with the rigor
it deserves in this short chapter. In short, for an optimization problem seeking a global minimum
(or maximum) to a cost function, which can be evaluated pointwise, a GP is fit to samples from
that cost function to approximate its shape. The GP fit to the cost surface is then used, usually in
combination with the inherent measure of uncertainty, to select a new point at which to probe the
cost function. In this way, the GP can be used to estimate the location of the current global minimum
and also to guide future evaluations of the cost function. Bayesian optimization excels in settings
where the cost surface can be modeled well by a GP and where evaluations of the cost function
are computationally expensive prohibiting classical metaheuristic optimization approaches. The
3 Gaussian Processes 129

In common usage of the GP, rather than attempting to marginalize the hyperpa-
rameters, these quantities are instead optimized. The form of the optimization most
commonly employed is a Type-II optimization problem where the cost function is
related to the marginal likelihood of the process. Conceptually, the hyperparameters
are optimized to ensure the GP fits the training data the best it can, taking into account
all the possible values of f (the latent function). Formally, the optimization problem
being solved is
 
φ̂ = arg max p (y | f, X, φ) p (f | X, φ) df
φ
(3.12)
= arg min {− log p (y | X, φ)}
φ

where φ is the set of hyperparameters to be learnt which is often extended to include


σn2 alongside the kernel hyperparameters, e.g. σ 2f and . The optimization is most
commonly solved as per the second line of Eq. (3.12), that is, a minimization over the
negative log marginal likelihood of the GP. It is almost always a good idea to work
with log values of the probability density function for numerical reasons (avoiding
over-/under-flow problems with floating point representations of the value). Casting
the problem as a minimization is a matter of convention and allows easier integration
with gradient descent methods. Therefore, the quantity being minimized is given by
J = − log p (y | X, φ)
1  −1  
= y − m(X ) K X,X + σn2 I y − m(X ) (3.13)
2
1   N
+ K X,X + σn2 I  + log (2π )
2 2
To perform the optimization, it is the user’s choice as to the preferred approach.
It is possible to perform gradient-based optimization on the quantity in Eq. (3.13).
Taking derivatives with respect to φ and applying the chain rule reveals that deriva-
tives of the covariance function are required. Classically, algorithms such as steepest
descent or conjugate gradient descent (Press et al. 1992) may have been used; while
still effective, more fashionable choices, such as Adam (Kingma and Ba 2015),
are more commonly encountered in modern implementations. If implementing a
gradient-based optimization approach in practice, even if using modern adaptive
step size methods, it is strongly recommended to use derivatives with respect to
log φ. Similarly to the issues encountered when not working with the log form of
the cost function, working with derivatives dJ/dφ directly can cause numerical issues.
Luckily, the change to using log φ is a simple one; since it is rarely documented, it
is included here for completeness. By application of the chain rule,

concept of Bayesian optimization has arguably been conceptually present since the 1970s but a
good introductory text can be found in Snoek et al. (2012).
130 T. J. Rogers et al.

dJ dJ dφ
= ·
d log φ dφ d log φ
 
dJ d log φ −1
= · (3.14)
dφ dφ
dJ
= ·φ

With the widespread availability of automatic differentiation functionality in mod-


ern numerical coding libraries, it is often a good idea to utilize this gradient-based
optimization approach, albeit with multiple random restarts to encourage finding the
global minimum. However, it may still be preferential in some circumstances to make
use of gradient-free optimizers, a review of which sits well beyond the confines and
remit of this chapter!
Regardless of the optimization technique used, it is worth considering that (for
most well-behaved covariance functions) there exist two main competing minima on
the cost surface (Andrianakis and Challenor 2012). The first of these minima contains
the short-length scale solutions where the GP will pass closely through almost every
data point and return quickly to the prior as it moves a small distance from the
observed training point. The second contains long-length scale solutions where the
GP smooths through multiple data points, learning a function which captures short-
scale variation as measurement noise. Under these solutions, the region of influence
of an individual training data point can be very large.
To demonstrate this phenomenon, by exaggerating the two possibilities, the pos-
terior of the GP for a fixed sample (the same as Fig. 3.1) is shown in Fig. 3.3 for
scenarios where the length scale of the kernel is deliberately altered from that used
to generate the “true sample”. The sample of the GP which is used as training data
is shown using the solid black line with red dots to indicate which observations are
included in Dtr . In Fig. 3.3a, a length scale far smaller than the generating function
is chosen, here it can be seen that the learnt GP models variation over a far shorter
scale than the target function. This type of solution, where a very short length scale
is chosen, will ensure that all variation in the data set is captured, however, it has

Fig. 3.3 Comparison of GP posteriors with length scales altered relative to the original sample
(black solid line). As before, posterior mean shown in thick orange, confidence intervals in shaded
orange, observed data as red dots, and samples from the posterior in blue
3 Gaussian Processes 131

the possibility of mistaking noise on the measured values for genuine features of the
function and, as can be seen, it will not be useful for making predictions even a small
distance from the observed training set.
The alternative case, shown in Fig. 3.3b, is where the length scale is selected to
be deliberately too long. Here, the GP is seen to estimate a set of functions which
smooth through all of the data, ignoring local variations in the target values. This
effect is also clearly undesirable. It is seen that the target function, in black, now lies
well outside the confidence intervals of the posterior. In practice, these overconfident
predictions mean that the GP will return tight predictive posteriors even far from the
observed training data (when the length scale is very long), however, these confident
predictions may well be confidently wrong if the length scale is overestimated.
It is easy to imagine that there is some “goldilocks” scenario which balances
the ability of short-length scale solutions to model small variations in the target
function and the long-length scale solution’s ability to make useful predictions away
from the training data. It is this optimal value which hopefully lies at the minima
of the negative log marginal likelihood. When working with the GP, however, it is
worth considering that the cost function is itself conditioned only on the available
training data. This conditioning means that it can be possible to fall into the “wrong”
minimum which chooses a length scale appropriate for the observed data, but not for
the physical process one seeks to model, or the cost of the short- versus long-length
scale solution can be very similar, making it difficult to make an informed choice. If
there is not a clear localized minimum in the cost surface of the negative log marginal
likelihood, it would be the recommendation of the authors to return to marginalizing
the hyperparameters in a Bayesian manner.
Throughout this section, the approach of the GP has been compared, in an informal
manner, to the more popular neural network methods seen in the literature. It has been
argued that the strengths of the GP are its low number of (hyper) parameters which
need optimization and its ability to automatically quantify uncertainty. It would be
remiss to at this point not highlight that it is certainly possible to perform uncertainty
quantification within the neural network family of models, the most direct comparison
of course being that of Bayesian neural networks. Significant work on developing
neural networks where the posterior distributions over the weighted was carried out
as early as the 1990s, for example, MacKay (1992) and the thesis of Neal (1995). Neal
(1995) showed how MCMC approaches could be applied to the weights of a neural
network and give very strong results, albeit at significantly increased computational
load to the deterministic counterparts. It is also in that work where an equivalency
between Bayesian neural networks, under certain assumptions, and the Gaussian
process itself (Neal 1996). A trend which has continued, for example, Williams
(1996) presents a covariance function for a GP equivalent to a neural network with
sigmoid activations and Gaussian weight distributions. More recently, these classical
results for shallow neural networks have been extended for deep and convolutional
architectures, e.g. Matthews et al. (2018), Garriga-Alonso et al. (2018). One criticism
leveled at the Bayesian approach to neural networks is the significant increase in
computational load, however, (as will be seen with GPs in the following section)
methodologies to reduce this load are available, e.g. Blundell et al. (2015). It is
132 T. J. Rogers et al.

fair to say that interest continues in Bayesian perspectives on neural networks and
their relationship to GP models, evidence of this being the prominence of work
on Bayesian deep learning as seen by workshops dedicated to this topic at many
of the most prominent machine learning conferences. In the context of the work
presented here, it is worth considering that this equivalency may well mean that the
choice between the GP and a (Bayesian) neural network is not so much a choice as
a difference of perspective.

3.2 Beyond the Gaussian Process

Thus far, the GP has been presented as a compact and elegant tool which can power-
fully solve regression problems with little user intervention. This section will consider
some cases where the simple version of the GP seen already may not be sufficient
and, where possible, highlight solutions from the literature.

3.2.1 Large Training Data

One of the major criticisms of the GP is that it can scale poorly with increasing num-
bers of training data.During the optimization
 of the hyperparameters,
 it is necessary
to invert the matrix K X,X + σn2 I multiple times at a cost of O N 3 per iteration.
 
Each prediction can then be made at a cost of O (N ) for the expectation and O N 2
for the predictive covariance. There has, therefore, been significant attention given
to producing approximations to the full GP which reduce this computational over-
head, often referred to as sparse GPs. The aim of any sparse GP method is to retain
as much of the expressive power as possible from the full GP while reducing the
computational load to the minimum possible amount.
The, perhaps obvious, first port of call would be to simply use less data. To
attempt to choose a set of M points from Dtr which effectively summarize the
function to be learnt while ignoring redundant data. In other words, a regular GP is
learnt using only M of the N available training points. This approach, sometimes
referred to as the subset of data, can clearly be seen as the simplest but least useful
approximation that could be made.While  taking only
 a subset of data will reduce the
complexity of the process from O N 3 to O M 3 the information contained in the
discarded data is completely lost. In engineering, where often these data have been
collected at great expense, this method is a very wasteful approach. Furthermore, if
the function to be modeled is sufficiently intricate that it requires a large number of
data in Dtr to be adequately modeled—which can happen especially when the input
is higher dimensional—then the quality of the fit will be drastically reduced through
the subset of data approximation. A final challenge in this most basic method is how
to choose the appropriate subset of data to include in the selected M, which can be
a difficult task. One should seek to include the data which most effectively capture
3 Gaussian Processes 133

the shape of the function to be learnt without introducing redundant information. In


the absence of a better approach, the selection would usually be to choose points
which give “uniform coverage” of the input space, i.e. the maximal space filling set,
alternatively (and perhaps more commonly) the user may simply decimate the data
with little thought to the nature of the exact subset being chosen.
The subset of data approach is far from a satisfactory solution to large train-
ing datasets in the GP. Because of its limitations, research into various other more
sophisticated methods has continued in the literature. One major group of meth-
ods is referred to as inducing-point-based approximations. In essence, all inducing
point approximations reduce the computational load of inverting the full covari-
ance matrix by projecting it through a lower rank covariance matrix which is the
covariance between M selected inducing  points u. This
 projection reduces the com-
putational load of the GP from O N 3 to O N M 2 which is seen to be highly
beneficial when a user chooses M  N . Two important papers have highlighted
a number of the inducing point methods which are available from the literature.
First, work by Quinonero-Candela and Rasmussen (2005) linked many of the earli-
est works with inducing points under a unified framework, including one of the most
popular approaches, the Fully Independent Training Conditional (FITC) model. One
issue with the FITC approach to sparse GP regression is that it can tend to “overfit”,
underestimating the variance of the process as compared to the full GP solution. An
alternative, which for many is preferable to FITC because of avoiding this issue,
is the Variational Free Energy (VFE) approximation presented by Titsias (2009).
Where the FITC model will approximate the prior distribution of the GP, the VFE
instead targets a variational approximation of the posterior of the GP. This distinc-
tion is the means by which the VFE is more resistant to overfitting than the FITC
approach. However, unlike the FITC model, the inducing points are now variational
parameters and, if the set of inducing points is chosen to be equal to the full set of
training inputs, the full GP marginal likelihood will not be recovered. In practice, the
VFE approach to sparse GPs will give rise to higher predictive variances than FITC.
One might then reasonably ask; is it possible to trade off the benefits and drawbacks
of both VFE and FITC? This question motivated the more recent work of Bui et al.
(2017) who show that there also exists a unifying framework, linking FITC type
and VFE approximations of the GP using inducing points through the lens of power
expectation propagation (Minka 2004), an alternative approximate inference tech-
nique arising from expectation propagation approach of Minka (2001). It is shown
how a single scalar parameter α ∈ [0, 1] controls the trade-off between FITC and
VFE, where α → 0 recovers VFE and α → 1 recovers FITC.
An alternative to the inducing-point family of approaches, which have been dis-
cussed so far, is to make a basis function expansion of the nonparametric kernel
in the GP. This approach gives rise to several other sparse GP approximation tech-
niques which can be very powerful alternatives to the inducing point methods above.
One example of a possible basis decomposition for the kernel is the Fourier basis.
Since the spectrum of many popular kernels is available in the closed form—for
example, this is possible for Matérn kernels with ν = 21 + p for p = 0 or any pos-
itive integer—it is possible to draw on a set of frequencies which approximate the
134 T. J. Rogers et al.

same spectral content. Rahimi and Recht (2008) show how a randomly selected
Fourier basis serves as a good approach for approximating Gaussian process solu-
tions, although this method is arguably superseded by Hensman et al. (2017) which
combines ideas of Fourier features with the variational interpretation of sparse GPs
from VFE to develop the Variational Fourier Features (VFF) method. It is shown in
Hensman et al. (2017) how the VFF performs very well on large datasets including
the benchmark US flight-delay prediction task (Hensman et al. 2013). Moving away
from the Fourier basis it should also be noted that work of Solin and Särkkä (2020)
has shown that an eigendecomposition of the Laplace operator associated with the
kernel also provides an effective basis for approximating GPs with a more efficient
parametric model.
Finally, it should be noted that it is also possible to combine subsets of the training
data in various ways, see Cao and Fleet (2014), Deisenroth and Ng (2015), Nguyen-
Tuong et al. (2009), Gramacy and Apley (2015) for a non-exhaustive list of examples.
The final point to note is the approach for distributed parallel training (hyperparameter
learning) of GP models achieved through stochastic variational inference (Hoffman
et al. 2013) and shown in the context of GPs in Hensman et al. (2013). This work
shows how the concepts of variational sparse GPs can give further wall-time speed
up through mini-batched parallel evaluation and updates to the hyperparameters.

3.2.2 Non-Gaussian Likelihoods

Remembering the initial setup of the GP as solving a regression problem in the form
 
yi = f (xi ) + ε ε ∼ N 0 , σn2 ,

it is seen that the regression targets are corrupted with i.i.d. Gaussian white noise.
This ensures that the joint distribution assessed in Eq. (3.10), for prediction, and the
marginal in Eq. (3.13), for hyperparameter learning, are available in closed form as
Gaussian densities. The advantage of this approach is that the computation of the
conditional which enables prediction and the marginal for learning of the hyper-
parameters can be done exactly. However, in some ways, this model also imposes
strong assumptions about the data-generating process which one wishes to model.
In this short subsection, it will be discussed how the assumption regarding the noise
process in the GP can in some ways be relaxed by considering certain approximate
approaches. It should be noted that in almost all cases, it will no longer be possible
to compute the marginal likelihood or the predictive posterior in closed form when
the assumption of white Gaussian i.i.d. noise is relaxed.
The first case in which the i.i.d. Gaussian noise assumption may not be sufficient
is that of heteroscedastic Gaussian noise. This is still an additive Gaussian noise
process but one where the variance of the measurement noise is now a function of
the inputs to the model, i.e. ε(xi ) ∼ N (0 , g(xi )) where g(xi ) is some function of
the inputs xi . A practical example of this noise process may be a situation where a
3 Gaussian Processes 135

sensor is working across different temperature ranges and the signal-to-noise ratio
varies as a function of the temperature. In this case, it would be insufficient to assume
a constant noise variance σn2 across the entire range of measurements; instead, the
variance of that noise process will also be a function of temperature, alongside the
function of interest f (·).
How then could the GP machinery be extended to manage this scenario? In the
simplest possible case, perhaps the function g(·) is known a priori to the modeler,
and it is known that the noise is independent across the input space. In this setting,
it is relatively simple to include the effect of the known changing noise in the usual
equations of the GP, the required modification is to replace the σn2 I term in Eq. (3.8)
with another noise matrix which is also diagonal and whose (diagonal) elements
are defined by g(xi ). In terms of prediction, to recover the latent function values
distribution p ( f  | X  , Dtr ) no further modification is needed in Eq. (3.11) than
replacing the σn2 I term in the training data. However, if the predictive distribution
over y is needed as in Eq. (3.10), then one must also modify the final term in
Eq. (3.10c) such that it is a diagonal matrix with the diagonal elements equalling
g(X  ).
The far more common scenario however, is one where the function which governs
the noise process g(·) is unknown, as well as the underlying latent function to be
learnt f (·). Clearly, this setting will pose more of a challenge when attempting to
model the data. Given that this chapter has so far espoused the benefits of the GP
as a tool for learning arbitrary functions, it may come as no surprise that one can
suggest placing a GP prior over g(·) as well as f (·) and learning the two functions
simultaneously. Assuming that the noise on each observation is independent, i.e. the
covariance matrix of ε is diagonal, with the diagonal elements being drawn from
some function g(·), these elements must be positive (one cannot have a negative or
zero variance). If one were to place a GP prior over g(·) in the same manner as has
been shown for f (·) this requirement will not necessarily be enforced, in fact, if a
zero mean is chosen, a priori many function values will be implausible. To rectify
this difficulty, the common solution is to impose a link function ĝ(·) = (g(·)),
where (·) maps elements from the whole real line onto the set of positive real
numbers. For instance, one could choose (g(·)) = g(·)2 although by far the most
common choice is (g(·)) = exp {g(·)}, in other words, the GP prior is placed over
the log values of the noise variance. It is also commonplace to choose a GP prior
which has a constant mean function m g (X ) = log σn2 , where log σn2 is the log of
the average noise variance across the whole function and may be learnt alongside
the hyperparameters. A heteroscedastic GP model which learns both the unknown
function and the unknown noise variance can then be constructed as

y = f (X ) + ε
 
f (X ) ∼ GP m(X ), k f (x, x )
(3.15)
ε ∼ N (0 , exp {g(X )})
 
g(X ) ∼ GP C, k g (x, x )
136 T. J. Rogers et al.

While theoretically the above is a logical and powerful construction, it has one
major drawback. It is not possible to compute, in closed form, an exact solution
which would allow either learning of the hyperparameters or predictions at new
inputs. As such, it may seem that the model is doomed, however, it is possible to
construct very good approximate solutions which are then computable. A variational
approximation to the posterior of the process is the usual solution to this model,
this was the solution proposed in Lázaro-Gredilla and Titsias (2011). The variational
approximation allows for computation of a lower bound on the unavailable marginal
likelihood which can be used to optimize the hyperparameters of the process and a
similar variational approximation allows for computation of the mean and variance
of the predictive posterior, again in closed form. The mean of the process being
learnt is found to be identical to the predictive mean in the standard GP (under the
variational approximation), then the variance is approximated by means of the law
of total variance. This predictive variance is given by the variance of the prediction
of f (X ) plus the exponential of the predictive mean for g(X ) summed with half
the predictive variance of g(X ) (Sect. 4 Lázaro-Gredilla and Titsias 2011). In doing
this, a Gaussian approximation of the predictive posterior is recovered but it should
be noted that the true predictive posterior of the model in Eq. (3.15) will not be
Gaussian.
The heteroscedastic model, however, is not the only non-Gaussian likelihood
model which may be of interest. The likelihood of the model can reflect other known
beliefs about the form of the generating process. For example, if modeling count
data one may choose a generative model to be a Poisson process, for measurements
bounded at zero the likelihood may be gamma, log-normal, or exponential, for mea-
surements bounded above and below a beta likelihood may be a sensible choice. For
all of these cases, the GP in its standard form is an inappropriate model. For these
intractable models, the “gold standard” approximate method (as in most intractable
Bayesian analysis) is the Markov Chain Monte Carlo (MCMC) approach (Gelman
et al. 2013), more recently the Hamiltonian Monte Carlo (HMC) (Neal et al. 2011;
Betancourt 2017) has been preferred as a more efficient choice. Options to use these
sampling-based approaches are available in major GP libraries including GPy (GPy
2012) and GPFlow (Matthews et al. 2017). While sampling-based approximations to
intractable GP likelihoods are appealing in a number of ways, they have some guar-
antees that they will converge to the true posterior distributions, they have one major
drawback, they impose a significant computational burden. To alleviate this load, the
chained GP approach of Saul et al. (2016) gives a flexible framework for variational
inference over models composed of multiple GPs and non-Gaussian likelihoods. In
many practical settings, the variational approximation will capture sufficiently the
information required to determine a good model.
3 Gaussian Processes 137

3.2.3 Multiple-Output GPs

Unlike the framework available for neural networks, where additional nodes can
simply be added to the output layer to enable multiple targets in regression, the
formation of the GP shown in Eq. (3.1) does not obviously lend itself to a multiple
target scenario. There are scenarios where it could be useful, however, to consider
multiple target data sequences within a combined regression. For example, if it is
known that two processes are correlated but one part of the input space can only be
observed for one process, then it may be desirable to use the correlation between the
processes to effectively expand the training dataset for both functions. To simplify the
presentation, a setting with two functions to be learnt will be shown, but it should be
clear how this would extend to multiple processes. Consider the following scenario,

y1 = f 1 (X 1 ) + ε1
(3.16)
y2 = f 2 (X 2 ) + ε2

where both ε1 and ε2 are i.i.d. zero-mean white Gaussian noise with variances σn,1 2

and σn,2 , respectively. y1 and y2 are the two vectors of target variables from the
2

functions f 1 (·) and f 2 (·), with sets of inputs X 1 and X 2 . The core assumption that
will enable multiple output predictions from the GP will be that the two functions
are also both drawn from an infinite-dimensional Gaussian distribution which can
model the correlation between them, as well as the correlation in the input space. It
is possible then to consider the prior joint distribution of the two sets of targets,
     
y1 m 1 (X 1 ) k1,1 (X 1 , X 1 ) + σn,1
2
I k1,2 (X 1 , X 2 )
∼N , (3.17)
y2 m 2 (X 2 ) k2,1 (X 2 , X 1 ) k2,2 (X 2 , X 2 ) + σn,2
2
I

For each function, there is an associated mean function m j (X j ) which can be defined
as before as any parametric mean. Considering each function in isolation, e.g. for
the first function,  
y1 ∼ N m 1 (X 1 ) , k1,1 (X 1 , X 1 ) + σn,1
2
I (3.18)

the standard Gaussian process prior is recovered, where the kernel k j, j (X j , X j )


can be computed in the standard manner as for the single output case. The central
challenge of designing a multiple-output GP is how one can appropriately define
kernels which capture the cross-covariance between the different sets of targets, i.e.
how can one develop an appropriate k j,k (X j , X k ) for j = k.
One early work in this area was the contribution of Boyle and Frean (2004). Using
the viewpoint of the GP as a linear filter on a continuous time white noise signal,
it was shown how additional filters (kernels) targeting a shared white noise source
could allow coupling between two (or more) correlated outputs. While the approach
is developed generally in that work, it is shown in detail only for the case of a two-
output process where the kernels are all Gaussian (equivalent to squared exponential),
although it should be straightforward to extend. The experience of the current authors
138 T. J. Rogers et al.

when working with this model has been that it can produce very good results and it
was possible to replicate the work shown in Boyle and Frean (2004). However, the
learning procedure for the hyperparameters, despite following the same optimization
of the marginal likelihood as previously shown, could become very unstable. There
are combinations of the hyperparameters to be learnt which cause the model to be
numerically unstable and for which the gradients are particularly difficult to recover.
This difficulty is coupled with an inherent non-identifiability problem in the model
where, given that each signal is a summation of an independent and linked GP, it can
lead to vastly different qualities of prediction in testing depending on the relative
hyperparameters of the different kernels.
Perhaps for this reason, the more popular approaches for multiple output GPs
are those which have been classically inherited from the kriging community, where
the multiple output problem has been referred to as co-kriging. Several popular
approaches are developed from the assumption that the observed processes can be
described as the linear combination of a number of latent Gaussian processes.
A helpful review which covers the major approaches to multiple-output GPs can be
found in Alvarez et al. (2012). The interested reader is directed toward that reference
for further details on the approaches mentioned here.

3.3 A Case Study with Wind Turbine Power Curves

To demonstrate an example where GPs can be used for modeling of an engineering


system, a short case study will explore the use of some of the discussed techniques
on data from wind-turbine power curves. The power curve of a wind turbine is the
critical relationship between the wind speed and the power output of that turbine.
It characterizes the performance and behavior of that turbine and has a variety of
important uses, from financial planning to fault detection. In this section, a short
review will be given of recent work applying the GP as a modeling tool to these

Fig. 3.4 Typical data from a


SCADA system wind turbine
power curve, as used in
Rogers et al. (2020)
3 Gaussian Processes 139

functions from data collected in a wind farm. In both cases, the data have been
normalized to remove information which may identify the exact turbines from which
the data have been collected.
An example set of power curve data is shown in Fig. 3.4. It follows a very char-
acteristic shape with three key sections. Close to zero wind speed (–1 normalized
wind speed), no power is produced. This trend continues until the speed is sufficient
to begin producing power, at which point the curve enters a roughly cubic segment
where the power output increases with wind speed. This segment can be found to
be roughly cubic if one considers the theoretical maximum work done by the fluid
flowing over the wind turbine, in practice, inefficiency mean that this would be a
highly simplified assumption. At some point, the wind speed increases to the level
where the turbine produces its rated power, i.e. maximum output. In Fig. 3.4, this
point occurs close to zero normalized wind speed. For the remaining increase in wind
speed, no increase in power output is observed as it is limited by the control system of
the turbine. This limiting of the power output creates a saturation-type nonlinearity
at higher wind speed as the output of the turbine is limited to not exceed the rated
power (normalized power output of 1).
The key references for this section are now briefly reviewed. Papatheou et al.
(2017) presented the GP as a useful damage-sensitive feature for performance moni-
toring purposes. In that work, the GP is used to allow extreme function theory analysis
of the power curve, which is shown to highlight abnormal turbines. The approach
used in this initial work was based upon the most basic form of the GP as first intro-
duced in this chapter; in Rogers et al. (2020) attention was paid to the noise process
of the power-curve function. The data were modeled with a heteroscedastic form of
the GP which modeled the changing variance over the power curve, additionally, a
committee machine was employed to handle the three distinct regions seen in the
power curve. Finally, in Mclean et al. (2022), further work has been completed on
the modeling of the power curve, this time a GP model with a beta likelihood is
employed to capture the bounded nature of the function space. This arises from the
physical constraints that the turbine cannot produce negative power nor can it exceed
its maximum rated power.
The key results of these works will now be briefly reviewed in the context of how
the GP was applied in increasingly sophisticated ways to improve understanding of
the power curve. Since the dataset used in these studies contains a relatively large
number of data points—in the order of 15,000 observations—sparse formulations of
the GP are used in all cases. Initially, taking the usual approach, assuming Gaussian
likelihood and homoscedastic noise (i.e. additive, isotropic, independent Gaussian
noise) and the VFE methodology for computing a sparse GP, the results shown in
Fig. 3.5 can be recovered.
Beginning with the quality of the mean prediction of the model, it can be seen
that the expected output of the learnt functions is very good. Visually it matches
closely to the data and using a normalized mean squared error (NMSE) metric a
score of 0.81% is achieved which indicates excellent quality of fit. The GP used here
employs a squared exponential kernel and a hyperbolic tangent mean to promote the
general behavior of the function which can be seen a priori. For reference, this score
140 T. J. Rogers et al.

Fig. 3.5 Prediction of the power curve using a VFE approach (Rogers et al. 2020)

is a notable improvement over the 3.94% for a piecewise linear fit or 1.50% for a
hyperbolic tangent, which are competing parametric models commonly employed. At
this stage, the user has a consideration to make, if only concerned with the pointwise
prediction of the model, it may be reasonable to stop, satisfied with this good quality
mean fit. However, returning to Fig. 3.5, it can be seen that the variance in the end
segments is greatly overestimated, i.e. the GP is too uncertain, and in the central
segment, which may be of most interest, the variance is underestimated. This poses
a problem since one key advantage of the GP is its quantification of uncertainty,
given that in the proposed model the form of the uncertainty is not well captured it
is reasonable to try to improve upon this result.
To improve the quality of this fit, the immediate option is to make a change in the
choice of noise model. Since the variance in the data is seen to vary over the input
space, a heteroscedastic approach is taken. The form of this model was to place a
GP prior with a constant mean over the log of the noise variance, in this way the
variation in the noise could be learnt across the input space in the same manner as
the function is learnt. In Fig. 3.6, the predictions from this updated model are shown.
Immediately, it is clear that the variance toward the ends of the power curve is much
better captured. No longer is there a high degree of uncertainty in these regions
(below cut-in and above-rated power) where it is expected that the power output
should be very consistent, i.e. low noise. The model has also managed to capture
the increasing variance with wind speed in the middle segment of the curve and
the data predominantly lie within the 3σ confidence bounds as would be expected.
3 Gaussian Processes 141

Fig. 3.6 Prediction of the power curve using the heteroscedastic approach (Rogers et al. 2020)

The mean performance of the model is also maintained. This updated approach,
therefore, is of value when the user is more interested in capturing the variability in
the process with greater accuracy. This information regarding the uncertainty will be
particularly valuable if this model were to be incorporated into financial forecasting
which wished to consider future risk, or if the model is used within a probabilistic
monitoring and/or decision framework.
A final point on the model shown in Fig. 3.6, since the power curve was character-
ized by the three distinct segments of different regimes, a committee machine model
was used to blend together different GPs for each of these segments. In Fig. 3.7, the
three separate heteroscedastic models and the combined model are all shown, noting
that these are shown with the mean function removed. The mechanism of the com-
mittee machine is to combine the predictions of the models weighted automatically
by their confidence. This approach is very attractive as it allows various GP models
to be combined without requiring manual tuning. The automatic quantification of
the uncertainty allows for the combination to happen in quite a natural way, as with
humans one may pay attention to the expert on the committee who is most confident
in their knowledge. A word of warning in this modeling approach, as is probably
prudent when dealing with humans, the user should be wary of overconfidence which
may skew this combination but again the combination of the heteroscedastic process
and Bayesian Occam’s Razor provides some degree of automatic protection against
this issue.
142 T. J. Rogers et al.

Fig. 3.7 Individual heteroscedastic GPs inside the committee machine prediction (Rogers et al.
2020)

In the figures above, it has been shown how a heteroscedastic noise model can
better capture the variation in the noise variance across the input space. However,
inspecting Fig. 3.6 closer it can be seen that around the transition into rated power,
near zero normalized wind speed, there is a significant probability mass predicting
power outputs greater than the rated power. This mismatch poses a problem as it
may mean samples from the GP of the power curve, even in the heteroscedastic case,
over-predict the potential power output of the turbine beyond what is physically
possible. This limitation on the model motivates the modeler to consider if further
modification to the process is needed. In the setting of the power curve, it is natural to
consider that the likelihood of the data, which is the generative process, is bounded,
i.e. samples from the distribution which encodes prior belief should not exceed the
rated power.
It may seem that the issue of the bounded space (no prediction beyond rated power)
is limited to the quantification of the uncertainty from the GP. However, considering
the results of the standard GP,5 as shown in Fig. 3.8, it is clearly seen that despite
the mean prediction giving a very low NMSE score, it can still violate the physical
bounds of the process. Clearly, this is a concern when attempting to establish trust
in a data-driven approach.
Consider then how the GP model can be constructed in such a way that the physical
bounds of the process are obeyed. The modeler chooses a likelihood function which
restricts the generating process to some other domain, for example, a gamma, log-
normal, or exponential distribution to enforce positive only target values or in the

5In the referenced Fig. 3.8, a zero mean function is used as opposed to the hyperbolic tangent in
Figs. 3.5 and 3.6.
3 Gaussian Processes 143

case of the power curve the beta distribution provides an obvious choice of likelihood
as it is bounded at zero and one. This choice of likelihood imposes restrictions on the
family of functions which can be generated by the GP, in fact, choosing a different
likelihood in some ways stops the model being a GP at all, as the likelihood is
no longer Gaussian. One should be wary that this change does not inadvertently
over-restrict the possible functions learnt by the process. For example, the choice
of beta likelihood is still imposing that the distribution conditioned on a particular
input is unimodal. However, considering the data in Fig. 3.4, it can be seen that this
assumption is reasonable.
To develop the power curve model further in Mclean et al. (2022) adopts a model
where a multiple output GP jointly models the α and β parameter of the generat-
ing beta distribution. The ability for variation between the GP outputs which feed
the beta distribution naturally also allows heteroscedastic behavior in the posterior.
Considering the effect of this approach, in Fig. 3.9, it can be seen that the mean pre-
diction under the beta likelihood is comparable to the regular GP but samples from
the posterior never leave the physically plausible space. Zooming in on the transition
to rated power in Fig. 3.10, the effect is especially pronounced. In this region, where
the regular GP and heteroscedastic model have previously struggled, the beta GP
captures the transition well without violating the upper bound on the target variable.
In Fig. 3.11, a plot showing the posterior likelihood of the process over the space
is presented. The likelihood toward the two tails concentrates close to zero and one
in normalized power at low and high wind speeds, respectively. Since the likelihoods
become very high, the colormap has been truncated for readability. It is also noted that
the concept of confidence intervals which are usually used to represent the predictive
uncertainty of the GP is not necessarily helpful in this instance as the distributions are

Fig. 3.8 Behavior of the


vanilla GP at rated power
(Mclean et al. 2022)
144 T. J. Rogers et al.

highly skewed, hence the use of a darkening colormap with increasing likelihood.
What is seen in these results is the expected high confidence in the tails of the
process and more diffuse posterior in the transitional segment. It is argued that, for
the additional modeling effort as compared to the GP model shown in Fig. 3.5, the
reward is improved quantification of the posterior predictive uncertainty in the power
curve which will aid any future processes relying on these uncertainty predictions.

3.4 Conclusions

The GP has been presented as a powerful and conceptually simple tool for performing
regression tasks. In its most basic form, the required calculations to produce mean
and variance predictions span only a few lines of maths or code. However, various
extensions which increase the expressive power of the model have been presented. In
the case study, it has been shown how consideration and careful application of these
extensions can greatly increase the value of the GP as a tool for modeling engineering
functions. As is the case in most applications of machine learning within engineering,
it is important to consider if the model can be set up in such a manner that it conforms
to the expected/desired behavior that is known a priori. In particular, it has been
highlighted that, while valuable, quantification of uncertainty must receive careful
attention to ensure its utility in further applications.

Fig. 3.9 Samples from the


posterior of the beta GP
compared to independent test
data (Mclean et al. 2022)
3 Gaussian Processes 145

Fig. 3.10 Samples from the


posterior of the beta GP
compared to independent test
data close to the transition to
rated power (Mclean et al.
2022)

Fig. 3.11 Posterior of the beta GP for the power curve, noting that the colormap is clipped at the
likelihood equal to 30 (Mclean et al. 2022)

References

Abdessalem AB, Dervilis N, Wagg DJ, Worden K (2017) Automatic kernel selection for gaus-
sian processes regression with approximate Bayesian computation and Sequential Monte Carlo.
Frontiers Built Environ 3:52
146 T. J. Rogers et al.

Alvarez MA, Rosasco L, Lawrence ND et al (2012) Kernels for vector-valued functions: a review.
Found Trends® Mach Learn 4(3):195–266
Andrianakis I, Challenor PG (2012) The effect of the nugget on Gaussian process emulators of
computer models. Comput Stat Data Anal 56(12):4215–4228
Betancourt M (2017) A conceptual introduction to Hamiltonian Monte Carlo. arXiv:1701.02434
Blundell C, Cornebise J, Kavukcuoglu K, Wierstra D (2015) Weight uncertainty in neural network.
In: International conference on machine learning. PMLR, pp 1613–1622
Boyle P, Frean M (2004) Dependent Gaussian processes. In: Saul L, Weiss Y, Bottou L (eds)
Advances in neural information processing systems, vol 17. MIT Press
Bui TD, Yan J, Turner RE (2017) A unifying framework for Gaussian process pseudo-point approx-
imations using power expectation propagation. J Mach Learn Res 18(1):3649–3720
Cao Y, Fleet DJ (2014) Generalized product of experts for automatic and principled fusion of
Gaussian process predictions. arXiv:1410.7827
Deisenroth M, Ng JW (2015) Distributed Gaussian processes. In: International conference on
machine learning. PMLR, pp 1481–1490
Gardner P, Rogers T, Lord C, Barthorpe R (2021) Learning model discrepancy: a Gaussian process
and sampling-based approach. Mech Syst Signal Process 152:107381
Garnett R, Osborne MA, Hennig P (2013) Active learning of linear embeddings for Gaussian
processes. arXiv:1310.6740
Garriga-Alonso A, Rasmussen CE, Aitchison L (2018) Deep convolutional networks as shallow
gaussian processes. arXiv:1808.05587
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis
GPy (2012) GPy: A Gaussian process framework in python. http://github.com/SheffieldML/GPy
Gramacy RB, Apley DW (2015) Local Gaussian process approximation for large computer exper-
iments. J Comput Graph Stat 24(2):561–578
Hensman J, Durrande N, Solin A (2017) Variational Fourier features for Gaussian processes. J Mach
Learn Res 18(1):5537–5588
Hensman J, Fusi N, Lawrence ND (2013) Gaussian processes for big data. In: Proceedings of the
twenty-ninth conference on uncertainty in artificial intelligence, UAI’13, Arlington, Virginia,
USA, AUAI Press, pp 282–290
Hernández-Lobato JM, Hoffman MW, Ghahramani Z (2014) Predictive entropy search for efficient
global optimization of black-box functions. Adv Neural Inf Process Syst 27
Hoffman MD, Blei DM, Wang C, Paisley J (2013) Stochastic variational inference. J Mach Learn
Res
Kingma DP, Ba J (2015) Adam: A method for stochastic optimization. In: Bengio Y, LeCun Y (eds)
3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA,
Conference Track Proceedings
Lázaro-Gredilla M, Titsias MK (2011) Variational heteroscedastic Gaussian process regression. In:
ICML
MacKay DJ (1992) A practical bayesian framework for backpropagation networks. Neural Comput
4(3):448–472
Matthews AGDG, Hron J, Rowland M, Turner RE, Ghahramani Z (2018) Gaussian process
behaviour in wide deep neural networks. In: International conference on learning representa-
tions
Matthews AGDG, van der Wilk M, Nickson T, Fujii K, Boukouvalas A, León-Villagrá P, Ghahramani
Z, Hensman J (2017) GPflow: A gaussian process library using tensor flow. J Mach Learn Res
18(40):1–6
Mclean J, Jones M, O’Connell B, Maguire A, Rogers T (2022) Physically meaningful uncertainty
quantification in probabilistic wind turbine power curve models as a damage sensitive feature.
arXiv:2209.15579
Micchelli CA, Xu Y, Zhang H (2006) Universal kernels. J Mach Learn Res 7(12)
Minka T (2004) Power expectaion propagation. Technical report, Technical report, Microsoft
Research, Cambridge
3 Gaussian Processes 147

Minka TP (2001) A family of algorithms for approximate Bayesian inference. Ph.D. thesis, Mas-
sachusetts Institute of Technology
Neal RM (1995) Bayesian Learning for Neural Networks. Ph.D. thesis, University of Toronto
Neal RM (1996) Priors for infinite networks. In: Bayesian learning for neural networks. Springer,
Berlin, pp 29–53
Neal RM et al (2011) Mcmc using hamiltonian dynamics. Handb Markov Chain Monte Carlo
2(11):2
Nguyen-Tuong D, Seeger M, Peters J (2009) Model learning with local Gaussian process regression.
Adv Robot 23(15):2015–2034
Papatheou E, Dervilis N, Maguire AE, Campos C, Antoniadou I, Worden K (2017) Performance
monitoring of a wind turbine using extreme function theory. Renew Energy 113:1490–1502
Press WH, Teukolsky SA, Vetterling WT, Flannery BP et al (1992) Numerical recipes in C
Quinonero-Candela J, Rasmussen CE (2005) A unifying view of sparse approximate Gaussian
process regression. J Mach Learn Res 6:1939–1959
Rahimi A, Recht B (2008) Weighted sums of random kitchen sinks: Replacing minimization with
randomization in learning. Adv Neural Inf Process Syst 21
Rogers T, Gardner P, Dervilis N, Worden K, Maguire A, Papatheou E, Cross E (2020) Probabilistic
modelling of wind turbine power curves with application of heteroscedastic Gaussian process
regression. Renew Energy 148:1124–1136
Saul AD, Hensman J, Vehtari A, Lawrence ND (2016) Chained Gaussian processes. In: Gretton
A, Robert CC (eds) Proceedings of the 19th international conference on artificial intelligence
and statistics, volume 51 of Proceedings of machine learning research. Cadiz, Spain, PMLR, pp
1431–1440
Simpson F, Lalchand V, Rasmussen CE (2021) Marginalised Gaussian processes with nested sam-
pling. Adv Neural Inf Process Syst 34:13613–13625
Snoek J, Larochelle H, Adams RP (2012) Practical Bayesian optimization of machine learning
algorithms. Adv Neural Inf Process Syst 25
Solin A, Särkkä S (2020) Hilbert space methods for reduced-rank Gaussian process regression. Stat
Comput 30(2):419–446
Stein ML (1999) Interpolation of spatial data: some theory for kriging. Springer Science & Business
Media
Svensson A, Dahlin J, Schön TB (2015) Marginalizing Gaussian process hyperparameters using
sequential monte carlo. In: 2015 IEEE 6th international workshop on computational advances in
multi-sensor adaptive processing (CAMSAP). IEEE, pp 477–480
Titsias M (2009) Variational learning of inducing variables in sparse Gaussian processes. In: Arti-
ficial intelligence and statistics. PMLR, pp 567–574
Williams C (1996) Computing with infinite networks. Adv Neural Inf Process Syst 9
Williams CK, Rasmussen CE (2006) Gaussian processes for machine learning. MIT Press
Chapter 4
Machine Learning Methods
for Constructing Dynamic Models
From Data

J. Nathan Kutz

4.1 Introduction

Traditional modeling of physics-based systems in science and engineering is typi-


cally achieved by deriving governing equations for the system and simulating the
system for various parameter regimes. Thus, computational methods and scientific
computing are now an integral part of every field of application. Indeed, a well-
developed and diverse number of numerical methods can now be applied for solv-
ing ordinary differential and partial differential equation systems (Kutz 2013). This
allows for the characterization of many modern high-dimensional, complex dynam-
ical systems, thus enabling engineering design and control in a diversity of applica-
tion areas. The ubiquity of computing has led to reduced order models (ROMs) that
provide a mathematical architecture for reducing the computational complexity of
mathematical models in numerical simulations (Benner et al. 2015; Antoulas 2005;
Quarteroni et al. 2015; Hesthaven et al. 2016). More recently, data-driven methods
have emerged as an alternative paradigm, where building models directly from data
is advocated. Indeed, in many emerging modern applications, including turbulence
closure modeling, weather forecasting, power grid networks, climate modeling, and
neuroscience, for instance, the construction of dynamic models from data is neces-
sary as first principles models are unknown, inadequate, or beyond first principles
derivations. In what follows, a number of techniques are highlighted which can
build dynamic models directly from data. Critical to the architectures advocated are
the joint and simultaneous learning of coordinates and models, allowing for robust
characterization of the underlying dynamics. Indeed, with the emergence of machine
learning techniques, there exists a diverse number of data-driven methods for learning
both low-dimensional embeddings and evolution equations in such coordinates (Kutz

J. Nathan Kutz (B)


Department of Applied Mathematics and Electrical and Computer Engineering,
University of Washington, Seattle, WA 98195, USA
e-mail: [email protected]
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 149
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_4
150 J. Nathan Kutz

2013; Brunton et al. 2020). In what follows, we consider some of the leading machine
learning strategies that help represent physics-based systems by learning dynamics
and embeddings jointly.
The data-driven discovery process begins with data acquisition. Thus, sensors are
critical for empowering the physics learning process. Often what is measured by the
sensors y is not the correct variable u for building a parsimonious, or dynamic, model
representation. Thus, it is important to first learn how to map the measurements y to a
state space representation u where a model is constructed, i.e. a measurement model
must be constructed. Once achieved, a parsimonious model for u can be constructed in
a variety of ways as detailed in subsequent sections of the manuscript. Importantly,
many modern applications of data-driven modeling require that the measurement
model and parsimonious model be constructed simultaneously. There are, of course,
many nuances to this process, but we will primarily focus at a first approximation on
learning spatio-temporal systems governed by partial differential equations which
are specified by a nonlinear operator N(·).
Thus, we seek to construct a dynamic model based on data y ∈ Rm measured from
a high-dimensional state u ∈ Rn , where n  1 and m  n. Specifically,

u̇ = N(u, x, t, ), (4.1a)


yk = h(tk , x, u(tk )) + σ (4.1b)

where the spatio-temporal dynamics are prescribed by N : X1 → Rn , the observation


operator is h : X1 → Rm , the frequency of observations are given by yk which are
measured at the times tk on some spatial locations prescribed by x. Observations are
compromised by measurement noise σ , which is typically described by a probability
distribution (e.g. a normal distribution σ ∼ N(μ, σ )). The dynamics is prescribed
by a set of parameters .
 The goal is thatm×given p measurements yk arranged in the matrix Y =
y1 y2 . . . y p ∈ R p , infer the dynamics N(·) (or unknown portion of dynam-
ics) with parametrization , the measurement operator h(·), or a proxy model of
the true system so that tasks such as control, characterization, and forecasting can
be accomplished. If h(·) is not the identity and/or σ is not zero, then we are in the
case of imperfect data. This problem can be one of online learning where the update
must occur in real time and with no possibility of repeating the experiment. Thus,
model discovery is typically an ill-posed problem whose solution must be accom-
plished through judiciously chosen regularization. Solving this ill-posed problem
is a fundamental scientific and mathematical challenge since there are simply not
enough constraints on the measurement model, dynamics, and parametrization to
achieve a unique solution. To date, it has only been accomplished in highly special-
ized settings, typically with full-state measurements and high-quality (low-noise)
data. Significant mathematical innovations have to be developed in order to make
this a general and robust architecture, especially as regularization is required to make
the problem well-posed.
4 Machine Learning Methods for Constructing Dynamic Models From Data 151

As sufficient data is acquired from sensors in time and space, the data-discovery
pipeline then produces the flow:

y (measurements) → u (state space) → N(·) (dynamics model) (4.2)

with two functions to discover, h and N. Alternatively, one can also find a new
coordinate system z in which to build the dynamic model so that

y (measurements) → u (state space) → z (new state space) → N(·) (dynamic model)


(4.3)
In addition to the discovery of a measurement model and the underlying dynam-
ics, the dimension of the dynamics r must also be discovered and/or estimated in
any data-driven architecture. Often the rank of an underlying subspace on which
the physics can be projected can be estimated from the singular value decomposi-
tion (Brunton and Kutz 2019). However, hyperparameter tuning is critical in refining
the initial estimates of the rank r . Such hyperparameter tuning, which is aimed at jus-
tifying the choice of rank through cross-validation, is critical to almost every aspect
of training a successful machine learning model. Importantly, we wish to impose
physics-based regularization principles in order to make (4.1a) well-posed. In what
follows, construction of a ROM will be the driving principle for regularization in
model discovery.

4.2 Modeling Viewpoints

To frame the application of machine learning algorithms for building dynamic models
from data, we will consider physics-based systems modeled by partial differential
equations (PDEs). PDEs model a diversity of spatio-temporal systems, including
those found in the classical physics fields of fluid dynamics, electrodynamics, heat
conduction, and quantum mechanics. Indeed, PDEs provide a canonical description
of how a system evolves in space and time due the presence of partial derivatives
which model the relationships between rates of change in time and space. Governing
equations of physics-based systems simply provide a constraint, or restriction, on
how these evolving quantities are related. We consider PDEs of the form Courant
and Hilbert (2008)
u̇ = N (u, x, t; β) (4.4)

where u(x, t) is the variable of interest, or the state space, which we are trying to
model. The solution u(x, t) of the PDE depends upon the spatial variable x, which
can be in 1D, 2D, or 3D, and time t. Importantly, solutions can often depend on the
parameters β, thus requiring a solution ultimately that can model parametric depen-
dencies, i.e. u(x, t; β). In what follows, to illustrate various mathematical methods,
the parameters β are assumed to be fixed. Solutions of (4.4) are typically achieved
through numerical simulation, unless N (·) is linear and constant-coefficient so that
152 J. Nathan Kutz

analytic solutions are available. Asymptotic and perturbation methods can also offer
analytically tractable solution methods (Kutz 2020). In many modern applications,
discretization of the evolution for u(x, t) results in a high-dimensional system for
which computations are prohibitively expensive. The goal of data-driven modeling is
to generate a proxy model of (4.4) that renders tractable, inexpensive computations
that approximate the full simulations of (4.4).
In what follows, we will typically assume that we do not know (4.4), but have only
data collected at different time points. However, if the governing equations are indeed
known, then recourse to model reduction allows for the construction of efficient proxy
models. Traditional reduced order models (ROMs) produce a dimensionally reduced
version of (4.4) by projecting the governing PDE into a new, low-rank coordinate
system. Coordinate transformations are one of the fundamental paradigms for pro-
ducing solutions to PDEs (Keener 2018). ROMs produce coordinate transformations
from the simulation data itself. Thus, if the governing PDE (4.4) is discretized so
that u(x, t) → uk = u(tk ) ∈ Rn , then snapshots of the solution can be collected into
the data matrix ⎡ ⎤
| | ··· |
X = ⎣ u1 u2 · · · um ⎦ (4.5)
| | ··· |

where X ∈ Cn×m . An optimal coordinate system for ROMs is extracted from the data
matrix X with a singular value decomposition (SVD) (Kutz 2013)

X = V∗ (4.6)

where  ∈ Cn×r ,  ∈ Rr ×r and V ∈ Cm×r and r is the number of modes extracted


to represent the data. The SVD, which is more commonly known in the ROMs
community as the proper orthogonal decomposition (POD) (Holmes et al. 2012;
Benner et al. 2015), computes the dominant correlated activity of the data as a set of
orthogonal modes. It is guaranteed to provide the best 2 -norm reconstruction of the
data matrix X for a given number of modes r . Importantly, the r modes  extracted
from the data matrix are used to produce a separation of variables solution to the
PDE (Haberman 1983)
u = (x)a(t) (4.7)

where the vector a = a(t) is determined by using this solution form and Galerkin pro-
jecting to (4.4) (Benner et al. 2015). Thus, projection-based ROM simply represents
the governing PDE evolution (4.4) in the r -rank subspace spanned by .
The projective-based ROM construction often produces a low-rank model for the
dynamics of a(t) that can be unstable (Carlberg et al. 2017), i.e. the models produced
generate solutions that rapidly go to infinity in time. Machine learning techniques
offer a diversity of alternative methods for computing the time dynamics in the low-
rank subspace. The simplest architecture aims to train a deep neural network that
maps the solution forward in time
4 Machine Learning Methods for Constructing Dynamic Models From Data 153

ak+1 = fθ (ak ) (4.8)

where ak = a(tk ) and fθ represents a deep neural network whose weights and biases
are determined by θ . A diversity of neural networks can be used to advance solu-
tions or learn the flow map from time t to t + t (Qin et al. 2019; Liu et al. 2020).
Indeed, deep learning algorithms provide a flexible framework for constructing a
mapping between successive time steps. The typical ROM architecture constrains
the dynamics to a subspace spanned by the POD modes; thus, in the new coordi-
nate system which is generated by projection to the low-rank subspace, the snapshot
matrix is now constructed from the ak . These snapshots can be used to construct a
time-stepping model using neural networks. Recently, Parish et al. (2020) and Regaz-
zoni et al. (2019) developed a suite of neural network-based methods for learning
time-stepping models for (4.8). Moreover, Parish and Carlberg provide extensive
comparisons between different neural network architectures along with traditional
techniques for time-series modeling. In such models, the neural networks (or time-
series analysis methods) simply map an input ak to an output ak+1 .
Machine learning algorithms offer options beyond a Galerkin-POD or deep learn-
ing of time-stepping in the time variable a(t). Thus, instead of inserting (4.7) back
into (4.4) or learning a flow map fθ for (4.8), we can instead think about directly build-
ing a model for the dynamics of a(t). Thus, we would like to construct a dynamical
system
da
= f(a, t) (4.9)
dt

where f(·) now prescribes the dynamics. Two highly successful options have emerged
for producing a model for the dynamics, (i) The dynamic mode decomposition
(DMD) (Kutz et al. 2016) and (ii) the sparse identification of nonlinear dynam-
ics (SINDy) (Brunton et al. 2016). The DMD model for f(·) is assumed to be linear
so that
da
≈ La (4.10)
dt
where L is a linear operator found by regression. Solutions are then trivial as all
that one requires is to find the eigenvalues and eigenvectors of L in order to provide
a general solution by linear superposition. The SINDy method makes a different
assumption, that the dynamics f(·) can be represented by only a few terms. In this
case, the regression is to the form

da
≈ ξ (4.11)
dt
where the columns of the matrix  are candidate terms from which to construct
a dynamical system and ξ contains the loading (or coefficient or weight) of each
library term. SINDy assumes that ξ is a sparse vector so that most of the library
terms do not contribute to the dynamics. The regression is a simple Ax = b solve for
154 J. Nathan Kutz

an overdetermined system that is regularized by sparsity, or the sparsity-promoting


0 or 1 norms.
In addition to the diversity of methods for building a model for the time dynamics
a(t), there also exists the possibility of using coordinates other than those defined by
. Moving beyond a linear subspace can provide improved coordinate systems for
building dynamic models for a(t). Importantly, there exists the possibility of learning
a coordinate system where, for instance, a linear DMD model or a parsimonious
SINDy model can be more efficiently imposed. Thus, we wish to learn a coordinate
transformation
z = fθ (u) (4.12)

where z is the new variable representing the state space u and fθ is the neural network
coordinate transformation that defines the mapping. The goal is then to enforce a
DMD or SINDy model in the new coordinate system

dz
= Lz (4.13a)
dt
dz
= ξ . (4.13b)
dt
This allows us to find a coordinate system beyond the standard linear, low-rank
subspace  which can be advantageous for ROM construction. In what follows,
we highlight the basic mathematical formulations allowing for a diversity of ROM
approximations, especially those that leverage DMD and SINDy in constructing
advantageous ROMs.

4.3 Learning Paradigms: Burgers’ Equation

The previous section highlights a number of methods that can be leveraged to build
a proxy model of the dynamics. The different paradigms can each be quite effective
depending on the goal of the practitioner. To demonstrate the construction of the
various representations, we consider the canonical nonlinear PDE: Burgers’ equation
with diffusive regularization. The evolution, as illustrated in Fig. 4.1a, is governed
by diffusion with a nonlinear advection (Burgers 1948).

u t + uu x − u x x = 0  > 0, x ∈ [−∞, ∞]. (4.14)

When  = 0, the evolution can lead to shock formation in finite time. The presence
of the diffusion term regularizes the PDE, ensuring continuous solutions for all time.
The governing equation (4.14) can be learned directly from data using the SINDy
architecture. Thus a parsimonious nonlinear evolution is learned. However, it is clear
in what follows that other representations can also be learned.
4 Machine Learning Methods for Constructing Dynamic Models From Data 155

Fig. 4.1 a Evolution dynamics of Burgers’ equation with initial condition u(x, 0) = exp(−(x +
2)2 ). b Fifteen mode DMD approximation of the Burgers’ evolution. The simulation of (4.14)
was performed over t ∈ [0, 30] where the sampling was taken at every t = 1. The domain was
discretized with n = 256 points on a domain x ∈ [−15, 15]

Burgers’ equation is one of the few nonlinear PDEs whose analytic solution form
can be derived. In independent, seminal contributions, Hopf (1950) and Cole (1951)
derived a transformation that linearizes the PDE. The Cole–Hopf transformation is
defined as follows:
u = −2vx /v . (4.15)

The transformation to the new variable v(x, t) replaces the nonlinear PDE (4.14)
with the linear, diffusion equation

vt = vx x (4.16)

where it is noted that  > 0 in (4.14) in order to produce a well-posed PDE.


Thus through the Cole–Hopf transformation, the PDE can be exactly represented
in a liner fashion. This provides an alternative to (4.14). Indeed, this form of the
model can be solved using Fourier transforms. Fourier transforming in x gives the
ODE system
v̂t = −k 2 v̂ (4.17)

where v̂ = v̂(k, t) denotes the Fourier transform of v(x, t) and k is the wavenumber.
The solution in the Fourier domain is easily found to be

v̂ = v̂0 exp(−k 2 t) (4.18)

where v̂0 = v̂(k, 0) is the Fourier transform of the initial condition v(x, 0).
The linearization process gives a direct construction of the Koopman operator. To
construct the Koopman operator, we can then combine the transform to the variable
v(x, t) from (4.15)
156 J. Nathan Kutz
 x
−∞ u(ξ, t)dξ
v(x, t) = exp − (4.19)
2

with the Fourier transform to define the observables

g(u) = v̂ . (4.20)

The Koopman operator is then constructed from (4.18) so that

K = exp(−k 2 t) . (4.21)

This is one of the rare instances where an explicit expression for the Koopman opera-
tor and the observables can be constructed analytically. The inverse scattering trans-
form (Ablowitz and Segur 1981) for other canonical PDEs, Korteweg-deVries (KdV)
and nonlinear Schrödinger (NLS) equations, also can lead to an explicit expression
for the Koopman operator, but the scattering transform and its inversion are much
more difficult to construct in practice.
If instead of an extended domain, one considers a bounded domain where x ∈
[0, L] and v(0) = v(L) = 0, then the diffusion equation yields the solution

v(x, t) = bn exp(−λn t) sin(nπ x/L). (4.22)
j=1

In practice, one would only use a finite sum to represent the solution so that
r
v(x, t) ≈ bn exp(−λn t) sin(nπ x/L). (4.23)
j=1

where r is the number of modes used to approximate the solution. Thus, r is like the
rank of a reduced order approximation.
The purpose in explicitly working out the Burgers’ equation is to highlight the
diversity of what can be learned. Specifically, three representations can be constructed
from data: the SINDy model (4.14), a linear Koopman model (4.16), or the eigenso-
lution itself (4.23). Note that our typically analytical methods actually start with the
nonlinear governing equation (4.14) and transform it first to a linear model (4.16)
before producing the solution (4.23). From a data-driven approach, all three are sim-
ply representations in different coordinate systems and variables. Thus, when using
machine learning methods to jointly learn coordinates and dynamics, it is critical to
specify exactly which representation is desired. There are indeed infinite represen-
tations of physics models (Chen et al. 2022). One can impose one’s desires on the
model by constructing appropriate loss functions when training a neural network for
a coordinate system.
4 Machine Learning Methods for Constructing Dynamic Models From Data 157

4.4 Dynamic Models From Data

Three mathematical architectures will be used in building dynamic models from


data, including DMD, SINDy, and neural networks. Each will be mathematically
considered in the context of the mathematical construct (4.1a).

4.4.1 Dynamic Mode Decomposition

Dynamic mode decomposition originated in the fluid dynamics community. Intro-


duced as algorithm by Schmid and Sesterhenn (2008, 2010), it has rapidly become
a commonly used data-driven analysis tool and the standard algorithm to approxi-
mate the Koopman operator from data (Rowley et al. 2009). In the context of fluid
dynamics, DMD was used to identify spatio-temporal coherent fluid structures from
high-dimensional time-series data. The DMD analysis offered an alternative to stan-
dard dimensionality reduction methods such as the proper orthogonal decomposition
(POD), which highlighted low-rank features in fluid flows using the computationally
efficient singular value decomposition (SVD) (Kutz 2013). The advantage of using
DMD over SVD is that the DMD modes are linear combinations of the SVD modes
that have a common linear (exponential) behavior in time, given by oscillations at a
fixed frequency with growth or decay. Specifically, DMD is a regression to solutions
of the form
r
X(t) =  j eω j t b j =  exp( t)b, (4.24)
j=1

where X(t) is an r -rank approximation to a collection of state space measurements


xk = x(tk ) (k = 1, 2, . . . , n). The algorithm regresses to values of the DMD eigen-
values ω j , DMD modes  j and their loadings b j . The ω j determines the temporal
behavior of the system associated with a modal structure  j , thus giving a highly
interpretable representation of the dynamics. Such a regression can also be learned
from time-series data (Lange et al. 2020). DMD may be thought of as a combination
of SVD/POD in space with the Fourier transform in time, combining the strengths
of each approach (Chen et al. 2012; Kutz et al. 2016). DMD is modular due to its
simple formulation in terms of linear algebra, resulting in innovations related to con-
trol (Proctor et al. 2016), compression (Brunton et al. 2015; Erichson et al. 2019),
reduced order modeling (Alla and Kutz 2017), and multi-resolution analysis (Kutz
et al. 2016; Champion et al. 2019), among others.
Because of its simplicity and interpretability, DMD has been applied to a wide
range of diverse applications beyond fluid mechanics, including neuroscience (Brun-
ton et al. 2016), disease modeling (Proctor and Eckhoff 2015), robotics (Mamakoukas
et al. 2019, 2020), video processing (Grosek and Kutz 2014; Erichson et al. 2016),
power grids (Susuki et al. 2009; Susuki and Mezic 2011), financial markets (Mann
and Kutz 2016), and plasma physics (Taylor et al. 2018; Kaptanoglu et al. 2020).
158 J. Nathan Kutz

The regression to (4.24) shows the immediate value of DMD for forecasting. Specif-
ically, any time t ∗ can be evaluated to produce an approximation to the state of the
system X(t ∗ ). However, despite its introduction more than a decade ago, DMD is
rarely used for forecasting and/or reconstruction of time-series data except in cases
with high-quality (noise-free or nearly noise-free) data. Indeed, practitioners who
work with DMD and noisy data know that the algorithm fails not only to produce a
reasonable forecast but also often fails in the task of reconstructing the time series it
was originally regressed to. Thus, in the past decade, the value of DMD has largely
been an important diagnostic tool as the DMD modes and frequencies are highly
interpretable. Indeed, from Schmid’s (Schmid and Sesterhenn 2008; Schmid 2010)
original work until now, DMD papers are primarily diagnostic in nature with the key
figures of any given paper being the DMD modes and eigenvalues. In cases, where
DMD is used on noise-free data, such as for producing reduced order models from
high-fidelity numerical simulation data (Bagheri 2014; Alla and Kutz 2017), then
the DMD solution (4.24) can be used for reconstructing and forecasting accurate
representations of the solution.
The algorithmic construction of the DMD method can be best understood from the
so-called exact DMD (Tu et al. 2014). Indeed, this exact DMD is simply a least-square
fitting procedure. Specifically, the DMD algorithm seeks a best-fit linear operator A
that approximately advances the state of a system, x ∈ Rn , forward in time according
to the linear dynamical system
xk+1 = Axk , (4.25)

where xk = x(kt), and t denotes a fixed time step that is small enough to resolve
the highest frequencies in the dynamics. Thus, the operator A is an approximation
of the Koopman operator K restricted to a measurement subspace spanned by direct
measurements of the state x (Rowley et al. 2009). In the original DMD formula-
tion (Schmid 2010), uniform sampling in time was required so that tk = kt. The
exact DMD algorithm (Tu et al. 2014) does not require uniform sampling. Rather,
for each snapshot u(tk ), there is a corresponding snapshot u(tk ) one time step t in
the future. These snapshots are arranged into two matrices, X which is given in (4.5),
and X which is the matrix (4.5) with all snapshots advanced t. In terms of these
matrices, the DMD regression (4.25) is

X ≈ AX. (4.26)

The exact DMD is the best fit, in a least-squares sense, operator A that approximately
advances snapshot measurements forward in time. Specifically, it can be formulated
as an optimization problem

A = argmin X − AX F = X X† (4.27)
A

where · F is the Frobenius norm and † denotes the pseudo-inverse. The pseudo-
inverse may be computed using the SVD of X = U V∗ as X† = V −1 U∗ . The
4 Machine Learning Methods for Constructing Dynamic Models From Data 159

matrices U ∈ Cn×n and Vm×m are unitary so that U∗ U = I and V∗ V = I, where ∗


denotes complex-conjugate transpose. The columns of U are known as POD modes.
Often for high-dimensional data, the DMD leverages low-rank structure by first
projecting A onto the first r POD modes in Ur and approximating the pseudo-inverse
using the rank-r SVD approximation X ≈ Ur r Vr∗

à = Ur∗ AUr = Ur∗ X Vr −1


r . (4.28)

The leading spectral decomposition of A may be approximated from the spectral


decomposition of the much smaller Ã

ÃW = W . (4.29)

The diagonal matrix contains the DMD eigenvalues, which correspond to eigenval-
ues of the high-dimensional matrix A. The columns of W are eigenvectors of Ã, and
provide a coordinate transformation that diagonalizes the matrix. These columns may
be thought of as linear combinations of POD mode amplitudes that behave linearly
with a single temporal pattern given by the corresponding eigenvalue λ.
The eigenvectors of A are the DMD modes , and they are reconstructed using
the eigenvectors W of the reduced system and the time-shifted data matrix X
−1
 = X Ṽ ˜ W. (4.30)

Tu et al. (2014) proved that these DMD modes are eigenvectors of the full A matrix
under certain conditions. As already shown in the introduction, the DMD decompo-
sition allows for a reconstruction of the solution as (4.24). The amplitudes of each
mode b can be computed from
b = † X1 , (4.31)

however, alternative and often better approaches are available to compute b (Chen
et al. 2012; Jovanović et al. 2014; Askham and Kutz 2018). Thus, the data matrix X
may be reconstructed as

X ≈ diag(b)T(ω)
⎡ ⎤⎡b ⎤⎡ ω t
e 1 1 · · · eω1 tm

| | 1
⎢ ⎥⎢ ⎥
= ⎣φ 1· · · φ r ⎦⎣ . . . ⎦⎣ ... . . . ... ⎦ . (4.32)
| | br eωr t1 · · · eωr tm

Bagheri (2014) first highlighted that DMD is particularly sensitive to the effects of
noisy data, with systematic biases introduced to the eigenvalue distribution (Duke
et al. 2012; Bagheri 2013; Dawson et al. 2016; Hemati et al. 2017). As a result, a
number of methods have been introduced to stabilize performance, including total
least-squares DMD (Hemati et al. 2017), forward-backward DMD (Dawson et al.
2016), variational DMD (Azencot et al. 2019), subspace DMD (Takeishi et al. 2017),
160 J. Nathan Kutz

time-delay embedded DMD (Brunton et al. 2017; Arbabi and Mezić 2017; Kamb et al.
2020; Hirsh et al. 2021), and robust DMD methods (Askham and Kutz 2018; Scherl
et al. 2020). However, the optimized DMD algorithm of Askham and Kutz (2018),
which uses a variable projection method for nonlinear least squares to compute the
DMD for unevenly timed samples, provides the best and optimal performance of
any algorithm currently available. This is not surprising given that it actually is
constructed to optimally satisfy the DMD problem formulation. Specifically, the
optimized DMD algorithm solves the exponential fitting problem directly

argmin X − b T(ω) F, (4.33)


ω,b

where b = diag(b). This has been shown to provide a superior decomposition due
to its ability to optimally suppress bias and handle snapshots collected at arbitrary
times. Moreover, it can be used with statistical bagging methods to produce the BOP-
DMD (bagging, optimized DMD) (Sashidhar and Kutz 2021), which is perhaps
the most stable variant of DMD. BOP-DMD also provides spatial and temporal
uncertainty quantification. The disadvantage of optimized DMD is that one must
solve a nonlinear optimization problem, often which can fail to converge.
The construction of a traditional ROM that is accurate and efficient is centered
on the reduction (4.4). Thus, once a low-rank subspace is computed from the SVD,
the POD modes  are used for projecting the dynamics. DMD allows us to use a
data-driven, non-intrusive method in order to regress to a model for the temporal
dynamics. Consider modification of (4.4) to the evolution dynamics

ut = Lu + N (u, ux , ux x , . . . , x, t; β) (4.34)

where the linear and nonlinear parts of the evolution, denoted by L and N(·), respec-
tively, have been explicitly separated. The solution ansatz u = a yields the ROM

da
=  T La +  T N(a, β). (4.35)
dt

Note that the linear operator in the reduced space  T L in a r × r matrix which
is easily computed. The nonlinear portion of the operator  T N(a, β) is more
complicated since it involves repeated computation of the operator as the solution a,
and consequently the high-dimensional state u, is updated in time.
One method for overcoming the difficulties introduced in evaluating the nonlinear
term on the right-hand side is to introduce the DMD algorithm. DMD approximates
a set of snapshots by a best-fit linear model. Thus, the nonlinearity can be evaluated
over snapshots and a linear model constructed to approximate the dynamics. Thus,
two matrices can be constructed
4 Machine Learning Methods for Constructing Dynamic Models From Data 161
⎡ ⎤ ⎡ ⎤
| | | | | |
N− = ⎣N1 N2 · · · Nm−1 ⎦ and N+ = ⎣N2 N3 · · · Nm ⎦ (4.36)
| | | | | |

where Nk is the evaluation of the nonlinear term N (u, ux , ux x , . . . , x, t; β) at t = tk .


The ± denotes the input (−) and output (+), respectively. This gives the training
data necessary for regressing to a DMD model

N+ = AN N− . (4.37)

The governing equation (4.34) can then be approximated by

ut ≈ Lu + AN u = (A + AN )u (4.38)

where the operator L has been replaced by A. The dynamics is now completely linear
and solutions can be easily constructed from the eigenvalues and eigenvectors of the
linear operator A + AN .
In practice, the DMD algorithm exploits low-dimensional structure in building
a ROM model. Thus, instead of the approximate linear model (4.34), we build a
low-dimensional version. From snapshots (4.36) of the nonlinearity, the DMD algo-
rithm can be used to approximate the dominant rank-r nonlinear contribution to the
dynamics as
r
N (u, ux , ux x , . . . , x, t; β) ≈  j exp(ω j t)b j =  exp( t)b (4.39)
j=1

where b j determines the weighting of each mode. Here,  j is the DMD mode and
ω j is the DMD eigenvalue. This approximation can be used in (4.35) to produce the
POD-DMD approximation

da
=  T La +  T  exp( t)b. (4.40)
dt
In this formulation, there are a number of advantageous features, (i) the nonlin-
earity is only evaluated once with the DMD algorithm (4.39) and (ii) the products
 T L and  T  are also only evaluated once and both produce matrices that are
low rank, i.e. they are independent of the original high-rank system. Thus, with a
one-time, up-front evaluation of two snapshot matrices to produce  and , the
ROM produces a computationally efficient ROM that requires no recourse to the
original high-dimensional system.
Alla and Kutz (2017) integrated the DMD algorithm into the traditional ROM
formalism to produce the POD-DMD model (4.40). The comparison of this com-
putationally efficient ROM with traditional model reduction is shown in Fig. 4.2.
Specifically, both the computational time and error are evaluated using this tech-
nique. Once the DMD algorithm is used to produce an approximation of the nonlin-
162 J. Nathan Kutz

Fig. 4.2 Computation time and accuracy on a semi-linear parabolic equation (Modified from Alla
and Kutz 2017). Four methods are compared, the high-fidelity simulation of the governing equations
(FULL), a Galerkin-POD reduction as given in (4.40) (POD), a Galerkin-POD reduction with the
discrete empirical interpolation (DEIM) algorithm for evaluation the nonlinearity (POD-DEIM),
and the POD-DMD approximation (4.40). The left panel shows the computation time, which is
an order-of-magnitude faster than traditional POD-DEIM algorithms. The right panel shows the
accuracy of the different methods for reproducing the high-fidelity simulations. POD-DMD loses
some accuracy in comparison to Galerkin-POD methods due to the fact that DMD modes are not
orthogonal and thus the error does not decrease as quickly as the POD-based methods

ear term, it can be used for producing future state predictions and a computationally
efficient ROM. Indeed, its computational acceleration is quite remarkable in com-
parison to traditional methods. Moreover, the method is non-intrusive and does not
require additional evaluation of the nonlinear term. The entire method can be used
with randomized algorithms to speed up the low-rank evaluations even further (Alla
and Kutz 2019). Note that the computational performance boost comes at the expense
of accuracy as shown in Fig. 4.2. This is primarily due to the fact that additional POD
modes used for standard ROMs, which are orthogonal by construction and guaran-
teed to be a best fit in 2 , are now replaced by DMD modes which are no longer
orthogonal (Alla and Kutz 2017).

4.4.2 Sparse Identification of Nonlinear Dynamics

In addition to a DMD model for modeling the low-rank dynamics, the SINDy regres-
sion framework also allows one to build a model for the evolution of the temporal
dynamics in the low-rank subspace. Discovery of governing equations plays a fun-
damental role in the development of physical theories, and in this case, we wish to
discover the evolution dynamics of a(t) for constructing our ROM. With increas-
ing computing power and data availability in recent years, there have been sub-
stantial efforts to identify the governing equations directly from data (Bongard and
Lipson 2007; Schmidt and Lipson 2009; Yang et al. 2020). There has been partic-
ular emphasis on parsimonious representations because they have the benefits of
4 Machine Learning Methods for Constructing Dynamic Models From Data 163

promoting interpretability and generalizing well to unknown data (Bai et al. 2015;
Brunton et al. 2014, 2016; Mackey et al. 2014; Ozoliņš et al. 2013; Proctor et al.
2014; Tran and Ward 2017; Wang et al. 2011). The SINDy method was proposed
in Brunton et al. (2016), which leverages dictionary learning and sparse regression
to model dynamical systems. This approach has been successful in modeling a diver-
sity of applications, including in chemistry (Hoffmann et al. 2019), optics (Sorokina
et al. 2016), engineered systems (Li et al. 2019), epidemiology (Horrocks and Bauch
2020), and plasma physics (Dam et al. 2017). Furthermore, there has been a variety
of modifications, including improved robustness to noise (Champion et al. 2019,
2020; Kaheman et al. 2020), generalizations to partial differential equations (Raissi
and Karniadakis 2018; Rudy et al. 2017, 2019), multi-scale physics (Champion et al.
2019), and libraries of rational functions (Mangan et al. 2016; Kaheman et al. 2020).
Just like the BOP-DMD algorithm (Sashidhar and Kutz 2021), recent Bayesian and
ensemble methods make SINDy much more robust for model discovery for noisy
systems and with little data (Hirsh et al. 2021; Fasel et al. 2021).
In the context of ROMs modeling, the goal is now to discover a dynamic, parsi-
monious model of the evolution dynamics of a high-fidelity model embedded in a
low-rank subspace. Recall that u(t) ≈ a(t) for building a ROM. Although  can
be easily computed with the SVD, it is the evolution of a(t) that ultimately deter-
mines the temporal behavior of the system. Thus far, the temporal evolution has been
computed via Galerkin projection and DMD. SINDy gives yet another alternative

d
a = f(a) (4.41)
dt

where the right-hand side function prescribing the evolution dynamics f(·) is
unknown. SINDy provides a sparse regression framework to determine this dynam-
ics. As in DMD, the snapshots of a(t) are collected into the matrix
⎡ ⎤
| | |
A = ⎣a1 a2 · · · am ⎦ . (4.42)
| | |

The SINDy regression framework is then formulated as

Ȧ = (A). (4.43)

where each column ξ k in  is a vector of coefficients determining the active terms in


the kth row in (4.41). Leveraging parsimony provides a dynamical model using as few
terms as possible in . Such a model may be identified using a convex 1 -regularized
sparse regression

ξ k = argminξ k ȧk − (A)ξ k 2 + λ ξ k 1. (4.44)


164 J. Nathan Kutz

Fig. 4.3 Application of SINDy algorithm for ROM construction. High-dimensional data is used
with the sparse identification of nonlinear dynamics (SINDy) (Brunton et al. 2016) in order to
produce a model for a(t). This procedure is modular so that different techniques can be used for the
feature extraction and regression steps. In this example of flow past a cylinder, SINDy discovers
the model of Noack et al. (2003). Modified from Brunton et al. (2016)

Note that ȧk is the kth column of Ȧ, and λ is a sparsity-promoting regularization. There
are many variants for sparsity promotion that can be used (Tibshirani 1996; Donoho
2006; Candès 2006; Candès et al. 2023, 2006; Candès and Tao 2006; Baraniuk
2007; Tropp and Gilbert 2007), including the advocated sequential least-squares
thresholding to select active terms (Brunton et al. 2016).
The SINDy-POD method provides a simple regression framework for discovering
a parsimonious, and generally nonlinear, model for the evolution dynamics of the
high-dimensional model in a low-dimensional subspace. For example, the canonical
example of flow past a circular cylinder is considered. This is modeled by the 2D,
incompressible Navier–Stokes equations (Bagheri 2014)

1
∇ · u = 0, ∂t u + (u · ∇)u = −∇ p + u (4.45)
Re
where u is the two-component flow velocity field in 2D and p is the pressure term. For
Reynolds number Re = Rec ≈ 47, the fluid flow past a cylinder undergoes a super-
critical Hopf bifurcation, where the steady flow for Re < Rec transitions to unsteady
vortex shedding (Bearman 1969). The unfolding gives the celebrated Stuart–Landau
ODE, which is essentially the Hopf normal form in complex coordinates. This has
resulted in accurate and efficient reduced order models for this system (Noack et al.
2003, 2011).
In Fig. 4.3, simulations at Re = 100 were considered. The SVD of the data matrix
at this Reynolds number shows that three modes dominate the dynamics. As such
the first three columns of the matrix V are extracted and the SINDy regression (4.43)
is performed. The discovered dynamical model is given by
4 Machine Learning Methods for Constructing Dynamic Models From Data 165

ȧ1 = μa1 − ωa2 + Aa1 a3 (4.46a)


ȧ2 = ωa1 + μa2 + Aa2 a3 (4.46b)
ȧ3 = −λ(a3 − a12 − a22 ) (4.46c)

which is the same found be Noack et al. (2003) through a detailed asymptotic reduc-
tion of the flow dynamics. Thus, the ROM evolution dynamics (4.46c) represents
a significantly different model than what is achieved via Galerkin POD projection.
This model is stable and it also captures the correct supercritical Hopf bifurcation
dynamics as a function of the Reynolds number. Thus, the SINDy-POD provides an
improved ROM description of the dynamics.

4.4.3 Neural Networks

The emergence of machine learning is expanding the mathematical possibilities for


the construction of accurate ROMs. As shown in the previous sections, the focus of
traditional projection-based ROMs is on computing the low-dimensional subspace
 on which to project the governing equations. Recall that in constructing the low-
dimensional subspace, the SVD is used on snapshots of high-fidelity simulation (or
experimental) data X ≈  ˜ Ṽ∗ . The POD reduction technique uses only the single
matrix  in the reduction process. The temporal evolution in the reduced space
 is quantified by ˜ Ṽ∗ . This gives explicitly the evolution of each mode over the
snapshots of X, information which is not used in projection-based ROMs. Neural
networks can then be used directly on the time-series data encoded in V to build a
time-stepping algorithm for marching the solution forward in time.
The motivation for using deep learning algorithms for time-stepping is the recog-
nition that projection-based model reduction often can produce unstable iteration
schemes (Carlberg et al. 2017). A second important fact is that valuable temporal
information in the low-dimensional space is summarily dismissed by the projection
schemes, i.e. only the POD modes are retained for ROM construction. Neural net-
works aim to leverage the temporal information and in the process build efficient and
stable time-stepping proxies. Recall that model reduction proceeds by projecting into
the low-dimensional subspace spanned by  so that

u(t) ≈ a(t). (4.47)

In the projection-based ROMs of previous sections, the amplitude dynamics a(t) are
constructed by Galerkin projection on the governing equations. With neural networks,
the dynamics a(t) is approximated from the discrete time-series data encoded in V.
Specifically, this gives
166 J. Nathan Kutz

Fig. 4.4 Illustration of neural network integration with POD subspaces. The autoencoder structure
projects the original high-dimensional state space data into a low-dimensional space via u(t) ≈
a(t). As shown in the bottom left, the snapshots uk are generated by high-fidelity numerical
solutions of the governing equations ut = N (u, ux , ux x , . . . , x, t; β). In traditional ROMs, the
snapshots ak are constructed from Galerkin projection as shown in the bottom right. Neural networks
instead learn a mapping ak+1 = fθ (ak ) from the original, low-dimensional snapshot data. It should
be noted that time-stepping Runge–Kutta schemes, for instance, are a form of feed-forward neural
networks, which are used to produce the original high-fidelity data snapshots uk (Gonzalez-Garcia
et al. 1998)

⎡ ⎤
| | |
a(t) → ˜ Ṽ∗ = ⎣ a1 a2 · · · am ⎦ (4.48)
| | |

over the m time snapshots of the original data matrix on which the ROM is to be
constructed.
Deep learning algorithms provide a flexible framework for constructing a mapping
between successive time steps. As shown in Fig. 4.4, the typical ROM architecture
constrains the dynamics to a subspace spanned by the POD modes . Thus, in the
original coordinate system, the high-fidelity simulations of the governing equations
for u are solved with a given numerical discretization scheme to produce a snapshot
matrix X containing uk . In the new coordinate systems which is generated by projec-
4 Machine Learning Methods for Constructing Dynamic Models From Data 167

tion to the subspace , the snapshot matrix is now constructed from ak as shown in
(4.48). In traditional ROMs, the snapshot matrix (4.48) is not used. Instead snapshots
of ak are achieved by solving the Galerkin projected model. However, the snapshot
matrix (4.48) can be used to construct a time-stepping model using neural networks.
Neural networks allow one to use the high-fidelity simulation data to train a mapping

ak+1 = fθ (ak ) (4.49)

where fθ is a generic representation of a neural network which is characterized by its


structure, weights, and biases. Neural networks can be costly to train. Indeed, they
typically require a significant amount of data and long training periods in order to
perform up to their potential. When comparing DMD, SINDy, and neural networks,
neural networks take the longest time to train, while DMD is rapid and data efficient.
These trade-offs must often be considered in making a choice between the three
different model reduction paradigms. It is also important to consider the number of
trainable parameters. Deep neural networks have been recently shown to work best
for over-parametrized setups (Kontolati et al. 2022).
As previously mentioned, Parish et al. (2020) and Regazzoni et al. (2019) devel-
oped neural network models to learn time-stepping of (4.49). In such models, the
neural networks (or time-series analysis methods) simply map an input (ak ) to an
output (ak+1 ). In its simplest form, the neural network training requires input–output
pairs that can be generated from snapshots ak . Thus, two matrices can be constructed
⎡ ⎤ ⎡ ⎤
| | | | | |
A− = ⎣a1 a2 · · · am−1 ⎦ and A+ = ⎣a2 a3 · · · am ⎦ (4.50)
| | | | | |

where the ± denotes the input (−) and output (+), respectively. This gives the training
data necessary for learning (optimizing) a neural network map

A+ = fθ (A− ). (4.51)

There are numerous neural network architectures that can learn the mapping fθ . A
simple feed-forward network was already shown to be quite accurate in learning such
a model. Further sophistication can improve accuracy and reduce data requirements
for training. Regardless of the architecture, the error accumulation in DNNs when
the solution is obtained recursively is an open research question which is important
to understand for its usage in ROM architectures and for long time solutions.
Regazzoni et al. (2019) formulated the optimization of (4.51) in terms of
maximum-likelihood. Specifically, they considered the most suitable representation
of the high-fidelity model in terms of simpler neural network models. They show
that such neural network models can approximate the solution to within any accuracy
required (limited by the accuracy of the training data, or course) simply by construct-
ing them from the input-output pairs given by (4.51). Parish et al. (2020) provide an
in-depth study of different neural network architectures that can be used for learning
168 J. Nathan Kutz

Fig. 4.5 From Parish et al.


(2020), a comparison of a
diversity of error metrics and
methods for constructing the
mapping (4.49) for the
advection–diffusion
equations. In all models
considered in the paper, the
LSTM and RNN structures
proved to be the more
accurate models for
time-stepping. The reader is
encouraged to consult the
original paper for the details
of the underlying models, the
error metrics displayed, and
the training data used.
Python code are available in
the appendix of Parish et al.
(2020)

the time-steppers. They are especially focused on recurrent neural network (RNN)
architectures that have proven to be so effective in temporal sequences associated
with language (Goodfellow et al. 2016). Their extensive comparisons show that long
short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) neural networks
outperform other methods and provide substantial improvements over traditional
time-series approaches such as autoregressive models. In addition to a baseline Gaus-
sian process (GP) regression, they specifically compare time-stepping models that
include the following, k-nearest neighbors (kNN), artificial neural networks (ANN),
autoregressive with exogenous inputs (ARX), Integrated ANN (ANN-I), latent ARX
(LARX), RNN, LSTM, and standard GP. Some models include recursive training
(RT) and others do not (NRT). Their comparisons on a diversity of PDE models,
which will not be detailed here, are evaluated on the fraction of variance unex-
plained (FVU). Figure 4.5 gives a representation of the extensive comparisons made
on these methods for an advection–diffusion PDE model.
4 Machine Learning Methods for Constructing Dynamic Models From Data 169

The success of neural networks for learning time-stepping representations fits


more broadly under the aegis of flow maps (Wiggins 2003)

uk+1 = F(uk ). (4.52)

For neural networks, the flow map is approximated by the learned model (4.49) so
that F = fθ . Qin et al. (2019) and Liu et al. (2020) have explored the construction
of flow maps from neural networks as yet another modeling paradigm for advanc-
ing the solution in time without recourse to high-fidelity simulations. Such methods
offer a broader framework for fast time-stepping algorithms as no initial dimen-
sionality reduction needs to be computed. In Qin et al. (2019), the neural network
model fθ is constructed with a residual network (ResNet) as the basic architecture
for approximation. In addition to a one-step method, which is shown to be exact in
temporal integration, a recurrent ResNet and recursive ResNet is also constructed
for multiple time steps. Their formulation is also in the weak form where no deriva-
tive information is required in order to produce the time-stepping approximations.
Several numerical examples are presented to demonstrate the performance of the
methods. Like Parish et al. (2020) and Regazzoni et al. (2019), the method is shown
to be exceptionally accurate even in comparison with direct numerical integration,
highlighting the qualities of the universal approximation properties of fθ .
Liu et al. (2020) leveraged the flow map approximation scheme to learn a mul-
tiscale time-stepping scheme. Specifically, one can learn flow maps for different
characteristic timescales. Thus, a given model

ak+τ = fθ τ (ak ) (4.53)

can learn a flow map over a prescribed timescale τ . If there exist distinct timescales in
the data, for instance, denoted by t1 , t2 , and t3 with t1  t2  t3 (slow, medium, and
fast times), then three models can be learned, fθ 1 , fθ 2 , and fθ 3 for the slow, medium,
and fast times, respectively. Figure 4.6 shows the hierarchical time-stepping (HiTS)
scheme with three distinct timescales. The training data of a high-fidelity simulation,
or collection of experimental data, allow for the construction of flow maps which
can then be used to efficiently forecast long times in the future. Specifically, one can
use the flow map constructed on the slowest scale fθ 1 to march far into the future
while the medium and fast scales are then used to advance to the specific point in
time. Thus, a minimal number of steps is taken on the fast scale, and the work of
forecasting long into the future is done by the slow and medium scales. The method
is highly efficient and accurate.
Figure 4.7 compares the HiTS scheme across a number of example problems,
some of which are videos and music frames. Thus, HiTS does not require governing
equations, simply time-series data arranged into input–output pairs. The performance
of such flow maps is remarkably robust, stable, and accurate, even when compared to
leading time-series neural networks such as LSTMs, echo state networks (ESN) and
clockwork recurrent neural networks (CW-RNNs). This is especially true for long
170 J. Nathan Kutz

Fig. 4.6 Multiscale hierarchical time-stepping scheme (modified from Liu et al. 2020). Neural
network representations of the time-steppers are constructed over three distinct time scales. The
red model takes large steps (slow scale fθ 1 ), leaving the finer time-stepping to the yellow (medium
time scales fθ 2 ) and blue (fast time scales fθ 3 ) models. The dark path shows the sequence of maps
from u1 to um

Fig. 4.7 Evaluation of different neural network architectures (column) on each training sequence
(row) (From Liu et al. 2020). Key diagnostics are visualized from a diversity of examples, including
music files and videos. The last frame of the reconstruction is visualized for the first, third, and
fourth examples while the entire music score is visualized in the second example. Note the superior
performance of the hierarchical time-stepping scheme in comparison with other modern neural net-
work models such as LSTMs, echo state networks (ESN), and clockwork recurrent neural networks
(CW-RNNs)

forecasts in contrast to the small time-steps evaluated in the work of Parish et al.
(2020).
Overall, the works of Parish et al. (2020), Regazzoni et al. (2019), Qin et al.
(2019), and Liu et al. (2020) exploit very simple training paradigms related to input–
4 Machine Learning Methods for Constructing Dynamic Models From Data 171

Fig. 4.8 Schematic of the SINDy autoencoder method for simultaneous discovery of coordinates
and parsimonious dynamics (From Champion et al. 2019). a An autoencoder architecture is used
to discover intrinsic coordinates z from high-dimensional input data x. The network consists of
two components, an encoder ϕ(x), which maps the input data to the intrinsic coordinates z, and a
decoder ψ(z), which reconstructs x from the intrinsic coordinates. b A SINDy model captures the
dynamics of the intrinsic coordinates. The active terms in the dynamics are identified by the nonzero
elements in , which are learned as part of the NN training. The time derivatives of z are calculated
using the derivatives of x and the gradient of the encoder ϕ. The inset shows the pointwise loss
function used to train the network. The loss function encourages the network to minimize both the
autoencoder reconstruction error and the SINDy loss in z and x. L 1 regularization on  is also
included to encourage parsimonious dynamics

output pairings of temporal snapshot data as structured in (4.50). This provides a


significant potential improvement for learning time-stepping proxies to Galerkin
projected models.

4.5 Joint Discovery of Coordinates and Models

As a final example of ROM construction, we consider an architecture capable of


jointly and simultaneously learning coordinates and parsimonious dynamics. Specif-
ically, Champion et al. (2019) present a method (SINDy AE) for the simultaneous
discovery of sparse dynamical models (SINDy) and coordinates (autoencoders) that
enable these simple representations. The aim of the architecture is to leverage the
parsimony and interpretability of SINDy with the universal approximation capabil-
ities of deep neural networks to discover an appropriate coordinate system in which
to embed the dynamics. This can produce interpretable and generalizable models
capable of extrapolation and forecasting since the dynamical model is minimally
parameterized. The architecture is shown in Fig. 4.8 where an autoencoder is used
172 J. Nathan Kutz

to embed the original data x into a new coordinate z amenable to a parsimonious


representation.
While in the original coordinate system, a dynamical model may be dense in
terms of functions of the original measurement coordinates x, this method determines
through an autoencoder a reduced coordinate system z(t) = ϕ(x(t)) ∈ Rd (d  n)
where the following dynamical model holds:

dz(t)
= g(z(t)). (4.54)
dt
Specifically, a parsimonious description of the dynamics is sought where g contains
only a few active terms from a SINDy library. Thus in addition to a dynamical
model, the method learns a coordinate transforms ϕ, ψ that map the measurements
to intrinsic coordinates via z = ϕ(x) (encoder) and back via x ≈ ψ(z) (decoder).
The autoencoder is a flexible, feedforward neural network that allows one to
discover underlying low-dimensional coordinates in which to represent the data.
Thus, the layers of the autoencoder learn a latent representation of a new variable
in which to express the data, in this case, the dynamic evolution dynamics. The
network is trained to output an approximate reconstruction of its input, and the
restrictions placed on the network architecture (e.g. the type, number, and size of the
hidden layers) characterize the intrinsic coordinates (Goodfellow et al. 2016). The
autoencoder gives a nonlinear generalization of a PCA analysis (Baldi and Hornik
1989); thus, it goes beyond the standard linear POD subspace description.
Autoencoders can learn a low-dimensional representation in isolation without
need to specify any other constraints. Without further specifications, the intrin-
sic coordinates learned have no particular meaning or interpretation. However,
if in the latent space, additional constraints are imposed, then additional struc-
ture and meaning can be imposed on the model. For the SINDy AE model, the
network is required to learn coordinates associated with parsimonious dynamics.
Thus, it integrates the sparse regression framework of SINDy in the latent space,
or intrinsic coordinates z. This constraint in the autoencoder provides a regular-
ization framework whereby model discovery is achieved by constructing a library
(z) = [θ 1 (z), θ 2 (z), . . . , θ p (z)] of candidate basis functions, e.g. polynomials, and
learning a sparse set of coefficients  = [1 , . . . , d ] that defines the dynamical
system
dz(t)
= g(z(t)) = (z(t)).
dt
Typical of SINDy, the library is specified before training occurs, where library load-
ings (coefficients)  are learned along with the autoencoder weights during training
(optimization). Importantly, the derivatives ẋ(t) of the original states are computed
in order to pass these along to the encoder variables as ż(t) = ∇x ϕ(x(t))ẋ(t). This
helps enforce accurate prediction of the dynamics by incorporating the loss function
 2
Ldz/dt = ∇x ϕ(x)ẋ − (ϕ(x)T )2 . (4.55)
4 Machine Learning Methods for Constructing Dynamic Models From Data 173

This term uses both the typical SINDy regression along with the gradient of the
encoder to promote learning of a sparse dynamical model which accurately predicts
the time derivatives of the encoder variables. Additional loss terms require that the
SINDy predictions accurately reconstruct the time derivatives of the original data
  2
Ldx/dt = ẋ − (∇z ψ(ϕ(x))) (ϕ(x)T ) 2 . (4.56)

These loss terms (4.55) and (4.56) are added to the the standard autoencoder loss

Lrecon = x − ψ(ϕ(x)) 2
2 ,

which ensures that the autoencoder can accurately reconstruct the original input data.
To help promote sparsity in the SINDy architecture, an 1 regularization penalty is
included on the SINDy coefficients . This promotes a parsimonious model for the
dynamics by selecting only a small number of terms. The combination of the above
four terms gives the following overall loss function:

Lrecon + λ1 Ldx/dt + λ2 Ldz/dt + λ3 Lreg ,

where the hyperparameters λ1 , λ2 , λ3 determine the relative weighting of the three


terms in the loss function.
The SINDy AE is attractive since it does not force the ROM to operate in the
subspace . Rather, the AE allows the discovery of a coordinate system whereby a
SINDy model can be expressed. This architecture can be used to also learn a linear
Koopman embedding of the data (Lusch et al. 2018; Gin et al. 2021). Moreover, the
same method can be lifted to evaluate boundary value problems (Gin et al. 2020).

4.6 Conclusions

Data-driven methods have emerged as an invaluable tool for aiding in the construction
of dynamic models and their representation. Demonstrated here are three emerging
paradigms for data-driven models, (i) dynamic mode decomposition, (ii) sparse iden-
tification for nonlinear dynamics, and (iii) neural networks. In each case, the goal is to
use these methods to construct a model for the dynamics of a(t). This is a data-driven
construction as opposed to a projection-based construction typical of ROMs (Benner
et al. 2015) when governing equations are already known.
Each of the data-driven constructions has advantages that can be leveraged by
practitioners. The DMD method is perhaps the simplest as it provides a regression to
a best-fit linear model. The linear model, which models a Koopman operator (Brun-
ton et al. 2021), is advantageous since solutions can be easily represented as a linear
combination of eigenvalues and eigenfunctions of the constructed linear operator.
The data requirements are minimal for the DMD approximation and there exists
174 J. Nathan Kutz

open-source code, pyDMD (Demo et al. 2018), for producing DMD models. SINDy
requires more data, but it allows for a nonlinear representation of the dynamic evolu-
tion. SINDy is advantageous since it produces a parsimonious evolution dynamics for
a(t) that is typically interpretable and amenable to analysis with tools from dynamical
systems theory. It also has open-source software available called pySINDy (de Silva
et al. 2020; Kaptanoglu et al. 2021). If sufficient data is available, a diversity of deep
learning algorithms are available for producing neural networks for modeling the
time evolution of a(t). Such algorithms have been shown to be successful in a diver-
sity of application areas. Moreover, deep learning can be structured, for instance, to
learn multiscale physics.
Overall, data-driven methods are providing significant improvement capabilities
for traditional ROMs. As these methods are developed further over the next decade,
it is anticipated that ROMs will be substantially improved in terms of computational
performance and accuracy. This has the potential to revolutionize digital twin tech-
nologies as these methods can use computational or measurement data to construct
proxy models that are accurate and efficient to simulate.

Acknowledgements The work of JNK was supported in part by the US National Science Founda-
tion (NSF) AI Institute for Dynamical Systems (dynamicsai.org), grant 2112085.

References

Ablowitz MJ, Segur H (1981) Solitons and the inverse scattering transform, vol 4. Siam
Alla A, Kutz JN (2017) Nonlinear model order reduction via dynamic mode decomposition. SIAM
J Sci Comput 39(5):B778–B796
Alla A, Kutz JN (2019) Randomized model order reduction. Adv Comput Math 45(3):1251–1271
Antoulas AC (2005) Approximation of large-scale dynamical systems. SIAM
Arbabi H, Mezić I (2017) Ergodic theory, dynamic mode decomposition, and computation of spectral
properties of the koopman operator. SIAM J Appl Dyn Syst 16(4):2096–2126
Askham T, Kutz JN (2018) Variable projection methods for an optimized dynamic mode decom-
position. SIAM J Appl Dyn Syst 17(1):380–416
Azencot O, Yin W, Bertozzi A (2019) Consistent dynamic mode decomposition. SIAM J Appl Dyn
Syst 18(3):1565–1585
Bagheri S (2013) Koopman-mode decomposition of the cylinder wake. J Fluid Mech 726:596–623
Bagheri S (2014) Effects of weak noise on oscillating flows: Linking quality factor, Floquet modes,
and Koopman spectrum. Phys Fluids 26(9):094104
Bai Z, Wimalajeewa T, Berger Z, Wang G, Glauser M, Varshney PK (2015) Low-dimensional
approach for reconstruction of airfoil data via compressive sensing. AIAA J 53(4):920–933
Baldi P, Hornik K (1989) Neural networks and principal component analysis: Learning from exam-
ples without local minima. Neural Netw 2(1):53–58
Baraniuk RG (2007) Compressive sensing. IEEE Signal Process Mag 24(4):118–120
Bearman PW (1969) On vortex shedding from a circular cylinder in the critical reynolds number
regime. J Fluid Mech 37(3):577–585
Benner P, Gugercin S, Willcox K (2015) A survey of projection-based model reduction methods
for parametric dynamical systems. SIAM Rev 57(4):483–531
Bongard J, Lipson H (2007) Automated reverse engineering of nonlinear dynamical systems. Proc
Natl Acad Sci 104(24):9943–9948
4 Machine Learning Methods for Constructing Dynamic Models From Data 175

Brunton SL, Brunton BW, Proctor JL, Kaiser E, Kutz JN (2017) Chaos as an intermittently forced
linear system. Nat Commun 8(19):1–9
Brunton SL, Budišić M, Kaiser E, Kutz JN (2021) Modern Koopman theory for dynamical systems.
arXiv:2102.12086
Brunton SL, Kutz JN (2019) Data-driven science and engineering: machine learning, dynamical
systems, and control. Cambridge University Press
Brunton SL, Kutz JN, Manohar K, Aravkin AY, Morgansen K, Klemisch J, Goebel N, Buttrick
J, Poskin J, Blom-Schieber A et al (2020) Data-driven aerospace engineering: Reframing the
industry with machine learning. arXiv:2008.10740
Brunton SL, Proctor JL, Tu JH, Kutz JN (2015) Compressive sampling and dynamic mode decom-
position. To appear in the J Comput Dyn. Available: arXiv:1312.5186
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc Natl Acad Sci 113(15):3932–3937
Brunton SL, Tu JH, Bright I, Kutz JN (2014) Compressive sensing and low-rank libraries for
classification of bifurcation regimes in nonlinear dynamical systems. SIAM J Appl Dyn Syst
13(4):1716–1732
Brunton BW, Johnson LA, Ojemann JG, Kutz JN (2016) Extracting spatial-temporal coherent
patterns in large-scale neural recordings using dynamic mode decomposition. J Neurosci Methods
258:1–15
Burgers JM (1948) A mathematical model illustrating the theory of turbulence. Adv Appl Mech
1:171–199
Candès EJ (2006) Compressive sensing. In: Proceedings of the international congress of mathemat-
ics
Candès EJ, Romberg J, Tao T, Stable signal recovery from incomplete and inaccurate measurements.
Commun Pure Appl Math 8(1207–1223):1207–1223, 59
Candès EJ, Tao T (2006) Near optimal signal recovery from random projections: Universal encoding
strategies? IEEE Trans Inf Theory 52(12):5406–5425
Candès EJ, Romberg J, Tao T (2006) Robust uncertainty principles: exact signal reconstruction
from highly incomplete frequency information. IEEE Trans Inf Theory 52(2):489–509
Carlberg K, Barone M, Anti H (2017) Galerkin v. least-squares petrov-galerkin projection in non-
linear model reduction. J Comput Phys 330:693–734
Champion K, Lusch B, Kutz JN, Brunton SL (2019) Data-driven discovery of coordinates and
governing equations. Proc Natl Acad Sci 116(45):22445–22451
Champion K, Zheng P, Aravkin AY, Brunton SL, Kutz JN (2020) A unified sparse optimization
framework to learn parsimonious physics-informed models from data. IEEE Access 8:169259–
169271
Chen KK, Tu JH, Rowley CW (2012) Variants of dynamic mode decomposition: Boundary condi-
tion, Koopman, and Fourier analyses. J Nonlinear Sci 22(6):887–915
Chen B, Huang K, Raghupathi S, Chandratreya I, Du Q, Lipson H (2022) Automated discovery of
fundamental variables hidden in experimental data. Nat Comput Sci 2(7):433–442
Champion KP, Brunton SL, Kutz JN (2019) Discovery of nonlinear multiscale systems: Sampling
strategies and embeddings. SIAM J Appl Dyn Syst 18(1):312–333
Cole JD (1951) On a quasi-linear parabolic equation occurring in aerodynamics. Quart Appl Math
9:225–236
Courant R, Hilbert D (2008) Methods of mathematical physics: partial differential equations. Wiley,
New York
Dam M, Brøns M, Rasmussen JJ, Naulin V, Hesthaven JS (2017) Sparse identification of a predator-
prey system from simulation data of a convection model. Phys Plasmas 24(2):022310
Dawson STM, Hemati MS, Williams MO, Rowley CW (2016) Characterizing and correcting for
the effect of sensor noise in the dynamic mode decomposition. Exp Fluids 57(3):1–19
Demo N, Tezzele M, Rozza G (2018) Pydmd: Python dynamic mode decomposition. J Open Source
Softw 3(22):530
176 J. Nathan Kutz

de Silva BM, Champion K, Quade M, Loiseau J-C, Kutz JN, Brunton SL (2020) Pysindy: a python
package for the sparse identification of nonlinear dynamics from data. arXiv:2004.08424
Donoho DL (2006) Compressed sensing. IEEE Trans Inf Theory 52(4):1289–1306
Duke D, Soria J, Honnery D (2012) An error analysis of the dynamic mode decomposition. Exp
Fluids 52(2):529–542
Erichson NB, Brunton SL, Kutz JN (2016) Compressed dynamic mode decomposition for real-time
object detection. J Real-Time Image Process
Erichson NB, Voronin S, Brunton SL, Kutz JN (2019) Randomized matrix decompositions using
R. J Stat Softw 89(11):1–48
Fasel U, Kutz JN, Brunton BW, Brunton SL (2021) Ensemble-sindy: Robust sparse model discovery
in the low-data, high-noise limit, with active learning and control. arXiv:2111.10992
Gin CR, Shea DE, Brunton SL, Kutz JN (2020) DeepGreen: Deep learning of Green’s functions
for nonlinear boundary value problems. arXiv:2101.07206
Gin C, Lusch B, Brunton SL, Kutz JN (2021) Deep learning models for global coordinate transfor-
mations that linearise pdes. Eur J Appl Math 32(3):515–539
Gonzalez-Garcia R, Rico-Martinez R, Kevrekidis IG (1998) Identification of distributed parameter
systems: A neural net based approach. Comput Chem Eng 22:S965–S968
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Grosek J, Kutz JN (2014) Dynamic mode decomposition for real-time background/foreground
separation in video. arXiv:1404.7592
Haberman R (1983) Elementary applied partial differential equations, vol 987. Prentice Hall Engle-
wood Cliffs, NJ
Hemati MS, Rowley CW, Deem EA, Cattafesta LN (2017) De-biasing the dynamic mode decom-
position for applied Koopman spectral analysis. Theor Comput Fluid Dyn 31(4):349–368
Hesthaven JS, Rozza G, Stamm B et al (2016) Certified reduced basis methods for parametrized
partial differential equations, vol 590. Springer, Berlin
Hirsh SM, Barajas-Solano DA, Kutz JN (2021) Sparsifying priors for bayesian uncertainty quan-
tification in model discovery. arXiv:2107.02107
Hirsh SM, Ichinaga SM, Brunton SL, Kutz JN, Brunton BW (2021) Structured time-delay models
for dynamical systems with connections to frenet-serret frame. arXiv:2101.08344
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hoffmann M, Fröhner C, Noé F (2019) Reactive sindy: Discovering governing reactions from
concentration data. J Cheml Phys 150(2):025101
Holmes P, Lumley JL, Berkooz G, Rowley CW (2012) Turbulence, coherent structures, dynamical
systems and symmetry. Cambridge university Press
Hopf E (1950) The partial differential equation u t + uu x = μu x x . Commun Pure App Math 3:201–
230
Horrocks J, Bauch CT (2020) Algorithmic discovery of dynamic models from infectious disease
data. Sci Rep 10(1):1–18
Jovanović MR, Schmid PJ, Nichols JW (2014) Sparsity-promoting dynamic mode decomposition.
Phys Fluids 26(2):024103
Kaheman K, Brunton SL, Kutz JN (2020) Automatic differentiation to simultaneously identify
nonlinear dynamics and extract noise probability distributions from data. arXiv:2009.08810
Kaheman K, Kutz JN, Brunton SL (2020) Sindy-pi: A robust algorithm for parallel implicit sparse
identification of nonlinear dynamics. arXiv:2004.02322
Kamb M, Kaiser E, Brunton SL, Kutz JN (2020) Time-delay observables for Koopman: Theory
and applications. SIAM J Appl Dyn Syst 19(2):886–917
Kaptanoglu AA, de Silva BM, Fasel U, Kaheman K, Callaham JL, Delahunt CB, Champion K,
Loiseau J-C, Kutz JN, Brunton SL (2021) Pysindy: A comprehensive python package for robust
sparse system identification. arXiv:2111.08481
Kaptanoglu AA, Morgan KD, Hansen CJ, Brunton SL (2020) Characterizing magnetized plasmas
with dynamic mode decomposition. Phys Plasmas 27:032108
Keener JP (2018) Principles of applied mathematics: transformation and approximation. CRC Press
4 Machine Learning Methods for Constructing Dynamic Models From Data 177

Kutz JN (2013) Data-driven modeling and scientific computation: methods for complex systems
and big data. Oxford University Press
Kutz JN (2020) Advanced differential equations: Asymptotics and perturbations. arXiv:2012.14591
Kutz JN, Brunton SL, Brunton BW, Proctor JL (2016) Dynamic mode decomposition: data-driven
modeling of complex systems. SIAM
Kutz JN, Fu X, Brunton SL (2016) Multiresolution dynamic mode decomposition. SIAM J Appl
Dyn Syst 15(2):713–735
Kontolati K, Goswami S, Shields MD, Em Karniadakis G (2022) On the influence of over-
parameterization in manifold based surrogates and deep neural operators. arXiv:2203.05071
Lange H, Brunton SL, Kutz N (2020) From Fourier to Koopman: Spectral methods for long-term
time series prediction. arXiv:2004.00574
Li S, Kaiser E, Laima S, Li H, Brunton SL, Kutz JN (2019) Discovering time-varying aerodynam-
ics of a prototype bridge by sparse identification of nonlinear dynamical systems. Phys Rev E
100(2):022220
Liu Y, Kutz JN, Brunton SL (2020) Hierarchical deep learning of multiscale differential equation
time-steppers. arXiv:2008.09768
Lusch B, Kutz JN, Brunton SL (2018) Deep learning for universal linear embeddings of nonlinear
dynamics. Nat Commun 9(1):4950
Mackey A, Schaeffer H, Osher S (2014) On the compressive spectral method. Multiscale Model
Simul 12(4):1800–1827
Mamakoukas G, Castano M, Tan X, Murphey T(2019) Local Koopman operators for data-driven
control of robotic systems. In: Proceedings of “Robotics: Science and Systems 2019”, Freiburg
im Breisgau. IEEE
Mamakoukas G, Castano M, Tan X, Murphey T (2020) Derivative-based Koopman operators for
real-time control of robotic systems. arXiv:2010.05778
Mangan NM, Brunton SL, Proctor JL, Kutz JN (2016) Inferring biological networks by sparse
identification of nonlinear dynamics. IEEE Trans Mol, Biol Multi-Scale Commun 2(1):52–63
Mann J, Kutz JN (2016) Dynamic mode decomposition for financial trading strategies. In: Quanti-
tative finance, pp 1–13
Noack BR, Afanasiev K, Morzynski M, Tadmor G, Thiele F (2003) A hierarchy of low-dimensional
models for the transient and post-transient cylinder wake. J Fluid Mech 497:335–363
Noack BR, Morzynski M, Tadmor G (2011) Reduced-order modelling for flow control, vol 528.
Springer Science & Business Media
Ozoliņš V, Lai R, Caflisch R, Osher S (2013) Compressed modes for variational problems in
mathematics and physics. Proc Natl Acad Sci 110(46):18368–18373
Parish EJ, Carlberg KT (2020) Time-series machine-learning error models for approximate solutions
to parameterized dynamical systems. Comput Methods Appl Mech Eng 365:112990
Proctor JL, Brunton SL, Brunton BW, Kutz JN (2014) Exploiting sparsity and equation-free archi-
tectures in complex systems. Eur Phys J Spec Top 223(13):2665–2684
Proctor JL, Brunton SL, Kutz JN (2016) Dynamic mode decomposition with control. SIAM J Appl
Dyn Syst 15(1):142–161
Proctor JL, Eckhoff PA (2015) Discovering dynamic patterns from infectious disease data using
dynamic mode decomposition. Int Health 7(2):139–145
Qin T, Wu K, Xiu D (2019) Data driven governing equations approximation using deep neural
networks. J Comput Phys 395:620–635
Quarteroni A, Manzoni A, Negri F (2015) Reduced basis methods for partial differential equations:
an introduction, vol 92. Springer, Berlin
Raissi M, Em Karniadakis G (2018) Hidden physics models: Machine learning of nonlinear partial
differential equations. J Comput Phys 357:125–141
Regazzoni F, Dede L, Quarteroni A (2019) Machine learning for fast and reliable solution of time-
dependent differential equations. J Comput Phys 397:108852
Rowley CW, Mezić I, Bagheri S, Schlatter P, Henningson DS (2009) Spectral analysis of nonlinear
flows. J Fluid Mech 645:115–127
178 J. Nathan Kutz

Rudy SH, Brunton SL, Proctor JL, Kutz JN (2017) Data-driven discovery of partial differential
equations. Sci Adv 3(4):e1602614
Rudy S, Alla A, Brunton SL, Kutz JN (2019) Data-driven identification of parametric partial dif-
ferential equations. SIAM J Appl Dyn Syst 18(2):643–660
Sashidhar D, Kutz JN (2021) Bagging, optimized dynamic mode decomposition (bop-dmd) for
robust, stable forecasting with spatial and temporal uncertainty-quantification. arXiv:2107.10878
Scherl I, Strom B, Shang JK, Williams O, Polagye BL, Brunton SL (2020) Robust principal com-
ponent analysis for particle image velocimetry. Phys Rev Fluids 5(054401)
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656:5–28
Schmid PJ, Sesterhenn J (2008) Dynamic mode decomposition of numerical and experimental data.
In: 61st annual meeting of the APS division of fluid dynamics. American Physical Society
Schmidt M, Lipson H (2009) Distilling free-form natural laws from experimental data. Science
324(5923):81–85
Sorokina M, Sygletos S, Turitsyn S (2016) Sparse identification for nonlinear optical communication
systems: Sino method. Opt Express 24(26):30433–30443
Susuki Y, Mezić I, Hikihara T (2009) Coherent dynamics and instability of power grids.
repository.kulib.kyoto-u.ac.jp
Susuki Y, Mezic I (2011) Nonlinear Koopman modes and coherency identification of coupled swing
dynamics. IEEE Trans Power Syst 26(4):1894–1904
Takeishi N, Kawahara Y, Yairi T (2017) Subspace dynamic mode decomposition for stochastic
Koopman analysis. Phys Rev E 96(3):033310
Taylor R, Kutz JN, Morgan K, Nelson BA (2018) Dynamic mode decomposition for plasma diag-
nostics and validation. Rev Sci Instrum 89(5):053501
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B, pp 267–288
Tran G, Ward R (2017) Exact recovery of chaotic systems from highly corrupted data. Multiscale
Model Simul 15(3):1108–1129
Tropp JA, Gilbert AC (2007) Signal recovery from random measurements via orthogonal matching
pursuit. IEEE Trans Inf Theory 53(12):4655–4666
Tu JH, Rowley CW, Luchtenburg DM, Brunton SL, Kutz JN (2014) On dynamic mode decompo-
sition: theory and applications. J Comput Dyn 1(2):391–421
Wang W, Yang R, Lai YC, Kovanis V, Grebogi C (2011) Predicting catastrophes in nonlinear
dynamical systems by compressive sensing. Phys Rev Lett 106(15):154101
Wiggins S (2003) Introduction to applied nonlinear dynamical systems and chaos, vol 2. Springer
Science & Business Media
Yang Y, Bhouri MA, Perdikaris P (2020) Bayesian differential programming for robust systems
identification under uncertainty. arXiv:2004.06843
Chapter 5
Physics-Informed Neural Networks:
Theory and Applications

Cosmin Anitescu, Burak İsmail Ateş, and Timon Rabczuk

5.1 Introduction

Machine learning (ML) methods based on artificial neural networks (ANNs) have
become increasingly used, particularly in data-rich fields such as text, image, and
audio processing, where they have achieved remarkable results, greatly surpassing
the previous state-of-the-art algorithms. Typically, ML methods are most efficient
in applications where the patterns are difficult to describe by clear-cut rules, such
as handwriting recognition. In these cases, it may be more efficient to generate the
rules by a kind of high-dimensional regression between a sufficiently large number
of input–output pairs. However, other techniques based on ANNs have also been suc-
cessful in domains where the rules are relatively easy to describe, such as AlphaZero
(Silver et al. 2017) for game playing and AlphaFold (Jumper et al. 2021) for protein
folding. Many of these advancements have been driven by an increase in compu-
tational capabilities, in particular with regard to Graphics Processing Units (GPUs)
and Tensor Processing Units (TPUs) (Jouppi et al. 2017), but also by theoretical
advances related to the initialization and architecture of the ANNs. In the scientific
community, there has also been increased interest in applying the new developments
in ANNs and ML to solve partial differential equations (PDEs) and other engineering
problems of interest.

C. Anitescu · T. Rabczuk (B)


Institut für Strukturmechanik, Bauhaus-Universität Weimar, Weimar, Germany
e-mail: [email protected]
C. Anitescu
e-mail: [email protected]
B. İsmail Ateş
School of Engineering and Design, Technical University of Munich, Munich, Germany
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 179
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_5
180 C. Anitescu et al.

One can distinguish between supervised, unsupervised, and reinforcement learn-


ing. In the former, the aim is to find the mapping between a set of inputs and outputs,
such as images of hand-written digits and the actual digit they represent, so that when
a new input is presented, the correct output can be predicted by the ML algorithm.
A prerequisite for the application of these methods is the availability of labeled
data. In engineering applications, such approaches can be used e.g. for predicting
the solution from the boundary conditions for a given PDE based on a large set of
inputs/solutions pairs of similar problems; see also operator-approximation meth-
ods (Li et al. 2020a, b; Lu et al. 2021). However, a drawback is the requirement for
possibly large amounts of labeled data (i.e. solved examples) drawn from the same
distribution as the problems that we like to solve in the first place. On the contrary, in
unsupervised learning the algorithm aims to find patterns in the input data to produce
useful output based on some hard-coded rules or objectives. In classical ML, such
tasks include image segmentation, dimensionality reduction (such as principal com-
ponent analysis or PCA), or different types of clustering (grouping unlabeled data
based on similarities or differences). Furthermore, there is a middle ground category
of semi-supervised learning, where a mixture of labeled and unlabeled data is used in
an attempt to overcome some of the shortcomings of the first two categories. Related
to this, is the concept of reinforcement learning, where an agent-based system seeks
to learn the actions that maximize a reward function.
Physics-informed neural networks (PINNs) are more closely related to the unsu-
pervised or semi-supervised learning, whereby satisfying the governing equations,
including the boundary conditions, at a given set of collocation points defines the
objective function. This idea was originally proposed during the 1990s in Lagaris et al.
(1998), Lagaris et al. (1997) and further extended for domains with irregular bound-
aries in Lagaris et al. (2000). As the cost of training neural networks became cheaper,
further developments have been first reported in Raissi et al. (2019), Sirignano and
Spiliopoulos (2018) among others, including the extension to time-dependent prob-
lems and model parameter inference (i.e. inverse problems). In Raissi et al. (2019),
the term PINN is first used, along with the concept of combining (possibly noisy)
experimental data with the governing equation in a small data or semi-supervised
setting. Since then, several improvements have been suggested, such as adaptively
choosing the collocation points (Anitescu et al. 2019; Wight and Zhao 2020), varia-
tional formulations (Yu et al. 2018; Samaniego et al. 2020; Kharazmi et al. 2019), and
domain decomposition approaches (Shukla et al. 2021). Moreover, PINNs have been
applied to a wide variety of problems, such as hyperelasticity (Nguyen-Thanh et al.
2020), multiphase poroelasticity (Haghighat et al. 2022), Kirchhoff plates (Zhuang
et al. 2021), eikonal equation (bin Waheed et al. 2021), biophysics (Kissas et al.
2020), quantum chemistry (Pfau et al. 2020), materials science (Shukla et al. 2020)
and others.
In this chapter, we give a concise overview of the main ideas of PINNs, focusing
on the implementation and potential applications to forward and inverse PDEs. In
Sect. 5.2, we introduce the building blocks required to create and train a neural net-
work model, while, in Sect. 5.3, we present the collocation and energy minimization
approaches, along with a discussion of enforcing the boundary conditions. In Sect.
5 Physics-Informed Neural Networks: Theory and Applications 181

5.4, we present some numerical examples, focusing on some pedagogical examples


of standard PINNs for problems that are feasible to compute on regular desktops or
even mobile computers, followed by some concluding remarks in Sect. 5.5.

5.2 Basics of Artificial Neural Networks

An artificial neural network (ANN) is loosely modeled after the structure of the
brain, which is made up of a large number of cells (neurons) which communicate
with their neighbors through electrical signals. Mathematically, an ANN can be
seen as a function u N N : Rn → Rm , which maps n inputs into m outputs. An ANN
is a universal function approximator (Hornik et al. 1989). Therefore, u N N can be
used to interpolate some unknown function from the data given at certain points or to
approximate the solution of a partial differential equation. The function u N N depends
on a collection of parameters (called trainable parameters) which are obtained by
an optimization procedure with the goal of minimizing some user-defined objective
or loss function.
In an ANN, the neurons, or computational units, are organized in layers which are
connected by composition with an activation function as detailed below. Different
types of layers (and activation functions) can be assembled together, according to the
application and the information known about the function to be approximated. There
are several types of ANNs, which include fully connected feed-forward networks,
convolutional neural networks (CNNs), recurrent neural networks (RNNs), residual
neural networks (ResNets), transformers, and others. In the following, we will focus
mostly on the feed-forward neural networks which are among the simplest and can
also be used as building blocks of more complicated architectures.

5.2.1 Feed-Forward Neural Networks

In this type of network, also called multi-layer perceptron (MLP), the output is
obtained by successive compositions of a linear transformation and a nonlinear acti-
vation function. The network consists of an input layer, an output layer, and any
number of intermediate hidden layers. The function u N N for a network with an n-
dimensional input, and m-dimensional output and k hidden layers can be written
as
u N N = L k ◦ L k−1 ◦ . . . ◦ L 0 (5.1)

with
L i (xi ) = σi (Wi xi + bi ) = xi+1 for i = 0, . . . , k. (5.2)

Here, Wi are matrices of size m i × n i , with n 0 = n, n i+1 = m i , and m k = m, xi+1


and bi are column vectors of size m i , and the activation functions σi are applied
182 C. Anitescu et al.

Fig. 5.1 A fully connected feed-forward neural network with the input, hidden, and output layers

element-wise to the vectors Wi xi + bi . The entries of the matrices Wi are called


weights and those of the vectors bi are called biases, and together they represent
the trainable parameters of the neural network. For k > 0, the values m 0 , …, m k−1
can be chosen freely and represent the number of neurons in each hidden layer. If
the number of hidden layers k > 1, we say that u N N is a deep neural network. A
schematic of a feed-forward network with 3 neurons in the input layer, two hidden
layers with 4 and 5 neurons, respectively, and an output layer consisting of 2 neurons
is shown in Fig. 5.1.
In a typical application, many inputs are collected in a batch and evaluated
together. Evaluating the output of the neural network involves mainly linear algebra
operations (such as matrix and vector products) which can be easily parallelized.
In machine learning frameworks, such as TensorFlow (Abadi et al. 2015), PyTorch
(Paszke et al. 2019), or JAX (Bradbury et al. 2018), a computational graph is built
to record the different operations. This allows for efficient evaluation and also for
computing the gradients by automatic differentiation methods as will be detailed in
Sect. 5.2.3.

5.2.2 Activation Functions

Several types of activation functions can be considered depending on the task at


hand. We will briefly describe a few popular ones in the following subsections.

5.2.2.1 Linear Activation

The simplest activation function is the linear activation, which means that σ is simply
the identity function:
5 Physics-Informed Neural Networks: Theory and Applications 183

σ (x) = x. (5.3)

On a network with no hidden layers, a linear activation function between the input
and output layers can be used to perform a linear regression between the input and
output data. For networks with one or more hidden layers, stacking linear layers is not
useful since a composition of linear activations is still linear. However, linear layers
can be combined with other nonlinear activation functions. For example, linear layers
can be used as the last layer to scale the output to arbitrary values. A non-trainable
linear transformation is often used to normalize the input of a network to speed up
the training (optimization) process, as will be detailed in Sect. 5.2.3.3.

5.2.2.2 Rectified Linear Units

One of the simplest nonlinear activation functions is the piece-wise linear rectified
linear unit (ReLU) function, defined as

σ (x) = max(0, x). (5.4)

It can easily be seen that a single hidden layer with ReLU activation, followed by
a linear activation layer, can approximate exactly piecewise linear functions in one
dimension (Samaniego et al. 2020). Indeed, on a grid with nodes x0 < x1 < . . . < xn ,
the finite element linear hat function Ni (x) can be written as
 
1 1 1 1
Ni (x) = ReLU (x − xi−1 ) − + ReLU (x − xi ) + ReLU (x − xi+1 )
hi hi h i+1 h i+1
(5.5)
where h i = xi − xi−1 . This observation can be extended to higher dimensions, where
two hidden layers are enough to approximate piecewise linear simplex elements in
two and more dimensions (He et al. 2020). Further, error bounds for the approxima-
tion of ReLU networks in Sobolev norms are given, e.g. in Petersen and Voigtlaender
(2018), Gühring et al. (2020).

5.2.2.3 Sigmoid

The sigmoid activation, also known as the logistic function, is defined as

1
σ (x) = . (5.6)
1 + exp(−x)

This function has a S-shaped form, as shown in Fig. 5.2b. The range of this function
is the interval (0, 1), therefore it is often used in the output layer of neural networks
used for binary classification tasks, where the output is a probability that the input
184 C. Anitescu et al.

(a) ReLU (b) Sigmoid

(c) Tanh (d) Swish

Fig. 5.2 Commonly used non-linear activation functions

belongs to a given class. The function is also differentiable infinitely many times,
resulting in a smooth approximation which is desirable for many applications.

5.2.2.4 Hyperbolic Tangent

The hyperbolic tangent activation function is defined as

exp(x) − exp(−x)
tanh(x) = . (5.7)
exp(x) + exp(−x)

This function looks similar to the sigmoid activation, maintaining the overall S-shape
and smoothness. An important difference is that the range of the outputs is (−1, 1)
which is centered at 0. This makes the tanh activation more suitable for deep networks
without creating a bias toward positive outputs.
5 Physics-Informed Neural Networks: Theory and Applications 185

5.2.2.5 Swish

The swish activation function is defined as


x
swish(x) = = x · σ (x), (5.8)
1 + exp(−x)

where σ (x) is the sigmoid activation. The plot of this function is shown in Fig. 5.2d.
The swish function looks similar to the ReLU activation. However, like sigmoid and
tanh, it is infinitely differentiable.
We note that there are several other activations that have been proposed which are
similar to ReLU and swish, such as Leaky ReLU (Maas et al. 2013), exponential linear
units (ELUs) (Clevert et al. 2015), Gaussian error linear units (GELUs) (Hendrycks
and Gimpel 2016), Mish (Misra 2019), and others. These have been shown to remedy
some of the drawbacks of the previously considered activation functions and provide
a modest improvement on some machine learning tasks, particularly related to image-
based classification and segmentation tasks (Li et al. 2021). However, from the point
of view of function approximation where partial derivatives are involved, tanh or
swish are also well-suited due to their smoothness properties.

5.2.2.6 Adaptive Activation Functions

In addition to the standard activation functions, which are fixed at each layer, the
so-called adaptive activations have been proposed which depend on some model-
dependent or trainable parameters. In particular, for a given activation function σ (x),
we can define the adaptive version by

σa (x) = σ (ax). (5.9)

The idea of using trainable parameters in the activation function was proposed in
Agostinelli et al. (2014), and further developed in the context of function and PDE
solution approximation in Jagtap et al. (2020b, a, 2022), Shukla et al. (2020) among
others. Some adaptive or trainable activation functions have a different form, for
example, the original Swish activation proposed in Ramachandran et al. (2017) is of
the form: x
σβ (x) = , (5.10)
1 + exp(−βx)

where β is either a trainable or user-defined parameter. In some cases, using an adap-


tive activation can improve the results on classification tasks by a modest amount,
usually an increase of 0.5–2% in the accuracy (Apicella et al. 2021). A similar
improvement can be seen for function approximation, although the overall complex-
ity of the architecture is increased.
186 C. Anitescu et al.

5.2.3 Training

As mentioned earlier, the training process involves optimizing the network param-
eters (weights and biases) such that an objective function is minimized. Suppose
the loss function is denoted by L(u N N (x; θ )), where u N N is the neural network and
θ represents the trainable parameters, e.g. the matrices Wi and vectors bi in (5.2).
In the case of regression, a commonly used loss function is the mean square error,
defined as
1 
N
L M S E (u N N (x j ; θ )) = |u N N (x j ) − y j |2 , (5.11)
N j=1

where x j , j = 1, . . . , N are input points at which the ground truth output values y j
are known. For the case of PDE approximations, more complicated loss functions
which contain the partial derivatives of u N N with respect to the inputs can be devised.
Additional terms can be used to incorporate the governing equations and boundary
conditions, as will be detailed in Sect. 5.3. Then the process of training a neural
network can be described as

Find θ ∗ = arg min L(u N N (x; θ )). (5.12)


θ

We note that since L(u N N (x; θ ) is usually based on the evaluation of u N N or its
derivatives at a finite number of points (called training points); therefore, θ ∗ will
depend in general on number and location of these points. A careful choice of select-
ing L and a proper weighting between its terms is therefore key to ensuring that the
training is successful and the output generalizes well to new inputs.

5.2.3.1 Forward and Back Propagation

Finding the optimal weights is usually done by gradient-based methods, such as gra-
dient descent. Parallelization and automatic differentiation methods are key ingredi-
ents in efficient implementations. The optimization method requires the gradients of
a possibly large number of trainable parameters, with many networks containing tens
of millions of parameters. Some are even larger, for example, the GPT-3 language
model uses 175 billion parameters (Floridi and Chiriatti 2020). Therefore, reverse-
mode differentiation, also known as back-propagation (Rumelhart et al. 1986), is
commonly used to compute the gradients with respect to the trainable parameters.
The differentiation process involves a forward pass, during which the neural net-
work output and the loss function are evaluated from a given input, and the operations
involved are recorded in a graph. Then the derivatives are computed in reverse order
of the evaluation, with the intermediate results obtained from the chain rule stored at
the graph nodes (see e.g. Sect. 6.5 in Goodfellow et al. 2016 for details). The remark-
able outcome of this procedure is that the partial derivatives of the loss function with
5 Physics-Informed Neural Networks: Theory and Applications 187

respect to all the parameters can be evaluated at a cost that is proportional to the
number of floating points operations involved in the forward evaluation.
Using forward mode differentiation, where the partial derivatives are computed in
the order of evaluation, would result in a much higher cost that is also proportional to
the number of parameters, although the memory requirements may be lower (López
et al. 2021). In general, evaluating the partial derivatives (Jacobian) of a function
f : Rn → Rm requires O(n) operations in forward mode, and O(m) operations in
reverse mode. In the context of PDEs, forward mode differentiation may be more
efficient when computing the partial derivatives of the outputs with respect to the
input coordinates, particularly for multiphysics models or other coupled problems
where several solution fields are considered.

5.2.3.2 Network Initialization

When initializing the training process, particular care is needed for the selection of
the initial value. For example, if all the weights and biases are set to zero, then the
gradients with respect to the weights within a layer will have the same value. In a
gradient descent update with a fixed step size, all the parameters will be updated by
the same amount, resulting in the equivalent of a network with a single neuron per
layer. Part of the recent success of deep neural networks in applications is owed to
better techniques for initializing the values of the network parameters, such as Glorot
(Xavier) (Glorot and Bengio 2010) and He et al. (2015) initialization.
While the initialization method can be seen as a hyperparameter which can be
tuned according to the problem at hand, a commonly used one is Glorot uniform,
where the weights are chosen from a uniform distribution U [−l, l], where

6
l= , (5.13)
n in + n out

with n in and n out being the number of input and output neurons for a given layer.
The biases are initialized to zero. This is also the default initialization used in the
TensorFlow deep learning framework.

5.2.3.3 Data Normalization

It can be observed that the nonlinear region of most activation functions σ (x), such
as the ones in Fig. 5.2, is centered in a small interval around x = 0. Therefore, if the
input data is in a region far away from the origin, then the activation will be mostly
constant or linear, which will hinder the performance of gradient descent methods
(see also Sect. 5.2.4.2). To remedy this issue, it is essential to perform a normalization
on the input data, which is just a linear transformation into the interval [−1, 1]. In
particular, for each input neuron, the transformation is given by the formula:
188 C. Anitescu et al.

2 · (x − xmin )
Tnor m (x) = − 1, (5.14)
xmax − xmin

where xmax and xmin are the maximum and minimum input values, respectively. In
the case where the input values are points in the computational domain, then xmin
and xmax represent the bounding box of the domain. These values must be fixed for
training and testing; otherwise, incorrect results will be obtained.

5.2.4 Testing and Validation

After a neural network is trained, it is expected to output useful results. However, in


most cases, it is not feasible to train the network indefinitely or until the loss function
stops decreasing (up to machine precision). Moreover, the number of training points
and the number of layers and neurons must be correlated in the sense that, for optimal
results, a larger number of parameters require a larger number of training points to
avoid overfitting. The performance of the network is then measured by testing and
validating the output.
In standard machine learning tasks, it is common to partition the available data into
training/testing/validation subsets. The training data is used in the optimization pro-
cedure (5.12) for finding the optimal trainable parameters (weights and biases). The
validation data is used to monitor the performance of the network by just evaluating
the loss function. Tuning the network hyperparameters, such as the type of activation
function, network size, and optimization algorithm may require some trial and error.
Although the validation data is not used directly in the optimization process, it may
indirectly create a bias in the process of hyperparameter tuning. Therefore, when
the performance of the validation data is satisfactory, the network may be further
validated using the test set. A typical split is to use 80% of the data for training
and ca. 20% for testing and validation, although these ratios may vary depending on
the problem at hand. For example, in the case of physics-informed neural networks,
where training and testing data are just points in the domain, it may be useful to test
the network by generating many more points from a higher resolution sample. We
note that in most machine learning models, optimization is the most computationally
intensive part. Therefore, the amount of training data is most closely related to the
amount of random access memory (RAM) and numerical (floating point) operations
required while testing (evaluating) the model is comparatively much cheaper.
In this section, we describe some of the pitfalls involved in training and testing a
network, and the countermeasures that can be implemented.

5.2.4.1 Underfitting and Overfitting

Several types of approximation pathologies can be encountered in the process of


training a neural network, among which underfitting and overfitting are some of the
5 Physics-Informed Neural Networks: Theory and Applications 189

most common. Underfitting can occur when the neural network does not have enough
approximation capability to satisfactorily fit the data or solve the problem at hand. It
can also occur when the optimization has not converged, for example, because too
few iterations have been performed, or because the learning rate is too low or too
high. Underfitting can be typically identified when both the training and validation
losses are higher than acceptable values.
Overfitting, on the other hand, can appear when the network capacity is larger
than required. In this scenario, the training data is well approximated but other data
points may be far off from the actual values, or in machine learning parlance, the
model “does not generalize” well. A similar case where fitting exactly a small dataset
does guarantee that the target function is well approximated occurs in interpolation
by high-order polynomials, where the interpolant can oscillate wildly between the
interpolation points. In this case, the training loss value decreases to a low value
(even zero), while the validation loss can be much higher.
A good strategy to avoid overfitting or underfitting is to monitor both the training
and validation losses and to stop the training when the testing loss begins to increase.
To illustrate, the results for regression of the function u(x) = sin(π x) for x ∈ [−1, 1]
are shown in Fig. 5.3. A random uniform noise with magnitude in the interval (0, 0.1)
was added to the training and validation data, which consists of 201 and 50 points,
respectively. A neural network with two hidden layers consisting of 64 neurons has
been used, together with the tanh activation function for the first two layers and linear

(a) Underfit, 300 iterations (b) Proper fit, 10000 iterations

(c) Overfit, 100000 iterations (d) Loss convergence

Fig. 5.3 Fitting a noisy function and the loss convergence history
190 C. Anitescu et al.

activation in the last layer. The ADAM optimizer with the default parameters and
learning rate of 0.001 is used to minimize the mean square error of the difference
between the predicted and training values.
We observe from Fig. 5.3a that after 300 iterations, the neural network can start to
approximate the sinusoidal function, but it is still quite far away from the actual shape
(underfitting). The training loss value at this stage is 0.1129, while the validation loss
is 0.1393. After 10000 iterations, the approximation is already quite good, with only
a small error between the prediction and the actual function (without noise) as shown
in Fig. 5.3b. Here, the training loss is 0.0033 and the validation loss is quite close
at 0.0035. Next, if we continue to training, we start to observe that after many more
iterations, the training and validation loss start to diverge (see Fig. 5.3d). After 100000
iterations, we notice that the predicted function has some oscillations and spikes as it
tries to capture the noise in the data as shown in Fig. 5.3c. At this stage, the training
loss is 0.0026 and the test loss is 0.0037.

5.2.4.2 Vanishing and Exploding Gradients

Two other types of problems encountered in training of artificial neural networks are
those related to the magnitude of the gradients. The vanishing gradients phenomenon
occurs when the derivative of the loss function with respect to the training variables
is very small. This can mean that the objective function is very close to a stationary
point, which can also be a saddle point or some other point far from the global
minimum. The end result is very slow or no convergence of the loss function. A
common remedy for this problem is to perform a normalization of the input data (see
also Sect. 5.2.3.3). Otherwise, changing the network architecture or the activation
function (for example using rectified activations like ReLU or Swish) may also be
helpful, since S-shaped activations like sigmoid or hyperbolic tangent are particularly
susceptible to vanishing gradients.
Exploding gradients, on the contrary, refer to the occurrence of too large deriva-
tives of the loss function with respect to the trainable parameters. In extreme cases,
the gradients can overflow, resulting in not-a-number (NaN) values for the loss.
Another possible effect is unstable training, where the loss value oscillates without
converging to the optimal value. Possible remedies for this problem include using a
smaller learning rate, and adding residual (or skip) connections to the neural network
(Philipp et al. 2018).
The ReLU activation function may suffer from a related problem known as “dying
ReLU”, which occurs when some neurons become inactivated, in the sense that they
always output zero for all the inputs. This can happen when a large negative bias
value is learned for a particular neuron. Because the derivative of the constant zero
function is also zero, it is not possible to recover a “dead” ReLU neuron, resulting
in a diminished approximation capability.
5 Physics-Informed Neural Networks: Theory and Applications 191

5.2.5 Optimizers

We will now briefly describe the optimization algorithms commonly used to train
(i.e. minimize the loss function) a neural network. First, we mention that two types of
optimization strategies can be employed: full-batch training and mini-batch training.
In the former, the entire data set is used during a forward pass through the network
and the gradients with respect to all the data points are computed in one step. In
mini-batch training, on the other hand, the training data is split into several sub-
sets of (approximately) the same size called mini-batches. Then an optimization
sub-step is taken with respect to each mini-batch. When the entire dataset is seen
by the optimization algorithm once, then a training epoch is completed. In general,
first-order optimization methods, like gradient descent, are commonly used with
mini-batch training, while algorithms that make use of (approximations of) second
derivative information use full-batch training. A detailed survey of optimization
methods used in machine learning has been presented in Sun et al. (2019).

5.2.5.1 Stochastic Gradient Descent

The gradient descent method is the simplest gradient-based optimizer. The idea is to
minimize the function in the direction of the gradient evaluated at the current guess
by a fixed step size (also called the learning rate). If the objective function is L(w),
then an optimization step can be written as

w(t+1) := w(t) − η∇w L(w(t) ), (5.15)

where η is the learning rate. In the case of mini-batch training, since the mini-batches
are typically randomly selected, the method is called stochastic gradient descent
(SGD). Using mini-batches has been shown to improve the robustness, allowing the
optimizer to find the global optima (or better local optima) even for non-convex
problems (De Sa et al. 2015; Mertikopoulos et al. 2020).

5.2.5.2 Adaptive Momentum (ADAM)

This optimization method, proposed in Kingma and Ba (2014), replaces the fixed
learning rate of the conventional SGD with a variable step-size based on the momen-
tum, which can be seen as a linear combination of the gradients of the current and
previous time steps.
An update of the ADAM optimizer from step t to step t + 1 has the form:
192 C. Anitescu et al.

m(t+1) := β1 m(t) + (1 − β1 )∇w L(w(t) ) (biased first moment) (5.16)


(t+1) (t) (t)
v := β2 v + (1 − β2 )(∇w L(w )) 2
(biased second moment) (5.17)
(t+1)
m
m̂ := (unbiased first moment) (5.18)
1 − β1(t+1)
v(t+1)
v̂ := (unbiased second moment) (5.19)
1 − β2(t+1)

w(t+1) := w(t) − η √ (weights update). (5.20)
v̂ + 

Here, m and v are the moment vectors which are initialized to zeros, β1 , β2 , and  are
constants which are usually initialized to β1 = 0.9, β2 = 0.999, and  = 10−8 , and
η is the learning rate. β1t and β2t denote β1 and β2 to the power t, and (∇w L(w(t) ))2
denotes the element-wise squaring of the gradient vector. Because the momentum
vectors are initialized to zeros, a bias-correction is introduced in (5.18) and (5.19).
This technique can smooth out the oscillations in the gradients and usually improves
the convergence compared to the standard SGD optimizer.

5.2.5.3 Quasi-Newton Methods

The gradient descent-based methods approximate the loss at each step by a linear
function without taking into account the curvature information. Faster convergence
can be obtained by using Newton algorithms, which involve computing the second
derivatives. Nevertheless, for a large number of parameters, the cost of Newton’s
method in terms of memory storage and floating point operations can be prohibitive,
since the Hessian matrix has size n × n, where n is the number of parameters. A
more feasible alternative is the family of quasi-Newton methods, like the Broyden–
Fletcher–Goldfarb–Shanno (BFGS) algorithm (Broyden 1970; Fletcher 1970; Gold-
farb 1970; Shanno 1970) or the limited memory version L-BFGS (Liu and Nocedal
1989), which are already implemented in machine learning frameworks like PyTorch
or TensorFlow Probability (Dillon et al. 2017). Another algorithm that can be used
for problems with a small number of parameters is the Levenberg–Marquardt algo-
rithm (Levenberg 1944; Marquardt 1963), which can be seen as a combination of
the Gauss–Newton method and gradient descent.

5.3 Physics-Informed Neural Networks

In the following, we focus on the physics-informed neural network, by which is


meant an artificial neural network incorporates the residuals of the PDE to be solved
into the loss function. In most cases, a simple, fully connected feed-forward network
5 Physics-Informed Neural Networks: Theory and Applications 193

is used; however, some important differences can be noted in the form of the objective
function, in particular regarding to whether the strong or weak form of the PDE is
used.

5.3.1 Collocation Method

The classical PINNs are collocation-based, meaning that the neural network aims to
approximate the strong form of the governing equation at a set of collocation points.
Because the collocation points can be randomly distributed inside the domain and no
mesh is needed, this method belongs to the category of mesh-free methods. Moreover,
once the “building blocks” for constructing the neural network and evaluating the
partial derivatives with respect to the inputs are obtained, the implementation is
relatively simple.
In particular, suppose that the governing PDE is of the form:

∂u(x)
F(u(x), , . . .) = 0 for x ∈  (5.21)
∂ x1
∂u(x)
G(u(x), , . . .) = 0 for x ∈ , (5.22)
∂n
where F represents a differential operator for the domain interior, G is a differential
operator for the boundary conditions, u is the unknown function,  and are the
computational domain and its boundary, and n is the outer normal vector to the
boundary. The interior differential operator may contain any order of derivatives with
respect to the inputs, while the boundary operator may contain any order of derivative
with respect to the outer normal vector for Neumann-type boundary conditions.
The loss function for a neural network u N N (x; θ ) with trainable parameters θ
(which include the weights and biases for each layer) can be constructed based
on the “mean square error” (MSE) evaluated at a set of Nint interior collocation
points {xiint }, i = 1, . . . , Nint and a set of Nbnd boundary collocation points {xbnd
j },
j = 1, . . . , Nbnd as

λ1 
Nint
∂u N N (xiint ; θ )
Lcoll (θ ) = F(u N N (xiint ; θ ), , . . .)2
Nint i=1 ∂ x1

λ2  ∂u N N (xbnd
j ; θ)
Nbnd
+ G(u N N (xbnd
j ; θ ), , . . .)2 . (5.23)
Nbnd j=1 ∂n

Here λ1 and λ2 are weight terms; usually, choosing λ2 >> λ1 helps to speed up
convergence by ensuring that the boundary conditions are satisfied. Adaptive methods
for choosing the weights have also been proposed in Wang et al. (2022). In the case
of time-dependent problems, the classical PINNs use a space–time discretization,
where the time is considered as an additional dimension.
194 C. Anitescu et al.

5.3.2 Energy Minimization Method

In the energy minimization method, we seek to minimize an energy functional, which


is usually based on the weak (variational) form of the PDE. In many scientific mod-
eling tasks, an energy functional appears naturally from the physical laws involved
(for example, the principle of minimum potential energy in structural mechanics).
Suppose the functional to minimize is denoted by J (u), which can be decomposed
into an interior term and a boundary term:
 
J (u) = Hint (u) d + Hbnd (u) d , (5.24)


with denoting the portion of the boundary over which the boundary term is evalu-
ated. Then we can define the loss function of the form:
 
Lenergy (θ) = Hint (u N N ) d + Hbnd (u N N ) d . (5.25)


The integrals in (5.25) are usually approximated by numerical integration, using


a finite set of quadrature points {qiint }, and weights {wiint } with i = 1, . . . , Q int
for the interior integral and quadrature points {qbndj } and weights {w j } with
bnd

j = 1, . . . , Q bnd for the boundary integral, i.e.


Q int 
Q bnd
Lenergy (θ ) ≈ Hint (u N N (qiint ))wiint + Hbnd (u N N (qbnd
j ))w j .
bnd
(5.26)
i=1 j=1

When additional constraints are needed, such as Dirichlet boundary conditions, addi-
tional terms can be added to the loss function, similarly to (5.23). Alternatively, one
can impose the Dirichlet boundary conditions strongly (i.e. exactly) by modifying the
output of the neural network to match the prescribed boundary data. In particular, we
consider the computed solution to be ũ N N , satisfying ũ N N (x) = u D (x) for x ∈ D ,
where u D is the Dirichlet boundary condition specified on the boundary D to be of
the form:
ũ N N (x) = g(x) + d(x)u N N (x), (5.27)

where g is a smooth extension of u D such that g(x) = u D (x) for x ∈ D and d is a


distance function such as d(x) = 0 for x ∈ D and d(x) > 0 otherwise. When u N N
is vector-valued, then we multiply each of its components by d(x). This ensures that
the output ũ N N satisfies exactly the boundary conditions (i.e. ũ N N (x) = u D (x) for
x ∈ D ), although the choice of g(x) and d(x) requires some care (Sukumar and
Srivastava 2022).
In general, the energy minimization method tends to be less computationally
demanding than the collocation-based PINNs, due to the fact that, e.g. only the first-
order derivatives need to be computed for solving a second-order problem. However,
5 Physics-Informed Neural Networks: Theory and Applications 195

it requires an integration mesh and it is more difficult to verify that the solution is
correct within a certain tolerance, since the objective function should converge to
some non-zero minimum which is not known in advance. A possible approach to
overcome this problem is to compute the residual loss for validation, which can then
also be used to adaptively adjust the number of integration points, as proposed in
Goswami et al. (2020).

5.4 Numerical Applications

By using a small set of training or input data (e.g. initial and boundary conditions
and/or measured data) as well as governing physical laws, PINNs attempt to approx-
imate the solution of the problem. Complex nonlinear systems and phenomena in
physics and engineering are described by differential equations.
PINNs have shown their capabilities to solve both forward and inverse problems in
science and engineering. A forward problem can be defined as a problem of finding a
particular effect of a given cause utilizing a physical or mathematical model, whereas
an inverse problem refers to finding causes from the given effects (Vauhkonen et al.
2016). We can investigate the 1D steady-state heat equation with the source term to
give more concrete examples of forward and inverse problems.
Let us consider a rod with unit length along the x-axis and the heat flowing through
this rod with a heat source as our model. We can represent the temperature at location
x on the rod as T (x). Under certain assumptions, such as the rod being perfectly
insulated, with the source term q(x) being known, then the governing equation can
be written as
d2T
κ 2 + q(x) = 0 (5.28)
dx
where κ > 0 is the thermal diffusivity constant. Finding temperature at any location
on the rod is a forward problem. On the other hand, finding the constant κ, which
is a rod feature, from observed temperature data is a good example of an inverse
problem. These examples will be detailed in Sects. 5.4.1 and 5.4.2.
To summarize, the aforementioned procedures explained in the previous sections
to solve differential equations with PINNs will become tangible with numerical
applications in this section. The solution estimation of PINNs for both forward and
inverse problems will be discussed by providing simple and complex examples.

5.4.1 Forward Problems

In the introductory part of this section, the definition of a forward problem is given as
finding the particular effect of a given cause using a physical or mathematical model.
196 C. Anitescu et al.

5.4.1.1 1D Steady-State Heat Equation

Let us remember the 1D steady-state heat conduction problem with a heat source.
As we discussed before, the governing equation for this example is given in (5.28).
Let the thermal diffusivity constant be κ = 0.5 and x denote the location on the rod.
Here, the source term is given as q(x) = 15x − 2. We assume that the temperatures
at both ends are 0. Then we can re-write (5.28) as

d2T q(x)
+ = 0, x ∈ [0, 1]
dx2 κ
q(x) = 15x − 2 (5.29)
κ = 0.5
T (0) = T (1) = 0

The first step to solve this problem is to discretize the domain with uniform or
randomly sampled collocation points. Then the neural network will process these
collocation points through its linearly connected layers consisting of neurons with
nonlinear activation functions. Of course, the outcome of the first forward propaga-
tion will not be compatible with the true solution. Therefore, at this point, the physics
and boundary knowledge will guide the neural network to approximate the ground
truth by updating the weights and biases of the neural network. Let us elaborate on
this step by step and reinforce these steps with code snippets. Note that these codes
are written with TensorFlow version 2.x with the Keras API.
We first generate 100 equidistant points in our domain. Here, the choice of the
number of points is up to the user. However, it should be noted that the number of
points also has some influence on the number of iterations or network size required
to have results with similar accuracy. The ADAM optimizer with a learning rate of
0.005 is used for this example. An input layer, three hidden layers with 32 neurons
equipped with tanh activation function, and an output layer form the neural network
(see Fig. 5.4). The input and output layers have one neuron each since the input for
the network is only one spatial dimension, and the output is the temperature at these
points. By setting the number of iterations to 1000 and introducing the boundary
condition data in TensorFlow tensors, we complete the initial settings of our model
(see Listing 1).

Listing 1: Initial settings for the heat equation


# We set seeds initially. This feature controls the randomization of
# variables (e.g. initial weights of the network).
# By doing it so, we can reproduce same results.
tf.random.set_seed(123)
# 100 equidistant points in the domain are created
x = tf.linspace(0.0, 1.0, 100)
# boundary conditions T(0)=T(1)=0 and \kappa are introduced.
bcs_x = [0.0, 1.0]
bcs_T = [0.0, 0.0]
5 Physics-Informed Neural Networks: Theory and Applications 197

hidden layers
(1) (2) (3)
a1 a1 a1

(1) (2) (3)


a2 a2 a2

input output
layer
layer (1) (2) (3)
a3 a3 a3
x T
(1) (2) (3)
a4 a4 a4
.. .. ..
. . .
(1) (2) (3)
a32 a32 a32

Fig. 5.4 The architecture of the feed-forward neural network for 1D steady-state heat conduction
problem. The network consists of one input layer, one output layer, and three hidden layers with
32 neurons each. a is the activation function. Superscripted numbers denote the layer number, and
subscripted ones denote the neuron number in the relevant layer

bcs_x_tensor = tf.convert_to_tensor(bcs_x)[:, None]


bcs_T_tensor = tf.convert_to_tensor(bcs_T)[:, None]
kappa = 0.5
# Number of iterations
N = 1000
# ADAM optimizer with learning rate of 0.005
optim = tf.keras.optimizers.Adam(learning_rate=0.005)

# Function for creating the model


def buildModel(num_hidden_layers, num_neurons_per_layer):
tf.keras.backend.set_floatx("float32")
# Initialize a feedforward neural network
model = tf.keras.Sequential()

# Input is one dimensional ( one spatial dimension)


model.add(tf.keras.Input(1))

# Append hidden layers


for _ in range(num_hidden_layers):
model.add(
tf.keras.layers.Dense(
num_neurons_per_layer,
activation=tf.keras.activations.get("tanh"),
kernel_initializer="glorot_normal",
)
198 C. Anitescu et al.

# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))

return model

# determine the model size (3 hidden layers with 32 neurons each)


model = buildModel(3, 32)

Then we define our loss function, which is composed of two parts, the boundary
loss, and physics loss, as formulated in (5.30). Here, the loss term tells us how far
away our model is from ’reality’. For the measure of these loss terms, we will use
mean square error formulation, which is mentioned in Sects. 5.2.3 and (5.11).

L Loss = L BCs + L Physics (5.30)

Constructing the boundary loss is easier compared to the physics loss. Our model’s
assumptions should be compatible with the prescribed boundary conditions, which
are T (0) = 0 and T (1) = 0 for our case. Thus, our goal should be to minimize the
mean square error between our model’s temperature prediction at both ends of the
rod and the real temperature values at these points, which must be 0. The boundary
condition loss is given by (5.31)

N B =2
λ1 
L BCs = |TN N (x j ) − y j |2 , (5.31)
N B j=1

where N B = 2 since we have boundary condition data for two points which are
T (0) = T (1) = 0. The regularization term λ1 is taken as 1.
We also need to provide information about the interior points to get reasonable
results. Although we do not know the temperature data for intermediate points on
the rod, we know those points have to satisfy some physical laws that we derived in
(5.28). Or in other words, our temperature prediction needs to satisfy (5.28). When
we take the derivative of the temperature prediction of the network with respect to x
two times and sum this result with the source term q(x) divided by κ, this summation
must yield 0. Thus, the physics loss for our example becomes
N =100 
λ2 P d 2 TN N  q(x j ) 2
L Physics = | + | , (5.32)
N P j=1 d x 2 x=x j κ

Again, the regularization term λ2 is taken as 1. Now we can combine the boundary
conditions loss and physics loss functions to form our model’s loss function (see
(5.30)), which will guide the model to make better predictions in each iteration.
5 Physics-Informed Neural Networks: Theory and Applications 199

Listing 2: Loss function definition for the heat equation


# Boundary loss function
def boundary_loss(bcs_x_tensor, bcs_T_tensor):
predicted_bcs = model(bcs_x_tensor)
mse_bcs = tf.reduce_mean(tf.square(predicted_bcs - bcs_T_tensor))
return mse_bcs

# the first derivative of the prediction


def get_first_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T = model(x)
T_x = tape.gradient(T, x)
return T_x

# the second derivative of the prediction


def second_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T_x = get_first_deriv(x)
T_xx = tape.gradient(T_x, x)
return T_xx

# Source term divided by \kappa


source_func = lambda x: (15 * x - 2) / kappa

# Function for physics loss


def physics_loss(x):
predicted_Txx = second_deriv(x)
mse_phys = tf.reduce_mean(tf.square(predicted_Txx + source_func(x)))
return mse_phys

# Overall loss function


def loss_func(x, bcs_x_tensor, bcs_T_tensor):
bcs_loss = boundary_loss(bcs_x_tensor, bcs_T_tensor)
phys_loss = physics_loss(x)
loss = bcs_loss + phys_loss
return loss

TensorFlow records the operations on the trainable variables when computing


the loss function and calculates the gradients by backpropagation. Then, the ADAM
optimizer with a fixed learning rate of 0.005 minimizes the loss function. By per-
forming one forward and one back propagation over the whole data set, one epoch is
completed. This procedure is repeated the specified number of times, which is 1000
for our example.

Listing 3: Training of the heat equation model


# taking gradients of the loss function
def get_grad():
with tf.GradientTape() as tape:
# This tape is for derivatives with
# respect to trainable variables
tape.watch(model.trainable_variables)
200 C. Anitescu et al.

Loss = loss_func(x, bcs_x_tensor, bcs_T_tensor)


g = tape.gradient(Loss, model.trainable_variables)
return Loss, g

# optimizing and updating the weights of the model by using gradients


def train_step():
# Compute current loss and gradient w.r.t. parameters
loss, grad_theta = get_grad()
# Perform gradient descent step
optim.apply_gradients(zip(grad_theta, model.trainable_variables))
return loss

# Training loop
for i in range(N + 1):
loss = train_step()
# printing loss amount in each 100 epoch
if i
print("Epoch {:05d}: loss = {:10.8e}".format(i, loss))

Once the training process is completed with the desired loss value, we can validate
the output by performing one forward pass with a test dataset which is typically
formed in the same domain as the training dataset. In our example, the training data
was 100 equidistant points between 0 and 1. We can determine our test dataset as 200
equidistant points in the same domain. Figure 5.5 depicts that the model’s prediction
captures the analytical result.

5.4.1.2 2D Linear Elasticity Example

So far, we have seen the most straightforward application of PINNs to solve a 1D


forward problem. The problem has been described with a linear second-order non-
homogeneous ordinary differential equation with Dirichlet boundary conditions.
Then the equation was solved with a neural network with three hidden layers. This
pedagogical example is supposed to help readers to understand the concept of using

Fig. 5.5 Exact solution and


the prediction of the model.
The predicted solution
coincides with the ground
truth which is
T (x) = −5x 3 + 2x 2 + 3x
5 Physics-Informed Neural Networks: Theory and Applications 201

neural networks in scientific problems. Now, let us proceed with a more complex
application.
Therefore, consider the cantilever beam model (Wang et al. 2006; Otero and
Ponta 2010; Wang and Feng 2009), a classical example in linear elasticity theory.
This problem is governed by the well-known equilibrium equation (5.33) given as

− ∇ · σ (x) = f (x) for x ∈  (5.33)

with the strain-displacement given by:

1
= (∇u + ∇uT ) (5.34)
2
and Hooke’s law for a linear isotropic elastic solid:

σ = 2μ + λ(∇ · u)I, (5.35)

where μ and λ are the Lamé constants, and I is the identity tensor. The Dirichlet
boundary conditions are u(x) = û for x ∈ D and the Neumann boundary conditions
are σ n = tˆ for x ∈ N , where n is the normal vector.
For this example (see Fig. 5.6),  is a rectangle with corners at (0, 0) and (8, 2).
Letting x = (x, y), and u = (u, v) the Dirichlet boundary conditions for x = 0 are

Py W2
u(x, y) = (2 + ν)(y 2 − )
6E I 4 (5.36)
P
v(x, y) = − (3νy 2 L)
6E I
Commonly, a parabolic traction at x = 8

y 2 − yW
p(x, y) = P (5.37)
2I

is applied where P = 2 is the maximum traction, E = 103 is Young’s modulus,


3
ν = 0.25 is the Poisson ratio and I = b W12 is second moment of area of the cross
section. The dimensions of the beam in x, y and z directions are L = 8, W = 2 and
b = 1, respectively (Fig. 5.6).
The objective of this problem is to find the displacements on the beam in x and
y directions. In order to solve the problem, firstly, uniform collocation points in the
domain are created. The numbers of collocation points are 80 and 40 in x and y
directions, respectively, as shown in Fig. 5.7. The prediction of the neural network
shall satisfy the equilibrium equation (5.33) and the constitutive law (5.35) as well
as the boundary conditions (5.36).
The strong imposition of the Dirichlet boundary conditions is explained in detail
in Sect. 5.3.2. In this example, we strongly imposed Dirichlet boundary conditions
202 C. Anitescu et al.
Fig. 5.6 The illustration of
2D elasticity problem y

L=8 y

W=2
x z

b=1
Pmax=2

at x = 0. Therefore, there is no need to insert collocation points along the y-axis


where x = 0 as is illustrated in Fig. 5.7.
In this example, a fully connected feed-forward neural network with three hidden
layers and with 20 neurons per hidden layer is used with the swish activation function.
The neural network constructed for this problem is depicted in Fig. 5.8. ADAM and
L-BFGS optimizers are used together. ADAM optimizer was used for the first 15000
iterations, and then optimization of the parameters continued with L-BFGS optimizer
for the successive 500 iterations.
After completing the training procedure, the model is tested with a new set of data.
For the test data, the number of uniformly spaced collocation points was doubled in
the same domain. One forward pass is performed to see the results for the test data.
The solution obtained with the model and the exact solution for the displacements
of the beam in x and y directions are plotted in Fig. 5.9.
The approximation obtained by the neural network is very close to the analyti-
cal solution. The relative L 2 error in the approximation is 5.899 × 10−5 . The error
distribution can be seen in Fig. 5.10. The error for the displacements in x and y
directions are of the order of 10−6 and 10−5 , respectively. The model makes less
accurate predictions for the displacements around the beam’s free end than the fixed
end.

Fig. 5.7 Collocation points


on the Timoshenko
cantilever beam. 80 points in
x direction and 40 points in
y direction. Red points stand
for boundary points whereas
the blue points represent
interior collocation points.
Since Dirichlet boundary
conditions are strongly
imposed, the collocation
points along the y-axis where
x = 0 are not needed
5 Physics-Informed Neural Networks: Theory and Applications 203

hidden layers
(1) (2) (3)
a1 a1 a1

input output
layer
layer (1) (2) (3)
a2 a2 a2
x u
(1) (2) (3)
a3 a3 a3

y a4
(1) (2)
a4 a4
(3) v

.. .. ..
. . .
(1) (2) (3)
a20 a20 a20

Fig. 5.8 The architecture of the feed-forward neural network for the Timoshenko beam problem.
The network consists of one input layer, one output layer, and three hidden layers. There are 20
neurons per hidden layer. Two neurons in the input layer take x and y coordinates, and the output
neurons give displacements in u and v directions. a is the activation function that is swish in this
example. Superscripted numbers denote the layer number, and subscripted ones denote the neuron
number in the relevant layer

5.4.1.3 3D Hyperelasticity

As the last example of this section, a hyperelasticity problem presented in Samaniego


et al. (2020) will be discussed. We will solve this particular problem with the deep
energy method shown in Sect. 5.3.2. Objective of the problem is obtaining the dis-
placements for a 3D hyperelastic cuboid made of an isotropic, homogeneous material
subjected to prescribed twisting, body forces, and traction forces. In order to obtain
optimal parameters of the neural network, the potential energy formulation for the
body will be used as the loss function. The governing equations and boundary con-
ditions for this problem are written as

∇ · P + fb = 0,
Dirichlet boundary : u = ū on ∂ D , (5.38)
Neumann boundary : P · n = t̄ on ∂ N ,

where ū is the prescribed displacement given on the Dirichlet boundary and t̄ is the
prescribed traction at the Neumann boundary; n denotes the outward unit normal
vector, P is the 1st Piola Kirchoff stress tensor and fb is the body force. The potential
energy functional of this problem is given by Samaniego et al. (2020)
204 C. Anitescu et al.

(a) Predicted displacements in x-direction (b) Exact displacements in x-direction

(c) Predicted displacements in y-direction (d) Exact displacements in y-direction

Fig. 5.9 Predicted and exact values for displacements on a Timoshenko cantilever beam in x and
y directions

(a) Estimation error for displace- (b) Estimation error for displace-
ments in x-direction ments in y-direction

Fig. 5.10 The difference between the exact solution and the predicted solution for displacements
on the beam in x and y directions

  
ε(ϕ) = d V − fb · ϕd V − t̄ · ϕd A, (5.39)
  ∂ N

where  is the strain energy density and ϕ indicates the mapping of points on the
body from the initial/undeformed to the deformed state.
In order to obtain optimal parameters of the neural network, the potential energy
(5.39) is parameterized by the neural network’s prediction for the displacements.
5 Physics-Informed Neural Networks: Theory and Applications 205

Thus, the loss function reads


  
L( p) = (ϕ(X ; p))d V − fb · ϕ(X ; p)d V − t̄ · ϕ(X ; p)d A, (5.40)
  ∂ N

If we rewrite (5.40) in a discrete form, it becomes

N∂  N
V  V  A∂  N 
N N
L( p) ≈ ((ϕ p )i ) − (fb )i · (ϕ p )i − t¯i · (ϕ p )i , (5.41)
N i=1 N i=1 N∂  N i=1

in which V is the volume and N is the number of data points within the solid;
N∂  N and A∂  N denote the number of points on the surface subjected to the force
and the surface area, respectively.
Let us consider now 3D cuboid of length L = 1.25, width W = 1.0, and depth
H = 1.0. It is fixed at the left surface and twisted 60◦ counter-clockwise by boundary
conditions u | 1 at the right-end surface. Also, at the lateral surfaces, a body force
fb = [0, −0.5, 0]T and traction forces t̄ = [1, 0, 0]T are applied(see Fig. 5.11).
The Dirichlet boundary conditions for this particular problem are

u| = [0, 0, 0]T ,
⎡ ⎤
0

0 (5.42)
u| 1
= ⎣0.5[0.5 + (X 2 − 0.5) cos(π/3) − (X 3 − 0.5) sin(π/3) − X 2 ]⎦
0.5[0.5 + (X 2 − 0.5) sin(π/3) + (X 3 − 0.5) cos(π/3) − X 3 ]

The Neo-Hookean model is assumed in this problem. The material properties are
shown in Table 5.1.
We now proceed with determining the network parameters. In each direction, 40
equally spaced points, 64000 points in total, are placed over the whole domain (see
Fig. 5.12a). The neural network consists of three hidden layers, and each hidden

Fig. 5.11 The 3D


hyperelastic cuboid is fixed
at the left-hand side and
twisted 60◦
counter-clockwise
H=1

L=1
.25
1
Fixed Support W=
206 C. Anitescu et al.

Table 5.1 Material properties and parameters for hyperelastic cuboid

Description Value
E—Young’s modulus 106
ν—Poisson ratio 0.3
E
μ—Lame’ parameter
2(1 + ν)

λ—Lame’ parameter
(1 + ν)(1 − 2ν)

(a) 64000 equidistant points over (b) Deformed shape of 3D hyperelastic


the domain cuboid

Fig. 5.12 Training points on the cuboid and its predicted deformed shape after training

layer has 30 neurons with a tanh activation function. The input and output layers
have three neurons corresponding to coordinates of the initial configuration of the
designated points and their deformed coordinates after loading, respectively. The
network is trained with 50 iterations and the parameters are optimized by the L-
BFGS optimizer.
The predicted deformed shape of the cuboid is given in Fig. 5.12b. A line passing
through two points on the cube A(0.625, 1, 0.5) and B(0.625, 0, 0.5) is drawn to
compare the displacement predictions and the real displacements on the line. We
showed in Samaniego et al. (2020), that a neural network with the same setup but
trained with 25 steps has an error in the L 2 norm of 0.13210, whereas the finite
element model has an error of 0.13275 for estimating the displacements on the line
AB.

5.4.2 Inverse Problems

An inverse problem can be considered as inferring features of a model from the


observed data. Finding the elasticity modulus of a beam from its displacement mea-
5 Physics-Informed Neural Networks: Theory and Applications 207

surements under certain constraints or inferring the space-dependent reaction rate


of a diffusion-reaction system (Yu et al. 2022) can be given as examples for the
inverse problems. PINNs have already been used to tackle several inverse problems
existing in unsaturated groundwater flow (Depina et al. 2022), nano-optics, and meta-
materials (Chen et al. 2020). We will discuss two inverse problems in this section to
demonstrate the application of PINNs to solve these problems.

5.4.2.1 Inverse Heat Equation

In Sect. 5.4.1, we examined a 1D steady-state heat equation with a source term.


The goal was to find the temperature values along the rod with boundary data and
governing physical laws. Now, let us reconsider that problem with a slightly different
setup that converts the problem into an inverse one.
Assume we measured temperature at 100 equidistant points on the rod. Addition-
ally, we have the same source term and the same boundary conditions and aim to
find the thermal diffusivity constant κ in (5.28). Again, the neural network seeks to
predict the temperature values at 100 equidistant points on the rod. However, the
thermal diffusivity constant κ is unknown this time. Therefore, the model will be
trained to obtain the optimum values of the diffusivity constant as well as the network
parameters. Mean square error between measured temperatures and model predic-
tion will be used as the loss function. Additionally, a physics loss term that guides
the prediction of the network according to the governing equations will be added to
the loss function.
At first, initial settings are applied (see Listing 4) similar to the forward heat
conduction problem defined in Sect. 5.4.1. However, the constant κ is not known
in advance for this problem. We have an initial guess of κ = 0.1 for the thermal
diffusivity constant. The neural network has three hidden layers with 32 neurons
each, and the tanh function is used as the activation function. The ADAM optimizer
optimizes the network parameters with a fixed learning rate of 0.001. The number
of epochs is designated as 6000.

Listing 4: Initial settings for the inverse heat equation


# importing necessary libraries
import numpy as np
import tensorflow as tf
import matplotlib.pyplot as plt

# We set seeds initially. This feature starts the model with same random
# variables (e.g. initial weights of the network).
# By doing it so, we have same results whenever the code is run
tf.random.set_seed(123)

# 100 equidistant points in the domain are created


x = tf.linspace(0.0, 1.0, 100)

# boundary conditions which are T(0)=T(1)=0 are introduced.


bcs_x = [0.0, 1.0]
208 C. Anitescu et al.

bcs_T = [0.0, 0.0]


bcs_x_tensor = tf.convert_to_tensor(bcs_x)[:, None]
bcs_T_tensor = tf.convert_to_tensor(bcs_T)[:, None]
kappa = tf.Variable([0.1], trainable=True)

# Number of iterations
N = 6000

# ADAM optimizer with learning rate of 0.001


optim = tf.keras.optimizers.Adam(learning_rate=1e-3)

#The exact solution of the problem. It will be used to produce measured data
#and test data
solution = lambda x: -5 * x**3 + 2 * x**2 + 3 * x

def buildModel(num_hidden_layers, num_neurons_per_layer):


tf.keras.backend.set_floatx("float32")
# Initialize a feedforward neural network
model = tf.keras.Sequential()

# Input is one dimensional ( one spatial dimension)


model.add(tf.keras.Input(1))

# Append hidden layers


for _ in range(num_hidden_layers):
model.add(
tf.keras.layers.Dense(
num_neurons_per_layer,
activation=tf.keras.activations.get("tanh"),
kernel_initializer="glorot_normal",
)
)

# Output is one-dimensional
model.add(tf.keras.layers.Dense(1))

return model

# determine the model size (3 hidden layers with 32 neurons each)


model = buildModel(3, 32)

After defining the model settings, we can proceed with constructing the loss
function. The loss function (5.43) consists of three parts, namely, boundary loss,
physics loss, and data loss.

L Loss = L BCs + L Physics + L Data (5.43)

with
5 Physics-Informed Neural Networks: Theory and Applications 209

λ1 
NB
L BCs = |TN N (xi ) − yi |2 ,
N B i=1

λ2  d 2 TN N 
NP
q(x j ) 2
L Physics = |  + | , (5.44)
N P j=1 dx 2
x=x j κ

λ3 
ND
L Data = |TN N (x j ) − y j |2
N D j=1

Here N B , N P , and N D correspond to the number of data points for boundary loss,
physics loss, and measured data loss, respectively. Regularization terms λ1 , λ2 , λ3
are taken as 1 in this example; (5.43) and (5.44) are defined in the code as follows:

Listing 5: Loss function for the inverse heat equation


@tf.function
def boundary_loss(bcs_x_tensor, bcs_T_tensor):
predicted_bcs = model(bcs_x_tensor)
mse_bcs = tf.reduce_mean(tf.square(predicted_bcs - bcs_T_tensor))
return mse_bcs

# the first derivative of the prediction


@tf.function
def get_first_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T = model(x)
T_x = tape.gradient(T, x)
return T_x

# the second derivative of the prediction


@tf.function
def second_deriv(x):
with tf.GradientTape() as tape:
tape.watch(x)
T_x = get_first_deriv(x)
T_xx = tape.gradient(T_x, x)
return T_xx

# Source term
def source_func(x): return (15 * x - 2)

@tf.function
def physics_loss(x):
x = x[1:-1]
predicted_Txx = second_deriv(x)
mse_phys = tf.reduce_mean(
tf.square(predicted_Txx * kappa + source_func(x)))
return mse_phys

@tf.function
210 C. Anitescu et al.

def data_loss(x):
x = x[1:-1]
ob_T = solution(x)[:, None]
data_loss = tf.reduce_mean(tf.square(ob_T - model(x)))
return data_loss

@tf.function
def loss_func(x):
bcs_loss = boundary_loss(bcs_x_tensor, bcs_T_tensor)
phys_loss = physics_loss(x)
ob_loss = data_loss(x)
loss = phys_loss + ob_loss + bcs_loss
return loss

The training and testing procedures are the same as for the forward problem.
Again, the gradients of the loss function with respect to κ and the trainable variables,
which are weights and biases of the network, are determined with backpropagation.
Then, the trainable variables and the κ value are updated by the ADAM optimizer
using previously obtained gradients. This iterative procedure is repeated a number
of epochs times, and eventually, it is expected to reach the possible minimum loss
value.
Listing 6: Training
# taking gradients of the loss function w.r.t. trainable variables
# and kappa
@tf.function
def get_grad():
with tf.GradientTape(persistent=True) as tape:
# This tape is for derivatives with
# respect to trainable variables
tape.watch(model.trainable_variables)
tape.watch(kappa)
Loss = loss_func(x)
g = tape.gradient(Loss, model.trainable_variables)
g_kappa = tape.gradient(Loss, kappa)
return Loss, g, g_kappa

# optimizing and updating the weights and biases of the model and
# kappa by using the gradients
@tf.function
def train_step():
# Compute current loss and gradient w.r.t. parameters
loss, grad_theta, grad_kappa = get_grad()

# Perform gradient descent step


optim.apply_gradients(zip(grad_theta, model.trainable_variables))
optim.apply_gradients([(grad_kappa, kappa)])
return loss

The network parameters obtained at the last epoch form our model. We can test
the model with a new data set in the same domain and plot the results to compare
it with the ground truth (see Fig. 5.13a). The value for κ estimated by the neural
5 Physics-Informed Neural Networks: Theory and Applications 211

Fig. 5.13 The temperature and thermal diffusivity constant prediction of the neural network and
true values

network is equal to 0.5000, and the real value of κ is 0.5. Figure 5.13b illustrates
that as the network is being trained, the value of κ converges to the true value. The
relative L 2 error norm is 7.575 × 10−6 .

5.4.2.2 Inverse Helmholtz Equation

The second and also last example is the Helmholtz equation, which is a time-
independent version of the wave equation. It is used for describing problems in
electromagnetic radiation, acoustics, and seismology. The homogeneous form of the
Helmholtz equation is written as :

∇2u + k2u = 0 (5.45)


212 C. Anitescu et al.

where ∇ 2 is the Laplace operator and k is the wave number. The solution of the
problem is u(x, y) for (x, y) ∈ . An inverse acoustic duct problem, adopted from
Anitescu et al. (2019), whose governing equation is a complex-valued Helmholtz
equation such that k is unknown and u(x, y) is known at some points in the domain,
will be investigated.
We can write (5.45) with domain information and boundary conditions as:

∇ 2 u(x, y) + k 2 u(x, y) = 0 where (x, y) ∈  and  := (0, 2) × (0, 1), (5.46)

with the Neumann and Robin Boundary Conditions

∂u
= cos(mπ x) at x = 0
∂n
∂u
= −iku at x = 2 (5.47)
∂n
∂u
= 0 for y = 0 and y = 1
∂n
m being the number of modes which is taken as 1. The wave number k is unknown.
The initial guess for k is 1, and the true value is chosen as k = 6. The exact solution
for u(x, y) can be written as:

u(x, y) = cos(mπ y)(A1 e−ikx x )(A2 eikx x ),


 (5.48)
where k x = k 2 − (π m)2

where A1 and A2 are obtained by solving a 2 × 2 linear system:


  
ik x −ik x A1 1
× = (5.49)
(k − k x )e−2ikx (k + k x )e2ikx A2 0

Similar to the previous inverse problem in which we obtained the thermal diffusivity
constant of a rod, the overall loss function is composed of boundary loss, physics loss
and data loss. The boundary loss is constructed by Neumann and Robin boundary
conditions specified in (5.47), and the physics loss is equal to the left-hand-side of
(5.46). In addition, the data loss, in other words, the mean square error between
observed u(x, y) values and the prediction of the neural network is the last term in
our overall loss function. These loss functions can be described as follows:

Lloss = L BCs + L Physics + L Data (5.50)

where L BCs , L Physics , L Data are:


5 Physics-Informed Neural Networks: Theory and Applications 213

Fig. 5.14 Collocation points for 2D Helmholtz equation. Black points depict the boundary points
where Neumann boundary conditions are valid whereas the red points show the Robin boundary
points. In addition, blue points represent the inner collocation points where physics loss and data
loss are computed

λ1 
NB
∂u N N b b ∂u b b 2
L BCs = | (xi , yi ) − (x , y )| ,
N B i=1 ∂n ∂n i i

λ2 
NP
L Physics = |∇ 2 u N N (x ∗j , y ∗j ) − k 2 u(x ∗j , y ∗j )|2 , (5.51)
N P j=1

λ3 
ND
L Data = |u N N (x ∗j , y ∗j ) − u(x ∗j , y ∗j )|2
N D j=1

Here L BCs , L Physics , L Data refer to the loss obtained from boundary conditions,
governing equation, and measured data, respectively. The regularization term λ1 is
100 whereas λ2 and λ3 are 1; N B indicates the number of boundary points, N P , N D
are the number of interior collocation points where physics loss is computed and the
number of points where the observed data is available, respectively. In this problem,
784 equidistant points (28 × 28) such that N P =N D = 676 and N B =108 are created
(see Fig. 5.14).
The neural network consists of 5 hidden layers with the tanh activation function,
and there are 30 neurons in each layer. The data is normalized to the interval [−1, 1]
before being processed. First, ADAM optimizer and, subsequently, the quasi-Newton
method (L-BFGS) are employed to minimize the loss function. Five thousand itera-
tions for ADAM and 6200 iterations with L-BFGS are applied. The estimated solution
for u(x, y) and the exact solution are shown in Fig. 5.15.
The initial guess for k was one, and the neural network’s estimation for k after
the training is 5.999. The relative L 2 error norm for the real part of the solution is
0.0015. A comparison between the predicted solution and the exact solution can be
found in Fig. 5.16.
214 C. Anitescu et al.

(a) Predicted solution for the real (b) Exact solution for the real part of
part of Helmholtz equation Helmholtz equation

(c) Predicted solution for the imagi- (d) Exact solution for the imaginary
nary part of Helmholtz equation part of Helmholtz equation

Fig. 5.15 Predicted and exact values for real and imaginary parts of the Helmholtz equation

(a) Error distribution between pre- (b) Error distribution between pre-
dicted and exact solution for the real dicted and exact solution for the
part imaginary part

Fig. 5.16 Error distribution for real and imaginary parts of the Helmholtz equation
5 Physics-Informed Neural Networks: Theory and Applications 215

5.5 Conclusions

In this chapter, we have introduced some of the main building blocks for PINNs.
The main idea is to cast the process of solving a PDE as an optimization problem,
where either the residual or some energy functional related to the governing equations
is minimized. We showed the implementation of PINNs for both simple and more
advanced inverse problems. First, a 1D steady-state heat conduction problem with
a source term was solved for the unknown thermal diffusivity constant κ. Later,
a complex-valued Helmholtz equation for an inverse acoustic duct problem was
investigated. The wave number k is unknown in the beginning, and it is approximated
by the PINN model. Unlike the forward problems, we have an additional term in the
loss function, which is formed as the mean square error between the measured data
and the model’s prediction.
By taking advantage of modern machine learning libraries, it is possible to write
fairly succinct programs that approximate the solution or some quantity of inter-
est, while at the same time taking advantage of the built-in parallelization offered by
multi-processor and GPU architectures. Nevertheless, solving PDEs by the optimiza-
tion of parameters in a “standard” fully connected neural network is less efficient
than current methods such as finite elements. More advances seem possible by com-
bining machine learning algorithms with classical methods for solving PDEs which
make use of the available knowledge for approximating the solutions or quantities
of interest.

References

Abadi M, Agarwal A, Barham P, Brevdo E et al (2015) TensorFlow: large scale machine learning
on heterogeneous systems. Software available from tensorflow.org. https://www.tensorflow.org/
Agostinelli F, Hoffman M, Sadowski P, Baldi P (2014) Learning acti vation functions to improve
deep neural networks. arXiv:1412.6830
Anitescu C, Atroshchenko E, Alajlan N, Rabczuk T (2019) Artificial neural network methods for
the solution of second order boundary value problems. Comput Mater Continua 59(1):345–359
Apicella A, Donnarumma F, Isgr‘o F, Prevete R (2021) A survey on modern trainable activation
functions. Neural Netw 138:14–32
Bin Waheed U, Haghighat E, Alkhalifah T, Song C et al (2021) PINNeik: Eikonal solution using
physics-informed neural networks. Comput Geosci 155:104833
Bradbury J, Frostig R, Hawkins P, Johnson MJ et al (2018) JAX: compos able transformations of
Python+NumPy programs. Version 0.2.5. http://github.com/google/jax
Broyden CG (1970) The convergence of a class of double-rank minimiza tion algorithms: 2. The
new algorithm. IMA J Appl Math 6(3):222–231
Chen Y, Lu L, Karniadakis GE, Dal Negro L (2020) "Physics-informed neural networks for inverse
problems in nano-optics and metamateri als. Opt Express 28(8):11618–11633
Clevert D-A, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by expo-
nential linear units (ELUs). arXiv:1511.07289
De Sa C, Re C, Olukotun K (2015) Global convergence of stochastic gradient descent for some
non-convex matrix problems. International conference on machine learning. PMLR, pp 2332–
2341
216 C. Anitescu et al.

Depina I, Jain S, Mar Valsson S, Gotovac H (2022) Application of physics-informed neural networks
to inverse problems in unsaturated groundwater flow. Georisk: Assess Manag Risk Eng Syst
Geohazards 16(1):21–36
Dillon JV, Langmore I, Tran D, Brevdo E et al (2017) Tensorflow dis tributions. arXiv:1711.10604
Fletcher R (1970) A new approach to variable metric algorithms. Comput J 13(3):317–322
Floridi L, Chiriatti M (2020) GPT-3: Its nature, scope, limits, and consequences. Minds Mach
30(4):681–694
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feedforward neural net-
works. In: Proceedings of the thirteenth international conference on artificial intelligence and
statistics. JMLR Work shop and conference proceedings, pp 249–256
Goldfarb D (1970) A family of variable-metric methods derived by varia tional means. Math Comput
24(109):23–26
Goodfellow I, Bengio Y, Courville A (2016) Deep learning. MIT Press
Goswami S, Anitescu C, Rabczuk T (2020) Adaptive fourth-order phase field analysis for brittle
fracture. Comput Methods Appl Mech Eng 361:112808
Gühring I, Kutyniok G, Petersen P (2020) Error bounds for approxi mations with deep ReLU neural
networks in Ws, p norms. Anal Appl 18(05):803–859
Haghighat E, Amini D, Juanes R (2022) Physics-informed neural net work simulation of multiphase
poroelasticity using stress-split sequen tial training. Comput Methods Appl Mech Eng 397:115141
He J, Li L, Xu J, Zheng C (2020) Relu deep neural networks and linear finite elements. J Comput
Math 38(3):502–527
Hendrycks D, Gimpel K (2016) Gaussian error linear units (GELUs). arXiv:1606.08415
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification. In: Proceedings of the IEEE international conference on
computer vision, pp 1026–1034
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approx-
imators. Neural Netw 2(5):359–366
Jagtap AD, Kawaguchi K, Em Karniadakis G (2020a) Locally adap tive activation functions with
slope recovery for deep and physics informed neural networks. Proc R Soc A 476(2239):20200334
Jagtap AD, Kawaguchi K, Karniadakis GE (2020b) "Adaptive acti vation functions accelerate
convergence in deep and physics-informed neural networks. J Comput Phys 404:109136
Jagtap AD, Shin Y, Kawaguchi K, Karniadakis GE (2022) Deep Kronecker neural networks: A gen-
eral framework for neural networks with adaptive activation functions. Neurocomputing 468:165–
180
Jouppi NP, Young C, Patil N, Patterson D et al (2017) In-datacenter performance analysis of a
tensor processing unit. In: Proceedings of the 44th annual international symposium on computer
architecture, pp 1–12
Jumper J, Evans R, Pritzel A, Green T et al (2021) Highly accurate protein structure prediction with
Alpha fold. Nature 596(7873):583–589
Kharazmi E, Zhang Z, Karniadakis GE (2019) Variational physics informed neural networks for
solving partial differential equations. arXiv:1912.00873
Kingma DP, Ba J (2014) Adam: A method for stochastic optimiza tion. arXiv:1412.6980
Kissas G, Yang Y, Hwuang E, Witschey WR et al (2020) Machine learn ing in cardiovascular
flows modeling: Predicting arterial blood pressure from non-invasive 4D flow MRI data using
physics-informed neural net works. Comput Methods Appl Mech Eng 358:112623
Lagaris IE, Likas AC, Papageorgiou DG (2000) Neural-network methods for boundary value prob-
lems with irregular boundaries. IEEE Trans Neural Netw 11(5):1041–1049
Lagaris IE, Likas A, Fotiadis DI (1997) Artificial neural network methods in quantum mechanics.
Comput Phys Commun 104(1–3):1–14, 40
Lagaris IE, Likas A, Fotiadis DI (1998) Artificial neural networks for solving ordinary and partial
differential equations. IEEE Trans Actions Neural Netw 9(5):987–1000
Levenberg K (1944) A method for the solution of certain non-linear prob lems in least squares. Q
Appl Math 2(2):164–168
5 Physics-Informed Neural Networks: Theory and Applications 217

Li A, Chen R, Farimani AB, Zhang YJ (2020a) Reaction diffusion system prediction based on
convolutional neural network. Sci Rep 10(1):1-9
Li Z, Kovachki N, Azizzadenesheli K, Liu B et al (2020b) Fourier neural op erator for parametric
partial differential equations. arXiv:2010.08895
Li Z, Liu F, Yang W, Peng S et al (2021) A survey of convolutional neural networks: analysis,
applications, and prospects. IEEE Trans Neural Netw Learn Syst
Liu DC, Nocedal J (1989) On the limited memory BFGS method for large scale optimization. Math
Program 45(1):503–528
López J, Anitescu C, Rabczuk T (2021) Isogeometric structural shape optimization using automatic
sensitivity analysis. Appl Math Model 89:1004–1024
Lu L, Jin P, Pang G, Zhang Z et al (2021) Learning nonlinear opera tors via DeepONet based on
the universal approximation theorem of operators. Nat Mach Intell 3(3):218–229
Maas AL, Hannun AY, Ng AY et al (2013) Rectifier nonlinearities improve neural network acoustic
models. Proc icml 30(1):3. Citeseer
Marquardt DW (1963) An algorithm for least-squares estimation of non linear parameters. J Soc
Indus Appl Math 11(2):431–441
Mertikopoulos P, Hallak N, Kavis A, Cevher V (2020) On the al most sure convergence of stochastic
gradient descent in non-convex problems. Adv Neural Inf Process Syst 33:1117–1128
Misra D (2019) Mish: A self regularized non-monotonic activation function. arXiv:1908.08681
Nguyen-Thanh VM, Zhuang X, Rabczuk T (2020) A deep energy method for finite deformation
hyperelasticity. Eur J Mech-A/Solids 80:103874
Otero AD, Ponta FL (2010) Structural analysis of wind-turbine blades by a generalized Timoshenko
beam model
Paszke A, Gross S, Massa F, Lerer A et al (2019) PyTorch: an impera tive style, high-performance
deep learning library. In: Advances in Neural Information Processing Systems 32. Curran Asso-
ciates, Inc., 2019, pp 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-
style-high-performance-deep-learning-library.pdf
Petersen P, Voigtlaender F (2018) Optimal approximation of piecewise smooth functions using deep
ReLU neural networks. Neural Netw 108:296–330
Pfau D, Spencer JS, Matthews AGDG, Foulkes WMC (2020) Ab initio solution of the many-electron
Schrödinger equation with deep neural networks. Phys Rev Res 2:033429
Philipp G, Song D, Carbonell JG (2018) Gradients explode—Deep Networks are shallow—ResNet
explained
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neu ral networks: A deep learn-
ing framework for solving forward and inverse problems involving nonlinear partial differential
equations. J Comput Phys 378:686–707
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. arXiv:1710.05941
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning rep resentations by back-propagating
errors. Nature 323(6088):533–536
Samaniego E, Anitescu C, Goswami S, Nguyen-Thanh VM et al (2020) An energy approach to
the solution of partial differential equations in computational mechanics via machine learning:
Concepts, implementation and applications. Comput Methods Appl Mech Eng 362:112790
Shanno DF (1970) Conditioning of quasi-Newton methods for function minimization. Math Comput
24(111):647–656
Shukla K, Di Leoni PC, Blackshire J, Sparkman D et al (2020) Physics informed neural network for
ultrasound nondestructive quantification of surface breaking cracks. J Nondestruct Eval 39(3):1–
20
Shukla K, Jagtap AD, Karniadakis GE (2021) Parallel physics informed neural networks via domain
decomposition. J Comput Phys 447:110683
Silver D, Hubert T, Schrittwieser J, Antonoglou I et al (2017) Mastering chess and shogi by self-play
with a general reinforcement learning algorithm. arXiv:1712.01815
Sirignano J, Spiliopoulos K (2018) DGM: A deep learning algorithm for solving partial differential
equations. J Comput Phys 375:1339–1364
218 C. Anitescu et al.

Sukumar N, Srivastava A (2022) Exact imposition of boundary con ditions with distance functions
in physics-informed deep neural net works. Comput Methods Appl Mech Eng 389:114333
Sun S, Cao Z, Zhu H, Zhao J (2019) A survey of optimization meth ods from a machine learning
perspective. IEEE Trans Cybern 50(8):3668–3681
Vauhkonen M, Tarvainen T, Lähivaara T (2016) Inverse problems. In: Pohjolainen S (ed) Mathe-
matical modelling. Springer International Publishing
Wang G-F, Feng X-Q (2009) Timoshenko beam model for buckling and vibration of nanowires
with surface effects. J Phys D: Appl Phys 42(15):155411
Wang C, Tan V, Zhang Y (2006) Timoshenko beam model for vibra tion analysis of multi-walled
carbon nanotubes. J Sound Vib 294(4–5):1060–1072
Wang S, Yu X, Perdikaris P (2022) When and why PINNs fail to train: A neural tangent kernel
perspective. J Comput Phys 449:110768
Wight CL, Zhao J (2020) Solving allen-cahn and cahn-hilliard equations using the adaptive physics
informed neural networks. arXiv:2007.04542
Yu B et al (2018) The deep Ritz method: a deep learning-based numerical algorithm for solving
variational problems. Commun Math Stat 6(1):1–12
Yu J, Lu L, Meng X, Karniadakis GE (2022) Gradient-enhanced physics-informed neural networks
for forward and inverse PDE problems. Comput Methods Appl Mech Eng 393:114823
Zhuang X, Guo H, Alajlan N, Zhu H et al (2021) Deep autoencoder based energy method for
the bending, vibration, and buckling anal ysis of Kirchhoff plates with transfer learning. Eur J
Mech-A/Solids 87:104225
Chapter 6
Physics-Informed Deep Neural Operator
Networks

Somdatta Goswami, Aniruddha Bora, Yue Yu, and George Em Karniadakis

6.1 Introduction

Physics-informed neural networks (PINNs) have transformed the way we model the
behavior of physical systems for which we have available some measurements and
at least a parameterized partial differential equation (PDE) to provide additional
information in a semi-supervised type of learning (Raissi et al. 2019; Karniadakis
et al. 2021; Samaniego et al. 2020). They can solve ill-posed problems that may lack
boundary conditions, e.g. thermal boundary conditions in heat transfer problems
(Cai et al. 2021) or discover voids and defects in materials based only on a handful
of displacements (Zhang et al. 2022), or obtain the failure pattern (Goswami et al.
2020b). Despite their effectiveness, PINNs are trained for specific boundary and ini-
tial conditions, as well as loading or source terms, and require expensive training
during inference. Therefore, they are not particularly effective for other operating
conditions and real-time inference, although transfer learning can somewhat alle-
viate this limitation (Goswami et al. 2020c, 2022e). What we need in engineering
disciplines such as design, control, uncertainty quantification, robotics, etc. is a gen-

S. Goswami · A. Bora · G. E. Karniadakis (B)


Division of Applied Mathematics, Brown University, Providence, RI, USA
e-mail: [email protected]
S. Goswami
e-mail: [email protected]
A. Bora
e-mail: [email protected]
Y. Yu
Department of Mathematics, Lehigh University, Bethlehem, PA, USA
e-mail: [email protected]
G. E. Karniadakis
School of Engineering, Brown University, Providence, RI, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 219
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_6
220 S. Goswami et al.

eralized version of PINNs that can infer the system’s response in real time for many
different boundary/initial conditions and loadings, without further training or perhaps
with very light training. This will lead to speed up factors of thousands compared to
conventional numerical solvers, e.g. CFD or solid mechanics simulators.
The neural operators, introduced in 2019 in the form of DeepONet (Lu et al.
2021), fulfill this promise. Training is performed offline in a predefined input space
and hence inference is very fast since no further training is required as long as the
new conditions are inside the input space. For arbitrary inputs, which are out of the
distribution (OOD), further training is required but this may be relatively light if
the input space is sampled sufficiently. We note here that a neural operator is very
different from a reduced order model (ROM) that is restricted to a very small subset of
conditions and lacks generalization properties due to the under-parameterization in
such methods (Kontolati et al. 2022; Geelen et al. 2022). Another important property
of DeepONet is that it is based on a universal approximation theorem for operators
(Chen and Chen 1995; Lu et al. 2021), and more recent theoretical work (Lanthaler
et al. 2022) has shown that DeepONet can break the curse of dimensionality in the
input space, unlike approaches based on ROM for parameterized PDEs (Riffaud et al.
2021).
Figure 6.1 represents the schematic of a polymorphic DeepONet and its resem-
blance to a human neuron. An operator network is made up of two deep neural
networks (DNNs): branch and trunk. The branch and trunk networks are analogous
to synchronized dendritic branches and axonal spiking. The result is a nonlinear oper-
ator that can be used to approximate any function defined as an input in the branch

Fig. 6.1 A schematic representation of the deep operator network (DeepONet). It consists of two
neural networks (the branch network and the trunk network) with flexible architectures (DNNs: deep
neural networks; GNNs: graph neural networks; SNN: spiking neural networks). The networks take
functions as input and output functions, which are represented by the dot product of the outputs of
the branch (basis coefficients) and the trunk networks (basis functions). The color coding indicates
the resemblance of computational operations with a human neuron
6 Physics-Informed Deep Neural Operator Networks 221

network and evaluated at the locations specified in the trunk network. In 2020, work
on graph kernel network (GKN) for PDE led to another type of operator regression
(Li et al. 2020c). Subsequently, the same group proposed a different architecture in
which they formulated the operator regression by parameterizing the integral kernel
directly in Fourier space and named it Fourier Neural Operator (FNO) (Li et al.
2020b). All these versions are different realizations of DeepONet if appropriate
changes in the trunk and branch are imposed, see Fig. 6.1 and also (Lu et al. 2022a).
Consider an operator G that maps from the input function v to the output function
u, i.e. G : v → u. DeepONet tries to learn the operator G by approximating the basis
function for expressing the output functional space. In this chapter, we will discuss
in detail the three aforementioned neural operators, and their extensions, and put
forward numerical examples to illustrate the usage and limitations of each of these
approaches. The physics of the problem is enforced using the labeled input–output
dataset pairs for the conventional architecture of the proposed operators. In related
sections, variants of the operators that use the physics of the problem (and require
little to no labeled data) to train the network are also covered.
This chapter is organized as follows. In Sect. 6.2, we discuss the deep neural
operator (DeepONet), its extensions, and variants that have been developed and used
for different problem setups. In Sect. 6.3, we describe the Fourier neural operator
(FNO) architecture and present the extensions of the operator to deal with com-
plex geometries, problems defined in different input and output spaces, and also
the physics-informed version of the operator. In Sect. 6.4, we introduce the graph
neural operator (GNO), its components, and non-local kernel networks. In Sect. 6.6,
we compare the performance of the studied models for several examples from the
literature. Finally, we summarize our observations and provide concluding remarks
in Sect. 6.7.

6.2 DeepONet and Its Extensions

DeepONet is based on the universal approximation theorem for operators, which is


defined in Theorem 6.1. The conventional DeepONet architecture consists of two
DNNs: one encodes the input function, v(η), at fixed sensor points (branch net) to
provide the coefficients, and another encodes the information related to the spatio-
temporal coordinates of the output function (trunk net) to provide the basis functions.
The goal of DeepONet is to learn the operator G(v) that can be evaluated at continuous
spatio-temporal coordinates ξ (input to the trunk net). The output of the DeepONet
for a specified input vector, vi , is a scalar-valued function of ξ expressed as Gθ (vi )(ξ ),
where θ = (W, b) includes the trainable parameters (weights W and biases b) of the
networks. To work with the input function of the branch net, it must be discretized
in a finite-dimensional space using a finite number of points, termed sensors. We
specifically evaluate vi at m fixed sensor locations {η1 , η2 , . . . , ηm } to obtain the
pointwise evaluations, vi = {vi (η1 ), vi (η2 ), . . . , vi (ηm )}, which are used as input
to the branch net. The trunk net receives the spatial and temporal coordinates, e.g.
222 S. Goswami et al.

ξ = {xi , yi , ti } and evaluates the solution operator to compute the loss function. The
solution operator for an input realization, v1 , can be expressed as follows:


p
Gθ (v1 )(ξ ) = bri · tri
i=1
(6.1)
p
= bri (v1 (η1 ), v1 (η2 ), . . . , v1 (ηm )) · tri (ξ )
i=1

where br1 , br2 , . . . , br p are outputs of the branch net and tr1 , tr2 , . . . , tr p are outputs
of the trunk net.

Theorem 6.1 (Generalized universal approximation theorem for operators Lu et al.


2021) Suppose that X is a Banach space, K 1 ⊂ X , K 2 ⊂ R Din are two compact
sets in X and R Din , respectively, V is a compact set in C(K 1 ). Assume that G : V →
C(K 2 ) is a nonlinear continuous operator. Then for any  > 0, there exist positive
integers m, p, branch nets brk , and trunk nets trk , and η1 , η2 , · · · , ηm ∈ K 1 , such
that  
 
  p


sup sup G(v)(ξ ) − brk (v(η1 ), v(η2 ), . . . , v(ηm )) trk (ξ ) < 
v∈V ξ ∈K 2       
k=1 branch tr unk

Furthermore, the functions brk (outputs of the branch network) and trk (outputs of
the trunk network) can be chosen as diverse classes of neural networks, satisfying the
classical universal approximation theorem of functions, e.g. fully connected neural
networks (FNN), residual neural networks, and convolution neural networks (CNN).

This theorem was proved in Chen and Chen (1995) with two-layer neural net-
works. Also, the theorem holds when the Banach space C(K 1 ) is replaced by L q (K 1 )
and C(K 2 ) replaced by L r (K 2 ), q, r ≥ 1. We notice that in Eq. (6.1), the last layer of
each brk branch network is bias-free. Although bias is not a requirement in Theorem
6.1, adding bias may improve performance by lowering the generalization error (Lu
et al. 2021). The trainable parameters of a data-driven DeepONet, represented by
θ in Eq. (6.1), are obtained by minimizing a loss function, which is expressed as
follows:
1 
N
L(θ ) = wi |u i (ξ ) − Gθ (vi )(ξ )|2 (6.2)
N i=1

where u i (ξ ) is the ground truth and N denotes the total number of functions in the
branch network. The weights associated with each sample in Eq. (6.2) are denoted
by wi , which are assumed to be unity in the simplest case. Examples described in
Sects. 6.6.1.2 and 6.6.1.3 are solved with the architecture of conventional data-driven
DeepONet.
During the optimization process, some query points must be penalized more than
others in order to satisfy constraints (initial condition, boundary condition). In such
6 Physics-Informed Deep Neural Operator Networks 223

cases, properly designed non-uniform training point weights can improve accuracy.
These penalizing parameters can be manually modulated but are often a tedious
procedure or should be decided adaptively during the training of the DeepONet
(McClenny and Braga-Neto 2020; Kontolati et al. 2022). These parameters in the
loss function can be updated by gradient descent side by side with the network
parameters. The modified loss function is defined as follows:

1 
N
L(θ , λ) = g(λ)|u i (ξ ) − Gθ (vi )(ξ )|2 (6.3)
N i=1

where g(λ) is a non-negative, strictly increasing self-adaptive mask function, and λ =


{λ1 , λ2 , . . . λ j } are j self-adaptive parameters, each associated with an evaluation
point, ξ j . These parameters are constrained to increase monotonically and are always
positive. Typically, in a neural network, we minimize the loss function with respect
to the network parameters, θ . However, in this approach, we additionally maximize
the loss function with respect to the trainable hyperparameters using a gradient-
descent/ascent procedure. The modified objective function is defined as follows:

min max L(θ , λ). (6.4)


θ λ

The self-adaptive weights are updated using the gradient-descent method, such that

λk+1 = λk + ηλ ∇λ L(θ , λ) (6.5)

where ηλ is the learning rate of the self-adaptive weights and


 T
∇λi L = g  (λi )(u i (ξ ) − Gθ (vi )(ξ ))2 (6.6)

Therefore, if g(λi ) > 0, ∇λi L would be zero only if the term (u i (ξ ) − Gθ (vi )(ξ )) is
zero. Implementing self-adaptive weights in Kontolati et al. (2022) has considerably
improved the accuracy prediction of discontinuities or non-smooth features in the
solution. Following the inception of DeepONet in 2019 (Lu et al. 2021), a num-
ber of enhancements to traditional architecture were developed to offer inductive
bias and speed up training. In subsequent sections, we present several extensions of
DeepONet.

6.2.1 Feature Expansion in DeepONet

In this section, we lay emphasis on learning the operator from paired data using any
available prior knowledge of the underlying system. This knowledge can then be
224 S. Goswami et al.

directly encapsulated into DeepONet by changing the architecture of the trunk or


branch network, which is problem specific.
Feature expansion in the trunk network: Additional information on the output func-
tion is included as features in the trunk network. For example, in Kontolati et al.
(2022), we incorporate the knowledge of the dynamics of the decaying evolution
of the Brusselator reaction–diffusion system by employing a trigonometric feature
expansion of the temporal input of the trunk network. Additionally, historical infor-
mation about dynamical systems could also be used as features in the trunk.
Feature expansion in the branch net: To incorporate information that has both spatial
and temporal dependencies, we use an additional input function in the branch net-
work. In the regularized cavity flow problem presented in Lu et al. (2022a), an addi-
tional trigonometric input function was included in the branch network to incorporate
periodic boundary conditions. Similarly, an additional input function is included in
Kissas et al. (2022), which is created by averaging a feature embedded in the inputs
of the branch network over probability distributions that depend on the correspond-
ing query locations of the output function, employing the kernel-coupled attention
mechanism, which allows the operator to accurately model correlations between the
query locations of the output functions.
POD modes in the trunk: The standard DeepONet employs the trunk net to learn
the basis of the output function from the data. In this approach, the basis functions
are pre-computed by performing proper orthogonal decomposition (POD) on the
labeled output of the training data (after the mean has been excluded). The labeled
outputs are denoted by ui (ξ ), where ξ denotes the coordinates at which the outputs
are computed. The POD basis is used in the trunk net. A DNN is employed in the
branch net to learn the POD basis coefficients such that the output can be written as
follows:
 p
G(v1 )(ξ ) = bri (η)φi (ξ ) + φ0 (ξ ) (6.7)
i=1

where φ0 (ξ ) is the mean function of all ui (ξ ), i = 1, . . . , N computed from the


training dataset, and {φ1 , φ2 , . . . , φ p } are the p pre-computed POD modes of ui (ξ ).
Examples described in Sects. 6.6.1.2 and 6.6.1.3 are solved with the architecture of
POD-DeepONet.
Additionally, in Oommen et al. (2022), we have used a similar feature expansion
in the branch net. The branch takes as input the latent dimension of a convolutional
autoencoder to learn the dynamic evolution of a two-phase mixture. In De Hoop et al.
(2022), the authors have used POD modes in the branch net to learn the operator.
This approach is often challenging for discontinuous functions because it is similar
to global spectral methods trying to approximate non-smooth functions.
6 Physics-Informed Deep Neural Operator Networks 225

6.2.2 Multiple Input DeepONet

The DeepONet, which is based on the universal approximation theorem of operators,


is defined for input functions on a single Banach space. To employ DeepONet for
realistic setups (multiple input functions) and diverse applications, the multiple input
operators theorem was theoretically formulated and proposed in Jin et al. (2022).
Such networks are useful for approximating solutions for multiple initial conditions
and boundary conditions at the same time. The solution operator for multiple input
functions is defined as the tensor product of Banach spaces such that:


p
Gθ (v, w) = briv · briw · tri (6.8)
i=1

where briv and briw denote the i-th output embedding of the branch networks corre-
sponding to the input functions denoted by v and w, respectively. A schematic repre-
sentation of this architecture is shown in Fig. 6.2. In Goswami et al. (2022f), a Deep-
ONet framework based on multiple input functions is proposed that encompasses
two different DNNs (CNN and FNN) as branch networks. The architecture uses
grayscale images of systolic and diastolic geometry (in CNNs) along with patient-
specific information (such as hypertension in FNN) to predict the initial distribution
and extent of the mechanobiological insult in a patient with thoracic aortic aneurysm.
Branch Network 1

⋮ ⋮
Branch Network 2
⋮ ⋮
Trunk Network

Data Loss
⋮⋮⋮ Minimize Loss
IniƟal condiƟon
Loss ( , , , )
⋮⋮⋮ Boundary
condiƟon Loss

Fig. 6.2 A schematic representation of a multiple input data-driven DeepONet with self-adaptive
weights. The operator takes as inputs the functional field in branch network 1 (employing a CNN),
the boundary conditions in branch network 2 (employing an FNN to take as input the values at
three sensor locations marked with a red cross mark), and computes the solution for the coordinates
which are inputs of the trunk network (employing an FNN). The loss function, L, is the sum of
three components: data loss, loss at the initial condition, and boundary loss. Each of these losses
is penalized with self-adaptive penalty parameters, λ1 , λ2 , λ3 , respectively. These parameters are
updated adaptively along with the weights and biases, θ = (W, b) of the networks
226 S. Goswami et al.

6.2.3 Physics-Informed DeepONet

The limitation of purely data-driven approaches is that generalizing the solution


requires a large corpus of paired datasets. In many engineering applications, data
acquisition is prohibitively expensive, and the amount of available data is typically
minimal. As a result, in this “sparse data” environment, we draw motivation from
PINNs to train the DeepONet by directly incorporating known differential equations
into the loss function, along with some labeled datasets. The outputs of the DeepONet
are differentiable with respect to their input coordinates, thereby allowing the use
of automatic differentiation to develop an appropriate regularization mechanism for
biasing the target output functions to satisfy the underlying PDE constraints. Keeping
in mind the computational cost, for higher dimensional PDEs one could use a finite
difference approach or parametrizing the solution and using a Sobel filter (Zhu et al.
2019) to compute the derivatives.
The hybrid loss function for this framework is defined as follows:

L = Ldata + L physics ,
where L physics = Linit + Lbound + L pde ,
1 
Ninit
Linit = [Gθ (v)(x, y, t0 ) − u(x, y, t0 )]2 ,
Ninit 1=1
N
bound
(6.9)
1
Lbound = [Gθ (v)(xb , yb , t) − u(xb , yb , t)] ,
2
Nbound i=1

1 
Nf
L pde = [ f (Gθ )]2
N f i=1

where Ninit denotes the number of initial data points, Nbound denotes the number
of boundary points, and N f denotes the number of collocation points or integration
points on which the PDE is evaluated. Additionally, f (Gθ ) denotes the residual form
(Wang et al. 2021) or the variational form (Goswami et al. 2022d) of the governing
equation. L pde acts as an appropriate regularization mechanism for biasing the target
output functions to satisfy the underlying PDE constraints when very few labeled data
are available. Examples discussed in Sects. 6.6.2.1 and 6.6.2.2 are solved using the
additional physics-informed loss term to obtain the optimized network parameters.

6.3 FNO and Its Extensions

The Fourier neural operator (FNO) is based on replacing the kernel integral operator
with a convolution operator defined in Fourier space. The operator takes input func-
tions defined on a well-defined, equally spaced lattice grid and outputs the field of
6 Physics-Informed Deep Neural Operator Networks 227

interest on the same grid points. The network parameters are defined and learned in
the Fourier space rather than in the physical space, i.e. the coefficients of the Fourier
series of the output function are learned from the data. A schematic representation of
the FNO is shown in Fig. 6.3, which can be viewed as a DeepONet with a convolu-
tion neural network in the branch net to approximate the input functions and Fourier
basis functions in the trunk net. In particular, the network has three components:
first, the input function v(x) is lifted to a higher dimensional representation h(x, 0),
through a lifting layer, P, which is often parameterized by a linear transformation or
a shallow neural network. Then, the neural network architecture is formulated in an
iterative manner: h(x, 0) → h(x, 1) → h(x, 2) → · · · → h(x, L), where h(x, j),
j = 0, . . . , L, is a sequence of functions representing the values of the architecture
at each layer. Each layer is defined as a nonlinear operator via the action of the sum
of Fourier transformations and a bias function:

h(x, j + 1) =L Fj N O [h(x, j)]


:=σ W j h(x, j) + F −1 [R j · F[h(·, j)]](x) + c j (6.10)

Here, we use σ to denote the activation function. W j , c j , and R j are trainable param-
eters for the jth layer, such that each layer has different parameters (i.e. different
kernels, weights, and biases). Lastly, the output u(x) is obtained by projecting h(x, L)
through a local transformation operator layer, Q. In Sect. 6.6.1.1, we demonstrate
the performance of conventional data-driven FNO on learning the solution operator
for PDEs.

6.3.1 Feature Expansion in FNO

One important requirement of FNO is that the input function is defined on a lattice
grid, which makes FNO often difficult to apply for problems where the input function
is defined on a few points (such as the boundary condition and initial condition) to be
mapped to the solution over the whole domain. Additionally, if the problem domain
is defined on an unstructured mesh like in complex geometries, implementing the
conventional FNO architecture is a challenge. To address these issues, several feature
expansions were proposed in Lu et al. (2022a).
dFNO+: This feature is implemented for problem setup where the domain of the
input function is different from the domain of the output function. For example,
if we want to map the initial condition for a problem to the spatial and temporal
evolution of the solution, we define the mapping as

G : v(x, 0) → v(x, t) (6.11)

where v(x, 0) defines the initial condition and v(x, t) is the evolved dynamics. In
such cases, the input space is defined on the spatial domain and so it is difficult to map
228 S. Goswami et al.

Fig. 6.3 Schematic representation of the Fourier neural operator. The input function, v(x), is
transformed to a higher dimensional representation using a shallow neural network, P , and then
operated by a series of L Fourier layers. Within each Fourier layer, a Fourier transform, F , of the
input is obtained to filter out the higher modes using a linear transform, R, and then converted back to
physical space using inverse Fourier transform, F −1 . Along with the Fourier space transformation,
a residual connection with a weight matrix, W, is applied to the input function and added and is later
acted upon by a nonlinear activation function. At the end of the Fourier layers, a local transformation
Q is applied by employing a shallow neural network to convert the output space to a dimension of
the input grid points

the output to the spatial and the temporal domain simultaneously. To this end, we
propose two approaches. In the first approach, we define a new input function, ṽ(x, t),
which has an additional temporal component and is defined as ṽ(x, t) = v(x). As our
second approach, we propose to define the output space employing a recurrent neural
network such that the solution operator is decomposed into a series of operators and
the solution of each time step is obtained iteratively using a time marching scheme
such that G : v(x, t) → v(x, t + Δt). Alternatively, the input space could be defined
as a subset of the output space, while attempting to map the boundary condition to
the solution defined over the entire domain. In such cases, the input function can be
padded with zeros for the domain’s interior points and then considered as input to
the neural operator.
gFNO+: FNO employs discrete Fast Fourier Transform (FFT), which necessitates
the definition of the input and output functions defined on a Cartesian domain with
a lattice grid mesh. However, for problems defined on complex real-life geometry,
an unstructured mesh is typically used, requiring us to deal with two issues: (1)
non-Cartesian domain and (2) non-lattice mesh. To handle issues associated with the
input and output functions defined on a non-Cartesian domain, we define a bounding
box and project the input and the output space by “nearest neighbor” to maintain
continuity at the boundaries. Alternatively, for issues associated with the non-lattice
mesh, we perform interpolation between the unstructured mesh and a lattice grid
mesh. The examples described in Sects. 6.6.1.2 and 6.6.1.3 are solved with the
combination of dFNO+ and gFNO+.
Wavelet Neural Operators: In Tripura and Chakraborty (2022), the Wavelet Neural
Operator (WNO) was proposed, which learns the network parameters in the wavelet
space that are both frequency and spatial localized, thereby can learn the patterns
in the images and/or signals more effectively. Specifically, the Fourier integral of
6 Physics-Informed Deep Neural Operator Networks 229

FNO was replaced by wavelet integrals for capturing the spatial behavior of a signal
or for studying the system under complex boundary conditions. It was shown that
WNO can handle domains with both smooth and complex geometries, and it was
applied in learning solution operators for a highly nonlinear family of PDEs with
discontinuities and abrupt changes in the solution domain and the boundary.

6.3.2 Implicit FNO

In the vanilla FNO, to guarantee the universal approximation property different train-
able parameters are employed for each Fourier layer (Kovachki et al. 2021a). Hence,
the number of trainable parameters increases as the network gets deeper, which
makes the training process of the FNO more challenging and potentially prone to
overfitting. In You et al. (2022a) (see Fig. 6.6 of Sect. 6.6.1.1), it was found that when
the network gets deeper, the training error decreases in the FNO while the test error
becomes much larger than the training error, indicating that the network is overfitting
the training data. Furthermore, if one further increases the number of hidden layers L,
training the FNOs becomes challenging due to the vanishing gradient phenomenon.
To improve the neural network’s stability performance in the deep layer limit,
in You et al. (2022f), You et al. proposed to model the PDE solutions of unknown
governing laws as the implicit mappings between given loading/boundary conditions
and the resultant solution, with the neural network serving as a surrogate for the
solution operator. Based on this idea, the implicit FNO (IFNO) architecture was
developed, which can be interpreted as a data-driven surrogate of the fixed point
procedure, in the sense that the increment of fixed point iterations is modeled as an
autonomous operator between layers. As illustrated in Fig. 6.4, in IFNOs the same
parameter set is employed for each iterative layer, with the layer update written as

h(x, t + Δt) = L I F N O [h(x, t)]


:= h(x, t) + Δtσ W h(x, t) + F −1 [R · F[h(·, t)]](x) + c (6.12)

Here, the trainable parameters W , R, and c are taken to be layer independent, so


the number of trainable parameters does not increase with the number of layers,
alleviating the major bottleneck of the overfitting issue encountered by the original
FNO in Eq. (6.10). Moreover, this feature also enables the straightforward application
of the shallow-to-deep initialization technique (Ruthotto and Haber 2019). In fact,
the index of integral layers is identified with the number of time steps in a time-
discretization scheme. By dividing both sides of Eq. (6.12) by Δt, the term (h(·, t +
Δt) − h(·, t))/Δt corresponds to the discretization of a first-order derivative, and
Eq. (6.12) can be interpreted as a nonlinear differential equation in the limit of deep
layers, i.e. as Δt → 0. Thus, the optimal parameters (W , R, and c) of a shallow
network can be interpolated and reused in a deeper one as initial guesses.
230 S. Goswami et al.

Fig. 6.4 A schematic representation for the implicit Fourier Neural Operator (IFNO), which
enhances the vanilla FNOs architecture with reduced memory cost and improved stability in the
deep layer limit. This architecture also employs the lifting layer, P , and the projection layer, Q, as in
the original FNOs, and proposes a modified model for the Fourier layer. Within each Fourier layer,
the number of layers is identified with the number of time steps in a time-discretization scheme,
and the increments between layers are parameterized via the action of the sum of Fourier space
transformation and the local linear transformation, in a manner that all Fourier layers share the
same set of trainable parameters. As such, the iterative layers can be interpreted as a discretized
autonomous integral differential equation

In IFNO, a forward pass through a very deep network is analogous to obtaining the
PDE solution as an implicit problem, and the universal approximation capability is
guaranteed as far as there exists a convergent fixed point equation. Since the proposed
architecture is built as a modification of the FNO, it also parameterizes the integral
kernel directly in the Fourier space and utilizes the fast Fourier transformation (FFT)
to efficiently evaluate the integral operator. Hence, IFNO inherits the advantages
of FNO on resolution independence and efficiency, while demonstrates not only
enhanced stability but also improved accuracy in the deep network limit. In Sect.
6.6.2.3, we demonstrate the performance of IFNO on biological tissue modeling
based on digital image correlation (DIC) measurements.

6.3.3 Physics-Informed FNO

Motivated by physics-informed DeepONet (PI-DeepONet) proposed in Goswami


et al. (2022d) and Wang et al. (2021), the physics-informed FNO (PINO) was pro-
posed in Li et al. (2021), Konuk and Shragge (2021) as an integration of operator
learning and physics-informed settings. PINO reduces the labeled dataset require-
ment for training the neural operator and helps in faster convergence of the solution.
In this setting, it is important to note that, unlike DeepONet, FNO outputs the solution
6 Physics-Informed Deep Neural Operator Networks 231

on a grid that employs FFT. Hence, the gradients of the output function with respect
to the input space cannot be computed using the automatic differentiation library
commonly employed in machine learning algorithms. To this end, the following
approaches can be used to explicitly define the gradients:
1. Using the conventional numerical gradients such as finite difference and Fourier
gradient. These approaches require either a fine discretization (for finite differ-
ence), else the numerical error would be magnified or would require a smooth
and uniform grid (spectral methods).
2. Applying automatic differentiation of the sum of the Fourier coefficient at every
spatial location (without doing the inverse FFT) and the value of a query func-
tion defined as an interpolation or a low-rank integral operator (Kovachki et al.
2021b).
3. Explicitly defining the gradients on the Fourier space and applying the chain
rule to compute the required quantities.
The authors of Li et al. (2021) have computed the exact gradient by defining the
gradient on the Fourier space. Additionally, the linear mapping that arises due to
the residual connection with the weight matrix, W shown in Fig. 6.3, is interpolated
using the Fourier method. To optimize the network parameters, the loss function
defined in Eq. (6.9) is minimized.
Besides the scenarios where full physics constraints, such as the known governing
equations, are provided, in many real-world modeling tasks, only partial physics laws
are available. To improve the learning efficacy on such problems, in You et al. (2022e)
the authors proposed to impose partial physics knowledge via a soft penalty constraint
to the neural operator. In particular, an IFNO was built to model the heterogeneous
material responses from the DIC displacement tracking measurements of multiple
biaxial stretching protocols on a porcine tricuspid valve anterior leaflet. Both the
constitutive model and the material microstructure are unknown, and hence there is no
known governing law that can be imposed. The authors then proposed to infuse the no-
permanent-set assumption to guide the training and prediction of the neural operators.
In other words, when the specimen is at rest, one should observe a zero displacement
field in the specimen. Specifically, a hybrid loss function L = Ldata + λL physics is
employed, with the same data-driven loss as defined in Eq. (6.9) and the physics-
informed loss defined as a penalization term:

L physics := [Gθ (0)(x)]2 .

Here, 0(x) denotes the input function valued zero everywhere, and λ > 0 is a tunable
hyperparameter to balance the data-driven loss and the physics-informed loss. This
method was shown to improve the extrapolative performance of IFNO in the small
deformation regime.
232 S. Goswami et al.

6.4 Graph Neural Operators

Machine learning using graphs is becoming increasingly popular mainly because of


its ability to capture the graph (non-local) structure of the data. The concept of every
node of a graph consisting of information from every other node is quite enriching.
Machine learning tasks using a graph neural network can be of different types. For
example, it can be a node-level prediction (where one wants to determine what is
the attribute or properties of a particular node), edge-level prediction (whether or not
an edge exits between given nodes), or it may be a full graph prediction (where one
wants to classify the whole graph).

6.4.1 Graph Neural Networks

A graph neural network (GNN) is a transformation, which can be optimized based


on all attributes of the graph (nodes, edges, global context). This transformation has
to preserve the graph symmetry (permutation invariance). In GNNs, the message-
passing layer (MPL) plays a key role and forms the core component. The set of
operations that happens in message-passing layers can be summarized in three major
steps: (1) the Message step where information is gathered for each of the nodes and
its edges; (2) the Aggregate step where the information from all the neighboring
nodes is gathered for each node; and (3) the Update step where the information from
the previous steps are combined to update the state of each node. Mathematically,
the three steps can be expressed with the equation

hi(k+1) = update(k) hi(k) , aggregate(k) h(k)
j , ∀ j ∈ N (i) , (6.13)

where hi(k) is the current state k, for node i, and N (i) represents the neighborhood of
node i which consists of the nodes that have a direct edge connection with i. Here,
update and aggregate can be defined as different kinds of functions based on specific
learning tasks, such as mean, max, normalized sum, a Multi-Layer Perceptron (MLP),
and Recurrent Neural Network (RNN), just to name a few. All these functions should
be designed to preserve the permutation invariance as desired. Figure 6.5 shows a
schematic representation of the kth MPL on a graph with five nodes. Usually, in the
input layer, a node embedding is generated to represent the feature of each node. The
goal of node embedding is to encode nodes so that similarity in the embedding space
(e.g. dot product) approximates similarity in the original network. Up to date, the
two major GNNs that are widely used are Graph Convolution Network (GCN) (Kipf
and Welling 2016) and Graph Attention Network (GATs) (Veličković et al. 2017).
In GCN, the equivalent of Eq. (6.13) is given by

h i(k+1) = σ D̂ − 2 Â D̂ − 2 h i(k) W (k)
1 1
(6.14)
6 Physics-Informed Deep Neural Operator Networks 233

Fig. 6.5 A schematic representation of the message-passing layer (MPL) in graph neural networks
j
(GNNs). Herein, h i represents the embedded feature of node i in the jth layer. In the message and
aggregate steps, information is gathered from each node, its edges, and its neighboring nodes. Then,
j+1
the node feature is updated to h i , the feature of node i in the ( j + 1)th layer

where  = A + I represents the adjacency matrix A together with self-loops, D̂ is


the degree matrix, and the term D̂ − 2 Â D̂ − 2 normalizes Â. W is a shared weight
1 1

matrix for projecting output features into a lower dimensional subspace and σ is the
ReLU activation function. Overall, the GCN methodology produces the normalized
sum of the neighbor’s node features.
GAT can be seen as a modified version of GCN, improving its generalizability.
GAT uses the attention mechanism as a substitute for the statically normalized convo-
lution operation so that more important nodes receive higher weight during neighbor-
hood aggregation. GAT introduces three more steps prior to using the same normal-
ized aggregation of GCN: (i) a linear transformation z i(k) = w (k) h i(k) to transform the
input features h ik into higher level features, (ii) computing pairwise un-normalized 
attention coefficients between two neighbors ei(k) (k)
z i(k) z (l)
T
j = LeakyReLU a j ,
  
and (iii) using a softmax activation function αi(k) (k)
j = exp ei j / k∈N (i) exp eik (k)

that makes the coefficients easily comparable across different nodes.

6.4.2 Integral Neural Operators Through Graph Kernel


Learning

Here, we introduce the graph kernel network (GKN) approach and its variations,
which can be interpreted as a continuous version of a GNN as well as an integral
neural operator. The GKN is the first integral neural operator, which was introduced
in Li et al. (2020c) and has the foundation in the representation of the solution of
a PDE by Green’s function. As a motivating example, let us consider a PDE of the
form:
(La u) (x) = v(x), x ∈ Ω ,
u(x) = 0, x ∈ ∂Ω
234 S. Goswami et al.

where a(x) is the parameter field, v(x) is the loading term which acts as the input
function in the solution operator, and u(x) is the PDE solution which can be seen as
the output function. We can define Green’s function G : Ω × Ω → R as the unique
solution to the problem under relatively general constraints on La , such that

La G(x, ·) = δx ,

where δx is the delta measure on Rd centered at x. Because G(x, y) is dependent


on the coefficient a, we will refer to it as G a from here on. Then the ground-truth
solution operator, G, can be expressed as an integral operator of Green’s function:

u(x) = G(v)(x) := G a (x, y)v(y)dy
Ω

When La is uniformly elliptic, for example, Green’s function is generally continuous


at positions x = y, and hence using a neural network κ to model the kernel becomes
intuitive. In particular, the model writes

u(x) = κ(x, y, a(x), a(y))v(y)dy (6.15)
Ω

with κ being a shallow neural network taking x, y, a(x), and a(y) as its inputs. Based
on this idea, two graph-based neural operators, namely, the graph kernel network (Li
et al. 2020c) and the non-local kernel network (You et al. 2022a) were constructed
and will be entailed below. Here, both graph-based neural operators were constructed
with the data-driven loss only, and hence the physics is introduced through data. We
also point out that the ideas of imposing full or partial physics in Sects. 6.2.3 and
6.3.3 can be easily applied to these graph-based neural operators as well.
Graph Kernel Networks (GKNs): The idea of GKNs comes from parameterizing
Green’s function in an iterative architecture (Li et al. 2020c). As an integral neural
network similar to the FNOs, GKNs are also composed of a lifting layer P, iterative
kernel integration layers, and a projection layer Q. While the lifting layer and the
projection layer share the same architecture as the FNO, it is assumed that the iterative
kernel integration part is invariant across layers, with the update of each layer network
given via the action of the summation of a non-local integral operator Eq. (6.15) and
a linear operator:

h(x, j + 1) =LG K N [h(x, j)]


⎛ ⎞

:=σ ⎝W h(x, j) + κ(x, y, a(x), a(y); θ )h(y, j)dy + c⎠ (6.16)
Ω
6 Physics-Informed Deep Neural Operator Networks 235

Similar to FNOs, in GKNs the nodes within each layer are treated as a continuum,
so each layer representation can be seen by a function of the continuum set of nodes
D ⊂ Rd . κ ∈ Rs×s is a tensor kernel function that takes the form of a (usually shallow)
NN whose parameters θ are to be learned through training. Different from the setting
in FNOs, in GKNs the parameters W , c, and θ are layer independent. As such, the
GKN resembles the original ResNet block (He et al. 2016), where the usual discrete
affine transformation is substituted by a continuous integral operator. In practice, the
interaction range of kernel κ(x, y, a(x), a(y); θ ) is often chosen based on the known
information about the true kernel of the application or based on the computational
efficiency needs. When taking the interaction range as Ω, i.e. every point in the
whole domain has an impact on x, then the model will be more expressive but
computationally expensive. On the other hand, one can restrict the interaction range
as the ball with radius r centered at x, i.e. Br (x), for efficiency purposes, keeping in
mind that this choice might compromise the accuracy Br (x).
In Li et al. (2020c), the integral in Eq. (6.16) is realized through a message-passing
graph neural network architecture. In particular, the physical domain Ω is assumed
to be discretized as a set of points χ := {x1 , . . . , x J } ⊂ Ω. Then, these points are
treated as the nodes of a weighted, directed graph, such that an edge {i, j} presents
when the representation of point x j has an impact on the representation of point xi ,
i.e. κ(xi , x j , a(xi ), a(x j ))) = 0. Denoting N (x) as the neighborhood of each point
x ∈ χ according to the graph, with the message-passing algorithm of Gilmer et al.
(2017) the integral operator in Eq. (6.16) is implemented as an averaging aggregation
of messages:
⎛ ⎞
1 
h(x, j + 1) =σ ⎝W h(x, j) + κ(x, y, a(x), a(y); θ )h(y, j) + c⎠
|N (x)| y∈N (x)
(6.17)

where |N (x)| represents the total number of points in N (x). Comparing with the
FNOs described in Sect. 6.3, GKN has a more general kernel formulation since FNO
assumes κ(x, y, a(x), a(y); θ ) := κ(x − y) so as to allow the fast Fourier transform.
Hence, GKNs are theoretically more expressive than FNOs. Moreover, the general
kernel formulation also provides flexibility in model designs. As such, partial physics
knowledge such as the material isotropy (You et al. 2022c), homogeneity (You et al.
2021d, b), and rotational equivalence properties (Liu et al. 2022) can be explicitly
imposed by designing proper kernel models. However, when taking the interaction
range as the whole domain, i.e. N (xi ) = χ for all xi , the corresponding graph of Eq.
(6.17) will be fully connected, and the number of edges scales like O(J 2 ), which
makes the GKNs generally much more expensive than some other neural operators,
say, FNOs. To accelerate the computation of GKNs, several techniques were imple-
mented. In Li et al. (2020c), the Nyström approximation method is considered, which
samples uniformly at random the points of N (x), to reduce the complexity of compu-
tation when calculating the integral. In Li et al. (2020d), the multipole graph neural
236 S. Goswami et al.

Fig. 6.6 2D Darcy flow in a square domain. Demonstration of the stability performance of three
integral neural operators. Comparison of relative mean squared errors from GKNs, FNOs, and
NKNs when using the training set with grid size Δx = 1/15 (You et al. 2022a). Error bars represent
standard errors over five simulations. Left: errors on the training dataset. Right: errors on the test
dataset with different resolutions: Δx = 1/15, Δx = 1/30, and Δx = 1/60. Detailed experimental
settings and further numerical results are provided in Sect. 6.6.1.1

operator (MGNO) is proposed. By unifying a multi-resolution matrix factorization


of the kernel with GNNs, MGNOs capture the global properties of the PDE solution
operator with a linear time complexity. As another approach to improve the learning
efficiency, in Gupta et al. (2021) the authors proposed to approximate the kernel of
the integral operator through a better basis representation by using the multiwavelet
transform.
Non-local Kernel Networks (NKNs). Although GKNs are resolution independent, in
You et al. (2022a) it was found that GKNs might become unstable when increasing
the number of iterative kernel integration layers, L. As shown in Fig. 6.6, when
L increases, then either there is no gain in accuracy or increasing values of the
loss function occur in GKNs. To obtain a reliable deep kernel network, which can
handle more complicated and general learning tasks, in You et al. (2022a) the non-
local kernel network (NKN) was proposed. NKNs share the same architecture of
the lifting and projection layers as the GKNs, FNOs, and IFNOs, but propose to
use an alternative form of the iterative integration layer. Similar to the IFNOs, in
NKNs the index of integral layers is also identified with the number of time steps
in a time-discretization scheme, by letting t = jΔt and given the ( j + 1)th network
layer presentation by

h(x, t + Δt) = L N K N [h(x, t)]



:= h(x, t) + Δt κ(x, y, a(x), a(y); θ )(h(y, t) − h(x, t))dy
Ω
−β(x; w)h(x, t) + c) (6.18)

Here, the kernel tensor function κ is modeled by a NN parameterized by θ , and a


reaction term, the tensor function β, is modeled by another NN parameterized by ζ .
Both κ and β are usually shallow NNs, such as the multi-layer perceptron (MLP),
with the parameters θ and ζ to be learned together with the biases, c. One can see
6 Physics-Informed Deep Neural Operator Networks 237

that the integral operator on the right-hand side of Eq. (6.18) can be interpreted as a
non-local Laplacian operator:

Ldi
κ
ff
[h] := κ(x, y, a(x), a(y))(h(y, t) − h(x, t))dy.
Ω

Therefore, the NKN architecture can indeed be interpreted as a non-local diffusion


reaction equation in the limit of deep layers, i.e. as Δt → 0:

∂h
(x, t) − Ldi
κ
ff
[h](x, t) + β(x)h(x, t) = c.
∂t
The stability of NKNs was analyzed via non-local vector calculus, showing that
when the kernel function κ is square-integrable and non-negative, and the reaction
parameter function β is positive and bounded, the learned non-local operator is
positive definite and the network is stable in the limit of deep layers. In You et al.
(2022a), when applied to the Poisson equation solution learning task it was found that
the NKNs’ amplification matrix is positive definite, while the GKNs’ matrix exhibits
negative eigenvalues, indicating that instabilities might occur. In Sect. 6.6.1.1, we
demonstrate the performance of NKNs, GKNs, and the vanilla FNOs, on learning
the solution operator for the 2D Darcy’s equation.

Remark 6.1 Here, we would like to point out that a similar idea of considering the
correspondence between the stable architecture and the stable PDEs was also recog-
nized recently in Graph Neural Diffusion (GRAND) (Chamberlain et al. 2021). In
GRAND, the authors interpreted GNN architectures from a mathematical framework
by different choices of the form of the diffusion equation and discretization schemes.
It was shown that more advanced and stable numerical schemes such as Runge–Kutta
and implicit schemes would help to improve the performance and amount to larger
multi-hop diffusion operators in the design of deep GNN architectures.

6.5 Neural Operator Theory

Beyond the pioneering work of Chen and Chen (1995) on the universal approximation
theory of operators for a single layer, other works have appeared only recently for
deep neural networks. The first theoretical work that extended the Chen and Chen
theorem to deep neural networks was in Lu et al. (2021). The paper by Deng et al.
(2022) considers the advection–diffusion equation, including nonlinear cases. The
authors have shown that DeepONet has exponential approximation rates in the linear
case. Moreover, they demonstrated that by emulating numerical methods, DeepONet
has algebraic convergence with respect to the network size. In the paper by Lanthaler
et al. (2022), the authors have extended the universal approximation theorem in Chen
and Chen (1995) and have removed the continuity and compactness assumptions.
238 S. Goswami et al.

Also, they have provided an upper bound for the DeepONet error by decomposing it
into three parts: encoding error, approximation error, and reconstruction error. They
have also proven lower bounds on the reconstruction error by utilizing optimal errors
for projections on finite-dimensional affine subspaces of separable Hilbert spaces.
They have used this to prove the two-sided bounds on the DeepONet error. This
construction also allows them to infer the size of the trunk net needed to approximate
the eigenfunctions of the covariance operator in order to obtain optimal reconstruction
errors. In Marcati and Schwab (2021), the authors have shown that for linear second-
order PDEs with non-homogeneous coefficients and source terms, DeepONet has
exponential expression rates for the coefficient-to-solution operators in the H 1 norm.
Additionally, their results also show that neural networks can emulate accurately the
(discrete) solution map of Galerkin methods for the elliptic PDEs mentioned above
with numerical integration. They have also proven that the DeepONet architecture has
size O(|log()|κ ) for any κ > 0 depending on the physical space dimension. In paper
(Yu et al. 2021), the authors have shown that for non-polynomial activation functions,
an operator with a neural network of width five is arbitrarily close to any given
continuous nonlinear operator. They have also shown the theoretical advantages of
depth by constructing operator ReLU neural networks of depth 2k 3 + 8 with constant
width, which they compare with other operators’ ReLU neural networks of depth k.
In paper (Kovachki et al. 2021a), the authors have shown that FNOs are uni-
versal, i.e. they can approximate any continuous operator to the desired accuracy.
However, they have also shown that in the worst case, the size of FNO can grow
super-exponentially in terms of the desired error for approximating a general Lips-
chitz continuous operator. They have proved rigorously that there exists a ψ-FNO,
which can approximate the underlying nonlinear operators efficiently for Darcy flow
and the incompressible Navier–Stokes equations. In paper (You et al. 2022f), You et
al. have shown that the IFNO is a universal solution-finding operator, in the sense
that it can approximate a fixed point method to the desired accuracy.
Apart from the data-driven approaches discussed above, the paper (De Ryck and
Mishra 2022) has presented error bounds for physics-informed operator learning for
both DeepONets and FNOs. Finally, in the paper (García et al. 2020), the authors have
proposed a general framework to analyze the rates of spectral convergence  for alarge
1 
log n 2m
family of graph Laplacians. They established a convergence rate of O n
.
Also, in Li et al. (2020c), Li et al have shown that the Graph Kernel Network approach
has competitive approximation accuracy to classical and deep learning methods.

6.6 Applications

This section is divided into two parts that present several numerical examples that
were solved using either data-driven training of the neural operators (Sect. 6.6.1)
or physics-informed training (Sect. 6.6.2) to evaluate the performance of the neural
operators discussed above. To assess the efficacy of neural operators, we compute
6 Physics-Informed Deep Neural Operator Networks 239

the relative L 2 error of the predictions. Each example includes information about the
data generation, network architecture, activation function, and optimizer used.

6.6.1 Data-Driven Neural Operators

In this section, we present three problems to show the implementation of data-driven


neural operators.

6.6.1.1 Darcy Flow in a Square Domain

As the first benchmark example, we consider the modeling problem of


two-dimensional sub-surface flows through a porous medium with a heterogeneous
permeability field. In this example, we compare the results obtained using data-driven
FNOs, GKNs, and NKNs. The high-fidelity synthetic simulation data for this exam-
ple are described by the Darcy flow, which was first proposed in Li et al. (2020c),
and later considered in a series of neural operator studies (Li et al. 2020b, c, d; Lu
et al. 2022a; You et al. 2022a, f). The governing differential equation is defined as
follows:
−∇ · (K (x)∇u(x)) = f, x = (x, y),
(6.19)
subjected to u D (x) = 0, ∀ x ∈ ∂Ω

where K is the conductivity parameter field which characterizes the material


microstructure, u is the hydraulic head, and f is a source term. In this context, the
goal is to learn the solution operator of Darcy’s equation and compute the solution
field u(x).
In this example, we consider a square physical domain Ω = [0, 1]2 , a fixed source
field f (x) = 1, and Dirichlet boundary condition u D (x) = 0. The aim is to obtain
the solution field u(x) for each realization of conductivity field K (x). That means
the neural operators are employed to learn the mapping from K (x) to u(x). As
standard simulations of sub-surface flow, the permeability K (x) is modeled as a two-
valued piecewise constant function with random geometry such that the two values
have a ratio of 4. Specifically, 140 samples of K (x) were generated according to
K ∼ ψ# N (0, (−Δ + 9I )−2 ), where ψ takes a value of 12 on the positive part of the
real line and a value of 3 on the negative. 100 samples are employed for the purpose
of training, while the rest forms the test dataset. Different resolutions of datasets are
down-sampled from a 241 × 241 grid solution generated by using a second-order
finite difference scheme, as provided in Li et al. (2020c). The corresponding data
can be found at Li (2020a). With the purpose of testing generalization properties
with respect to resolution, three datasets with different resolutions are considered
here: a “coarse” dataset with grid size Δx = 1/15, a “fine” dataset with grid size
Δx = 1/30, and a “finer” dataset with grid size Δx = 1/60. Here, the coarse set is
240 S. Goswami et al.

Fig. 6.7 2D Darcy flow in a square domain. A visualization of 16-layer FNO, GKN, and NKN
performance on an instance of conductivity parameter field K (x), when using (normalized) “coarse”
training dataset
 (Δx = 1/15) and test on the “finer” dataset (Δx = 1/60). Here, the absolute point-
wise error u data (x) − u pr ed (x) is plotted

employed for the purpose of training, and the performance of the resultant neural
operators was tested on datasets with all resolutions.
Results: We report the training/test errors of FNO, GKN, and NKN in Fig. 6.6 and
the plot of solutions in one representative test sample in the “finer” dataset in Fig.
6.7, where all neural operators are with L = 16 layers. One can observe that all
solutions obtained with NKN are visually consistent with the ground-truth solutions,
while GKN loses accuracy near the material interfaces. FNO results are off in even
larger regions. These results provide a further qualitative demonstration of the loss
of accuracy in FNOs and GKNs from the previous sections. In this case, the averaged
relative test errors from GKN, FNO, and NKN are 4.71, 9.29, and 3.28%, respectively.
The results for this problem have been adapted from You et al. (2022a).

6.6.1.2 Darcy Flow in a Complex Domain

We now further consider Darcy flows in a triangular domain with a vertical notch
to map the boundary condition to the hydraulic head, u(x), with K (x, y) = 0.1,
and f = −1. In this illustration, we assess the outcomes of data-driven DeepONet,
dgFNO+, and POD-DeepONet.
6 Physics-Informed Deep Neural Operator Networks 241

The following Gaussian process is employed to generate samples of boundary


conditions for each boundary in the triangular domain:

u D (x) ∼ GP(0, K((x, y), (x  , y  ))),


(x − x  )2 (6.20)
K(x, x  ) = exp[− ], l = 0.2, and x, x  ∈ [0, 1]
2l 2
Specifically, 1084 nodes are employed in the numerical solver to discretize the geom-
etry, and 2000 different boundary conditions are generated of which 1900 realizations
are used for training, and the remaining is used to test the accuracy of the neural oper-
ators for unseen cases. We generate the paired dataset by solving Eq. (6.19) using
the MATLAB Partial Differential Equation Toolbox.
For DeepONet training, we have employed FNNs as both branch and trunk net-
works with 3 hidden layers consisting of 128 neurons in each layer for the trunk
network, while 2 hidden layers of 128 neurons in the branch network. The net-
work parameters are optimized using the Adam optimizer and the ReLU activation
function is used. Additionally, we now employ POD-deepONet to approximate the
hydraulic head. To this end, we employ an FNN architecture with 3 hidden layers
of 128 neurons in the branch network, and the trunk consists of 32 dominant POD
modes. To train the Fourier neural operator, we have mapped the triangular domain
with a notch to a bounded square domain. To map the boundary conditions to the
solution of the hydraulic head, we implement a combination of feature expansions
(dgFNO+) discussed in Sect. 6.3.1. The network architecture for FNO employs a
shallow neural network, P of 32 neurons, followed by 4 Fourier layers that filter out
8 lowest modes. The last transformation, Q, is defined using an FNN with 2 layers
of 128 neurons each and employs the activation function ReLU.
Results: A relative L 2 error of 1.0%, 2.02%, and 7.12% is obtained using POD-
DeepONet, DeepONet, and dgFNO+. A representative case is shown in Fig. 6.8,
where the predicted results and the errors corresponding to DeepONet, dgFNO+,
and POD-DeepONet are shown for a specific boundary.

6.6.1.3 Flow in a Cavity

In this example, we consider a two-dimensional lid-driven flow in a square cavity


(i.e. Ω = [0, 1]2 ), to compare the accuracy of the network to accurately predict the
velocity of the flow field for a given boundary condition using data-driven DeepONet,
dFNO+, and POD-DeepONet. The flow field can be described by the incompressible
Navier–Stokes equations:

∇ · u = 0, (6.21)
∂t u + u · ∇u = −∇ P + ν∇ u 2
(6.22)
242 S. Goswami et al.

Fig. 6.8 Darcy flow in a triangular domain with a notch. For a representative boundary condition,
the hydraulic head is obtained using DeepONet, dgFNO+, and POD-DeepONet. The prediction
errors for the three operator networks are shown against the respective plots. The ground truth is
simulated using the PDE Toolbox in MATLAB. The predicted solutions and the ground truth share
the same color bar, while the errors corresponding to each of the neural operators are plotted on the
same
 datacolor bar.prThe results
 are adopted from Lu et al. (2022a). Here, the absolute pointwise error
u (x) − u ed (x) is plotted

where u = (u, v), where u is the velocity in the x−direction and v is the velocity
in the y−direction, P denotes the pressure, and ν is the kinematic viscosity of the
fluid. We consider a case with different boundary conditions for the upper wall. In
particular, the boundary conditions are expressed as follows:
 
cosh[r (x − L2 )]
u =U 1− , v=0 (6.23)
cosh( r2L )

where U , r , and L are constants. In this setup, r = 10, L = 1 is the length of the
cavity. In addition, the remaining walls are assumed to be stationary. The aforemen-
tioned equations are then solved using the lattice Boltzmann method (Meng and Guo
2015) to generate the training data. Interested readers could find more details on data
generation in Lu et al. (2022a). To generate the labeled dataset, we generate 100
velocity flow fields at various Reynolds numbers, Re = [100, 2080], with a step
size of 20. A total of 100 labeled datasets of the flow field were simulated, 90 of
which were used for training the neural operator, and 10 were used as unseen cases to
test the accuracy of the trained network. The operator network maps the upper wall
boundary condition to the converged velocity field. The DeepONet architecture uses
a convolution neural network in the branch net and an FNN with 3 hidden layers of
128 neurons in the trunk net. The hyperbolic tangent activation function tanh is used
for this problem. Next, we use POD-DeepONet with the same branch net architecture
and 6 modes to approximate the flow fields.
6 Physics-Informed Deep Neural Operator Networks 243

Fig. 6.9 Steady cavity flow described by Navier–Strokes Equation. A representative case from
the test dataset depicting the flow fields, u and v in x− and y−directions, respectively, and the
associated errors obtained using POD-DeepONet. The predicted solutions and the ground truth
share the same color bar, while the errors corresponding to each
 of the neural operators
 are plotted
on the same color bar. Here, the magnitude of velocity error udata (x) − u pr ed (x) is plotted

Results: The relative L 2 error is reported as 1.20, 0.33, and 0.63% for predicting
flow fields for unseen boundary conditions using DeepONet, POD-DeepONet, and
dFNO+. A representative case is shown in Fig. 6.9, where the predicted results and
errors corresponding to POD-DeepONet are shown for a specific boundary. The
results for this problem have been adapted from Lu et al. (2022a).

6.6.2 Physics-Informed Neural Operators

In this section, we consider an additional loss term, L physics , as a regularization term


in the loss function, to optimize the hyperparameters of the neural operators along
with a few labeled datasets (see Sects. 6.6.2.1 and 6.6.2.2) and/or improve the neural
operators’ generalizability (see Sect. 6.6.2.3).

6.6.2.1 Brittle Fracture in a Plate Loaded in Shear

In this example, we aim to model the final crack path, given any initial location of the
crack on a unit square plate, fixed at the left and bottom edge, is subjected to shear
loading on the top edge. This example illustrates the benefits of using a physics-driven
loss comparing the results obtained using PI-DeepONet and data-driven DeepONet.
244 S. Goswami et al.

We use the phase-field approach (Goswami et al. 2019) to model the crack in the
domain. The material properties considered are ν = 121.15 kN/mm2 , μ = 80.77
kN/mm2 , as Lamé’s first and second parameters, respectively, and the critical energy
release rate, G c = 2.7 × 10−3 kN/mm. The boundary conditions of the setup are
denoted as follows:

u(x, 0) = v(x, 0) = 0, u(x, 1) = Δu, (6.24)

where u and v are the solutions of the elastic field in x- and y-axis, respectively, and
Δu is the incrementally applied shear displacement on the top edge of the plate. To
train the PI-DeepONet using the variational form, we minimize the total energy of
the system, which is defined as follows:

E = Ψe + Ψc ,

where Ψe = f e (x)dx, f e (x) = (1 − φ)2 ψe+ () + ψe− () ,
Ω

Gc 2
Ψc = f c (x)dx, f c (x) = φ + l02 |∇φ|2 − (1 − φ)2 H (x, t)
2l0
Ω
(6.25)
where Ψe is the stored elastic strain energy, Ψc is the fracture energy, l0 is the length
scale parameter that controls the diffusion of the crack, ψe+ and ψe− are the tensile and
the compressive components of the strain energy densities obtained by the spectral
decomposition of the strain tensor, and H (x, t) is the strain-history functional. In
this example, we have used a hybrid loss function to train the network parameters.
The training samples are obtained (n = 11) for different initial crack lengths, lc ∈
[0.2, 0.7], in steps of 0.05. For the network architecture, the branch net and the trunk
net are four-layer fully connected neural networks with [100, 50, 50, 50] neurons,
respectively. Once the solution is evaluated at the sampled points, the outputs for the
elastic field are modified to exactly satisfy the Dirichlet boundary conditions:

Gθu = [y(1 − y)]Ĝθu + yΔu,


(6.26)
Gθu = [y(y − 1)] × [x(x − 1)]Ĝθv

where Ĝθu and Ĝθv are obtained from the DeepONet. The conventional data-driven
DeepONet is trained with the same 11 samples, keeping the network architecture
of the branch net and the trunk net exactly the same. The synthetic data to train
the network is generated using the codes developed in Goswami et al. (2020a). To
improve the accuracy, the training samples are increased to 43.
Results: A prediction error of 2.16% on φ is reported when PI-DeepONet was
employed. Additionally, predictions of data-driven DeepONet have an error of 26.2
and 3.12% for φ when trained with 11 samples and 43 samples, respectively. Figure
6.10 showing the plots of the predicted solutions for lc = 0.375 mm is presented,
6 Physics-Informed Deep Neural Operator Networks 245

Fig. 6.10 Shear failure: the PI-DeepONet is trained with 11 crack lengths to predict the final
damage path for any crack length when the height of the crack is fixed at the center of the left edge.
The plot is for lc = 0.375 mm, where Δu = 0.220 mm. The predicted displacement in x-direction
is plotted for two locations along the x-axis and is compared with ground truth to show the accuracy
of the prediction. The results are adopted from Goswami et al.  (2022d). Here, for the phase-field
parameter the absolute pointwise errorφ data (x) − φ pr ed (x)  is plotted, and for the displacement
field the magnitude of pointwise error udata (x) − u pr ed (x) is plotted

which is obtained using PI-DeepONet. The results for the data-driven DeepONet
suggest that it is unable to capture the crack diffusion phenomenon and also cannot
generalize to complex fracture phenomena with limited datasets.

6.6.2.2 Flow in Heterogeneous Porous Media

We consider a two-dimensional flow through heterogeneous porous media, which is


governed by Eq. (6.19) using PI-DeepONet, which uses no labeled datasets and is
trained using the governing physics of the problem. In this example, we aim to learn
the operator such that:
Gθ : K (x) → h(x), (6.27)

where K (x) is spatially varying hydraulic conductivity and h(x) is the hydraulic
head. The setup is of a unit square plate with a discontinuity of 5 × 10−3 mm.
For generating multiple conductivity fields for training the neural operator, we
describe the conductivity field, K (x), as a stochastic process. In particular, we take
K (x) = exp(F(x)), with F(x) denoting a truncated Karhunen–Loève (KL) expan-
sion for a certain Gaussian process, which is a finite-dimensional random variable.
The DeepONet is trained using the variational formulation of the governing equa-
246 S. Goswami et al.

tion and without any labeled input–output datasets. The optimization problem can
be defined as follows:
Minimize: E = Ψh ,
(6.28)
subject to: h(x) = 0 on ∂Ω D

where ∂Ω D represents the boundary of the domain and


 
1
Ψh = K (x)|∇h(x)|2 dx − h(x) dx (6.29)
2
Ω Ω

The network architecture of the branch and the trunk networks are two separate
6-hidden layers FNN with 32 neurons per hidden layer.
Results: The trained PI-DeepONet yields a predictive error of 3.12%. The loss tra-
jectory is shown in Fig. 6.11b. The prediction of h(x) for a representative sample of
K (x) using PI-DeepONet is shown in Fig. 6.11a. It is interesting to note that we have
tried to solve the problem by minimizing the residual (Wang et al. 2021). However,
the residual-based DeepONet is not able to approximate the solution of h(x) for a
given K (x).

Fig. 6.11 Flow in heterogeneous porous media: a The predicted h(x) for a given conductivity field,
K (x) (plotted on a log scale), is shown for a representative sample. True h(x) represents the ground
truth and is the simulated solution using the MATLAB PDE toolbox. The difference between the
predicted h(x) and the ground truth is shown in the error plot. b The plots show the loss trajectory
of the two components of the loss function. The plot on the left shows the decrease in energy of
the domain with respect to the number of epochs, while the plot on the right shows the boundary
loss term. The results are adopted from Goswami et al. (2022d). Here, the absolute pointwise error
h data (x) − h pr ed (x) is plotted
6 Physics-Informed Deep Neural Operator Networks 247

6.6.2.3 Biological Tissue Modeling From Experimental Measurements

To demonstrate the applicability and generalizability of neural operators in modeling


complex systems from noisy real-world data, in You et al. (2022e), the implicit
FNO was employed to predict the material displacement field of biological tissue
without postulating a specific constitutive model form nor possessing knowledge on
the material microstructure, from digital image correlation (DIC) displacement field
measurements. As shown in Fig. 6.12, a material database is constructed from the
DIC displacement tracking measurements of seven biaxial stretching protocols with
different biaxial tension ratios on a porcine tricuspid valve anterior leaflet specimen.
Then, the material response is modeled as a solution operator from the loading to the
resultant displacement field, using the implicit FNO (You et al. 2022f) architecture,
to predict the response of this soft tissue specimen with arbitrary loading conditions.
To demonstrate the predictivity, we use various combinations of loading protocols
and compare their effectiveness with a finite element analysis approach based on
the phenomenological Fung-type model. To study the model performance for in-
distribution tests, we randomly selected 83% of the samples of all protocols to form
the training set and built the vanilla IFNO model and the Fung-type model based
on this common training set. Then, both models are validated and tested on the rest
of the samples. Moreover, to further investigate the generalization capability of the
models, we also perform the out-of-distribution prediction study, employing part of
the protocols for training and the rest of the protocols for testing. In particular, the
models are trained on protocols with biaxial tensions P11 : P22 = 1 : 1, P11 : P22 =
0.66 : 1, and P11 : P22 = 1 : 0.66, then tested on protocols with P11 : P22 = 1 : 0.33,
P11 : P22 = 0.33 : 1, P11 : P22 = 0.1 : 1, and P11 : P22 = 1 : 0.05. We notice that the
testing protocols are not covered in any of the training sets, and they have smaller
maximum tensions compared with the training sets. Hence, with this study, we aimed
to investigate the performance of the implicit FNO and the physics-guided implicit

Fig. 6.12 Problem and experimental setups for the biological tissue modeling example. a An
image of the speckle-patterned porcine tricuspid valve anterior leaflet (TVAL) specimen subject to
biaxial stretching, with the DIC tracking grid shown in green. b Schematic of a specimen subject to
Dirichlet-type boundary conditions, so the goal of neural operator learning is to provide a surrogate
mapping from the boundary displacement u D (x), x ∈ ∂Ω, to the displacement field u(x), x ∈ Ω.
c Illustration of the seven protocols of mechanical testing on a representative TVAL specimen. Here,
P11 and P22 denote the first Piola–Kirchhoff stresses in the x- and y-directions, respectively
248 S. Goswami et al.

Fig. 6.13 Biological tissue modeling from DIC displacement data. Upper: in-distribution pre-
diction with training set on 83% of randomly selected samples. a Sample-wise error comparison
between a conventional constitutive model (the Fung-type model) and the implicit FNO on all
biaxial testing protocol sets. b Visualization of the Fung-type model fitting and implicit FNO
performances on a representative test sample. Bottom: out-of-distribution prediction on the small
deformation regime, by training each model on protocol sets 1, 2, and 4 then testing on the rest of the
sets. c Sample-wise error comparison between the Fung-type model, the original implicit FNO, and
the physics-guided implicit FNO on testing protocol sets. d Visualization of the Fung-type model
fitting and physics-guided implicit FNO performances on a representative test sample. On (a) and
(c), the relative mean squared errors were plotted for each sample. On (b) and (d), the magnitudes
of displacement errors on each material point were plotted

FNO methods for predicting the out-of-distribution material responses in the small
deformation regime.
Results: From in-distribution tests (see Fig. 6.13a, b), we found that the proposed data-
driven approach presents good prediction capability to unseen loading conditions
with the same type of biaxial loading ratios and outperforms the phenomenological
model. Specifically, the implicit FNO model achieved only 1.64% relative error on
the test dataset, while the Fung-type model has a 10.83% error. To provide further
insights into this comparison, in Fig. 6.13b, both the x- and y-displacement solutions
and the prediction errors are visualized on a representative test sample. The Fung-type
model, which considered the homogenized stress–strain at one material point (i.e. the
center of the specimen) due to limited information about the spatial variation in the
stress measurement, failed to capture the material heterogeneity and hence exhibited
large prediction errors in the interior region of the TVAL specimen domain. This
observation confirms the importance of capturing the material heterogeneity and
verifies the capability of the neural operators in heterogeneous material modeling for
in-distribution learning tasks.
6 Physics-Informed Deep Neural Operator Networks 249

On the other hand, when tested on out-of-distribution loading ratios, the neural
operator learning approach becomes less effective and has comparable performance
as the constitutive modeling (see Fig. 6.13c, d). In particular, when the models are
trained on protocols in the large deformation region and then tested on protocols in
the small deformation region, 16.78% and 16.80% prediction errors were observed
from the implicit FNO and Fung-type model, respectively. To improve the gener-
alizability of the neural operators, partial physics knowledge was infused using the
no-permanent-set assumption discussed in Sect. 6.3.3, and this method is shown
to improve the model’s extrapolative performance in the small deformation regime
by around 1.5%. This study demonstrates that with sufficient data coverage and/or
regularization terms from partial physics constraints, the data-driven neural operator
learning approach can be a more effective method for modeling complex biological
materials than the traditional constitutive modeling approaches. The results for this
problem have been adapted from You et al. (2022e).

6.7 Summary and Outlook

In this chapter, we have reviewed the basics of three neural operators, DeepONet,
FNO, and the graph neural operator, as well as their extensions, and have presented
representative application examples. While the list of possible applications of neural
operators will continue to expand in the near future, here we provide a partial list of
their role in applications so far.
• For real-time forecasting: designing efficient control systems, fault detectors for
car engines, solving complex multi-physics problems in less than a second (Cai
et al. 2021).
• Proper hybridization of physics- and data-based models can achieve the goal of
generating an efficient, accurate, and generalizable model that can be used to
greatly accelerate the modeling of time-dependent multi-scale systems (Yin et al.
2022).
• To reduce the dependency of large paired datasets (when accurate information
about the governing law of the physical system is not available), Deep Transfer
Operator Learning (DeepONet) (Goswami et al. 2022e) can be used for accurate
prediction of quantities of interest in related domains, with only a handful of new
measurements.
• Develop faster ways to train neural operators by incorporating multi-modality and
multi-fidelity data (Lu et al. 2022b; Howard et al. 2022; De et al. 2022).
• The application of neural operators in life sciences is endless. For example,
approaches developed in Yin et al. (2022), Goswami et al. (2022f) show the appli-
cation of DeepONet for accurate prediction of aortic dissection and aneurysm,
which is patient specific and hence could provide clinicians sufficient time for
planning surgery.
250 S. Goswami et al.

• DeepONet can be used for accelerating climate modeling by adding learned high-
order corrections to the low-resolution (e.g. 100 Km) climate simulations.
Next, we discuss possible new developments required in the future to further
advance physics-informed deep learning, and in particular neural operators.
In the last two decades, we have seen rapid advances in GPU computing that
together with the simultaneous advances in deep learning algorithms have enabled
the development of new hybrid models based on both physics-driven and data-driven
methods. Research teams are now working on developing high-fidelity digital twins
of the human organs and the Earth’s atmosphere, which will require long and expen-
sive training, even more than the expensive transformer language models developed
by big software companies involving hundreds of billions of parameters. To deal
with the increasing cost and cope with the urgent demand for a real-time inference
that requires only very little new training for problems at scale in computational
mechanics, higher levels of abstraction of these algorithms are required. To that end,
continual learning at the operator regression level is a promising avenue in establish-
ing the mathematical foundations of digital twins in computational mechanics and
beyond. Some authors and even industry researchers use concepts from principal
component analysis and reduced order modeling to build digital twins but the lack of
over-parametrization of such methods, even of their nonlinear extensions, is a lim-
iting factor for their effectiveness in realistic scenarios of diverse and unanticipated
operating conditions.
Adding physics into the training of neural operators in addition to any available
data enhances their accuracy and generalization capacity for tasks even outside the
distribution of the input space. Scalable physics-informed neural networks (Shukla
et al. 2022) can be employed to solve high-dimensional problems not possible with
traditional finite element solvers, e.g. up to approximately 10 dimensions if not
more. Similarly, scalable physics-informed neural operators can also solve high-
dimensional problems even in real time and can be used for designing very complex
systems. For example, in De et al. (2022), the authors solved an industry-based
problem of computing the power generated in a utility-scale wind plant with 64
turbines, considering the uncertainty of the wind speed, inflow direction, and yaw
angle (a schematic representation shown in Fig. 6.14). The layout of 64 wind turbines
is on a two-dimensional mesh which was used to generate the training data as shown in
Fig. 6.14b. The annual energy output of a wind farm is often calculated by estimating
the predicted power of the joint distribution of wind speed and direction; however, the
quantity of function evaluations necessary frequently prevents the use of high-fidelity
models in industry (King et al. 2020).
To that end, training a DeepONet or any other neural operator with only high-
fidelity data would be computationally expensive. One promising way to solve such
realistic problems is using multi-fidelity approaches proposed in De et al. (2022),
Lu et al. (2022b), Howard et al. (2022). These real problems take to leverage the
generalized flavor of DeepONet, which can be flexibly designed for any problem at
hand. Scaling neural operators to industry-level problems with parallel multi-GPU
training could be a very impactful research direction. Another interesting direction is
6 Physics-Informed Deep Neural Operator Networks 251

A PREPRINT - M AY 31, 2022

387 m

387 m

x2 474 m
x1

(a) (b)

Fig. 6.14 a Wind farm layout of six wind turbines. b Arrangements of 8 × 8 wind turbines in a
farm. Adopted from De et al. (2022)

developing graph neural operators for modeling digital twins and complex systems-
of-systems, in particular, with the ability to apply causal inference for discovering
intrinsic pathways not easily discovered by other methods. Finally, to address the
excessive cost of training in deep learning and make edge computing a reality, devel-
oping the energetically favorable spiking neural networks on neuromorphic comput-
ers could lead to (at least) three orders of magnitude in energy savings while we may
come closer to more biologically plausible neural operator architectures; see also
Fig. 6.1.

References

Cai S, Wang Z, Lu L, Zaki TA, Karniadakis GE (2021) Deepm&mnet: Inferring the electrocon-
vection multiphysics fields based on operator approximation by neural networks. J Comput Phys
436:110296
Cai S, Wang Z, Wang S, Perdikaris P, Karniadakis GE (2021) Physics-informed neural networks
for heat transfer problems. J Heat Transf 143(6)
Chamberlain B, Rowbottom J, Gorinova MI, Bronstein M, Webb S, Rossi E (2021) Grand: Graph
neural diffusion. In: International conference on machine learning. PMLR, pp 1407–1418
Chen T, Chen H (1995) Universal approximation to nonlinear operators by neural networks with
arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw
6(4):911–917
Chen T, Chen H (1995) Universal approximation to nonlinear operators by neural networks with
arbitrary activation functions and its application to dynamical systems. IEEE Trans Neural Netw
6(4):911–917
De Hoop M, Huang DZ, Qian E, Stuart AM (2022) The cost-accuracy trade-off in operator learning
with neural networks. arXiv:2203.13181
De Ryck T, Mishra S (2022) Generic bounds on the approximation error for physics-informed (and)
operator learning. arXiv:2205.11393
252 S. Goswami et al.

De S, Hassanaly M, Reynolds M, King RN, Doostan A (2022) Bi-fidelity modeling of uncertain


and partially unknown systems using deeponets. arXiv:2204.00997
Deng B, Shin Y, Lu L, Zhang Z, Karniadakis GE (2022) Approximation rates of deeponets for
learning operators arising from advection-diffusion equations. Neural Netw
García Trillos N, Gerlach M, Hein M, Slepčev D (2020) Error estimates for spectral convergence
of the graph laplacian on random geometric graphs toward the laplace–beltrami operator. Found
Comput Math 20(4):827–887
Geelen R, Wright S, Willcox K (2022) Operator inference for non-intrusive model reduction with
nonlinear manifolds. arXiv:2205.02304
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE (2017) Neural message passing for quantum
chemistry. In: International conference on machine learning. PMLR, pp 1263–1272
Goswami S, Anitescu C, Rabczuk T (2019) Adaptive phase field analysis with dual hierarchical
meshes for brittle fracture. Eng Fract Mech 218:106608
Goswami S, Anitescu C, Rabczuk T (2020a) Adaptive fourth-order phase field analysis for brittle
fracture. Comput Methods Appl Mech Eng 361:112808
Goswami S, Anitescu C, Rabczuk T (2020b) Adaptive fourth-order phase field analysis using deep
energy minimization. Theor Appl Fract Mech 107:102527
Goswami S, Anitescu C, Chakraborty S, Rabczuk T (2020c) Transfer learning enhanced physics
informed neural network for phase-field modeling of fracture. Theor Appl Fract Mech 106:102447
Goswami S, Yin M, Yu Y, Karniadakis GE (2022d) A physics-informed variational deeponet for
predicting crack path in quasi-brittle materials. Comput Methods Appl Mech Eng 391:114587
Goswami S, Kontolati K, MD, Karniadakis GE (2022e) Deep transfer learning for partial differential
equations under conditional shift with deeponet. arXiv:2204.09810
Goswami S, Li DS, Rego BV, Latorre M, Humphrey JD, Karniadakis GE (2022f) Neural oper-
ator learning of heterogeneous mechanobiological insults contributing to aortic aneurysms.
arXiv:2205.03780
Gupta G, Xiao X, Bogdan P (2021) Multiwavelet-based operator learning for differential equations.
Adv Neural Inf Process Syst 34:24048–24062
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings
of the IEEE conference on computer vision and pattern recognition, pp 770–778
Howard AA, Perego M, Karniadakis GE, Stinis P (2022) Multifidelity deep operator networks.
arXiv:2204.09157
Jin P, Meng S, Lu L (2022) Mionet: Learning multiple-input operators via tensor product.
arXiv:2202.06137
Karniadakis GE, Kevrekidis IG, Lu L, Perdikaris P, Wang S, Yang L (2021) Physics-informed
machine learning. Nature Reviews. Physics 3(6):422–440
King R, Glaws A, Geraci G, Eldred MS (2020) A probabilistic approach to estimating wind farm
annual energy production with bayesian quadrature. In: AIAA scitech 2020 forum, p 1951
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks.
arXiv:1609.02907
Kissas G, Seidman J, Guilhoto LF, Preciado VM, Pappas GJ, Perdikaris P (2022) Learning operators
with coupled attention. arXiv:2201.01032
Kontolati K, Goswami S, Shields MD, Karniadakis GE (2022) On the influence of over-
parameterization in manifold based surrogates and deep neural operators. arXiv:2203.05071
Konuk T, Shragge J (2021) Physics-guided deep learning using fourier neural operators for solving
the acoustic vti wave equation. In: 82nd EAGE annual conference and exhibition, European
association of geoscientists and engineers, pp 1–5
Kovachki N, Lanthaler S, Mishra S (2021a) On universal approximation and error bounds for fourier
neural operators. J Mach Learn Res 22:Art–No
Kovachki N, Li Z, Liu B, Azizzadenesheli K, Bhattacharya K, Stuart A, Anandkumar A (2021b)
Neural operator: Learning maps between function spaces. arXiv:2108.08481
Lanthaler S, Mishra S, Karniadakis GE (2022) Error estimates for deeponets: A deep learning
framework in infinite dimensions. Trans Math Its Appl 6(1):tnac001
6 Physics-Informed Deep Neural Operator Networks 253

Li Z (2020a) Graph based neural operators. https://github.com/zongyi-li/graph-pde


Li Z, Kovachki N, Azizzadenesheli K, Liu B, Bhattacharya K, Stuart A, Anandkumar A (2020b)
Fourier neural operator for parametric partial differential equations. arXiv:2010.08895
Li Z, Kovachki N, Azizzadenesheli K, Liu B, Bhattacharya K, Stuart A, Anandkumar A (2020c)
Neural operator: Graph kernel network for partial differential equations. arXiv:2003.03485
Li Z, Kovachki N, Azizzadenesheli K, Liu B, Stuart A, Bhattacharya K, Anandkumar A (2020d)
Multipole graph neural operator for parametric partial differential equations. Adv Neural Inf
Process Syst 33:6755–6766
Liu N, Yu Y, You H, Tatikola N (2022) Ino: Invariant neural operators for learning complex physical
systems with momentum conservation. Under Review
Li Z, Zheng H, Kovachki N, Jin D, Chen H, Liu B, Azizzadenesheli K, Anandkumar A (2021)
Physics-informed neural operator for learning partial differential equations. arXiv:2111.03794
Lu L, Jin P, Pang G, Zhang Z, Karniadakis GE (2021) Learning nonlinear operators via DeepONet
based on the universal approximation theorem of operators. Nat Mach Intell 3(3):218–229
Lu L, Meng X, Cai S, Mao Z, Goswami S, Zhang Z, Karniadakis GE (2022a) A comprehensive and
fair comparison of two neural operators (with practical extensions) based on fair data. Comput
Methods Appl Mech Eng 393:114778
Lu L, Pestourie R, Johnson SG, Romano G (2022b) Multifidelity deep neural operators for efficient
learning of partial differential equations with application to fast inverse design of nanoscale heat
transport. arXiv:2204.06684
Marcati C, Schwab C (2021) Exponential convergence of deep operator networks for elliptic partial
differential equations. arXiv:2112.08125
McClenny L, Braga-Neto U (2020) Self-adaptive physics-informed neural networks using a soft
attention mechanism. arXiv:2009.04544
Meng X, Guo Z (2015) Multiple-relaxation-time lattice Boltzmann model for incompressible mis-
cible flow with large viscosity ratio and high Péclet number. Phys Rev E 92(4):043305
Oommen V, Shukla K, Goswami S, Dingreville R, Karniadakis GE (2022) Learning two-phase
microstructure evolution using neural operators and autoencoder architectures. arXiv:2204.07230
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neural networks: A deep learn-
ing framework for solving forward and inverse problems involving nonlinear partial differential
equations. J Comput Phys 378:686–707
Riffaud S, Bergmann M, Farhat C, Grimberg S, Iollo A (2021) The dgdd method for reduced-order
modeling of conservation laws. J Comput Phys 437:110336
Ruthotto L, Haber E (2019) Deep neural networks motivated by partial differential equations. J
Math Imaging Vis 1–13
Samaniego E, Anitescu C, Goswami S, Nguyen-Thanh VM, Guo H, Hamdia K, Zhuang X, Rabczuk
T (2020) An energy approach to the solution of partial differential equations in computational
mechanics via machine learning: Concepts, implementation and applications. Comput Methods
Appl Mech Eng 362:112790
Shukla K, Xu M, Trask N, Karniadakis GE (2022) Scalable algorithms for physics-informed neural
and graph networks. Data-Centric Eng 3
Tripura T, Chakraborty S (2022) Wavelet neural operator: a neural operator for parametric partial
differential equations. arXiv:2205.02191
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks.
arXiv:1710.10903
Wang S, Wang H, Perdikaris P (2021) Learning the solution operator of parametric partial differential
equations with physics-informed deeponets. Sci Adv 7(40):eabi8605
Yin M, Ban E, Rego BV, Zhang E, Cavinato C, Humphrey JD, Em Karniadakis G (2020) Simulating
progressive intramural damage leading to aortic dissection using deeponet: an operator–regression
neural network. J R Soc Interface 19(187):20210670
Yin M, Zhang E, Yu Y, Karniadakis GE (2022) Interfacing finite elements with deep neural operators
for fast multiscale modeling of mechanics problems. Comput Methods Appl Mech Eng 115027
254 S. Goswami et al.

You H, Yu Y, D’Elia M, Gao T, Silling S (2022a) Nonlocal kernel network (NKN): a stable and
resolution-independent deep neural network. arXiv:2201.02217
You H, Yu Y, Silling S, D’Elia M (2021b) Data-driven learning of nonlocal models: from high-
fidelity simulations to constitutive laws. In: Accepted in AAAI spring symposium. MLPS
You H, Yu Y, Silling S, D’Elia M (2022c) A data-driven peridynamic continuum model for upscaling
molecular dynamics. Comput Methods Appl Mech Eng 389:114400
You H, Yu Y, Trask N, Gulian M, D’Elia M (2021d) Data-driven learning of robust nonlocal physics
from high-fidelity synthetic data. Comput Methods Appl Mech Eng 374:113553
You H, Zhang Q, Ross CJ, Lee C-H, Hsu M-C, Yu Y (2022e) A physics-guided neural opera-
tor learning approach to model biological tissues from digital image correlation measurements.
arXiv:2204.00205
You H, Zhang Q, Ross CJ, Lee C-H, Yu Y (2022f) Learning deep implicit fourier neural operators
(IFNOs) with applications to heterogeneous material modeling, To appear on Comput Methods
Appl Mech Eng
Yu A, Becquey C, Halikias D, Mallory ME, Townsend A (2021) Arbitrary-depth universal approx-
imation theorems for operator neural networks. arXiv:2109.11354
Zhang E, Dao M, Karniadakis GE, Suresh S (2022) Analyses of internal structures and defects in
materials using physics-informed neural networks. Sci Adv 8(7):eabk0644
Zhu Y, Zabaras N, Koutsourelakis P-S, Perdikaris P (2019) Physics-constrained deep learning
for high-dimensional surrogate modeling and uncertainty quantification without labeled data. J
Comput Phys 394:56–81
Chapter 7
Digital Twin for Dynamical Systems

Tapas Tripura, Shailesh Garg, and Souvik Chakraborty

7.1 Introduction

A digital twin is a virtual replica of a physical system that exists either in the computer
or in the cloud (Vinicius et al. 2019; Coronado et al. 2018). Compared to conventional
computational models that emulate the physical system’s behavior in a temporally
static sense, a digital twin attempts temporal synchronization of the physical and
digital twins (Arup 2019; Worden et al. 2020); this naturally necessitates real-time
updating of the digital twin based on real data collected by using sensors. Note
that a digital twin might dictate changes to the physical system via actuators. This
is particularly true when a digital twin is used for active vibration control. While
the concept, in theory, has been there since 2002, the first practical definition was
provided in Eric et al. (2011). Rapid developments in Machine Learning and Internet
of Things (IoT) are two of the critical factors that enabled the emergence of this
technology. Additionally, with continuous growth in the usage of connected devices,
network speed, and increasing popularity of cloud platforms, digital twin technology
is likely to observe rapid growth in the upcoming years.
By definition itself, the term “digital twin” is extremely vast and can be applied
to any engineering field (Wagg et al. 2020a, b). Few of the many applications of
digital twin include prognostics and health monitoring (Wihan et al. 2020; Jinjiang
et al. 2019; Millwater et al. 2019; Zhou et al. 2019), manufacturing (Tarasankar et al.
2017; Haag and Anderl 2018; Yuqian et al. 2020; Kyu et al. 2020; Bin and Kai-Jian

T. Tripura · S. Garg · S. Chakraborty (B)


Indian Institute of Technology, Delhi, India
e-mail: [email protected]
T. Tripura
e-mail: [email protected]
S. Garg
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 255
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_7
256 T. Tripura et al.

2019), automotive and aerospace engineering (Li et al. 2017; Michael et al. 2020;
Eric et al. 2011; Sergey and Andrey 2020), to mention a few. In this chapter, we
mainly focus on digital twins for dynamical systems. The development of digital
twin technology is a non-trivial undertaking because of the presence of multiple
timescales (Chakraborty et al. 2021). For example, a typical time period of a wind
turbine is generally in seconds, whereas the operational period is in years (Adhikari
and Bhattacharya 2012). Therefore, a digital twin should be able to handle multiple
timescales seamlessly.
The development of digital twins can either be pursued from the perspective of
physics-based modeling or from the standpoint of data-based modeling. For exam-
ple, in Ganguli and Adhikari (2020), a physics-based digital twin was proposed for
dynamical systems. A physics-based digital twin is robust to changing environments;
however, these twins rely on (a) exact knowledge of the governing physics and (b)
clean (noise-free) data. Unfortunately, neither can be achieved in a realistic scenario.
The alternative, data-based digital twin Wihan et al. (2020) eliminates these issues.
However, purely data-based twins generally fail to generalize to unseen environ-
ments. Accordingly, the performance often deteriorates, and then the digital twin
goes out of sync with its physical counterpart. A third alternative does exist where
the idea is to fuse data with physics so as to exploit the strength of the two approaches.
For instance, if the physics of the problem is partially known, data-driven modeling
techniques can be used to compensate for the missing physics (Garg et al. 2022).
Such data-physics hybridization is often referred to as the “gray-box” modeling and
will potentially drive the development of future digital twin technologies.
The development of hybrid digital twin technology is often based on the assump-
tion that the overall physics of the system stays the same, and it is the parameter
that evolves. With the updated parameters the hybrid digital twin aims to predict the
behavior of the physical twin under unseen environments. For example, to numer-
ically model damage, we generally reduce the stiffness of a system/component.
However, this is an approximation and in a realistic setting, the physics of the sys-
tem often changes. For instance, the physics of a system can vary because of the
initiation of the crack. As a result, the predictions will diverge away from the actual
trajectory and over the course of time, the digital twin will get desynchronized from
the physical twin. Therefore, a digital twin must be able to track the change in
physics. Unfortunately, the digital twin literature addressing this aspect is almost
absent. One possible approach is to continuously identify the model-form errors in
the nominal model and perform a simultaneous parameter-model update. A more
appropriate solution is to identify the model updates using explainable functions
and estimate-associated epistemic uncertainties. These added features will further
enable the framework to correctly define the degree of generalizability, and accurate
estimation of long-term prediction and remaining useful life. Toward the end of this
chapter, a Bayesian framework in conjunction with stochastic structural dynamics is
demonstrated which addresses these issues.
Lastly, all physical systems have inherently associated randomness; hence, a dig-
ital twin should also account for the uncertainty in the system and data. Overall,
in a physical system, uncertainty might be present in material properties, bound-
7 Digital Twin for Dynamical Systems 257

ary/initial conditions, and forcing functions. In the context of dynamical systems,


one can argue that the uncertainty in the forcing function has a more prominent effect
than uncertainty in material properties or boundary/initial conditions. Additionally,
as previously mentioned, estimating epistemic uncertainty due to noisy and limited
data is also essential. In this book chapter, a separate section is devoted to treating
nonlinear stochastic dynamic systems.

7.2 Building Blocks and Nominal Model in Digital Twin

By definition, the idea of the digital twin is quite broad. However, on a higher level,
a digital twin generally consists of four key modules, (a) visualization module, (b)
update module, (c) prediction module, and (d) decision module. While the visual-
ization module is the front end of the overall digital twin framework, the other three
are the engines that drive the overall digital twin framework. A brief description and
functionality of each of the modules are provided below. A schematic representation
of the same is shown in Fig. 7.1.

Remark 7.1 While all the moduli are equally important, this chapter primarily
focuses on the update module and the prediction module. Note that a digital twin
with only the first three components is often called the predictive digital twin.

7.3 Physics-Based Digital Twin for SDOF System

In this section, we discuss the physics-based digital twin. However, before going into
the details of a physics-based digital twin, we briefly discuss the notion of a nominal
model.

Fig. 7.1 Different modules within a digital twin, along with its functionality
258 T. Tripura et al.

7.3.1 Nominal Model

The journey of a digital twin starts from a nominal model. By definition, a nominal
model is the initial model and represents the system during its installation. In engi-
neering, the nominal model can be considered a validated, verified, and calibrated
physics-based model. For instance, it can be a finite element model of a bridge or an
aircraft. We here explain the overall concept of the nominal model by using a simple
single-degree-of-freedom (SDOF) system.
We consider a physical system that can be represented as an SDOF system. We
also consider that the sensors sample at discrete time points ts . Assuming that the
evolution of the system parameters is only dependent on the slow time ts , the equation
of motion can be written as

∂ 2 u(t, ts ) ∂u(t, ts )
m(ts ) + c(ts ) + k(ts )u(t, ts ) = f (t, ts ) (7.1)
∂t 2 ∂t

Here t and ts are the “system time” and “slow time”, respectively. The terms m(ts ),
c(ts ), and k(ts ) are, respectively, the mass, damping, and stiffness of the system at ts .
The response u(t, ts ) and force f (t, ts ) are now a function of both t and ts ; hence, the
equation is represented using partial derivatives. We consider the slow time or the
service time ts as a time variable that is much slower than t. For example, ts could
represent the number of cycles, and the change in system parameters represents the
degradation of the system during its lifetime.
Equation (7.1) is considered as a digital twin of an SDOF dynamic system. When
ts = 0, that is, at the beginning of the service life of the system, the digital twin in
Eq. (7.1) reduces to the nominal system.

d2 u 0 (t) du 0 (t)
m0 + c0 + k0 u 0 (t) = f 0 (t) (7.2)
dt dt
where m 0 , c0 , k0 , and f 0 are, respectively, the mass, damping, stiffness, and force at
t = 0.

7.3.2 The Digital Twin Framework

For Eq. (7.1) to be the digital twin for a SDOF system, the system parameters m(ts ),
c(ts ), and k(ts ) need to be continuously updated based on the data collected from
the physical counterpart. This essentially means that updating the digital twin is
equivalent to updating these parameters.
We consider the sensors to be installed on the physical system to take measure-
ments at locations of time defined by ts . With this setup, the objective of the update
module is to estimate the parameters based on measurements at each ts . Herein, we
assume that the variation in c(ts ) is negligible and limit ourselves to the variation in
7 Digital Twin for Dynamical Systems 259

m(ts ) and k(ts ) only. Without any loss of generality, the following functional forms
are considered:
k(ts ) = k0 (1 + k (ts ))
(7.3)
and m(ts ) = m 0 (1 + m (ts ))

where k (ts ) and m (ts ) are the changes in stiffness and mass parameters. Note that
k(ts ) is generally expected to be a decaying function over a long time to represent
a loss in the stiffness of the system. m(ts ), on the contrary, is expected to be an
increasing or a decreasing function. The following functions have been chosen for
generating synthetic data:

(1 + k cos(βk ts ))
k (ts ) = e−αk ts −1 (7.4)
(1 + k )
and m (ts ) = m SawTooth(βm (ts − π/βm )) (7.5)

Here αk , βk , k , m , and βm are the constants deciding the rate of change of stiffness
and mass parameters. A schematic representation of the same is shown in Fig. 7.2.
The choice of these functions is motivated by the fact that the stiffness degrades over
time in a periodic manner representing a possible fatigue crack growth in an aircraft
over repeated pressurizations. On the other hand, the mass increases and decreases
over the nominal value due to re-fueling and fuel burn over a flight period. The key
consideration is that a digital twin of the dynamical system should track these types
of changes by exploiting sensor data measured on the system.

1.5
Normalised changes

0.5

The stiffness function: k(ts)/k0


The mass function: m(ts)/m 0
0
0 200 400 600 800 1000
Normalised time: ts/T0

Fig. 7.2 Examples of model functions representing long-term variability in the mass and stiffness
properties of a digital twin system
260 T. Tripura et al.

7.3.3 Formulating the Digital Twin

As stated earlier, we consider that only the mass and the stiffness change with time.
Accordingly, the governing differential equation for this case is represented as

d2 u (t) du (t)
m s (ts ) 2
+ c0 + ks (ts ) u (t) = f (t) (7.6)
dt dt
where ks (ts ) and m s (ts ) are represented by Eq. (7.3). Substituting Eq. (7.3) into Eq.
(7.6) and solving, we obtain

λs1,2 = −ωs (ts ) ζs (ts ) ± iωds (ts ) (7.7)

where ωs (ts ), ωds (ts ), and ζs (ts ) denote the evolution in natural frequency, damped
natural frequency, and damping ratio, respectively. The evolution is defined as fol-
lows: √
1 + k (ts )
ωs (ts ) = ω0 √ (7.8a)
1 + m (ts )

ζ0
ζs (ts ) = √ √ and (7.8b)
1 + m (ts ) 1 + k (ts )

ωds (ts ) = ωs (ts ) 1 − ζs2 (7.8c)

where ω0 and ζ0 are the natural frequency and damping ratio of the system at t = 0.
The objective here is to exploit Eq. (7.7) to estimate k and m . However, this
is non-trivial because of the coupled nature of the two. Following Ganguli and
Adhikari (2020), this can be addressed by considering the real and the imaginary
parts separately. By introducing a distance metric d (·, ·) and applying it to the real
and imaginary parts, we have

dR (ts ) = d (R (λ0 ) , R (λs (ts ))) (7.9)


and dI (ts ) = d (I (λ0 ) , I (λs (ts ))) (7.10)

Dividing Eqs. (7.9) and (7.10) by ω0 , we obtain

dR (ts ) ζ0
d̃R (ts ) = = − ζ0 (7.11)
ω0 1 + m (ts )

 (1 + k (ts )) (1 + m (ts )) − ζ02
dI (ts )
and d̃I (ts ) = = 1 − ζ02 − − ζ0
ω0 1 + m (ts )
(7.12)
7 Digital Twin for Dynamical Systems 261

As already stated, the digital twin is completely described by m (ts ) and k (ts ), and
both of these can be computed by solving Eqs. (7.11) and (7.12) simultaneously,

d̃R (ts )
m (ts ) = − , (7.13a)
ζ0 + d̃R (ts )
 
ζo d̃R2 (ts ) − 1 − 2ζ02 d̃I (ts ) + ζ02 d̃I2 (ts )
k (ts ) = (7.13b)
ζ0 + d̃R (ts )

Remark 7.2 Although the above expression provides a closed-form solution for
updating the digital twin, it is incapable of handling noisy data. This will be illustrated
in the numerical example section.

7.3.4 Numerical Experiment

The digital twin described using Eqs. (7.1) and (7.13) is examined in this section
through a simple numerical example. We consider two cases, one where the data is
clean and the second where the data is noisy. The damping ratio ζ0 is fixed at 0.05.
The results obtained for the noise-free case are shown in Fig. 7.3. It is observed that
the digital twin defined using Eqs. (7.1) and (7.13) exactly captures the evolution of
the stiffness in slow time. However, the performance of the digital twin in capturing
the temporal evolution of mass is slightly off. This is because of the fact that the
digital twin was updated using only a few observations. Overall, as long as the data
collected is noise free, the physics-based digital twin performs exceedingly well.
Next, we consider a more realistic case where the sensor data is corrupted by noise.
The observations are corrupted by white Gaussian noise with a standard deviation
equal to 0.5% of the standard deviation of actual data. The results corresponding
to the noise-corrupted case are shown in Fig. 7.4. We observe with the noise, the
digital twin starts deteriorating and performs extremely poorly. This is expected as
the formula in Eq. 7.13 is based on the assumption that the available data is noise
free. Overall, it is safe to conclude that the physics-based digital twin presented here
only works when the available data is noise free. Another challenge associated with
this setup is the lack of predictive capability. Since we don’t learn the evolution of
the parameters m (ts ) and k (ts ), it is not possible to predict the state variables at a
future time point.

7.4 Physics ML Fusion: Towards a Predictive Digital Twin

A fundamental challenge with a purely physics-based digital twin resides in collect-


ing noisy data. One way to address this is to combine the physics-based model with
262 T. Tripura et al.

(a) Mass evolution

(b) Stiffness evolution

Fig. 7.3 Performance of the digital twin in capturing the evolution of m and k with slow time
(normalized). The data is noise free

a machine learning model. In this section, we illustrate this possibility, but coupling
Gaussian process (GP) regression with the physics-based digital twin was discussed
before.
7 Digital Twin for Dynamical Systems 263

(a) Mass evolution

(b) Stiffness evolution

Fig. 7.4 Performance of the digital twin in capturing the evolution of m and k with slow time
(normalized). The data is noisy

For building a predictive digital twin, we start by approximating the unknown


parameters, m (ts ) and k (ts ), using a GP as follows:
    
[m (ts ) , k (ts )] ≈ m̂ (ts ) , k̂ (ts ) ∼ GP μts , κ ts , ts ; θ (7.14)
264 T. Tripura et al.
 
where μts = μ(ts ; θ m ) and κ ts , ts ; θ k are, respectively, the mean function and
covariance kernel of the GP. θ = [θ m , θ k ] are the hyperparameters of the GP model
and need to be trained. However, it is not possible to compute these hyperparameters
owing to the fact that we don’t have any measurements for either m (ts ) or k (ts ).
To alleviate this issue, we substitute Eq. (7.14) into Eq. (7.7) and proceed as before.
This yields that the GP at a given time ts should match the target value obtained using
the physics-based approach; in other words,

d̃R (ts )
m̂ (ts ) = − , (7.15a)
ζ0 + d̃R (ts )
 
ζo d̃R2 (ts ) − 1 − 2ζ02 d̃I (ts ) + ζ02 d̃I2 (ts )
k̂ (ts ) = (7.15b)
ζ0 + d̃R (ts )

where d̃R (ts ) and d̃I (ts ), as before, are distance measures. Such a setup allows us
to train the GP model and estimate the parameters θ in a seamless manner. A brief
discussion on GP is given in the following section.

Remark 7.3 From a practical point of view, one needs to assume a functional form
of the mean function and a covariance kernel for implementing GP in practice. One
can also opt to select the mean and covariance kernel in an adaptive manner. In this
work, we have used Bayesian information criteria for selecting the optimal mean
function and covariance kernel.

7.4.1 Gaussian Process

Gaussian process (GP) is a popular machine learning technique that aims to infer
a distribution over functions and then utilize the distribution to make predictions at
some unknown points (Murphy 2012). Although different modern variants of GP
exist in the literature, the vanilla GP is utilized in this work. For an understanding
of the vanilla GP, consider ∈ Rd to be the input variable and y ∈ Rd to be a set of
noisy measurements of the response variable. Then the regression equation can be
written as
y = g( ) + v, (7.16)

where v represents the noise. With this setup, the objective is to estimate the latent
(unobserved) function g( ) that will enable the prediction of the response variable,
ŷ at new values of . In GP-based regression setup,  defined over g( ) with
 a GP is
the mean function μ( ) and covariance function κ ,  ; θ as
  
g() ∼ GP μ(), κ ,  ; θ (7.17)
7 Digital Twin for Dynamical Systems 265

In the above equation, the mean and covariance functions are defined as

μ( ) = E[g( )] (7.18a)
       
κ ,  ; θ = E (g( ) − μ( )) g  − μ  (7.18b)

where θ denotes the hyperparameters of the covariance function κ(·, ·), such as
the length and scale parameters for a squared exponential kernel. The choice of
the covariance function κ allows encoding of any prior knowledge about g( ) (e.g.
periodicity, linearity, smoothness) and can accommodate approximation of arbitrarily
complex functions (Rasmussen and Williams 2006). In Eq. (7.17), it can be seen that
any finite collection of function values has a joint multivariate Gaussian distribution,
i.e. (g ( 1 ) , g ( 2 ) , . . . , g ( N )) ∼ N(μ, K), where μ = [μ ( 1 ) , . . . , μ ( N )]T is
the mean vector and K is the covariance matrix with K(i, j) = κ i , j for i, j =
1, 2, . . . , N . Then mean function is generally taken as a zero vector, i.e. μ( ) = 0
when no prior information
  is available about the mean function. On the other hand,
any function κ ,  that generates a positive, semi-definite, covariance matrix K
is considered to be valid for the covariance function. Once Eq. (7.16) is ready, the
objective of GP is to estimate the hyperparameters, θ, based on the observed input–
  Nt
output pairs, j , y j j=1 , where Nt is the number of training samples. Once θ have
been computed, the predictive distribution of g ( ∗ ) given the datasets { y, }, the
hyperparameters θ and the new inputs, ∗ is represented as
         2  ∗ 
p g ∗ | y, , θ , ∗
= N g ∗ | μGp ∗ , σGp (7.19)

where the mean and covariance functions of the predictive distribution are given as
 ∗  ∗
 −1
μGP = k T , ;θ K(, ; θ ) + σn2 I y (7.20a)
 ∗  ∗ ∗
  ∗
 −1  ∗

σGP
2
=k , ; θ − k T , ;θ K( , ; θ ) + σn2 I k , ;θ (7.20b)

In general, the hyperparameters θ can be obtained by maximizing the likelihood


of the data. Alternatively, the Bayesian approaches can also be used to compute
the posterior distribution on the hyperparameters, θ (Ilias et al. 2013; Chakraborty
and Chowdhury 2019). A discussion on Bayesian regression is provided in Sect.
7.6.3. More information on GP can be found in the literature (Nayek et al. 2019;
Chakraborty and Chowdhury 2019).

7.4.2 Numerical Experiment

We revisit the same experiment carried out in Sect. 7.3 with the same setup (see Fig.
7.2). Figure 7.5 depicts the evolution of real and imaginary parts of the measured
266 T. Tripura et al.

-0.5
Actual change
-0.6 37 Samples avaialble for the digital twin
150 Samples avaialble for the digital twin
-0.7 200 Samples avaialble for the digital twin

-0.8

-0.9

-1

-1.1

-1.2

-1.3

-1.4

-1.5
0 100 200 300 400 500 600 700 800 900 1000
Normalised time: ts /T0
(a) Changes in real part of natural frequency
1.5
Actual change
1.4 37 Samples avaialble for the digital twin
150 Samples avaialble for the digital twin
1.3 200 Samples avaialble for the digital twin

1.2

1.1

0.9

0.8

0.7

0.6

0.5
0 100 200 300 400 500 600 700 800 900 1000
Normalised time: ts /T0
(b) Changes in imaginary part of natural frequency

Fig. 7.5 Variation (normalized) in the real and imaginary parts of the natural frequency
7 Digital Twin for Dynamical Systems 267

(a) Mass function

(b) Stiffness function

Fig. 7.6 GP-based digital twin obtained from exact data

natural frequency of the system over time. Figure 7.6 shows the results obtained
for clean data. The proposed and updated digital twin is able to capture the time
evolution of mass and stiffness. However, this setup is unrealistic as, even with the
most advanced sensors, the data collected will always be noisy (Zhang et al. 2017).
As a natural progression, we consider a more realistic case with noisy data. We
consider three noise levels. Figure 7.7 shows the mass and stiffness evolution of the
digital twin trained with 37 noisy observations. We observe that the digital twin is
able to capture the time evolution of stiffness with high accuracy; however, it fails
to capture the evolution of mass adequately. Additionally, uncertainty due to limited
and noisy data is perfectly captured and can be used in further decision-making.
268 T. Tripura et al.

(a) Mass function

(b) Stiffness function

Fig. 7.7 GP-based digital twin obtained from 37 noisy data with σθ = 0.005. The shaded plot
depicts the 95% confidence interval

Figures 7.8 and 7.9 show the performance of the digital twin trained with 150
noisy observations. We observe a dramatic improvement in the digital twin with the
increased data points. The results obtained are highly accurate. Lastly, evolution of
mass with 200 noisy observations and σθ = 0.025 is shown in Fig. 7.10. As expected,
this yields the best results.
7 Digital Twin for Dynamical Systems 269

Fig. 7.8 Mass evolution for digital twin obtained from 150 noisy data. Noise levels of σθ = 0.005,
σθ = 0.015, and σθ = 0.025 are considered

7.5 Digital Twin for Nonlinear Stochastic


Dynamical Systems

The digital twins discussed till now are deterministic in nature. However, real systems
are always associated with uncertainty in one form or the other. In this section, we
focus on the development of digital twins for the multi-degree-of-freedom (MDOF)
270 T. Tripura et al.

Fig. 7.9 Stiffness evolution for digital twin trained with 150 noisy data with σθ = 0.025. The
shaded plot depicts the 95% confidence interval

Fig. 7.10 Mass evolution as a function of the normalized slow time ts /T0 for GP-based digital twin
(simultaneous mass and stiffness evolution) trained with 200 noisy data with σθ = 0.025. Bayesian
information criteria yield a “Linear” basis and an “ARD Matern 5/2” covariance kernel. The shaded
plot depicts the 95% confidence interval

stochastic system. Note that, similar to previous sections, we assume that the temporal
evolution of a system accounts for changes in the system parameters only. No change
in the governing physics is considered in this section.
7 Digital Twin for Dynamical Systems 271

7.5.1 Stochastic Nonlinear MDOF System: The Nominal


Model

Consider an N −DOF stochastic nonlinear system having governing equations as


follows:
M0 Ẍ + C0 Ẋ + K0 X + G (X, α) = F +  Ẇ (7.21)

where M0 ∈ R N ×N , C0 ∈ R N ×N , and K0 ∈ R N ×N , respectively, represent the mass,


damping, and (linear) stiffness matrix of the system. G (·, ·) ∈ R N , on the other
hand, represents the nonlinearity present in the system. F in Eq. (7.21) represents
the deterministic force and Ẇ (Wiener derivative) is the stochastic load vector with
noise intensity matrix . The parameter α in Eq. (7.21) represents the nonlinear
stiffness model. Note that M0 , C0 , and K0 are the nominal parameters and represent
the pristine system.
The DT for the N-DOF nonlinear system discussed above can be represented as

∂ 2 X(t, ts ) ∂ X(t, ts )
M(ts ) + C(ts ) + K(ts )X(t, ts ) + G ((t, ts ), α) = F(t, ts ) +  Ẇ (7.22)
∂t 2 ∂t

The notations are mostly similar to those discussed before.

7.5.2 Problem Statement

The DT for MDOF nonlinear system is defined in Eq. (7.22) as incomplete without
proper update mechanism for the system parameters M(ts ), C(ts ), and K(ts ). Similar
to the previous section, we assume that the temporal variation in M(ts ), C(ts ), and
K(ts ) is slow, and hence the dynamics are decoupled. We assume that the sensors
collect data at discrete time instants ts , and at each time instant, time history mea-
surements of acceleration response in ts ± t are available. In this section, we only
consider variation in the stiffness, and the objective is to develop a DT for a nonlinear
MDOF system. Other requirements like continuous updates and future predictions
remain, as discussed in the previous section.

7.5.3 The Digital Twin Framework

A schematic representation of the proposed DT is shown in Fig. 7.11. It has four


primary components, namely, (a) selection of nominal model, (b) data collection,
(c) parameter estimation at a given time instant, and (d) estimation of the temporal
variation in parameters. The selection of the nominal model has already been detailed
in Sect. 7.5.1 and hence, here, the discussion is limited to data collection, parameter
estimation, and estimation of temporal variation of the parameters only.
272 T. Tripura et al.

Digital Twin Parameter estimation using UKF


Measurements from Gray-box
physical system model
Learning system parameters

Updating DT

Physics from low


fidelity model Temporal evolution of system parameters

Physical system
High-fidelity model

Predicting future
response, prognosis,
diagnosis, estimating Predicting future values of parameters
remaining useful life, (Expected), using GP

Fig. 7.11 Schematic representation of the proposed digital twin framework

One primary concern in DT is its connectivity with its physical counterpart. To


ensure connectivity, sensors are placed on the physical system (physical twin) for
data collection. The data is communicated to the DT by using cloud technology.
With the substantial advancements in IoT, access to different types of sensors is
straightforward for collecting different types of data. Although different types of
sensors can be used, we have assumed acceleration data in this section. We note that
the proposed approach is not limited to acceleration measurements only, and it can be
used even for displacement and velocity measurements. Also, the proposed approach
can also be used with vision-based sensors. Having said that, accelerometers are still
the most popular choice, and hence acceleration measurement is considered in this
study.
Once the data is collected, the next objective is to estimate the system param-
eters (stiffness matrix to be specific), assuming that at time instant ts , acceleration
measurements are available in [ts − t, ts ], where t is time interval over which
acceleration measurement is available at ts . It is to be noted that ts is a timestep in the
slow timescale whereas t is the time interval in the fast timescale. With this setup,
the parameter estimation objective is to estimate K(ts ). In this work, we estimate
K(ts ) by using the unscented Kalman filter (UKF). For further details on UKF in the
context of the proposed digital twin, interested readers may refer (Garg et al. 2021).
The last step within the proposed DT framework is to estimate the temporal
evolution of the parameters. This is extremely important as it enables the DT to
predict the future behavior of the physical system. In this work, we propose to use a
combination of Gaussian process regression (GPR) and UKF to learn  the temporal

evolution of the system parameters. To be specific, consider t = t1 , t2 , . . . , t N to
be time instants in slowscale. Also, assume that using UKF, the estimated system
7 Digital Twin for Dynamical Systems 273

parameters are available at different time instants as v = [v 1 , v 2 , . . . , v N ], where v i


includes the elements of stiffness matrix. The proposed work trains a GPR model
between t and v,
v ∼ GP(μ, κ) (7.23)

Note that for brevity, the hyperparameters in Eq. (7.23) are omitted. The GPR is
trained by following the procedure discussed in Sect. 7.4.1. For algorithmic details,
more information can be found in Garg et al. (2021). Once trained, the GPR can
predict the system parameters at future timesteps. Note that GPR being a Bayesian
machine learning model also provides predictive uncertainty, which can be used to
judge the accuracy of the model. For the ease of readers, the overall DT framework
proposed is shown in Algorithm 1.

Algorithm 1: Proposed DT
1 Select nominal model ;  Section 7.5.1
2 Use data (acceleration measurements) Ds collected at time ts to compute the parameters
K(ts ) ;  UKF (Garg et al. 2021, 2022)
ts
3 Train a GP using D = [tn , v n ]n=1 as training data, where v n represents the system
parameter. Predict K(t˜) at future time t˜. Substitute K(t˜) into the governing equation
(high-fidelity model) and solve it to obtain responses at time t˜.
4 Take decisions related to maintenance, remaining useful life, and health of the system.
5 Repeat steps 2–6 as more data become available

7.5.4 Numerical Examples

We consider a 7-DOF system as shown in Fig. 7.12. The 7-DOF system is mod-
eled with a Duffing–van der Pol (DVP) oscillator at the fourth DOF. The governing
equations of motion for the 7-DOF system are as follows:

M Ẍ + C Ẋ + KX + G (X, α) = F +  Ẇ (7.24)

where M ∈ R7×7 represents the mass matrix, X ∈ R7 is the response vector,  ∈


R7×7 is the diffusion matrix, Ẇ ∈ R7 represents the Gaussian white noise, F ∈ R7
represents the forcing function, and G ∈ R7 represents the unknown physics. C and
K in Eq. (7.24) represent damping and linear stiffness, respectively. The stiffness
of all but the fourth DOF varies with the slow timescale ts . The rationale here is

Fig. 7.12 7-DOF System with Duffing–van der Pol oscillator


274 T. Tripura et al.

Table 7.1 System Parameters—7-DOF System—Data Simulation


Index i Mass (Kg) Stiffness Damping Force (N) Stochastic
constant constant Fi = λi sin(ωi t) noise
(N/m) (Ns/m) parameters
i = 1,2 m i = 20 ki = 2000 ci = 20 λi = 10, ωi = 10 si = 0.1
i = 3,4,5,6 m i = 10 ki = 1000
i =7 mi = 5 ki = 500
DVP Oscillator Constant, αDVP = 100

that nonlinear stiffness is generally used for vibration control (Das et al. 2021) and
energy harvesting (Cao et al. 2019), and hence kept constant. Further details on the
parameters are shown in Table 7.1.
For generating synthetic data, data simulation is carried out using the Taylor-1.5-
Strong scheme, and a filtering model is formed using the Euler–Maruyama (EM)
equation in Garg et al. (2021). Note that although the value k4 is a priori known, we
have still considered it into the state vector. It was observed that such a setup helps in
regularizing the UKF estimates. The measurement model for the UKF remains the
same that is written as
⎡ ⎤
− m1 (y1 (k1 + k2 ) − c2 y4 − k2 y3 + y2 (c1 + c2 ))
1
⎢ 1 (c y − y (k + k ) + c y + k y + k y − y (c + c )) ⎥
⎢ 2 2 3 2 3 3 6 2 1 3 5 4 2 3 ⎥
⎢ m 2 ⎥
⎢− 1 k y − c y − k y − c y + y (k − k ) + α 3 ⎥
⎢ m3 4 7 4 8 3 3 3 4 5 3 4 DVP (y5 − y7 ) + y6 (c3 + c4 ) ⎥
⎢ ⎥
h( y) = ⎢ 1 3
⎢ m 4 c4 y6 + c5 y10 − k4 y5 + k5 y9 + y7 (k4 − k5 ) + αDVP (y5 − y7 ) − y8 (c4 + c5 )


⎢ ⎥
⎢ 1 (c y − y (k + k ) + c y + k y + k y − y (c + c )) ⎥
⎢ m5 5 8 9 5 6 6 12 5 7 6 11 10 5 6 ⎥
⎢ 1 (c y − y (k + k ) + c y + k y + k y − y (c + c )) ⎥
⎣ m6 6 10 11 6 7 7 14 6 9 7 13 12 6 7 ⎦
1 (c y − c y + k y − k y )
m7 7 12 7 14 7 11 7 13
(7.25)
Process noise covariance matrix Q is expressed as Garg et al. (2021)
 
σ σ σ σ̃ σ σ σ
qc = t −1 diag 0 m1 0 m2 0 m3 0 m4 0 m5 0 m6 0 m7 0 0 0 0 0 0 0
1 2 3 4 5 6 7
Q = qc qcT
(7.26)
where σ̃4 = m − k (7) σ 4 and m −
k (7) is the seventh element of UKF’s predicted mean.
Simulated acceleration and input are corrupted with a Gaussian noise having SNR
values of 50 and 20, respectively. The final data used in filtering is presented in Fig.
7.13.
The functionality of the digital twin is dependent on the performance of UKF
and GP. To that end, we first illustrate the accuracy of UKF. The acceleration vec-
tors (noisy) shown in Fig. 7.14 are considered as the measurements. The state and
parameter estimation results obtained using the UKF algorithm are shown in Fig.
7.15. The digital twin provides a highly accurate estimate of the state vectors. Param-
eters k2 , k3 , and k5 also converge exactly toward the ground truth. As for k1 , k6 , and
7 Digital Twin for Dynamical Systems 275

40 20
UKF Input
Ground Truth
20 10

force
acc.
0 0

-20 -10

-40 -20
0 2 4 0 2 4
t (sec) t (sec)

Fig. 7.13 Acceleration and the deterministic component of the force for the 7-DOF problem

20
Measurement 40
Ground Truth
20
A1
f1

0
0
-20 -20
0 2 4 0 2 4
t (sec) t (sec)
20 50
0
A4
f4

0
-50
-20 -100
0 2 4 0 2 4
t (sec) t (sec)
20 50
A7

0
f7

-20 -50
0 2 4 0 2 4
t (sec) t (sec)

Fig. 7.14 Force (deterministic part) and acceleration vector corresponding to DOF 1, 4, and 7 used
in UKF

k7 , UKF yields an accuracy of around 95%. A summary of the estimated parameters


in the slow timescales is shown in Figs. 7.16 and 7.17. We observe that the estimates
improve with an improved initial guess of system parameters (which, in our case,
are the final parameters obtained from previous data points). Similar to Fig. 7.15, we
observe that the estimates for stiffness k2 , k3 , and k5 are more accurate than those
obtained for k1 , k6 , and k7 . These data are used for training the GP model.
Figure 7.18 shows the results obtained using the GP. The vertical line in Fig. 7.18
indicates the end point of the measurement window for GP. For k1 , k2 , k3 , and k5 , the
results obtained using the digital twin match exactly with the true solution. For k7 ,
276 T. Tripura et al.

Fig. 7.15 Combined state Ground Truth Filtered Result


and parameter estimation 0.5 4
results for the 7-DOF van der
Pol system 2
0

y1

y2
0
-0.5 -2
0 2 4 0 2 4
t (sec) t (sec)
10
4
5

y7

y8
2 0

0 -5
0 2 4 0 2 4
t (sec) t (sec)
6 10
4
y13

y14
0
2

0 -10
0 2 4 0 2 4
t (sec) t (sec)

(a) State (Displacement And Velocity) Estimation

Ground Truth Initial Value Filtered Result


2400 2200

2200 2000
k1

k2

2000
1800
1800
0 2 4 0 2 4
t (sec) t (sec)
1100
1100
1000
k3

k5

1000
900
900
0 2 4 0 2 4
t (sec) t (sec)
1100 550

1000 500
k6

k7

900 450

0 2 4 0 2 4
t (sec) t (sec)
(b) Parameter (Stiffness) Estimation) Estimation
7 Digital Twin for Dynamical Systems 277

4 Ground Truth
Estimated Value
2

y2
2000 0

-2 4
0 2 4
t (sec) 2

1750

y2
0
0.5
k1

-2
0
y1

0 2 4
t (sec)
1500
-0.5
0 5
t (sec) 0.5

0
y1

1250
-0.5

0 5
t (sec)
0 2500 5000 7500 10000
ts (Days)

(a) 1

4 Ground Truth
Estimated Value
2
y4

2000 0

-2

-4 4
0 1 2 3 4 5
t (sec) 2
y4

1750 0

-2
k2

1
-4
0.5 0 2 4
0 t (sec)
y3

1500
-0.5

-1 1
0 5
t (sec)
0
y3

1250
-1
0 5
t (sec)
0 2500 5000 7500 10000
ts (Days)

(b) 2

Fig. 7.16 Stiffness (k1 and k2 ) obtained using the UKF algorithm
278 T. Tripura et al.

1100 1100
Ground Truth
Estimated Value
1000 1000

k3 900 900

k5
800 800

700 700

600 600

0 5000 10000 0 5000 10000


ts (Days) ts (Days)

1100 550

1000 500

900 450
k6

k7
800 400

700 350

600 300

0 5000 10000 0 5000 10000


ts (Days) ts (Days)

Fig. 7.17 Estimated stiffness (k3 , k5 , k6 and k7 ) in slow timescale using the UKF algorithm

the digital twin predicted results are found to diverge from the true solution. How-
ever, the divergence only happens after 3.5 years from the last observation, which,
for all practical purposes, is reasonable enough. For stiffness k6 also, the predic-
tion improves as more data are made available to the digital twin. This establishes
that the digital twin has the capacity for self-correction, which in turn helps better
representation of the physical systems.

7.6 Digital Twin for Systems with Misspecified Physics

The construction of digital twins in the previous three sections is based on the assump-
tion that the evolution of a system can be perfectly captured by changes in its system
parameters. However, such an assumption limits the capability of a digital twin. In
section, we discuss the predictive digital twin that is capable of tracking change in
physics.
Let us consider the following D-dimensional second-order partial differential
equation:
7 Digital Twin for Dynamical Systems 279

True 95% CI GP Mean Estimate Last Training Data

2000 2000 1000

1750 1750
k1

k2

k3
800
1500 1500

1250 1250
600
0 5000 10000 0 5000 10000 0 5000 10000
Time (days) Time (days) Time (days)
1000 1000 500
k5

k6

k7
800 800 400

600 600 300


0 5000 10000 0 5000 10000 0 5000 10000
Time (days) Time (days) Time (days)

Fig. 7.18 Results representing the performance of the proposed digital twin (GP + UKF) for the
7-DOF system

M(ts )∂tt Ẍ(t, ts ) + C(ts )∂t Ẋ(t, ts ) + K(ts )X(t, ts )


(7.27)
+ H( Ẋ, X, t, ts ) + Q( Ẋ, X, t, ts ) =  Ḃ(t, ts )

where M ∈ R D×D , C ∈ R D×D , and K ∈ R D×D represent the mass, damping, and
linear stiffness matrices of the system, respectively. The functions H( Ẋ, X, t, ts ) :
R D → R D and Q( Ẋ, X, t, ts ) : R D → R D denote the linear and nonlinear pertur-
bations in the system, respectively. The term B(t, ts ) ∈ R D on the right-hand side
represents the white noise (the generalized derivative of Brownian motion) with noise
intensity matrix  ∈ R D×D . In the above equation, two timescales t and ts are used,
which represent the intrinsic time and the service time, respectively. The service
time refers to the periods over which the underlying structure or a component is
expected to be inspected. The timescale ts is comparatively much slower than t and
since X(t, ts ) is a function of both the timescales Eq. (7.27) is written in terms of
the partial derivatives. It can be understood that the evolution in M(ts ), C(ts ), and
K(ts ) occurs very slowly with respect to timescale ts . The forcing term  Ḃ(t, ts )
however can change with respect to both the timescales t and ts . We call Eq. (7.27) as
the model of the proposed digital twin. Since it is already mentioned that the system
evolves with respect to the slower timescale, we rephrase Eq. (7.27) when ts = 0 as

M0 Ẍ(t) + C0 Ẋ(t) + K0 X(t) =  Ḃ(t) (7.28)

The above equation denotes the beginning of the service life of the underlying system
and is often called the nominal model in DT (this is almost similar to the DT defined in
Eq. (7.21)). Here, M0 , C0 , and K0 are the parameters of the nominal model. Further
280 T. Tripura et al.

Predictive digital twin

Physical Model

DAQ Data collection

Nominal
Validation
Mirror Model

Data assimilation
and processing

Module-1

If both input-output
Type of
Prediction, information
measurement
Design, Real-time correction and If only output information
forecast about remaining useful
Unseen environment life Module-2

Fig. 7.19 Schematic architecture of the predictive digital twin framework with simultaneous
parameter-model updating feature. The network primarily consists of three components, namely,
the physical model, the mirror model, and the linking mechanism. The linking mechanism performs
three simultaneous operations that are (i) data assimilation and processing, (ii) updating of the
nominal mirror model, and (iii) making predictions in the presence of unseen environmental agents
using the updated digital twin. For updating the virtual mirror model using explainable functions,
the data assimilation and processing unit utilizes two modules. In the backend, these modules use the
sparse Bayesian linear regression. Based on whether both input–output or output-only measurements
is available, module-1 and module-2 are activated, respectively. The input refers to the source and
the output refers to the state measurements. The details on the modules are provided in Figs. 7.20
and 7.21

we assumed that as the timescale ts shifts from the initial condition the nominal
system gets perturbed by new terms, expressed using the functions H( Ẋ, X, t, ts )
and Q( Ẋ, X, t, ts ) as

M0 Ẍ(t, ts ) + C0 Ẋ(t, ts ) + K0 X(t, ts ) + H( Ẋ, X, ts ) + Q( Ẋ, X, ts ) =  Ḃ(t)


(7.29)
It is straightforward to note that any changes in the physical model can be incorporated
into the DT using the linear and nonlinear functions H(·) and Q(·). Therefore, in
order to use the DT in practice, we need to characterize the functions H(·) and
Q(·). In this work, it is assumed that a linking mechanism between the twins is
established by using sensors and actuators. The sensors provide measurements of
the system states and the force, whenever available, at the time instant ts . At each
time instant ts , the measurements are obtained for ts + 1s, which means that we have
access to only one second of noisy data. We aim to discover the functions H(·) and
Q(·) from these limited and noisy measurements. Once discovered, they are used
to update the nominal model in Eq. (7.21). In the discovery of H(·) and Q(·), we
aim to learn them in their interpretable forms. In order to assess the performance
in an unseen scenario, we also aim to learn the uncertainties associated with the
7 Digital Twin for Dynamical Systems 281

parameters of the functions H(·) and Q(·). For these, the sparse Bayesian inference
is employed. The resulting framework thus is interpretable, and since the functions are
learned in a probabilistic framework, the chances of overfitting are very low. Further,
the physics of the underlying perturbations is learned using actual mathematical
functions. Therefore, it is conjectured that the proposed DT will be able to track
the evolution of the physical twin accurately. A schematic representation of the
predictive digital twin is shown in Fig. 7.19. The network architecture has three
primary components—(a) the physical model, (b) the digital twin, and (c) the linking
mechanism. The linking mechanism further consists of three independent modules—
(i) the data assimilation and processing module, (ii) the model updating module, and
(iii) the prediction module. The data processing is performed by using a physics-
based nominal model, and in the updating module, the sparse Bayesian regression
is employed. Since the perturbations in the physical model are obtained in terms of
interpretable functions, we consider the digital twin as white in nature. Although the
proposed framework should theoretically work for higher order dynamical systems,
in this work, we assume only the second-order dynamical systems. Furthermore, we
consider that the second-order dynamical systems can be completely represented
using displacement and velocity measurements.
Due to the advances in the development of sensor technologies now it is possible
to measure the displacement and velocity time histories of a dynamical system. How-
ever, often the measurement of input forces is not feasible. Toward this, we propose
two frameworks—(i) when both the input–output measurements are available and
(ii) when only the state measurements are measurable. In framework-1, we simply
remove the information of the nominal model using the measured state measurements
and then perform sparse regression to identify the perturbation terms. The schematic
illustration of the framework is presented in Fig. 7.20. In framework-2, similar to the
previous, we first remove the information of the nominal model and then employ the
sparse Bayesian regression in the purview of the Kramers–Moyal expansion (Hannes
1996) to identify the perturbations in terms of stochastic differential equations (Peter
and Eckhard 1992; Bernt 2013). The schematic representation of framework-2 is
provided in Fig. 7.21.

Fig. 7.20 Schematic architecture of the Module-1 in Fig. 7.19 for Bayesian model updat-
ing of dynamical systems in the presence of both input–output observations. Module-1 takes
the state and force measurements as input and forms a library of candidate basis functions from
these measurements. To update the nominal model, a sparse Bayesian regression between the state
measurements and their derivatives is constructed using the library
282 T. Tripura et al.

Fig. 7.21 Schematic architecture of the Module-2 in Fig. 7.19 for Bayesian model updating of
dynamical systems with output-only measurements. In module-2, only the state measurements
are provided as input. The library of candidate basis functions is created from state-only measure-
ments. Similar to module-1, sparse Bayesian regression is performed to update the nominal model.
However, the target vectors in module-2 are obtained using the Kramers–Moyal formula

7.6.1 Model Updating Using Input–Output Measurement

The premise of identification of the perturbations from noisy input–output measure-


ments and updating the original model is that the measurements can be expressed as
a linear superposition of certain basis functions. The basis functions can be identity,
polynomial, trigonometric, exponential, logarithmic, signum, modulus, or combina-
tions between them. These basses are evaluated on the input–output measurements,
and a library, often called a design matrix, is formed. However, due to the inclusion
of a large number of candidate functions in the library, most of the candidate model
components are likely to be incorrect. Further, there will always be some level of
confusion as few of the basis functions will have high correlations. As a whole, the
“true” model components will not be identified, leading to model discrepancy and
bias in the identified system parameters. In order to allow the model components that
do not provide a significant contribution in representing the data to be removed, the
sparse Bayesian regression algorithm in Appendix A is used. Further, the Bayesian
nature of the model updating framework helps in removing the terms inside the
library in a probabilistic manner, thereby requiring less human intervention.
Before moving into the mathematical description, we note that the higher order
dynamical systems are commonly described using a projected space; for example, the
second-order systems are often expressed in terms of their displacement and velocity
states. The benefit of the projected space is that all the states in this space are directly
observable. Toward this, let us represent the projection by a map T : Rd → Rm where
d and m are the dimensions of the original and projected space. In the projected space,
let us assume that there exists a dynamical system of the following form:
7 Digital Twin for Dynamical Systems 283

Ẋt = f(X t , t); X(t = t0 ) = X 0 ; t ∈ [0, T ] (7.30)

where X t ∈ Rm denotes the system states and f(X t , t) : Rm → Rm represents the


dynamics of the underlying system. Since we assume that due to operational and
environmental conditions, the underlying nominal gets perturbed, we rephrase the
above equation as

Ẋt = f(X t , t) + h(X t , t) ; X(t = t0 ) = X 0 ; t ∈ [0, T ], (7.31)


     
Nominal model Perturbation

where h(X t , t) : Rm → Rm represents the perturbations terms. In the proposed dig-


ital twin framework, the nominal model f (X t , t) is known to us a priori and we
aim correctly the nominal model by learning the perturbation term h (X t , t) from
freshly obtained input–output measurements. In order to only discover the pertur-
bation terms, we first remove the information about the nominal model from the
measured output as
Żt = Ẋt − f (X t , t)
(7.32)
= h (Z t , t)

Let { k (Z t ); k = 1, . . . , K } be the set of candidate library functions, θ =


[θ1 , θ2 , . . . , θ K ]T be the associated parameters, and K is the dimension of the library.
In order to discover f(Z t , t) in terms of the analytical functions, we express f(Z t , t)
as linear combination of the basis functions as

f i (Z t , t) = θi1 1 (Z t ) + · · · + θik k (Z t ) + · · · + θi K K (Z t ) (7.33)

In the above equation, i represents the ith state of m-dimensional statespace and θik
denote the kth basis function of ith state. In the regression format, the above equation
is expressed as
Y i = Lθ i +  i (7.34)

where Y i = Żi and L ∈ R N ×K := [ 1 (Z t ), . . . , K (Z t )] is the library. For construct-


ing the target and library in the above equation, both the states Z t and Ż t can be mea-
sured using the sensors. However, in case of restrictions, one can choose to measure
only Z t and obtain Ż t from Z t using higher order numerical differentiation schemes.
Once constructed, it can be understood that the above equation can be solved using
the sparse Algorithm 2 in Appendix. A schematic illustration is provided in Fig. 7.20.

7.6.2 Model Updating Using Output-Only Measurements

In the previous section, we demonstrated the model updating framework using both
the input–output information. However, often the accurate measurement of inputs is
not feasible. In such cases, the library of candidate functions becomes ill-conditioned,
284 T. Tripura et al.

and this leads to the selection of the wrong basis functions. Since, in this case, the
input information is assumed to be unavailable, we try to represent the underlying
governing physics in terms of the stochastic differential equations (SDEs) (Peter and
Eckhard 1992; Bernt 2013). To represent the systems in terms of SDEs we treat the
output measurements as a stochastic process and perform sparse Bayesian learning
in the purview of stochastic calculus. We again use the statespace to represent the
higher order dynamical systems in terms of the SDEs. Let the statespace be realized
by a map T : Rd → Rm that maps the d-dimensional system to the m-dimensional
SDEs with d < m. Then using T , any perturbed higher order system can be reduced
to the following SDEs:

Ẋ = f (X t , t) + h (X t , t) + g (X t , t) ξ (t) (7.35)
        
Nominal model Perturbation Diffusion

where f (X t , t) : Rm → Rm represents the dynamics of the nominal model,


h(X t , t) : Rm → Rm represents the nature of the perturbation, g (X t , t) : Rm →
Rm×n represents the volatility associated with the dynamics, and ξ (t) represents the
stochastic input often represented as white noise (Bernt 2013). To apply mathemat-
ical operations over the noise ξ (t), it is often appropriate to represent the above
equation through Itô SDEs which arises naturally in nonlinear dynamical systems
subjected to stochastic excitation such as earthquake, wind force, and wave force.
Bernt (2013), Fima (2005). Let (, F , P) be the complete probability space with
{Ft ; 0 ≤ t ≤ T )} being the natural filtration constructed from sub-σ -algebras of the
filtration F . Further note that the white noises are generalized derivative of Brow-
nian motions, i.e. ξ (t) = Ḃ(t). Therefore, under (, F , P) an m-dimensional n-
factor SDE driven by n-dimensional Brownian motion {B j (t); j = 1, . . . n} can be
written as
d X t = ( f (X t , t) + h (X t , t)) dt + g (X t , t) d B (t) ;
(7.36)
X(t = t0 ) = X 0 ; t ∈ [0, T ]

Here X t ∈ Rm denotes the Ft -measurable state measurements. In the Itô calculus,


the first term within the bracket on the right-hand side is called the drift vector, the
second term is termed as diffusion matrix, and B j (t) ∈ Rn is known as Brownian
motion. In addition to classical methods (Peter and Eckhard 1992), for obtaining the
solution of Eq. (7.36) many modern stochastic integration schemes are previously
proposed by the authors (Tapas et al. 2020, 2021a, b). In the digital twin framework,
the nominal model f (X t , t) is known to us a priori with the help of which we aim to
learn the perturbation term h (X t , t) using the freshly obtained measurements. For
this, we first remove the information about the nominal model from the measured
signal using the following operations:
7 Digital Twin for Dynamical Systems 285

d Z t = ( f (X t , t) + h (X t , t)) dt + g (X t , t) d B (t) − f (X t , t) dt
(7.37)
= h (Z t , t) dt + g (Z t , t) d B (t)

The above SDE is now the function of the nominal model removed state measure-
ments and contains the information of—(i) perturbation in the drift and (ii) diffu-
sion. At this point, it is straightforward to understand that the discovery of governing
physics in terms of Eq. (7.36) requires the independent identifications of the drift
and diffusion components. In contrast to the diffusion term, the deterministic drift
functions behave as smooth functions, i.e. they are assumed to be at least twice
differential. Thus there exists a finite variation of drifts. On the contrary, the stochas-
tic Brownian motions are not differentiable everywhere with respect to the process
Z(t). Due to such non-differentiability property, the Brownian motions are assumed
to have only the quadratic variation; as a consequence, their convergence is defined
in the mean square sense only.
Mathematically, let us consider the interval s ∈ [0, t] that is partitioned into n-
parts. If Z t denotes arbitrary random processn then according to the Itô calculus, as
n → ∞ the finite variation {Vn (Z  , t) : i |Z (si ) − Z (si−1 )|} → V (Z , t) and the
quadratic variation {Q n (Z , t) : in |Z (si ) − Z (si−1 )|2 } → Q(Z , t) (Uwe Hassler
et al. 2016). This suggests that if the sampling rate is sufficiently small, then the
drift and diffusion components of an SDE in Eq. (7.36) can be learned using only
the state measurements in terms of their linear and quadratic variations, respectively
(Hannes 1996). However, it should be noted that the diffusion components—(i) have
zero finite variations and (ii) are bounded by the quadratic variations. Thus, the dif-
fusion components are recoverable only through their covariation terms. Therefore,
we express the drift and diffusion components of the SDE in Eq. (7.36) in terms of
the state measurements as follows:
1
hi (Z t , t) = lim E [Z i (t + t) − Z i (t)] ∀ k = 1, 2, . . . N , (7.38a)
t→0 t
1 1   
i j (Z t , t) = lim E |Z i (t + t) − Z i (t)|  Z j (t + t) − Z j (t) (7.38b)
2 t→0 t
∀ k = 1, 2, . . . N

where hi (Z t , t) is the ith drift component and i j is the (i j)th component of the dif-
fusion covariance matrix  ∈ Rn×n := (g g T )(Z t , t). In order to derive the analytical
form of the drift and diffusion components from state measurements, we represent
the drift and diffusion as a linear superposition of candidate basis functions.
Let { k (Z t ); k = 1, . . . , K } be the set of candidate library functions where
k (Z t ) represents the various linear and nonlinear mathematical functions defined
with respect to the system states. We first construct the libraries L f ∈ R N ×K and
Lg ∈ R N ×K from the subsets { k (Z t )} ⊆ { k (Z t )} and { k (Z t )} ⊆ { k (Z t )} for drift
f g

and diffusion, respectively. Then, we express the ith drift component and the i jth
term of diffusion covariance matrix as a linear superposition of the library functions
as
286 T. Tripura et al.

f f f f f f
hi (Z t , t) = θi1 1 (Z t ) + · · · + θi k k (Z t ) + · · · + θi K K (Z t ) (7.39a)
g g g g g g
i j (Z t , t) = θi j 1 1 (Z t ) + · · · + θi j k k (Z t ) + · · · + θi j K K (Z t ) (7.39b)

f g
where θik and θi j k are the weights associated with the kth basis function of ith drift
and i jth diffusion covariance components, respectively. In a compact form, the above
equations can be represented as
f
Y i = L f θ i + i (7.40a)
g
Yij = Lg θ i j + ηi j (7.40b)

f  T g  T
In the above equations θ i = θi1 , θi2 , . . . , θi K and θ i j = θi j 1 , θi j 2 , . . . , θi j K ,
which correspond to ith drift component and i jth element of diffusion covariance
matrix, respectively. Similarly, the target vectors Y i and Y i j correspond to the ith-drift
component and (i j)th component of the diffusion covariance matrix, respectively.
The terms  i and ηi j represent the corresponding measurement error vectors. For the
discovery, the target vectors are constructed using Eq. (7.38) as

1    T
Yi = Z i,1 − ξi,1 , . . . , Z i,N − ξi,N (7.41a)
t
1       T
Yij = { Z i,1 − ξi,1 Z j,1 − ξ j,1 }, . . . , { Z i,N − ξi,N Z j,N − ξ j,N }
t
(7.41b)

The straightforward application of the sparse Bayesian regression in the above


directly yields—(i) the perturbation terms in drift and (ii) the diffusion, along with
f g
their parameters θ i and θ i j . The schematic representation for the model updating
using output-only measurements is provided in Fig. 7.21. The sparse Bayesian regres-
sion is briefly discussed in the following section, and for ease of understanding, an
algorithm is provided in Appendix A.

7.6.3 Sparse Bayesian Regression

In sparse Bayesian regression, we consider a regression problem of the following


form:
Y = L(X, Ẋ)θ + , (7.42)

where Y ∈ R N denotes the N -dimensional target vector, L(·, ·) ∈ R N ×K denotes


the library of candidate functions, X = [X, Ẋ] ∈ R N ×m is the state observations,
θ ∈ R K is the weight vector, and  ∈ R N is the residual error vector representing the
model mismatch error. It can be noted that the regression problems in Eqs. (7.40a)
and (7.40b) are identical to the problem in Eq. (7.42). Continuing, given the pairs
7 Digital Twin for Dynamical Systems 287

D = [X, Y ] the aim is to find the posterior distribution of the weight vector θ . For
estimating the posterior distribution of the weight vector θ , the Bayes formula is
utilized as follows:
P (θ) P (Y |θ )
P (θ |Y ) = (7.43)
P (Y )

where P (θ|Y ) is the posterior distribution of θ , P (θ ) is prior on θ, P (Y |θ) is the


likelihood function, and P (Y ) is the marginal likelihood of the data. The mismatch
error  is modeled as a vector of independent and identically distributed (i.i.d) Gaus-
sian random variable with zero mean and variance σ 2 . With this information, the
likelihood function is expressed as
 
P(Y |θ, σ 2 ) = N Lθ , σ 2 I N ×N (7.44)

where I N ×N is the identity matrix. In the sparse Bayesian regression, the aim is
to obtain a sparse representation of the weight vector θ, which further renders the
resulting model interpretable. In the purview of the Bayesian regression, the sparsity
in the resulting model is introduced by assigning certain sparsity-promoting priors.
For a detailed review of the sparsity-promoting priors, the readers are referred to the
literature (Edward and Robert 1997; Robert and Mikko 2009). In this section, the
sparse Bayesian regression is demonstrated using the spike and slab prior. The spike
and slab prior is a mixture of two distributions, where the spike helps to shrink the
small values of weights to zero and the slab allows only the large values to escape
the shrinkage. For various models of spike and slab prior, the readers can refer to
the literature (Robert and Mikko 2009; Rajdip et al. 2021). In this section, the slab
is modeled as zero-mean Gaussian distribution with large variance and the spike
using the Dirac-delta function. Due to the presence of the Dirac-delta function, the
prior can be regarded as discontinuous spike and slab (DSS) prior. Since DSS is a
mixture of two distributions, for regression, the weights θ k ∈ θ need to be classified
into the spike and slab components. This is done by introducing a latent indicator
variable vector Z = [Z 1 , . . . , Z K ] for each of the component θk ∈ θ . The latent
indicator variable Z k takes a value of 1 or 0 depending on whether the corresponding
weight belongs to slab or spike, respectively. In the DSS-prior model, the weight
components that belong to spike do not contribute to selection of basis function form
library L(·, ·); therefore, a reduced weight vector θ r ∈ Rr : {r  K } is composed
from the elements of the weight vector θ for which Z k = 1. Using this reduced weight
vector θ r the DSS-prior is defined as

p (θ |Z) = pslab (θr ) pspike (θk ) (7.45)
k,Z k =0

 
where pspike (θk ) = δ0 and pslab (θ r ) = N 0, σ 2 ϑs R0,r . Here, δ0 is the Dirac-delta
function, ϑs is the slab variance, and the matrix R0,r is the scaling matrix. If the
correlation between the basis functions is ignored the scaling matrix is taken as
R0,r = Ir ×r ; otherwise, the Fischer information matrix is used as R0,r = N (DT D)−1 .
288 T. Tripura et al.

In order to increase the accuracy and faster convergence, the Bayes formula in Eq.
(7.43) is further expanded to a hierarchical model by assuming a prior on the noise
variance σ 2 , the slab variance ϑs , and the latent vector Z. The priors are selected
based on the information from the conjugate priors as

p (ϑs ) = IG (αϑ , βϑ ) (7.46)


p (Z k | p0 ) = Ber n ( p0 ) ; k = 1 . . . K (7.47)
 
p ( p0 ) = Beta α p , β p (7.48)
 
p σ 2 = IG (ασ , βσ ) (7.49)

where “IG” denotes the inverse-Gamma distribution, “Bern” denotes the Bernoulli
distribution, and “Beta” denotes the Beta distribution. The Bayesian hierarchical
model is shown in Fig. 7.22, where the hyperparameters αϑ , βϑ , α p , β p , ασ , and
βσ are provided as a deterministic constants. The joint posterior distribution is then
obtained from Fig. 7.22 as

p Y |θ , σ 2 p θ |Z, ϑs , σ 2 p (Z| p0 ) p (ϑs ) p σ 2 p ( p0 )


p θ , Z, ϑs , σ 2 , p0 |Y =
p (Y )
   (7.50) 
where p θ , Z, ϑs , σ 2 , p0 |Y denotes
 the joint
 posterior distribution, p Y |θ, σ 2
denotes the likelihood function, p θ|Z, ϑs , σ 2 is the prior distribution for the weight
vector θ , p (Z| p0 ) is the prior distribution for  latent vector Z, p (ϑs ) is the
 the
prior distribution for the slab variance ϑs , p σ 2 is the prior distribution for the
noise variance, p ( p0 ) is the prior distribution for the success probability p0 , and
p (Y ) is the marginal likelihood. Due to the presence of the Dirac-delta function in
the DSS-prior direct sampling from the joint distribution function is intractable in
this case. Thus, the Gibbs sampling technique is used to draw the random samples
from the joint distribution. For deriving the conditional distributions of the random
variables, the readers can refer (George and Edward 1992). The steps for sampling

Fig. 7.22 Hierarchical


Bayesian network of the
discontinuous spike and
slab model for sparse linear
regression. The variables in
the green square boxes
indicate the deterministic
parameters and the variables
in the white circles represent
random variables
7 Digital Twin for Dynamical Systems 289

from the conditional distributions using Gibbs sampling are provided in Algorithm
2 in Appendix A.
Once the stationary distribution is reached, the marginal posterior inclusion prob-
ability (PIP) is estimated on the samples from Gibbs sampling. The PIP is denoted
as PIP:= p (Z k = 1|Y ), which measures the probability of participation of the corre-
sponding basis function in the updated model. A PIP = 1 will mean the corresponding
basis function is present in all the visited models, whereas a PIP = 0.5 will mean
that the corresponding basis function occurs at least in half of the visited models.
Let N MC denote the length of Markov Chain Monte Carlo (MCMC) required to
achieve the stationary distribution after discarding the burn-in samples. Then the PIP
is approximated for each of the K -basis functions by estimating the mean over the
N MC -Gibbs samples of the kth latent variable Z k ∈ Z (Rajdip et al. 2021) as

1 
Ns
j
p (Z k = 1|Y ) ≈ Z ; k = 1, . . . , K (7.51)
Ns j=1 k

By selecting a desired threshold for the PIP values, the final updated model can be
derived from any pair of [X, Y ]. A higher PIP threshold will result in a highly sparse
model, whereas a lower PIP threshold will result in a model with a large number
of functions. Also, higher PIP suggests that in the case of unseen scenarios, the
corresponding basis functions are highly likely to occur in the data representation
target vector. Being probabilistic in nature, the Bayesian algorithm also helps in
capturing the mean and covariance of the weights θk ∈ θ . In an unseen scenario, the
covariance information can be used to construct confidence intervals and to judge
the uncertainties associated with the updated model.

7.6.4 Numerical Examples

To illustrate the performance of the proposed approach, we considered a more sophis-


ticated and near-realistic problem that involves the discovery of crack degradation in
a linear dynamical system. The degradation of stiffness due to fatigue accumulation
during the vibration process is a real phenomenon and has great importance in engi-
neering practice. For simulating the degradation, we particularly adopted the model
proposed in Ref. Sobczyk (2006). With this, the governing motion equations of the
underlying problem are given as

m Ẍ 1 (t) + c Ẋ 1 (t) + kλX 1 (t) = f (t) (7.52a)


λ = α1 + α2 exp (−α3 q(t)α4 ) (7.52b)
 β
q̇(t) = γ X 12 (t) + Ẋ 12 (t) 2 (7.52c)
290 T. Tripura et al.

Here m ∈ R, c ∈ R and k ∈ R are the mass, damping, and stiffness parameters of


actual linear system. The scalar λ ∈ R characterizes the dependency of stiffness on
the degradation and α1 ∈ R+ , α2 ∈ R+ , α3 ∈ R+ , α4 ∈ R+ defines the extent and rate
of degradation. While Eq. (7.52a) represents the actual system, Eq. (7.52c) denotes
the evolution of the degradation measure q(t). For more details, the readers are
referred to Ref. Sobczyk (2006). In similar fashion to previous examples, here we
aim to simultaneously discover Eq. (7.52c), the nonlinear degradation terms k(x3 ) =
k(α1 + α2 exp (−α3 q3α4 )) in Eq. (7.52a) and the diffusion σ/m in Eq. (7.52a). Thus
it can be noticed that although the initial system was linear, the identified system
is highly nonlinear in nature. As it was explained in the previous examples, the
forcing f (t) is simulated as zero-mean Gaussian white noise for framework-1 and
as Brownian motion for framework-2.
The performance of the digital twin is evaluated by its capability to (a) identify
the governing physics and (b) predict response at a future time. Figure 7.23 shows
the effectivity of the proposed approach in identifying the correct library terms, and

(a) framework-1

(b) framework-2

Fig. 7.23 Basis function selection for the example considered


7 Digital Twin for Dynamical Systems 291

Table 7.2 Posterior mean and standard deviations of the selected basis functions
Systems Basis function a Deterministic † Stochastic

Mean Std. Mean Std.


Crack Crack path X2 0.0099 5.29 × 10−5 0.0099 8.00 × 10−4
degradation
Ẋ 2 0.0100 4.02 × 10−8 0.0099 4.52 × 10−7
System α1 X –1000 3.51 × 10−5 –995.74 22.62
α
α2 e−α3 ψ 4 X –1000 1.00 × 10−4 –1003.93 22.74
u 1.00 1.39 × 10−5 – –
‡ σ (X , t) – – 1.00 0.0549
t
a The deterministic refers to the case when both the input–output information are available, and †
stochastic refers to the case when the output-only response is available. ‡ Note that the diffusion
terms are discovered in terms of their covariation, i.e. to discover the diffusion terms, one needs to
perform the square root operation on the covariance matrix 

Table 7.2 shows its efficacy in identifying the parametric values associated with the
library terms. Results obtained using both frameworks are shown. We observe that
for both cases, the proposed approach yields highly accurate results, as indicated by
the excellent match between the estimated mean and actual parametric values. The
posterior distribution of the parameters is shown in Fig. 7.24.
The predictive capability of the interpretable digital twin is shown in Fig. 7.25.
Results obtained using both frameworks are shown. In both cases, we observe that
the digital twin predicts highly accurate results, with the mean prediction matching
almost exactly with the ground truth.

Remark 7.4 As an alternate to the interpretable digital twin frameworks, one can
opt for using a machine learning model to learn the unknown physics. For example,
a framework similar to the one proposed in Garg et al. (2022) is used. However,
it was observed that the amount of data required in such a hybrid framework is
generally more as compared to the interpretable version presented here. Also, as we
learn the exact physics, perpetual generalization is obtained. With such hybridization,
generalization is compromised to some extent.

7.7 Conclusions

In this chapter, we discussed the concept of digital twins for dynamical systems. Out
of the four modules in digital twins ((a) visualization, (b) update, (c) prediction, and
(d) decision), this chapter particularly focused on the update and prediction module.
In particular, we discussed how purely physics-driven and gray-box models could be
used for updating a digital twin. While a purely physics-driven digital twin is prone
to noise in the data, a gray-box model-based digital twin is somewhat immune to
the noise in the data. Additionally, integrating Bayesian statistics and probabilistic
292 T. Tripura et al.

(a) framework-1

(b) framework-2

Fig. 7.24 Posterior distribution of the identified parameters

machine learning algorithm makes a digital twin robust and aids in the technology’s
journey toward autonomy.
All physical systems have inherently associated randomness, and hence it is essen-
tial for a digital twin to account for the possible uncertainties in a system. We illus-
trated how to develop a digital twin for a stochastic dynamical system by using Itô
calculus, machine learning, and Kalman filtering approaches. Following the same
spirit as before, we recommend using probabilistic machine learning approaches as
the predictive uncertainty obtained in probabilistic machine learning has a huge role
to play in the development of digital twins. Case studies involving a seven-story
nonlinear stochastic dynamical system were presented to illustrate the applicability
and performance of the digital twin.
A digital twin is supposed to be a virtual replica of a physical system and is sup-
posed to emulate the evolution of the system throughout its lifetime. In most cases,
the evolution of system dynamics is represented as the evolution of the system param-
eters. Unfortunately, this is an approximate scenario as it is not only the parameters
but the governing equation itself that can change. Therefore, an important aspect of
the development of digital twins is the identification of model-form errors. One part
of this chapter is devoted toward developing digital twins for systems with misspec-
7 Digital Twin for Dynamical Systems 293

(a) Framework-1: displacement time series (b) Framework-2: displacement time series

(c) Framework-1: velocity time series (d) Framework-2: velocity time series

(e) Framework-1: crack path evolution (f) Framework-2: crack path evolution

Fig. 7.25 Predictive performance of the proposed predictive digital twin for example—3. a and
b Results for the DT using framework-1, where both the input–output observations are available. c
and d Results of the DT when only output measurements are feasible. The DT perfectly identifies the
perturbation terms along with their associated parameters. As a result, the prediction results match
almost perfectly with the actual system responses. However, when the models are updated using
only the output observations, the uncertainty in the predictions increases by some amount. This
ability to learn the uncertainties in the identified system parameters helps us to perform reliability
analysis on the systems designed using the proposed DT

ified physics. We illustrate how sparse Bayesian learning can be used for identifying
the missing terms and correcting/updating the digital twin. The applicability of the
system for both deterministic and stochastic systems is shown.
294 T. Tripura et al.

Appendix

A. Sparse Bayesian Regression

The pseudocode for sparse Bayesian regression is provided below.


Algorithm 2: Pseudo code of the sparse Bayesian regression
(0) (0)
1 Input: State measurements X(t) ∈ R N ×m and hyperparameters: α p , β p , ασ , βσ , αϑ , βϑ , p0 , ϑs .
2 Obtain the library L using the candidate basis functions.
3 Estimate the initial variance of noise: σ 2,(0) = Var(Lθ − Y ).
 
(0) (0) (0)
4 Estimate the initial latent vector (0) = ψ1 , ψ2 , . . . , ψ K subjected to arg min MSE(Lθ − Y ).
θ
(i) (i) (i)T (i) (i)T (i) (i)−1 (i)−1 −1
5 Estimate μθ = θ Lr Y , θ = σ 2(i) Lr Lr + ϑs R0,r , and find the initial weight vector
(0)
θr from the Gaussian distribution with mean μθ and variance θ as,

(i) (i) (i) (i)


p θ r |Y , ϑs , σ 2(i) = N μθ , θ .

for i = 1, . . . , MCMC do
(i) (i) (i)
p0 p Y |ψk =0, −k ,ϑs
Estimate u k =  and λ =
6
p0 +λ 1− p0 (i) (i) (i) .
p Y |ψk =1, −k ,ϑs
7 Then update the latent variable vector (i+1) from the Bernoulli distribution as,

(i+1) (i) (i)  


p ψk |Y , ϑs , p0 = Ber n u k .

8 Update the noise variance σ 2(i+1) from the Inverse-gamma distribution as,

(i) (i)T (i)−1 (i)


p σ 2(i+1) |Y , (i+1) , ϑs = I G ασ + 0.5N , βσ + 0.5 Y T Y − μθ θ μθ .

(i+1)
9 Update the slab variance ϑs from Inverse-gamma distribution as,

 
(i+1) 1 (i)T (i)−1 (i)
p ϑs |θ (i) , (i+1) , σ 2(i+1) = I G αϑ + 0.5h z , βϑ + 2
θ r R0,r θ r .

K (i+1) and update the success rate p (i) from the Beta distribution as,
10 Estimate h z = k=1 ψk 0

(i+1)  
p p0 | (i+1) = Beta α p + h z , β p + K − h z .

(i+1)
11 Update the weight vector θ r from step 7. Repeat steps 7→11
12 end
13 Discard the burn-in MCMC samples and calculate the marginal PIP values.

n MC
  1  j
p ψk = 1|Y ≈ ψk ; k = 1, . . . , K .
n MC
j=1

14 Select the basis function in the final model with desired PIP values.
15 Output: The mean μθ and covariance θ .
7 Digital Twin for Dynamical Systems 295

References

Arup (2019) Digital twin: Towards a meaningful framework. Technical report, Arup, London,
England
Bilionis I, Zabaras N, Konomi BA, Lin G (2013) Multi-output separable gaussian process: Towards
an efficient, fully bayesian paradigm for uncertainty quantification. J Comput Phys 241:212–239
Booyse W, Wilke DN, Heyns S (2020) Deep digital twins for detection, diagnostics and prognostics.
Mech Syst Signal Process 140:106612
Casella G, George EI (1992) Explaining the gibbs sampler. Am Stat 46(3):167–174
Chenzhao L, Sankaran M, You L, Sergio C, Liping W (2017) Dynamic bayesian network for aircraft
wing health monitoring digital twin. Aiaa J 55(3):930–941
Debroy T, Zhang W, Turner J, Suresh Babu S (2017) Building digital twins of 3d printing machines.
Scripta Materialia 135:119–124
Dongxing C, Xiangying G, Wenhua H (2019) A novel low-frequency broadband piezoelectric
energy harvester combined with a negative stiffness vibration isolator. J Intell Mater Syst Struct
30(7):1105–1114
Ganguli R, Adhikari S (2020) The digital twin of discrete dynamic systems: Initial approaches and
future challenges. Appl Math Model 77:1110–1128
George EI, McCulloch RE (1997) Approaches for bayesian variable selection. Statistica Sinica
339–373
Harry M, Juan O, Nathan C (2019) Probabilistic methods for risk assessment of airframe digital
twin structures. Eng Fract Mech 221:106674
Hassler U et al (2016) Stochastic processes and calculus. Springer Texts in Business and Economics
He B, Bai K-J (2019) Digital twin-based sustainable intelligent manufacturing: a review. Adv Manuf
1–21
Hoodorozhkov S, Krasilnikov A (2020) Digital twin of wheel tractor with automatic gearbox. In:
E3S web of conferences, vol 164. EDP Sciences, pp 03032
Kapteyn MG, Knezevic DJ, Willcox K (2020) Toward predictive digital twins via component-based
reduced-order models and interpretable machine learning. In: AIAA scitech 2020 forum, pp 0418
Klebaner FC (2005) Introduction to stochastic calculus with applications. World Scientific Publish-
ing Company
Kloeden PE, Platen E (1992) Higher-order implicit strong numerical schemes for stochastic differ-
ential equations. J Stat Phys 66(1–2):283–314
Lu Y, Liu C, Kevin I, Wang K, Huang H, Xu X (2020) Digital twin-driven smart manufacturing:
Connotation, reference model, applications and research issues. Robot Comput-Integr Manuf
61:101837
Mike Z, Jianfeng Y, Donghao F (2019) Digital twin framework and its application to power grid
online analysis. CSEE J Power Energy Syst 5(3):391–398
Murphy KP (2012) Machine learning: a probabilistic perspective. MIT Press
Nayek R, Fuentes R, Worden K, Cross EJ (2021) On spike-and-slab priors for bayesian equation
discovery of nonlinear dynamical systems via sparse linear regression. Mech Syst Signal Process
161:107986
O’Hara RB, Sillanpää MJ (2009) A review of bayesian variable selection methods: what, how and
which. Bayesian Anal 4(1):85–117
Oksendal B (2013) Stochastic differential equations: an introduction with applications. Springer
Science & Business Media
Park KT, Lee D, Do Noh S (2020) Operation procedures of a work-center-level digital twin for
sustainable and smart manufacturing. Int J Precis Eng Manuf-Green Technol 7(3):791–814
Rajdip N, Souvik C, Sriram N (2019) A gaussian process latent force model for joint input-state
estimation in linear structural systems. Mech Syst Signal Process 128:497–530
Rasmussen CE, Williams CK (2006) Gaussian processes for machine learning, vol 1
Risken H (1996) Fokker-planck equation. In: The Fokker-Planck equation. Springer, Berlin, pp
63–95
296 T. Tripura et al.

Sebastian H, Reiner A (2018) Digital twin-proof of concept. Manuf Lett 15:64–66


Shailesh G, Ankush G, Souvik C, Budhaditya H (2021) Machine learning based digital twin
for stochastic nonlinear multi-degree of freedom dynamical system. Probabilistic Eng Mech
66:103173
Shailesh G, Souvik C, Budhaditya H (2022) Physics-integrated hybrid framework for model form
error identification in nonlinear dynamical systems. Mech Syst Signal Process 173:109039
Sobczyk K (2006) Stochastic dynamics and reliability of degrading systems. Bull Pol Acad Sci:
Tech Sci 54(1)
Sondipon A, Subhamoy B (2012) Dynamic analysis of wind turbine towers on flexible foundations.
Shock Vib 19(1):37–56
Sourav D, Souvik C, Yangyang C, Solomon T (2021) Robust design optimization for sma based
nonlinear energy sink with negative stiffness and friction. Soil Dyn Earthq Eng 140:106466
Souvik C, Rajib C (2019) Graph-theoretic-approach-assisted gaussian process for nonlinear stochas-
tic dynamic analysis under generalized loading. J Eng Mech 145(12):04019105
Souvik C, Sondipon A, Ranjan G (2021) The role of surrogate models in the development of digital
twins of dynamic systems. Appl Math Model 90:662–681
Souza V, Cruz R, Silva W, Lins S, Lucena V (2019) A digital twin architecture based on the industrial
internet of things technologies. In 2019 IEEE international conference on consumer electronics
(ICCE). IEEE, pp 1–2
Tripura T, Gogoi A, Hazra B (2020) An ito-taylor weak 3.0 method for stochastic dynamics of
nonlinear systems. Appl Math Model
Tripura T, Hazra B, Chakraborty S (2021a) Generalized weakly corrected milstein solutions to
stochastic differential equations. arXiv:2108.10681
Tripura T, Imran M, Hazra B, Chakraborty S (2021b) A change of measure enhanced near exact euler
maruyama scheme for the solution to nonlinear stochastic dynamical systems. arXiv:2108.10655
Tuegel EJ, Ingraffea AR, Eason TG, Michael Spottswood S (2011) Reengineering aircraft structural
life prediction using a digital twin. Int J Aerosp Eng
Urbina CPD, Roby L, Wafa L, Mahmoud P, Ethan W, Thomas K (2018) Part data integration in
the shop floor digital twin: Mobile and cloud technologies to enable a manufacturing execution
system. J Manuf Syst 48:25–33
Wagg DJ, Gardner P, Barthorpe RJ, Worden K (2020a) On key technologies for realising digital
twins for structural dynamics applications. In: Model validation and uncertainty quantification,
vol 3. Springer, Berlin, pp 267–272
Wagg DJ, Worden K, Barthorpe RJ, Gardner P (2020b) Digital twins: state-of-the-art and future
directions for modeling and simulation in engineering dynamics applications. ASCE-ASME J
Risk Uncert Eng Syst Part B Mech Eng 6(3)
Wang J, Ye L, Gao RX, Li C, Zhang L (2019) Digital twin for rotating machinery fault diagnosis
in smart manufacturing. Int J Prod Res 57(12):3920–3934
Wei Z, Gaoliang P, Chuanhao L, Yuanhang C, Zhujun Z (2017) A new deep learning model for fault
diagnosis with good anti-noise and domain adaptation ability on raw vibration signals. Sensors
17(2):425
Worden K, Cross EJ, Barthorpe RJ, Wagg DJ, Gardner P (2020) On digital twins, mirrors, and
virtualizations: Frameworks for model verification and validation. ASCE-ASME J Risk Uncert
Eng Syst Part B Mech Eng 6(3)
Chapter 8
Reduced Order Modeling

Zulkeefal Dar, Joan Baiges, and Ramon Codina

List of Acronyms

AE AutoEncoder
AMR Adaptive Mesh-Refinement
ANN Artificial Neural Network
BiLSTM Bidirectional Long Short-Term Memory
DEIM Discrete Empirical Interpolation Method
DMD Dynamic Mode Decomposition
FE Finite Element
FNN Feedforward Neural Network
FOM Full Order Model
LSTM Long Short-Term Memory
ML Machine Learning
NIROM Non-Intrusive Reduced Order Model
NN Neural Network
PDE Partial Differential Equation
PG Petrov–Galerkin
POD Proper Orthogonal Decomposition
POD-G Proper Orthogonal Decomposition based Galerkin projection
PROM Parametric Reduced Order Model

Z. Dar
International Center for Numerical Methods in Engineering, Barcelona, Spain
e-mail: [email protected]
J. Baiges · R. Codina (B)
Universitat Politècnica de Catalunya, Barcelona, Spain
e-mail: [email protected]
J. Baiges
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 297
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_8
298 Z. Dar et al.

RIC Relative Information Content


ROM Reduced Order Model
SGS Subgrid Scales
SINDy Sparse Identification of Nonlinear Dynamics
SUPG Streamline-Upwind Petrov-Galerkin
SVD Singular Value Decomposition
VMS Variational MultiScale

8.1 Introduction

Partial Differential Equations (PDEs) provide a mathematical model to represent the


many processes occurring in nature of significant importance. Hence, solving these
PDEs is of prime interest to researchers and engineers. Many high-fidelity solution
techniques have been developed which can solve the PDEs with the desired accuracy.
However, these methods are very computationally expensive in general. The computa-
tionalexpensecanbecomeparticularlyprohibitiveinthecaseofoptimizationproblems
astheyrequirealargenumberofsimulationstobeperformed.Moreover,acontrolprob-
lem might require solutions to PDEs in real-time, which is seldom achievable given the
high computational cost associated with solving PDEs. Reduced order models (ROMs)
offer an alternative approach to the high-fidelity models to obtain solutions with rea-
sonableaccuracyandat areducedcomputational cost. ROMs reducethecomputational
cost by approximating the large-scale systems by much smaller ones.
Broadly speaking, any model which reduces the computational expense can be
regarded as a ROM. In the context of finite difference, finite element, or finite volume
discretizations, a ROM could mean using a coarser discretizing mesh (Baiges et al.
2020; Fabra et al. 2022). Similarly, using a larger time step for a time integration
scheme could also imply a ROM (Fabra et al. 2022). Furthermore, using simpli-
fied physics could be considered a ROM. However, in the scientific community,
ROMs are understood to represent a particular class of model order reduction, called
projection-based methods (see Remark 8.1). These methods involve finding a latent
low-dimensional space to represent the actual full order model (FOM) dynamics. The
motivation of such methods relies on the observation that even nonlinear dynamical
systems can exhibit patterns that can be used to characterize their behavior. Unsur-
prisingly, the origin of these methods can be traced back to identifying and studying
coherent structures in turbulent flows in fluid mechanics (Leask 1967). These meth-
ods are perhaps still the most commonly used in flow problems, and hence most of
the explanations and literature in this chapter will refer to flow problems.
Projection-based ROMs consist of two steps:
1. Offline step: Finding the low-dimensional representation of the FOM
2. Online step: Solving for the unknowns in the reduced ordered space.
Where the offline step is very computationally intensive but needs to be performed
only once or a few times. Once the offline step has been performed, the low-cost
8 Reduced Order Modeling 299

online step of the ROM can be performed several times to solve an optimization
problem or a real-time control problem, thus providing large computational savings
and real-time solutions.
The implementation strategy of the online step of ROMs is used to classify them
into intrusive and non-intrusive ROMs. The intrusive ROMs are physics based in the
sense that they require the governing equations, and possible access to the computa-
tional code, to project the FOM onto a reduced order space to solve for the unknowns.
Non-intrusive ROMs, on the other hand, are purely data-driven and do not require
access to governing equations or computational code.
In the current data-centric era, Machine Learning (ML) has emerged as a viable
tool for reduced order modeling. ML has revolutionized a wide range of fields over
the past few decades. Scientific computing, in particular reduced order modeling, is
no exception. Although the interest in exploring ML techniques for reduced order
modeling is relatively new, it has already shown great potential by replacing in part,
or entirely, the offline and online steps.
A variety of conventional (see Remark 8.2) and ML-based techniques have been
developed for offline and online phases till date and applied successfully in a variety
of contexts, e.g. solid mechanics (Daniel et al. 2020; Yvonnet and He 2007), material
science (Hunter et al. 2019), fluid mechanics (Baiges et al. 2013b, c; Burkardt et al.
2006; Galletti et al. 2004; Glaz et al. 2010; Kalashnikova and Barone 2011; Lucia
and Beran 2003; Pawar et al. 2019; Srinivasan et al. 2019; Wang et al. 2012), shape
optimization (Akhtar et al. 2010; Bergmann et al. 2007; Li et al. 2022; LeGresley and
Alonso 2000; Rozza et al. 2011), and flow control (Arian et al. 2000; Graham et al.
1999; Mohan and Gaitonde 2018; Noack et al. 2011) problems. The proper orthog-
onal decomposition-based Galerkin projection (POD-G) method can be considered
to be the most well-established and commonly used method for reduced order mod-
eling. This method uses proper orthogonal decomposition (POD) to find the basis,
called POD modes, of the low-dimensional space. The FOM is then projected in an
intrusive manner onto these POD modes using mostly Galerkin projection to solve
for the unknowns.
The organization of the chapter is as follows. Section 8.2 describes POD along
with its essential ingredient, the singular value decomposition (SVD). Section 8.3
describes the Galerkin projection, hyperreduction, and stabilization of POD-ROMs.
Section 8.4 describes the non-intrusive ROMs with a brief explanation of dynamic
mode decomposition (DMD). Section 8.5 deals with the description of parametric
ROMs. Finally, Sect. 8.6 describes the ML techniques used for the online and offline
phases of reduced order modeling.
Remark 8.1 In literature, the term reduced basis is also sometimes used for
projection-based methods. However, more commonly reduced basis methods are
meant to refer to a particular class of projection-based methods based on greedy
algorithms.1 This later usage is also applicable in the context of this chapter.

1 Greedy algorithms are a class of algorithms that are based on choosing the option which produces
the largest immediate reward with the expectation that the successive application of greedy sampling
will lead to a global optimum. Greedy algorithms may use an error estimator to guide the sampling.
300 Z. Dar et al.

Remark 8.2 All data-driven dimension reduction techniques can be classified as ML


techniques in the broader sense. For example, POD and DMD can be classified as
unsupervised ML techniques. However, these techniques were originally developed
for dynamical systems based on mathematical arguments. Therefore, such techniques
are referred to as conventional and we do not group them under the umbrella of ML
techniques.

8.2 Proper Orthogonal Decomposition

The bases commonly used, e.g. piecewise-linear basis in the finite element (FE)
method, Fourier modes, etc. can solve a large number of dynamical systems, but these
bases are generic and do not correspond to the intrinsic modes of the systems which
they solve. Hence, a large number of such basis functions need to be used to capture
the solution. The intrinsic modes which form the basis of the solution space can
be found using proper generalized decomposition, reduced basis method or proper
orthogonal decomposition (POD), among others. Proper generalized decomposition
and reduced basis methods are commonly based on a greedy approach and a com-
prehensive review on them can be found in Chinesta et al. (2010, 2011), Hesthaven
et al. (2016), respectively. POD (Chatterjee 2000) is perhaps the most commonly
used method to find the basis, called POD modes in the context of POD.
For clarity, let us first introduce the concept of function and vector based descrip-
tion of a variable in the context of numerical methods. Suppose that the analysis
domain is spatially discretized using an interpolation-based method like finite ele-
ments, finite volumes, spectral elements, etc. The variable of interest can then either
be represented in the vectorial form as the collection of coefficients that multiply
the basis functions, or in a corresponding functional form which relies on the inter-
polation to define the variable over the entire domain. Throughout this chapter, the
functional representation is denoted using the lowercase letters, like q, and the vecto-
rial representation using the uppercase letters, like Q. In the case of Greek alphabets,
where the case of alphabets is not obvious, functional representation is denoted by
showing the dependence on the spatial coordinates x explicitly, like ζ (x). Also, a
variable with underbar a represents the variable in a general form which could be
either functional or vectorial based on the context. In the case of the FE method, the
vectorial and functional representations of a variable u are related as


nn
u(x, t) = χ k (x)U k (t)
k=1

where u(x, t) is the functional representation which depends on the spatial coordi-
nates x in addition to time t, U k the k−th element of the vector U, χ k (x) the FE
interpolation function for the k−th node and nn the total number of nodes.
8 Reduced Order Modeling 301

Now, POD consists of representing a variable u as a linear combination of the


POD modes as

nb
u(t, μ) =  k (μ)Urk (t, μ) (8.1)
k=1

where  k is the kth POD mode, Urk the kth ROM coefficient, nb the number of
basis vectors and μ a parameter that characterizes the behavior of the system.
u and  could be the functional or vectorial representation, as required, but the
same representation must be used for both variables.
POD relies on a singular value decomposition (SVD) to find the basis. Let us
describe how the basis is determined using the SVD of the data generated using
PDEs.

8.2.1 Proper Orthogonal Decomposition Applied to Partial


Differential Equations

Let us consider a general unsteady nonlinear PDE describing the behavior of a real-
valued function u of n components and dependent on a parameter μ. The evolution
of u in the spatial domain  ⊂ Rd , d denoting the dimensions of the problem, and
time interval ]0, tf [ is given by

∂t u(x, t, μ) + N(u(x, t, μ)) = f (x, t, μ) in , t ∈ ]0, tf [, (8.2)

where N is a nonlinear operator, f the forcing term and ∂t the time derivative.
Equation (8.2) is further provided with suitable boundary and initial conditions so
that the problem is well-posed. For simplicity, μ is considered to be fixed for now,
and hence u will not be stated explicitly as a function of the parameter μ from now
on till Sect. 8.5, when parametric ROMs are discussed.
After the advent of the computational era, the most commonly used technique to
solve (8.2) is to discretize it in space using a discretization technique, e.g. the FE
method. This discretization leads to the a system of ordinary differential equations
(ODEs) which reads: find U :]0, tf [→ Rnp such that

M∂t U + N(U)U = F, (8.3)

where U is the vector of unknowns, M ∈ Rnp×np the mass-matrix, N(U) ∈ Rnp×np


the matrix representation of the nonlinear differential operator N, F ∈ Rnp the result-
ing forcing vector, and np the number of degrees of freedom. We shall denote as
B f = Rnp the FOM space.
Finally, a time integration technique can be applied to (8.3) to obtain a fully
discrete system, which for a given time-step reads
302 Z. Dar et al.

A(U)U = R, (8.4)

where A(U) ∈ Rnp×np is the nonlinear system matrix and R ∈ Rnp is the right-hand-
side which takes into account the contributions of the previous values of U as well.
Equation (8.4) can then be solved for nt time steps to get nt solution vectors.
To find the POD basis, all solution vectors are not generally required. Rather it is
desired to select a minimum, but sufficient, number of solution vectors that contain all
the important dynamic features of the system. A simple approach for a uniform step
time integration scheme is to use solution vectors after every i-th time-step, where
i is a natural number (Baiges et al. 2015). Another approach could be to capture
each cycle of a periodic phenomenon using a certain number of solution vectors
(Mou et al. 2021). Suppose, that we are able to gather a set of ns solution vectors,
called snapshots, {U 1 , U 2 , ..., U ns } carrying all the important features of the system
dynamics. To simplify the exposition, we have assumed that the snapshots correspond
to the first ns consecutive solution vectors, but this can be easily generalized. Note
that the term snapshots will be interchangeably used for the solution vectors, as well
as their mean-subtracted form discussed in Sect. 8.2.2. Once the solution set has been
gathered, the SVD is used to find the basis or POD modes.

8.2.2 Singular Value Decomposition

SVD, also known as principal component analysis, is one of the most important
matrix factorization techniques used across many fields. The SVD of a matrix is
guaranteed to exist and can be considered unique for the bases generation purposes.
To perform the SVD, it is customary to first subtract the mean value U ∈ Rnp from
the solution vectors U j to obtain S j = U j − U, for j = 1, 2, ..., ns. The mean-
subtracted snapshots are then arranged into a matrix S ∈ Rnp×ns as follows:
⎡ ⎤

S = ⎣ S1 S2 . . . Sns ⎦ ,

a tall skinny matrix as, in general, ns  np. It is now desired to find the basis of the
space B ⊂ B f to which these snapshots belong using SVD. The SVD of S gives

S =    V T , (8.5)

where   ∈ Rnp×np contains the left singular vectors,  ∈ Rnp×ns contains the sin-
gular values and V ∈ Rns×ns contains the right singular vectors. Some important
properties of the SVD are listed below:
• Matrices   and V are orthogonal i.e.
8 Reduced Order Modeling 303

 T   =    T = I np and V T V = V V T = I ns , (8.6)

where I k is the identity matrix in Rk .


• Matrix  contains non-negative values along the diagonal, arranged in decreasing
order, and zeros elsewhere, i.e.

ii ≥ j j ≥ 0, ∀ i < j and i j = 0, ∀ i = j (8.7)

Now,  hasat most ns non-zero values. Assuming such a case, it is possible to



write  = , where  ∈ Rns×ns . Using this, the full SVD (8.5) can be converted
0
to its economy or reduced SVD form as
 

S =  V T = V T , (8.8)
0

where  ∈ Rnp×ns . The full and reduced versions of SVD are shown in Fig. 8.1.
Note that (8.8) represents the exact decomposition of S. The set of columns { j }ns
j=1
of the matrix , represents the basis vectors of B. So

B = span{ 1 ,  2 , ...,  ns }, where  j ∈ Rnp , for j = 1, ..., ns.

So, the dimension of the solution space has been reduced from np to ns, with
ns  np. However, ns still could be of the order of hundreds or even thousands and
could still be considered computationally demanding. So, to unlock the full potential
of ROMs in terms of computational savings, truncation is performed to yield a smaller
number of basis vectors than the one provided by the economy SVD.

Fig. 8.1 Matrix representation of full and economy SVD. The lightening of the color of the circles
represents the ordered decreasing of the diagonal values of  and 
304 Z. Dar et al.

8.2.2.1 Truncated SVD

As discussed above, in practice, the basis is truncated to get r < ns number of basis
vectors. This leads to a reduced space Br ⊂ B ⊂ B f , where dimBr = r , dimB = ns
and dimB f = np and r < ns  np. The truncation is motivated by the ordered
decreasing singular values in , Property (8.7). A singular value ii represents
the amount of energy or information with which the corresponding basis vector  i
contributes toward the solution. This contribution can be quantified using the relative
information content (RIC) given by
r
kk
RIC = k=1
ns .
k=1 kk

The singular and RIC values for the classical problem of the flow over a cylinder
approximated using the FE method are shown in Fig. 8.2. It can be seen that only a
few initial values contain most of the energy. So, instead of using 150 basis vectors,
it is possible to use just a few of them to describe the flow dynamics around the
cylinder. Based on the reasoning discussed above, RIC is widely used as a truncation
criterion. The number of POD modes can then be decided such that RIC is equal
to a desired value, e.g. 0.9, meaning that the POD modes will retain 90% of the
information. This truncation can be represented as a truncated SVD as

ˆ
S ≈ Ŝ =  ˆ V̂ T , (8.9)

where ˆ ∈ Rnp×r contains the first r columns of  ,  ˆ ∈ Rr ×r contains the top left
r × r block of  and V̂ ∈ Rns×r
contains first r columns of V . Note that the truncated
SVD only approximates matrix S, as shown in (8.9). However, the truncated SVD is
guaranteed to give the optimal approximation of S in the low-dimensional space as
guaranteed by the Eckart–Young theorem (Eckart and Young 1936).

Fig. 8.2 Singular and RIC


values for the flow over a
cylinder. The singular values
are represented on a log scale
for better visualization
8 Reduced Order Modeling 305

Theorem 8.1 The rank-r truncated SVD Ŝ provides the best rank-r approximation
to S in the L 2 sense, i.e.

arg min S − Ŝ 2
ˆ
= ˆ V̂ T .
Ŝ of rank r

Thus, the presence of patterns in the high-dimensional data, shown by rapidly


decreasing singular values, and the optimality of SVD guaranteed by the Eckart–
Young theorem have resulted in the wide usage of POD to find the basis of the
reduced spaces.

8.2.2.2 SVD for Functions

The SVD (8.9) corresponds to solving the problem of minimizing


2

ns 
r
ˆ 1 , ..., 
J ( ˆ r) = S −
i iT
(S  ˆj
ˆ j ) ˆ iT 
, subject to  ˆ j = δi j .
i=1 j=1
Rnp
(8.10)
Many times we are dealing with functions, e.g. when using FE methods, defined
over the entire domain , and not vectors corresponding to the degrees of freedom
of the approximation. In such cases, it could be desired for the properties to hold in
the functional (continuous) sense rather than in the algebraic (discrete) sense. So,
let us suppose that instead of solving the minimization problem (8.10), it is desired
to minimize its functional counterpart
2

ns 
r
j (φ̂1 (x), ..., φ̂r (x)) = s (x) −
i
s (x)φ̂ j (x) φ̂ j (x)
i
,
i=1 j=1 
L 2 ()

subject to φ̂i (x)φ̂ j (x) = δi j , (8.11)




where s i (x) is the functional form of the vectors Si , i = 1, ..., ns, and φ̂ j (x),
j = 1, ..., r , are the basis functions, which are L 2 ()-orthogonal. Note that (8.11)
minimizes the difference over the entire domain . For the sake of clarity, we have
considered that the unknown of the problem is a scalar function, and so are the snap-
shots and the basis, but the extension to the vector case is straightforward. Also, note
that the difference between  ˆ and φ̂(x) is not only of the vectorial and functional
ˆ
representation; rather  and φ̂(x) are two different bases having different orthogonal
properties. The functional SVD (8.11) can be shown to be the same as minimizing
306 Z. Dar et al.

2

ns r 
 
ˆ 1 , ..., 
J ( ˆ r) = M S −
1/2 i ˆ j M 1/2 
(M 1/2 Si )T M 1/2  ˆj ,
i=1 j=1
Rnp
T
ˆ M
subject to  ˆ = I r , (8.12)

where M is the mass-matrix as in (8.3) and I r ∈ Rr ×r is the identity matrix. Note that
the L 2 ()-orthogonality of basis functions φ̂(x) translates to orthogonality of the
corresponding basis vectors  ˆ with respect to the mass-matrix M as  ˆ T M ˆ = Ir .
ˆ 
To find , we perform the SVD of S = M S: 1/2

 
S= ˆ V̂ T ,

and the desired basis can be recovered as

ˆ = M −1/2 
 

Using the functional SVD produced more accurate results in Dar et al. (2023).
Note, however, that throughout this chapter the SVD term will refer to the one that
solves problem (8.10), unless stated otherwise, e.g. in Sect. 8.3.3.3.

8.3 Reduced Order Modeling Using Proper Orthogonal


Decomposition

As discussed in Sect. 8.1, reduced order modeling consists of offline and online
steps. The offline step of finding the reduced order basis is discussed in Sect. 8.2.
Now we describe the most commonly used method for the online step, the Galerkin
projection, to find the ROM coefficients in (8.1).

8.3.1 Galerkin Projection

ˆ forms the basis of the reduced solution space Br , and was calculated from
As 
mean-subtracted snapshots, decomposition (8.1) can be written as

ˆ r + U,
U ≈ U (8.13)

where the ROM coefficients U r ∈ Rr are the components of U in Br expressed in


the reference system defined by . ˆ Given U r , U can be found using (8.13). Let us
insert (8.13) in the original matrix system (8.4). Omitting the explicit dependence of
A on U from the notation, we can write (8.4) as
8 Reduced Order Modeling 307

ˆ r + U) = R.
AU ≈ A(U

Taking knowns to the right-hand-side, we get

ˆ r = R − AU.
AU (8.14)

This leads to a np × r over-determined system. Let us assume A to be symmetric


and positive definite (SPD). A least-square strategy for approximating (8.14) with
respect to the norm induced by A−1 , as described in Bui-Thanh et al. (2008), Carlberg
et al. (2011), leads to
ˆ T AU
ˆ r = ˆ T (R − AU). (8.15)

which is the Galerkin projection of the full order system (8.4) onto the reduced space.
Let us write (8.15) compactly as

Ar U r = Rr (8.16)

where

ˆ T A
Ar :=  ˆ ∈ Rr ×r ,
T
ˆ (R − AU)
Rr :=  ∈ Rr .

Applicable for the general matrices A, the so-called Petrov–Galerkin (PG) pro-
jection is found to provide more stable results, as compared to Galerkin projection,
in the case of A not being a SPD matrix (Carlberg et al. 2011). Using  ˆ T A as a
suitable PG projector, the PG reduced order form of (8.4) is given by

Ar A U r = Rr A , (8.17)

where now

ˆ T AT A
Ar A :=  ˆ ∈ Rr ×r ,
ˆ T AT (R − AU)
Rr A :=  ∈ Rr .

This corresponds to a least squares strategy for solving (8.14) with respect to the
standard Euclidean norm in Rnp . Irrespective of the type of projection used, both
the final reduced order systems (8.16) and (8.17) are r × r systems as opposed to
the full order system (8.4) of size np × np, with r  np. Thus, the reduced order
system can be solved at a fraction of the cost of the full order system. All the concepts
described later apply to both the Galerkin and PG ROMs; however, for simplicity,
we will describe them using the Galerkin-ROM (8.15).
308 Z. Dar et al.

8.3.2 Hyperreduction

The ROM discussed above can be solved at a reduced computational expense. How-
ever, assembling the system matrices has a cost of the same order as that of the FOM.
For linear problems, the assembling of matrices needs to be done once, and hence, is
not considered a bottleneck to achieving reduced computation times. However, for
nonlinear problems, the system matrices need to be assembled for every nonlinear
iteration, i.e. multiple times for every time-step in general, and will lead to a signif-
icant cost. Thus, it is important to use some techniques to determine the nonlinear
terms at a reduced cost. This is achieved using hyperreduction techniques and the
resulting models are called hyper-ROMs. There are many methods used for hyperre-
duction including, but not limited to, empirical interpolation method (Barrault et al.
2004) or its discrete version discrete empirical interpolation method (DEIM) (Chat-
urantabut and Sorensen 2010), Gauss–Newton with approximate tensors (Carlberg
et al. 2011), missing point estimator approach (Astrid et al. 2008), cubature-based
approximation method (An et al. 2008), energy conserving sampling and weighting
method (Farhat et al. 2015), and adaptive mesh refinement (AMR)-based hyperre-
duction (Reyes and Codina 2020). Here we briefly describe DEIM- and AMR-based
hyperreduction.

Remark 8.3 In the case of a polynomial nonlinearity in general, and quadratic non-
linearity in particular, the reduced nonlinear operator can be written as a tensor which
is not a function of U r , and hence needs to be computed just once. However, hyper-
reduction techniques, like DEIM and AMR, described in this chapter are applicable
in a broader context.

8.3.2.1 Discrete Empirical Interpolation Method

DEIM is a greedy algorithm and its origin can be traced back to the gappy POD
method (Everson and Sirovich 1995) which was originally designed for image recon-
struction. Just as ROM approximates the solution space by a subspace, DEIM does
the same but for nonlinear terms only. However, DEIM uses interpolation indices to
find the temporal coefficients instead of solving the reduced system.
Let us denote the vector of nonlinear terms as N(θ ) ∈ Rnp , depending on θ . θ can
represent time t or any other parameter in the case of parametric ROMs. However,
here we explain DEIM in the context of non-parametric nonlinear ROMs with θ = t.
For DEIM applied to parametric ROMs, see Antil et al. (2014). DEIM proposes
approximating the space to which the N belongs by a subspace of lower dimension
s, i.e. s  np and not necessarily equal to the dimension r of the ROM space. Let
this subspace be spanned by the basis B = [B 1 , ..., B s ] ∈ Rnp×s . Thus, we can write

N(t) ≈ Bd(t) (8.18)


8 Reduced Order Modeling 309

where d(t) is the vector of coefficients. For simplicity, from now on, the dependence
on t will be omitted from the notation.
An efficient way to determine d is to sample s spatial points and use them to
determine d. This can be achieved using a sampling matrix H defined as

H = [H s1 , ..., H ss ] ∈ Rnp×ns

where H s j = [0, ..., 0, 1, 0, ...0]T ∈ Rnp , for j = 1, ..., s, is the s j th column of the
identity matrix I np ∈ Rnp×np . Using the sampling matrix H, we can write

H T N = H T Bd.

Suppose H T B is non-singular, thus leading to a unique solution of d as

d = (H T B)−1 H T N. (8.19)

Using (8.19), we can write (8.18) as

N ≈ B(H T B)−1 H T N. (8.20)

Now we need to define the basis B and the sampling points s j , j = 1, ..., s, called
interpolation indices in DEIM, to approximate N using (8.20) at a reduced cost. The
basis B is found using POD for the nonlinear vector N. During the simulations, the
nonlinear vectors at different time steps are gathered to form a snapshot matrix S N
of nonlinear terms as ⎡ ⎤

S N = ⎣ N 1 N 2 . . . N ns ⎦ .

The truncated SVD of rank-s is then performed as

Ŝ N = B N V TN (8.21)

to obtain the basis B of the desired order. The interpolation indices are then selected
iteratively using the basis B. This approach is shown in Algorithm 1, where [|ρ| y]
= max{|X|} means that y is the index of the maximum value of the components of
vector X = [X 1 , ..., X np ], i.e. X y ≥ X z , for z = 1, ..., np. The smallest y is taken if
more than one component corresponds to the maximum value.

8.3.2.2 Adaptive Mesh-Refinement-Based Hyperreduction

AMR-based hyperreduction proposes calculating the nonlinear terms on a mesh


coarser than the one used for the FOM. The points of the coarser mesh are located
using AMR. AMR-based hyperreduction aims at concentrating the mesh in the
310 Z. Dar et al.

Algorithm 1 DEIM algorithm for the selection of interpolation indices


INPUT : Basis vectors {B 1 , ..., B s }
OUTPUT : Interpolation indices s = s1 , ..., ss
1: [|ρ| s1 ] = max{|B 1 |} find the first index
2: B = [B1 ], H = [H s1 ] initialize matrices based on the first value
3: for k = 2; k ≤ s; k + 1 do loop to find successively the remaining indices
4: H T Bd = H T B k solve for d
5: Rk = B k − Bd calculate residual Rk
6: [|ρ| sk ] = max{|Rk |} find the index sk where the residual has the maximum value
7: B ← [B B k ], H ← [H H sk ] update the matrices for the next iteration
8: end for

regions of higher physics and coarsening it everywhere else such that the overall
degrees of freedom are reduced. AMR uses a posteriori error estimator to decide
these areas of higher physics-based activity. In Reyes and Codina (2020), the mesh
was coarsened such that the total error, in a certain norm, remained approximately the
same before and after hyperreduction. An a posteriori residual-based error estimator
was used and a coarse mesh containing 80% less degrees of freedom was achieved
giving results with a negligible error. Numerical analysis of the error estimator was
also performed in Codina et al. (2021) and it was shown that the estimator provides
an upper bound for the true error and has the correct numerical behavior.

8.3.3 Stabilization Using Variational Multiscale Methods

Instabilities can arise when PDEs are solved using numerical methods, usually in
singular perturbation problems or when the approximation spaces of the different
unknowns need to satisfy compatibility conditions. This issue is further exacerbated
when POD-G is used to develop a ROM. This has to do with the fact that the ROM
does not account for the impact of the FOM scales that are not captured by the
low-order space. This problem is well-known in other computational mechanics set-
tings, such as finite elements, where stabilized formulations have been developed to
address the instability of the Galerkin projection. The Variational Multiscale (VMS)
framework, originally proposed in Hughes et al. (1998), is a popular framework used
to develop stabilized formulations taking into account the effect of the discarded
scales in a multiscale problem. A comprehensive review of VMS-based stabilization
methods developed for fluid problems is provided in Codina et al. (2018). Inspired
by this, VMS-based stabilization methods have been developed for projection-based
ROMs (Reyes and Codina 2020) and successfully applied in the context of flow
problems (Reyes et al. 2018), fluid-structure interaction (Tello et al. 2020; Tello and
Codina 2021), and adaptive mesh-based hyperreduction (Reyes and Codina 2020).
A comprehensive description of it is provided in Codina et al. (2021), Reyes and
Codina (2020). However, a summary of the method, which uses the same VMS for-
mulation to stabilize both FOM and ROM, is presented here for completeness. Let
8 Reduced Order Modeling 311

us describe the formulation using a general unsteady nonlinear convection–diffusion


reaction transport equation.

8.3.3.1 Variational Problem

Let us consider again problem (8.2) and write it in a slightly modified form, along
with the boundary and initial conditions. Let the boundary of the domain  be
split into non-overlapping Dirichlet, D , and Neumann, N , parts. Given the initial
condition for the unknown u0 , the problem aims at finding u of n components that
satisfies

∂t u + N(u; u) = f in , t ∈ ]0, tf [,
Du = Du0 on D, t ∈ ]0, tf [,
F (u; u) = f N on N, t ∈ ]0, tf [,
u=u 0
in , t = 0,

where u0 is the prescribed Dirichlet boundary condition, D the Dirichlet operator,


f N the prescribed Neumann boundary condition and F the flux operator. Let us
define N as a general nonlinear operator of second order using Einstein’s notation

N(u; y) := −∂i (K i j (u)∂ j y) + A f,i (u)∂i y + Ac,i (u)∂i y + S(u) y,

where K i j , Ac,i , A f,i , and S are matrices in Rn×n and are a function of u, ∂i
denotes differentiation with respect to the i-th Cartesian coordinate xi and indexes
i, j = 1, ..., d. Let us also define the flux operator F using Einstein’s notation as

F (u; y) := n i K i j (u)∂ j y − n i A f,i (u) y,

where n is the external unit normal to the boundary with n i being its i-th component.
To write the weak form of the problem, let the integral of the product of two
functions, f and g over the domain ω be defined by  f , gω . For simplicity, the
subscript ω is omitted in the case ω = . Let us also introduce the form Bω and the
linear form L ω as

Bω (u; y, v)ω := ∂i v, K i j (u)∂ j yω + v, Ac,i (u)∂i yω + ∂i ( ATf,i (u)v), yω
+ v, S(u) yω ,
L ω (v) := v, f ω + v, f N  N .

Let u(., t) and v belong to the space Bc , the solution space of the continuous problem.
The weak form of the problem (in space) consists of finding u :]0, tf [→ Bc such that
312 Z. Dar et al.

∂t u, v + B(u; u, v) = L(v), (8.22)


u, v = u , v, at t = 0
0
(8.23)

for all v ∈ Bc,0 , where Bc,0 is the space of time independent test functions that satisfy
Dv = 0 on D . For simplicity, we assume in what follows homogeneous Dirichlet
conditions so that v ∈ Bc = Bc,0 .

8.3.3.2 Variational Multiscale Full Order Model Approximation

VMS method can be applied to other discretization techniques, but in what follows
we shall concentrate on the FE method. Thus, let us discretize the domain using
FEs. Let Ph = {K } be a FE partition of the domain , assumed quasi-uniform for
simplicity, with elements of size h. From this, a conforming FE space Bh ⊂ Bc may
be constructed using a standard approach. Note that now Bh = B f , i.e. the FE space
is a particular realization of the FOM space introduced earlier.
Any time integration scheme may be used for time discretization. For conciseness,
we shall assume that a backward difference scheme is employed with a uniform time
step t and the time discretization is represented by replacing ∂t with δt , where δt
involves unh , un−1
h , ..., depending on the order of the scheme used. Using a superscript
n for the time step counter, and a subscript h for FE quantities, the fully discretized
Galerkin approximation of problem (8.22) is to find {unh } ∈ Bh , for n = 1, ..., nt, nt
being the number of time steps, that satisfy

δt uh , v h  + B(uh ; uh , v h ) = L(v h ) ∀ v h ∈ Bh ,

where we have omitted the initial conditions and the time step superscript for sim-
plicity. This problem may suffer from instabilities, and hence requires the use of
stabilization methods like those based on VMS method.
The core idea of the VMS approach lies in the splitting Bc = Bh ⊕ B , where B is
any space that completes Bh in Bc . We call B the space of subgrid scales or subscales
(SGS), and the functions in the SGS spaces will be identified with the superscript  .
Using the splitting u = uh + u and similarly for the test function v = v h + v  , the
continuous problem (8.22) splits into

δt (uh + u ), v h  + B(uh + u ; uh + u , v h ) = L(v h ), ∀ v h ∈ Bh , (8.24)


δt (uh + u ), v   + B(uh + u ; uh + u , v  ) = L(v  ), ∀ v  ∈ B , (8.25)

which is exactly equivalent to (8.22) as no assumptions have been made so far. We


want to find the SGSs u using (8.25) and plug them into (8.24) to account for
their effect on uh . To achieve this, several assumptions are made, which are briefly
discussed below (see Codina et al. 2018).
8 Reduced Order Modeling 313

Important Considerations/Assumptions
• Choosing the subscale space B . In fact, the approximation to B will be a conse-
quence of the approximation to u . The choice of SGS space leads to algebraic
subgrid scales (Codina 2000a) or orthogonal subgrid scales, among other possi-
bilities (Codina 2000b).
• While expanding (8.25), we come across the application of the operator N to
subscales as N(u; u ). Because the subscale problem is infinite dimensional, the
following key approximation is used

N(u; u )| K ≈ τ −1 
K (u)u | K ,

where τ K is a matrix of stabilization parameters that approximates the inverse


of the differential operator on each element K . Different approximations for τ K
yield different VMS methods.
• Taking δt u into account or not. Taking it into consideration yields dynamic SGSs
(Codina et al. 2007), whereas assuming, δt u = 0 results in quasi-static SGSs.
• For B(u; y, v), u can be approximated as u ≈ uh instead of u = uh + u (this can
be relaxed, see Codina 2002). The first approach is known as linear SGSs and the
later as nonlinear SGGs, in accordance with the inclusion of SGSs in only linear
or both, linear and nonlinear, terms, respectively. u representing the nonlinearity
will be replaced by u∗ to represent any of the above possibilities in a general way.
• Approximate the SGSs in the interior of the elements only or consider their contri-
butions on the interelement boundaries as well (Baiges and Codina 2013a; Codina
and Baiges 2011). The use of interelement boundary SGSs becomes crucial when
discontinuous interpolations are used for some components of the unknown (Cod-
ina et al. 2009) or when SGSs are used as a posteriori error estimator (Codina et al.
2021).
The final problem can be written as finding unh ∈ Bh , for n = 1, ..., nt, that satisfy

δt uh , v h  + Bh (u∗ ; uh , v h ) = L h (u∗ ; v h ), ∀ v h ∈ Bh , (8.26)

with the forms Bh and L h given as

Bh (u∗ ; uh , v h ) = B(u∗ ; uh , v h ) + B  (u∗ ; uh , v h ) (8.27)


∗  ∗
L h (u ; v h ) = L(v h ) + L (u ; v h ) (8.28)

where B  (u∗ ; uh , v h ) and L  (u∗ ; v h ) are defined based on the choices made regard-
ing the considerations discussed above. B  (u∗ ; uh , v h ) and L  (u∗ ; v h ) for different
combination of choices can be found in Reyes and Codina (2020).
314 Z. Dar et al.

8.3.3.3 Variational Multiscale Reduced Order Model Approximation

A ROM for the FOM discussed above can be developed by constructing a ROM
space Br ⊂ Bh ⊂ Bc . Using the POD relying on SVD for functions, described in
Sect. 8.2.2.2, we may obtain a ROM space of dimension r

ˆ 1, 
Br = span{ ˆ 2 , ..., 
ˆ r },

such that r = dim Br  dim Bh .


The proposed VMS formulation for ROM is exactly the same as the one used
for the FOM with one key difference, i.e. the functions are now approximated in Br
instead of Bh . The rationale behind this is the fact that Br ⊂ Bh and, specifically, it
is possible to write the ROM basis functions as a linear combination of the basis of
the FE space Bh , and therefore, the ROM basis functions are piecewise polynomials
defined on the FE partition Ph . The use of the same stabilization parameters for the
FOM and the ROM is also justified based on the above reasoning. Applying the VMS
concept to the ROM, we will have the decomposition

B = Bh ⊕ B = Br ⊕ B ,

where B is any space that completes Br in B.

Remark 8.4 An interesting observation can be made when L 2 ()-orthogonal SGSs


ˆ obtained by solving (8.12).
are used in conjunction with L 2 ()-orthogonal basis 
Suppose that we were able to construct a POD basis of Bh , with np = dim Bh , i.e.

ˆ 1, 
Bh = span{ ˆ 2 , ..., 
ˆ np }.

Then, since the basis vectors obtained from the POD are L 2 ()-orthogonal, choosing
orthogonal subgrid scales allows us to write

ˆ r +1 , 
B = span{ ˆ r +2 , ..., 
ˆ np } ⊕ B ,

i.e. we have an explicit representation of the ROM space of SGSs. So, when VMS-
ROM is used to approximate ROM SGSs, it accounts for the FOM subscales, present
in the subspace B , as well as the SGSs arising as a result of ROM trunctaion, present
in the subspace spanned by { ˆ r +2 , ..., 
ˆ r +1 ,  ˆ np }.

Having in mind the previous discussion, the final reduced order problem can be
written as finding urn ∈ Br , for n = 1, ..., nt, that satisfy

δt ur , vr  + Br (u∗ ; ur , vr ) = L r (u∗ ; vr ), ∀ vr ∈ Br , (8.29)

with the forms Br and L r given as


8 Reduced Order Modeling 315

Br (u∗ ; ur , vr ) = B(u∗ ; ur , vr ) + B  (u∗ ; ur , vr ) (8.30)


∗  ∗
L r (u ; vr ) = L(vr ) + L (u ; vr ). (8.31)

It can be seen that the Equations (8.29)–(8.31) look exactly the same as (8.26)–(8.28).
Furthermore, the expressions for B  and L  are also the same for the FOM and the
ROM if the same choices are made for both, regarding the considerations discussed in
Sect. 8.3.3.2. The only difference between the FOM and the ROM formulation is that
in the case of ROM, functions are approximated in Br instead of Bh . B  (u∗ ; ur , vr )
and L  (u∗ ; vr ) for different combination of choices can be found in Reyes and Codina
(2020).

8.3.3.4 Other Stabilized Reduced Order Models

Streamline-Upwind Petrov-Galerkin (SUPG), a popular scheme for stabilized FE


methods introduced in Brooks and Hughes (1982), has been used to deal with the
instabilities in the context of projection-based ROMs as well (Buoso et al. 2022;
Giere et al. 2015; Giere and John 2017; Moreau and Novo 2022; Pacciarini and Rozza
2014). In the case of existence of compatibility conditions between the approximation
spaces for difference unknowns, enrichment of approximation spaces using the so-
called supremizers has been used to provide the required stability to ROMs (Ballarin
et al. 2015; Rozza et al. 2013). A grad-div stabilization was used for the POD-
G method and an error analysis was carried out for the resulting formulation in
García-Archilla et al. (2022). A streamline derivative stabilization term, as well
as an a posteriori stabilization method, were used in Azaïez et al. (2021). Least
square Petrov–Galerkin has also been used to stabilize the ROMs at the fully discrete
system level (Carlberg et al. 2011; Dal Santo et al. 2019) and compared with the
Galerkin projection in Carlberg et al. (2017). A non-intrusive stabilization method
for projection-based ROMs was proposed in Amsallem and Farhat (2012) which
can be applied as a black-box post processing step. A POD mode-dependent eddy
viscosity stabilization scheme was developed and applied to quasigeostrophic ocean
circulation in San and Iliescu (2015). In the domain of hyperreduction, algorithms like
DEIM have been found to exhibit numerical instabilities for second-order dynamical
systems (Farhat et al. 2015). In such cases, energy conserving sampling and weighting
method can be used, which provides the required stability by conserving the energy.

8.4 Non-intrusive Reduced Order Models

8.4.1 The General Concept

The Galerkin projection discussed in Sect. 8.3.1 is an intrusive approach, i.e. it


requires knowledge of the governing equations and/or access to the code used for
the FOM. There is another class of purely data-driven ROMs called non-intrusive
316 Z. Dar et al.

reduced order models (NIROMs) which provide solutions based only on the data
and without using the governing equations. NIROMs allow the decoupling of the
FOM and the ROM implementations completely and are particularly useful in cases
where the code used for the FOM is not open-source. NIROMs can be obtained using
conventional or ML-based techniques and the recent large-scale adoption of NIROMs
can be attributed to the increasing popularity of ML in scientific computing. The
ML-based NIROMs are later discussed in Sect. 8.6.2. For now, we describe dynamic
mode decomposition (DMD), which can be considered a conventional non-intrusive
extension of the POD.

8.4.2 Dynamic Mode Decomposition

Dynamic mode decomposition (DMD), originally introduced in Schmid (2010) in


the context of fluid dynamics, aims at identifying spatiotemporal coherent structures
in high-dimensional data. DMD computes a set of spatial modes, as well as their tem-
poral evolution. It characterizes temporal evolution simply as an oscillation of fixed
frequency with a growth or decay rate. So, while POD-G relies on solving a Galerkin
projected system to find the temporal evolution of the modes, DMD only requires the
data to define the temporal behavior of modes. An important consideration regard-
ing DMD modes is that they are not orthogonal by construction, and hence may
require more modes than POD-G to capture the same phenomena. Recursive DMD
(Noack et al. 2016), a variant of the original DMD, has been developed to produce
orthogonal modes in a recursive manner. Furthermore, there is a lack of consensus
regarding the best criteria to be used for selecting the dominant DMD modes (Noack
et al. 2011; Tissot et al. 2014). A number of other specialized DMD algorithms have
been developed, including, but not limited to, exact DMD (Tu et al. 2014), optimized
DMD (Chen et al. 2012), extended DMD (Williams et al. 2015), and time-delayed
DMD (Brunton et al. 2017). DMD has been successfully applied across a wide range
of applications in fluid mechanics (Alla and Kutz 2017; Kaheman et al. 2022; Sahba
et al. 2022; Schmid et al. 2012).
Let us describe how a basic DMD algorithm can be used to obtain DMD modes
and DMD eigenvalues (which describe the temporal evolution) from data. The same
symbols and terminologies are used, where applicable, as was used to describe SVD
in Sect. 8.2.2. First, two different sets of snapshots are gathered; let us denote them
by {Si }i=1
ns
and {S∗i }i=1
ns
. The snapshot pairs are such that S∗i is obtained by evolving
the system state by time-step t using Si as the initial conditions, for i = 1, ..., ns,
and with t small enough to resolve the temporal dynamics to the smallest desired
scale. The gathered snapshots are then arranged in matrices, S and S∗ , given by
⎡ ⎤ ⎡ ⎤

S = ⎣ S1 S2 . . . Sns ⎦ and S∗ = ⎣ S∗1 S∗2 . . . S∗ns ⎦ .


8 Reduced Order Modeling 317

Let T ∈ Rnp×np be a best-fit linear operator which relates S and S∗ as

S∗ = T S (8.32)

i.e. T acts as a time integrator.


Now, it is desired to find the eigenvalues and eigenvectors of matrix T . This matrix
T is a np × np matrix, and hence, it is impractical to perform its eigendecomposition.
DMD provides an efficient way of finding the r leading eigenvalues and eigenvectors
of matrix T . Using S+ as the pseudo-inverse of S, (8.32) can be written as

T = S∗ S+ . (8.33)

SVD can be used to approximate the pseudo-inverse S+ . Using a truncated SVD, it


can be written
S≈ ˆˆ V̂ T . (8.34)

As  ˆ T
ˆ and V̂ are orthogonal and satisfy  ˆ = I np and V̂ T V̂ = I ns , using (8.34)
allows us to write (8.33) as
T ≈ S∗ V̂ ˆ −1 
ˆ T,

where  ˆ is a diagonal matrix and can be easily inverted. We are only interested in
the first r eigenvalues and eigenvectors of matrix T . The r -rank approximation of
T , denoted by T r ∈ Rr ×r , is achieved by projecting T on to the reduced space using
basis ˆ as
T
ˆ T
Tr =  ˆ

= ˆ −1 
ˆ T S∗ V̂  ˆ T
ˆ
ˆ −1 .
ˆ T S∗ V̂ 
=

Now, the DMD eigenvalues can be found by performing the eigendecomposition of


the reduced operator T r as
T r E = Eϒ

where the entries of the diagonal matrix ϒ are the eigenvalues of the low-dimensional
T r , as well as the high-dimensional T . E contains the eigenvectors of T r and allows
us to obtain the eigenvectors of T , denoted as ϕ , as
−1
ˆ
ϕ = S∗ V̂  E

where the columns of ϕ ∈ Rnp×r , called DMD modes, are the eigenvectors of T .
Once DMD eigenvalues and eigenvectors have been determined, the state of the
system at the k-th time-step, U k , is given by
318 Z. Dar et al.

Fig. 8.3 Illustration of DMD applied to a flow over a cylinder. Three DMD modes and the temporal
evolution of their coefficients is shown

U k = ϕϒ k−1 D,

where D ∈ Rr is the vector of mode amplitudes that can be computed using initial
conditions. The DMD of a flow over a cylinder is illustrated in Fig. 8.3.

8.5 Parametric Reduced Order Models

In the previous sections, we have discussed how to build a ROM during an offline
stage and how to use it for getting results quickly during an online stage. So far, we
have assumed that the unknown U(t, μ) was a function of t only and the parameter
μ ∈ D ⊂ R, was kept constant. So, in essence, the ROM was used to solve exactly
the same problem whose solution was used to generate the snapshots to be used for
the ROM basis generation. The aim of reduced order modeling is to perform the
computationally expensive offline stage once (or a few times) and then use the gener-
ated ROM to perform many simulations in the cheap online phase for the new values
of the parameter μ. This situation arises routinely in optimization and control prob-
lems governed by parametric PDEs. The parameter can represent anything including
boundary conditions, geometry, viscosity, Reynold’s number, etc. For simplicity, we
assume that the parameter represents a scalar and its different values, μ1 , ..., μ ps ,
represent different configurations, however, the subsequent discussion is equally
valid where μ represents more than one parameters. The difficulty with parametric
reduced order models (PROMs) lies in the fact that the basis  μ1 obtained for μ1 is
unlikely to perform well for μ2 as the behavior captured by the basis  μ1 might be
different from the behavior exhibited by the system for μ2 , i.e.

U(μ1 ) ≈  μ1 U r (μ1 ),

but
8 Reduced Order Modeling 319

U(μ2 ) μ1 U r (μ2 ).

Several techniques have been developed to obtain a suitable basis for PROMs.
Hyperreduction techniques, like DEIM described in Sect. 8.3.2.1, can also be used
for PROMS (Antil et al. 2014) with θ = μ. Here, we describe two popular techniques
to obtain a basis for PROMs, the global basis method and the local basis with the
interpolation method. These techniques commonly use a greedy approach to sample
suitable parameter values to obtain the snapshots. Thus, they are commonly referred
to as POD-greedy approaches.

8.5.1 Global Basis

Probably the most obvious approach is to sample different parameter values, obtain
snapshots corresponding to them, and perform the SVD on all the snapshots to obtain
ˆ such that
a single global basis 

ˆ r (μ), ∀ μ ∈ D.
U(μ) ≈ U (8.35)

A greedy approach can be used to sample ps parameter values to obtain the snap-
shots. The global basis approach can provide a compact r dimensional basis satisfying
(8.35) if the solution is not very sensitive to the parameter μ, i.e. the solution manifold
has rapidly decaying Kolmogorov n-width. If the solution manifold has slow decay-
ing Kolmogorov n-width, it might require obtaining snapshots at a lot of sampled
parameter values, which can lead to a prohibitively expensive offline phase. Even
if the computational expense of the offline phase is completely ignored, achieving
reasonable accuracy in the online phase will require a lot of POD modes. Hence,
truncating the global basis to a rank r , ensuring a real-time execution of the online
phase with reasonable accuracy, will not be possible.

8.5.2 Local Basis with Interpolation

In the case that the global basis approach is not feasible, a local basis can be developed
and used with interpolation. Similar to the global basis approach, ps parameter values
are sampled and the snapshots are obtained for them. However, instead of performing
a SVD of the matrix containing all the snapshots, a separate SVD is performed for
the snapshot matrix for every sampled parameter value μ to obtain a corresponding
local basis  μi , for i = 1, ..., ps. Now, the basis  μ∗ can be obtained at a requested,
but unsampled, parameter value μ∗ using interpolation.
If conventional interpolation techniques are used, the interpolated basis  μ∗ is
likely to lose the key properties, e.g. orthogonality, after interpolation. Hence, inter-
polation using property preserving matrix manifolds is recommended to preserve the
320 Z. Dar et al.

ps
key properties. Let G be such a manifold of orthogonal matrices. Also, let {μi }i=1 be
ps
the set of sampled parameter values and { μi }i=1 be the set of corresponding bases.
∗ ps
The basis  μ∗ for the unsampled point μ ∈ / {μi }i=1 can be obtained as follows.
First, a tangent space T ( μ̃ ) is defined such that it is tangent to G at a reference
ps
point  μ̃ ∈ { μi }i=1 . Now,  μi , i = 1, ..., ps, except the reference point  μ̃ , are
projected to the tangent space using a logarithmic map defined as

T ( μi ) = Rμi tan−1 ( μi )W μi
T

with Rμi , μi and W μT i obtained from the following SVD:

( μi −  μ̃  μ̃T  μi )( μ̃T  μi )−1 = Rμi μi W μi ,


T

where T ( μi ) is the projection of  μi on the tangent space T ( μ̃ ), so that after


ps ps
projecting { μi }i=1 we obtain {T ( μi )}i=1 . At this point, the standard interpolation
ps
(for example using Lagrange interpolation) is performed using {T ( μi )}i=1 , the pro-
jections on the tangent space, to obtain T ( μ∗ ). Now, T ( μ∗ ) needs to be projected
back to the manifold G. This can be done using an exponential map defined as
 
 μ∗ =  μ̃ W μ∗ cos( μ∗ ) + Rμ∗ sin( μ∗ ) W μT ∗ ,

where Rμ̃ , μ̃ and W μ̃T are obtained from the following SVD:

T ( μ∗ ) = Rμ∗ μ∗ W μ∗ .
T

An illustrative implementation of matrix manifold interpolation is shown in Fig. 8.4.

Remark 8.5 The above described manifold-based interpolation has been shown to
be applicable to the direct interpolation of reduced order system matrices/vectors as

ps
Fig. 8.4 Interpolation of a set of matrices { μi }i=1 using the matrix manifold G and the tangent
plane T ( μ̃ )
8 Reduced Order Modeling 321

well for linear systems (Amsallem and Farhat 2011). Consider a spatially discretized
reduced order parametric linear system

M r (μ∗ )∂t U r (μ∗ ) + L r (μ∗ )U r (μ∗ ) = F r (μ∗ ).

If {M r (μi )}i=1 , {L r (μi )}i=1 and {F r (μi )}i=1 are obtained offline, M r (μ∗ ), L r (μ∗ )
ps ps ps

and F r (μ∗ ) can be obtained during the online phase using the manifold based interpo-
lation. This ensures that the key properties, e.g. symmetric positive definiteness (SPD),
are preserved after the interpolation. This direct interpolation is more efficient than first
finding the interpolated basis  μ∗ and then finding the reduced matrix X r (μ∗ ) using
 μT ∗ X μ∗ . However, this direct interpolation has been shown to work only for linear
problems so far. The appropriate logarithmic and exponential maps to be used to pre-
serve different matrix properties can be found in Amsallem and Farhat (2011).

8.6 Machine Learning-Based Reduced Order Models

The impact of ML has been profound on scientific computing. In this section, the
applications of ML in the context of projection-based ROMs are explored. However,
it is pertinent to mention the natural suitability of ML techniques to develop extremely
computationally inexpensive models, even beyond the context of projection-based
ROMs. A lot of the success of ML techniques can be attributed to their ability to
find and learn nonlinear mappings that govern a physical system. For any system,
a few key inputs can be selected and a ML technique can be applied to learn the
(non)linear mapping that exists between its outputs and the selected inputs. Since
the online (testing) phase of ML algorithms is very fast, any such application would
result in a computationally inexpensive model, i.e. a ROM.
In the context of projection-based ROMs, ML techniques have been applied to
achieve higher accuracy, improvement in speeds or a combination of both. ML tech-
niques have been applied to obtain nonlinear reduced spaces for ROMs which offer
a more compact representation than linear POD spaces. ML techniques have also
been used to obtain NIROMs by directly learning the nonlinear evolution of reduced
coordinates, previously referred to as ROM coefficients in the context of POD. The
term reduced coordinates is more popular in the literature in the context of ML-based
ROMs, and hence it will be used from here on. ML can also improve the accuracy of
the intrusive Galerkin ROMs by providing closure models or corrections based on
finer solutions. Purely data-driven ML techniques can be very data hungry. In order
to reduce the reliance on data, and improve the generalization of the ML models,
physics has been embedded in ML techniques and then applied to reduced order
modeling. ML has even been used for system identification to discover simple equa-
tions for the evolution of reduced coordinates. Let us describe the state-of-art in the
above-mentioned application domains.
322 Z. Dar et al.

8.6.1 Nonlinear Dimension Reduction

POD (or DMD) provides modes that approximate a linear subspace. However, the
evolution of many dynamical systems lies in nonlinear spaces. The linear approxima-
tion can lead to two issues. First, the POD modes might be unable to capture highly
nonlinear phenomena. Second, in the case that the linear representation can capture
the dynamics with reasonable accuracy, using POD may require more reduced coor-
dinates than the nonlinear representation. So, instead of the linear mapping provided
by the POD (8.13), a more robust mapping would be using a nonlinear function
ϑ D : Rr → Rnp given by
U ≈ U D = ϑ D (U r ), (8.36)

where U D ∈ Rnp is the mapped value and U is the FOM solution. This nonlinear
mapping can be achieved using autoencoders (AEs) (Ballard 1987).
AEs are artificial neural networks (ANNs) used widely for dimension reduction.
The simplest AE, called undercomplete AE, consists of input and output layers of
the same size as the size of the FOM, np in this case. Furthermore, it has a bottleneck
layer in the middle of the desired size r , the same as the size of the reduced space.
The architecture of an undercomplete AE is shown in Fig. 8.5. In general, AEs are
quite deep, i.e. they consist of many layers, and use nonlinear activation functions.
Based on the task performed, an AE can be subdivided into two sub-parts: an
encoder and a decoder. The encoder compresses the high-dimensional data succes-
sively through its many layers to produce the low-dimensional representation. The

Input Layer Output Layer

Bottleneck

Encoder Decoder

Fig. 8.5 Architecture of an undercomplete autoencoder with encoder–decoder parts. The number
of layers and the size of the bottleneck is set to three and two, respectively, for illustration purposes
8 Reduced Order Modeling 323

encoder can be represented by a function ϑ E : Rnp → Rr as

U r = ϑ E (U).

The decoder reproduces the high-dimensional representation from the low-


dimensional one as per the mapping (8.36). To train an AE, U is given both as
the input and the desired output, and the loss function U D − U 22 is minimized, i.e.
AE as a whole is expected to behave like an identity map. Interestingly, the optimal
solution using the encoder and decoder of single layers with linear activation func-
tions is shown to be closely related to POD (Baldi and Hornik 1989), i.e. POD can be
considered to be a linear AE. AEs have been used with simple feedforward (Milano
and Koumoutsakos 2002), as well as convolutional (Gonzalez and Balajewicz 2018;
Lee and Carlberg 2020b; Murata et al. 2020) layers and have shown performance
enhancement as compared to the linear dimension reduction techniques. Further-
more, time-lagged AEs have been shown to capture the slowly evolving dynamics
of chemical processes with higher precision (Wehmeyer and Noe 2018).
An important issue of AEs is that they do not provide a systematic way of determin-
ing the suitable dimension of the reduced space as they do not provide hierarchical
reduced coordinates, as POD does based on the RIC index. The number of reduced
coordinates needs to be provided a priori to an AE as the size of the bottleneck.
A smaller number of reduced coordinates cannot be selected as the coordinates are
not distributed hierarchically and each coordinate may correspond to roughly the
same RIC. A trial-and-error approach can be used to find the optimal dimension of
the reduced space, but this is not an efficient approach. Variational autoencoders
(Kingma and Welling 2013) can resolve this issue and provide a parsimonious set of
the reduced coordinates. In Eivazi et al. (2022), β-Variational autoencoders (Higgins
et al. 2022) use β as a hyperparameter to promote the sparsity of reduced coordinates
by deactivating the unimportant nodes of the bottleneck. β-Variational autoencoders
were shown to represent ∼ 90% of the energy for the problem of a flow over an
urban environment, using five modes only, in contrast to just ∼ 30% captured by the
first five POD modes. AEs have also been applied to discover nonlinear coordinate
spaces for DMD (Lusch et al. 2018; Otto and Rowley 2019; Takeishi et al. 2017).

8.6.2 Machine Learning Based Non-intrusive Reduced Order


Models

As discussed in Sect. 8.4.1, the non-intrusive approach to reduced order modeling


relies on modeling the dynamics of reduced coordinates using data only, i.e. with-
out accessing the governing equations. ML techniques, like ANNs, can be used to
learn the nonlinear dynamics of the reduced coordinates to get a NIROM. A generic
schematic of such an approach is shown in Fig. 8.6. Once the reduced representation
is obtained, using POD or AEs, a trained ML model ϑ M L can be used to evolve the
324 Z. Dar et al.

Fig. 8.6 Non-intrusive ROM obtained by replacing the Galerkin projection a with a ML approach
b to model the dynamics of the reduced coordinates U r

reduced coordinates from U rn to U rn+1 , where n is the time-step counter. Inputs addi-
tional to U rn can also be provided to ϑ M L to better learn the mapping U rn → U rn+1 .
ML-based NIROMs have been successfully used for a variety of applications.
Deep feedforward neural networks (FNNs), combined with POD for dimensionality
reduction, were applied to get accurate results for the differentially heated cavity
flow at various Rayleigh numbers in Pawar et al. (2019). The least squares support
vector machine (Suykens et al. 2002) was used in Chen et al. (2021) to relate reduced
coordinates with the applied excitations for predicting hypersonic aerodynamic per-
formance. FNN was used to develop a NIROM for the industrial thermo-mechanical
phenomena arising in blast furnace hearth walls in Shah et al. (2022). The multi-
output support vector machine (Xu et al. 2013) was used to model the dynamics of
POD coefficients to predict the stress and displacement for geological and geotech-
nical processes in Zhao (2021).
Long short-term memory (LSTM) (Hochreiter and Schmidhuber 1997) and its
variant bidirectional long short-term memory (BiLSTM) (Graves and Schmidhuber
2005) neural networks (NNs) have a memory effect and can capture sequences like
the time evolution of a process with higher accuracy than a FNN. LSTM/BiLSTMs
accepts a history of q time steps, {U rn , U rn−1 , ..., U rn−q+1 }, to predict the future value
U rn+1 . Thus
U rn+1 = ϑ L ST M (U rn , U rn−1 , ..., U rn−q+1 ).

LSTM and BiLSTM NNs have been widely used to predict the temporal evolution
of systems based on past values. LSTM and BiLSTM NNs were used to model
isotropic and magnetohydrodynamic turbulence in Mohan and Gaitonde (2018),
where the Hurst Exponent (Hurst 1951) was employed to study and quantify the
memory effects captured by the LSTM/BiLSTM NNs. LSTM NNs were used in
Vlachas et al. (2018) to predict high-dimensional chaotic systems and were shown to
outperform Gaussian processes. The improved performance of LSTM NNs was also
shown for reduced order modeling of near-wall turbulence as compared to FNNs in
Srinivasan et al. (2019). Finally, LSTM NNs were also used to model a synthetic jet
and three-dimensional flow in the wake of a cylinder in Abadía-Heredia et al. (2022).
Training ML algorithms in general, and LSTM/BiLSTM NNs in particular, can be
very computationally demanding. Transfer learning can be used to speed up the train-
ing phase. Instead of initializing the weights of a network randomly, transfer learning
8 Reduced Order Modeling 325

relies on using weights of a network previously trained for a closely related prob-
lem for initialization. Providing better initial weights allows the training to converge
faster to the optimal weights. Transfer learning was used to speed up the training of
LSTM and BiLSTM NNs modeling the three-dimensional turbulent wake of a finite
wall-mounted square cylinder in Yousif and Lim (2022). The flow was analyzed on
2D planes at different heights, each modeled using a separate LSTM/BiLSTM NN.
After the first LSTM/BiLSTM NN was trained, the others were initialized using its
weights, as the flow in different planes is correlated.
Gaussian process regression (Rasmussen and Williams 2005) can be used to build
NIROMs alongside providing uncertainty quantification. Gaussian process regres-
sion has been used to develop NIROM for shallow water equations (Maulik et al.
2021), chaotic systems like climate forecasting models (Wan and Sapsis 2017), and
nonlinear structural problems (Guo and Hesthaven 2018). In the domain of unsu-
pervised learning, cluster reduced order modeling has been applied to mixing layers
(Kaiser et al. 2014). The cluster reduced order modeling groups the ensemble of
data (snapshots) into clusters and the transitions between the states are dynamically
modeled using a Markov process.

8.6.3 Closure Modeling

Another common application of ML techniques is to provide closure modeling in the


context of intrusive ROMs. Galerkin projection is the most commonly used intrusive
technique to solve for the reduced coordinates. However, performing the Galerkin
projection can lead to inaccuracy and instability issues. These issues were highlighted
in Sect. 8.3.3, where the primary concern was to improve the stability of ROMs. Here,
the closure modeling of ROMs is presented with the primary objective of improving
their accuracy. Note that the stabilization and closure modeling are related, yet they
are distinct issues (Azaïez et al. 2021).
Let us present the closure problem in the context of truncated ROMs. Let us sup-
pose that numerical simulations are performed to gather snapshots U ∈ Rnp for ns
time steps, grouped together to form a snapshot matrix S ∈ Rnp×ns . Let the snap-
shots belong to the space B with  ∈ Rnp×ns being the basis of B. To obtain a
computationally efficient ROM, the basis  is truncated to r modes. This results in a
truncated basis ˆ ∈ Rnp×r for the resolved reduced space and a basis  ∈ Rnp×(ns−r )
for the unresolved reduced space. If we assume for a moment that ns is large enough
(ns ≥ np), we may consider that B = B f , and thus U can be written as

ˆ r + 
U = U U, (8.37)

where U r ∈ Rr is the vector of reduced coordinates captured by the reduced space


and 
U ∈ R(np−r ) contains the unresolved coordinates.
326 Z. Dar et al.

Let us assume that (8.3) represents the semi-discretized (in space) form of the
governing equations describing the behavior of U. A Galerkin projection of (8.3)
onto resolved and unresolved spaces, using ˆ and  , respectively, along with using
(8.37), results in the following system:
   
∂t U r G(U r , 
U)
=  ,
∂t 
U G(U r , 
U)

where both, G ∈ Rr and G ∈ R(np−r ) , are functions of U r and 


U. The objective is to
solve for the dynamics of the resolved part U r only, i.e.

∂t U r = G(U r , 
U).

Using the truncated basis to build the ROM implicitly implies, abusing the notation,

∂t U r = G(U r , 0) = G(U r ),

which is not true in the nonlinear cases, as the behavior of the resolved scales is
governed by their interaction with the unresolved ones as well. So, it is desired to
model this interaction as a term C(U r ), which is a function of the resolved scales
U r , so that
∂t U r = G(U r ) + C(U r ). (8.38)

Figure 8.7 represents the closure problem graphically. A variety of conventional


closure models have been developed for ROMs, see Ahmed et al. (2020) for a review
and (Wang et al. 2012) for a comparison of results obtained using popular con-
ventional techniques. Lately, ML techniques have been widely used to obtain clo-

Closure error
modeled by C(Ur)

Projection error

FOM Projected FOM ROM with closure ROM without closure

Fig. 8.7 An illustration of the ROM closure problem. The projection error is due to the use of the
space Br ⊂ B f for approximating the unknowns. The closure error is due to neglecting the effect
of nonlinear interaction on the evolution of the resolved coordinates
8 Reduced Order Modeling 327

sure models for ROMs. In San and Maulik (2018b), a single layer FNN, trained
using Bayesian regularization and extreme learning machine approaches, was used
to model ROM closure terms for flow problems governed by viscous Burgers equa-
tions. The ROM closure terms were modeled as a function of the Reynolds number
and resolved reduced coordinates. An extreme learning approach was also used in San
and Maulik (2018a) to determine eddy viscosity for a LES-inspired closure model.
An uplifting ROM with closure was proposed in Ahmed et al. (2021). LSTMs were
used to provide the closure term, as well as to determine the reduced coordinates 
U of
the unresolved space. Since the basis  was already known via POD, (8.37) was used
to approximate U. A similar approach was used in Ahmed et al. (2021) to develop
a closure model for pressure modes to accurately predict the hydrodynamic forces.
A residual neural network, a hundreds of layers deep FNN, was used to develop a
closure model for 1D Burgers equation in Xie et al. (2020).
ROM closure based on Mori–Zwanzig formalism (Mori 1965; Zwanzig 1960)
is also popular. Such approaches, in general, use closure models consisting of two
terms

closure term = memory integral + contribution of initial conditions, (8.39)

where the memory integral is a non-Markovian2 term and takes into account the
contribution of the past values of the resolved coordinates U r to model the unresolved
ones U. This memory integral is very computationally expensive to compute. To
evaluate it efficiently, neural closure models using neural delay differential equations
were proposed in Zhu et al. (2021). The number of past values required to accurately
determine the closure term was also determined. A conditioned LSTM NN was
used to model the memory term in Wang et al. (2020). To ensure computational
efficiency, the authors further used explicit time integration of the memory term,
while using implicit integration for the remaining terms of the discrete system. The
Mori–Zwanzig formalism and LSTM NNs were shown to have a natural analogy
between them in Ma and Wang (2019). This analogy was also used to develop a
systematic approach for constructing memory-based ROMs. Recently, reinforcement
learning techniques have also been applied to build unsupervised reward-oriented
closure models for ROMs (Benosman et al. 2020; San et al. 2022).

8.6.4 Correction Based on Fine Solutions

Reduced order modeling errors can be improved by introducing a correction term


in the fully discrete system, i.e. discretized in both space and time, and a general
concept to correct fully discrete reduced order problems based on the knowledge of
fine solutions can be developed. A fine solution is any solution that is considered

2 A non-Markovian term implies that the future state depends not only on the current values, but
also on the past value, i.e. such processes have memory effects of the past values.
328 Z. Dar et al.

more accurate than the ROM solution. This is applicable to projection-based ROMs,
as well as to the cases in which ROM represents a model with a coarser spatial
discretization or a larger time step.
Let U c be the solution of the coarse ROM system given by

AU c = R. (8.40)

Let the fine solution U f be also available for the given problem. Let us assume that
the projection of the fine solution onto the coarse solution space, denoted by U c f , is
the best possible coarse solution. In this case, a correction vector C can be added to
modify system (8.40) to obtain a new system

AU c f + C = R, (8.41)

with U c f as its solution. When the fine solution is not known, C needs to be modeled
as a function of the coarse solution, i.e. C = C(U c ).
A correction term using a least squares (LS) approach was proposed for POD-
ROM in Baiges et al. (2015). Special considerations regarding gathering training data
and using the appropriate initial conditions were also addressed with least squares
providing a linear model of the correction term. A nonlinear correction term was
determined using ANN in Baiges et al. (2020). The correction term was determined
for a coarse mesh-based ROM using the solution of an AMR-based FOM, and applied
to the fluid, structure, and FSI problems. A comparison of linear least squares and
nonlinear ANNs-based corrections was carried out for the wave equation in Fabra
et al. (2022). A nonlinear ANN-based correction for POD-ROM was used in Dar et al.
(2023). Different combinations of features to be provided as the inputs to the ANNs
were evaluated to develop an accurate model while minimizing the complexity. The
implicit and explicit treatment of the ANN-based correction was also evaluated. It
was shown that the ROM was able to produce good results for parametric interpola-
tion, as well as temporal extrapolation. All of the above-mentioned works relied on
significantly less training data as compared to NIROMs. A training-free correction
was further proposed in Baiges et al. (2021) to account for the loss of information
due to the adaptive coarsening of the coarse mesh-based ROM. The correction was
based solely on the data generated within the same simulation and did not require
any external data.

Remark 8.6 The correction based on fine solutions discussed in Sect. 8.6.4 and the
closure modeling discussed in Sect. 8.6.3 are similar to some extent. However, there
is a difference in their motivation, as well as the accepted definition in the literature.
Closure modeling for ROMs is understood to account for the error generated due to
the Galerkin projection, i.e. spatial discretization as given by (8.38). On the other
hand, the correction based on fine solutions works by introducing a correction at the
fully discrete level given by (8.41). Hence, the two approaches have been discussed
separately.
8 Reduced Order Modeling 329

8.6.5 Machine Learning Applied to Parametric Reduced


Order Models

Just as ML can make non-parametric ROMs robust by offering a non-intrusive deter-


mination of reduced coordinates and/or obtaining a nonlinear approximation to the
reduced solution space, it offers the same for the parametric ROMs. One way of
applying ML to facilitate parametric reduced order modeling is to develop a non-
intrusive model of the reduced coordinates as a function of the parameter μ, possibly
including time, to get
U r (t, μ) = ϑ M L (t, μ),

where ϑ M L is the desired mapping capturing the parametric dependency as well.


Here again, it is assumed that μ ∈ D ⊂ R without loss of generality, as explained in
Sect. 8.5. The dynamics of the reduced coordinates can be modeled using Gaussian
process regression (Guo and Hesthaven 2018, 2019; Kast et al. 2020) or ANNs
(Baiges et al. 2015; Dar et al. 2023; Fabra et al. 2022; Hesthaven and Ubbiali 2018;
Wang et al. 2019) trained using the parameter value as an additional input. ANNs
have also been used to predict results for times beyond the training time interval (Dar
et al. 2023; Mou et al. 2021).
The set of parameter values for which the given system’s behavior is similar, such
that it can be captured by the same basis, is not always obvious. ML-based clustering
algorithms can help in grouping this similar behavior and relating it with appropriate
parameter values. A shock sensor and clustering algorithm was used in Dupuis et al.
(2018) to decompose the domain, used for aerodynamic simulations, into regions
with and without shocks. Accordingly, the snapshots were also decomposed into the
two groups and two separate ROMs were built for shock and shock-free subdomains,
leading to improved results. Optimal trees (Bertsimas and Dunn 2017) were used
to build an interpretable classifier for unmanned air vehicle operation conditions
in Kapteyn et al. (2020). During the operation, sensors provided data related to the
operating conditions. This data was then used by the classifier to suggest the relevant
ROM to be used from the library of ROMs prepared in the offline stage.
ML has also been used to obtain nonlinear solution manifolds for parametric
ROMs (Lee and Carlberg 2020b). Deep convolutional autoencoders were used in
Gonzalez and Balajewicz (2018) to generate a reduced parametric trial manifold.
It was then combined with a LSTM NN to model the dynamics of the reduced
coordinates. Nonlinear reduced parametric manifolds were also learned using AEs
in Fresca et al. (2021) and a variation of it was developed in Fresca and Manzoni
(2022) to speed up the offline training process.
330 Z. Dar et al.

8.6.6 Physics Informed Machine Learning for Reduced


Order Models

ML tools, in general, require a large amount of data to provide accurate results. More-
over, developing a model for generalized cases further increases the data requirement.
Generating and storing such an amount of data is not always possible. An efficient
way of reducing the reliance on data, without affecting the accuracy or generaliz-
ability, is to embed physics in the ML tools. Embedding physics in ML is being
increasingly used in the broader field of scientific computing; however, limited work
is done so far in the domain of reduced order modeling.
One way of employing physics is to use ANNs to solve PDEs directly by mini-
mizing the residual of the governing equations without using any FOM data. Such
ANNs are called physics informed neural networks (Raissi et al. 2019) and can be
used to directly solve the reduced order system without relying on training–testing
phases. Physics reinforced neural networks, a variation of physics informed neural
networks, were proposed in Chen et al. (2019) in the context of ROMs. In general,
incorporating physics in ML models involves a loss function consisting of two terms:
a data-driven loss function J D and a physics-based loss function J P . Let A be a
general operator that describes the desired physics such that

A(U r ) = 0, (8.42)

where A may account for temporal derivatives, nonlinearity, etc. The physics-based
loss function J P is given by

J P = A(U M L ) 2 , (8.43)

where U M L is the output of the ML model. Furthermore, if the snapshots projected


onto the reduced space, denoted as R(U), are known, J D is given by

J D = U M L − R(U) 2 . (8.44)

In general, the training phase involves minimizing the mean of (8.43) and (8.44) for
multiple time steps and/or parameter values. The total loss function is given by

J = J D + εJ P , (8.45)

where ε is a hyperparameter that decides the weightage to be given to adherence to the


physics. Physics-reinforced neural network used the residual of the entire governing
equation to embed physics. Physics knowledge was embedded in Lee and Carlberg
(2020a) by incorporating the residual, arising from violating the conservation laws
using finite volume discretization, in the loss function. Embedding physics in a data-
driven ROM closure model reduced the data requirement by 50% in Mohebujjaman
8 Reduced Order Modeling 331

et al. (2019). The physics was incorporated by requiring some terms of the closure
model to be energy dissipating, while others to be energy conserving.
Introducing physics using (8.45), so that the training phase becomes a constrained
optimization problem, can be considered as applying weak constraints (note that
J P = 0 if U M L is replaced by U r in Eq. (8.43)). The violation of the physics leads
to a large loss function, and hence in an effort to minimize the loss function, the
ANN tries to adhere to the physics as well. The physics embedded in such a way is
prone to be violated in the testing phase when the ANN is exposed to cases beyond
the training phase. This is because, architecturally, the ANN is still unaware of the
physics. Furthermore, ε acts as an additional hyperparameter that needs to be tuned.
An alternative strategy is to amend the ANN structurally so that it enforces the
physical laws strongly. Such an ANN is hoped to be more robust in the testing phase
as it is not blind to the physics. Embedded physics in a coarse grained ROM for 3D
turbulence via hard constraints was achieved in Mohan et al. (2020). The divergence
free condition was enforced using curl-computing layers which formed a part of the
backpropagation process as well. Backpropagation through the curl operator ensured
that the network is not blind and has intimate knowledge of the constraints through
which it must make predictions. Another way of embedding physics in the layers
was proposed in Pawar et al. (2021). A physics-guided machine learning framework
was introduced which injected the desired physics in one of the layers of the LSTM
NN. By incorporating physics, an ANN applicable to more generalized cases was
achieved as compared to the purely data-driven approach.

8.6.7 Reduced System Identification

ML can also be used to find the equations representing the dynamics of a system
using data. An equation comprising different terms with adjustable coefficients is
assumed to model the behavior of a system. Data is then used to find the value of
these adjustable coefficients. This technique is known as system identification and
is of particular interest in two scenarios. First, if the equations are not known, as
in the case of modeling climate, epidemiology, neuroscience, etc. Second, when
the equations describing the behavior of the reduced coordinates are required. The
resulting equation-based representation of a system provides generalizability and
interpretability, not achievable by simply constructing a regression model based on
data. System identification is a broad field and many techniques have been applied
in this context, see Ljung (1998) and Juang (1994) for details.
Reduced system identification aims to obtain sparse equations (consisting of a few
simple terms) to describe the evolution of reduced coordinates of a projection-based
ROM. The sparse identification of nonlinear dynamics (SINDy) (Brunton et al. 2016)
algorithm can be used to get a simplistic dynamic model for the reduced coordinates.
A library of simple nonlinear terms, like polynomials or trigonometric functions,
is provided. SINDy then tries to find a mapping for the provided input–output data
332 Z. Dar et al.

using the minimum number of terms of the library, thus providing a minimalistic
interpretable model offering a balance of efficiency and accuracy.
SINDy has been applied to recover models for a variety of flow behaviors including
shear layers, laminar and turbulent wakes, and convection (Callaham et al. 2022;
Deng et al. 2020; Loiseau 2020; Loiseau and Brunton 2018). The vortex shedding
behind a cylinder, for example, can be captured using three modes only (Noack et al.
2003), first two POD modes and a shift mode as

∂t Ur 1 = μUr 1 − Ur 2 − Ur 1 Ur 3 ,
∂t Ur 2 = Ur 1 + μUr 2 − Ur 2 Ur 3 , (8.46)
∂t Ur 3 = Ur21 + Ur22 − Ur 3 ,

where Ur 1 , Ur 2 and Ur 3 are the reduced coordinates related to first two POD modes
and the shift mode. SINDy was able to recover this minimalistic model using data,
identifying the dominant terms and the associated coefficients correctly (Brun-
ton et al. 2016). SINDy was also combined with an autoencoder to find the low-
dimensional nonlinear representation, as well as to model the dynamics of the corre-
sponding reduced coordinates (Champion et al. 2019). To improve the performance
of SINDy, physics was also embedded in it in the form of symmetry in Guan et al.
(2021) and of conservation laws in Loiseau and Brunton (2018).

8.7 Concluding Remarks

Reduced order modeling can significantly accelerate numerical simulations. Thus,


ROMs permit the solution of optimization and control problems, requiring many
simulation runs, at a reasonable computational expense. Traditionally, POD-G has
been the most popular approach used to obtain ROMs. Overtime, additional ingredi-
ents have been incorporated to improve the performance of these ROMs. For exam-
ple, hyperreduction techniques have been developed to efficiently deal with nonlin-
ear terms, stabilization techniques have been incorporated to deal with instabilities,
DMD has been introduced as a non-intrusive variant of POD-G, and suitable basis
generation methods have been developed for parametric ROMs.
Like any other field, availability of data and open-access to ML models has led to
the increased popularity of ML techniques for reduced order modeling. ML has been
used to obtain ROMs with lower computational expense and enhanced accuracy as
compared to the conventional techniques. Not only this, but ML has also opened new
avenues of reduced order modeling which the conventional techniques were unable
to pursue; nonlinear dimension reduction being one of them. The reduced order
modeling community has been using ML in all the possible ways. On one hand, ML
has been used to improve primarily physics-based ROMs by providing ML-based
closure models and correction terms. On the other hand, physics has been embedded
in primarily ML-based ROMs to improve their performance. Finally, NIROMs have
8 Reduced Order Modeling 333

been developed based entirely on ML. Thus, a range of reduced order modeling
techniques are at disposal with purely conventional techniques at one end of the
spectrum to purely ML techniques at the other end, with the hybrid techniques lying
in-between.

References

Abadía-Heredia R et al (2022) A predictive hybrid reduced order model based on proper orthogonal
decom position combined with deep learning architectures. Expert Syst Appl 187:115910
Ahmed HF et al (2021) Machine learning-based reduced-order modeling of hydrodynamic forces
using pressure mode decomposition. Proc Inst Mech Eng, Part G: J Aerosp Eng 235(16):2517–
2528
Ahmed SE et al (2020) A long short-term memory embedding for hybrid uplifted reduced order
models. Phys D: Nonlinear Phenom 409:132471
Ahmed SE et al (2021) On closures for reduced order models-a spectrum of first-principle to
machine-learned avenues. Phys Fluids 33(9):091301
Akhtar I, Borggaard J, Hay A (2010) Shape sensitivity analysis in flow models using a finite-
difference approach. Math Probl Eng
Alla A, Kutz JN (2017) Nonlinear model order reduction via dynamic mode decomposition. SIAM
J Sci Comput 39(5):B778–B796
Amsallem D, Farhat C (2011) An online method for interpolating linear parametric reduced-order
models. SIAM J Sci Comput 33(5):2169–2198
Amsallem D, Farhat C (2012) Stabilization of projection-based reduced-order models. Int J Numer
Methods Eng 91(4):358–377
An SS, Kim T, James DL (2008) Optimizing cubature for efficient integration of subspace defor-
mations. ACM Trans Graph 27(5):65:1–165:10
Antil H, Heinkenschloss M, Sorensen DC (2014) Application of the discrete empirical interpolation
method to reduced order modeling of nonlinear and parametric systems. In: Quarteroni A, Rozza
G (eds) Reduced order methods for modeling and computational reduction. MS&A—Modeling,
Simulation and Applications. Springer International Publishing, Cham, pp 101–136
Arian E, Fahl M, Sachs EW (2000) Trust-Region Proper Orthogonal Decomposition for Flow Con-
trol. Technical report. Institute for Computer Applications in Science and Engineering, Hampton
VA
Astrid P et al (2008) Missing point estimation in models described by proper orthogonal decompo-
sition. IEEE Trans Autom Control 53(10):2237–2251
Azaïez M, Chacon Rebollo T, Rubino S (2021) A cure for instabilities due to advection-dominance
in POD solution to advection-diffusion-reaction equations. J Comput Phys 425:109916
Baiges J et al (2020) A finite element reduced-order model based on adaptive mesh refinement and
artificial neural networks. Int J Numer Methods Eng 121(4):588–601
Baiges J et al (2021) An adaptive finite element strategy for the numerical simulation of additive
manufacturing processes. Addit Manuf 37:101650
Baiges J, Codina R, Idelsohn S (2015) Reduced-order subscales for POD models. Comput Methods
Appl Mech Eng 291:173–196
Baiges J, Codina R (2013a) A variational multiscale method with subscales on the element bound-
aries for the helmholtz equation. Int J Numer Methods Eng 93(6):664–684
Baiges J, Codina R, Idelsohn S (2013b) A domain decomposition strategy for reduced order models.
Application to the incompressible Navier–Stokes equations. Comput Methods Appl Mech Eng
267:23–42
334 Z. Dar et al.

Baiges J, Codina R, Idelsohn S (2013c) Explicit reduced-order models for the stabilized finite
element approximation of the incompressible Navier–Stokes equations. Int J Numer Methods
Fluids 72(12):1219–1243
Baldi P, Hornik K (1989) Neural networks and principal component analysis: learning from exam-
ples without local minima. Neural Netw 2(1):53–58
Ballard DH (1987) Modular learning in neural networks. In: Proceedings of the sixth national
conference on artificial intelligence, vol 1. AAAI’87. AAAI Press, Seattle, Washington, pp 279–
284
Ballarin F et al (2015) Supremizer stabilization of POD-galerkin approximation of parametrized
steady incom pressible Navier–Stokes equations. Int J Numer Methods Eng 102(5):1136–1161
Barrault M et al (2004) An empirical interpolation method: application to efficient reduced-basis
discretization of partial differential equations. Comptes Rendus Mathematique 339(9):667–672
Benosman M, Chakrabarty A, Borggaard J (2020) Reinforcement learning-based model reduction
for partial differential equations. IFAC-PapersOnLine. 21st IFAC World Congress 53(2):7704–
7709
Bergmann M, Cordier L, Brancher J-P (2007) Drag minimization of the cylinder wake by trust-
region proper orthogonal decomposition. In: Active flow control. Springer, Berlin, pp 309–324
Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106(7):1039–1082
Brooks AN, Hughes TJR (1982) Streamline upwind/petrov-galerkin formulations for convection
dominated flows with particular emphasis on the incompressible Navier–Stokes equations. Com-
put Methods Appl Mech Eng 32(1):199–259
Brunton SL et al (2017) Chaos as an intermittently forced linear system. Nat Commun 8(1):19
Brunton SL, Proctor JL, Kutz JN (2016) Discovering governing equations from data by sparse
identification of nonlinear dynamical systems. Proc Natl Acad Sci 113(15):3932–3937
Bui-Thanh T, Willcox K, Ghattas O (2008) Model reduction for large-scale systems with high-
dimensional parametric input space. SIAM J Sci Comput 30(6):3270–3288
Buoso S et al (2022)Stabilized reduced-order models for unsteady incompressible flows in three-
dimensional parametrized domains. Comput Fluids 246:105604
Burkardt J, Gunzburger M, Lee H-C (2006) POD and CVT-based reduced-order modeling of Navier–
Stokes flows. Comput Methods Appl Mech Eng 196(1–3):337–355
Callaham JL et al (2022) An empirical mean-field model of symmetry-breaking in a turbulent wake.
Sci Adv 8(19):eabm4786
Carlberg K, Barone M, Antil H (2017) Galerkin v. Least-Squares Petrov-Galerkin projection in
nonlinear model reduction. J Comput Phys 330:693–734
Carlberg K, Bou-Mosleh C, Farhat C (2011) Efficient non-linear model reduction via a least-squares
petrov-galerkin projection and compressive tensor approximations. Int J Numer Methods Eng
86(2):155–181
Champion K et al (2019) Data-driven discovery of coordinates and governing equations. Proc Natl
Acad Sci 116(45):22445–22451
Chatterjee A (2000) An introduction to the proper orthogonal decomposition. Curr Sci 78(7):808–
817
Chaturantabut S, Sorensen DC (2010) Nonlinear model reduction via discrete empirical interpola-
tion. SIAM J Sci Comput 32(5):2737–2764
Chen KK, Tu JH, Rowley CW (2012) Variants of dynamic mode decomposition: boundary condition,
koopman, and fourier analyses. J Nonlinear Sci 22(6):887–915
Chen W et al (2021) Physics-informed machine learning for reduced-order modeling of nonlinear
problems. J Comput Phys 446:110666
Chen Z, Zhao Y, Huang R (2019) Parametric reduced-order modeling of unsteady aerodynamics
for hyper sonic vehicles. Aerosp Sci Technol 87:1–14
Chinesta F, Ammar, A Cueto E (2010) Recent advances and new challenges in the use of the proper
generalized decomposition for solving multidimensional models. Arch Comput Methods Eng
17(4):327–350
8 Reduced Order Modeling 335

Chinesta F, Ladeveze P, Cueto E (2011) A short review on model order reduction based on proper
generalized decomposition. Arch Comput Methods Eng 18(4):395
Codina R (2000a) On stabilized finite element methods for linear systems of convection-diffusion-
reaction equations. Comput Methods Appl Mech Eng 188(1):61–82
Codina R (2000b) Stabilization of incompressibility and convection through orthogonal sub-scales
in finite element methods. Comput Methods Appl Mech Eng 190(13–14):1579–1599
Codina R (2002) Stabilized finite element approximation of transient incompressible flows using
orthogonal subscales. Comput Methods Appl Mech Eng 191(39–40):4295–4321
Codina R et al (2007) Time dependent subscales in the stabilized finite element approximation of
incompressible flow problems. Comput Methods Appl Mech Eng 196(21–24):2413–2430
Codina R et al (2018) Variational multiscale methods in computational fluid dynamics. Encycl
Comput Mech 1–28
Codina R, Baiges J (2011) Finite element approximation of transmission conditions in fluids and
solids introducing boundary subgrid scales. Int J Numer Methods Eng 87(1–5):386–411
Codina R, Principe J, Baiges J (2009) Subscales on the element boundaries in the variational two-
scale finite element method. Comput Methods Appl Mech Eng 198(5–8):838–852
Codina R, Reyes R, Baiges J (2021) A posteriori error estimates in a finite element vms-based
reduced order model for the incompressible Navier–Stokes equations. Mech Res Commun. Spe-
cial Issue Honoring G.I. Taylor Medalist Prof. Arif Masud 112:103599
Dal Santo N et al (2019) An algebraic least squares reduced basis method for the solution of
nonaffinely parametrized stokes equations. Comput Methods Appl Mech Eng 344:186–208
Daniel T et al (2020) Model order reduction assisted by deep neural networks (ROM-net). Adv
Model Simul Eng Sci 7(1):16
Dar Z, Baiges J, Codina R (2023) Artificial neural network based correction models for reduced
order models in computational fluid mechanics. Comput Methods Appl Mech Eng 415:116232
Deng N et al (2020) Low-order model for successive bifurcations of the fluidic pinball. J Fluid
Mech 884:A37
Dupuis R, Jouhaud J-C, Sagaut P (2018) Surrogate modeling of aerodynamic simulations for mul-
tiple operating conditions using machine learning. AIAA J 56(9):3622–3635
Eckart C, Young G (1936) The approximation of one matrix by another of lower rank. Psychometrika
1(3):211–218
Eivazi H et al (2022) Towards extraction of orthogonal and parsimonious non-linear modes from
turbulent flows. Expert Syst Appl 202:117038
Everson R, Sirovich L (1664) Karhunen-Loeve procedure for Gappy data. JOSA A 12(8):1657–1664
Fabra A, Baiges J, Codina R (2022) Finite element approximation of wave problems with correcting
terms based on training artificial neural networks with fine solutions. Comput Methods Appl Mech
Eng 399:115280
Farhat C, Chapman T, Avery P (2015) Structure-preserving, stability, and accuracy properties of
the energy conserving sampling and weighting method for the hyper reduction of nonlinear finite
element dynamic models. Int J Numer Methods Eng 102(5):1077–1110
Fresca S, Dede’ L, Manzoni A (2021) A comprehensive deep learning-based approach to reduced
order modeling of nonlinear time-dependent parametrized PDEs. J Sci Comput 87(2):61
Fresca S, Manzoni A (2022) POD-DL-ROM: enhancing deep learning-based reduced order models
for non linear parametrized PDEs by proper orthogonal decomposition. Comput Methods Appl
Mech Eng 388:114181
Galletti B et al (2004) Low-order modelling of laminar flow regimes past a confined square cylinder.
J Fluid Mech 503:161–170
García-Archilla B, Novo J, Rubino S (2022) Error analysis of proper orthogonal decomposition
data assimilation schemes with grad-div stabilization for the Navier–Stokes equations. J Comput
Appl Math 411:114246
Giere S et al (2015) SUPG reduced order models for convection-dominated convection-diffusion-
reaction equations. Comput Methods Appl Mech Eng 289:454–474
336 Z. Dar et al.

Giere S, John V (2017) Towards physically admissible reduced-order solutions for convection-
diffusion problems. Appl Math Lett 73:78–83
Glaz B, Liu L, Friedmann PP (2010) Reduced-order nonlinear unsteady aerodynamic modeling
using a surrogate-based recurrence framework. AIAA J 48(10):2418–2429
Gonzalez FJ, Balajewicz M (2018) Deep convolutional recurrent autoencoders for learning low-
dimensional feature dynamics of fluid systems. arXiv:1808.01346 [physics]
Graham WR, Peraire J, Tang KY (1999) Optimal control of vortex shedding using low-order models.
Part I-open-loop model development. Int J Numer Methods Eng 44(7):945–972
Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Netw. IJCNN 2005 18(5):602–610
Guan Y, Brunton SL, Novosselov I (2021) Sparse nonlinear models of chaotic electroconvection.
R Soc Open Sci 8(8):202367
Guo M, Hesthaven JS (2019) Data-driven reduced order modeling for time-dependent problems.
Comput Methods Appl Mech Eng 345:75–99
Guo M, Hesthaven JS (2018) Reduced order modeling for nonlinear structural analysis using gaus-
sian process regression. Comput Methods Appl Mech Eng 341:807–826
Hesthaven JS, Rozza G, Stamm B (2016) Certified reduced basis methods for parametrized partial
differential equations. SpringerBriefs in Mathematics. Springer International Publishing, Cham
Hesthaven JS, Ubbiali S (2018) Non-intrusive reduced order modeling of nonlinear problems using
neural networks. J Comput Phys 363:55–78
Higgins I et al (2022) Beta-VAE: learning basic visual concepts with a constrained variational
framework. In: International conference on learning representations
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hughes TJR et al (1998) The variational multiscale method-a paradigm for computational mechan-
ics. Comput Methods Appl Mech Engineering Adv Stab Methods Comput Mech 166(1):3–24
Hunter A et al (2019) Reduced-order modeling through machine learning and graph-theoretic
approaches for brittle fracture applications. Comput Mater Sci 157:87–98
Hurst HE (1951) Long-term storage capacity of reservoirs. Trans Am Soc Civ Eng 116(1):770–799
John Leask L (1967) The structure of inhomogeneous turbulent flows. Struct Inhomogeneous Turbul
Flows 166–178
John V, Moreau B, Novo J (2022) Error analysis of a SUPG-stabilized POD-ROM method for
convection diffusion-reaction equations. Comput Math Appl 122:48–60
Juang J-N (1994) Applied system identification. Prentice Hall
Kaheman K, Brunton SL, Kutz JN (2022) Automatic differentiation to simultaneously identify
nonlinear dynamics and extract noise probability distributions from data. Mach Learn: Sci Technol
3(1):015031
Kaiser E et al (2014) Cluster-based reduced-order modelling of a mixing layer. J Fluid Mech
754:365–414
Kalashnikova I, Barone M (2011) Stable and efficient galerkin reduced order models for non-linear
fluid flow. In: 6th AIAA theoretical fluid mechanics conference, p 3110
Kapteyn MG, Knezevic DJ, Willcox K (2020) Toward predictive digital twins via component-
based reduced-order models and interpretable machine learning. In: AIAA scitech 2020 forum.
American Institute of Aeronautics and Astronautics
Kast M, Guo M, Hesthaven JS (2020) A non-intrusive multifidelity method for the reduced order
modeling of nonlinear problems. Comput Methods Appl Mech Eng 364:112947
Kingma DP, Welling M (2013) Auto-encoding variational bayes. In: International conference on
learning representations
Lee K, Carlberg KT (2020a) Deep conservation: a latent-dynamics model for exact satisfaction of
physical conservation laws. arXiv:1909.09754 [physics]
Lee K, Carlberg KT (2020b) Model reduction of dynamical systems on nonlinear manifolds using
deep convolutional autoencoders. J Comput Phys 404:108973
8 Reduced Order Modeling 337

LeGresley P, Alonso J (2000) Airfoil design optimization using reduced order models based on
proper orthogonal decomposition. In: Fluids 2000 conference and exhibit. American Institute of
Aeronautics and Astronautics
Li J, Du X, Martins JRRA (2022) Machine learning in aerodynamic shape optimization. Prog
Aerosp Sci 134:100849
Ljung L (1998) System identification: theory for the user, 2nd edn. Pearson, Upper Saddle River,
NJ
Loiseau J-C (2020) Data-driven modeling of the chaotic thermal convection in an annular ther-
mosyphon. Theor Comput Fluid Dyn 34:339–365
Loiseau J-C, Brunton SL (2018) Constrained sparse Galerkin regression. J Fluid Mech 838:42–67
Lucia DJ, Beran PS (2003) Projection methods for reduced order models of compressible flows. J
Comput Phys 188(1):252–280
Lusch B, Kutz JN, Brunton SL (2018). Deep learning for universal linear embeddings of nonlinear
dynamics. Nat Commun 9(1):4950
Maulik R et al (2021) Latent-space time evolution of non-intrusive reduced-order models using
gaussian process emulation. Phys D: Nonlinear Phenom 416:132797
Ma C, Wang J (2019) Model reduction with memory and the machine learning of dynamical systems.
Commun Comput Phys 25(4)
Milano M, Koumoutsakos P (2002) Neural network modeling for near wall turbulent flow. J Comput
Phys 182(1):1–26
Mohan AT, Gaitonde DV (2018) A deep learning based approach to reduced order modeling for
turbulent flow control using LSTM neural networks. arXiv:1804.09269 [physics]
Mohan AT et al (2020) Embedding hard physical constraints in neural network coarse-graining of
3D turbulence. arXiv:2002.00021 [physics]
Mohebujjaman M, Rebholz L, Iliescu T (2019) Physically constrained data-driven correction for
reduced order modeling of fluid flows. Int J Numer Methods Fluids 89(3):103–122
Mori H (1965) Transport, collective motion, and brownian motion*). Prog Theor Phys 33(3):423–
455
Mou C et al (2021) Data-driven variational multiscale reduced order models. Comput Methods
Appl Mech Eng 373:113470
Murata T, Fukami K, Fukagata K (2020) Nonlinear mode decomposition with convolutional neural
networks for fluid dynamics. J Fluid Mech 882:A13
Noack BR et al (2003) A hierarchy of low-dimensional models for the transient and post-transient
cylinder wake. J Fluid Mech 497:335–363
Noack BR et al (2016) Recursive dynamic mode decomposition of transient and post-transient wake
flows. J Fluid Mech 809:843–872
Noack BR et al (eds) (2011) Reduced-order modelling for flow control, vol 528. Springer, CISM
International Centre for Mechanical Sciences. Vienna
Otto SE, Rowley CW (2019) Linearly recurrent autoencoder networks for learning dynamics. SIAM
J Appl Dyn Syst 18(1):558–593
Pacciarini P, Rozza G (2014) Stabilized reduced basis method for parametrized advection-diffusion
PDEs. Comput Methods Appl Mech Eng 274:1–18
Pawar S et al (2019) A deep learning enabler for nonintrusive reduced order modeling of fluid flows.
Phys Fluids 31(8):085101
Pawar S et al (2021) Model fusion with physics-guided machine learning: projection-based reduced-
order modeling. Phys Fluids 33(6):067123
Raissi M, Perdikaris P, Karniadakis GE (2019) Physics-informed neural networks: a deep learning
frame work for solving forward and inverse problems involving nonlinear partial differential
equations. J Comput Phys 378:686–707
Rasmussen CE, Williams CKI (2005) Gaussian processes for machine learning
Reyes R et al (2018) Reduced order models for thermally coupled low mach flows. Adv Model
Simul Eng Sci 5(1):1–20
338 Z. Dar et al.

Reyes R, Codina R (2020) Projection-based reduced order models for flow problems: a variational
multiscale approach. Comput Methods Appl Mech Eng 363:112844
Rozza G, Huynh DBP, Manzoni A (2013) Reduced basis approximation and a posteriori error
estimation for stokes flows in parametrized geometries: roles of the inf-sup stability constants.
Rozza
Rozza G, Lassila T, Manzoni A (2011) Reduced basis approximation for shape optimization in
thermal flows with a parametrized polynomial geometric map. In: Spectral and high order methods
for partial differential equations. Springer, Berlin, pp 307–315
Sahba S et al (2022) Dynamic mode decomposition for aero-optic wavefront characterization. Opt
Eng 61(1):013105
San O, Iliescu T (2015) A stabilized proper orthogonal decomposition reduced-order model for
large scale quasigeostrophic ocean circulation. Adv Comput Math 41(5):1289–1319
San O, Maulik R (2018a) Extreme learning machine for reduced order modeling of turbulent
geophysical flows. Phys Rev E 97(4):42322
San O, Maulik R (2018b) Neural network closures for nonlinear model order reduction. Adv Comput
Math 44(6):1717–1750
San O, Pawar S, Rasheed A (2022) Variational multiscale reinforcement learning for discovering
reduced order closure models of nonlinear spatiotemporal transport systems. arXiv:2207.12854
[physics]
Schmid PJ (2010) Dynamic mode decomposition of numerical and experimental data. J Fluid Mech
656:5–28
Schmid PJ, Violato D, Scarano F (2012) Decomposition of time-resolved tomographic PIV. Exp
Fluids 52(6):1567–1579
Shah NV et al (2022) Finite element based model order reduction for parametrized one-way coupled
steady state linear thermo-mechanical problems. Finite Elem Anal Des 212:103837
Srinivasan PA et al (2019) Predictions of turbulent shear flows using deep neural networks. Phys
Rev Fluids 4(5):054603
Suykens JAK et al (2002) Least squares support vector machines. World Scientific
Takeishi N, Kawahara Y, Yairi T (2017) Learning koopman invariant subspaces for dynamic mode
decom position. Proceedings of the 31st international conference on neural information processing
systems. NIPS’17. Curran Associates Inc., Red Hook, NY, USA, pp 1130–1140
Tello A, Codina R (2021) Field-to-field coupled fluid structure interaction: a reduced order model
study. Int J Numer Methods Eng 122(1):53–81
Tello A, Codina R, Baiges J (2020) Fluid structure interaction by means of variational multiscale
reduced order models. Int J Numer Methods Eng 121(12):2601–2625
Tissot G et al (2014) Model reduction using dynamic mode decomposition. Comptes Rendus
Mécanique. Flow Separation Control 342(6):410–416
Tu JH et al (2014) On dynamic mode decomposition: theory and applications. J Comput Dyn
1(2)(Mon Dec 01 01:00:00 CET 2014):391–421
Vlachas PR et al (2018) Data-driven forecasting of high-dimensional chaotic systems with long
short-term memory networks. Proc R Soc A: Math, Phys Eng Sci 474(2213):20170844
Wan ZY, Sapsis TP (2017) Reduced-space gaussian process regression for data-driven probabilistic
forecast of chaotic dynamical systems. Phys D: Nonlinear Phenom 345:40–55
Wang Z et al (2012) Proper orthogonal decomposition closure models for turbulent flows: a numer-
ical comparison. Comput Methods Appl Mech Eng 237:10–26
Wang Q, Hesthaven JS, Ray D (2019) Non-intrusive reduced order modeling of unsteady flows using
artificial neural networks with application to a combustion problem. J Comput Phys 384:289–307
Wang Q, Ripamonti N, Hesthaven JS (2020) Recurrent neural network closure of parametric
POD-Galerkin reduced-order models based on the mori-zwanzig formalism. J Comput Phys
410:109402
Wehmeyer C, Noe F (2018) Time-lagged autoencoders: deep learning of slow collective variables
for molecular kinetics. J Chem Phys 148(24):241703
8 Reduced Order Modeling 339

Williams MO, Kevrekidis IG, Rowley CW (2015) A data-driven approximation of the koopman
operator: extending dynamic mode decomposition. J Nonlinear Sci 25(6):1307–1346
Xie X, Webster C, Iliescu T (2020) Closure learning for nonlinear model reduction using deep
residual neural network. Fluids 5(1):39
Xu S et al (2013) Multi-output least-squares support vector regression machines. Pattern Recognit
Lett 34(9):1078–1084
Yousif MZ, Lim H-C (2022) Reduced-order modeling for turbulent wake of a finite wall-mounted
square cylinder based on artificial neural network. Phys Fluids 34(1):015116
Yvonnet J, He Q-C (2007) The reduced model multiscale method (R3M) for the non-linear homog-
enization of hyperelastic media at finite strains. J Comput Phys 223(1):341–368
Zhao H (2021) A reduced order model based on machine learning for numerical analysis: an
application to geomechanics. Eng Appl Artif Intell 100:104194
Zhu Q, Guo Y, Lin W (2021)Neural delay differential equations. In: The international conference
on learning representations, p 20
Zwanzig R (1960)Ensemble method in the theory of irreversibility. J Chem Phys 33(5):1338–1341
Chapter 9
Regression Models for Machine Learning

Pengfei Wei and Michael Beer

9.1 Introduction

Regression is a typical supervised learning task that aims at learning a (parametric


or non-parametric) function fitting the quantitative relationship between a response
variable y and one or more predictor variables x = (x1 , x2 , . . . , xd ) from the labeled
data D = (X, Y), where the response variables are assumed to be continuous or at
least have many candidate values. We assume that the training data of the predictor
variables is organized in a matrix X of dimension (n × d) with the jth row x j being
the jth point of x, and the corresponding response values organized in a column vector
Y = (y1 , y2 , . . . , yn ) of dimension n with the jth element potentially satisfying
 
yj = g x j + j, (9.1)
 
where  j refers to the noise of the jth training point, and g x j is the potential implicit
function to be inferred from the training data D based on proper model assumptions
and measures of risk. The acquisition of regression methods with different features
appears in many engineering computation tasks, for example, in structural reliability
assessment, structural health monitoring, and multidisciplinary optimization design.
Specifically, as one of the most appealing streams in engineering computation, the
probabilistic Bayesian numerical methods for solving, e.g. numerical optimization,
multi-dimensional integration, and ODE/PDE, in a statistical inference perspective,

P. Wei (B)
School of Power and Energy, Northwestern Polytechnical University, West Youyi Road 127,
Xi’an 710072, China
e-mail: [email protected]
M. Beer
Institute for Risk and Reliability (IRZ), Leibniz Universität Hannover, Callinstraße 34, Hannover
30167, Germany
e-mail: [email protected]

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 341
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_9
342 P. Wei and M. Beer

are mostly based on regression models for fitting some expensive-to-estimate and
implicit functions (Hennig et al. 2022). This motivates us to present this chapter for
introducing the regression techniques in a concise way. Specifically, some classical
regression models will be introduced from either a Bayesian or non-Bayesian per-
spective with a focus on understanding the philosophy and rationale behind these
models.
As for real-world practices, there are two phases with slight differences in gen-
erating the training data D. For the first phase, the data can be generated from
observations or measurements (e.g. the response of a building against seismic exci-
tation), for which case the data set X may or may not be designed, depending on
whether the placement of, e.g. the sensors for measurement, can be designed. For the
second phase, the purpose of regression is to generate a cheap-to-estimate surrogate
for approximating expensive-to-estimate simulator, revealing that the training data X
can be arbitrarily designed and Y is generated by calling the simulator. For the later
phase, there may exist alternative simulators with different levels of fidelity, and the
resultant models are termed as multi-fidelity surrogates (Perdikaris et al. 2017). In
this chapter, it is assumed that the data X can be designed, bringing the motivation
of active learning.
The alternative models for regression mainly differ in the model forms and treat-
ments of model parameter estimations. The models and methods for regression to
be treated in this chapter include two groups, i.e. the non-Bayesian regression mod-
els based on minimizing the empirical loss function and the Bayesian regression
models, where the former class includes the Least Square Regression (LSR), the
Ridge Regression (RR) and the Support Vector Regression (SVR), equipped or not
equipped with kernel trick, and the later class to be examined includes the Bayesian
parametric regression and Gaussian Process Regression (GPR). The active learning
procedures for scientific computation based on Bayesian regression models are also
presented.

9.2 Parametric Regression: A Non-Bayesian Perspective

9.2.1 Least Square Regression

As has been stated above, different types of regression models can be grouped based
on the model forms and the measures of loss function. The LSR, as the name suggests,
uses the mean square error as loss function. The model parameters are then estimated
by minimizing this loss function. The models utilized in this group are commonly of
parametric form, and it is said to be linear LSR if the models show a linear relationship
with the model parameters (instead of the predictor variables). We take the linear
LSR as an example for introduction as the estimators of the model parameters are of
closed form.
9 Regression Models for Machine Learning 343

9.2.1.1 Linear LSR Model and Parameter Estimation

Without loss of generality, we assume that there exists a set of basis functions,
 
termed as φ (x) = 1, φ1 (x) , φ2 (x) , . . . , φ p (x) . These basis functions are usu-
ally called features, and the linear space spanned by them is termed as feature space.
Transforming the data X into the feature space can usually facilitate the regression
as the response may show linear behavior in the feature space. With this in mind, the
linear regression model is then formulated as:

y = β  φ (x) + , (9.2)
 
where β = β0 , β1 , . . . , β p refers to the vector of model parameters to be learned
from the training data D, and  denotes the noise (mostly assumed to be Gaussian
white noise with zero mean and constant variance).
With this model assumption, the prediction function is generated as ŷ (x) =
β  φ (x). Given the training data, D = (X, Y), the mean squared error (MSE),
rooted in the Euclidean distance (or L 2 distance), in matrix notation, is defined by:

1   
MSED (β) = Y − ŷ (X) Y − ŷ (X) . (9.3)
n
In functional form, it is defined as

 2
MSE (β) = y (x) − ŷ (x) f (x) dx, (9.4)
Rd

which, when estimated with validation data, presents useful information for validat-
ing the regression model. In Eq. (9.3), f (x) refers to the probability density of the
predictor variables x, which serves as a weight of integral for defining the MSE. The
error defined by Eq. (9.4) is usually termed as generalization error or more specif-
ically, generalization MSE; while the one defined by Eq. (9.3) over a labeled sample
set is called empirical error, or more specifically, empirical MSE, as it is indeed the
average error on this sample set. The generalization error measures the true error for
regression in the input space with the consideration of the probability distribution of
the predictor variables. The empirical error, under specific assumptions, is an approx-
imation of the generalization (true) error. Indeed, if the samples used for computing
the empirical error are independently and identically distributed (i.i.d.) according to
f (x), the expectation of the empirical error is exactly the generalization error (see,
e.g. Ref. Mohri et al. 2018, Chap. 2 ).
The LSR expects to learn the model parameters by minimizing the generaliza-
tion MSE as the generalization capability of the assumed model can be maximized,
which, unfortunately, is commonly infeasible due to the limited training data. A
compromised and practical way is to minimize the empirical MSE defined over the
training data set, which leads to a point estimator of the model parameter formulated
by:
344 P. Wei and M. Beer

β̂ LSR = arg min MSED (β) . (9.5)

Making the gradients of the empirical MSE with respect to the model parameters
equal to zero, the closed-form estimators can be easily derived as:
 −1 
β̂ LSR = A A A Y, (9.6)

where A is a (n × p)-dimensional matrix generated by mapping the predictor vari-


 
able data
 to the feature   means its ith row is obtained by A j = φ x j =
  space, which
1, φ1 x j , φ2 x j , . . . , φ p x j .
Given the estimation of the model parameters, the linear LSR model used for
prediction at any site x is:

ŷD = ŷD (x) = β̂ LSR φ (x) , (9.7)

where the subscript “D” indicates that the LSR prediction model is trained with the
labeled data “D”.
Given the trained LSR model, one of the most concerning issues is the quality
of the model. This can be evaluated from two aspects. The first aspect concerns the
quality of fitness to the training data, while the second aspect concerns its general-
ization performance, i.e. the performance for prediction at observed points. For the
former task, the “Coefficient of Determination” is of special interest, and will then
be introduced in the following subsection. For the later task, the Ridge Regression
can be an excellent alternative.

9.2.1.2 Coefficient of Determination

As defined by Eq. (9.4), the empirical MSE is defined as the average squared distance
between the true values and the predicted values of the LSR model, and it is intuitively
concluded that small values of the empirical MSE indicate the fitness of the LSR
model to the data. This can be partly true. For clarifying this, the coefficient of
determination, or called “R squared”, denoted as R 2 , can be helpful. Given the
training data D = (X, Y) with Y = (y1 , y2 , . . . , yn ) , the total Sum of Squares of
the response values is defined as:
n
 2
SStot = y j − ȳ , (9.8)
j=1

with ȳ = n1 nj=1 y j being the mean of the observed response values. The residual
Sum of Squares, defined as the accumulated squared errors between predicted and
observed response values, are given as:
9 Regression Models for Machine Learning 345

n
  2
SSres = y j − ŷD x j . (9.9)
j=1

The residual Sum of Squares can be interpreted as the unexplained variance of the
observed response values by ŷD (x), subtracting which from the total sum of squares,
the regression Sum of Squares, measuring the explained variance, is defined as:
n
   2
SSreg = SStot − SSres = ŷD x j − ȳ . (9.10)
j=1

By normalizing SSreg with SStot , the coefficient of determination is then defined as


Johnson and Wichern (2007):
n    2
SSreg j=1 ŷD x j − ȳ
R =
2
= n  2 , (9.11)
j=1 y j − ȳ
SStot

which can be interpreted as the percentage of the total variance of the observed
response data Y being explained by the linear LSR model ŷD (x).
R 2 takes values between 0 and 1, with R 2 = 1 indicating the model predictions
at any point of training data match precisely with the observed response values, and
R 2 = 0 implying that, at any training point, the model predicts the response value
as the mean ȳ of the observed response values, which corresponds to the case of
worst fitting. Usually, a higher value of R 2 indicates better fitness of the model to
the data. As an example, a training data set consisting of 15 points is created, and
the linear LSR is implemented with the function basis being φ (x) = {1, x} and
 
φ (x) = 1, x, x 2 . The results are compared in Fig. 9.1. As seen, with a linear
functional basis, the resultant R 2 value is 0.3026, indicating a bad fitness to the
training data. However, as the quadratic functional basis is added, the R 2 value goes
up to 0.9404, implying a much better fitness to the data.
For regression with multiple predictor variables, it is found that the R 2 spuri-
ously increases with the dimension of the predictor variables, which may mislead
the analysts from drawing correct conclusions on the fitting quality, and especially,
it is unsuitable to be used for comparing the performance of regression models with
different numbers of predictors variables. To eliminate this dimension effect, the
adjusted coefficient of determination, with the consideration of the degrees of free-
dom of both SSres and SStot , is adapted from R 2 as:

2   n−1
R = 1 − 1 − R2 . (9.12)
n− p−1
2 2
The adjusted R 2 values for the LSR models in Fig. 9.1 are R = 0.2490 and R =
0.9305 respectively, which are both smaller than the corresponding R 2 values.
346 P. Wei and M. Beer

Fig. 9.1 Schematic illustration of the linear LSR model fitted with basis functions φ (x) = {1, x}
 
(left) and φ (x) = 1, x, x 2 (right)

2
The R 2 and R values provide estimates of the extent of fitness of the linear LSR
model to the training data, but it does not provide a formal hypothesis test for the
overall significance of the model. For the later use, one can refer to the overall F-test
for regression, which is closely related to the (adjusted) coefficients of determination.
Due to limited space, this will not be introduced here, and the readers with interest
can refer to Chap. 7 of Ref. Johnson and Wichern (2007).

9.2.1.3 Ridge Regression for Alleviating Over-Fitting

The linear LSR is simple and easy to implement, benefiting from the closed-form
estimators of the model parameters β. However, the generalization performance can
be low as it aims at minimizing the empirical error , instead of the generalization
error. This may result in over-fitting of the regression model, which means good
fitness to the training data, but low predictive performance at unobserved points. As
prediction is usually the main objective of performing regression, developing tricks
for avoiding or alleviating over-fitting is then of special concern in many machine
learning algorithms.
One way to improve the generalization performance is to use the ridge regression,
where the target function for minimization is modified as:
p
T (β) = MSED (β) + λ βi2 , (9.13)
i=1

p
where the L 2 norm i=1 βi2 refers to the regularization term for controlling the
model complexity, and λ is a (user-defined) positive parameter used for balancing
9 Regression Models for Machine Learning 347

the model complexity against accuracy (quantified by the empirical MSE). Indeed,
under specific assumptions, the target function given by Eq. (9.13) is an upper bound
of the generalization error MSE (β) defined by Eq. (9.4). Therefore, minimizing the
target function T (β) is equivalent to minimizing the generalization error. This is the
intrinsic reason why the generalization performance can be improved and the over-
fitting can be alleviated by introducing the regularization term. The above-improved
version of linear LSR is called Ridge Regression (RR) , which is closely related to
the Gaussian process regression, to be introduced in Sect. 9.2.3. If the features φ (x)
are set to be the eigenfunctions of a kernel, it is then called Kernel Ridge Regression
(KRR).
By minimizing the target function in Eq. (9.13), the estimator of the model param-
eters for RR can be derived as:
 −1 
β̂ RR = A A + λI A Y, (9.14)

where I refers to the identity matrix. For more details of the RR, one can refer to
Chap. 10 of Ref. Mohri et al. (2018) or Chaps. 3 and 5 of Ref. Theodoridis (2015).

9.2.2 Support Vector Regression

The model form assumed for linear SVR is the same as that for the linear LSR, and
the difference appears mainly on the loss function, and thus also the estimation of the
model parameters. Without loss of generality, the model form for SVR is assumed
to be:
ŷ(x) = β  φ (x) + β0 , (9.15)
 
where φ (x) = φ1 (x) , φ2 (x) , . . . , φ p (x) is the functional basis of the feature
space, and β = β1 , β2 , . . . , β p and β0 are the model parameters to be estimated by
minimizing a loss function.

9.2.2.1 -Insensitive Loss Function

A typical difference between SVR and LSR is that, in SVR, the loss of each training
point contained in a tube with radius  and center ŷ(x) = β  φ (x) + β0 is assumed
to be zero. This feature makes SVR more  tolerant of the noises contained in the
training data. For the jth training point x j , y j , the loss function is defined as:
   
Li (β, β0 ) = max 0, ŷ x j − y j −  . (9.16)

Obviously, if the difference ŷ (x i ) − yi is less than , the loss is zero; while it is


higher than , the loss equals to the difference.
348 P. Wei and M. Beer

Fig. 9.2 Schematical explanation of the -insensitive loss function

The overall -insensitive loss function of the training data D is then defined as:

1
n
   
LD (β, β0 ) = β2 + C max 0, ŷ x j − y j −  , (9.17)
2 j=1

where the first term 21 β2 is inversely proportional to the gap between the two mar-
gins ŷ (x) −  and ŷ (x) + , the second term measures the accumulated distances
of all the training points to the regions bounded by the two margins, and the param-
eter C is introduced for balancing the above two factors, as explained by Fig.9.2.
By minimizing the loss function LD (β, β0 ), it is expected to produce a hyperplane

ŷ (x) = β̂ φ (x) + β̂0 in the feature space, on the one hand, to maximize the gap
between the two margins given fixed , on the other hand, to ensure that the all the
training points get as close as possible to the regions bounded by the two margins.

9.2.2.2 Parameter Estimation

The remaining task for SVR learning is then to estimate the model parameters β and
β0 by minimizing the -insensitive loss function defined by Eq. (9.17), which is not
as trivial as that for LSR learning as the target function is not smooth. Define two
slack variables ζ j and ζ j∗ as following:
   
ζ j = max 0, ŷ x j − y j −  ŷ (x i ) − yi  0

    (9.18)
ζ j = max 0, y j − ŷ x j −  otherwise.
9 Regression Models for Machine Learning 349

In words,
 ζ j is a non-negative variable taking the minimal value that satisfies
ŷ x j − y j −   ζ j , and ζ j∗ is another non-negative variable taking the minimal
 
value constrained by y j − ŷ x j −   ζ j∗ . With these slack variables, the optimiza-
tion problem for model parameter estimation can be reformulated as:

 ∗
 1
n
 
min L β, β0 , ζ , ζ = β2 + C ζ j + ζ j∗
β,β0 ,ζ ,ζ ∗ 2 j=1
 
subjected to ŷ x j − y j   + ζ j (9.19)
 
y j − ŷ x j   + ζ j∗
ζ j  0, ζ j∗  0, j = 1 . . . n.

Performing the Lagrangian multiplier method, the above-constrained optimization


problem can be equivalently transformed into an unconstrained minimization prob-
lem with the following target function:

  1 n
 
L β, β0 , ζ , ζ ∗ , μ, μ∗ , α, α ∗ = β2 + C ζ j + ζ j∗
2 j=1
n n
− μjζj − μ∗j ζ j∗
j=1 j=1
(9.20)
n
   
+ α j ŷ x j − y j −  − ζ j
j=1
n
   
+ α ∗j y j − ŷ x j −  − ζ j∗ ,
j=1

with μ j , μ∗j , α j and α ∗j being the lagrangian multipliers. Making the gradient of
L (β, β0 , ζ , ζ ∗ , μ, μ∗ ) with respect to each element of β, β0 , ζ and ζ ∗ respectively
equals to zero yields :
n
 
β= α j − α ∗j φ(x j ) (9.21a)
j=1
n
 
0= α j − α ∗j (9.21b)
j=1

C = αj + μj (9.21c)
C = α ∗j + μ∗j , (9.21d)

substituting all of which into Eq. (9.20), the dual optimization problem of Eq. (9.20)
is formulated as:
350 P. Wei and M. Beer

  n
   
min∗ L α, α ∗ = y j α ∗j − α j −  α ∗j + α j
α,α
j=1
p p
1  ∗     
− α j − α j φ x j φ (x k ) αk∗ − αk
2 j=1 k=1
p
 ∗ 
subject to α j − α j = 0, 0  α j , α ∗j  C, j = 1 . . . n,
j=1
(9.22)
which is a typical quadratic programming problem, and can be numerically solved
by the Sequential Minimal Optimization (SMO) algorithm. With α and α ∗ being
estimated, the estimate of the model parameters β, denoted as β̂ SVR , can be easily
computed with Eq. (9.21a). The estimate of the model parameter β0 can be com-
puted by β̂0,SVR = yk +  − nj=1 α ∗j − α j φ(x k ) φ(x j ) with x k being any of
the training point of the input variables.
With the complementary conditions, it is known that, for any j ∈ {1, 2, . . . , n},
it holds that:
   
α j × ŷ x j − y j −  − ζ j = 0 (9.23a)
   
α ∗j × ŷ x j − y j +  + ζ j∗ = 0. (9.23b)

Equation (9.23) reveals that, for any xj being a support vector with either α j = 0 or
α ∗j = 0 being true, it then holds that ŷ x j − y j −  = ζ j  0 or y j − ŷ x j −  =
ζ j∗  0 correspondingly, indicating that all the support vectors lie outside the tube
with radius . For the points inside the tube, it holds that α j = α ∗j = 0, thus the non-
support vectors make no contribution for determining the model parameter values,
as revealed by Eq. (9.21a). For each support vector x j , either α j = 0 or α ∗j = 0
hold, thus the SVR model is uniquely determined by the support vectors. With the
above conclusions in mind, it is known that the value of the tolerance parameter 
makes a trade-off between sparsity and accuracy, i.e. a larger value of  results in a
smaller number of support vectors, but also leads to a higher risk of ignoring many
important points which have significant effects on determining the model accuracy
(Mohri et al. 2018). Thus, a proper pre-selection of the values of  and C is important
for succeeding while training the SVR model.

9.2.3 Kernel Trick

A common feature of LSR and SVR is that a large number of inner products of
vectors in a feature space needs to be computed, as can be found in Eqs. (9.14) and
(9.22). These numerical procedures are usually computationally cumbersome as the
feature space spanned by φ(x) can be of extremely high or even infinite dimension.
9 Regression Models for Machine Learning 351

This can be comprehensively alleviated by using the kernel trick. Before the kernel
trick for LSR/SVR, some important concepts for kernel need to be introduced.

9.2.3.1 Kernel and Reproducing Kernel Hilbert Space

A kernel function, denoted as κ(x, x  ), is defined as a function with two arguments


mapping two points x and x  to the real space R. Given a feature space spanned by
φ(x), a kernel function can be defined via inner product operator ·, · as:
   
κ(x, x ) = φ (x) , φ x  = φ (x) φ x . (9.24)

It measures the similarity between two points. Despite the definition of a kernel via
the inner product, it is not required to compute the inner product, on the contrary,
the inner product can be evaluated by calling the kernel function with much lower
computational cost. To show the connection between a kernel and the inner product
of a feature map, it is useful to first introduce Mercer’s theorem.

Theorem 1 (Mercer’s theorem) Deonote X ⊂ Rd as a compact set, and κ : X ×


X → R a continuous and symmetric function. If the condition
 
   
a (x) κ x, x a x dxdx  0 (9.25)
X X

holds for any square integrable function a(x) defined on X , the kernel function κ
admits a uniformly convergent expansion as

   
κ x, x = λi ψi (x) ψi x . (9.26)
i=1

A kernel satisfying Mercer’s condition in Eq. (9.25) is said to be Positive Def-



inite Symmetric (PDS). In Eq. (9.26), {λi }i=1 are a set of infinite eigenvalues

of the kernel function κ, and {ψi (x)}i=1 are the corresponding infinite eigen-
functions. For a PDS kernel function, the eigenvalues and eigenfunctions can be
uniquely√
determined.Comparing Eq. (9.24) with Eq. (9.26), it can be concluded that

φ (x) = λi ψi (x) i=1 , indicating that the feature map can be explicitly derived
from the eigen decomposition of the kernel function, and it is also known that ele-
ments are orthogonal to each other, i.e.


φi (x) , φ j (x) = λi λ j ψi (x)ψ j (x) dx = 0 (9.27)
X

holds for any i = j. Thus, φ(x) forms a set of orthogonal basis for a feature space.
Three commonly used kernels are summarized as follows:
352 P. Wei and M. Beer
   d
• Polynomial kernel κ x, x = x, x  + C .
   2 
• Gaussian radial basis kernel κ x, x = exp −γ  x − x  .
   
• Sigmoid kernel κ x, x = tanh γ x, x  + C .
Take the polynomial kernel with C = 1 and d = 2 as an example, for any x, x ∈ R2 ,
it holds that:    2  
κ x, x = x, x  = φ (x) , φ x , (9.28)
 √ √ √ 
where φ (x) = 1, 2x1 , 2x2 , x12 , x22 , 2x1 x2 . Obviously, estimating the inner
 
product φ (x) , φ x  by calling the kernel shows much lower computational com-
plexity than calculating it by definition.
One more tool being useful for understanding many of the regression methods is
the Reproducing Kernel Hilbert Space (RKHS), which is given by the following
theorem.

Theorem 2 (Reproducing Kernel Hilbert Space, RKHS) Given a PDS kernel κ:


X × X → R,there exists a Hilbert space (or feature space) H and a feature map
φ (x) from X to H making the following holds:
   
∀x, x ∈ X , κ x, x = φ (x) , φ x . (9.29)

Besides, the Hilbert space H satisfies the reproducing property, i.e.


 
∀x, x ∈ X and ∀h ∈ H, h (x) = h(x ), κ x, x , (9.30)

where the inner product is defined by integrating x out. H is then called the RKHS
associated to κ.

For more details on Kernel and RKHS, one can refer to Chap. 11 of Ref. Theodoridis
(2015) and Chap. 5 of Ref. Mohri et al. (2018).

9.2.3.2 Kernel Trick for Regression

As revealed by Eq. (9.14) for RR and Eq. (9.22) for SVR, the parameter  estimation

process involves the computation of many inner products, typically φ x j , φ (x k )
for j, k ∈ {1, 2, . . . n}, each of which can be computationally expensive due to the
high or even infinite dimension of the feature space. Actually, with RKHS H associ-
ated with a PSD kernel κ equipped, there is even no need to know what exactly the
feature mapping is. Given a PSD kernel κ, we know that there must exist a feature
mapping φ(x) and a resultant feature space H making
   
φ x j , φ (x k ) = κ x j , x k (9.31)
9 Regression Models for Machine Learning 353

hold for any j, k ∈ {1, 2, . . . n}. Equation (9.31) is called Kernel Trick, which not
only facilitates the estimation of the inner products, but also realizes the avoidance
of building the feature map. We take the SVR as an example to illustrate the details,
but it also applies to both LSR and KRR.
Given a kernel κ (x, x), based on the kernel trick, the dual optimization problem
given in Eq. (9.22) can be reformulated as:

  n
   
min∗ L α, α ∗ = y j α ∗j − α j −  α ∗j + α j
α,α
j=1
p p
1  ∗    
− α j − α j κ x j , x k αk∗ − αk (9.32)
2 j=1 k=1
p
 
subject to α ∗j − α j = 0, 0  α j , α ∗j  C, j = 1 . . . n,
j=1

   
where the inner product φ x j φ (x k ) is replaced by κ x j , x k . Solv-
ing the optimization problem of Eq. (9.32),  the model parameters can
n ∗
then be computed by β̂ SVR = j=1 α j − α j φ(x j ) and β̂0,SVR = yk +  −

n
j=1 α ∗
j − α j φ(x k ) φ(x j ). The SVR model for prediction is then formulated
as:

ŷD (x) = β̂ SVR φ (x) + β̂0,SVR
n
   
= α j − α ∗j φ x j φ (x) + β̂0,SVR
j=1 (9.33)
n
   
= α j − α ∗j κ x j , x + β̂0,SVR .
j=1

It is now clear that, for both parameter estimation and prediction, there is no need
to know the explicit expression of the feature mapping φ(x), instead, only a kernel
function satisfying Mercer’s condition is required.
An example of implementing the SVR with linear kernel, polynomial kernel of
order 2, and Gaussian kernel is shown in Fig. 9.3 for illustration and comparison.
The value of  is set to be 0.6 for all three implementations. As can be seen, for
all three types of kernels, four points out of seven are identified as support vectors,
but the support vectors differ from kernel to kernel. With polynomial and Gaussian
kernels, SVM shows better fitness to the training data than that trained with linear
kernels. For a strict demonstration, some specific techniques such as cross-validation
are required, but will not be discussed in detail here due to the space limitation.
354 P. Wei and M. Beer

Fig. 9.3 An example of SVR implemented with kernel function being linear, polynomial, and
Gaussian forms, respectively

9.3 Regression: A Bayesian Perspective

Both the LSR and SVR methods estimate the model parameters by minimizing the
empirical errors, and predict the response value at any unobserved point x as a deter-
ministic value. Instead, a Bayesian regression method produces a stochastic process,
and predicts the response value at x as a random variable, with the variation of which
summarizes the prediction error. In this section, we introduce the Bayesian regres-
sion from both a parametric space perspective and a functional space perspective
respectively. These views of understanding the Bayesian regression can be found in,
e.g. Chap. 2 of Ref. Rasmussen and Williams (2006).
For a better understanding of the difference between Bayesian and non-Bayesian
regression methods, and that between the parametric and non-parametric regression
methods, it is helpful to introduce the concept of Hypothesis Set H. In general terms,
it is defined as a set of functions mapping the vectors in the feature space H to the
response space. For example, given the model formassumption y = β  φ (x)  +  for
linear LSR, the hypothesis set is defined as H := ŷ = β  φ (x) : β ∈ R p , and the
LSR regression problem can be formulated as finding an element from H minimizing
the empirical MSE.
9 Regression Models for Machine Learning 355

9.3.1 Gaussian Process Regression: A Parametric Space


Perspective

We assume that the regression model still takes the form of Eq. (9.2), i.e. 
y = β  φ (x) + , indicating the hypothesis set is H := ŷ = β  φ (x) : β ∈ R p .
Instead of finding an element from H by minimization of the empirical errors, the
Bayesian parametric regression aims at attributing a probability distribution on the
model parameters β, and thus the hypothesis set H, with the ability of summariz-
ing the prediction error under proper assumptions. This is realized by a Bayesian
inference procedure.
The prior assumption imposed on the problem includes two aspects. First, the
prior distribution of the model parameters
 β is assumed to be Gaussian with zero
mean and covariance pr , i.e. β ∼ N 0, pr , and the prior density is denoted by
f (β); Second, the noise is assumed to be Gaussian with zero mean and variance σn2 ,
and also independent from point to point. With the second assumption, the likelihood
function of the training data D can be formulated as:
  2 

n
1 yi − β  φ (x i )
f (D|β) = √ exp −
i=1
2π σn 2σn2
  (9.34)
1 1
= n/2 exp − 2 Y − Aβ .
2

2π σ 2 n
2σn

With Bayes formula, the posterior density of β can then be formulated as:

f (D|β) f (β)
f (β|D) = , (9.35)
f (D)

where f (D) = f (D|β) f (β) dβ is a constant called evidence, used for normal-
izing the posterior density.
As both f (β) and f (D|β) in Eq. (9.35) are Gaussian, the posterior density
f (β|D) must be of Gaussian form. The posterior mean β̄post and posterior covariance
post can be easily derived as:

μβ = σn−2 post X

Y, (9.36)

and −1
post = σn−2 A A + −1
pr , (9.37)

respectively. Comparing the above results with the estimator given by Eq. (9.14), it
can be concluded that, with the assumption that the noise has unit variance, i.e.
σn2 = 1, and the prior covariance is a diagonal matrix with equal elements, i.e.
pr = λI, the posterior mean β̄ post is exactly the same with the estimator β̂ RR of
356 P. Wei and M. Beer

the RR parameters. From this point of view, the prior assumption can be viewed as a
penalty term which aims at improving the generalization performance of the model.
However, it should be kept in mind that, instead of estimating the model parameters
as deterministic values, the Bayesian Linear Regression (BLR) infers the parameter
as a (subjective) probability distribution, with the posterior covariance post sum-
marizing the epistemic uncertainty on the model parameters. With the increase in
the training sample size, the readers can verify that the posterior variance of each
parameter tends to decrease, indicating the reduction of epistemic uncertainty.
Given the above setting and results, the posterior prediction ŷ (x) = φ (x) β
admits a Gaussian process as it is a linear combination of a set of Gaussian variables,
and the posterior mean and variance are formulated as:

μ y (x) = σn−2 φ (x) post X



Y, (9.38)

and
σ y2 (x) = φ (x) post φ (x) , (9.39)

respectively. The posterior mean provides an expected prediction at x, while the


posterior variance summarizes the corresponding prediction error, and reflects the
degree of uncertainty on this prediction.
We use a training set consisting of five points to illustrate the BLR method. The
model form is assumed to be y = β0 + β1 x. The BLR is implemented by setting the
noise variance as one and the prior distribution of (β0 , β1 ) as independent standard
Gaussian. The three key factors for Bayesian inference, i.e. the prior density, the
likelihood function, and the posterior density, of (β0 , β1 ), are compared in Fig. 9.4.
It is seen the posterior density shares more similarity with the likelihood function
than the prior density, but the standard Gaussian prior assumption makes the posterior

Fig. 9.4 Example of BLR with the three panels shows the contours of the prior density, the likeli-
hood function, and the posterior density of the model parameter (β0 , β1 ), respectively
9 Regression Models for Machine Learning 357

-1
Training points
LSR prediction
-2 RR prediction
BLR prediction(95.45% Confidence Intervals)
-3
-1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1

Fig. 9.5 Comparison of the prediction models generated by LSR, RR, and BLR, respectively

density closer to the origin than the likelihood function. This is consistent with the
conclusion drawn following Eq. (9.37) that the prior assumption is indeed a penalty
term.
The prediction models generated by LSR, RR, and BLR are then compared in Fig.
9.5. For RR, the penalty coefficient λ is set to be one, and in this case, the induced RR
prediction model is the same as the posterior mean prediction model induced by BLR,
as shown in Fig. 9.5. Instead of predicting the response value at x as a deterministic
value, the BLR estimates it as a Gaussian distribution, and the filled region in Fig. 9.5
indicates the 95.45% posterior confidence intervals. This feature makes it possible
for the analysts to make a trade-off between risk and prediction accuracy especially
when specific decisions need to be made based on the predictions.
The above procedure applies to any parametric forms of model assumptions, but
for the model form showing the nonlinear relationship between response and model
parameters, the posterior distribution of the model parameters usually do not admit a
closed form, and conditional sampling technique such as Markov Chain Monte Carlo
(MCMC) sampling needs to be introduced for sampling from the posterior density.
It is now clear that, for Bayesian
 parametric regression with hypothesis set
H := ŷ = β  φ (x) : β ∈ R p , the vector β of model parameter is assumed to be
finite dimension, and the problem is simplified to assigning a proper (subjective)
probability distribution for β. Commonly, we expect to extract as many features as
possible for model assumption as this makes the regression more flexible for captur-
ing the behaviors between response and predictor variables. As introduced in Sect.
9.2.3, this can be easily realized with a RKHS equipped with a kernel function. For
example, given a Gaussian kernel, the corresponding RKHS can realize almost any
smooth functions; with a Matérn kernel, a family of functions with a specific order
of derivatives can be realized (Rasmussen and Williams 2006).
358 P. Wei and M. Beer

In the case with infinite-dimensional feature space, the hypothesis set can still be
indexed by the model parameter β, and the problem can still be formulated in the
parametric space; however, it is intractable to assign a proper probability distribution
to the infinitely dimensional vector β. In contrast to the parametric space perspective,
a functional space perspective for viewing the Bayesian regression problems can be
used in this case to extend the above inference technique from finite-dimensional
feature space to infinite-dimensional cases. This leads to the general GPR model,
which will be introduced in the following subsection.

9.3.2 Gaussian Process Regression: A Functional Space


Perspective

Instead of imposing probability distribution assumption on the model parameters


β, we can assign probability distribution directly on the model function y(x). With
Gaussian prior assumption, the model function y(x) admits a Gaussian process as:
  
y (x) ∼ GP m (x) , κ x, x , (9.40)

where m(x) denotes the prior mean function which can be assumed to be zero, con-
stant, or polynomials, and κ(x, x ) is a kernel function indicating the prior covariance
of y(x) and y(x ). Without loss of generality, we assume that the prior mean takes
a constant value, denoted as β, and the prior covariance function takes the Gaussian
kernel (or called squared exponential kernel) with distinct length-scale parameter in
each dimension, that is:
 
  1  −1  
κ x, x = σ0 exp − x − x
2
x−x , (9.41)
2

where σ02 is the prior variance


 of y (x), assumed
 to be positive constant across the
space of x, and = diag σ12 , σ22 , . . . , σd2 accumulates the length-scale parameters
with σi denoting the length-scale parameter defining the strength of correlation of
y(x) in the ith dimension. One notes that the parameters β, σ02 and can no longer
be called “model parameters”, instead, commonly termed as “hyper-parameters”.
Actually, the model parameters, in this case, are of infinite dimension due to the
infinite dimension of the feature space spanned by the eigenfunctions of the kernel.
The model parameter in this case is indeed the function y(x) itself. Next, for further
inference, the values of the hyper-parameters β, σ02 and need to be estimated based
on the training data D.
With the above prior assumption, the labeled data Y is a n-dimensional Gaus-
sian vector with mean vector m (X|β) = βe and covariance matrix σ02 K ( ), where
e denotes the n-dimensional column vector with each element equal to one, and
the (i, j)th element of σ02 K ( ) is κ(x i , x j |σ02 , ). The likelihood function of the
training data D conditional on the hyper-parameters β and is then formulated as:
9 Regression Models for Machine Learning 359

  
1 1
f D|β, σ02 , =  exp − σ02 (Y − βe) K −1 ( ) (Y − βe) ,
2
(2π )n σ02 K ( )
(9.42)
the negative logarithm of which is then formulated as:

  1 2 1
− log f D|β, σ02 , ∝ σ0 (Y − βe) K −1 ( ) (Y − βe) + log σ02 K ( )
2 2
 
 L β, σ02 , .
(9.43)
The model hyper-parameters can then be evaluated with two alternative schemes.
First, a full Bayesian inference can be applied to infer a posterior density formulated
by multiplying
 the likelihood function of Eq. (9.42) with a proper prior density
f β, σ02 , . This commonly requires implementation of a conditional
  sampling
technique (e.g. MCMC) to generate the posterior samples for β, σ02 , , which is
commonly computationally cumbersome. Another strategy is to evaluate the hyper-
parameters as deterministic values  by maximizing
 the likelihood function, which is
equivalent to minimization of L β, σ02 , . In most literature, the second scheme is
applied, and thus the detail is introduced in what follows.
 
Making the first-order partial derivatives of L β, σ02 , over σ and σ02 to zero,
the estimators of the hyper-parameters β and σ0 can be formulated as:
2

 −1  −1
β̂ ( ) = e K −1 ( ) e e K ( )Y (9.44a)
1
σ̂02 ( ) = (Y − βe) K −1 ( ) (Y − βe) . (9.44b)
n
Substituting Eq. (9.44) into Eq. (9.43), the optimization for hyper-parameters esti-
mation can then be simplified as:

ˆ = arg min L β̂ ( ) , σ02 ( ) , . (9.45)

The above optimization problem can be solved by using the limited-memory


Broyden-Fletcher-Goldfarb-Shanno (LBFGS) quasi-Newton approximation algo-
rithm, and one can refer to Ref. Nocedal and Wright (2006) for details.
After the hyper-parameters are estimated, the posterior distribution of ŷ (x) for
prediction at x can be obtained. Based on the prior assumption, it is known that ŷ (x)
and Y are jointly Gaussian. The probability distribution of ŷ (x) conditional on Y is
also Gaussian, and the posterior mean and covariance can be derived as:

μ y (x) = m (x) + κ (X, x) K −1 (Y − m (X)) (9.46a)


   
c y x, x = κ x, x − κ (X, x) K −1 κ (X, x) , (9.46b)

where κ (X, x) is a n-dimensional column vector indicating the prior covariance


between each training point and x, and its ith element is κ (x i , x).
360 P. Wei and M. Beer

The posterior mean given by Eq. (9.46a) presents a mean prediction for the
response at x. Based on the RKHS, it is known that the prior covariance term
κ (x i , x)√ ∞ as an inner product, i.e. κ (x i , x) = φ (x i ) , φ (x), where
can be expressed
φ (x) = λi ψi (x) i=1 with (λi , ψi (x)) being an eigenvalue-eigenfunction pair.
The posterior mean is then a linear combination of the feature mapping φ (x), thus
being an element of the RKHS associated with the kernel κ(x, x  ). Further, each
realization of the Gaussian process governed by Eq. (9.46) is also an element of the
corresponding RKHS. The above facts reveal that the functional behavior that can
be captured by the GPR model is mainly governed by the property of the kernel
function. Given the squared exponential kernel, the resultant GPR model can almost
realize any smooth function. In real-world applications, the selection of the best ker-
nel can be implemented based on the users’ understanding of the physical process.
One can refer to Chap. 4 of Ref. Rasmussen and Williams (2006) for the properties
of alternative kernels.
The posterior variance σ y2 (x) = c y (x, x) defined by Eq. (9.46b) summarizes the
prediction error at x. It can be found that the posterior variance is always smaller than
the prior variance, this is fair as that, the epistemic uncertainty on the response value
at x tends to decrease after specific information of the model function being learned
from the training data. As will be shown in Sect. 9.4 of this chapter, the posterior
covariance given by Eq. (9.46b) provides a basis for active learning.
One more interesting point is that there are specific connections between the
GPR model and the non-Bayesian regression models when the feature mappings are
defined with a kernel (see Chap. 6 of Ref. Rasmussen and Williams 2006). Take the
 as an example. Suppose the feature mappings φ (x) are defined by a kernel
LSR
κ x, x , the LSR prediction model can be reformulated as:


ŷLSR = β̂ LSR φ (x)
 −1  
= A A A Y φ (x)
(9.47)
= Y K −1 A, φ (x)
= Y K −1 κ (X, x) ,

which is exactly the same as the posterior mean prediction given by Eq. (9.46a) if
the prior mean m(x) is assumed to be zero.
The GPR model introduced above is a noise-free version, indicating that the noise
is assumed to be zero. The mean prediction at a training point xi is exactly equal to its
label value yi , and the corresponding posterior variance equals to zero. This makes it
more like an interpolation method. For taking the noise into consideration,
 one  more
hyper-parameter σn2 can be embedded into the prior covariance as κ x, x + σn2 ,
and then the covariance matrix can be reformulated as K + σn2 In with In being the
n-dimensional identity matrix. The remaining inference procedures remain almost
the same as the noise-free version, and we don’t give more details.
9 Regression Models for Machine Learning 361

Posterior Prediction Model Ten Random Realizations


12 12

10 10

8 8

6 6

4 4

2 2

0 0

-2 -2

-4 -4

-6 -6
Training Points
-8 95% Confidence Interval -8
Posterior Mean Prediction
-10 -10
-3 -2 -1 0 1 2 3 -3 -2 -1 0 1 2 3

Fig. 9.6 An example for illustration of the GPR model, where the left panel shows the posterior
features (including the posterior mean prediction and the 95% posterior confidence interval), and
the right panel displays ten random realizations of the posterior Gaussian process

An example of the noise-free GPR model trained with five points is schematically
shown in Fig. 9.6. As can be seen from the left panel, all the training points are exactly
located on the posterior mean prediction curve, and the corresponding posterior
variances of these points are exactly zero, indicating that it is an interpolation method.
Ten realizations of the posterior Gaussian process model shown in the right panel
are generated at many discretization points based on their posterior mean vector and
posterior covariance matrix. As shown, all the training points are located exactly on
each of the realizations.

9.4 Active Learning

Bayesian regression provides an alternative way, against the traditional grid-based


methods, for solving the various types of numerical analysis challenges, such as
multi-dimensional cubature, optimization, PDE/ODE solution, and structural reli-
ability assessment. This framework is termed as Bayesian Numerical Analysis
(Cockayne et al. 2019), and has attracted extensive attention in both machine learn-
ing and numerical computation communities. One of the most appealing features of
this framework is that the discretization error is viewed as epistemic uncertainty, and
summarized by the posterior probability distribution of the quantity of interest. The
posterior distribution then provides instructive information for the optimal design
362 P. Wei and M. Beer

of the grid points, by adding which to the training data set, the most reduction of
numerical error can be achieved. The development of an Acquisition Function (or
called learning function) plays a key role for all those active learning procedures as it
informs the “optimal point”. In this section, we take two Bayesian numerical tasks,
i.e. Bayesian cubature and Bayesian reliability assessment as examples to introduce
the Bayesian numerical analysis framework and the corresponding active learning
procedures.

9.4.1 Active Learning for Bayesian Cubature

The problem of concern is to estimate the following integral:



I =  [y (x)] = y (x) π (x) dx, (9.48)

where π (x) is the weight density, and hereinafter assumed to be independent stan-
dard normal, the integrand y (x) is assumed to be governed by a computationally
expensive simulator. The Bayesian cubature aims at inferring a posterior distribution
by attributing a probabilistic assumption on the integrand, and a GPR model-based
scheme is introduced as an example in the following subsection.

9.4.1.1 Bayesian Cubature

Let D = (X, Y) denote a set of integration point, where the ith row x i of X indi-
cate the ith point of the input variable x, and Y is a column vector with the ith
element being computed as yi = y(x i ), e.g. by calling one or more expensive-to-
estimate simulators. Then based on the GPR procedure introduced in Sect. 9.3.2,
a GPR model  can be trained
 with D for approximating the integrand y(x) as
ŷ (x) ∼ GP μ y (x) , c y x, x , where the posterior mean μ y (x) and posterior
 
covariance c y x, x are formulated in Eq. (9.46). Then a random variable Î defined
as follows can be used for approximating I:

 
Î =  ŷ (x) = ŷ (x) π (x) dx, (9.49)

where  [·] denotes the integral operator over π (x). The integral defined by Eq.
(9.49) is a linear projection of the Gaussian process ŷ(x), thus Î follows Gaussian
distribution (Rasmussen and Ghahramani 2003). As for details, viewing ŷ (x) as an
infinite-dimensional Gaussian vector and π (x) as an infinite-dimensional determin-
istic vector, both indexed by x, the integral of Eq. (9.49) is indeed a linear combination
of infinite Gaussian variables.
9 Regression Models for Machine Learning 363

With the above conclusion in mind, the posterior mean Î can be derived as Ras-
mussen and Ghahramani (2003), Briol et al. (2019):
  
 
μI = ŷ (x) π (x) dx f ŷ d ŷ
  
 
= ŷ (x) f ŷ d ŷ π (x) dx (9.50)
 
=  μ y (x)
=  [m (x)] +  [κ (X, x)] K −1 (Y − m (X)) ,
 
based on the exchangeability of the two integral operators over the density f ŷ of
the GPR model ŷ(x) and the density π (x) of x. Similarly, the posterior variance of
Î can be formulated as:
  2
 
σI2 = ŷ (x) π (x) dx − μ I f ŷ d ŷ
    
          
= ŷ (x) − μ y (x) π (x) dx ŷ x − μ y x π x dx f ŷ d ŷ
  
         
= ŷ (x) − μ y (x) ŷ x − μ y x f ŷ d ŷ π x π (x)dxdx
  
=  c y x, x
     
=  κ x, x −  [κ (X, x)] K −1  κ X, x ,
  (9.51)
where  [·] indicates the integral operator  over π x , and  [·] refers to the
double integral operator over π (x) and π x .
From Eqs. (9.50) and (9.51), to derive the closed-form expressions
  for μI and
σI2 , one needs to express  [m (x)],  [κ (X, x)], and  κ x, x in closed
forms. The analytical expression of  [m (x)] is commonly easy to derive as m (x)
is commonly assumed to be zero/constant/linear   form. The availability of closed-
forms expressions for  [κ (X, x)] and  κ x, x is determined by the form of
the kernel κ and the density π . One can refer to Ref. Briol et al. (2019) for a summary
of the pairs (κ, π ) that lead to closed-form expressions for these two integrals. For
the squared exponential kernel given in Eq. (9.41) and normal density π(x), the
closed-form expressions for both integrals are applicable. For independent standard
normal density, they are formulated as Wei et al. (2020):
 
−1 −1/2 1  
 [κ (X, x)] = σ02 +I exp − diag X ( + I)−1 X (9.52a)
2
   −1 1/2
 κ x, x = σ02 2 +I , (9.52b)

where diag [·] refers to the operator of creating a column vector with the diagonal
elements of the argument.
364 P. Wei and M. Beer

With the above procedure, the integral I is then approximated by a Gaussian


 
variable Î ∼ N μI , σI2 , where the posterior mean μI provides a mean prediction
for I, and the posterior variance σI2 summarizes the corresponding discretization
error due to limited number of integration points. Unlike the traditional grid-based
cubature methods such as Gaussian–Hermite cubature and sparse grid cubature, the
Bayesian cubature formulas do not depend on the specific design of grids, and the
integration points can be designed in a much more flexible way, providing possibili-
ties for accelerating the convergence. In the following, an active learning procedure
will be introduced.

9.4.1.2 Active Learning with Posterior Variance Contribution Function

As has been stated above, the posterior variance σ I2 summarizes the numerical error
of the mean estimate μ I , with the assumption that the integrand y(x) can be described
as a realization in the RKHS associated with the kernel κ. Without considering the
above assumption, the demand of accelerating the cubature convergence is equivalent
to designing the integration points such that the posterior variance can be reduced
quickly with the increment of points.
Motivated by the above idea, an acquisition function, called Posterior Variance
Contribution (PVC) function, has been developed as follows Wei et al. (2020):
  
LPVC (x) = π (x)  c y x, x . (9.53)

By definition, the PVC function value at x integrates the posterior correlation infor-
mation of ŷ(x) at x with those at all the other points across the  integral domain.
Comparing Eq. (9.53) with Eq. (9.51), it can be found that σ I2 = LPVC (x) dx. The
above two observations reveal that the PVC function value at x measures the con-
tribution of the prediction error of y(x) at x with the consideration of its correlation
across the integral space. The point with the highest PVC value can be identified,
and added to the training data set D to achieve the highest reduction of the posterior
variance of Î.
For the squared exponential kernel κ and Gaussian density π , the PVC function
admits a closed form as:
      
LPVC (x) = π (x) ·  κ x, x − κ (X, x) K −1  κ X, x , (9.54)
  
where
 the analytical
 expression of  κ X, x is given by Eq. (9.52a), and that of
 κ x, x is formulated as:
 
   −1 −1/2 1
 κ x, x = σ02 +I exp − x  ( + I)−1 x . (9.55)
2
9 Regression Models for Machine Learning 365

The PVC function usually shows multi-modal behavior, which means that multi-
ple peaks exist. It is then necessary to find the global maximum point. As the PVC
function admits a closed form and is computationally very cheap, the evolutionary
optimization algorithm such as the particle swarm optimization (PSO) is recom-
mended.
We use a one-dimensional integral with integrand y (x) = x 2 sin (4x) + 1 and
standard Gaussian density as an example to illustrate the method. The active learning
process is initialized with four training integration points randomly drawn between
–2.5 and 2.5, and then the PVC function is utilized for the adaptive design of the
integration points. The active learning process stops if the posterior coefficient of
variation, defined as σ I /μ I , is less than 1%. With the above setting, the algorithm
adaptively produces twelve more points. The results including the posterior 95%
confidence intervals, the integration points, and the PVC functions, at sample size
n = 4, 7, 12, and 16, are shown in Fig. 9.7. The first column shows the results
generated with the four initial integration points, together with the global maximum
point identified by the PSO algorithm. This point is then added to the training data
set for step-by-step active learning. The multi-modal behavior of the PVC function
is also clearly shown in the second row of Fig. 9.7. The evolution details of the
posterior mean and posterior 95% confidence intervals of the integration Î are also
schematically shown in Fig. 9.8. As can be seen, equipped with the PVC function, the

Fig. 9.7 Evolution of the GPR model and PVC function against the training data size for the
one-dimensional integral example
366 P. Wei and M. Beer

1.5

0.5

0
Posterior Mean Estimates
Confidence Intervals
-0.5 True Values

4 6 8 10 12 14 16

Fig. 9.8 Evolution of the posterior confidence interval estimates against the training data size

posterior distribution of Î shrinks to the true value with a high rate, demonstrating
the effectiveness of the active learning scheme.

9.4.2 Active Learning for Bayesian Reliability Assessment

The structural reliability assessment problem is commonly formulated as:



pf = I F (x) π(x)dx, (9.56)

where p f means the probability of failure, and I F (x) is the indicator function of the
failure domain, and is defined as:

0, y (x) < 0
I F (x) = (9.57)
1, otherwise

In Eq. (9.57), y(x) is called limit state function with y(x) < 0 implying failure
state of the structure, and y(x)  0 indicating the safe state. The surface defined
by y(x) = 0 is then called failure surface or limit state surface, which separates the
input domain Rd into the failure domain F = {x : y (x) < 0} and the safe domain
S = {x : y (x)  0}. In practical application, the limit state function y(x) is governed
by one or more expensive-to-estimate simulators (e.g. finite element model), and the
problem of reliability assessment is then formulated as estimating p f with given
accuracy and the least calls of y(x). This is usually challenging especially when the
probability of failure is extremely small (typically less than 10−6 ).
9 Regression Models for Machine Learning 367

9.4.2.1 Bayesian Reliability Assessment

Despite that p f is formulated as a multi-dimensional integral, it is not practical to


solve the problem with the Bayesian cubature method introduced in Sect. 9.4.1 as
imposing a Gaussian process on the discontinuous two-valued integrand I F (x) is
irrational. During the past decades, quite a lot of numerical methods, such as Monte
Carlo simulation (MCS), subset simulation (Au and Beck 2001), and line sampling
(Schueller et al. 2004), have been developed for treating this challenge, but pursuing
methods with a better trade-off between accuracy and computational cost is still on
the way. Here we focus on the active learning procedure based on Bayesian regres-
sion models for efficient reliability assessment. The original developments of active
learning for estimating p f can be referred to Ref. Bichon et al. (2008) for a combi-
nation of the GPR model with the so-called Expected Feasibility Function (EFF)
(served as an acquisition function), and Ref. Echard et al. (2011) for a combina-
tion of the GPR model, MCS and the so-called U-function (served as an acquisition
function). The latter developed is termed as AK-MCS method, where “A” refers to
“active learning” and “K” indicates “Kriging model” (i.e. GPR model). This scheme
has achieved extensive attention during the past decade since its development due to
its simplicity and effectiveness. The AK-MCS method can also be interpreted from
the perspective of Bayesian cubature, providing an insightful understanding, and also
a practical way, for evaluating the numerical error resulting from the prediction error
for the GPR model (Dang et al. 2021). We term this scheme as Bayesian Reliabil-
ity Assessment (BRA), and introduce the active learning procedure for it with the
U-function as an example.
The BRA starts with a (n × d)-dimensional sample matrix S of x, each row x Sj
of which is randomly drawn following π(x). The value of n is pre-specified based
on the magnitude of p f to assure that the resultant of variation of the MCS estimator
is small enough. For example, for a target of coefficient of variation being less than
5%, n should be no smaller than 100/ p f . The initial training data set D = (X, Y)
of size n t (e.g. n t = 12) is then obtained with X being randomly drawn from S
without replacement, and the corresponding response values Y being computed by
 i.e. Y = y(X). Based on D, an initial GPR model
calling the limit state function,

ŷ(x) ∼ GP(μ y (x), c y x, x ) can be trained as a surrogate of y(x). The induced
surrogate model IˆF (x) for the indicator function I F (x) is then a two-point stochastic
process, and the induced random variable p̂ f is formulated as:
  
ˆ
p̂ f =  I F (x) = IˆF (x) π (x) dx. (9.58)

Definitely, p̂ f does not follow Gaussian distribution, but its posterior mean and
variance can be derived as mean prediction and measure of numerical error.
From the definition of the indicator function given in Eq. (9.57), the posterior
mean μ I (x) is formulated as:
368 P. Wei and M. Beer
 
  μ y (x)
μ I F (x) = Pr ŷ (x) < 0 =  − , (9.59)
σ y (x)

where  (·) indicates the cumulative distribution function (CDF) of standard Gaus-
sian distribution. The posterior covariance of IˆF (x, x ) can also be formulated
exactly, but its computation can be cumbersome. Instead, an upper bound is derived
based on Cauchy-Schwarz inequality as Dang et al. (2021):
   
c I F x, x  σ I F (x) σ I F x (9.60)

where σ I2F (x) refers to the posterior variance of IˆF (x) and is expressed as:
   
μ y (x) μ y (x)
σ I2F (x) =  − 1− − . (9.61)
σ y (x) σ y (x)

The posterior mean of p̂ f does not admit a closed form, but can be estimated with
the sample matrix S as:
   n
  
μ y (x) ∼ 1 μy x j
μpf =  − =  −    μ̂ p f , (9.62)
σ y (x) n n=1
σy x j
   
where μ y x j and σ y2 x j refer to the posterior mean and variance at the jth row x j
of S. Similarly, a MCS estimator can also be derived based on the GPR predictions
for S, but it is computationally cumbersome. Based on Eq. (9.60), an upper bound as
well as the corresponding MCS estimator for the posterior variance of p̂ f are given
by:
  
σ p2 f =  c I F x, x  ( [σ I F (x)])2
⎛ ! 
n "       ⎞2
1 " μ x μ xj (9.63)

=⎝ # −  y j
1− −  
y
⎠  σ̂ p2 f .
n j=1 σy x j σy x j

Neglecting the statistical error due to limited MC samples, μ̂ p f computed by Eq.


(9.62) provides a mean estimation of p f , and σ̂ p2 f summarizes the corresponding
prediction error.

9.4.2.2 Active Learning with U-Function

The key to active learning for BRA is to propose an effective acquisition function,
with which the design points can be adaptively selected from S. This acquisition
function should be designed to meet the target of reducing the prediction error of
p̂ f with a high rate against the training data size n t . Since the proposition of the
9 Regression Models for Machine Learning 369

AK-MCS method, quite a lot of acquisition functions for this purpose have been
developed, but here only the U-function will be introduced.
Given a GPR model ŷ(x), the U-function for an arbitrary point x is defined as:

μ y (x)
U (x) = . (9.64)
σ y (x)

The U-function can be explained as follows. Given μ y (x) < 0, the probability of
ŷ(x) being positive is:
 
 
μ y (x)
Pr ŷ(x) > 0|μ y (x) < 0 =  =  (−U (x)) . (9.65)
σ y (x) μ y (x)<0

On the contrary, given μ y (x) > 0, the probability of ŷ(x) being negative is:
 
  μ y (x)
Pr ŷ(x) < 0|μ y (x) > 0 =  − =  (−U (x)) . (9.66)
σ y (x) μ y (x)>0

Thereof, in any case,  (−U (x)) measures the probability that the sign of y(x) is
misdiagnosed by the posterior mean μ y (x). Then, by adding the point in S with the
highest value of  (−U (x)), or equivalently, the lowest value of U-function, to the
training data set D, it is expected to improve the prediction accuracy for p f the most.
When the upper bound σ̂ p f /μ̂ p f of the posterior coefficient of variation of p f is
less than a threshold, e.g. 5%, the active learning process can be stopped, and the
estimator in Eq. (9.62) can be used for predicting p f with a high accuracy, and the
estimator in Eq. (9.63) can be implemented for summarizing an upper bound of the
prediction error. One notes the above conclusions apply only when the sample size
n is large enough. The sample matrix S can also be adaptively enlarged during the
active learning process based on the coefficient of variation of the MCS estimator
given in Eq. (9.62).
The above active learning algorithm for BRA is illustrated with an academic
example. The limit state function is expressed as
4  
y (x1 , x2 ) = ci exp −αi1 (x1 − βi1 )2 − αi2 (x2 − βi2 )2 + 0.8
i=1

   
2314 −0.5 0.5 −0.5 0.5  
with α = ,β = , c = 1 −1.5 −1.5 2 , and
3241 −0.5 −0.5 0.5 0.5
both x1 and x2 follow standard Gaussian distribution. The active learning procedure
is initialized with n = 104 samples S and n t = 12 initial training samples D, and
the training process is shown in Fig. 9.9. The algorithm adaptively produces 61
more points before touching the convergence condition σ̂ p f /μ̂ p f < 5%. Thus, the
total number of limit state function calls is 73. The posterior mean estimate μ̂ p f is
0.0709, with the upper bound σ̂ p f /μ̂ p f being 4.89%. Compared with the reference
370 P. Wei and M. Beer

Fig. 9.9 An example of the active learning procedure for BRA

solution estimated by MCS with 106 samples, which is 0.0723 with coefficient of
variation being 0.36%, the result generated by active learning is accurate and robust.
One notes that the above MCS-based active learning procedure only applies to
relatively large failure probability (e.g. p f < 10−3 ) as for smaller probability of
failure, a larger number n is required, making the selection of each training point
computationally inefficient. For tackling this challenge, active learning procedures
combined with advanced MCS have been developed. One can refer to Ref. Wei et al.
(2019) for an example of the combination of active learning with subset simulation,
and Ref. Song et al. (2021) for a combination of active learning with line sampling,
both of which apply to the extremely small probability of failure (e.g. 10−9 ).

9.5 Conclusions

This chapter has investigated and compared a set of commonly used regression
models and methods for machine learning in engineering computations. Three non-
Bayesian models, i.e. the LSR, the RR for improving generalization performance of
LSR, and the SVR, have been first examined. These three models are all based on
defining the hypothesis set in a parametric form, and then searching the parameter
values by minimizing specific empirical loss functions defined on the training data.
The difference among these methods presents mainly on the definition of the loss
functions. An advantage of these three parametric methods is that they admit a
natural extension with the kernel trick. This is of special importance as it allows the
regression model to capture more or even infinite features of the training data with a
computational complexity linearly dependent on the training data size.
9 Regression Models for Machine Learning 371

In contrast to the non-Bayesian models searching for an optimal element from the
hypothesis set, the Bayesian regression models are trained by assigning a (subjective)
probability distribution model on the hypothesis set based on the beliefs learned from
the training data. This enables the prediction at any unobserved point to be modeled
by a subjective probability distribution with its variation summarizing the prediction
error. It is also shown that, under a specific setting, the posterior mean prediction of
a Bayesian regression model may coincide with a non-Bayesian regression model.
The probabilistic feature of the Bayesian regression models provides a possibility
for devising Bayesian numerical methods for addressing the numerical analysis tasks,
and the Bayesian cubature and Bayesian reliability assessment have been examined
as examples. The common feature is that, except for the mean predictions for the
quantities of interests, the associated numerical errors are treated as epistemic uncer-
tainty and summarized by the posterior variances. The active learning procedures are
then introduced for both Bayesian cubature and Bayesian reliability assessment, and
shown to be effective in reducing the number of required simulator calls to achieve
a specific accuracy level. Other Bayesian numerical algorithms, such as Bayesian
optimization (Huang et al. 2006) and Bayesian ODE/PDE solver (Wang et al. 2021;
Chen et al. 2021), are not investigated due to limited space, but the readers can refer
to corresponding references for these cutting-edge topics.

References

Au SK, Beck JL (2001) Probabilistic Eng Mech 16(4):263


Bichon BJ, Eldred MS, Swiler LP, Mahadevan S, McFarland JM (2008) AIAA J 46(10):2459
Briol FX, Oates CJ, Girolami M, Osborne MA, Sejdinovic D et al (2019) Stat Sci 34(1):1
Chen Y, Hosseini B, Owhadi H, Stuart AM (2021) J Comput Phys 447:110668
Cockayne J, Oates CJ, Sullivan TJ, Girolami M (2019) SIAM Rev 61(4):756
Dang C, Wei P, Song J, Beer M (2021) ASCE-ASME J Risk Uncertain Eng Systems Part A: Civ
Eng 7(4):4021054
Echard B, Gayton N, Lemaire M (2011) Struct Saf 33(2):145
Hennig P, Osborne MA, Kersting HP (2022) Probabilistic numerics Cambridge University Press
Huang D, Allen TT, Notz WI, Zeng N (2006) J Glob Optim 34(3):441
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn
Mohri M, Rostamizadeh A, Talwalkar A (2018) Foundations of machine learning, 2nd edn
Nocedal J, Wright SJ (2006) Springer,
Perdikaris P, Raissi M, Damianou A, Lawrence ND, Karniadakis GE (2017) Proc R Soc A: Math
Phys Eng Sci 473(2198):20160751
Rasmussen CE, Ghahramani Z (2003) Adv Neural Inf Process Syst 15:505
Rasmussen CE, Williams C (2006) MIT Press 39:40
Schueller GI, Pradlwarter HJ, Koutsourelakis PS (2004) Probabilistic Eng Mech 19(4):463
Song J, Wei P, Valdebenito M, Beer M (2021) Mech Syst Signal Process 147:107113
Theodoridis S (2015) Machine learning: a bayesian and optimization perspective
Wang J, Cockayne J, Chkrebtii OA, Sullivan TJ, Oates CJ (2021) Stat Comput 31(5):1
Wei P, Tang C, Yang Y (2019) Proc Inst Mech Engineers Part O: J Risk Reliab 233(6):943
Wei P, Zhang X, Beer M (2020) Comput Methods Appl Mech Eng 365:113035
Chapter 10
Overview on Machine Learning Assisted
Topology Optimization Methodologies

Ilias Chamatidis, Manos Stoumpos, George Kazakis, Nikos Ath. Kallioras,


Savvas Triantafyllou, Vagelis Plevris, and Nikos D. Lagaros

10.1 Introduction

The past two decades saw tremendous developments in artificial intelligence (AI).
Advancements in software, algorithms, and hardware led to the development of
significantly more accurate and versatile artificial intelligence models. This rendered
artificial intelligence a powerful tool that is used in diverse scientific areas, e.g.
medicine and drug design, economics, and self-driving cars, among many others.
These methods, having been successfully implemented in the simulation and
modeling of structures (Lu et al. 2022; Solorzano and Plevris 2022), found their
way to topology optimization problems, where artificial intelligence appears to have
great potential for successful implementation.
In conventional topology optimization, the optimal design of a specific domain
must be calculated subject to specific constraints and the objective is to minimize
the total compliance of the structure and use a specific amount of material. This
is typically an iterative process that involves large matrices and can be very time-
consuming. By means of artificial intelligence models, referred to also as surrogate
models (or surrogates), the computing time can be reduced significantly. The surro-
gate model is apriori trained offline. Following, during the optimization process
the model is inferred based on input data, which is a lot faster due to limited matrix
multiplications that the surrogate performs. The usual process involves either an arti-
ficial intelligence surrogate that complements the conventional procedure to reduce
computational costs or a standalone surrogate which calculates the whole optimized

I. Chamatidis · M. Stoumpos · G. Kazakis · N. Ath. Kallioras · S. Triantafyllou ·


N. D. Lagaros (B)
National Technical University of Athens, Athens, Greece
e-mail: [email protected]
V. Plevris
Qatar University, Doha, Qatar

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 373
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_10
374 I. Chamatidis et al.

structures by itself. The AI surrogates that are used belong to two main categories,
i.e. Surrogates that use density and surrogates that use images.
The surrogates that use density have similar inputs as the conventional method
since the optimization process uses the density of the structure and is updated in each
iteration of the AI model. The surrogates that perform optimization on images are a bit
different because they use techniques like image segmentation and filtering to output
the optimized image (structure) which then is mapped into density. Most surrogates
can be used for 2D and 3D structures and they are transferable, meaning that once
trained they can be used in another topology optimization problem (thermodynamics
or different material).
The Background section contains an introduction to artificial intelligence, the
surrogate models that will be used and an introduction to conventional topology
optimization. The Literature Survey section provides a review of recent advancements
of topology optimization using artificial intelligence models. This section is divided
into two parts, the first describing the models that use density and the second the
models that use image-based approaches.

10.2 Background

10.2.1 Topology Optimization

Topology optimization is the process of finding the optimal design of a structure.


The term optimal refers to a structure which has the same structural properties as
the initial one, serves the same purposes, but uses less material. That process is very
important because a structure with reduced material demand is easier to implement
and less expensive to construct. However, the process of topologically optimizing a
structure is an arduous task; it is an iterative process that requires many calculations,
and it is very time-consuming. The mathematical formulation of the problem is to
find the optimal material distribution while keeping the compliance minimum. The
most popular method used in topology optimization formulation is the so-called
power-law approach introduced by Bendsøe (1989) and later suggested by Zhou
and Rozvany (1991) and Mlejnek (1992). This approach is based on SIMP (Solid
Isotropic Material with Penalization) or the modified SIMP (Bendsøe and Sigmund
2004) where each element e is assigned a density xe which determines its Young’s
modulus E e according to the following expression:

E e (x) = E min + xep (E 0 − E min ), xe ∈ [0, 1] (10.1)

where E 0 is the original value of Young’s modulus, E min is a very small positive
value used to avoid a case where the stiffness matrix becomes singular and p is
a penalization factor that ensures that densities belong in [0, 1]. The optimization
problem is formulated as follows:
10 Overview on Machine Learning Assisted Topology Optimization … 375

N
⎢ min x : c(x) = U K U = E e (xe )u eT k0 u e
T
⎢ e=1

⎢ subject to :
⎢ (10.2)
⎢ V (x)/V0 = f

⎣ KU = F
0≤x ≤1

where c is the compliance that needs to be minimized, U and F are the global
displacement and force vectors, respectively, K is the global stiffness matrix, u e is
the element displacement vector, k0 is the element stiffness matrix for an element
with unit Young’s modulus, x is the vector of design variables, N is the number of
variables, V (x) and V0 are the material volume and design domain volume, respec-
tively, and f is the volume fraction that is chosen beforehand. The updates of the
new densities are based on the optimality criteria:
⎧ η
⎨ max(0, xe − m) i f xe Be ≤ max(0, xe − m)

xenew = min(0, xe + m) i f xe Beη ≥ max(1, xe − m) (10.3)


xe Beη , otherwise

where m is a positive move limit and B is obtained from the optimality condition:

− ∂∂cxe
Be = , (10.4)
λ ∂∂ xVe

where the Lagrangian multiplier λ is chosen so that the volume constraints are
satisfied. The sensitivities of the parameters c and V in terms of xe are

∂c p−1
xe
= − pxe (E 0 − E min )u eT k0 u
∂V (10.5)
xe
=1

The last step of that process is to make sure that densities do not produce weird
patterns. For this reason, a filtering is applied.

10.2.2 Artificial Intelligence and Neural Networks

Artificial intelligence is a large area that involves many scientific fields such as
mathematics, computer science and pertains to many algorithms that exist, which can
“learn” by example in order to solve a problem. By examining many different data
points, these algorithms can approximate the underlying function that describes the
problem. Thus, after the learning process has been completed, when a new example
376 I. Chamatidis et al.

is presented, they can predict a result based on the previous training. There are three
main categories that describe those algorithms:
• Supervised learning: In this kind of learning the samples that the model uses for
training have labels, meaning that after the model predicts an output based on the
input it received, it has the labels of the “ground truth” value of the sample and then
it compares the “ground truth” value with the one that it predicted and corrects
itself accordingly. Some of the most popular supervised learning models used are
neural networks, support vector machines, and k-nearest neighbor. Especially in
the case of neural networks, the last decades have seen great advancements by
using deep neural networks which utilize many hidden layers and a very large
number of nodes.
• Unsupervised learning: This can be considered as the opposite of supervised
learning. Here the data are unlabeled, and the model draws conclusions based
on the statistical properties of the data, e.g. the relevant clusters and distances.
The most popular models for unsupervised learning are K-means, self-organizing
maps, and principal component analysis. The main difference between unsuper-
vised and supervised learning is that in unsupervised learning there are no labels
(“ground truth” values) used during the training of the model.
• Reinforcement learning: This is a different method of learning, which does not
have many applications in topology optimization. Central to this method is the
notion of the agent (robot) that learns an optimal behavior by interacting with its
environment and changing its behavior accordingly.
In topology optimization literature the process is done, most of the times, using
deep neural networks. Neural networks are universal function approximators (Hornik
et al. 1989). Hence, by properly training an artificial neural network, this can
approximate the function that underlies the topology optimization problem.
Each node of the neural network performs an operation between the input and the
weights and biases of the model. After passing all the layers it calculates the error
and by backpropagating it corrects its weights to minimize the prediction error. This
process is iterative, where each iteration in which the model sees the entire dataset is
termed an epoch. The number of epochs that are chosen depends on the problem, the
amount of data, and the type of the model. The process of training a neural network
consists of three steps. The first step is to split the data into the training set, the
validation set, and the testing set. This split must be uniformly distributed, and the
class representation must be the same in all three sets. The second step is to train the
neural network using the training set and use the validation set periodically during the
training to test that the model doesn’t overfit. Overfitting is an undesirable outcome
of the training, where the neural network keeps reducing the error associated with
the training data, while the error associated with other data (validation data, testing
data, or others) increases, which means that the network overfocuses on the specific
training data and has lost its generalization capabilities. The last step is to use the
testing set, which the model has not seen before, to measure how well the network
performs when confronted with new data. Another issue that must be solved during
the training is defining the architecture of the neural network, which has to do with the
10 Overview on Machine Learning Assisted Topology Optimization … 377

number of hidden layers and the number of node in each hidden layer. If the number
of layers/nodes is too small, the model will not have enough learning capacity to
approximate the function properly, while if the number is too large the model will
overfit the dataset and it will not perform well on the test set (low variance-high bias
model).
Another variation of deep neural networks that is often used in topology optimiza-
tion is the convolutional neural networks (CNNs). These models take as their input an
image (either 2D or 3D) instead of a simple 1D output. Then, in each node, instead
of the usual multiplication between weights and the input, perform a convolution
using a small square kernel:

( f · g)(t) = f (τ )g(t − τ )dτ (10.6)


−∞

The advantage of these models is that they can unravel localized relations on the
image because they use information from an area instead of a single number.

10.3 Literature Survey

10.3.1 Density-Based Methods

Patel and Choi (2012) harness the power of Probabilistic Neural Networks (PNN)
and develop an optimization methodology that treats probabilistic constraints under
uncertainty. Probabilistic Neural Networks (PNN) rely on Bayesian inference (Clarke
1974) to make decisions, and the Parzen nonparametric estimator (Parzen 1962) for
the estimation of the probability density functions. The three main benefits of using
a probabilistic approach are the following: (i) Easy interpretation of the results,
(ii) Efficiency in treating nonlinear structures or disjoint failures, and (iii) Useful in
treating uncertainty. The described network is both easy to implement and to interpret.
Its training strategy relies on reducing the expected risk of each class (failure or not).
For example, suppose that ϑ belongs either to class ϑ A orϑ B , and the data vector is p
dimensional X T = [X 1 , X 2 , . . . , X p , ], the Bayes decision rule is

ϑ A i f h A l A f A (X ) > h B l B f B (X )
d(x) = (10.7)
ϑ B i f h A l A f A (X ) < h B l B f B (X )

where f A (X ) and f B (X ) are the probability density functions (PDF) for the two
classes A and B, and l A is the loss function associated with the decision d(x) = ϑ B
when θ = θa . Also h A is the a priori probability of occurrence of class A and
h B = 1 − h A . The loss function used during the training is
378 I. Chamatidis et al.

l = (h A − θ A )2 + (h B + θ B )2 (10.8)

The architecture of the probabilistic neural network consists of four layers: (i)
the input layer, (ii) the pattern layer, (iii) the summation layer, and (iv) the output
layer. The output of the pattern layer using the Parzen window nonlinear function
forms the PDF. A reliability analysis can be incorporated into the deterministic
topology optimization method. This process is called Reliability-Based Topology
Optimization (RBTO) and employs a probabilistic constraint such that

max / min : f (b)
⎢ b
⎢ subject to :


⎢ P j [g j (b, x)0] ≤ R R j

⎢ N
⎢ (10.9)
⎢ Ai L i − V ∗ ≤ 0

⎢ i=1


⎣K ·u = F
bl ≤ b ≤ bu

where f (.) represents the objective function, g j (.) the limit-state function, b is the
vector of deterministic design variables, and x is a random vector. P j denotes the
probability of the event, and the probability of failure can be expressed as P j [g j (.) <
0], Ai is the cross-sectional area of the elements and L i is the length of the particular
element, V ∗ denotes the volume of the material that can be used in the final design,
Al and Au are the upper and lower bounds of the cross-sectional area of the elements,
K is the global stiffness matrix, u is the global displacement vector, and F is the
nodal load vector. The PNN consists of two blocks, where the first block performs
the topology optimization and the second block the reliability analysis. The force,
boundary conditions, and the final volume V ∗ are used as an input. The PNN is used
to reduce the instances where the Finite Element Analysis routine is invoked, thus
reducing the computational cost of the entire process.
In Liu et al. (2015), a nonlinear multi-material topology optimization is developed
using unsupervised machine learning algorithms. Unsupervised algorithms construct
clusters of data based on their similarity. The model takes as an input the normal-
ized material parameter xe , where 0 ≤ xe ≤ 1. The K-means algorithm provides
the initial design of the structure. The final optimization design is obtained with
a metamodel-based multi-objective optimization strategy. This strategy consists of
five steps: Sampling, Simulation, Metamodel fit, Optimization, and Point selection for
the metamodel update. Sampling and simulation consist of choosing various design
experiments and functions evaluated on those designs required to fit the metamodel.
The fit of the metamodel pertains to fitting all known functions to approximate the
design responses that have not yet been evaluated. The model that is used in this stage
is the Kriging metamodel with a linear regression kernel and spherical correlation.
The last stage of the optimization step uses a multi-objective genetic algorithm to
10 Overview on Machine Learning Assisted Topology Optimization … 379

find the Pareto front. The Pareto front consists of solutions whose objectives are not
dominated by other solutions. The algorithm continues until the difference between
the solution of the model and the “ground truth” value is acceptable. The solution of
the model was compared with solutions from the SIMP optimization algorithm. The
proposed method achieves highly similar designs from metamaterials with SIMP, at
a reduced computational cost.
Lei et al. (2018) develop a real-time topology optimization procedure using
machine learning. The method proposed is based on the Moving Morphable Compo-
nent (MMC), where a set of morphable components are used as the basic building
blocks. The optimization process consists of morphing, merging, and overlap-
ping operations on those elements to achieve the final structure. The machine
learning problem is formulated as follows: Suppose there is a set of parameters
p = ( p1 , ..., pnp )T denoting the location of the concentrated load and D opt the
vector of optimal designs as an approximation of linear combination of eigenvectors
v1 ∈ 7n , ..., vm ∈ 7n , where n is the number of components. There are seven
design variables for each component. Hence, D opt can be expressed as

M
D opt ( p) = wi ( p)vi (10.10)
i=1

where wi ( p), i = 1, . . . , M is a set of weights depending on p and M is


the number of eigenvectors that represent D opt , M << 7n to achieve substan-
tial dimensionality reduction. Assuming that there are K number of vectors of
opt opt
optimal design D1 , . . . , D K as K number of vectors of parameters p1 =
( p1 , . . . , pnp ) , . . . , p K = ( p1K , . . . , pnp
1 1 T K T
) are obtained from direct optimization.
By resampling the above set, a larger set is constructed and ( p1 , . . . , p K ) is expanded
to ( p1 , . . . , p L ) with L >> K . Furthermore, matrix Y T Y with size 7n × 7n where
opt opt opt
Y T = (D1 , . . . , D L ) ∈ 7n×L , and Di denotes the optimal design variables for
p i in the expanded set ( p1 , . . . , p L ). Then, the eigenvectors v1 ∈ 7n , . . . , v M ∈ 7n
can be obtained by solving the eigenvalue problem, i.e. a standard principal compo-
nents procedure (PCA): (Y T Y )v = λv. From the above the first M vectors are
selected. Subsequently, Y T = V W T , where V = (v1 , . . . , v M ) ∈ 7n×M and
W T = (w1 , . . . , w L ) ∈ 7n×M with wi = (w1i , . . . , wiM )T ∈  M , i = 1, . . . , L.
The following relationship is defined:

p1 → wi = (w11 , . . . , wiM )T , . . . , p L → w L = (w1L , . . . , w M )


L T
(10.11)

The above equation can be approximated by any nonlinear regressor to approx-


imate the mapping between p ∈ np and D opt ∈ 7n . The machine learning algo-
rithms used are Support Vector Regressor (SVR) and KNN (k-Nearest Neighbors).
This dimensionality reduction and the fact that training is performed offline allow
the optimization procedure to be performed in real time. Furthermore, the reason that
the MMC method was selected has to do with the numbers of components n that are
380 I. Chamatidis et al.

required to describe a material distribution approach, O(102 ), both in 3D and 2D;


conversely the SIMP method would require m pixels that approach O(106 − 107 ) to
achieve a similar high-resolution result of the structural layout. The results show that
SVR performs better than KNN. Both algorithms output different structures than the
conventional optimized structure, especially in complex areas.
In Kallioras et al. (2020), deep belief networks are used to produce a higher-level
of representation and try to produce a mapping between the input and the output of
the structure. Deep Belief Networks (DBN) use stochastic variables to find high-level
correlations in the training data called feature detectors. These feature detectors and
input data used with Restricted Boltzmann Machines (RBM) form a Deep Belief
network. The architecture of the RBM consists of two layers, the input layer and the
hidden layer. The hidden layer is the so-called feature detector layer where each node
of that layer is only connected with the input layer. Similarly to the cost functions
that classical neural networks use, RBM use an energy function:

i max j max i max j max


E(v, h) = − αi vi − bjh j − vi h j wi j (10.12)
i=1 j=1 i=1 j=1

where v and b are the state and bias of the ith visible unit, h and b are the state and
bias of the jth hidden unit and wij is the weight coefficient of the connection between
those units. The state of the network with the lowest energy is the one with the higher
probability.

1 −E(v,h)
p(v, h) = e (10.13)
Z

where Z is
i max j max
Z= e−E(v,h) (10.14)
i=1 j=1

The difference between restricted and normal Boltzmann Machines is that in the
restricted ones there are no connections between the hidden units. A Deep Belief
Network (DBN) is eventually created by combining multiple RBM. The hidden
layer of one RBM is the visible layer (input layer) of the next RBM. The training
of the whole model involves two steps: first each RBM is trained individually using
unsupervised learning and then the whole model is trained using supervised learning.
The proposed method outputs a density value for each point and is also integrated
with SIMP to accelerate the optimization process. At the beginning, some initial
iterations are performed using SIMP and then the DBM performs the predictions of
the density. The training, validation, and test dataset are constructed using SIMP to
solve the optimization problem. The size of the dataset of different samples using
the cantilever examples is 480,000 samples. The proposed methodology succeeds
10 Overview on Machine Learning Assisted Topology Optimization … 381

with a reduction in SIMP iterations that reaches as high as 90% with a loss similar
to the one achieved by SIMP. Also, it is scalable for many finite elements and can be
applied both in 2D and 3D structures.
In Kallioras and Lagaros (2021b), deep belief neural networks are used to accel-
erate the topology optimization process by skipping SIMP iterations while using
the AI models to predict the desired density. SIMP is run for the initial iterations
and then the models finalize the design. The proposed method improves the method
introduced above by harnessing the power of DBN for quick calculations to operate
on higher orders (fine mesh) and use SIMP on a coarse mesh to assist the models.
This method accelerates the whole process by at least one order of magnitude. In
Kallioras and Lagaros (2020), a sequential collection of DBM is introduced where
they take as an input the initial iterations of SIMP and try to find hidden patterns and
correlations between initial densities of finite elements and the final densities. An
improvement of the previous models is introduced in Kallioras et al. (2021), where
the models are reduced in their order and by using deep learning the results are
extrapolated to a fully refined model thus achieving great accuracy and speeding up
as high as 80%. Another interesting method is introduced in Kallioras and Lagaros
(2021a) which is about a tool to generate equivalent shapes given an input with a
number of elements, forces, and supports. The produced shapes are not optimized
but are a collection of different shapes that act as a design inspiration. The output
of the process is compatible with 3D printers. The tool is powerful for prototyping
designs. It uses SIMP and Long Short-Term Memory Neural Networks (LSTM NNs)
and image processing methods to generate the shapes.
The work by White et al. (2019) focuses on large macroscale structures with
spatially varying metamaterials. To calculate the density of each element, a neural
network is used with a Gaussian activation function, i.e.

(x) = e−x
2
(10.15)

With the addition of the Gaussian function, the neural network emulates radial
basis function interpolation. The weights and biases of the neural network consist of
the scaling and offset of the Gaussian function and are calculated from the training
process. The neural network uses both the actual densities as an input and their
derivatives, for better accuracy. Apart from the use of the Gaussian function, the
architecture is a classical, one hidden layer neural network. Experiments were tested
with different numbers of neurons in the hidden layer. The results show that when the
dataset is small, using derivative data is largely beneficial (leading to 9 times smaller
error when using derivatives). However, the use of derivatives becomes irrelevant
when the dataset is large and the neural network has enough capacity.
In Chandrasekhar and Suresh (2021), the density function is represented by the
weights of the Neural Network. The difference in this approach is that it does not
try to accelerate the classic SIMP process by skipping some steps using NNs (image
segmentation methods, etc.). Rather, the NN is used to directly perform Topology
Optimization using its weights. Instead of representing the density field by a finite
element mesh, it is represented by the activation functions of the NN. The NN
382 I. Chamatidis et al.

outputs a density value for each point of the domain thus converting the optimization
setup from a constrained into an unconstrained problem with penalization. A fully
connected NN is used that may treat 2D and 3D structures and outputs a value p
which is the density value at any point. The loss function needed to train the neural
network is given by the following expression:

uT K u ρe υe 2
L(w) = +a − 1 (10.16)
J0 e
V∗

where α is a penalty parameter and J 0 is the initial compliance of the system.


The penalty term α is progressively increased with each iteration. Compared with
SIMP, the output is similar in terms of compliance but the methods require half the
computational cost of the conventional SIMP algorithm (Andreassen et al. 2011).
In Qian and Ye (2021), a dual neural network is used to solve the problem. The
first network is used for forward calculation and the second one to perform the sensi-
tivity analysis. The two networks are integrated with SIMP to replace the conventional
Finite Element Analysis. The results of the proposed method are tested in two bench-
mark tests: minimum compliance design and metamaterial design. The architecture
of both neural networks is fully connected, where the neural network used for the
sensitivity analysis has the inverse architecture from the one used for the forward
calculation. The loss function for both models consists of two parts:

3    2
∂ x13 ∂ x13
Loss = L 1 [x13 − (x13 )0 ]2 + L2 − (10.17)
i=1
x11 x11 0

To further improve the forward model, a convolutional neural network is used


that contains the structure and the loading conditions. The architecture of the CNN is
compact as it contains only 2 convolutional layers followed by 7 dense layers. This
network takes as an input a 2-channel image, where one channel contains the density
distribution of the structure and the second one contains the force distribution and
outputs the compliance of the structure. The results of the system described above
with images of size 64 × 64 are 137 times faster for the forward calculations and 74
times faster in sensitivity analysis. The authors show that a small dataset containing
only 2000 training data suffices to achieve a 95% accuracy.
In Zhang et al. (2021) a physics informed neural network is used to perform the
optimization, where the density of the elements is calculated by the reparameter-
ization of the weights of the neural network. The main idea is this work is that a
well-trained neural network can reconstruct an image based on a portion of it, where
the image that is reconstructed is the final structure of the design variables. To achieve
this, the mechanical properties and physics are introduced into the loss function of
the neural network. The method is divided into four parts: Neural reparameteriza-
tion, Design constraints, Applied physics model, and Calculation of loss function.
In neural reparameterization, the neural network takes as an input the density of
10 Overview on Machine Learning Assisted Topology Optimization … 383

the elements. The size of the input depends on the discretization of the mesh. The
density is considered as a dependent variable and the NN weights are the indepen-
dent variables. The second part requires converting the density outputted by the NN
to physical density corresponding to the physical constraints set in the formulation
of the problem. Filtering is used to achieve this. The third part is to perform Finite
Element Analysis on each iteration after obtaining the structure topology. The final
step of the process is to calculate the loss function to be minimized. In traditional
SIMP or BESO, the derivatives of the objective function need to be calculated to
perform sensitivity analysis. However, in this case, these are directly calculated via
automatic differentiation during the backpropagation step.
The architecture that the neural network is using is a decoder network starting with
a fully connected layer to linearly transform features from one space to another. This
is then followed by sequential up sampling and convolutional layers. Furthermore,
a direct copy of the input to the output is implemented, the same as the ones U-nets
use. An important step after the calculation of the density from the neural network is
to convert it to physical density according to the definition of the problem. Conven-
tional methods are not suffering from this problem because the design constraints
are considered when calculating the density. This conversion happens in two steps:
(1) Make x satisfy the [0, 1] constraint by applying the sigmoid transformation to
the output layer:

1
xi = (10.18)
1 + e xi −b(x,V )

(2) Eliminate intermediate density values by using “Projection” method:

tanh(βη) + tanh(β(x − η))


xphy = (10.19)
tanh(βη) + tanh(β(1 − η)

where η is a threshold and β controls the sharpness of the projection. To ensure


that the volume constraints are preserved when applying projection, a volume
preserving Heaviside is added, i.e.:

N N
xphy vi = xi vi (10.20)
i=1 i=1

that denotes the volume before and after the projection. Solving Eq. (10.20)
results in the value of η, which is used in the Projection equation. Higher values
of β correspond to thinner branches that will be eliminated from the final struc-
ture, which also makes the manufacturing process easier. Compared with the
SIMP method used in Andreassen et al. (2011) similar results are achieved in
terms of the shape of the final structure. One big difference is that the struc-
ture of the neural network has larger and more rigid branches on the inside of
the structure due to the projection step. The proposed method can also be used
384 I. Chamatidis et al.

in stress-constrained problems, structural natural frequency optimization prob-


lems, compliant mechanism design problems, heat conduction system design
problems, and the optimization problem of hyperelastic structures. The big
advantage of this method is that it does not need to construct a dataset apriori
and that the final structure does not suffer from structural disconnection.
In Abueidda et al. (2020), a convolutional neural network is used to calculate
the optimized designs without the use of conventional methods like SIMP, BESO,
etc., thus achieving substantial speed up in the process. The proposed method works
for linear and nonlinear materials; each material has its own CNN. A synthetic
dataset is constructed, with each pair of optimized designs and their corresponding
boundary conditions, loads and volume constraints. The architecture of the neural
network is based on ResUnet which combines the benefits of semantic segmentation
of Unet (Ronneberger et al. 2015) and residual learning of ResNet (Kollmann et al.
2020) to further improve the performance of the Unet. The Unet combines low-
level extracted features with high-level information to further improve the accuracy
of the segmentation. The deeper a neural network becomes, the harder it gets to
avoid the vanishing gradient problem where the gradient becomes too miniscule to
calculate. ResNet tackled this problem by using residual blocks. The entire network
consists of three components: (i) Encoder: compresses the input image into a more
compressed representation, (ii) Decoder: reconstructs the original image, and (iii)
Direct transfer of the input to the output layer. Also, the convolutional layers contain
skip connections (residual blocks) that apart from the vanishing gradient problem
that they solve, also reduce the number of parameters that the network must use. The
loss used during the training of the neural network is Mean Square Error (MSE):

N
1
MSE = N et (I, W ) − si 2
(10.21)
N i=1

Another metric used is the Dice Similarity Coefficient (DSC), which measures
the similarity of the output image obtained from the neural network with the ground
truth image. The DSC assumes a value of 1 if the two images are identical:
 
 
2y ∩ y 
DSC =   (10.22)
 
|y| + y 

After training the model, a simple threshold of 0.5 is applied to discretize the
densities in {0, 1}. Both the linear and nonlinear models achieve robust DSC = 0.958
and DSC = 0.964 on the test and the train set, respectively. Since the method does
not rely on any external FEA solver, it is very fast. By transferring the inference of
the neural network to a lower-level hardware, the method can perform instantaneous
optimization of structures for linear and nonlinear materials.
10 Overview on Machine Learning Assisted Topology Optimization … 385

In Deng and To (2021), a new parametric method using deep learning is intro-
duced, where the level set function is described by a deep neural network. The
proposed method utilizes the ability of the deep neural networks to approximate
any function, thus it can approximate the level set function too. A critical aspect
for the convergence of the objective function during training is the initialization of
the weights with random zero-mean values. The level set theory uses a zero contour
(2D) or isosurface (3D) to represent the boundaries of geometry of the structure. The
interface of the structure is described by the zero-level set functions:

(x, t) > 0, (x ∈ Ω)
⎣ (x, t) = 0, (x ∈ ∂Ω) (10.23)
(x, t) < 0, (x ∈ D/Ω)

where D is the design domain,  is the total number of admissible designs, ∂ the
boundary of the shape, and t the pseudo time. By differentiating the zero-level set,
the Hamilton–Jacobi partial differential equation (PDE) can be obtained:


− Vn |∇ | = 0 (10.24)
∂t
And the objective function is

min : J ( ) = (ε(u) : C : ε(u))H ( )d (10.25)


D

The deep neural network converts the PDE to an ordinary differential equation
(ODE). Hence, instead of solving Hamilton–Jacobi equations to update (x) and
finding the optimal design, (x) is represented by the parameters of the neural
network. The neural network is trained using the values resulting from solving
Hamilton–Jacobi, these values are used as the ground truth values. The resulting
designs have similar structural performance with the traditional methods, while with
different neural networks different conceptual designs can be produced.
In Patel et al. (2022) a method to overcome challenges that traditional topology
optimization struggles with, such as geometric frustration, non-smooth edges,
dangling structures at boundaries is introduced. In addition, the method acceler-
ates the entire process. This method uses two deep neural networks, one that predicts
the optimized microstructures and one that improves connectivity between them.
The method has three stages:
• A macroscale topology optimization solver (SIMP) which predicts optimized
macroscale topology optimizations. It takes as an input the finite elements in each
direction, boundary conditions, Poisson ratio, Young’s modulus, and optimization
parameters.
386 I. Chamatidis et al.

• The second stage contains a deep learning neural network which predicts
microstructures. This model takes as an input the density and nodal deflections of
every macroscale unit of the previous step and outputs optimized microstructures.
• The third stage contains another deep neural network that improves the connec-
tivity of the whole structure and outputs the final optimized structure.
The first neural network is trained using only corner displacement nodes rather
than the whole domain, which makes the calculation of the microscale structures
faster. Stage 1 features a modified density-based SIMP approach, having the objective
to minimize the compliance using conventional methods. The second stage of a model
predicts the optimized microstructures that fit well into the macrostructure. That
model has 3 sub-models, one deep neural network that maps the design variable vector
to a density distribution image, another convolutional neural network that predicts the
optimized structure, and a third post-processing solver that ensures volume fractions
constraints and optimal solutions. The third stage contains two neural networks that
improve the connectivity of the predicted microstructures. The first neural network is
a UNet that predicts an improvement in connectivity between 4 neighboring elements,
and the second neural network uses the pre-optimized corners to predict the output
to reduce the number of iterations. The connectivity of the optimized structure is
improved by 17% by the third stage and an overall 14.6% improvement in compliance.
Also, there is a great improvement in the speed of the calculation of the optimized
structure by a factor of ×10 faster than the conventional method. Also, the proposed
method works in both 2D and 3D structures.

10.3.2 Image-Based Methods

In Banga et al. (2018), a 3D approach is explored where a convolutional neural


network is used to calculate the final output of the structure. The idea is to train
a neural network with enough degrees of freedom to directly map an input to its
optimized structure output. The dataset used as ground truth are 3D images obtained
from TopOpt and it is based on SIMP methodology. The dataset is sampled with
different Volume fraction V0 , Number of Nodes N L , Load direction vectors VL ,
Load positions PL , and Displacement Boundary Constraint BC . The total number
of samples generated with TopOpt is 6,000. Each sample took 70 to 100 iterations
to converge in the conventional topology optimization process.
 n 
−1
Loss = X log(X pred ) + (1 − X true )log(X pred )
i i i i
n i=1 true
 n 
1
+β (X i
− X true )
i 2
(10.26)
n i=1 pred
10 Overview on Machine Learning Assisted Topology Optimization … 387

Three types of inputs are given to the neural network, which are (i) 3D density
distribution of voxels at iteration m (m < T ), (ii) Gradient of voxel densities between
iterations m and n (m < n), and (iii) Forces and Boundary Conditions along x, y, and
z directions, where T is the total number of iterations and m and n are intermediate
numbers of iterations. Also at the output of the network a Density Filter Function
was applied to smooth the output based on the neighbors of each voxel:

j∈N hi j v j x j
Density Filter Function : x− =  (10.27)
j∈N j h jvj

The architecture used is a convolutional neural network without any dense layers,
as it uses only 3D convolutional layers. Specifically, it follows an encoder–decoder
architecture and the output of the neural network is the same as the input. The
results are compared with standard linear elasticity solvers both in terms of accuracy
and speed. Another hyperparameter that is finetuned is the number of iterations at
which TopOpt is stopped, and it needs to be balanced in accuracy and number of
iterations performed. The best neural network experiment from the ones that have
been tried achieved 40% reduction in computational time and achieved 96% accuracy.
The results show that the calculated compliances of the structures by the traditional
method and the conventional method slightly differ. Also 4.82% of the samples
have huge compliance errors due to emergence of the structural disconnection. The
compliance error compared with the conventional method is 4.16% and volume
fraction error is 0.13%.
In Sosnovik and Oseledets (2019), topology optimizations followed by a neural
network are used to calculate the final structure. An initial number of conventional
topology optimization iterations is obtained N0 using SIMP, the output of SIMP is
turned into an image I and used as an input to the neural network. Image I is a
blurred/distorted representation of the final structure. If only topology optimization
was performed the final structure contains only material and void with no intermediate
values, this structure is represented by I ∗ . So, after performing N0 steps image I is not
the same as image I ∗ . Thus, neural networks are used to perform image segmentation
to converge image I to image I ∗ and resulting in binary densities {0, 1}. The neural
network architecture used in a fully connected convolutional neural network takes as
an input 2 grayscale images, the first image is the densities X n as outputs by the last
step of topology optimization and the second image is the difference of the densities
between 2 consecutive updates δ X = X n − X n−1 . The output of the network is a
grayscale image of the same resolution that contains the final structure. The neural
network follows the encoder–decoder architecture with 6 convolutional layers in the
encoder layer and another 6 in the decoder layered, which are the same shape as the
encoder network but reversed. Also, between the convolution layers Max Pooling
operation is used to introduce variance to the next layer.
The dataset used is synthetic based on SIMP solver for 2D structures. For the
generation of the dataset 100 iterations of SIMP are performed for each problem,
388 I. Chamatidis et al.

each individual problem is defined by its contains and load. To generate the dataset
the following constraints are used:
• The number of nodes with fixed x and y translations and the number of load is
sampled from the Poisson distribution with N x P(λ = 2), N y P(λ = 1)
• The load values are −1 and the probability of choosing a boundary node is 100
times higher than that of an inner node
• Volume is sampled from normal distribution f 0 N (μ = 0.5, σ = 0.1)
The total size of the dataset is 10,000 samples. Each sample consists of a tensor
with shape 100 × 40 × 40, where 100 is the number of iterations and 40 × 40 is the
grid size. During the training data augmentation is applied to the data to increase the
size of the dataset and the variation. The objective function used is:

L = L conf (X true , X pred ) + β L vol (X true , X pred ) (10.28)

where L conf is binary cross-entropy and L vol is MSE of the prediction and the target.
The results are compared against SIMP solver in terms of accuracy and time consump-
tion. The metrics used are Binary Accuracy and Intersection over Union (IoU). Where
Binary Accuracy measures the pixels classified correctly over the total number of
pixels of the structure and IoU measures the area of overlap over the area of the union
of the correctly classified pixels. Four different policies were tested using different
stopping iterations for the SIMP algorithm. The number of iterations that SIMP stops
is sampled from uniform distribution U ∼ [1, 100] and Poisson λ = 5, 10, 30. The
output of the structure is similar to the one produced by SIMP and its calculation is
20 times faster. Higher Accuracy and IoU is achieved with more SIMP iterations,
the highest one achieved is Accuracy = 99.6% and I oU = 99.2%.
Another image-based approach is the study by Wang et al. (2022), where a convo-
lutional neural network with strong generalization capabilities is used. The dataset
used consists of 80,000 samples using (Andreassen et al. 2011) which uses SIMP.
The volume fraction, number of forces, and direction of each force are sampled from
uniform distribution. Input of the neural network is a tensor 40 × 80 × 6 tensor,
each one tensor contains an image of Volume fraction, Nodal displacement in X and
Y directions, Nodal Normal Stains εx , ε y and Shearing γx y . The ground truth that
is used for training is the optimized output of SIMP. The architecture of the neural
network is an encoder–decoder where the encoder part reduces the size of the input
gradually up to 8 times and the decoder part restores it and outputs it to its orig-
inal size. Because the probability distribution of each element is between (0, 1) that
denotes the probability of existence of the element, thus a suitable loss function is
Kullback-Leible divergence which tries to minimize the distance between the know
distribution and the output distribution:

p(x) λ
D K L ( p||q) = p(x)log + θi2 (10.29)
x
q(x) 2 i
10 Overview on Machine Learning Assisted Topology Optimization … 389

where the first term is the loss function with p(x) the ground truth distribution and
q(x) the neural network output. The second term of the loss function is the L 2
regularization term to reduce overfitting, where λ is the weight of the regularization
term and θ is the network parameters. The difference compared to SIMP is similar,
as only a 4.12% showed large compliance errors. Also, the neural network provides
a huge speed up in calculation, about 99% faster calculation of the optimal design
structure.
In Kollmann et al. (2020), deep learning is used to optimize 2D structures of
metamaterials. The proposed method uses a convolutional neural network (CNN)
and non-iteratively optimizes metamaterials for either maximizing the bulk modulus,
maximizing the shear modulus, or minimizing Poisson’s ratio that also include nega-
tive values. The data used for the training of the neural network are created by
randomly sampling optimization parameters. These optimization parameters are a
filter radius, a design constraint (volume fraction) and a design objective (maximum
bulk modulus, maximum shear modulus, or minimum Poisson’s ratio), and are
sampled from uniform distribution. And then the optimized design is calculated using
SIMP. The neural network follows the architecture of the encoder–decoder network
and takes as an input 3 images, one for each optimization parameter described before
and outputs an image which represents the optimized structure. More specifically it
utilizes the ResUnet architecture, where the Unet part is utilized for the semantic
segmentation of the image and the skip connections of the ResNet which help train
more efficiently deep neural networks without the issue of the vanishing gradient (too
small values of the gradient in very deep neural networks). The loss function used
during the training of the model is MSE between the ground truth optimized image
and the output of the model, also the Dice Similarity Coefficient is used, which
denotes the similarity of 2 images. To train the model the dataset created is split
into training, validation, and test set. The validation set is used during the training
to ensure that the model is not overfit to the training. Then the model is evaluated
with the test set, which the model has never “seen” before. The model achieves:
M S E = 0.007 and DSC = 0.97, especially the similarity coefficient shows that the
produced optimized design image is almost similar to the ground truth. Final step of
the process is to apply a threshold of 0.5 to binarize the predicted densities, because
the model produces densities ranging from [0, 1] but the desired final optimized
design must have values in {0,1}.
In Chi et al. (2021), a large-scale solution is proposed without a loss in accuracy.
The proposed method has three distinct features: A novel component that’s being
trained from previous iterations, A two-scale topology optimization method using a
localized strategy, and A component that generates new data from actual physical
simulations that constantly improves the machine learning models. In contrast with
other methods that use deep learning, where the training of the neural networks
happens before the optimization process. In the proposed method training happens
online in 2 stages. One initial online training session and several online updates
during the process. There are 4 key parameters that control that process N j , N F
which controls the initial online training step and the frequency of online updating
frequency. The other 2 parameters W I , WU are a window that controls how much
390 I. Chamatidis et al.

back in the data the neural network can use for training in the initial step and the
update stage respectively. The process of the online initial training of the neural
network starts by solving the traditional equation for N I + N W − 1, where during
the training the network can “see” only W I steps back. Using the trained model, the
most computational expensive steps can be avoided (calculation of the state equations
and sensitivity analysis). To ensure that the model stays accurate during the whole
process, the weights of the model change with regular frequency by switching back
to the regular method of solving the equations with the standard procedure and
updating the model. To make the proposed framework efficient and scalable, a two-
scale topology optimization setup is used, where a fine-mesh and a coarse-mesh are
used separately in different stages. Fine-mesh is used to solve the state equations
to collect new data and also the design variables update is performed there. On the
coarse-mesh no design variables updates happen, but the state equation is solved at
every step of the optimization on the stiffness distribution that is mapped from the
fine-scale mesh. The architecture of the neural network used is a fully connected
deep neural network (DNN) with 4 hidden layers and with 1000 neurons at each
hidden layer. A notable addition to the architecture of the DNN is the Parametric
Rectified Linear Unit (PReLU) which is a generalization of ReLU that also contains
a learnable parameter α. The use of PReLU has been shown great performance in
image recognition tasks (He et al. 2015, 2016):

σ (x) = max(0, x) + α min(0, x) (10.30)

Deep Neural networks are often called universal approximators because, with
the right architecture, they can approach practically every function. In topology
optimization the design variables are millions, so one model cannot have the capacity
to scale and produce an accurate mapping from the input to the output, as the design
variables increase. That is why a two-scale localized setup is used to ensure the
scalability of the model. The proposed setup does not require as much memory for the
calculations. Training examples are produced from the coarse-mesh discretizations,
which are also enclosed in the fine-mesh ones. Also, the whole global design of the
fine-mesh is not treated as an individual example but each element of the mesh is
used individually as a training example. The results show that the localized training
strategy is more efficient in terms of memory efficiency and scalability. To measure
the accuracy of the models, the angle of deviation from the original sensitivity is
used:
 
GT G
θerror = arccos  T  (10.31)
G  G

With sufficient training steps with the proposed method, θerror approaches zero
proving that the method is also accurate. It is also suggested that the strain vector
instead of the nodal displacement vector from the coarse mesh should be used as an
input for the deep neural network.
10 Overview on Machine Learning Assisted Topology Optimization … 391

In Li et al. (2022), a cross-resolution method to map intermediate designs to


final high-resolution designs is introduced, also the method works with geometrical
non-linearities. Geometrical non-linearities occur when the relationship between
displacement and strain becomes nonlinear. The dataset is constructed by calculating
both the intermediate designs and the final high-resolution design of the cantilever
beam and the short cantilever beam with forces chosen randomly from uniform
distribution. The model used here consists of two neural networks and is based on
the generator-discriminator architecture. A cross-resolution is used as a generator
and a Markovian discriminator as a discriminator. The generator creates the high-
resolution image of the output design and takes as an input the low-resolution inter-
mediate image the discriminator is used during the training to distinguish between
real and fake configurations. The generator consists of 3 components, a Unet to
make the abstract connections from the low-resolution image to the higher one, a
cross-resolution layer that expands the dimensions and keeps the number of chan-
nels the same and a ResNet which is added to achieve deeper neural networks and by
using its skip connections the phenomenon of the vanishing gradient can be avoided.
The discriminator used is a Markovian discriminator which is more suitable for
image prediction problems compared to conventional CNN discriminators. Marko-
vian discriminator uses a sliding kernel which produces a score across the image and
the total mean score is used to judge if the image is real or fake, also the sliding kernel
preserves the continuity and can reveal more detail. The similarity that the generator
achieves is ~95% compared with the original image and reduces computational cost
from ~300 s to ~17 s.

10.4 Conclusions

The present chapter presents a review of methods used to perform topology opti-
mization using artificial intelligence-related methodologies. Artificial intelligence
is a useful tool employed in many scientific areas during the last decades. It is no
surprise that such tools have found their way to topology optimization problems
either by completely replacing the conventional methods or by assisting the conven-
tional methods to reduce the required computational cost. The main advantage of
using artificial intelligence to perform topology optimization is that these models
have a large enough learning capacity that can map the input to an output, even in
complex engineering problems. As a result, by properly training the AI model, it can
be used to map an input to an output during the topology optimization process.
There are many models and algorithms that have been developed during the most
recent years for AI-assisted topology optimization problems. The two most popular
families of methods are density based and image-based ones. Density-based methods
use the mechanical properties of the model as an input and output a density. The
second family, image-based methods, use an image as an input, either in 2D or 3D.
Most methods use a deep neural network or a convolutional neural network to calcu-
late the output of the optimized structure. Methods using density-based approaches
392 I. Chamatidis et al.

exhibit usually better performance both in terms of accuracy and computational cost
reduction. Advancements both in software and hardware can improve even further
the performance of these methods, as AI models rely on GPU computations.

Acknowledgements The research was supported by the National Funding for European Competi-
tive Research Projects, for the financial year 2019, with the beneficiary being the National Technical
University of Athens (NTUA), project: 67120100, GSRT AWARD 2019 “Data Driven Surrogate
Models and Applications”.
The fifth author acknowledges the support of the European Union’s Horizon research and innova-
tion programme under the Marie Sklodowska-Curie Individual Fellowship grant “AI2AM: Artificial
Intelligence driven topology optimisation of Additively Manufactured Composite Components”,
No. 101021629.

References

Abueidda DW, Koric S, Sobh NA (2020) Topology optimization of 2D structures with nonlineari-
ties using deep learning. Comput Struct 237:106283. https://doi.org/10.1016/j.compstruc.2020.
106283
Andreassen E, Clausen A, Schevenels M, Lazarov BS, Sigmund O (2011) Efficient topology opti-
mization in MATLAB using 88 lines of code. Struct Multidisc Optim 43(1):1–16. https://doi.
org/10.1007/s00158-010-0594-7
Banga S, Gehani H, Bhilare S, Patel S, Kara L (2018) 3D topology optimization using convolutional
neural networks. arXiv:1808.07440v1. https://doi.org/10.48550/arXiv.1808.07440
Bendsøe MP (1989) Optimal shape design as a material distribution problem. Struct Optim 1(4):193–
202. https://doi.org/10.1007/BF01650949
Bendsøe MP, Sigmund O (2004) Topology optimization: theory, methods, and applications.
Springer, Heidelberg. https://doi.org/10.1007/978-3-662-05086-6
Chandrasekhar A, Suresh K (2021) TOuNN: topology optimization using neural networks. Struct
Multidisc Optim 63(3):1135–1149. https://doi.org/10.1007/s00158-020-02748-4
Chi H, Zhang Y, Tang TLE, Mirabella L, Dalloro L, Song L, Paulino GH (2021) Universal machine
learning for topology optimization. Comput Methods Appl Mech Eng 375:112739. https://doi.
org/10.1016/j.cma.2019.112739
Clarke MRB (1974) Pattern classification and scene analysis. J R Stat Soc: Ser A (general)
137(3):442–443. https://doi.org/10.2307/2344977
Deng H, To AC (2021) A parametric level set method for topology optimization based on deep
neural network. J Mech Des 143(9). https://doi.org/10.1115/1.4050105
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level perfor-
mance on imagenet classification. In: 2015 IEEE international conference on computer vision
(ICCV), pp 1026–1034. https://doi.org/10.1109/ICCV.2015.123
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: 2016 IEEE
conference on computer vision and pattern recognition (CVPR), pp 770–778. https://doi.org/
10.1109/CVPR.2016.90
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal
approximators. Neural Netw 2(5):359–366. https://doi.org/10.1016/0893-6080(89)90020-8
Kallioras NA, Kazakis G, Lagaros ND (2020) Accelerated topology optimization by means of
deep learning. Struct Multidisc Optim 62(3):1185–1212. https://doi.org/10.1007/s00158-020-
02545-z
Kallioras NA, Lagaros ND (2020) DL-scale: deep learning for model upgrading in topology
optimization. Procedia Manuf 44:433–440. https://doi.org/10.1016/j.promfg.2020.02.273
10 Overview on Machine Learning Assisted Topology Optimization … 393

Kallioras NA, Lagaros ND (2021a) DL-SCALE: a novel deep learning-based model order upscaling
scheme for solving topology optimization problems. Neural Comput Appl 33(12):7125–7144.
https://doi.org/10.1007/s00521-020-05480-8
Kallioras NA, Lagaros ND (2021b) MLGen: generative design framework based on machine
learning and topology optimization. Appl Sci 11(24):12044
Kallioras NA, Nordas AN, Lagaros ND (2021) Deep learning-based accuracy upgrade of reduced
order models in topology optimization. Appl Sci 11(24):12005
Kollmann HT, Abueidda DW, Koric S, Guleryuz E, Sobh NA (2020) Deep learning for topology
optimization of 2D metamaterials. Mater Des 196:109098. https://doi.org/10.1016/j.matdes.
2020.109098
Lei X, Liu C, Du Z, Zhang W, Guo X (2018) Machine learning-driven real-time topology optimiza-
tion under moving morphable component-based framework. J Appl Mech 86(1). https://doi.org/
10.1115/1.4041319
Li J, Ye H, Yuan B, Wei N (2022) Cross-resolution topology optimization for geometrical non-
linearity by using deep learning. Struct Multidisc Optim 65(4):133. https://doi.org/10.1007/s00
158-022-03231-y
Liu K, Tovar A, Nutwell E, Detwiler D (2015) Towards nonlinear multimaterial topology opti-
mization using unsupervised machine learning and metamodel-based optimization. In: ASME
2015 international design engineering technical conferences and computers and information in
engineering conference. https://doi.org/10.1115/detc2015-46534
Lu X, Plevris V, Tsiatas G, De Domenico D (2022) Editorial: artificial intelligence-powered method-
ologies and applications in earthquake and structural engineering. Frontiers Built Environ 8.
https://doi.org/10.3389/fbuil.2022.876077
Mlejnek HP (1992) Some aspects of the genesis of structures. Struct Optim 5(1):64–69. https://doi.
org/10.1007/BF01744697
Parzen E (1962) On Estimation of a probability density function and mode. Ann Math Stat
33(3):1065–1076
Patel D, Bielecki D, Rai R, Dargush G (2022) Improving connectivity and accelerating multiscale
topology optimization using deep neural network techniques. Struct Multidisc Optim 65(4):126.
https://doi.org/10.1007/s00158-022-03223-y
Patel J, Choi S-K (2012) Classification approach for reliability-based topology optimization using
probabilistic neural networks. Struct Multidisc Optim 45(4):529–543. https://doi.org/10.1007/
s00158-011-0711-2
Qian C, Ye W (2021) Accelerating gradient-based topology optimization design with dual-model
artificial neural networks. Struct Multidisc Optim 63(4):1687–1707. https://doi.org/10.1007/s00
158-020-02770-6
Ronneberger O, Fischer P, Brox T (2015) U-net: convolutional networks for biomedical image
segmentation. Cham, pp 234–241. https://doi.org/10.1007/978-3-319-24574-4_28
Solorzano G, Plevris V (2022) Computational intelligence methods in simulation and modeling of
structures: a state-of-the-art review using bibliometric maps. Frontiers Built Environ 8. https://
doi.org/10.3389/fbuil.2022.1049616
Sosnovik I, Oseledets I (2019) Neural networks for topology optimization. Russ J Numer Anal
Math Model 34(4):215–223. https://doi.org/10.1515/rnam-2019-0018
Wang D, Xiang C, Pan Y, Chen A, Zhou X, Zhang Y (2022) A deep convolutional neural network
for topology optimization with perceptible generalization ability. Eng Optim 54(6):973–988.
https://doi.org/10.1080/0305215X.2021.1902998
White DA, Arrighi WJ, Kudo J, Watts SE (2019) Multiscale topology optimization using neural
network surrogate models. Comput Methods Appl Mech Eng 346:1118–1135. https://doi.org/
10.1016/j.cma.2018.09.007
394 I. Chamatidis et al.

Zhang Z, Li Y, Zhou W, Chen X, Yao W, Zhao Y (2021) TONR: An exploration for a novel
way combining neural network with topology optimization. Comput Methods Appl Mech Eng
386:114083. https://doi.org/10.1016/j.cma.2021.114083
Zhou M, Rozvany GIN (1991) The COC algorithm, Part II: Topological, geometrical and generalized
shape optimization. Comput Methods Appl Mech Eng 89(1):309–336. https://doi.org/10.1016/
0045-7825(91)90046-9
Chapter 11
Mixed-Variable Concurrent Material,
Geometry, and Process Design
in Integrated Computational Materials
Engineering

Tianyu Huang, Marisa Bisram, Yang Li, Hongyi Xu, Danielle Zeng,
Xuming Su, Jian Cao, and Wei Chen

11.1 Introduction

Integrated Computational Materials Engineering (ICME) is shifting the paradigm


of material design from exploration-driven and target-blind to a product-oriented
approach in that new materials and components can be developed specifically
for the end product’s use cases (Olson 1997). Rooted in decades of research in
mechanics, manufacturing, and materials science, ICME workflows integrate compu-
tational models from multiple domains and time/length scales so that the study of
process-structure–property-performance relationship for given materials and prod-
ucts can inform their optimal design (Fig. 11.1). For example, with ICME, a class
of anisotropic materials such as carbon fiber reinforced polymer (CFRP) composites
can be designed at microstructure level, i.e. the material’s volume fraction and fiber
orientations can be optimized based on the structure’s loading cases. For vehicle
lightweighting purposes (Su and Wagner 2017; Xu et al. 2017), ICME has been
accomplished by integrating CFRP modeling and design tools created by multiple
disciplines, such as advanced manufacturing process simulations (Zhang et al. 2016b;
Li et al. 2017b), multiscale materials models (Li et al. 2017b; Chen et al. 2018; Sun

T. Huang · M. Bisram · J. Cao · W. Chen (B)


Department of Mechanical Engineering, Northwestern University, Evanston, IL 60208, USA
e-mail: [email protected]
Y. Li · D. Zeng · X. Su
Research and Advanced Engineering, Ford Motor Company, Dearborn, MI 48121, USA
H. Xu
Department of Mechanical Engineering, University of Connecticut, Storrs, CT 06269, USA

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 395
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_11
396 T. Huang et al.

Fig. 11.1 Illustration of ICME workflow

et al. 2018), and stochastic modeling and uncertainty quantification methods (Bostan-
abad et al. 2018b; Huang et al. 2021), achieving a weight saving of 30% and cost
saving of $4.01 per pound of weight saved compared to an all-steel baseline, while
preserving the required structural durability (Su and Wagner 2019).
Machine learning (ML) is often used in ICME to enable metamodeling and
metamodel-based design optimization. Metamodeling (Simpson et al. 2001), or
surrogate modeling, is a technique to replace the expensive object models with a
simpler data-driven model to speed up the design evaluation process. The resulting
metamodel is trained using data from high-fidelity physics-based models, which are
often very computationally intensive, to allow more emulated model executions for
design space exploration. It leads to metamodel-based design optimization (MBDO),
wherein the optimization search is based on a metamodel instead of the original,
expensive one. MBDO has several steps, including (1) design of experiments (DoE)
for data collection, (2) metamodeling, (3) model validation, and (4) optimization.
The DoE methods, e.g. Latin hypercube sampling and low discrepancy sequences
(Sobol 2001), aim at developing a training dataset that explore the design space as
much as possible given limited computing budget. For engineering applications, Jin
et al. (2001) compared multiple metamodels and concluded that Gaussian process
(GP) modeling is the most suitable for modeling nonlinear behavior from deter-
ministic computer simulations. The recent surge in using artificial neural networks
(ANN) and deep networks (see, e.g. Daddona and Antonelli 2018; Rao and Liu
2020) is motivated by metamodeling of simulations with high-dimensional inputs.
The metamodel is often validated by methods such as k-fold cross-validation or
leave-one-out cross-validation. Finally, either gradient-based optimization search
algorithms or non-gradient-based method such as Bayesian optimization (Shahriari
et al. n.d) is integrated with metamodel for design optimization. MBDO has been
widely used in engineering design practice, such as design of fuel cell systems (Miao
et al. 2011), improving vehicle crashworthiness (Fang et al. 2005), and optimizing
the aerodynamics performance of car body designs (Song et al. 2017).
Major challenges exist in leveraging ML and correspondingly MBDO methods
in ICME-based design optimization such as the following.
• Mixed design variables: while many existing modeling and optimization
approaches assume continuous input variables, the design variables in ICME are
often of mixed continuous and non-continuous types. More specifically, it can
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 397

be any arbitrary combination of continuous, ordinal, and categorical variables.


For example, a CFRP component design may consist of continuous shape param-
eters (e.g. length and thickness), ordinal processing conditions (with discrete
but ordered levels like low, medium, high), and categorical variables such as
choices of materials (with no clear ordering of the categories). The mixed-variable
nature in ICME design problems results in a disjoint design space with highly
nonlinear responses and reinforces the need for novel ML models and optimization
algorithms accepting inputs of mixed types.
• Computation cost: the cost of executing the ICME models for design performance
analyses is nontrivial as the complete workflow will involve many physics-based
and therefore expensive simulations. It limits the number of designs we can
evaluate in the design cycles and makes data-hungry ML algorithms such as
deep neural networks unsuitable for ICME. In addition, it is not uncommon that
multiple objectives coexist in real-world product design problems (e.g. lowering
a load-bearing component’s weight but improving its durability), which further
increases the computational cost for design performance evaluation and requires
multi-objective optimization.
We propose to address the above challenges using constrained Bayesian optimiza-
tion (CBO) with latent-variable Gaussian processes (LVGP), denoted as LVGP-CBO
throughout this chapter. LVGP, originally develop by Zhang et al. (2019), is a statis-
tical model which can map all non-continuous input variables to a continuous latent
space for GP regression with mixed variables. It has been shown to have superior
properties in terms of accuracy and efficiency compared to other correlation structures
designed for qualitative predictors (Simpson et al. 2001; Qian et al. 2008; Rebonato
and Jäckel 2011; Zhou et al. 2011; Deng et al. 2017). We integrate LVGP with the
Bayesian optimization (BO) algorithm to search for the optimal designs. Given the
predicted distributions of design performance from the metamodel, BO iteratively
chooses the design with the highest potential to improve (over the current best) to
evaluate and update the LVGP. Combination of GP and BO has been shown to be
effective and efficient because more data points are sampled where optima are likely
to reside (Shahriari et al. n.d). LVGP and BO have been preliminarily explored in
recent works with material design applications (Iyer et al. 2019; Zhang et al. 2020).
For constrained ICME design problems with mixed variables and multiple objec-
tives, in this work, we first formulate it as a constrained optimization problem, then
extend BO to CBO with LVGP to realize concurrent material, process, and product
design.
This chapter is organized as follows: the LVGP-CBO method is introduced in
Sect. 11.2. Its application to design a thin-walled hat section with multiple material
choices, variable material parameters, and customizable hat geometry is demon-
strated in Sect. 11.3. A second demonstration, the design of short fiber compos-
ites with flexibility to choose between different fibers, their volume fractions, and
processing conditions, is given in Sect. 11.4. We conclude this chapter in Sect. 11.5.
398 T. Huang et al.

11.2 Mixed-Variable and Constrained Bayesian


Optimization

The iteration steps of LVGP-CBO are illustrated in Fig. 11.2. The framework is predi-
cated on the end-to-end ICME simulation models, which evaluate process, structural
and material designs consisting of mixed-type variables for multiple performance
metrics and are assumed to be costly to execute due to their nature of being multi-
step and multiscale. We start from a few initial designs in the ICME database and
build an LVGP surrogate model to predict the performance of any design in the input
space with quantifiable uncertainty. Then, an improved version of BO is employed
to search in the design space for optimal solution under constraints. The predicted
optimal designs are consequently evaluated by the ICME models, and the data points
are added to the database for the next iteration. The process continues until a preset
limit on iterations is reached or a convergence condition is met.
The methods are presented in this section. We start with the basics of GP modeling
and BO, followed by the introduction of LVGP modeling, then generalizing BO to
constrained optimization problems.
Throughout the section, we assume the dataset   to model
 contains N pairs
of predictors and observed responses x (1) , y1 , x (2) , y2 , ..., x (N ) , y N , where
x (i) (i = 1, ..., N ) is d-dimensional. X is the design matrix of size N × d, where each
row contains an observation. y is the response vector of size N , and Yi (i = 1, ..., N )
and Y are Gaussian random variables and vectors for GP modeling.

11.2.1 Gaussian Processes and Bayesian Optimization

To construct a statistical model of the data, we assume the output is a realization


of a Gaussian process, which is a set of time- or location-indexed random variables
following a multivariate Gaussian distribution, with a mean and a covariance function
to define the distributional characteristics (Fig. 11.3). Let [y1 , y1 , ..., y N ]T be the

Fig. 11.2 LVGP-CBO Workflow for ICME


11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 399

Fig. 11.3 Gaussian process modeling and Bayesian optimization

realization of the GP [Y1 , Y2 , ..., Y N ]T with


⎡ ⎤ ⎛⎡  (1)  ⎤ ⎡  (1) (1)   (1) (2)    ⎤⎞
Y1 m x  K x , x  K x , x  ... K x (1) , x (N ) 
⎢ Y2 ⎥ ⎜⎢ m x (2) ⎥ ⎢ K x (2) , x (1) K x (2) , x (2) ... K x (2) , x (N ) ⎥⎟
⎢ ⎥ ⎜⎢ ⎥ ⎢ ⎥⎟
⎣ ... ⎦ ∼ N ⎝⎣ ... ⎦, ⎣ ... ... ... ... ⎦⎠
 (N )   (N ) (1)   (N ) (2)   (N ) (N ) 
YN m x K x ,x K x ,x ... K x , x
(11.1)

where m(·) is a mean function and K (·, ·) is a covariance function, both converting
the predictors to the responses’ distribution parameters. A popular choice of the
covariance function is
 d 
 (i) ( j)   (i) ( j)    (i) ( j)
2
K x ,x = σ r x ,x
2
= σ exp −
2
λk x k − x k (11.2)
k=1

where r (·, ·), the squared exponential function, models the correlation between Yi
and Y j through the Euclidean distance between xi and x j , assuming the process
is stationary and isotropic. K (·, ·) has d + 1 hyperparameters (λk and σ 2 ). By
assuming the mean is a constant, i.e. m(·) = c, the training of the GP model
through maximum likelihood estimation (MLE) is turned into a (d + 2)-dimensional
optimization problem
 
ĉ, σ̂ 2 , λ̂k = argmax L c, σ 2 , λk |X, y (11.3)
c,σ 2 ,λk
400 T. Huang et al.

where the likelihood function L(·) is derived  from the multivariate Gaussian density
(with covariance matrix i j = K x (i) , x ( j) )
   
L c, σ 2 , λk |X, y = f Y y; X, c, σ 2 , λk
 
1 1 T −1
= N 1 exp − (y − c)  (y − c) (11.4)
2 2 || 2 2

To solve the optimization problem, Eq. (11.4) is often turned into its negative
logarithmic form for numerical stability, and the maximization problem (or mini-
mization when in negative logarithm) is solved by numerical optimization algorithms
(Bostanabad et al. 2018a).
Given the optimal hyperparameters, the distribution of Eq. (11.1) is determined
and we can construct the joint distribution of Y and Y (n) , a new point at x (n) we are
interested in. Note that the distribution of Y (n) conditioned on our observation Y = y
is also Gaussian with conditional mean and variance
   
E Y (n) |Y = y = c + r x (n) , X R −1 (y − c)
      
var Y (n) |Y = y = σ 2 1 − r x (n) , X R −1 r x (n) , X (11.5)

   
where R = ri j is the correlation
  matrix. We use E Y (n) |y as our point estimation
of y (n) , with variance var Y (n) |y , and the epistemic prediction uncertainty from lack
of data can be quantified hereafter using the variance itself or a confidence interval
based on it.
Details of GP modeling can be found in references such as Rasmussen (2003) and
Bostanabad et al. (2018a). In addition to the basic version, which predicts a single
noise-free variable from inputs, alternative versions have been developed for less
ideal situations such as multi-response outputs (Conti et al. 2009), stochastic noises
(Ankenman et al. 2008), and additive noises (Zhang et al. 2016a).
Conventional numerical optimization algorithms need access to the objective
function’s gradient information to determine search directions and optimality condi-
tions while convergence is often guaranteed for local solutions (Nocedal and Wright
2000). Bayesian optimization (BO) (Shahriari et al. n.d) is an alternative efficient
global optimization (EGO) approach that guides its solution search through the
probability distributions of outputs predicted by a metamodel constructed from the
search history data. The core idea is to design an acquisition function, which quanti-
tatively measures the potential for improvement at an unexplored site by comparing
and integrating its prediction uncertainty and predicted improvement relative to the
current optimum. If the uncertain prediction y at a location x is characterized by a


random variable Y with mean y and variance σ 2 , while the current best solution (for


a maximization problem) is y ∗ , a widely used acquisition function is the expected


improvement (EI)
   
EI(x) = E ŷ − y ∗ 1 ŷ > y ∗
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 401

  ŷ − y ∗ ŷ − y ∗
= ŷ − y ∗ N ( ) + σ̂ φN ( ) (11.6)
σ̂ σ̂

where φN and N denote the probability density and cumulative distribution func-
tion (PDF and CDF) of a standard normal distribution when the predictions are
provided by a GP model. The unexplored point with the highest EI is chosen for
objective function evaluation in each optimization iteration (Fig. 11.3). EI provides a
balance between exploring highly uncertain sites and exploiting predicted promising
sites. Other options include probability of improvement (PI), PI(x) = P[Y ≥ y ∗ ] =


y −y ∗  

P ( 
), and upper confidence bound (UCB), UCB(x) = y + βσ , where β is a
σ
hyperparameter of the user’s choice (Shahriari et al. n.d). The acquisition function
may also be hybrid, choosing from a set of candidates based on their performance in
the optimization process. See, for example, the Thompson sampling (Shahriari et al.
n.d) approach for reference.

11.2.2 Latent Variable Gaussian Process (LVGP) Modeling

As mentioned in Sect. 11.1, mixed-variable modeling and design problems often


occur in ICME. Latent-variable Gaussian process (LVGP) (Zhang et al. 2019) is an
emerging GP-based model that handles mixed-variable inputs by mapping categor-
ical variables to a continuous latent space. Like any GP variant, it can serve as a
surrogate model to physics-based simulations and subsequently guide the Bayesian
optimization.
Given a mixed-variable dataset, assuming the input is d-dimensional with d1
continuous or discrete variables xc ∈ Rd1 and d2 = d − d1 categorical
 variables

xt ∈ Rd2 , we define the covariance between any two designs x (i) = xc(i) , xt(i) and
 
( j) ( j)
x ( j) = xc , xt by

 
  
d1  2 
d2    
(i) ( j) (i) ( j) (k) (i) (k) ( j)
K x ,x = σ exp −
2
τk xc,k − xc,k − ||z xt,k −z xt,k ||2
k=1 k=1
(11.7)

where z : Z → R2 maps each level of a category to two continuous variables,


which is assumed to be in a latent space, characterizing the latent distance between
non-ordinal categories (Fig. 11.4). Note that by transforming categorical variables
to continuous ones, the original problem becomes a d1 + 2d2 -dimensional regular
GP fitting one. Hence the usual GP modeling routines, e.g. MLE and conditional
predictions, can still be applied. To train an LVGP, the values of z(·) are estimated
simultaneously with the other hyperparameters. With a constant mean (m(·) = c)
402 T. Huang et al.

Fig. 11.4 Illustration of the latent space in LVGP

LVGP with N data {X, y} and N × N covariance matrix (τ, z(·)) defined by
Eq. (11.7), its negative log-likelihood is

N   1 1
l(c, σ, τ, z(·)) = ln 2π σ 2 + ln|(τ, z(·))| + (y − c)T (τ, z(·))−1 (y − c)
2 2 2σ 2
(11.8)

where z(·) is a discrete function mapping categories to 2D continuous values. Equa-


tion (11.8) is to be minimized for the maximum likelihood estimation of the hyper-
parameters. Note that only relative distances between categories are used in LVGP,
therefore without loss of generality, its first level is always set to be at the origin
point (0, 0), and the second one is on the positive x axis. Rotational and translational
symmetries are removed henceforth.
We have shown that the LVGP method is more efficient and accurate compared to
the other GPs for qualitative variables (Qian et al. 2008; Rebonato and Jäckel 2011;
Zhou et al. 2011; Zhang and Notz 2015; Deng et al. 2017). Since it is ultimately a
GP, it can produce the point estimation of any given mixed-variable inputs as well as
its uncertainty in the form of a Gaussian distribution required by BO. Moreover, it
encodes each category, which may be represented by a variety of characteristics, with
only two latent values. For example, hypothetically a material may be parameterized
by its physical properties, which can be a high-dimensional list with lots of unrelated
variables depending on the design objectives. LVGP automatically compresses these
representations during the training process, and the resulting latent space can be
viewed as the projection of this high-dimensional representation space onto a plane
such that similar categories (i.e. those with smaller Euclidean distance in the latent
space) yield similar predictions. Note that this category-to-numerical mapping is
unique to the LVGP’s output. For example, Aluminum might be similar to steel in
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 403

a tensile strength model but closer to CFRP in the latent space of a product weight
model.
Note that GP models do not generally scale well for big data applications due to
the inherent matrix inversion operations (on a matrix of the size of the data) during
model fitting and inference. However, they are well-suited for BO in the context
of ICME since ICME often encounters small-data scenarios and BO aims at finding
less (more efficient) samples to build the GP toward the optima under the assumption
that the objective function (i.e. the ICME simulations) is computationally expensive.
Consequently, the resulting dataset is generally small enough to be modeled by GP-
based approaches due to resource constraints. On the other hand, for extremely high-
dimensional design problems that do require a large number of samples to model,
GP might be impractical and dimension reduction and feature selection techniques
(Li et al. 2017a) could be applied to reduce the dimensionality of the problem before
a surrogate model is trained for BO.

11.2.3 Constrained Bayesian Optimization

Multiple objectives often coexist in ICME decision making, e.g. performance opti-
mization, cost minimization, and compliance of prespecified design requirements.
For multi-objective optimization problems without constraints, it is not uncommon
that the most valued objective is kept while the remainders are converted to design
constraints in engineering practices (see, for example, the -constraint method in
Miettinen 1998). Note that Bayesian optimization, in its original formulation, is
an unconstrained optimization algorithm. To take into account the constraints in
ICME design, a constrained BO algorithm is needed, preferably based on the LVGP
metamodel for its power to consume mixed-variable inputs.
We develop our approach based on the following optimization problem

max y = f (x)
x (11.9)
s.t. g(x) ≤ 0

It is assumed that the objective and constraint functions can only be evaluated
from expensive functions (i.e. high-fidelity computer simulations) hence all models
are emulated by LVGPs. Denote a current best design be y ∗ and a new query point
and the predicted performance be {x, y}. Under the assumptions of LVGP, y is a
realization of a Gaussian process. Let the marginal distribution at x be Y , the predicted
improvement (for maximization problems) is defined as
 
I(x) = max 0, Y − y ∗ (11.10)

Taking the expectation on both sides of Eq. (11.10) leads to the popular expected
improvement (EI(x)) acquisition function in BO. For constrained optimization, first
404 T. Huang et al.


let the constraint and its prediction be g(x) = yc and y c . Similarly, yc is modeled by
a random variable Yc , , marginalized from the presumed GP. Following the method
proposed by Gardner et al. (2014), we define a feasibility indicator function (x)

1, if Yc > 0,
(x) = (11.11)
0, if Yc ≤ 0

It is not difficult to see that Eq. (11.11) represents a Bernoulli random variable
that takes the value 1 when the constraint is violated. The constrained improvement
is defined as
 
IC(x) = (x)I(x) = (x)max 0, Y − y (11.12)

which is nonnegative, and only nonzero when the constraint is satisfied. Note that
Y and Yc are independently modeled. By taking the expectation on the improve-
ment function, the EIC (expected improvement with constraints) acquisition function
becomes the product of expected improvement and probability of failure (PF), i.e.
violation of the constraint,
  
EIC(x) = E (x)max 0, Y − y ∗ = PF(x)EI(x) (11.13)

For multiple constraints, if we assume that they are mutually independent, then the
PF part of the EIC criterion becomes the product of all probabilities of failure.
The constrained Bayesian optimization (CBO) is constructed by adopting
Eq. (11.13) as the acquisition function in BO. For mixed-variable problems, we
extend x to x = (xc , xt ) (continuous and categorical variables) and opt for LVGP
to estimate the distributions of Y and YC . We use LVGP-CBO to denote this mixed-
variable and constraint-bounded optimization approach. The PF term in Eq. (11.13)
lowers the expected improvement of points that are more likely to be infeasible. As
a result, they are less likely to be chosen for evaluation.

11.3 Application to Concurrent Structure and Material


Design

We demonstrate the use of LVGP-CBO for concurrent material and structure opti-
mization through the design of a thin-walled hat section example for performance
improvement and weight reduction. A hat section is a popular structural model to
demonstrate the mechanical performance of materials in automotive engineering
(Schneider and Jones 2005; Debski et al. 2020; Xu et al. 2020). As shown in Fig. 11.5,
a closed hat is formed by joining two components together, a top hat and a backplate,
which may have separate geometry, materials, and processing specifications.
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 405

Fig. 11.5 Schematic of the hat section

We simulate the performance of a hat section design through an integrated


sequence of physics-based simulations (Fig. 11.6, including those having been
published in Zhang et al. 2016b; Lin et al. 2017; Xu et al. 2017; Chen et al. 2018;
Sun et al. 2018; Bostanabad et al. 2018b). The integrated simulation program first
generates the CAD model of the hat section based on the input geometry variables
and sent it along with the selected materials and process conditions to the process
simulations depending on the choice of materials (compression modeling and curing
for short and long fiber composites respectively). The resulting local microstruc-
ture parameters (e.g. fiber orientation states and volume fractions) are assigned to
the CAD model’s mesh grids and multiscale structural simulations are executed to
evaluate the design’s performance considering the processing-induced microstruc-
ture heterogeneity. The detailed description of input design variables and output
performance criteria are given in the following subsections. The simulation models
are integrated using ESTECO modeFRONTIER, a commercial software platform
for design automation and optimization. This section will focus on the example of
the hat structure with Sect. 11.3.1 discusses the integrated material-structure model,
Sect. 11.3.2 presents the decision of design variables, constraints, and objective func-
tion. LVGP modeling and validation will occur in Sect. 11.3.3, with CBO applied in
Sect. 11.3.4 to get the optimized results.

Fig. 11.6 The integrated simulation models for design evaluation


406 T. Huang et al.

11.3.1 The Integrated Material-Structure Model

The integrated models (Fig. 11.6) allow a comprehensive design of the hat section
from many aspects. The potential input variables include
1. Material selection (Fig. 11.7a). For each of the two components (hat and back-
plate), one of the following candidate materials is chosen: steel, aluminum, unidi-
rectional (UD) CFRP (two possible fabric layups, [0◦ /90◦ ] and [0◦ /60◦ / − 60◦ ]),
woven CFRP ([0◦ /90◦ ]), and chopped fiber CFRP (also known as SMC, sheet
molding compounds).
2. Fiber angle. A rotation of the fabrics, in multiples of ±15◦ , may be applied to
UD and woven to create more angle selections
 other than the prespecified ones.
For example, we may have UD 15◦ /105◦ as a result of UD [0◦ /90◦ ] with 15◦
rotation.
3. Thickness (Fig. 11.7c). For each component, a thickness between 1.2 and 4.2 mm
may be chosen. The variable is continuous for steel, aluminum, and chopped fiber
component, and discrete for UD and woven CFRP as it must be multiples of the
fabric thickness, which is 0.4 mm for UD and woven [0◦ /90◦ ], and 0.6 mm for
UD [0◦ /60◦ / − 60◦ ].
4. Hat height (Fig. 11.7c). The height of the hat can be altered within ±1 mm
compared to a reference design, i.e. it falls within the range [−1, 1].
5. Charge design (Fig. 11.7b). The shape of the initial charge for the sheet molded
chopped fiber CFRP. There are two alternatives.

Fig. 11.7 Schematic of the design variables, including a material selections, b process conditions
for SMC, and c part geometry
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 407

Fig. 11.8 Schematic of the performance simulations, including a stiffness, b strength, c fatigue
life, d crashworthiness and (2) criteria of crashworthiness

6. Compression molding parameters (Fig. 11.7b). Two possible compression


speeds and three possible levels of compression forces are available for every
component made by chopped fiber composites.
This integrated model has variable inputs, the size of which depends on the choice
of materials for each component. Additionally, it requires a mixture of continuous,
discrete, and categorical variables as its inputs. For the outputs, the options are
1. Weight and cost. They are directly calculated from the part geometry, material
selection, and a cost model including both materials and labor.
2. Stiffness (Fig. 11.8a). It is defined as the displacement of the center node of the
backplate model after a unit load is applied in an Abaqus simulation.
3. Strength (Fig. 11.8b). For metal materials, it is the equivalent plastic strain at
the center of the backplate after a unit load is applied; for CFRP, the Tsai-Wu
failure criteria (Tsai and Wu 2016) is used. It is computed in Abaqus.
4. Crashworthiness Fig. 11.8d, e). Two quantities are computed after a simulated
impact test using LS-DYNA, the peak load, and displacement from contact to
peak load.
5. Fatigue life (Fig. 11.8c). The number of loading cycles before fatigue failure is
predicted using an nCode computational model.

11.3.2 Design Variables, Constraints, and Objectives

The list of design variables, along with the number of levels, are summarized in
Table 11.1. To reduce the complexity of the problem, we assume the process condi-
tions are the same for both components (hat and backplate), which leads to 10 design
408 T. Huang et al.

variables in total (2 materials, 5 geometric parameters including 2 angles, 2 thickness


selections, and 1 height, and 3 SMC processing parameters). For LVGP modeling,
we first design experiments (simulation samples) and run integrated simulations to
obtain the performances, then model each of them via separate LVGPs.
We design the experiments for categorical and non-categorical variables indepen-
dently. The categorical ones are created by random permutation, while the remainders
are from the optimal Latin hypercube sampling (Jin et al. 2005). Then the two input
tables are assembled to form the final DoE. Note that the variable type of thickness
depends on the choice of material (discrete for long fiber CFRP and continuous for
the rest). For simplicity, this variable is regarded as continuous in the DoE table
and only rounded to the nearest level for UD and woven composites during model
evaluation (while the original values are used during LVGP training). A total of 160
points is created.
The integrated model generates multiple performance metrics for each design
in the DoE. We aim to find lightweight composite designs to replace the metal
version. Therefore, the mechanical performances are set as constraints while the
weight and cost are chosen as the optimization objectives. The design objectives and
constraints are summarized in Table 11.2. Note that cost-weight, cost-durability (i.e.
mechanical performances), and weight-durability are typically conflicting goals, as
(1) lightweight materials like CFRP are more expensive, and (2) lowering costs and
reducing weights by reducing the materials used generally decreases durability. As
a result, the optimization problem is a challenging one.
The baseline design is the minimal thickness (1.2 mm) all-steel model. Since in
general steel is the cheapest material among all candidates, the baseline model has
the lowest cost.

Table 11.1 ICME design variables for the thin-walled hat section component
Group Name Type Choices Hat Backplate
Material Material selection Categorical 6 1 1
Material Fiber angle Discrete 12 1 1
Geometry Thickness Continuous n/a 1 1
Geometry Hat height Continuous n/a 1 0
Process Charge design Categorical 2 1 1
Process Compression speed Discrete 2 1 1
Process Compression force Discrete 3 1 1
Note that if a categorical variable has only two levels, say 0 and 1, modeling it as categorical or
continuous makes no difference since the distance between the levels (to be estimated in LVGP) can
also be accounted for via the roughness parameter τ when it is modeled as a continuous variable
in regular GP. Therefore, we treat charge design as a continuous variable, and it leaves us with two
categorical variables, material selection for the two components. The ordinal (discrete) variables
are treated as continuous during LVGP modeling and validation.
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 409

Table 11.2 ICME design


Name Type Goal
targets for the thin-walled hat
section component Cost Objective Minimize
Weight Objective Minimize
Stiffness Constraint No worse than baseline
Strength (metal) Constraint No worse than baseline
Strength (composites) Constraint ≤1
Crashworthiness Constraint No worse than baseline
Fatigue life Constraint No worse than baseline

11.3.3 LVGP Modeling and Validation

Among the 160 DoE points, 128 (80%) of them are used for training the LVGP
model and the remaining 32 (20%) are reserved for model validation. All inputs and
outputs are normalized to [0, 1] to improve the convergence of model training. Both
mean squared error (MSE) and mean absolute relative error (MARE) are computed
  

for validation. For an array of predictions y 1 , y 2 , ..., y N and the truth y1 , y2 , ..., y N

N  
1  2 1  ŷ − y 
N
MSE = ŷi − yi , MARE = (11.14)
N i=1 N i=1  y 

The values are listed in Table 11.3. Low MSEs are observed for almost all outputs,
showing good average prediction performance. Note that absolute errors are more
sensitive to outliers. Since outputs 5 and 6 (strength) are developed for metal and
composite materials respectively, low accuracy is expected when the model tries to
predict the wrong criterion (e.g. the Tsai-Wu criterion, developed for composites,
when the design is all metal), which explains the high MARE for strength predictions.
Due to the high computational cost of the crashworthiness simulations, sometimes
its calculation exits prematurely when an internal algorithm interrupts the simulation
when it decides the hat section will break shortly. In this case, the distance is marked

Table 11.3 LVGP


Name MSE MARE
thin-walled hat section model
validation results 1 Weight 5.6 × 10−6 0.0063
2 Cost 0.0029 0.0255
3 Stiffness 0.0023 0.0699
4 Fatigue life 0.0257 0.0616
5 Strength (Metal) 0.0016 0.415
6 Strength (Composites) 0.0031 0.2206
7 Crashworthiness (Peak force) 0.0084 0.492
8 Crashworthiness (Distance) 0.02 2.102
410 T. Huang et al.

as −1, which corrupts the crashworthiness data and accounts for the high MARE in
output 7 and 8.
A few selected latent spaces are plotted in Fig. 11.9. The first row shows the latent
relation between material selections for the cost model. The linear alignment by five
of the materials indicates that the LVGP model is able to find the latent representation
of the levels of the categorical variable, which is in fact a 1D variable, the unit price.
Note that although theoretically unit price is the only factor affecting the cost through
material selection because the volume of the hat section is fully specified by the height
and thickness variables, in our model, the realized volume is also material-dependent
because the U1 material (UD CFRP with fabric [0/60/−60]) has one more layer than
the others and has a different unit thickness (0.6 mm as opposed to 0.4 mm for U2
and W). It means choosing this material will affect the cost model in two dimensions:
unit price and the realized thickness. For example, if we assume the unit price of
woven, U1, and U2 are p, k1 p, and k2 p, for thickness values that do not require
round-up for all them (i.e. common multiples of 0.4 and 0.6 such as 2.4 mm), the
cost ratio among the three materials (given all other design variables are equal) will
be precisely 1 : k1 : k2 ; however, for other values, the final thickness (after the
round-up) will change, and we will observe a different cost ratio between U1 and the
rest. An input thickness of 2.0 mm will end up with 2.0 mm U2/woven and 1.8 mm
U1 components, yielding a 1 : 0.9k1 : k2 . Therefore, the data should suggest U1
will influence the cost through both unit price and thickness. This phenomenon is
captured by LVGP and reflected in Fig. 11.9 in that it does not fall into the line of the
other materials formed in the latent space. Similarly, the 2D scattering of the points
in the latent spaces in Fig. 11.9c–f suggests that the influences of material selection
on these mechanical performances are also multi-dimensional, which indicates that
there are potentially many aspects (e.g. stiffness, strengths, etc.) of the materials that
may ultimately influence the performance, i.e. the categorical variables cannot be
modeled as a single variable in predicting mechanical performances.
Another perspective to examine the LVGP model is through τ , its roughness
parameters for continuous variables in Eq. (11.8). These parameters essentially place
a weight on each predictor when the correlation between points is determined. An
example, the parameters for the cost model, is shown in Table 11.4. Note that in the
MLE process, we set the 10−3 as the lower bound of the parameters, which means
those with −3 in the table has attained a parameter’s lowest possible value and are
most likely irrelevant for predictions. It is evident that although the model is trained
purely through data, it has captured the knowledge that cost of the product is a
function of the geometry design (thicknesses, i.e. the volume) among all continuous
variables and it is more sensitive to the hat’s thickness as it has larger cross-sectional
area (thus larger volume change per unit thickness change).
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 411

Fig. 11.9 The latent spaces of the cost, fatigue life, composites strength model (top to bottom) for
top hat material and backplate selection (left to right). (F = steel, A = Aluminum, S = SMC, U1
= UD [0/60/−60], U2 = UD [0/90], W = Woven [0/90])
412 T. Huang et al.

Table 11.4 The roughness


k Name log10 (τk )
parameters of LVGP in the
cost model 1 Height −3
2 Hat thickness −0.641
3 Backplate thickness −0.9752
4 Hat angle −3
5 Backplate angle −3
6 Compression speed −3
7 Compression force −3
8 Charge design −3

11.3.4 LVGP-CBO Setup and Design Results

We start with 24 DoE points to construct the initial LVGP models for optimization
and EIC estimation, and run 300 iterations of LVGP-CBO. Three design trials are
performed, as reported in Table 11.5.
The trials are intended for testing single- and combined- objective optimization
capabilities of the proposed approach. The weights w1 , w2 in the objective function
in Trial 3 are chosen such that the two individual objectives are normalized to similar
scale. Ideally, one should run the physical simulations for additional data during the
Bayesian optimization; however, for the purpose of demonstration, we query a
higher-fidelity metamodel, LVGP, built with 160 data points, for design evaluation.
The optimization history and results are reported in Fig. 11.10 and Tables 11.6,
11.7, and 11.8. Some design variables are not shown here. For the low-cost design,
it can be observed that the algorithm repeatedly switches between aluminum, steel,
UD, and woven composites materials, with a strong preference on metals, to search
for the optimal design. Note that steel is the cheapest material in the model. The
baseline design with minimal thickness should have attained the theoretical lower
bound of the cost, i.e. it is impossible to be improved; however, the optimization
history still shows a 5.4% improvement over the initial randomly sampled designs
(from 115.54 to 109.32). The last three designs have very similar (less than 1.1%)
costs compared to the lower bound (Table 11.5) showing LVGP-CBO’s capability to
find close-to global optima.
If the cost is not an issue, the optimal design for weight reduction (Trial 2) will
be most likely an all-composites design favoring UD and Woven CFRP for their

Table 11.5 ICME optimization experiments for the thin-walled hat section component
Trial Objective Baseline performance
1 Cost ($) 109.17
2 Weight (kg) 4.1
3 w1 Cost + w2 Weight,w1 = 1/110, w2 = 1/8 6.12
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 413

Fig. 11.10 Optimization history for hat section designs a low-cost design b low weight design
c combined low cost and weight design

Table 11.6 Low-cost design history


ID Hat (thickness) Backplate (thickness) Height Cost Weight
1 UD [0/90] (1.6) Al (2.14) 0.91 115.54 1.5
2 Al (2.58) Al (2.88) 0.06 113.67 2.8
3 Steel (1.29) Steel (2.37) −0.22 112.95 5.1
4 Al (1.99) Steel (1.37) 0.17 110.39 3.0
5 Woven (1.2) Al (1.66) 0.57 109.71 1.1
6 Al (2.12) Al (1.23) 0.8 109.32 1.8

Table 11.7 Low weight design history


ID Hat (thickness) Backplate (thickness) Height Cost Weight
1 UD [0/90] (1.2) SMC (3.38) −0.84 121.28 1.2
2 UD [0/90] (2.0) UD [0/90] (1.6) −0.01 121.61 1.1
3 UD [0/90] (1.2) Woven (1.2) 0.56 113.18 0.76

Table 11.8 Combined cost and weight minimization history


ID Hat (thickness) Backplate (thickness) Height Cost Weight
(1) SMC (2.36) UD [0/90] (2.8) −0.55 125.91 1.4
2 UD [0/90] (1.2) Al (2.23) −0.65 113.17 1.4
3 UD [0/90] (1.6) SMC (2.67) 0.02 120.80 1.1
4 UD [0/90] (2.0) SMC (1.29) 0.05 117.70 0.93
5 UD [0/90] (1.2) SMC (1.71) −0.82 115.81 0.86
414 T. Huang et al.

superior specific strength. Since the loading in constraint evaluation is unidirectional,


UD composites are, by intuition, a better choice for the load-bearing component (top
hat), which is reflected in the designs in Tables 11.7 and 11.8. Compared to the all-
steel baseline, an 81.5% weight reduction is achieved with this low weight design, at
the cost of a 3.7% budget increase. The final design weighs 36.7% less with a 6.7%
increment of cost than the initial random search result (design 1).
Among all CFRP choices, short fiber composites such as SMC cost less than
long fiber composites such as UD and woven ones. The final backplate choice in
Table 11.8, when cost and weight are considered together, reflects the cost saving
offered by SMC and high (directional) strength of UD composites. It can also be
observed that the algorithm switches from an infeasible design to a feasible one.
Although the final design in Table 11.8 does not offer improvement over the one in
Table 11.7, the search history toward it shows the algorithm’s capability to choose
materials (ID 1–3) and optimize part geometry and material design (ID 4–6) within
one optimization framework. It beats the baseline with a 79% weight reduction at a
6.1% increased cost.

11.4 Application to Concurrent Material and Process


Design

Concurrent material and process design through LVGP-CBO is demonstrated using


an injection-molded short fiber reinforced polymer (SFRP) composites example.
The design space is formed by several mixed-type variables, including fiber selec-
tion (categorical), volume fraction (continuous), and multiple processing conditions
(categorical, continuous, and discreate). After integrated molding and tensile test
simulations, we save the resulting strain fields of the test samples and aim to reduce
strain concentration and limit the maximum strain through better material and process
design. This section will cover the optimization application of an injection-molded
tensile coupon in which Sect. 11.4.1 will discuss the integrated process structure
property model, and Sect. 11.4.2 will cover the design variables, constraints, and
objective for SFRP design. LVGP modeling and validation will occur in Sect. 11.4.3,
with CBO applied in Sect. 11.4.4 to get the optimized results.

11.4.1 The Integrated Process-Structure–Property Model

The injection-molded SFRP model is built on process simulations via Moldflow and
performance simulations via LS-DYNA. The flow of information and computational
models is given in Fig. 11.11g. We simulate the injection molding process of an
ASTM D638 tensile coupon under different conditions and assess its mechanical
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 415

Fig. 11.11 a–d Schematics of the design variables of the SFRP model e–f Schematics of the
process and performance simulations g The workflow for design evaluation

performance under tensile loading. The fibers’ aspect ratio is fixed at 5. The model
inputs are
1. Fiber choice (Fig. 11.11a). The fibers in the SFRP may be either glass or carbon.
2. Fiber volume fraction (Fig. 11.11b). The volume fraction of the fibers is a
continuous value ranging in [0.05, 0.2].
3. Mold material (Fig. 11.11c). The material of the mold, which affects the cooling
rate, can be chosen from aluminum, steel, copper, and brass.
4. Maximum injection rate (Fig. 11.11d). Three discrete choices are available, low,
medium, and high, which correspond to 100, 2000, and 5000 cm3 /s respectively.
5. Maximum injection pressure (Fig. 11.11d). Similar to the injection rate with the
low, medium, and high levels corresponding to 50, 100, and 180 MPa respectively.
6. Mold surface temperature (Fig. 11.11d). A continuous value between [20, 80]◦
C.
7. Melt temperature (Fig. 11.11d). The temperature we set for the melted SFRP
material to flow into the mold, a continuous value ranging between [200, 280]◦
C.
The process simulation model predicts the local orientation state of the short fibers
and outputs a field of the second-order orientation tensors (Fig. 11.11e). It can be
seen that the microstructures are heterogeneous. They are converted to local material
properties via a micromechanics model in Moldflow and sent to LS-DYNA for a
tensile test simulation (Fig. 11.11f), with one end fixed and the other applied with a
0.5 m/s velocity displacement, resulting in a field of local strains. We examine the
Green strain fields for performance evaluation.
416 T. Huang et al.

11.4.2 Design Variables, Constraints, and Objectives


for SFRP Design

We summarize the design variables in Table 11.9. The designs are varied by switching
the mold and the fiber materials and choosing from a collection of continuous
and discrete parameters for material design (fiber volume fraction) and processing
conditions.
The two categorical variables create a total of 8 possible combinations. For
simplicity, we treated the two discrete variables as continuous, and designed a total
of 96 simulation experiments based on a sliced Latin hypercube (Ba et al. 2015) with
8 slices and 12 points per slice.
The design variables are transferred to Autodesk Moldflow for process simu-
lations, and their outputs, the fields of local SFRP orientation states (Fig. 11.11),
are translated to local material property fields via a built-in material model for LS-
DYNA tensile test simulations. The final output is a 6D strain field, with each local
node containing Green strains in 6 directions (x, y, z, x y, yz, x z). We summarize
the massive data by extracting statistics of strain fields, which include
1. Mean, the average value of the strain fields in all 6 directions. It characterizes
the general trend of the deformation,
2. Max, the maximum of the absolute value of the strain fields in all 6 directions,
which indicates the most extreme deformation across the field,
3. Standard deviation, denoted in the following tables and figures by sd, of the
strain fields in all 6 directions. It is an indicator of the range of the values in terms
of the local deformation, and
4. Short-range spatial correlation, computed by searching for the closest 1% node
pairs in the FEA mesh and calculating their empirical correlation, this metric can
be viewed as a quantitative measure of the smoothness of the fields (Fig. 11.12).
The resulting dataset is partly visualized in Fig. 11.13 via scatter plots (discrete
variables not shown). It is evident that among the continuous variables, the fiber

Table 11.9 Design variables for the injection-molded SFRP tensile coupon
Group Name Type Choices
Material Fiber material Categorical 2: {glass, carbon}
Material Mold material Categorical 4: {aluminum, steel, copper, brass}
Material Fiber volume fraction Continuous [0.05, 0.2]
Process Mold surface temperature Continuous [20, 80]◦ C
Process Melt temperature Continuous [200, 280]◦ C
Process Max injection rate Discrete 3: {100, 2000, 5000} cm3 /s
Process Max injection pressure Discrete 3: {50, 100, 180} MPa
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 417

Fig. 11.12 Examples of the strain fields from the tensile simulations of injection-molded SFRP.
a a high smoothness example, b a low smoothness example

volume fraction is strongly correlated to the standard deviation and the spatial corre-
lation of the strain fields, while the melt temperature is visibly a huge influencer of
all performance criteria.
For the design example, we hypothesized an engineering use case, with the
optimization objective and constraint defined by

max r x r y rx y     G (11.15)


s.t. max εxG + max ε Gy + max εx y ≤ 0.3

where ri is the short-range spatial correlation (smoothness) of the i-th direction, and
max(εiG ) denotes the maximal value in the Green strain field εiG (i = x, y, x y). All
outputs are normalized to [0, 1] using the observed ranges among the 96 simulations.
The design objective ensures the fields are as smooth as possible to avoid strain
concentration, while the design constraint is set so that the maximal values in the
resulting x-, y-, and x y- Green strain fields are bounded. In other words, weak
material designs will be rejected.
418 T. Huang et al.

Fig. 11.13 The scatter plots of continuous design variables and performances, with categorical
variables marked by colors and shapes. Responses regarding the z direction are not shown here

11.4.3 LVGP Modeling and Validation

84 out of the 96 data, half from glass fiber designs and half from carbon, are randomly
selected to build the LVGP. The remaining 12 are reserved for validating the model
via their MSE and root MSE (RMSE) of the model predictions. The RMSE has the
same unit as the output; therefore, they can be directly compared to each other as a
measure of accuracy. The validation results regarding the x and y directions are listed
in Table 11.10 as z is the thickness direction of the tensile coupon and compared
to the strain fields in x and y directions, the values in z-related ones are typically
negligible. It can be seen that the MSEs are small, and RMSEs are typically smaller
than the response data by at least one order of magnitude, indicating great predictive
capability of the trained LVGP.
A latent space for the mold materials from the y-smoothness model is plotted
in Fig. 11.14. Although there is no direct physical interpretation of the locations of
these candidate materials, their nonlinear alignment suggests that different materials
influence the results in multiple aspects.
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 419

Table 11.10 LVGP SFRP model validation results


Name MSE RMSE Min Max
1 x-mean 1.3 × 10−8 1.2 × 10−4 −6.4 × 10−3 −4.6 × 10−3
2 x-max 7.6 × 10−9 8.7 × 10−5 8 × 10−3 1 × 10−2
3 x-sd 1.4 × 10−9 3.7 × 10−5 1.2 × 10−3 2 × 10−3
4 x-smoothness 4.6 × 10−5 7 × 10−3 0.847 0.9
5 y-mean 4.8 × 10−10 2.2 × 10−5 −4 × 10−4 −2 × 10−4
6 y-max 3.8 × 10−7 6.2 × 10−4 3.9 × 10−3 6.5 × 10−3
7 y-sd 1.6 × 10−9 3.9 × 10−5 1.3 × 10−3 1.7 × 10−3
8 y-smoothness 7.4 × 10−5 8.6 × 10−3 0.949 0.976
9 x y-mean 3 × 10−10 1.7 × 10−5 −3 × 10−5 7.2 × 10−5
10 x y-max 1.5 × 10−7 3.8 × 10−4 8.9 × 10−3 1.8 × 10−2
11 x y-sd 4.9 × 10−9 7 × 10−5 1.1 × 10−3 2 × 10−3
12 x y-smoothness 8.9 × 10−5 9.4 × 10−3 0.808 0.871

Fig. 11.14 The latent space


for the mold materials in the
y-smoothness model

The response surfaces for y-smoothness from different mold materials, condi-
tioned on glass fiber designs with fixed discrete variables and mold surface temper-
atures, are shown in Fig. 11.15. While Fig. 11.15a presents the complex shape of
the surfaces, the top view image in Fig. 11.15b shows that there is not a domi-
nating material that behaves uniformly better than the rest. Therefore, if we want to
perform design optimization for a metric such as y-smoothness, the searching will
420 T. Huang et al.

Fig. 11.15 a The response surfaces of different mold materials from the glass fiber y-smoothness
model regarding fiber volume fraction and melt temperature, and b is the top view of (a). The other
design variables are kept constant (0.5 for mold surface smoothness)

Table 11.11 The roughness parameters (log10 (τk )) of LVGP in modeling y-direction statistics
k Name Mean Max SD Smoothness
1 Volume fraction 0.48 0.04 0.53 1.7
2 Mold surface temperature –1.3 2.7 –0.94 –0.74
3 Melt temperature 0.95 0.45 1.1 0.27
4 Max injection rate –1.8 –3 –1.3 –0.5
5 Max injection pressure –3 –3 –3 –3

involve switching among the candidate materials while tuning the combination of
the continuous variables concurrently, which can be realized with
 LVGP.
As stated before, the roughness parameters, ranging in 10−3 , 103 , indicate
the importance of non-categorical predictors. Those from the y-direction statis-
tics models are summarized in Table 11.11. Their magnitudes generally agree
with the trend in Fig. 11.13, as positive values are observed for predictors with
at least moderate correlation with the field statistics. It can be concluded that for
the y-direction results, the volume fraction and melt temperature are two important
variables in determining the predicted values, while the max injection pressure is
noninfluential.

11.4.4 LVGP-CBO Setup and Design Results

We start the LVGP-CBO with 32 initial points randomly sampled from the 96-point
DoE, then continue for 300 iterations of optimization. When the initial designs do
not include a feasible one, the algorithm selects a random point from them as the
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 421

pseudo-optimum, and proceeds to finding feasible solutions. The optimization exper-


iment is repeated for 10 times with different random starting points. Similar to the hat
section example, for demonstration purposes, we configure the algorithm to query
a high-fidelity LVGP surrogate model (built with 96 data points) for design perfor-
mance evaluation, while ideally high-fidelity simulations or experiments should be
performed to collect the performance data.
A typical optimization history (from the 10 repeated experiments) is shown in
Figs. 11.16 and 11.17a. They show how the designs are gradually improved and
move from the infeasible to feasible space. The design details in Table 11.12 indicate
that a good design consists of low fiber volume fraction and high melt temperature,
which help decrease the max strains and increase the smoothness simultaneously. The
rest of the design variables are not close to their respective upper or lower bounds,
showing that the optimal performance cannot be attained by choosing the extremes
within the design space.
The 10 optimal designs from the repeated experiments are highlighted in
Fig. 11.17b. Note that the 32 initial points are randomly selected from the 96-point
dataset and only 3 out of 96 have constraint function values less than 0.3. It can
be seen that although the majority of them are not feasible, the optimal designs
found by LVGP-CBO are not only feasible but have higher smoothness than those in
the original dataset. It shows that CBO can explore uncertain regions in the design
space (when most of the known regions are infeasible) to search for better designs
satisfying the constraint. It can be concluded that LVGP-CBO has the capability of
searching for feasible, optimal designs in a mixed-variable and constrained design

Fig. 11.16 Optimization


history for injection-molded
SFRP designs. The fiber and
mold material are given
under the design ID
422 T. Huang et al.

Fig. 11.17 a Optimization history and b optimal designs in the performance space for injection-
molded SFRP designs

Table 11.12 Smoothness maximization history (continuous variables)


ID Fiber VF T Mold surface T Melt Max injection rate Max injection pressure
1 0.083 48.4 222.9 100 180
2 0.11 30.2 271.6 4072.6 98.9
3 0.058 46.4 266.8 3450.4 92.0

space, and therefore it is very promising in the concurrent design of CFRP materials
and processes in the context of ICME.

11.5 Conclusions

In this chapter, LVGP-CBO, a constrained Bayesian optimization approach using


latent-variable Gaussian processes for multi-objective mixed-variable design tasks,
is presented and applied to concurrent process, materials, and geometry design in
ICME. The method showcases one possible use of machine learning and optimiza-
tion to conquer the challenges in the integrated design for ICME, where extremely
costly process-structure-performance simulations and mixed-type design variables
are prevalent. LVGP, the ML model adopted in this work, learns from high-fidelity but
scarce training data, and predicts the probability distribution of design performance
given materials, processing, and geometry design represented by mixed categorical
and continuous variables. Validation results show that sufficient accuracy can be
achieved. Its hyperparameters, including the latent space trained from data, offer
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 423

insights into the physical phenomenon, such as the underlying similarity between
levels of categorial variables, and the importance ranking of design variables.
The LVGP model inherits all advantages of GP modeling, one of which is to offer
uncertainty prediction for sequential sampling. Combined with Bayesian Optimiza-
tion considering constraints, LVGP-CBO adds constrained efficient global optimiza-
tion on top of the predictive ML model. The proposed approach needs only a small
number of data points to start with, and continuously searches for designs that are
most likely to be optimal as predicted by LVGP in the mixed-variable design space.
The subsequent data collection (i.e. simulations) is guided by the CBO algorithm to
balance the exploration of uncertain regions and the exploitation of high-potential
sites for the optimal design. On the contrary, the commonly used gradient-based or
evolutionary optimization methods only look for improved designs in each itera-
tion (as opposed to optimal) and are therefore less efficient. As a result, the frame-
work allows multi-objective, constrained, and mixed-variable optimization with less
required simulations, which makes it suitable for ICME design of process, materials,
and structures in a concurrent fashion.
To demonstrate the approach, we established two ICME workflows with fully
bridged part-scale manufacturing processes and structural performance simulations,
integrated with material-scale microstructure and material property models so that the
processing-induced local microstructure evolution are taken into account for design
evaluation. Both workflows require mixed-variable inputs and generate multiple
performance criteria as design objectives.
We first apply LVGP-CBO to concurrent material and part geometry design
of thin-walled hat sections with a focus on adopting CFRP composites for
lightweighting. Specifically, the goal is to reduce the weight and cost of the compo-
nent given the flexibility to choose among a variety of metal and CFRP materials, the
component’s geometry, and processing conditions, without sacrificing the mechan-
ical performance benchmarked by the all-steel design. Optimization results show that
LVGP-CBO is an efficient design framework that simultaneously searches for the
better combinations of materials, microstructures, processing, and product geometry.
A weight reduction of 81.5% can be achieved at the cost of a 3.7% budget increase
by replacing metal parts with composites ones and optimizing their geometries and
material designs.
The framework is also tested by concurrent process and material design of
injection-molded SFRP parts. The goal is to lower the strain concentration under
tensile loads while the designer has the freedom to choose both fiber and mold
materials and the SFRP’s processing parameters. Toward the goal of reducing strain
concentration constrained by maximal local train, LVGP-CBO is shown to realize
automatic fiber and mold material selection, along with the injection molding param-
eter optimization. With the integrated process-structure–property-performance simu-
lations for design evaluation, the presented LVGP-CBO algorithm is applicable to
other concurrent manufacturable and high-performance designs in ICME.

Acknowledgements The financial support from Ford Motor Company and the US Department of
Energy (Award Number: DE-EE0006867) is acknowledged.
424 T. Huang et al.

References

Ankenman B, Nelson BL, Staum J (2008) Stochastic kriging for simulation metamodeling. In:
Proceedings—winter simulation conference, pp 362–370. https://doi.org/10.1109/WSC.2008.
4736089
Ba S, Myers WR, Brenneman WA (2015) Optimal sliced latin hypercube designs. 57:479–487.
https://doi.org/10.1080/00401706.2014.957867
Bostanabad R, Kearney T, Tao S et al (2018a) Leveraging the nugget parameter for efficient Gaussian
process modeling. Int J Numer Methods Eng 114:501–516
Bostanabad R, Liang B, Gao J et al (2018b) Uncertainty quantification in multiscale simulation
of woven fiber composites. Comput Methods Appl Mech Eng 338:506–532. https://doi.org/10.
1016/J.CMA.2018.04.024
Chen Z, Huang T, Shao Y et al (2018) Multiscale finite element modeling of sheet molding compound
(SMC) composite structure based on stochastic mesostructure reconstruction. Compos Struct
188:25–38. https://doi.org/10.1016/j.compstruct.2017.12.039
Conti S, Gosling JP, Oakley JE, O’Hagan A (2009) Gaussian process emulation of dynamic computer
codes. Biometrika 96:663–676
Daddona DM, Antonelli D (2018) Neural network multiobjective optimization of hot forging.
Procedia CIRP 67:498–503. https://doi.org/10.1016/J.PROCIR.2017.12.251
Debski H, Rozylo P, Teter A (2020) Buckling and limit states of thin-walled composite columns
under eccentric load. Thin-Walled Struct 149:106627. https://doi.org/10.1016/J.TWS.2020.
106627
Deng X, Lin CD, Liu K-WK-W, Rowe RKK (2017) Additive gaussian process for computer models
with qualitative and quantitative factors. Technometrics 59:283–292. https://doi.org/10.1080/
00401706.2016.1211554
Fang H, Rais-Rohani M, Liu Z, Horstemeyer MF (2005) A comparative study of metamodeling
methods for multiobjective crashworthiness optimization. Comput Struct 83:2121–2136. https://
doi.org/10.1016/j.compstruc.2005.02.025
Gardner JR, Kusner MJ, Xu ZE et al (2014) Bayesian optimization with inequality constraints. 32
Huang T, Gao J, Sun Q et al (2021) Stochastic nonlinear analysis of unidirectional fiber compos-
ites using image-based microstructural uncertainty quantification. Compos Struct 260:113470.
https://doi.org/10.1016/J.COMPSTRUCT.2020.113470
Iyer A, Zhang Y, Prasad A et al (2019) Data-centric mixed-variable bayesian optimization for
materials design. In: Proceedings of the ASME design engineering technical conference 2A-
2019. https://doi.org/10.1115/DETC2019-98222
Jin R, Chen W, Simpson TW (2001) Comparative studies of metamodelling techniques under
multiple modelling criteria. Struct Multidiscip Optim 23:1–13. https://doi.org/10.1007/s00158-
001-0160-4
Jin R, Chen W, Sudjianto A (2005) An efficient algorithm for constructing optimal design of
computer experiments. J Stat Plan Inference 134:268–287
Li Y, Chen Z, Xu H et al (2017b) Modeling and simulation of compression molding process for
sheet molding compound (SMC) of chopped carbon fiber composites. SAE Int J Mater Manuf
10:130–137
Li J, Cheng K, Wang S et al (2017a) Feature selection: a data perspective. ACM Comput Surv 50.
https://doi.org/10.1145/3136625
Lin S-P, Chen Y, Zeng D, Su X (2017) Meso-modeling of carbon fiber composite for crash safety
analysis. In: WCX17: SAE world congress experience
Miao J-M, Cheng S-J, Wu S-J (2011) Metamodel based design optimization approach in promoting
the performance of proton exchange membrane fuel cells. Int J Hydrogen Energy 36:15283–
15294. https://doi.org/10.1016/j.ijhydene.2011.08.070
Miettinen K (1998) Nonlinear multiobjective optimization 12. https://doi.org/10.1007/978-1-4615-
5563-6
Nocedal J, Wright S (2000) Numerical optimization
11 Mixed-Variable Concurrent Material, Geometry, and Process Design … 425

Olson GB (1997) Computational design of hierarchically structured materials. Science 277:1237–


1242. https://doi.org/10.1126/science.277.5330.1237
Qian PZG, Wu H, Wu CFJ (2008) Gaussian process models for computer experiments with
qualitative and quantitative factors. Technometrics 50:383–396
Rao C, Liu Y (2020) Three-dimensional convolutional neural network (3D-CNN) for heterogeneous
material homogenization. Comput Mater Sci 184:109850. https://doi.org/10.1016/J.COMMAT
SCI.2020.109850
Rasmussen CE (2003) Gaussian processes in machine learning. Lecture notes in computer science
(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
3176:63–71. https://doi.org/10.1007/978-3-540-28650-9_4
Rebonato R, Jäckel P (2011) The most general methodology to create a valid correlation matrix for
risk management and option pricing purposes
Schneider F, Jones N (2005) Impact of thin-walled high-strength steel structural sections. 218:131–
158. https://doi.org/10.1243/095440704772913927
Shahriari B, Swersky K, Wang Z et al, Taking the human out of the loop: a review of bayesian
optimization. 1–24
Simpson W, Peplinski J, Koch PN, Allen JK (2001) Metamodels for computer-based engineering
design: survey and recommendations. Eng Comput 17:129–150
Sobol IM (2001) Global sensitivity indices for nonlinear mathematical models and their Monte
Carlo estimates. Math Comput Simul 55:271–280. https://doi.org/10.1016/S0378-4754(00)002
70-6
Song K, Choo KH, Kim J-H, Mavris DN (2017) Multi-objective decision making of a simplified car
body shape towards optimum aerodynamic performance. In: ASME 2017 international design
engineering technical conferences and computers and information in engineering conference.
American Society of Mechanical Engineers
Su X, Wagner D (2017) IV. 6 Integrated computational materials engineering development of carbon
fiber composites for lightweight vehicles. Ford Motor Company. Lightweight Materials 2016
Annual Report 177
Su X, Wagner D (2019) Integrated computational materials engineering development of carbon
fiber composites for lightweight vehicles. United States
Sun Q, Meng Z, Zhou G et al (2018) Multi-scale computational analysis of unidirectional carbon
fiber reinforced polymer composites under various loading conditions. Compos Struct 196:30–
43
Tsai SW, Wu EM (2016) A general theory of strength for anisotropic materials. 5:58–80. https://
doi.org/10.1177/002199837100500106
Xu H, Li Y, Zeng D (2017) Process integration and optimization of ICME carbon fiber composites
for vehicle lightweighting: a preliminary development. SAE Int J Mater Manuf 10:274–281
Xu H, Zhou Q, Yang L, et al (2020) Reconstruction of full-field complex deformed shapes of
thin-walled special-section beam structures based on in situ strain measurement 23:3335–3350.
https://doi.org/10.1177/1369433220937156
Zhang Y, Notz WI (2015) Computer experiments with qualitative and quantitative variables: a
review and reexamination. Qual Eng 27:2–13
Zhang L, Wang K, Chen N (2016a) Monitoring wafers’ geometric quality using an additive Gaussian
process model. IIE Trans 48:1–15
Zhang W, Ren H, Wang Z et al (2016b) An integrated computational materials engineering method
for woven carbon fiber composites preforming process. In: AIP conference proceedings, p
170036
426 T. Huang et al.

Zhang Y, Apley DW, Chen W (2020) Bayesian optimization for materials design with mixed quanti-
tative and qualitative variables. Sci Rep 10:1 10:1–13. https://doi.org/10.1038/s41598-020-606
52-9
Zhang Y, Tao S, Chen W, Apley DW (2019) A latent variable approach to gaussian process modeling
with qualitative and quantitative factors. 62:291–302. https://doi.org/10.1080/00401706.2019.
1638834
Zhou Q, Qian PZG, Zhou S (2011) A simple approach to emulation for computer models with
qualitative and quantitative factors. Technometrics 53:266–273
Chapter 12
Machine Learning Interatomic
Potentials: Keys to First-Principles
Multiscale Modeling

Bohayra Mortazavi

12.1 Introduction

In the conventional continuum mechanics simulations, prior to starting the calcu-


lations, materials and interactions properties are required to be provided as inputs
to the models. As such, the accuracy and reliability of simulations for a problem,
directly correlate with materials and interaction input parameters. Moreover, as the
most realistic scenario for practical applications, temperature, loading rates, and
existing defects, can also affect the material properties. On this basis, in order to
be able to employ a more advanced constitutive law for a continuum mechanics
modeling, one requires more elaborated input data. For the conventional bulk mate-
rials, various experimental tests have been standardized and classified to obtain the
required materials properties for various constitutive laws. For example, with the aid
of conventional uniaxial tension test, ultimate tensile strength, maximum elongation
and reduction area, under various loading rates and temperatures can be investi-
gated. In comparison with conventional bulk materials, an elaborated and accurate
understanding of heat transport or mechanical properties of nanomaterials and nanos-
tructures are however drastically more complex, time-consuming, and expensive as
well. In these cases, diverse and in some cases unknown sources of uncertainties can
affect the accuracy, reliability, and reproducibility of experimental measurements.
Therefore, the development of accurate and robust theoretical approaches, in order
to elaborately probe the heat transport and mechanical properties of nanomaterials,
is highly advantageous in order to enhance the design process and minimize the

B. Mortazavi (B)
Institute of Photonics, Department of Mathematics and Physics, Leibniz Universität Hannover,
Appelstraße 11, 30167 Hannover, Germany
e-mail: [email protected]
Cluster of Excellence PhoenixD (Photonics, Optics, and Engineering–Innovation Across
Disciplines), Leibniz Universität Hannover, Hannover, Germany

© The Author(s), under exclusive license to Springer Nature Switzerland AG 2023 427
T. Rabczuk and K. J. Bathe (eds.), Machine Learning in Modeling and Simulation,
Computational Methods in Engineering & the Sciences,
https://doi.org/10.1007/978-3-031-36644-4_12
428 B. Mortazavi

necessity of complex experimental tests. Quantum mechanics-based calculations,


as those of density functional theory (DFT) simulations, offer excellent accuracy
as well as exceptional reproducibility of results, but with excessive computational
costs for large systems. Classical empirical interatomic potentials are affordable
to be conducted for systems with millions of atoms, but they nonetheless suffer
from non-trustworthy accuracy. It is thus intuitively clear that an approach with a
mixture of quantum mechanics and classical empirical interatomic potentials funda-
mentals may yield the optimal solutions. Machine learning interatomic potentials
(MLIPs) have been suggested to be the most promising approach for the aforemen-
tioned goal, with inherent accuracy and flexibility of quantum mechanics methods
and with comparable computational efficiency of empirical interatomic potentials.
Particularly, MLIPs can offer the possibility of first-principles multiscale modeling,
in which the quantum mechanics accuracy and flexibility can be bridged to the
continuum scale, to study the thermal, mechanical, and vibrational properties, with
an ultimate goal of developing fully automated platforms for structural and materials
designs. In this chapter, various aspects of this novel possibility will be discussed. We
first briefly discuss conventional methods and MLIPs to extract atomic interactions,
and next we effectively describe the basic concepts of MLIPs. We then highlight
the bottlenecks of quantum mechanics and empirical interatomic potentials in the
evaluation of materials and structural properties. To finish, we elucidate the novel
concept of first-principles multiscale modeling and discuss the practical prospect for
the autonomous materials and structural design.

12.2 Methods for Exploring Interatomic Forces

12.2.1 Quantum Mechanics

The quantum mechanics models solve the electronic structure of a system and hence
evaluate interaction of electrons and nuclei on the basis of electronic structure infor-
mation. These models may also estimate the interatomic energy via calculating the
electronic interatomic bonds. Within the popular Born–Oppenheimer approximation,
the atomic nuclei or so-called “atoms” are treated as classical particles, while the
electrons are treated as quantum mechanics particles. The DFT method is currently
the most extensively employed quantum mechanics solution, which shows excep-
tional accuracy and computational efficiency, particularly for crystalline and highly
symmetrical structures. Herein the complicated theoretical background of DFT will
not be discussed, but it is worth noting that the main drawback of this approach
is that the computational cost increases exponentially with the number of atoms.
Moreover, vacuum in the plane-wave DFT also add computational burden. Currently
DFT calculations are limited for studying very small systems, consisting of a few
hundred atoms with conventional processors or a few thousands of atoms, achiev-
able only with advanced supercomputers. For the majority of problems in mechanics
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 429

and thermodynamics, one however only deals with forces, energies and stresses, and
consequently the data related to the electronic structure is completely of no use.
Based on this elementary concept, empirical or machine learning interatomic poten-
tials are able to provide substantially reduced computational cost, with a goal of
marginal sacrificial of accuracy. Apart from the computational cost, in Sect. 12.4, we
will discuss a few other bottlenecks of DFT method in the analysis of thermal and
mechanical properties.

12.2.2 Empirical Interatomic Potentials

Atomic forces can be derived from an empirical interatomic potential function, which
is one of the most computationally efficient approaches. Mechanical forces in an
atomistic system can be divided into either conservative or nonconservative. Conser-
vative forces only depend on positions of the particles, irrespective of instantaneous
velocities and trajectories between different positions. The dissipative or gyroscopic
forces, such as mechanical friction are nonconservative. For an atomistic system
with only conservative forces, one can define a specific function, U, which is called
potential energy and solely depend on the coordinates of the particles,

U = U (r1 , r2 , . . . , r N ) (12.1)

This function can show a general structure of:


      
U (r1 , r2 , . . . , r N ) = U1 (ri ) + U2 r i , r j + U3 r i , r j , r k + · · ·
i i, j>i i, j>i,k> j
(12.2)

where r i are position vectors of the particles, and the function U m is the m-body
potential. The U 1 naturally represents the energy functional due to an external force
field, such as gravity. The second term shows potential energy of pair-wise interaction
of the particles; the third gives the three-body components and so on. On this basis, the
function U 1 is the external potential, U 2 is the pair-wise or two-body and U m , where m
> 2 is a multibody term. Among the most well-known two-body empirical interatomic
potential, one can refer to Lennard–Jones, Morse or Buckingham potentials. The
Lennard–Jones potential can take various forms, in which the so-called “12–6” type
is expressed as follows:
 12  6 
  σ σ
U L J ri , r j = 4ε − (12.3)
ri j ri j
430 B. Mortazavi

where ε is the depth of the potential well and σ is the equilibrium distance. The two-
body empirical interatomic potential are mostly used to describe non-bonding or
long-range interactions. The higher-order multibody terms of the potential function
(m > 2) are generally required for the modeling of more complex interactions in
solids, which allow to more accurately account for chemical bonds, topology, and
spatial arrangements of atoms. For example, in the three-body potentials, the force
between two connecting atoms not only dependents on their positions, but also on
the position of third atoms within a defined cutoff distance. The Tersoff potential
(Tersoff 1988), which is one of the most popular three-body potentials, that has been
extensively used to study covalent systems, as those of graphene and diamond, shows
the following form:
    
UT ri j = f c ri j Ai j e−λi j ri j − Bi j e−μi j ri j (12.4)

where A is a constant, and


 − 1
Bi j = 1 + β n ζ n 2n (12.5)

  
ζ = f c (rik )ωi j g θi jk (12.6)
k=i, j

  c2 c2
g θi jk = 1 + 2 −   2 (12.7)
d d 2 + h − cos θi jk


⎨ 1,  r i j < Ri j
  ri j −Ri j
f c ri j = 1
+ 21 Cos Si j −Ri j
, Ri j < ri j < Si j (12.8)


2
0, ri j > Si j

As clearly observable, in comparison with Lennard–Jones potential with only two


unknowns, the Tersoff potential for covalent systems requires more constants to better
describe bonds and angles between atoms. A specific form of multibody potential can
be described by the embedded atom method potentials, which are usually employed
for studying metallic systems, in which atoms are embedded in electron gas host.
The EAM potentials show the following function:

     1   
U E AM ri j = L i ρh,i + φi j ri j (12.9)
i
2 i j=i

where ρ h,i is the host electron density at atom i due to all other background atoms
in the system, L i [ρ] is the energy to embed the atom i into the background electron
density ρ, and φ ij is a pair-wise component between atoms i and j. The host electron
density can be expressed:
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 431
  
ρh,i = ρ ∗ ri j (12.10)
j=i

where ρ* is the electron density of atom j as a function of interatomic separation


distances of r ij . Although the empirical interatomic potential functions follow clear
and understandable physical concepts, and are computationally very efficient and
stable, but their accuracy is questionable. In Sect. 12.4 we will discuss how exten-
sively used interatomic potentials fail to reproduce the mechanical and thermal prop-
erties of simple systems, like graphene. Moreover, developing a parameter set for
an empirical interatomic potential requires a complex fitting procedure, only doable
by experienced groups, which makes them inefficient for studying novel materials
and structures by the majority of researchers, irrespective of their common accuracy
concern.

12.2.3 Machine Learning Interatomic Potentials

During the last decade, machine learning methods have been extensively employed
to accelerate the evaluation of various physical properties of materials (Ouyang et al.
2021; Novikov et al. 2021; Hu et al. 2020; Chakraborty et al. 2020). Among various
fields, one of the successful applications has been related to the machine learning-
based interatomic potentials, which can be employed in conventional molecular
dynamics simulations or directly utilized to evaluate interatomic forces and calcu-
late desired physical properties, like phonon dispersion relation. MLIPs belong to
nonparametric designed interatomic potentials, with the goal of providing quantum
mechanics level accuracy with empirical interatomic potentials’ order of computa-
tional cost. According to the terminology of regression analysis, interatomic poten-
tials are either parametric or nonparametric (Shapeev 2016). The main difference
between these two approaches is the capability of nonparametric interatomic poten-
tials to reach a higher level of accuracy, with systematically increasing the number of
their parameters, which consequently imposes higher computational costs. In order
to achieve high accuracy in rather large systems, nonparametric potentials are more
suitable choices. A parametric interatomic potential, like Tersoff, on the other side can
take a fixed number of parameters for all studied system, which as expected limits the
accuracy, though the computational efficiency is robust and unchallengeable. A MLIP
consists of two basic elements; “descriptors,” and the “regression model,” which itself
is a function of descriptors. The descriptors capture the atomic environment, irrespec-
tive of the type of the studied system, with a cutoff function for the computational
efficiency. Among various possibilities, Behler-Parrinello (2007) and bispectrum
coefficients (Thompson et al. 2015) are currently among the most popular descrip-
tors in developing MLIPs (Yanxon et al. 2020). Worth noting that type of descriptors
may substantially affect the performance of a MLIP (Thompson et al. 2015). The
regression model is the second basic element in MLIPs. For the development of
MLIPs, there are various regression methods, such as linear/polynomial regression,
432 B. Mortazavi

Kernel methods, and artificial neural networks (Behler and Parrinello 2007; Behler
2015). MLIPs are developed on the basis of locality in the interatomic interactions,
which means that the total energy of an atomistic configuration x, approximated
with a function with θ parameters, E(x; θ), can be partitioned into contributions of
individual atoms as a function of the environment of this atom:

E(x; θ ) = V (ri ; θ ) (12.11)
i

where r i is a collection of vectors connecting to the ith atom, x i , with other atoms
in the environment, within a predefined cutoff distance Rcut . It can be seen that
the function V is the site energy and is equivalent to the interatomic potential. As
mentioned earlier, linear/polynomial regression, Kernel methods, and artificial neural
networks can be used for the representation of the site energy V. In MLIPs, a large
set of descriptors are used in describing the atomic environments, in order to ensure
reliable reconstruction of any reasonable environment.
The machine learning methodology in constructing the nonparametric interatomic
potentials, likely to other counterparts, relies mostly on the data. MLIPs thus also
follow the same concept as other machine learning methods, meaning that with the
aid of sufficiently large data, the importance of prior knowledge concerning the
underlying physics becomes less critical. As such, after defining a potential func-
tion, the corresponding parameters are fitted to the quantum mechanics data. As
a routine machine learning methodology, the performance/accuracy can be subse-
quently tested, and if needed either the potential parameters or the training data can
be modified. In comparison with empirical interatomic potentials, the main chal-
lenge of MLIPs originates from their strong dependency on the quantum mechanics
data. This means that the configurations or atomic environments that evolve during
a simulation, should be compatible with quantum mechanics dataset faced during
the training process. For example, while a Tersoff potential parameterized (Lindsay
2010) for the pristine graphene, could be employed to study the highly defective, so
called amorphous graphene thermomechanical properties (Mortazavi et al. 2016), a
MLIP trained for pristine graphene cannot be reliably used for defective systems.
On this basis, MLIPs show a common transferability issue, which can be resolved
by enhancing the training data or adopting an active learning methodology, in which
new configurations with unexplored atomic environments are gradually included in
the training data, and subsequently new MLIPs are re-trained.

12.3 Developing a Machine Learning Interatomic Potential

After gaining a basic understanding concerning the concept of MLIPs, in this section
we briefly discuss the standard procedure for development of a MLIP and highlight
the common challenges.
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 433

12.3.1 Popular Machine Learning Interatomic Potentials

For a practical usage of an inexpert user, there are already various MLIPs available,
that can be trained using the quantum mechanics datasets. The following are the
mainstream methods:
Neural network potentials. Behler and Parrinello (2007) pioneered the advance
in the field of MLIPs in 2007, in which they proposed the concept of neural network
potentials (NNPs), with rotation and permutation invariant descriptors and neural
network regression approach. To date, NNPs have been the most extensively used
MLIPs (Behler 2014, 2016). For the practical development of NNPs, several plat-
forms have been developed, among them one can refer to: RuNNer (Artrith et al. 2011;
Behler n.d.), RubNNet4MD (Brieuc et al. n.d), ænet (Artrith and Urban 2016), n2p2
(Singraber et al. 2019), SIMPLE-NN (Lee et al. 2019), and PyXtal-FF (Yanxon et al.
2021). A novel approach is deep tensor neural network (Schütt et al. 2017), which
has been recently extended to conduct molecular dynamics simulations, employing
the DeePMD-kit (Wang et al. 2018a).
Gaussian approximation potentials. Gaussian approximation potentials (GAPs)
(Bartók et al. 2010) are among the early types of highly accurate MLIPs. GAPs use
smooth overlap of atomistic positions Kernel descriptors and the Gaussian process
regression approach. GAPs are well-known for their outstanding accuracy and ability
to build transferable MLIPs (Rowe et al. 2020), but exceedingly high computational
costs are their main drawback. QUIP (Bartók and Csányi 2015) is a platform for the
practical development of GAPs.
Moment tensor potentials. Proposed by Shepeev (2016), moment tensor poten-
tial (MTP) are among the most accurate and computationally efficient MLIPs. The
descriptors of MTPs are on the basis of moment tensors, which share similarities
with NNPs. MTPs accuracy for the examination of thermal transport (Mortazavi
et al. 2021a, 2020a; Liu et al. 2021; Arabha and Rajabpour 2021; Arabha et al. 2021)
and mechanical (Rowe et al. 2020; Mortazavi et al. 2021b, 2022a) properties have
been confirmed by several studies. The MLIP package (Novikov et al. 2021) can be
used for the efficient training of MTPs.
Spectral neighbor approximation potentials. Another family of MLIPs are
spectral neighbor analysis potentials (SNAPs) (Thompson et al. 2015). In SNAPs
basis functions are constructed as cubic polynomials of spherical harmonic expan-
sion coefficients of the (nonsmooth) atomic density, likely to that used in GAPs. Like
MTPs, regression is employed to find the parameters of SNAPs. PyXtal-FF package
(Yanxon et al. 2021), can be used to develop SNAPs, alongside with NNPs.
Neuroevolution machine learning potentials. Although they share similar basics
with NNPs, neuroevolution-potential (NEPs) (Fan et al. 2021) are among the latest
MLIPs, which use descriptors based on Chebyshev and Legendre polynomials
and feedforward neural network regression model. The GPUMD (Fan et al. 2022)
package is available for the training of NEPs. In a recent study (Ying et al. 2023),
NEPs could accurately reproduce the complex mechanical responses of single-layer
fullerene networks as compared with DFT results.
434 B. Mortazavi

Fig. 12.1 Test error versus


computational cost for the
Mo system (Reprinted from
Zuo et al. 2020, Copyright
2020, American Chemical
Society)

For a user with limited knowledge of MLIPs, the choice of an efficient method for
a particular problem is not a straightforward decision. Although MLIPs can be conve-
niently developed for a given system, but their computational costs are rather high.
Moreover, if the accuracy is ought to be tested with expensive quantum mechanics
simulations, the computational efficiency suppresses even more. Therefore, to facil-
itate this dilemma for a new user, comparative studies are highly beneficial. In one
of the first comparative studies published in 2020 (Zuo et al. 2020), GAP, MTP,
NNP, and SNAP potentials have been developed for Li, Mo, Cu and Ni metals and
Si and Ge covalent systems. Comparison of MLIP results with those by empirical
interatomic potentials confirms that all considered MLIPs yield considerably higher
accuracy, which reveal their superiority in predicting energies and forces. Results
shown in Fig. 12.1 suggest that MTPs offer the best tradeoff between accuracy and
computational costs. It is clear that GAPs can yield the highest accuracy, but with
almost two order of magnitude higher computational costs than MTPs. Nonetheless,
aforementioned findings are based on a particular case study, and therefore while
the predicted difference in the computational cost of various method should follow
similar patterns, the accuracy may change for another problem or system. In addition,
more elaborated comparisons should be conducted for multicomponent systems.

12.3.2 Training of Machine Learning Interatomic Potentials

After deciding about the functional form of a MLIP for a particular problem, one
next has to design and implement a training process. In this regard, the next choice is
how to create the training dataset and subsequently fit the corresponding parameters.
Belonging to the family of machine learning algorithms, MLIPs are not developed to
extrapolate outside the training environment. As such, depending on the problem of
the interest, the prepared dataset should be rich enough to include all relevant config-
urations. Here it is very advantageous to bring a few examples. Ab-initio molecular
dynamics (AIMD) simulations are currently the most popular approach for creating
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 435

datasets for MLIPs. For the problem of phononic properties, like phonon dispersion
and phonon group velocity, the methods for extracting these physical properties are
mostly based on the small displacement approaches. As such for this problem, a
MLIP should be highly accurate to capture only small displacements around the
ground state. As shown in a recent study (Mortazavi et al. 2020b), MLIPs trained
over short, 1000 steps AIMD simulations conducted at 50 K, could reproduce the
phononic properties in close agreements with those by density functional perturba-
tion theory (DFPT) results. When one considers the examination of lattice thermal
conductivity using the Boltzmann transport equation (BTE), as discussed in another
recent study (Mortazavi et al. 2021a), the incorporation of AIMD results at high
temperature become important in order to enhance accuracy and better capture the
anharmonic effects on the thermal transport. For the analysis of complex mechanical
properties, the AIMD results are ought to also include stretched samples in which
the rupture gradually occurs by increasing the temperature (Mortazavi et al. 2021b,
2022b). As it is clear, depending on the type of problem, an efficient strategy for
preparing the dataset has to be adapted. In the aforementioned three examples, the
candidate structures that could define the atomic environment during the simulations
were predictable, and one could produce sufficiently large structures similar to those
conditions. Nonetheless, without enough knowledge about the candidate structures,
like the energy minimization of novel compositions, one has no other way but to use
active learning to assemble the training set during simulations.
As the general basis, quantum mechanics-based data are required for the training
of MLIPs. As mentioned earlier, AIMD simulations are currently the most popular
approach for the aforementioned purpose. Unlike the ground state simulations, it is
possible to adjust temperature in AIMD simulations. High temperatures are usually
equivalent with larger deformations, and vice versa. Therefore, with AIMD simula-
tions one has the choice about the extensiveness of the atomic environment. Moreover,
one major issue with quantum mechanics-based data is the self-consistent loop for the
convergence of electronic structure calculations. With AIMD simulation, the results
for the converged self-consistent loop of previous step can secure faster convergence
in the next step, which can substantially accelerate calculations. It should be nonethe-
less noted that at high temperatures, since the difference in the atomic positions of
two following steps may become considerable, the self-consistent loop may converge
slower. Another critical issue is with the size of considered structures in the AIMD
simulations. Larger structures can be beneficial to describe more complex atomic
environments, nonetheless, the computational costs of AIMD simulations increase
exponentially with the number of atoms, which may affect the computational effi-
ciency. Generally, for periodic systems, the size of the system should be more than
twice of the cutoff distance of the potential. Another issue is that the atomic config-
urations obtained by AIMD simulations are correlated for close time steps, and they
do not describe new useful atomic environments, and as such the incorporation of
complete AIMD data may lead to overfitting issue. To avoid this problem normally
a portion of AIMD results are utilized for training of MLIPs.
436 B. Mortazavi

After creating the training dataset with sufficiently large structures to capture
atomic environment, interatomic potential’s parameters are fitted, with a goal to mini-
mize the difference between the predicted values for energy, forces, and stresses by
the MLIP with those from the quantum mechanics calculations. Since this concept is
very fundamental, herein we present the optimization function for the MTP (Novikov
et al. 2021; Shapeev 2016):

K 
 N 
 2 
we E kAI M D − E kM T P + w f  f AI M D − f M T P 2
k,i i k,i
k=1 (12.12)
3  AI M D  
+ws σ − σ M T P 2
→ min ,
i, j=1 k,i j k,i j

Here E kAI M D , f k,i


AIMD
, and σk,i
AIMD
j represent the energy, atomic forces, and stresses,
respectively, and E k , f k,i , and σk,i
MTP MTP MTP
j are the corresponding values calculated
by the fitted MTP. Moreover, K is the number of the configurations in the prepared
training dataset, N corresponds to the number of atoms in every configuration, and
we , wf, and ws refer to the positive weights which demonstrate the significance of
energies, forces, and stresses in the optimization, respectively. It should be noted
that size and diversity of configurations in the training dataset may affect the training
weight factors. For example, for configurations with small number of atoms, or those
that stresses do not show critical effects, the ws coefficient can be selected to be close
to zero. Moreover, when the dataset includes configurations with different sizes or
number of atoms, the number of atoms can be used as the scaling factor in the defi-
nition of weighting coefficients (Novikov et al. 2021). On this basis, definition of
weight coefficients may affect the accuracy of developed MLIPs, which require tests
to find optimal values. For the case of MTP in the exploration of phononic properties
(Mortazavi et al. 2020b), since the stress of configurations do not play critical role,
we , wf and ws were set to 1, 0.1 and 0.001, respectively. Another factor is the defini-
tion of cutoff for a MLIP, which can substantially affect the computational efficiency,
especially for the employment in molecular dynamics calculations. In this case, the
chemistry and type of bonding should be carefully examined. For example, while
covalent systems like graphene and diamond can be simulated accurately with rather
short cutoff distances, ionic counterparts generally ask for lager values. As it is clear,
the details of quantum mechanics calculations, MLIP parameters, and fitting process
can affect the accuracy and transferability of a MLIP, and as such depending on the
type of problem, optimal conditions should be adjusted. This process is undoubtedly
drastically different from that of the empirical interatomic potentials, where a func-
tion with a limited number of parameters is expected to accurately work for a wide
range of problems.
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 437

12.3.3 Passive or Active Fitting

As discussed, preparing an optimal training dataset is a critical step in the develop-


ment of a MLIP for a given problem. The main challenge is that the training dataset
has to be sufficiently large to capture atomic environments, occurring during the
simulations. If the atomic environment could be predicted, and explored completely
prior to the training, it becomes possible to passively develop a MLIP and conve-
niently employ it for desired calculations. Since the training of a MLIP also includes
computational cost, the aforementioned passive fitting minimizes the need for the
re-training, which not only enhances the computational efficiency, but also substan-
tially simplifies the development process. The passive fitting has been successfully
used in the evaluation of phononic properties on the basis of small displacement
approaches (Mortazavi et al. 2021a, b). On the other side, one probable case is
that the complete atomic environment cannot be produced by AIMD calculations,
like the case of predicting novel materials. The concept of learning on the fly or
active learning can be used to address this issue, which combines training a MLIP
and subsequently finding new structures in a single process. In this regard, molec-
ular dynamics simulations can be conducted using a fitted MLIP to explore the
new atomic environments. Next, the configurations with the highest extrapolation
grades can be defined during the molecular dynamics simulations, and subsequently
employed in the quantum mechanics tool to extract energy, forces, and stresses.
These new quantum mechanics data is then incorporated to the previous one, and
a new MLIP is re-trained, and this process is repeated to finally find a MLIP with
robust accuracy and stability for the considered problem. To better understand this
process, an interested reader should refer to the previous studies (Podryabinkin and
Shapeev 2017; Podryabinkin et al. 2019). The active learning approach has been
recently successfully employed for the accurate simulation of complex nanoindenta-
tion of bulk covalent materials (Podryabinkin et al. 2022), which is also schematically
shown in Fig. 12.2. Although this approach offers a robust possibility to develop
stable and accurate MLIPs, but because of the fact that several steps of quantum
mechanics calculations and potential’s parameters re-fitting might be required, the
computational efficiency becomes considerably affected. The other technical issue is
that the extrapolated atomic environments mostly occur only within a small portion
of a large molecular dynamics model (within the proximity of indenter tip as shown
in Fig. 12.2), and in this case those regions have to be identified and used in the
quantum mechanics calculations. This is a critical bottleneck, not only because
of lack of new useful information in the rest of the model, but more importantly
because of unaffordable cost of quantum mechanics calculations for large systems.
Therefore, only extrapolated regions should be used for the new quantum mechanics
calculations, for which the periodic boundary conditions, most probably are strictly
difficult to be applied. We know that periodic boundary conditions are required for
the conventionally employed plane-wave-based DFT and AIMD calculations. As it
is clear, although active learning approach can address the transferability issue, but
438 B. Mortazavi

Fig. 12.2 Active learning on


atomic environments for the
simulation of
nanoindentation of covalent
materials (Reprinted from
Podryabinkin et al. 2022,
Copyright 2022, American
Chemical Society)

not only the practical implementation is complex, but also the computational cost is
substantially higher than that of the passive fitting.
An efficient possible solution is to combine the concept of active and passive fitting
approaches, in order to minimize the need for frequent quantum mechanics calcu-
lations and subsequent MLIPs re-training. Let’s consider the considered problem
of nanoindentation, around the indenter tip one expects the formation of amorphous
configurations, whereas in the far regions the crystal structure is intact. Therefore, one
can include crystalline and various amorphous lattices in the original training datasets
and conduct AIMD simulations with variable temperatures and under different initial
strains, to artificially simulate structural transitions and failures. The complete AIMD
trajectories can be then subsampled to avoid overfitting and efficiently train prelim-
inary MLIPs. The accuracy of the first fitted MLIPs can be then examined over
the complete AIMD dataset, and configurations with worst extrapolation grades
(Podryabinkin and Shapeev 2017) could be identified and incorporated to the orig-
inal subsampled dataset. The final MLIPs with enhanced accuracy and stability could
be then re-fitted using the improved training dataset. This two-step passive fitting
approach has been recently successfully accomplished in several studies on the basis
of MTPs (Mortazavi et al. 2021b, 2022a b, c; Mortazavi 2021), with confirmed
accuracy as compared with DFT results.

12.3.4 Current Challenges of MLIPs

Potential choice. As mentioned earlier, the accuracy level of popular MLIPs are all
very close to quantum mechanics simulations, still nonetheless it is not well estab-
lished that which type of MLIPs, and with which combination of hyperparameters,
cutoff distance, and training strategy are best suited for a particular problem. GAPs are
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 439

well known for their accuracy, but also for their expensive computational costs as well
(Behler 2014, 2016). MTPs can be by two orders of magnitude faster than GAPs with
a comparable level of accuracy (Zuo et al. 2020). NNPs and SNAPs are other MLIPs
that are more computationally efficient than GAPs, with good accuracy for complex
systems. Among all available MLIPs, the MTP method, to the best of our knowledge
is the only approach with confirmed accuracy for the simulation of lattice thermal
conductivity with either Boltzmann transport equation (Mortazavi et al. 2021a; Liu
et al. 2021) or molecular dynamics calculations (Mortazavi et al. 2020a; Korotaev
et al. 2019) and evaluation of complex mechanical properties (Mortazavi et al. 2021b,
2022a, b, c; Mortazavi 2021). Other MLIPs nevertheless may yield similar perfor-
mances or even fail, therefore, direct comparisons with first-principles calculations
are highly required to ensure their reliability. At the end, two different MTPs or GAPs
developed with different hypermeters, or trained over dissimilar training datasets, are
expected to yield close, but also different predictions.
Transferability issue. As discussed earlier, MLIPs are expected to work accu-
rately within the atomic environment fed into them during the training. For example,
a MLIP trained for a pristine structure without any defect is unstable or unreliable for
defective samples. Let’s consider DFT calculations for the geometry optimization,
in this case for initially inaccurate positions the commonly used DFT schemes are
expected to conveniently find the global or at least local minimum energy config-
urations, but with MLIPs this process requires a more complex fitting. As such
unless the amount of required geometry optimization simulations are enormous or
the structures are very large, the common DFT is more reliable and computationally
efficient than MLIPs. Let’s look at this challenge from another view. It is expected
that a more transferable MLIP, most probably yields a lower accuracy for a given
lattice, and moreover the training process becomes more complex and computa-
tionally demanding. As such for studying a specific system and problem, one can
develop a specific MLIP, because the ideal goal of MLIPs is to get accurate results
for a given problem. For example, recently mechanical and thermal properties of
various graphene-like BC2 N lattices have been studied using MTPs (Mortazavi et al.
2022a), in which for every lattice a separate MTP has been trained to maximize the
accuracy. Transferable MLIPs are nonetheless the only viability for the accelerated
new materials and structures discovery (Podryabinkin et al. 2019).
Long-range interactions. Long-range interactions, such as van der Waals and
electrostatic interactions, exist in many important systems and their incorporation
in standard MLIPs can be highly beneficial for broadening the application range.
Because of expensive computational costs of MLIPs, increasing the cutoff distance
is not only an exceedingly costly approach, but may also affect the accuracy of
describing critical short-range interactions. To address this issue, one promising
possibility is to simultaneously consider short-range interactions via a standard
MLIP and capture long-range counterparts with empirical interatomic potentials,
like Lennard–Jones, as it has been most recently successfully accomplished for the
2D vdW heterostructures (Novikov et al. 2022). Nonetheless, because of the crit-
ical importance of this aspect, novel computationally efficient approaches need to
devised, implemented, and tested.
440 B. Mortazavi

Magnetic systems. Despite recent advances in the simulation of magnetic systems


with MTP (Novikov et al. 2022), the robustness of MLIPs for various systems with
complex magnetic behaviors, like switchable magnetism or antiferromagnetism have
been rarely investigated. From basic physics, for magnetic MLIPs, spin of every atom
should be included as an additional feature.
Novel functional forms. It has been shown that the electronic temperature or
the value of the band gap affects locality and smoothness of interatomic potentials
(Chen and Ortner 2017; Chen and Ortner 2016). Therefore, mathematical under-
standing of properties of Born–Oppenheimer potential energy surfaces can be useful
in developing more accurate functional form of MLIPs (Shapeev n.d.).

12.4 Quantum Mechanics and Empirical Interatomic


Potentials Challenges

Although quantum mechanics are well-known for their superior accuracy, however,
when they are combined with other methods to evaluate a desired property, like
lattice thermal conductivity, one may face several technical challenges, which will
be discussed in the following. On the other side, empirical interatomic potentials are
more convenient for utilization to study wide range of problems, but as discussed
earlier they suffer from accuracy and flexibility issue to study novel compositions.
Graphene, the 2D form of carbon atoms, with a highly symmetrical atomic lattice,
lack of magnetism and short-range covalent bonding, is among the simplest materials
to be theoretically investigated. In this section, we consider graphene and other
graphene-like covalent systems to discuss the bottleneck of quantum mechanics
and empirical interatomic potentials in the examination of thermal transport and
mechanical properties. Taking into account the simplicity of considered graphene-
like structures, one may better realize the technical difficulties for more complex
structures.

12.4.1 Thermal Transport

We first discuss common quantum mechanics-based approach to investigate the


lattice thermal transport. In a material, heat transport is achieved with two thermal
barriers, which are electrons and crystal lattice vibrations. For semiconductors or
insulators, the heat transport by electrons is negligible than that of the lattice vibra-
tions, which are quantized into phonons. According to the assumption of Boltzmann
transport equation (BTE), phonon occupation changes with a temperature gradient,
stems from two main reasons: phonon drift and collisions. Different types of colli-
sions include phonons scattering due to intersection with other phonons, structure
boundaries, defects, impurities, or isotope scattering. In a small volume of a crystal
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 441

and under steady-state condition, no alteration in the number of phonons is expected


and therefore the total amount of drift and collision terms must be zero:


∂n( ) 
q ∂n( q )
 + v ( q ) · ∇T =0 (12.13)
∂t  ∂T
c

where n is the phonon distribution function, q is the phonon wave vector, and v is
the group velocity. ∇n( q ) · v ( q ) corresponds to the distribution function change
rate caused by phonon motions, and [∂n( q )/∂t]c refers to the same variation rate
due to collisions, leading to a steady-state condition (Ward 2009). Different pack-
ages nowadays give the ability to estimate thermal conductivity using the BTE
method. For instance, the ShengBTE package (Li et al. 2014) provides the ability
to determine the lattice thermal conductivity by solving the BTE utilizing a set
of interatomic force constants. A BTE solution requires both harmonic and anhar-
monic force constants to evaluate lattice thermal conductivity. To obtain anharmonic
force constants, depending on the symmetry and cutoff distance, a few hundred to
several thousand force calculations over relatively large supercell structures have to
be performed, which can be computationally exceedingly expensive with conven-
tional DFT methods. Nonetheless when using the first-principles DFT-BTE methods,
the estimation for the thermal conductivity may change depending on the choices
for exchange–correlation functional and computational details for the harmonic and
anharmonic force constants calculations, such as plane waves cutoff energy, K-point
mesh size, supercell size, Q-grid of the BTE solution or cutoff distance in the eval-
uation of anharmonic force constants. As it is clear, a convergence test to ensure
independency of the BTE estimation to the aforementioned parameters can be prac-
tically very burdensome. To better understand the challenge of DFT-BTE method in
the estimation of phononic thermal conductivity, we consider two cases of graphene
and C3 N monolayers. In this regard, according to several DFT-BTE-based studies
(Qin et al. 2018; Taheri et al. 2018), it has been shown that the type of exchange–
correlation functional may yield significant effects on the estimated lattice thermal
conductivity of graphene (find Fig. 12.3a). In contrast with DFT-based predictions
(Qin et al. 2018; Taheri et al. 2018), the estimations by the MTP-based BTE solution
however revealed negligible effects of exchange–correlation functional on the lattice
thermal conductivity of graphene (find MTP results for PBE, PBEsol, and revPBE
shown in Fig. 12.3b) (Mortazavi et al. 2021a). Interestingly, a latest DFT-BTE-based
study (Taheri et al. 2021) confirmed earlier predictions by MTP-based BTE solu-
tion (Mortazavi et al. 2021a) and found close values for the thermal conductivity
of graphene with different exchange–correlation functional (find Fig. 12.3c), and
more importantly their estimated values (Taheri et al. 2021) are consistent with the
earlier MLIP-based estimations (Mortazavi et al. 2021a). Authors in Ref. Taheri et al.
(2021) discussed that the scattering in the earlier full-DFT predictions can be asso-
ciated with not accurately describing the quadratic dispersion of the out-of-plane
acoustic mode of graphene. The second example is graphene-like C3 N, in which on
442 B. Mortazavi

Fig. 12.3 The thermal conductivity of graphene for different exchange–correlation functional, by
(a) Qin et al. (Reprinted from Qin et al. 2018, Copyright 2018, Elsevier), Mortazavi et al. (Reprinted
from Mortazavi et al. 2021a, Copyright 2021, Elsevier) and Taheri et al. (2021) (Reprinted from
Taheri et al. 2021, Copyright 2021, American Physical Society)

the basis of PBE functional using VASP package (Kresse and Furthmüller 1996)
and BTE solution with ShengBTE (Li et al. 2014) package, the room temperature
thermal conductivity were reported to be 380 (Wang et al. 2019), 128 (Kumar et al.
2017), 80 (Wang et al. 2018b), 380 (Gao et al. 2018) and 482 (Peng et al. 2018) W/m
K, which unexpectedly shows six-fold scattering. In another recent study (Liu et al.
2021), authors extended the MTP-based BTE solution (Mortazavi et al. 2021a) and
considered four phonon scattering in the evaluation of lattice thermal conductivity of
bulk BAs using the MTP method, and they found excellent agreements with full-DFT
estimations. It can be concluded that MLIPs not only can yield accurate estimations
and significantly accelerate calculations, but since they are less dependent on the
details of AIMD simulations of training dataset and moreover larger supercell sizes
and cutoff distance impose almost negligible costs, they can substantially facilitate
the examination of thermal conductivity.
On the other side, using the original (Tersoff 1989), AIREBO (Stuart et al. 2000),
REBO (Brenner et al. 2002) and optimized Tersoff (Lindsay and Broido 2010) empir-
ical interatomic potentials, the room temperature thermal conductivity of graphene
were estimated to be 870 (Wei et al. 2011), 709 (Hong et al. 2018a), 350 (Thomas et al.
2010) and ~3000 W/m K (Mortazavi and Rabczuk 2015; Fan et al. 2017), respec-
tively. Based on the aforementioned results, only optimized Tersoff (Lindsay and
Broido 2010) potential yield a reasonable value. Interestingly, for the graphene-like
C3 N monolayer, different studies on the basis of Tersoff (Lindsay and Broido 2010;
KInacI et al. 2012) potentials, the thermal conductivity at 300 K was predicted to be
805–520 (Mortazavi 2017), 810–826 (Hong et al. 2018b), 775 (Han et al. 2019), 806
(Dong et al. 2018), 800 (Song et al. 2019), and 780 (Hatam-Lee et al. 2020) W/m K. It
is clear that while these values by more than two-fold overestimate DFT-based results,
but they are all very close, highlighting the good reproducibility of empirical inter-
atomic potentials estimations. In our recent study for three different graphene-like
BC2 N monolayers (Mortazavi et al. 2022a), MTP-based results confirm that empir-
ical interatomic potentials estimations for thermal conductivity can be nonphysical
and misguiding as well (Mortazavi et al. 2022a). With the presented literature in
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 443

this section, the advantages of MLIPs in the examination of thermal transport with
respect to the accuracy and simplicity in comparison with conventional methods
are highlighted, however their computational cost remain their major bottleneck if
employed within the molecular dynamics simulation platforms.

12.4.2 Mechanical Properties

Quantum mechanics-based DFT simulations are nowadays extensively utilized to


study the mechanical properties of various compositions, with outstanding level of
accuracy and reproducibility. The computational cost of the DFT method however
limits its effectiveness for studying the mechanical properties of large systems, such
as nanoporous conductive frameworks. The other major shortcoming is that the
DFT calculation of mechanical properties is generally conducted at ground state,
neglecting the temperature effects. Atomic vibrations due to thermal effects can
substantially affect the stress–strain curve and failure mechanism. By increasing the
temperature, the symmetry of structures suppresses, and therefore larger represen-
tative volume elements must be considered in order to predict useful behaviors. It
is thus clear that the DFT method faces a severe computational cost issue for the
assessment of mechanical behavior at high temperatures. MLIPs in this case can
address both aforementioned challenges of DFT method. In Fig. 12.4, the uniaxial
stress–strain responses of the C5 N monolayer conductive framework are compared
using the MTP and DFT models, as reported in Ref. Mortazavi et al. (2022c). In this
case two different MTPs were developed, using AIMD trajectories with and without
DFT-D3 (Grimme et al. 2010) van der Waals (vdW) dispersion correction. Results
shown in Fig. 12.4 confirm that both MTPs could accurately reproduce the tensile
strengths of the C5 N monolayer. It is observable that without taking the vdW disper-
sion correction into account, the general agreement for stress–strain curve is affected.
Results shown in Fig. 12.4 also confirm that MTP method could precisely predict the
failure mechanism, as compared with DFT. The developed MTP was then success-
fully used to study the mechanical and failure properties at room temperature, which
are almost unsolvable by DFT method. In another similar study on the basis of MTP
method (Mortazavi et al. 2022b), mechanical properties of five different C6 N7 -based
nanoporous nanosheets were successfully simulated, with confirmed accuracy with
DFT results.
On the other side, when considering the simple case of graphene, using molecular
dynamics simulations on the basis of original (Tersoff 1989), AIREBO (Stuart et al.
2000), ReaxFF (Srinivasan et al. 2015), and optimized Tersoff (Lindsay and Broido
2010) original parameter sets, the tensile strength of graphene is predicted to be 200
(Mortazavi et al. 2012; Ni et al. 2010), 150–250 (Yang et al. 2018; He et al. 2014),
125–138 (Jensen et al. 2015), and 158 GPa (Mortazavi et al. 2016), respectively. As
compared in Fig. 12.5, the aforementioned interatomic potentials yield nonphysical
strain hardening at high strain levels. Except for the case of ReaxFF-based results,
which yields very irregular stress–strain patterns (Jensen et al. 2015), the nonphysical
444 B. Mortazavi

Fig. 12.4 Uniaxial stress–strain curve and failure mechanism of the C5 N monolayer predicted by
DFT (at 0 K) and MTP-based (at 1 K) models, with and without inclusion of vdW dispersion
correction in the AIMD dataset preparation. (Reprinted from Mortazavi et al. 2022c, Copyright
2022, Royal Society of Chemistry)

strain hardening can be removed by the trial and error modification of the potential’s
cutoff distance function (Mortazavi et al. 2016; He et al. 2014) and trying to reproduce
the experimental tensile strength value. As shown in Fig. 12.5, a passively-fitted
MTP model conducted at 1 K could very closely reproduce the directional dependent
stress–strain curves of graphene as compared with DFT results. This example clearly
reveals the superiority of MLIPs in the analysis of mechanical properties, in which
a rapidly trained MTP could substantially outperform widely employed empirical
interatomic potentials, with respect to the accuracy.

12.5 First-Principles Multiscale Modeling

At this stage, robustness and challenges of MLIPs and various drawbacks of empir-
ical interatomic potentials and conventional DFT for the examination of mechan-
ical and thermal properties are well discussed. MLIPs can be fitted over compu-
tationally affordable AIMD trajectories, and they have been extensively proven
to exhibit close accuracy as those of the native datasets, but more importantly
benefit from the inherent flexibility of DFT method to investigate diverse and novel
compositions. In addition, the majority of conventionally used MLIP methods are
currently available within the most popular molecular dynamics package of Large-
scale Atomic/Molecular Massively Parallel Simulator (LAMMPS) (Plimpton 1995),
which considerably facilitate their practical utilization, enrich them with extensive
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 445

Fig. 12.5 Comparison of mechanical properties of graphene predicted by (a) original Terosff
(Tersoff 1989) (Reprinted from Ni et al. 2010, Copyright 2010, Elsevier), AIREBO (Stuart et al.
2000) (Reprinted from He et al. 2014, Copyright 2014, Elsevier), ReaxFF (Srinivasan et al. 2015)
(Reprinted from Jensen et al. 2015, Copyright 2015, American Chemical Society), and optimized
Tersoff (Lindsay and Broido 2010) (Reprinted from Mortazavi et al. 2016, Copyright 2016, Else-
vier) empirical interatomic potentials, with those by (e and f) DFT (at 0 K) and MTP-based (at 1 K)
models (Reprinted from Mortazavi et al. 2021b, Copyright 2021, John Wiley and Sons)

libraries of LAMMPS can enable studying large systems and exploring various phys-
ical properties. In our recent works, we have practically confirmed another novel
opportunity of MLIPs. We have demonstrated that MLIPs fitted to fixed AIMD
datasets can enable first-principles multiscale modeling of mechanical (Mortazavi
et al. 2021b) and heat transport (Mortazavi et al. 2020a) properties of complex nanos-
tructures, in which ab-initio level of accuracy can be hierarchically bridged to explore
the properties of macroscopic systems. Worthy to note that bonds rupture and the
subsequent failure progress can substantially affect atomic environment, which can
cause technical difficulties for developing stable and accurate MLIPs. On this basis,
the evaluation of mechanical properties with dynamic atomic environments is basi-
cally more complex than the thermal counterparts, which are mostly explored close
to equilibrium conditions. We therefore discuss the concept of first-principles multi-
scale modeling of mechanical properties (Mortazavi et al. 2021b), which as discussed
are naturally among the most complicated problems for MLIPs. In order to practi-
cally show this exceptional possibility enabled by MLIPs, mechanical properties of
coplanar graphene/borophene heterostructures (Liu et al. 2019) have been explored.
Worth noting that apart from the accuracy concern of empirical interatomic potentials,
because of complex bonding configurations between graphene and borophene, they
are unable to keep these structures stable at finite temperatures. The first-principles
446 B. Mortazavi

Fig. 12.6 First-principles multiscale modeling strategy to simulate the mechanical properties of
graphene/borophene coplanar heterostructures (Reprinted from Mortazavi et al. 2021b, Copyright
2021, John Wiley and Sons)

multiscale modeling of mechanical properties (Mortazavi et al. 2021b), similar to


thermal transport problem (Mortazavi et al. 2020a), comprises four major steps,
which are schematically shown in Fig. 12.6. In the first step, AIMD simulations are
conducted over stress-free and strained structures to prepare required datasets for
training, which is used to develop a MLIP for the molecular dynamics simulations
in the second step. Next, MLIP-based molecular dynamics simulations calculations
were conducted to evaluate the mechanical properties of pristine and heterostruc-
ture phases at room temperature. In the final step, on the basis of data provided by
MLIP-based molecular dynamics simulations, the mechanical/failure responses of
macroscopic heterostructures can be examined using the continuum finite element
method.

12.6 Concluding Remark

Since the introduction of the MLIPs concept by Behler and Parrinello (2007) in
2007, an increasing interest has been developed for the use of MLIPs, for example,
to accelerate the calculations and/or conduct more accurate molecular dynamics
calculations. There is no doubt that MLIPs significance will keep only expanding
and novel advanced methods will be developed to accelerate calculations or offer
solutions for more complex problems. MLIPs moreover offer extraordinary capabil-
ities to marry the first-principles accuracy with multiscale modeling, enabling the
modeling of complex nanostructures at continuum level, without any prior physical
knowledge, with DFT level of accuracy and more importantly without paying exces-
sive computational costs. MLIPs therefore offer a highly bright prospect to develop
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 447

fully automated platforms, to design, optimize and explore complex responses of


materials and structures at continuum level, enabling to capture atomistic effects and
bringing new insights to address various types of real-world engineering challenges.

Acknowledgements Author appreciates the funding by the Deutsche Forschungsgemeinschaft


(DFG, German Research Foundation) under Germany’s Excellence Strategy within the Cluster of
Excellence PhoenixD (EXC 2122, Project ID 390833453)

References

Arabha S, Aghbolagh ZS, Ghorbani K, Hatam-Lee M, Rajabpour A (2021) Recent advances in


lattice thermal conductivity calculation using machine-learning interatomic potentials. J Appl
Phys 130:210903. https://doi.org/10.1063/5.0069443
Arabha S, Rajabpour A (2021) Thermo-mechanical properties of nitrogenated holey graphene
(C2N): a comparison of machine-learning-based and classical interatomic potentials. Int J Heat
Mass Transf 178:121589
Artrith N, Morawietz T, Behler J (2011) High-dimensional neural-network potentials for multicom-
ponent systems: applications to zinc oxide. Phys Rev B 83:153101. https://doi.org/10.1103/Phy
sRevB.83.153101
Artrith N, Urban A (2016) An implementation of artificial neural-network potentials for atomistic
materials simulations: performance for TiO2 . Comput Mater Sci 114:135–150. https://doi.org/
10.1016/j.commatsci.2015.11.047
Bartók AP, Csányi G (2015) Gaussian approximation potentials: a brief tutorial introduction. Int J
Quantum Chem 115:1051–1057. https://doi.org/10.1002/qua.24927
Bartók AP, Payne MC, Kondor R, Csányi G (2010) Gaussian approximation potentials: the accuracy
of quantum mechanics, without the electrons. Phys Rev Lett 104:136403
Behler J (2014) Representing potential energy surfaces by high-dimensional neural network
potentials. J Phys Condens Matter 26:183001
Behler J (2015) Constructing high-dimensional neural network potentials: a tutorial review. Int J
Quantum Chem 115:1032–1050. https://doi.org/10.1002/qua.24890
Behler J (2016) Perspective: machine learning potentials for atomistic simulations. J Chem Phys
145:170901. https://doi.org/10.1063/1.4966192
Behler J, Parrinello M (2007) Generalized neural-network representation of high-dimensional
potential-energy surfaces. Phys Rev Lett 98:146401. https://doi.org/10.1103/PhysRevLett.98.
146401
Behler J, RuNNer: A program for constructing high-dimensional neural network potentials
Brenner DW, Shenderova OA, Harrison JA, Stuart SJ, Ni B, Sinnott SB (2002) A second-generation
Reactive Empirical Bond Order (REBO) potential energy expression for hydrocarbons. J Phys
Condens Matter. https://doi.org/10.1088/0953-8984/14/4/312
Brieuc F, Schran C, Forbert H, Marx D, RubNNet4MD: Ruhr-Universität Bochum neural networks
for molecular dynamics simulations
Chakraborty P, Liu Y, Ma T, Guo X, Cao L, Hu R, Wang Y (2020) Quenching thermal transport
in aperiodic superlattices: a molecular dynamics and machine learning study. ACS Appl Mater
Interfaces 12:8795–8804
Chen H, Ortner C (207) QM/MM methods for crystalline defects. Part 2: consistent energy and
force-mixing. Multiscale Model Simul 15:184–214. https://doi.org/10.1137/15M1041250
Chen H, Ortner C (2016) QM/MM methods for crystalline defects. Part 1: locality of the tight
binding model. Multiscale Model Simul 14:232–264. https://doi.org/10.1137/15M1022628
448 B. Mortazavi

Dong Y, Meng M, Groves MM, Zhang C, Lin J (2018) Thermal conductivities of two-dimensional
graphitic carbon nitrides by molecule dynamics simulation. Int J Heat Mass Transf 123:738–746.
https://doi.org/10.1016/j.ijheatmasstransfer.2018.03.017
Fan Z, Wang Y, Ying P, Song K, Wang J, Wang Y, Zeng Z, Xu K, Lindgren E, Rahm JM et al (2022)
GPUMD: a package for constructing accurate machine-learned potentials and performing highly
efficient atomistic simulations. J Chem Phys 157:114801. https://doi.org/10.1063/5.0106617
Fan Z, Zeng Z, Zhang C, Wang Y, Song K, Dong H, Chen Y, Ala-Nissila T (2021) Neuroevolution
machine learning potentials: combining high accuracy and low cost in atomistic simulations and
application to heat transport. Phys Rev B 104:104309
Fan Z, Pereira LFC, Hirvonen P, Ervasti MM, Elder KR, Donadio D, Ala-Nissila T, Harju A (2017)
Thermal conductivity decomposition in two-dimensional materials: application to graphene.
Phys Rev B 95. https://doi.org/10.1103/PhysRevB.95.144309
Gao Y, Wang H, Sun M, Ding Y, Zhang L, Li Q (2018) First-principles study of intrinsic phononic
thermal transport in monolayer C3N. Phys E Low-Dimensional Syst Nanostruct 99:194–201.
https://doi.org/10.1016/j.physe.2018.02.012
Grimme S, Antony J, Ehrlich S, Krieg H (2010) A consistent and accurate ab initio parametrization
of density functional dispersion correction (DFT-D) for the 94 elements H-Pu. J Chem Phys
132:154104. https://doi.org/10.1063/1.3382344
Han D, Wang X, Ding W, Chen Y, Zhang J, Xin G, Cheng L (2019) Phonon thermal conduction
in a graphene–C 3 N heterobilayer using molecular dynamics simulations. Nanotechnology
30:075403. https://doi.org/10.1088/1361-6528/aaf481
Hatam-Lee SM, Rajabpour A, Volz S (2020) Thermal conductivity of graphene polymorphs and
compounds: from C3N to graphdiyne lattices. Carbon N Y 161:816–826. https://doi.org/10.
1016/j.carbon.2020.02.007
He L, Guo S, Lei J, Sha Z, Liu Z (2014) The effect of stone–thrower–wales defects on mechanical
properties of graphene sheets—a molecular dynamics study. Carbon N Y 75:124–132. https://
doi.org/10.1016/j.carbon.2014.03.044
Hong Y, Ju MG, Zhang J, Zeng XC (2018a) Phonon thermal transport in a graphene/MoSe2 van
Der Waals heterobilayer. Phys Chem Chem Phys 20:2637–2645. https://doi.org/10.1039/C7C
P06874C
Hong Y, Zhang J, Zeng XC (2018b) Monolayer and bilayer polyaniline C3N: two-dimensional
semiconductors with high thermal conductivity. Nanoscale 10:4301–4310. https://doi.org/10.
1039/C7NR08458G
Hu R, Iwamoto S, Feng L, Ju S, Hu S, Ohnishi M, Nagai N, Hirakawa K, Shiomi J (2020) Machine-
learning-optimized aperiodic superlattice minimizes coherent phonon heat conduction. Phys
Rev X 10:21050
Jensen BD, Wise KE, Odegard GM (2015) Simulation of the elastic and ultimate tensile proper-
ties of diamond, graphene, carbon nanotubes, and amorphous carbon using a revised ReaxFF
parametrization. J Phys Chem A 119:9710–9721. https://doi.org/10.1021/acs.jpca.5b05889
KInacI A, Haskins JB, Sevik C, ÇaǧIn T (2012) Thermal conductivity of BN-C nanostructures. Phys
Rev B-Condens Matter Mater Phys 86:115410. https://doi.org/10.1103/PhysRevB.86.115410
Korotaev P, Novoselov I, Yanilkin A, Shapeev A (2019) Accessing thermal conductivity of complex
compounds by machine learning interatomic potentials. Phys Rev B 100:144308. https://doi.
org/10.1103/PhysRevB.100.144308
Kresse G, Furthmüller J (1996) Efficient iterative schemes for Ab initio total-energy calculations
using a plane-wave basis set. Phys Rev B 54:11169–11186. https://doi.org/10.1103/PhysRevB.
54.11169
Kumar S, Sharma S, Babar V, Schwingenschlögl U (2017) Ultralow lattice thermal conductivity in
monolayer C3 N as compared to graphene. J Mater Chem A 5:20407–20411. https://doi.org/10.
1039/C7TA05872A
Lee K, Yoo D, Jeong W, Han S (2019) SIMPLE-NN: an efficient package for training and executing
neural-network interatomic potentials. Comput Phys Commun 242:95–103. https://doi.org/10.
1016/j.cpc.2019.04.014
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 449

Li W, Carrete J, Katcho NA, Mingo N (2014) ShengBTE: a solver of the boltzmann transport
equation for phonons. Comput Phys Commun 185:1747–1758. https://doi.org/10.1016/j.cpc.
2014.02.015
Lindsay B (2010) Optimized tersoff and brenner empirical potential parameters for lattice dynamics
and phonon thermal transport in carbon nanotubes and graphene. Phys Rev B-Condens Matter
Mater Phys 82:205441
Lindsay L, Broido DA (2010) Optimized tersoff and brenner empirical potential parameters for
lattice dynamics and phonon thermal transport in carbon nanotubes and graphene. Phys Rev
B-Condens Matter Mater Phys 81:205441. https://doi.org/10.1103/PhysRevB.81.205441
Liu Z, Yang X, Zhang B, Li W (2021) High thermal conductivity of wurtzite boron arsenide
predicted by including four-phonon scattering with machine learning potential. ACS Appl Mater
Interfaces. https://doi.org/10.1021/acsami.1c11595
Liu X, Hersam MC (2019) Borophene-graphene heterostructures. Sci Adv 5:eaax6444. https://doi.
org/10.1126/sciadv.aax6444
Mortazavi B (2017) Ultra High stiffness and thermal conductivity of graphene like C<inf>3</Inf>N.
Carbon N Y 118:25–34. https://doi.org/10.1016/j.carbon.2017.03.029
Mortazavi B (2021) Ultrahigh thermal conductivity and strength in direct-gap semiconducting
graphene-like BC6N: a first-principles and classical investigation. Carbon N Y 182:373–383.
https://doi.org/10.1016/j.carbon.2021.06.038
Mortazavi B, Fan Z, Pereira LFC, Harju A, Rabczuk T (2016) Amorphized graphene: a stiff material
with low thermal conductivity. Carbon N. Y. 103:318–326. https://doi.org/10.1016/j.carbon.
2016.03.007
Mortazavi B, Novikov IS, Podryabinkin EV, Roche S, Rabczuk T, Shapeev AV, Zhuang X (2020b)
exploring phononic properties of two-dimensional materials using machine learning interatomic
potentials. Appl Mater Today 20:100685. https://doi.org/10.1016/j.apmt.2020.100685
Mortazavi B, Novikov IS, Shapeev A (2022a) V A machine-learning-based investigation on the
mechanical/failure response and thermal conductivity of semiconducting BC2N monolayers.
Carbon N Y 188:431–441. https://doi.org/10.1016/j.carbon.2021.12.039
Mortazavi B, Podryabinkin EV, Novikov IS, Rabczuk T, Zhuang X, Shapeev A (2021a) V Acceler-
ating first-principles estimation of thermal conductivity by machine-learning interatomic poten-
tials: a MTP/ShengBTE solution. Comput Phys Commun 258:107583. https://doi.org/10.1016/
j.cpc.2020.107583
Mortazavi B, Podryabinkin EV, Roche S, Rabczuk T, Zhuang X, Shapeev A (2020a) V Machine-
learning interatomic potentials enable first-principles multiscale modeling of lattice thermal
conductivity in graphene/borophene heterostructures. Mater. Horizons 7:2359–2367. https://
doi.org/10.1039/D0MH00787K
Mortazavi B, Rabczuk T (2015) Multiscale modeling of heat conduction in graphene laminates.
Carbon N Y 85:1–7. https://doi.org/10.1016/j.carbon.2014.12.046
Mortazavi B, Rémond Y, Ahzi S, Toniazzo V (2012) Thickness and chirality effects on tensile
behavior of few-layer graphene by molecular dynamics simulations. Comput Mater Sci 53:298–
302. https://doi.org/10.1016/j.commatsci.2011.08.018
Mortazavi B, Shahrokhi M, Shojaei F, Rabczuk T, Zhuang X, Shapeev A (2022c) V A first-
principles and machine-learning investigation on the electronic, photocatalytic, mechanical and
heat conduction properties of nanoporous C5N monolayers. Nanoscale 14:4324–4333. https://
doi.org/10.1039/D1NR06449E
Mortazavi B, Shojaei F, Shapeev AV, Zhuang X (2022b) A combined first-principles and machine-
learning investigation on the stability, electronic, optical, and mechanical properties of novel
C6N7-based nanoporous carbon nitrides. Carbon N Y 194:230–239. https://doi.org/10.1016/j.
carbon.2022.03.068
Mortazavi B, Silani M, Podryabinkin EV, Rabczuk T, Zhuang X, Shapeev A (2021b) V First-
principles multiscale modeling of mechanical properties in graphene/borophene heterostructures
empowered by machine-learning interatomic potentials. Adv Mater 33:2102807. https://doi.org/
10.1002/adma.202102807
450 B. Mortazavi

Ni Z, Bu H, Zou M, Yi H, Bi K, Chen Y (2010) Anisotropic mechanical properties of graphene


sheets from molecular dynamics. Phys B Condens Matter 405:1301–1306. https://doi.org/10.
1016/j.physb.2009.11.071
Novikov I, Gubaev K, Evgeny Podryabinkin AS (2021) The MLIP package: moment tensor
potentials with mpi and active learning. Mach Learn Sci Technol 2:025002
Novikov I, Grabowski B, Körmann F, Shapeev A (2022) Magnetic moment tensor potentials for
collinear spin-polarized materials reproduce different magnetic states of Bcc Fe. NPJ Comput
Mater 8:13. https://doi.org/10.1038/s41524-022-00696-9
Ouyang Y, Yu C, Yan G, Chen J (2021) Machine learning approach for the prediction and
optimization of thermal transport properties. Front Phys 16:1–16
Peng B, Mortazavi B, Zhang H, Shao H, Xu K, Li J, Ni G, Rabczuk T, Zhu H (2018) Tuning thermal
transport in C3 N monolayers by adding and removing carbon atoms. Phys Rev Appl 10:34046.
https://doi.org/10.1103/PhysRevApplied.10.034046
Plimpton S (1995) Fast Parallel algorithms for short-range molecular dynamics. J Comput Phys
117:1–19. https://doi.org/10.1006/jcph.1995.1039
Podryabinkin EV, Kvashnin AG, Asgarpour M, Maslenikov II, Ovsyannikov DA, Sorokin PB,
Popov MY, Shapeev A (2022) V Nanohardness from first principles with active learning on
atomic environments. J Chem Theory Comput 18:1109–1121. https://doi.org/10.1021/acs.jctc.
1c00783
Podryabinkin EV, Shapeev A (2017) V active learning of linearly parametrized interatomic
potentials. Comput Mater Sci 140:171–180. https://doi.org/10.1016/j.commatsci.2017.08.031
Podryabinkin EV, Tikhonov EV, Shapeev AV, Oganov AR (2019) Accelerating crystal struc-
ture prediction by machine-learning interatomic potentials with active learning. Phys Rev B
99:064114. https://doi.org/10.1103/PhysRevB.99.064114
Qin G, Qin Z, Wang H, Hu M (2018) On the diversity in the thermal transport properties of graphene:
a first-principles-benchmark study testing different exchange-correlation functionals. Comput
Mater Sci 151:153–159. https://doi.org/10.1016/j.commatsci.2018.05.007
Rowe P, Deringer VL, Gasparotto P, Csányi G, Michaelides A (2020) An accurate and transferable
machine learning potential for carbon. J Chem Phys 153:34702. https://doi.org/10.1063/5.000
5084
Schütt KT, Arbabzadah F, Chmiela S, Müller KR, Tkatchenko A (2017) Quantum-chemical insights
from deep tensor neural networks. Nat Commun 8:13890. https://doi.org/10.1038/ncomms
13890
Shapeev AV (2016) Moment tensor potentials: a class of systematically improvable interatomic
potentials. Multiscale Model Simul 14:1153–1173. https://doi.org/10.1137/15M1054183
Shapeev A (2019) V Chapter 3 Applications of machine learning for representing interatomic
interactions. In: Computational materials discovery; The royal society of chemistry, pp 66–86.
ISBN 978-1-78262-961-0
Singraber A, Morawietz T, Behler J, Dellago C (2019) Parallel multistream training of high-
dimensional neural network potentials. J Chem Theory Comput 15:3075–3092. https://doi.org/
10.1021/acs.jctc.8b01092
Song J, Xu Z, He X, Bai Y, Miao L, Cai C, Wang R (2019) Thermal conductivity of two-dimensional
BC 3: A comparative study with two-dimensional C 3 N. Phys Chem Chem Phys 21:12977–
12985. https://doi.org/10.1039/C9CP01943J
Srinivasan SG, van Duin ACT, Ganesh P (2015) Development of a ReaxFF potential for carbon
condensed phases and its application to the thermal fragmentation of a large fullerene. J Phys
Chem A 119:571–580. https://doi.org/10.1021/jp510274e
Stuart SJ, Tutein AB, Harrison JA (2000) A reactive potential for hydrocarbons with intermolecular
interactions. J Chem Phys 112:6472–6486. https://doi.org/10.1063/1.481208
Taheri A, Pisana S, Singh CV (2021) Importance of quadratic dispersion in acoustic flexural phonons
for thermal transport of two-dimensional materials. Phys Rev B 103:235426. https://doi.org/10.
1103/PhysRevB.103.235426
12 Machine Learning Interatomic Potentials: Keys to First-Principles … 451

Taheri A, Da Silva C, Amon CH (2018) First-principles phonon thermal transport in graphene:


effects of exchange-correlation and type of pseudopotential. J Appl Phys 123:215105. https://
doi.org/10.1063/1.5027619
Tersoff J (1988) New empirical approach for the structure and energy of covalent systems. Phys
Rev B 37:6991–7000. https://doi.org/10.1103/PhysRevB.37.6991
Tersoff J (1989) Modeling solid-state chemistry: interatomic potentials for multicomponent systems.
Phys Rev B 39:5566–5568. https://doi.org/10.1103/PhysRevB.39.5566
Thomas JA, Iutzi RM, McGaughey AJH (2010) Thermal conductivity and phonon transport in
empty and water-filled carbon nanotubes. Phys Rev B-Condens Matter Mater Phys 81:045413.
https://doi.org/10.1103/PhysRevB.81.045413
Thompson AP, Swiler LP, Trott CR, Foiles SM, Tucker GJ (2015) Spectral neighbor analysis
method for automated generation of quantum-accurate interatomic potentials. J Comput Phys
285:316–330
Wang H, Li Q, Pan H, Gao Y, Sun M (2019) Comparative investigation of the mechanical, electrical
and thermal transport properties in graphene-like C3 B and C3 N. J Appl Phys 126:234302. https://
doi.org/10.1063/1.5122678
Wang H, Qin G, Qin Z, Li G, Wang Q, Hu M (2018b) Lone-pair electrons do not necessarily lead
to low lattice thermal conductivity: an exception of two-dimensional penta-CN2 . J Phys Chem
Lett 9:2474–2483. https://doi.org/10.1021/acs.jpclett.8b00820
Wang H, Zhang L, Han J, Weinan E (2018a) DeePMD-Kit: a deep learning package for many-body
potential energy representation and molecular dynamics. Comput Phys Commun 228:178–184
Ward A (2009) Alistair first principles theory of the lattice thermal conductivity of semiconductors.
PhDT
Wei Z, Ni Z, Bi K, Chen M, Chen Y (2011) In-plane lattice thermal conductivities of multilayer
graphene films. Carbon N Y 49:2653–2658. https://doi.org/10.1016/j.carbon.2011.02.051
Yang X, Wu S, Xu J, Cao B, To AC (2018) Spurious heat conduction behavior of finite-size
graphene nanoribbon under extreme uniaxial strain caused by the AIREBO potential. Phys
E Low-Dimensional Syst Nanostruct 96:46–53. https://doi.org/10.1016/j.physe.2017.10.006
Yanxon H, Zagaceta D, Wood BC, Zhu Q (2020) Neural network potential from bispectrum compo-
nents: a case study on crystalline silicon. J Chem Phys 153:54118. https://doi.org/10.1063/5.
0014677
Yanxon H, Zagaceta D, Tang B, Matteson DS, Zhu Q (2021) PyXtal_FF: a python library for
automated force field generation. 2:27001. https://doi.org/10.1088/2632-2153/abc940
Ying P, Dong H, Liang T, Fan Z, Zhong Z, Zhang J (2023) Atomistic Insights into the mechanical
anisotropy and fragility of monolayer fullerene networks using quantum mechanical calculations
and machine-learning molecular dynamics simulations. Extrem Mech Lett. https://doi.org/10.
1016/j.eml.2022.101929
Zuo Y, Chen C, Li X, Deng Z, Chen Y, Behler J, Csányi G, Shapeev AV, Thompson AP, Wood
MA et al (2020) Performance and cost assessment of machine learning interatomic potentials.
J Phys Chem A 124:731–745. https://doi.org/10.1021/acs.jpca.9b08723

You might also like