Deep Learning Applied To Computational Mechanics A Comprehensive Review, State of The Art, and The Classics - Vu-Quoc & Humer 2022
Deep Learning Applied To Computational Mechanics A Comprehensive Review, State of The Art, and The Classics - Vu-Quoc & Humer 2022
028130
REVIEW
“We must welcome the future, for it soon will be the past,
and we must respect the past, for it was once all that was humanly possible.”
George Santayana
ABSTRACT
Three recent breakthroughs due to AI in arts and science serve as motivation: An award winning digital image,
protein folding, fast matrix multiplication. Many recent developments in artificial neural networks, particularly
deep learning (DL), applied and relevant to computational mechanics (solid, fluids, finite-element technology)
are reviewed in detail. Both hybrid and pure machine learning (ML) methods are discussed. Hybrid methods
combine traditional PDE discretizations with ML methods either (1) to help model complex nonlinear constitu-
tive relations, (2) to nonlinearly reduce the model order for efficient simulation (turbulence), or (3) to accelerate
the simulation by predicting certain components in the traditional integration methods. Here, methods (1) and (2)
relied on Long-Short-Term Memory (LSTM) architecture, with method (3) relying on convolutional neural net-
works. Pure ML methods to solve (nonlinear) PDEs are represented by Physics-Informed Neural network (PINN)
methods, which could be combined with attention mechanism to address discontinuous solutions. Both LSTM
and attention architectures, together with modern and generalized classic optimizers to include stochasticity for
DL networks, are extensively reviewed. Kernel machines, including Gaussian processes, are provided to suffi-
cient depth for more advanced works such as shallow networks with infinite width. Not only addressing experts,
readers are assumed familiar with computational mechanics, but not with DL, whose concepts and applications
are built up from the basics, aiming at bringing first-time learners quickly to the forefront of research. History and
limitations of AI are recounted and discussed, with particular attention at pointing out misstatements or miscon-
ceptions of the classics, even in well-known references. Positioning and pointing control of a large-deformable
beam is given as an example.
KEYWORDS
Deep learning, breakthroughs, network architectures, backpropagation, stochastic optimization methods from
classic to modern, recurrent neural networks, long short-term memory, gated recurrent unit, attention, transformer,
kernel machines, Gaussian processes, libraries, Physics-Informed Neural Networks, state-of-the-art, history, limi-
tations, challenges; Applications to computational mechanics; Finite-element matrix integration, improved Gauss
quadrature; Multiscale geomechanics, fluid-filled porous media; Fluid mechanics, turbulence, proper orthogonal
decomposition; Nonlinear-manifold model-order reduction, autoencoder, hyper-reduction using gappy data; con-
trol of large deformable beam.
SUMMARY
A goal of this in-depth review is not only to provide the state of the art for computational-mechanics readers
with some familiarity of deep-learning networks, but also with first-time learners in mind, by developing relevant
fundamental concepts from the basics. Moreover, for the convenience of the readers, detailed references are
provided, e.g., page numbers in thick books, links to online references and open reviews where available.
TABLE OF CONTENTS
1 Opening remarks and organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Deep Learning, resurgence of Artificial Intelligence . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Handwritten equation to LaTeX code, image recognition . . . . . . . . . . . . . . . 15
2.2 Artificial intelligence, machine learning, deep learning . . . . . . . . . . . . . . . . 16
2.3 Motivation, applications to mechanics . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3.1 Enhanced numerical quadrature for finite elements . . . . . . . . . . . . . 18
2.3.2 Solid mechanics, multiscale modeling . . . . . . . . . . . . . . . . . . . 22
2.3.3 Fluid mechanics, reduced-order model for turbulence . . . . . . . . . . . 27
3 Computational mechanics, neuroscience, deep learning . . . . . . . . . . . . . . . . . . . . 30
4 Statics, feedforward networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1 Two concept presentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.1 Top-down approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.1.2 Bottom-up approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Matrix notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 Big picture, composition of concepts . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.3.1 Graphical representation, block diagrams . . . . . . . . . . . . . . . . . . 37
4.4 Network layer, detailed construct . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.1 Linear combination of inputs and biases . . . . . . . . . . . . . . . . . . 37
4.4.2 Activation functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.4.3 Graphical representation, block diagrams . . . . . . . . . . . . . . . . . . 45
4.4.4 Artificial neuron . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 Representing XOR function with two-layer network . . . . . . . . . . . . . . . . . 46
4.5.1 One-layer network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5.2 Two-layer network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 What is “deep” in “deep networks” ? Size, architecture . . . . . . . . . . . . . . . . 52
4.6.1 Depth, size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.2 Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1 Cost (loss, error) function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.1 Mean squared error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
5.1.2 Maximum likelihood (probability cost) . . . . . . . . . . . . . . . . . . . 58
5.1.3 Classification loss function . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Gradient of cost function by backpropagation . . . . . . . . . . . . . . . . . . . . . 62
5.3 Vanishing and exploding gradients . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1 Logistic sigmoid and hyperbolic tangent . . . . . . . . . . . . . . . . . . 69
5.3.2 Rectified linear function (ReLU) . . . . . . . . . . . . . . . . . . . . . . 69
5.3.3 Parametric rectified linear unit (PReLU) . . . . . . . . . . . . . . . . . . 70
6 Network training, optimization methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
6.1 Training set, validation set, test set, stopping criteria . . . . . . . . . . . . . . . . . 72
6.2 Deterministic optimization, full batch . . . . . . . . . . . . . . . . . . . . . . . . . 76
6.2.1 Exact line search . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
4 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 1: AI-generated image won contest in the category of Digital Arts, Emerging Artists, on
2022.08.29 (Section 1). “Théâtre D’opéra Spatial” (Space Opera Theater) by “Jason M. Allen
via Midjourney”, which is “an artificial intelligence program that turns lines of text into hyper-
realistic graphics” [4]. Colorado State Fair, 2022 Fine Arts First, Second & Third. (Permission
of Jason M. Allen, CEO, Incarnate Games)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 7
Figure 2: Breakthroughs in AI (Section 2). Left: The journal Science 2021 Breakthough
of the Year. Protein folded 3-D shape produced by the AI software AlphaFold compared to
experiment with high accuracy [5]. The AlphaFold Protein Structure Database contains more
than 200 million protein structure predictions, a holy grail sought after in the last 50 years.
Right: The AI solfware AlphaGo, a runner-up in the journal Science 2016 Breakthough of the
Year, beat the European Go champion Fan Hui five games to zero in 2015 [6], and then went
on to defeat the world Go grandmaster Lee Sedol in 2016 [7]. (Permission by Nature.)
Since the preprint of this paper was posted on the arXiv in Dec 2022 [9], there have been considerable
excitements and concerns about ChatGPT—a large language-model chatbot that can interact with humans in
a conversational way—which would be incorporated into Microsoft Bing to make web “search interesting
again, after years of stagnation and stasis” [10], whose author wrote “I’m going to do something I thought
I’d never do: I’m switching my desktop computer’s default search engine to Bing. And Google, my default
source of information for my entire adult life, is going to have to fight to get me back.” Google would release
its own answer to ChatGPT called “Bard” [11]. The race is on.
Audience. This review paper is written by mechanics practitioners to mechanics practitioners, who
may or may not be familiar with neural networks and deep learning. We thus assume that the readers are
familiar with continuum mechanics and numerical methods such as the finite element method. Thus, unlike
typical computer-science papers on deep learning, notation and convention of tensor analysis familiar to
practitioners of mechanics are used here whenever possible.3
For readers not familiar with deep learning, unlike many other review papers, this review paper is not
just a summary of papers in the literature for people who already have some familiarity with this topic,4
particularly papers on deep-learning neural networks, but contains also a tutorial on this topic aiming at
bringing first-time learners (including students) quickly up-to-date with modern issues and applications
of deep learning, especially to computational mechanics.5 As a result, this review paper is a convenient
“one-stop shopping” that provides the necessary fundamental information, with clarification of potentially
confusing points, for first-time learners to quickly acquire a general understanding of the field that would
facilitate deeper study and application to computational mechanics.
Deep-learning software libraries. Just as there is a large number of available software in different
subfields of computational mechanics, there are many excellent deep-learning libraries ready for use in
applications; see Section 9, in which some examples of the use of these libraries in engineering applications
are provided with the associated computer code. Similar to learning finite-element formulations versus
learning how to run finite-element codes, our focus here is to discuss various algorithmic aspects of deep-
learning and their applications in computational mechanics, rather than how to use deep-learning libraries
in applications. We agree with the view that “a solid understanding of the core principles of neural networks
and deep learning” would provide “insights that will still be relevant years from now” [21], and that would
not be obtained from just learning to run some hot libraries.
Readers already familiar with neural networks may find the presentation refreshing,6 and even find new
information on neural networks, depending how they used deep learning, or when they stopped working
in this area due to the waning wave of connectionism and the new wave of deep learning.7 If not, readers
3
Tensors are not matrices; other concepts are summation convention on repeated indices, chain rule, and matrix index
convention for natural conversion from component form to matrix (and then tensor) form. See Section 4.2 on Matrix
notation.
4
See the review papers on deep learning, e.g., [12] [13] [14] [15] [16] [17] [18], many of which did not provide
extensive discussion on applications, particularly on computational mechanics, such as in the present review paper.
5
An example of a confusing point for first-time learners with knowledge of electrical circuits, hydraulics, or (biolog-
ical) computational neuroscience [19] would be the interpretation of the arrows in an artificial neural network such
as those in Figure 7 and Figure 8: Would these arrows represent real physical flows (electron flow, fluid flow, etc.)?
No, they represent function mapping (or information passing); see Section 4.3.1 on Graphical representation. Even
a tutorial such as [20] would follow the same format as many other papers, and while alluding to the human brain in
their Figure 2 (which is the equivalent of Figure 8 below), did not explain the meaning of the arrows.
6
Particularly the top-down approach for both feedforward network (Section 4) and back propagation (Section 5).
7
It took five years from the publication of Rumelhart et al. 1986 [22] to the paper by Ghaboussi et al. 1991 [23], in
which backpropagation (Section 5) was applied. It took more than twenty years from the publication of Long Short-
Term Memory (LSTM) units in [24] to the two recent papers [25] and [26], which are reviewed in detail here, and
where recurrent neural networks (RNNs, Section 7) with LSTM units (Section 7.2) were applied, even though there
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 9
can skip these sections to go directly to the sections on applications of deep learning to computational
mechanics.
Applications of deep learning in computational mechanics. We select some recent papers on application
of deep learning to computational mechanics to review in details in a way that readers would also understand
the computational mechanics contents well enough without having to read through the original papers:
• Recurrent neural network (RNN) with Long Short-Term Memory (LSTM) units9 was applied to
multiple-scale, multi-physics problems in solid mechanics [25];
• RNNs with LSTM units were employed to obtain reduced-order model for turbulence in fluids based
on the proper orthogonal decomposition (POD), a classic linear project method also known as princi-
pal components analysis (PCA) [26]. More recent nonlinear-manifold model-order reduction meth-
ods, incorporating encoder / decoder and hyper-reduction of dimentionality using gappy (incomplete)
data, were introduced, e.g., [47] [48].
Organization of contents. Our review of each of the above papers is divided into two parts. The first part
is to summarize the main results and to identify the concepts of deep learning used in the paper, expected to
be new for first-time learners, for subsequent elaboration. The second part is to explain in details how these
deep-learning concepts were used to produce the results.
The results of deep-learning numerical integration [38] are presented in Section 2.3.1, where the deep-
learning concepts employed are identified and listed, whereas the details of the formulation in [38] are
discussed in Section 10. Similarly, the results and additional deep-learning concepts used in a multi-scale,
multi-physics problem of geomechanics [25] are presented in Section 2.3.2, whereas the details of this
formulation are discussed in Section 11. Finally, the results and additional deep-learning concepts used
in turbulent fluid simulation with proper orthogonal decomposition [26] are presented in Section 2.3.3,
were some early works on application of RNNs (without LSTM units) in civil / mechanical engineering such as [27]
[28] [29] [30]. But already, “fully attentional Transformer” was proposed to render “intricately constructed LSTM”
unnecessary [31]. Most modern networks use the default rectified linear function (ReLU)—which was introduced in
computational neuroscience since at least before [32] and [19], and then adopted in computer science beginning with
[33] and [34]—instead of the traditional sigmoid function dated since the mid 1970s with [35], but yet many newer
activation functions continue to appear regularly, aiming at improving accuracy and efficiency over previous activation
functions, e.g., [36], [37]. In computational mechanics, by the beginning of 2019, there has not yet widespread use
of ReLU activation function, even though ReLU was mentioned in [38], where the sigmoid function was actually
employed to obtain the results (Section 10). See also Section 13 on Historical perspective.
8
It would be interesting to investigate on how the adjusted integration weights using the method in [38] would affect
the stability of an element stiffness matrix with reduced integration (even in the absence of locking) and the supercon-
vergence of the strains / stresses at the Barlow sampling points. See, e.g., [39], p. 499. The optimal locations of these
strain / stress sampling points do not depend on the integration weights, but only on the degree of the interpolation
polynomials; see [40] [41]. “The Gauss points corresponding to reduced integration are the Barlow points (Barlow,
1976) at which the strains are most accurately predicted if the elements are well-shaped” [42].
9
It is only a coincidence that (1) Hochreiter (1997), the first author in [24], which was the original paper on the widely
used and highly successful Long Short-Term Memory (LSTM) unit, is on the faculty at Johannes Kepler University
(home institution of this paper’s author A.H.), and that (2) Ghaboussi (1991), the first author in [23], who was among
the first researchers to apply fully-connected feedforward neural network to constitutive behavior in solid mechanics,
was on the faculty at the University of Illinois at Urbana-Champaign (home institution of author L.V.Q.). See also
[43], and for early applications of neural networks in other areas of mechanics, see e.g., [44], [45], [46].
10 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
whereas the details of this formulation, together with the nonlinear-manifold model-order reduction [47]
[48], are discussed in Section 12.
All of the deep-learning concepts identified from the above selected papers for in-depth are subse-
quently explained in detail in Sections 3 to 7, and then more in Section 13 on “Historical perspective”.
The parallelism between computational mechanics, neuroscience, and deep learning is summarized in
Section 3, which would put computational-mechanics first-time learners at ease, before delving into the
details of deep-learning concepts.
Both time-independent (static) and time-dependent (dynamic) problems are discussed. The architecture
of (static, time-independent) feedforward multilayer neural networks in Section 4 is expounded in detail,
with first-time learners in mind, without assuming prior knowledge, and where experts may find a refreshing
presentation and even new information.
Backpropagation, explained in Section 5, is an important method to compute the gradient of the cost
function relative to the network parameters for use as a descent direction to decrease the cost function for
network training.
For training networks—i.e., finding optimal parameters that yield low training error and lowest valida-
tion error—both classic deterministic optimization methods (using full batch) and stochastic optimization
methods (using minibatches) are reviewed in detail, and at times even derived, in Section 6, which would be
useful for both first-time learners and experts alike.
The examples used in training a network form the training set, which is complemented by the validation
set (to determine when to stop the optimization iterations) and the test set (to see whether the resulting
network could work on examples never seen before); see Section 6.1.
Deterministic gradient descent with classical line search methods, such as Armijo’s rule (Section 6.2),
were generalized to add stochasticity. Detailed pseudocodes for these methods are provided. The classic
stochastic gradient descent (SGD) by Robbins & Monro (1951) [49] (Section 6.3, Section 6.3.1), with add-
on tricks such as momentum Polyak (1964) [3] and fast (accelerated) gradient by Nesterov (1983 [50], 2018
[51]) (Section 6.3.2), step-length decay (Section 6.3.4), cyclic annealing (Section 6.3.4), minibatch-size
increase (Section 6.3.5), weight decay (Section 6.3.6) are presented, often with detailed derivations.
Step-length decay is shown to be equivalent to simulated annealing using stochastic differential equation
equivalent to the discrete parameter update. A consequence is to increase the minibatch size, instead of
decaying the step length (Section 6.3.5). In particular, we obtain a new result for minibatch-size increase.
In Section 6.5, highly popular adaptive step-length (learning-rate) methods are discussed in a unified
manner in Section 6.5.1, followed by the first paper on AdaGrad [52] (Section 6.5.2).
Overlooked in (or unknown to) other review papers and even well-known books on deep learning,
exponential smoothing of time series originating from the field of forecasting dated since the 1950s, the key
technique of adaptive methods, is carefully explained in Section 6.5.3.
The first adaptive methods that employed exponential smoothing were RMSProp [53] (Section 6.5.4)
and AdaDelta [54] (Section 6.5.5), both introduced at about the same time, followed by the “immensely
successful” Adam (Section 6.5.6) and its variants (Sections 6.5.7 and 6.5.8).
Particular attention is then given to a recent criticism of adaptive methods in [55], revealing their
marginal value for generalization, compared to the good old SGD with effective initial step-length tun-
ing and step-length decay (Section 6.5.9). The results were confirmed in three recent independent papers,
among which is the recent AdamW adaptive method in [56] (Section 6.5.10).
Dynamics, sequential data, and sequence modeling are the subjects of Section 7. Discrete time-
dependent problems, as a sequence of data, can be modeled with recurrent neural networks discussed in
Section 7.1, using the 1997 classic architecture such as Long Short-Term Memory (LSTM) in Section 7.2,
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 11
but also the recent 2017-18 architectures such as transformer introduced in [31] (Section 7.4.3), based on
the concept of attention [57]. Continuous recurrent neural networks originally developed in neuroscience
to model the brain and the connection to their discrete counterparts in deep learning are also discussed in
detail, [19] and Section 13.2.2 on “Dynamic, time dependence, Volterra series”.
The features of several popular, open-source deep-learning frameworks and libraries—such as Tensor-
Flow, Keras, PyTorch, etc.—are summarized in Section 9.
As mentioned above, detailed formulations of deep learning applied to computational mechanics in [38]
[25] [26] [47] [48] are reviewed in Sections 10, 11, 12.
History of AI, limitations, danger, and the classics. Finally, a broader historical perspective of deep
learning, machine learning, and artificial intelligence is discussed in Section 13, ending with comments on
the geopolitics, limitations, and (identified-and-proven, not just speculated) danger of artificial intelligence
in Section 14.
A rare feature is in a detailed review of some important classics to connect to the relevant concepts in
modern literature, sometimes revealing misunderstanding in recent works, likely due to a lack of verification
of the assertions made with the corresponding classics. For example, the first artificial neural network,
conceived by Rosenblatt (1957) [1], (1962) [2], had 1000 neurons, but was reported as having a single neuron
(Figure 42). Going beyond probabilistic analysis, Rosenblatt even built the Mark I computer to implement
his 1000-neuron network (Figure 133, Sections 13.2 and 13.2.1). Another example is the “heavy ball”
method, for which everyone referred to Polyak (1964) [3], but who more precisely called the “small heavy
sphere” method (Remark 6.6). Others were quick to dismiss classical deterministic line-search methods that
have been generalized to add stochasticity for network training (Remark 6.4). Unintended misrepresentation
of the classics would mislead first-time learners, and unfortunately even seasoned researchers who used
second-hand information from others, without checking the original classics themselves.
The use of Volterra series to model the nonlinear behavior of neuron in term of input and output firing
rates, leading to continuous recurrent neural networks is examined in detail. The linear term of the Volterra
series is a convolution integral that provides a theoretical foundation for the use of linear combination of
inputs to a neuron, with weights and biases [19]; see Section 13.2.2.
The experiments in the 1950s by Furshpan et al. [58] [59] that revealed the rectified linear behavior
in neuronal axon, modeled as a circuit with a diode, together with the use of the rectified linear activation
function in neural networks in neuroscience years before being adopted for use in deep learning network,
are reviewed in Section 13.3.2.
Reference hypertext links and Internet archive. For the convenience of the readers, whenever we refer
to an online article, we provide both the link to original website, and if possible, also the link to its archived
version in the Internet Archive. For example, we included in the bibliography entry of Ref. [60] the links to
both the Original website and the Internet archive.10
Figure 3: ImageNet competitions (Section 2). Top (smallest) classification error rate versus
competition year. A sharp decrease in error rate in 2012 sparked a resurgence in AI interest and
research [13]. By 2015, the top classification error rate surpassed human classification error
rate of 5.1% with Parametric Rectified Linear Unit [61]; see Section 5.3.3 and also [62]. Figure
from [63]. (Figure reproduced with permission of the authors.)
The 3-D shape of a protein, obtained by folding a linear chain of amino acid, determines how this protein
would interact with other molecules, and thus establishes its biological functions [64]. There are some 200
million proteins, the building blocks of life, in all living creatures, and 400,000 in the human body [64]. The
AlphaFold Protein Structure Database already contained “over 200 million protein structure predictions.”11
For comparison, there were only about 190 thousand protein structures obtained through experiments as
of 2022.07.28 [65]. “Some of AlphaFold’s predictions were on par with very good experimental models
[Figure 2, left], and potentially precise enough to detail atomic features useful for drug design, such as the
active site of an enzyme” [66]. The influence of this software and its developers “would be epochal.”
On the 2019 new-year day, The Guardian [67] reported the most recent breakthrough in AI, published
less than a month before on 2018 Dec 07 in the journal Science in [68] on the development of the software
AlphaZero, based on deep reinforcement learning (a combination of deep learning and reinforcement learn-
ing), that can teach itself through self-play, and then “convincingly defeated a world champion program in
the games of chess, shogi (Japanese chess), as well as Go”; see Figure 2, right.
Go is the most complex game that mankind ever created, with more combinations of possible moves
than chess, and thus the number of atoms in the observable universe.12 It is “the most challenging of classic
games for artificial intelligence [AI] owing to its enormous search space and the difficulty of evaluating
board positions and moves” [6].
11
See also AlphaFold Protein Structure Database Internet archived as of 2022.09.02.
12
The number of atoms in the observable universe is estimated at 1080 . For a board game such as chess and Go, the
number of possible sequences of moves is m = bd , with b being the game breadth (or “branching factor”, which
is the “number of legal moves per position” or average number of moves at each turn), and d the game depth (or
length, also known as number of “plies”). For chess, b ≈ 35, d ≈ 80, and m = 3580 ≈ 10123 , whereas For Go,
b ≈ 250, d ≈ 150, and m = 250150 ≈ 10360 . See, e.g., “Go and mathematics”, Wikipedia, version 03:40, 13 June
2018; “Game-tree complexity”, Wikipedia, version 07:04, 9 October 2018; [6].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 13
This breakthrough is the crowning achievement in a string of astounding successes of deep learning (and
reinforcenent learning) in taking on this difficult challenge for AI.13 The success of this recent breakthrough
prompted an AI expert to declare close the multidecade long, arduous chapter of AI research to conquer
immensely-complex games such as chess, shogi, and Go, and to suggest AI researchers to consider a new
generation of games to provide the next set of challenges [73].
In its long history, AI research went through several cycles of ups and downs, in and out of fashion, as
described in [74], ‘Why artificial intelligence is enjoying a renaissance’ (see also Section 13 on historical
perspective):
“THE TERM “artificial intelligence” has been associated with hubris and disappointment since
its earliest days. It was coined in a research proposal from 1956, which imagined that signif-
icant progress could be made in getting machines to “solve kinds of problems now reserved
for humans if a carefully selected group of scientists work on it together for a summer”. That
proved to be rather optimistic, to say the least, and despite occasional bursts of progress and
enthusiasm in the decades that followed, AI research became notorious for promising much
more than it could deliver. Researchers mostly ended up avoiding the term altogether, prefer-
ring to talk instead about “expert systems” or “neural networks”. But in the past couple of
years there has been a dramatic turnaround. Suddenly AI systems are achieving impressive
results in a range of tasks, and people are once again using the term without embarrassment.”
The recent resurgence of enthusiasm for AI research and applications dated only since 2012 with a
spectacular success of almost halving the error rate in image classification in the ImageNet competition,14
Going from 26% down to 16%; Figure 3 [63]. In 2015, deep-learning error rate of 3.6% was smaller than
human-level error rate of 5.1%,15 and then decreased by more than half to 2.3% by 2017.
The 2012 success16 of a deep-learning application, which brought renewed interest in AI research out
of its recurrent doldrums known as “AI winters”,17 is due to the following reasons:
• Availability of much larger datasets for training deep neural networks (find optimized parameters). It
is possible to say that without ImageNet, there would be no spectacular success in 2012, and thus no
resurgence of AI. Once the importance of having large datasets to develop versatile, working deep
networks was realized, many more large datasets have been developed. See, e.g., [60].
• Emergence of more powerful computers than in the 1990s, e.g., the graphical processing unit (or
GPU), “which packs thousands of relatively simple processing cores on a single chip” for use to
process and display complex imagery, and to provide fast actions in today’s video games” [77].
13
See [69] [6] [70] [71]. See also the film AlphaGo (2017), “an excellent and surprisingly touching documentary
about one of the great recent triumphs of artificial intelligence, Google DeepMind’s victory over the champion Go
player Lee Sedol” [72], and “AlphaGo versus Lee Sedol,” Wikipedia version 14:59, 3 September 2022.
14
“ImageNet is an online database of millions of images, all labelled by hand. For any given word, such as “balloon”
or “strawberry”, ImageNet contains several hundred images. The annual ImageNet contest encourages those in the
field To compete and measure their progress in getting computers to recognise and label images automatically” [75].
See also [62] and [60], for a history of the development of ImageNet, which played a critical role in the resurgence
of interest and research in AI by paving the way for the mentioned 2012 spectacular success in reducing the error
rate in image recognition.
15
For a report on the human image classification error rate of 5.1%, see [76] and [62], Table 10.
16
Actually, the first success of deep learning occurred three years earlier in 2009 in speech recognition; see Section 2
regarding the historical perspective on the resurgence of AI.
17
See [74].
14 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
• Advanced software infrastructure (libraries) that facilitates faster development of deep-learning ap-
plications, e.g., TensorFlow, PyTorch, Keras, MXNet, etc. [78], p. 25. See Section 9 on some reviews
and rankings of deep-learning libraries.
• Larger neural networks and better training techniques (i.e., optimizing network parameters) that were
not available in the 1980s. Today’s much larger networks, which can solve once intractatable / difficult
problems, are “one of the most important trends in the history of deep learning”, but are still much
smaller than the nervous system of a frog [78], p. 21; see also Section 4.6. A 2006 breakthrough,
ushering in the dawn of a new wave of AI research and interest, has allowed for efficient training
of deeper neural networks [78], p. 18.18 The training of large-scale deep neural networks, which
frequently involve highly nonlinear and non-convex optimization problems with many local minima,
owes its success to the use of stochastic-gradient descent method first introduced in the 1950s [80].
• Successful applications to difficult, complex problems that help people in their every-day lives, e.g.,
image recognition, speech translation, etc.
⋆ In medicine, AI “is beginning to meet (and sometimes exceed) assessments by doctors in various
clinical situations. A.I. can now diagnose skin cancer like dermatologists, seizures like neurologists,
and diabetic retinopathy like ophthalmologists. Algorithms are being developed to predict which
patients will get diarrhea or end up in the ICU,19 and the FDA20 recently approved the first machine
learning algorithm to measure how much blood flows through the heart—a tedious, time-consuming
calculation traditionally done by cardiologists.” Doctors lamented that they spent “a decade in medi-
cal training learning the art of diagnosis and treatment,” and were now easily surpassed by computers
[81]. “The use of artificial intelligence is proliferating in American health care—outpacing the de-
velopment of government regulation. From diagnosing patients to policing drug theft in hospitals, AI
has crept into nearly every facet of the health-care system, eclipsing the use of machine intelligence
in other industries” [82].
⋆ In micro-lending, AI has helped the Chinese company SmartFinance reduce the default rates of
more than 2 millions loans per month to low single digits, a track record that makes traditional brick-
and-mortar banks extremely jealous” [83].
⋆ In the popular TED talk “How AI can save humanity” [84], the speaker alluded to the above-
mentioned 2006 breakthrough ([78], p. 18) that marked the beginning of the “deep learning” wave of
AI research when he said:21 “About 10 years ago, the grand AI discovery was made by three North
American scientists,22 and it’s known as deep learning”.
Section 13 provices a historical perspective on the development of AI, with additional details on current
and future applications.
18
The authors of [13] cited this 2006 breakthrough paper by Hinton, Osindero & Teh in their reference no.32 with
the mention “This paper introduced a novel and effective way of training very deep neural networks by pre-training
one hidden layer at a time using the unsupervised learning procedure for restricted Boltzmann machines (RBMs).”
A few years later, it was found out that RBMs were not necessary to train deep networks, as it was sufficient to use
rectified linear units (ReLUs) as active functions ([79], interview with Y. Bengio); see also Section 4.4.2 on active
functions. For this reason, we are not reviewing RBMs here.
19
Intensive Care Unit.
20
Food and Drug Administration.
21
At video time 1:51. In less than a year, this 2018 April TED talk had more than two million views as of 2019 March.
22
See Footnote 18 for the names of these three scientists.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 15
It was, however, disappointing that despite the above-mentioned exciting outcomes of AI, during the
Covid-19 pandemic beginning in 2020,23 none of the hundreds of AI systems developed for Covid-19 di-
agnosis were usable for clinical applications; see Section 13.5.1. As of June 2022, the Tesla electric ve-
hicle autopilot system is under increased scrutiny by the National Highway Traffic Safety Administration
as there were “16 crashes into emergency vehicles and trucks with warning signs, causing 15 injuries and
one death.”24 In addition, there are many limitations and danger in the current state-of-the-art of AI; see
Section 14.
m
p×q =m⇒p= (1)
q
Another example is the hand-written multiplication work below by the same pupil:
23
“The World Health Organization declares COVID-19 a pandemic” on 2020 Mar 11, CDC Museum COVID-19
Timeline, Internet archive 2022.06.02.
24
Krisher T., Teslas with Autopilot a step closer to recall after wrecks, Associated Press, 2022.06.10.
25
We thank Kerem Uguz for informing the senior author LVQ about Mathpix.
16 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Questions would immediately arise in the mind of first-time learners: Are ML and AI two different
fields, or the same fields with different names? If one field is a subset of the other, then would it be more
general to just refer to the larger set? On the other hand, would it be more specific to just refer to the subset?
In fact, Deep Learning is a subset of methods inside a larger set of methods known as Machine Learning,
which in itself is a subset of methods generally known as Artificial Intelligence. In other words, Deep
Learning is Machine Learning, which is Artificial Intelligence; [78], p. 9.28 On the other hand, Artificial
Intelligence is not necessarily Machine Learning, which in itself is not necessarily Deep Learning.
The review in [85] was restricted to Neural Networks (which could be deep or shallow)29 and Support
Vector Machine (which is Machine Learning, but not Deep Learning); see Figure 6. Deep Learning can be
thought of as multiple levels of composition, going from simpler (less abstract) concepts (or representations)
to more complex (abstract) concepts (or representations).30
Based on the above relationship between AI, ML, and DL, it would be much clearer if the phrase “ma-
chine learning (ML) and artificial intelligence (AI)” in both the title of [85] and the original sentence quoted
above is replaced by the phrase “machine learning (ML)” to be more specific, since the authors mainly re-
viewed Multi-Layer Neural (MLN) networks (deep learning, and thus machine learning) and Support Vector
Machine (machine learning).31 MultiLayer Neural (MLN) network is also known as MultiLayer Perceptron
26
Mathpix Snip “misunderstood” that the top horizontal line was part of a fraction, and upon correction of this “mis-
understanding” and font-size adjustment yielded the equation image shown in Eq. (2).
27
We are only concerned with NNs, not SVMs, in the present paper.
28
References to books are accompanied with page numbers for specific information cited here so readers don’t waste
time to wade through an 800-page book to look for such information.
29
Network depth and size are discussed in Section 4.6.1. An example of a shallow network with one hidden layer can
be found in Section 12.4 on nonlinear-manifold model-order reduction applied to fluid mechanics.
30
See, e.g., [78], p. 5, p. 8, p. 14.
31
For more on Support Vector Machine (SVM), see [78], p. 137. In the early 1990s, SVM displaced neural networks
with backpropagation as a better method for the machine-learning community ([79], interview with G. Hinton). The
resurgence of AI due to advances in deep learning started with the seminal paper [86], in which the authors demon-
strated via numerical experiments that MLN network was better than SVM in terms of error in the handwriting-
recognition benchmark test using the MNIST handwritten digit database, which contains “a training set of 60,000
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 17
Figure 6: Artificial intelligence and subfields (Section 2.2). Three classes of methods—
Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL)—and their rela-
tionship, with an example of method in each class. A knowledge-base method is an AI method,
but is neither a ML method, nor a DL method. Support Vector Machine and spiking computing
are ML methods, and thus AI methods, but not a DL method. Multi-Layer Neural (MLN) net-
work is a DL method, and is thus both an ML method and an AI method. See also Figure 158
in Appendix 4 on Cybernetics, which encompassed all of the above three classes.
(MLP).32 both MLN networks and SVMs are considered as artificial intelligence, which in itself is too broad
and thus not specific enough.
Another reason for simplifying the title in [85] is that the authors did not consider using any other
AI methods, except for two specific ML methods, even though they discussed AI in the general historical
context.
The engine of neuromorphic computing, also known as spiking computing, is a hardware network built
into the IBM TrueNorth chip, which contains “1 million programmable spiking neurons and 256 million
configurable synapses”,33 and consumes “extremely low power” [87]. Despite the apparent difference with
the software approach of deep computing, neuromorphic chip could implement deep-learning networks, and
thus the difference was not fundamental [88]. There is thus an overlap between neuromorphic computing
and deep learning, as shown in Figure 6, instead of two disconnected subfields of machine learning as
reported in [20].34
examples, and a test set of 10,000 examples.” But kernel methods studied for the development of SVM have now
been used in connection with networks with infinite width to understand how deep learning works; see Section 8 on
“Kernel machines” and Section 14.2 on “Lack of understanding.”
32
See [78], p. 5.
33
The neurons are the computing units, and the synapses the memory. instead of grouping the computing units into a
central processing unit (CPU), separated from the memory, and connect the CPU and the memory via a bus, which
creates a communication bottleneck, like the brain, each neuron in the TrueNorth chip has its own synapses (local
memory).
34
In [20], there was only a reference to [87], but not to [88]. It is likely that the authors of [20] were not aware of [88],
and thus an intersection between neuromorphic computing and deep learning.
18 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 7: Feedforward neural network (Section 2.3.1). A feedforward neural network in [38],
rotated clockwise by 90 degrees to compare to its equivalent in Figure 23 and Figure 35 further
below. All terminologies and fundamental concepts will be explained in detail in subsequent
sections as listed. See Section 4.1.1 for a top-down explanation and Section 4.1.2 for bottom-
up explanation. This figure of a network could be confusing to first-time learners, as already
indicated in Footnote 5. (Figure reproduced with permission of the authors.)
italics) employed in these papers, together with the corresponding sections in the present paper where these
concepts are explained in detail. First-time learners of deep learning likely find these fundamental concepts
described by obscure technical jargon, whose meaning will be explained in details in the identified subse-
quent sections. Experts of deep learning would understand how deep learning is applied to computational
mechanics.
(1) Application 1.1: For each element (particularly distorted elements), find the number of integration
points that provides accurate integration within a given error tolerance. Section 10.2 contains the
details.
(2) Application 1.2: Uniformly use 2 × 2 × 2 integration points for all elements, distorted or not, and
find the appropriate quadrature weights36 (different from the traditional quadrature weights of the
Gauss-Legendre method) at these integration points. Section 10.3 contains the details.
To train37 the networks—i.e., to optimize the network parameters (weights and biases, Figure 8) to
minimize some loss (cost, error) function (Sections 5.1, 6)—up to 20,000 randomly distorted hexahedrals
were generated by displacing nodes from a regularly shaped element [38], see Figure 9. For each distorted
shape, the following are determined: (1) the minimum number of integration points required to reach a
35
MLN is also called MultiLayer Perceptron (MLP); see Footnote 32.
36
The quadrature weights at integration points are not to be confused with the network weights in a MLN network.
37
See Section 6 on “Network training, optimization methods”.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 19
A. Oishi, G. Yagawa / Comput. Methods Appl. Mech. Engrg. 327 (2017) 327–351 335
Figure 8: Artificial neuron (Section 2.3.1). A neuron with its multiple inputs Oip−1 (which are
outputs from the previous layer (p−1), and thus the variable name “O”), processing operations
p−1
(multiply inputs with network weights wji , sum weighted inputs, add bias θjp , activation func-
tion f ), and single output Ojp [38]. See the equivalent Figure 36, Section 4.4.3 further below.
All terminologies and fundamental concepts will be explained in detail in subsequent sections
as listed. See Section 4 on feedforward networks, Section 4.1.1 on top-down explanation and
Section 4.1.2 on bottom-up explanation. This figure of a neuron could be confusing to first-time
learners, as already indicatedFig.in3. Footnote 5.linear
8 and 20 noded 3D (Figure reproduced
and quadratic with permission of the authors.)
elements, respectively.
Figure 9: Cube and distorted cube elements (Section 2.3.1). Regular and distorted linear
Fig. 4. Regular cubic and irregular distorted elements.
hexahedral elements [38]. (Figure reproduced with permission of the authors.)
shape, and Fig. 4(b) that of distorted
p shape, where one of the edges is extended twofold and the two neighboring
edges extended by the factor 2. By adding nodes at the middle of edges of the elements as shown in Fig. 4, two
prescribed accuracy, and (2) corrections to the quadrature weights by trying one million randomly generated
twenty-node quadratic elements are generated.
sets of correctionForfactors,
these four among whichthethe
kinds of elements, best one
convergence was ofretained.
properties the Gauss–Legendre quadrature are tested as shown
in Fig. 5. The horizontal axis represents the number of integration points per axis, which is the same for all three
While Application
axes, and the 1.1 used
vertical axis one
showsfully-connected (Section
the Err or defined in Eq. 4.6.1)
(17). In Fig. feedforward
5, “Linear-Regular neural network (Section 4),
and Quadratic-Regular”
Application 1.2represent
reliedtheon two
linear andneural networks:
the quadratic elements ofThe firstillustrated
the shape neuralinnetwork was a classifier
Fig. 4(a), respectively, that took the element
and “Linear-Irregular
and Quadratic-Irregular” those of the shape illustrated in Fig. 4(b). Though an element of the regular shape, whether
shape (18 normalized nodal coordinates) as input and estimated whether or not the numerical
it is linear or quadratic, shows fast convergence with just a few integration points to reach prescribed accuracy, that of integration
distorted shape results in slower convergence requiring more integration points.
(quadrature) could be improved by adjusting the quadrature weights for the given element (one output), i.e.,
the network classifier only produced two outcomes, yes or no. If an error reduction was possible, a second
neural network performed regression to predict the corrected quadrature weights (eight outputs for 2 × 2 × 2
quadrature) from the input element shape (usually distorted).
To train the classifier network, 10,000 element shapes were selected from the prepared dataset of 20,000
hexahedrals, which were divided into a training set and a validation set (Section 6.1) of 5000 elements
each.38
38
For the definition of training set and test set, see Section 6.1. Briefly, the training set is used to optimize the network
20 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
To train the second regression network, 10,000 element shapes were selected for which quadrature
could be improved by adjusting the quadrature weights [38].
Again, the training set and the test set comprised 5000 elements each. The parameters of the neural
networks (weights, biases, Figure 8, Section 4.4) were optimized (trained) using a gradient descent method
Fig. 9. Distributions of estimation errors for [0, 1] normalized correction factors.
(Section 6) that minimizes a loss function (Section 5.1), whose gradients with respect to the parameters are
computed using backpropagation (Section 5).
error-reduction ratio is less than 1, the integration using the predicted quadrature weights is more accurate
than that using the standard quadrature weights. To compute the two quadrature errors mentioned above
(one for the predicted quadrature weights and one for the standard quadrature weights, both for the same
2 × 2 × 2 integration points), the reference values considered as most accurate were obtained using 30 ×
30 × 30 integration points with the standard quadrature quadrature weights; see Eq. (401) in Section 10.2
with qmax = 30.
For most element shapes of both the training set (a) and the test set (b), each of which comprised 5000
elements, the blue bars in Figure 10 indicate an error ratio below one, i.e., the quadrature weight correction
effectively improved the accuracy of numerical quadrature.
Readers familiar with Deep Learning and neural networks can go directly to Section 10, where the
details of the formulations in [38] are presented. Other sections are also of interest such as classic and
state-of-the-art optimization methods in Section 6, attention and transformer unit in Section 7, historical
perspective in Section 13, limitations and danger of AI in Section 14.
Readers not familiar with Deep Learning and neural networks will find below a list of the concepts that
will be explained in subsequent sections. To facilitate the reading, we also provide the section number (and
the link to jump to) for each concept.
(1) Feedforward neural network (Figure 7): Figure 23 and Figure 35, Section 4
(2) Neuron (Figure 8): Figure 36 in Section 4.4.4 (artificial neuron), and Figure 131 in Section 13.1
(biological neuron)
(7) What is “deep” in “deep networks” ? Size, architecture, Section 4.6.1, Section 4.6.2
(11) Training error, validation error, test (or generalization) error: Section 6.1
This list is continued further below in Section 2.3.2. Details of the formulation in [38] are discussed in
Section 10.
22 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 11: Dual-porosity single-permeability medium (Section 2.3.2). Left: Actual reservoir.
Dual (or double) porosity indicates the presence of two types of porosity in naturally-fractured
reservoirs (e.g., of oil): (1) Primary porosity in the matrix (e.g., voids in sands) with low per-
meability, within which fluid does not flow, (2) Secondary porosity due to fractures and vugs
(cavities in rocks) with high (anisotropic) permeability, within which fluid flows. Fluid ex-
change is permitted between the matrix and the fractures, but not between the matrix blocks
(sugar cubes), of which the permeability is much smaller than in the fractures. Right: Model
reservoir, idealization. The primary porosity is an array of cubes of homogeneous, isotropic
material. The secondary porosity is an “orthogonal system of continuous, uniform fractures”,
oriented along the principal axes of anisotropic permeability [89]. (Figure reproduced with per-
mission of the publisher SPE.)
40
Porosity is the ratio of void volume over total volume. Permeability is a scaling factor, which when multiplied by
the negative of the pressure gradient, and divided by the fluid dynamic viscosity, gives the fluid velocity in Darcy’s
law, Eq. (409). The expression “dual-porosity dual-permeability poromechanics problem” used in [25], p. 340,
could confuse first-time readers—especially those who are familiar with traditional reservoir simulation, e.g., in
[92]—since dual porosity (also called “double porosity” in [89]) and dual permeability are two different models of
naturally-fractured porous media; these two models for radionuclide transport around nuclear waste repository were
studied in [93]. Further added to the confusion is that the dual-porosity model is more precisely called dual-porosity-
single permeability model, whereas the dual-permeability model is called dual-porosity dual-permeability model
[94], which has a different meaning than the one used in [25].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 23
Figure 12: Pore structure of Majella limestone, dual porosity (Section 2.3.2), a carbonate
rock with high total porisity at 30%. Backscattered SEM images of Majella limestone: (a)-
(c) sequence of zoomed-ins; (d) zoomed-out. (a) The larger macropores (dark areas) have
dimensions comparable to the grains (allochems), having an average diameter of 54 µm, with
macroporosity at 11.4%. (b) Micropores embedded in the grains and cemented regions, with
microporosity at 19.6%, which is equal to the total porosity at 30% minus the macroporosity.
(c) Numerous micropores in the periphery of a macropore. (d) Map performed manually under
optical microscope showing the partitioning of grains, matrix (mostly cement) and porosity
[90]. See Section 11 and Remark 11.8. (Figure reproduced with permission of the authors.)
oil-water mixture, one-phase oil-solvent mixture). Fluid exchange is permitted between the rock matrix and
the fracture system, but not between the matrix blocks. In the DPSP model, the fracture system and the
rock matrix, each has its own porosity, with values not differing from each other by a large factor. On the
contrary, the permeability of the fracture system is much larger than that in the rock matrix, and thus the
system is considered as having only a single permeability. When the permeability of the fracture system
and that of the rock matrix do not differ by a large factor, then both permeabilities are included in the more
general dual-porosity dual-permeability (DPDP) model [94].
Since 60% of the world’s oil reserve and 40% of the world’s gas reserve are held in carbonate rocks,
there has been a clear interest in developing an understanding of the mechanical behavior of carbonate
rocks such as limestones, having from lowest porosity (Solenhofen at 3%) to high porosity (e.g., Majella at
30%). Chalk (Lixhe) is a carbonate rock with highest porosity at 42.8%. Carbonate rock reservoirs are also
considered to store carbon dioxide and nuclear waste [95] [93].
In oil-reservoir simulations in which the primary interest is the flow of oil, water, and solvent, the
porosity (and pore size) within each domain (rock matrix or fracture system) is treated as constant and
homogeneous [94] [96].41 On the other hand, under mechanical stress, the pore size would change, cracks
and other defects would close, leading to a change in the porosity in carbonate rocks. Indeed, “at small
stresses, experimental mechanical deformation of carbonate rock is usually characterized by a non-linear
stress-strain relationship, interpreted to be related to the closure of cracks, pores, and other defects. The
41
See, e.g., [94], p. 295, Chap. 9 on “Advanced Topics: Fluid Flow in Fractured Reservoirs and Compositional Simu-
lation”.
24 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 13: Majella limestone, nonlinear stress-strain relations (Section 2.3.2). Differential
stress (i.e., the difference between the largest principal stress and the smallest one) vs axial
strain (left) and vs volumetric strain (right) [90]. See Remark 11.7, Section 11.3.4, and Re-
mark 11.10, Section 11.3.5. (Figure reproduced with permission of the authors.)
non-linear stress-strain relationship can be related to the amount of cracks and various type of pores” [95],
p. 202. Once the pores and cracks are closed, the stress-strain relation becomes linear, at different stress
stages, depending on the initial porosity and the geometry of the pore space [95].
Moreover, pores have different sizes, and can be classified into different pore sub-systems. For the Ma-
jella limestone in Figure 12 with total porosity at 30%, its pore space can be partitioned into two subsystems
(and thus dual porosity), the macropores with macroporosity at 11.4% and the micropores with microp-
orosity at 19.6%. Thus the meaning of dual-porosity as used in [25] is different from that in oil-reservoir
simulation. Also characteristic of porous rocks such as the Majella limestone is the non-linear stress-strain
relation observed in experiments, Figure 13, due the changing size, and collapse, of the pores.
Likewise, the meaning of “dual permeability” is different in [25] in the sense that “one does not seek
to obtain a single effective permeability for the entire pore space”. Even though it was not explicitly spelled
out,42 it appears that each of the two pore sub-systems would have its own permeability, and that fluid is
allowed to exchange between the two pore sub-systems, similar to the fluid exchange between the rock
matrix and the fracture system in the DPSP and DPDP models in oil-reservoir simulation [94].
In the problem investigated in [25], the presence of localized discontinuities demands three scales—
microscale (µ), mesoscale (cm), macroscale (km)—to be considered in the modeling, see Figure 14. Classi-
cal approaches to consistently integrate microstructural properties into macroscopic constitutive laws relied
on hierarchical simulation models and homogenization methods (e.g., discrete element method (DEM)–
FEM coupling, FEM2 ). If more than two scales were to be considered, the computational complexity would
become prohibitively, if not intractably, large.
Instead of coupling multiple simulation models online, two (adjacent) scales were linked by a neural
network that was trained offline using data generated by simulations on the smaller scale [25]. The trained
42
At least at the beginning of Section 2 in [25].
43
The LSTM variant with peephole connections is not the original LSTM cell (Section 7.2); see, e.g, [97]. The
equations describing the LSTM unit in [25], whose authors never mentioned the word “peephole”, correspond to the
original LSTM without peepholes. It was likely a mistake to use this figure in [25].
Fig. 6. Comparison between off-line pre-trained multiscale ANN-FEM simulations and online hierarchical multiscale simulations.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 25
network subsequently served as a surrogate model in online simulations on the larger scale. With three
scales being considered, two recurrent neural networks (RNNs) with Long Short-Term Memory (LSTM)
units were employed consecutively:
(1) Mesoscale RNN with LSTM units: On the microscopic scale, a representative volume element (RVE)
was an assembly of discrete-element particles, subjected to large variety of representative load-
ing paths to generate training data for the supervised learning of the mesoscale RNN with LSTM
units, a neural network that was referred to as “Mesoscale data-driven constitutive model” [25]
(Figure 14). Homogenizing the results of DEM-flow model provided constitutive equations for the
traction-separation law and the evolution of anisotropic permeabilities in damaged regions.
(2) Macroscale RNN with LSTM units: The mesoscale RVE (middle row in Figure 14), in turn, was a
finite-element model of a porous material with embedded strong discontinuities equivalent to the frac-
ture system in oil-reservoir simulation in Figure 11. The host matrix of the RVE was represented by an
isotropic linearly elastic solid. In localized fracture zones within, the traction-separation law and the
hydraulic response were provided by the mesoscale RNN with LSTM units developed above. Train-
ing data for the macroscale RNN with LSTM units—a network referred to as “Macroscale data-driven
26 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 15: LSTM variant with “peephole” connections, block diagram (Sections 2.3.2, 7.2).43
Unlike the original LSTM unit (see Section 7.2), both the input gate and the forget gate in
an LSTM unit with peephole connections receive the cell state as input. The above figure
from Wikipedia, version 22:56, 4 October 2015, is identical to Figure 10 in [25], whose au-
thors erroneously used this figure without mentioning the source, but where the original LSTM
unit without “peepholes” was actually used, and with the detailed block diagram in Figure 81,
Section 7.2. See also Figure 82 and Figure 117 for the original LSTM unit applied to fluid
mechanics. (CC-BY-SA 4.0)
constitutive model” [25]—is generated by computing the (homogenized) response of the mesoscale
RVE to various loadings. In macroscopic simulations, the mesoscale RNN with LSTM units provided
the constitutive response at a sealing fault that represented a strong discontinuity.
Path-dependence is a common characteristic feature of the constitutive models that are often realized
as neural networks; see, e.g., [23]. For this reason, it was decided to employ RNN with LSTM units, which
could mimick internal variables and corresponding evolution equations that were intrinsic to path-dependent
material behavior [25]. These authors chose to use a neural network that had a depth of two hidden layers
with 80 LSTM units per layer, and that had proved to be a good compromise of performance and training
efforts. After each hidden layer, a dropout layer with a dropout rate 0.2 were introduced to reduce overfitting
on noisy data, but yielded minor effects, as reported in [25]. The output layer was a fully-connected layer
with a logistic sigmoid as activation function.
An important observation is that including micro-structural data—the porosity ϕ, the coordination num-
ber CN (number of contact points, Figure 16), the fabric tensor (defined based on the normals at the
contact points, Eq. (404) in Section 11; Figure 16 provides a visualization)—as network inputs signifi-
cantly improved the prediction capability of the neural network. Such improvement is not surprising since
soil fabric—described by scalars (porosity, coordination number, particle size) and vectors (fabric tensors,
particle orientation, branch vectors)—exerts great influence on soil behavior [99]. Coordination number44
has been used to predict soil particle breakage [100], morphology and crushability [99], and in a study of
internally-unstable soil involving a mixture of coarse and fine particles [101]. Fabric tensors, with theoreti-
cal foundation developed in [102], provide a mean to represent directional data such as normals at contact
points, even though other types of directional data have been proposed to develop fabric tensors [103].
To model anisotropic behavior of granular materials, contact-normal fabric tensor was incorporated in an
isotropic constitutive law.
44
The coordination number (Wikipedia version 20:43, 28 July 2020) is a concept originated from chemistry, signifying
the number of bonds from the surrounding atoms to a central atom. In Figure 16 (a), the uranium borohydride
U(BH4 )4 complex (Wikipedia version 08:38, 12 March 2019) has 12 hydrogen atoms bonded to the central uranium
atom.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 27
Figure 16: Coordination number CN (Section 2.3.2, 11.3.2). (a) Chemistry. Number of bonds
to the central atom. Uranium borohydride U(BH4 )4 has CN = 12 hydrogen bonds to uranium.
(b, c) Photoelastic discs showing number of contact points (coordination number) on a particle.
(b) Random packing and force chains, different force directions along principal chains and
in secondary particles. (c) Arches around large pores, precarious stability around pores. The
coordination number for the large disc (particle) in red square is 5, but only 4 of those had non-
zero contact forces based on the bright areas showing stress action. Figures (b, c) also provide
a visualization of the “flow” of the contact normals, and thus the fabric tensor [98]. See also
Figure 17. (Figure reproduced with permission of the author.)
Figure 17 illustrates the importance of incorporating microstructure data, particularly the fabric tensor,
in network training to improve prediction accuracy.
Deep-learning concepts to explain and explore: (continued from above in Section 2.3.1)
(15) Dropout layer and dropout rate,45 which had minor effects in the particular work repoorted in [25],
and thus will not be covered here. See [78], p. 251, Section 7.12.
Figure 17: Network with LSTM and microstructure data (porosity ϕ, coordination number
CN = Nc , Figure 16, fabric tensor F = AF · AF , Eq. (404)) (Section 2.3.2, 11.3.2). Simple
shear test using Discrete Element Method to provide network training data under loading-
unloading conditions. ANN = Artificial Neural Network with no LSTM units. While network
with LSTM units and (ϕ, CN , AF ) improved the predicted traction, compared to network with
LSTM units and only (ϕ) or (ϕ, CN ); the latter two networks produced predicted traction that
was worse compared to network with LSTM alone, indicating the important role of the fabric
tensor, which contained directional data that were absent in scalar fields like (ϕ, CN ) [25].
(Figure reproduced with permission of the author.)
basis {ϕ1 (x), ϕ2 (x), . . . ϕ∞ (x)} from high-resolution data obtained from high-fidelity models or measure-
ments, in which x is a point in a 3-D fluid domain B; see Section 12.1. A flow dynamic quantity u(x, t),
such as a component of the flow velocity field, can be projected on the POD basis by separation of variables
as (Figure 18)
i=∞
X i=k
X
u(x, t) = ϕi (x)αi (t) ≈ ϕi (x)αi (t) , with k < ∞ , (3)
i=1 i=1
where k is a finite number, which could be large, and αi (t) a time-dependent coefficient for ϕi (x). The
computation would be more efficient if a much smaller subset with, say, m ≪ k, POD basis functions,
j=m
X
u(x, t) ≈ ϕij (x)αij (t) , with m ≪ k and ij ∈ {1, . . . , k} , (4)
j=1
where {ij , j = 1, . . . , m} is a subset of indices in the set {1, . . . , k}, and such that the approximation in
Eq. (4) is with minimum error compared to the approximation in Eq. (3)2 using the much larger set of k
POD basis functions.
In a Galerkin-Project (GP) approach to reduced-order model, a small subset of dominant modes form a
basis onto which high-dimensional differential equations are projected to obtain a set of lower-dimensional
differential equations for cost-efficient computational analysis.
Instead of using GP, RNNs (Recurrent Neural Networks) were used in [26] to predict the evolution of
fluid flows, specifically the coefficients of the dominant POD modes, rather than solving differential equa-
tions. For this purpose, their LSTM-ROM (Long Short-Term Memory - Reduced Order Model) approach
combined concepts of ROM based on POD with deep-learning neural networks using either the original
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 29
Figure 18: Reduced-order POD basis (Sections 2.3.3, 12.1). For each dataset (also Fig-
ure 116), which contained k snapshots, the full POD reconstruction of the flow-field dynamical
quantity u(x, t), where x is a point in the 3-D flow field, consists of all k basis functions ϕi (x),
with i = 1, . . . , k, using Eq. (3); see also Eq. (439). Typically, k is large; a reduced-order POD
basis consists of selecting m ≪ k basis functions for the reconstruction, with the smallest error
possible. See Figure 19 for the use of deep-learning networks to predict αi (t + t′ ), with t′ > 0,
given αi (t) [26]. (Figure reproduced with permission of the author.)
LSTM units, Figure 117 (left) [24], or the bidirectional LSTM (BiLSTM), Figure 117 (right) [104], the
internal states of which were well-suited for the modeling of dynamical systems.
To obtain training/testing data, which were crucial to train/test neural networks, the data from transient
3-D Direct Navier-Stokes (DNS) simulations of two physical problems, as provided by the Johns Hopkins
turbulence database [105] were used [26]: (1) The Forced Isotropic Turbulence (ISO) and (2) The Magne-
tohydrodynamic Turbulence (MHD).
To generate training data for LSTM/BiLSTM networks, the 3-D turbulent fluid flow domain of each
physical problem was decomposed into five equidistant 2-D planes (slices), with one additional equidistant
2-D plane served to generate testing data (Section 12, Figure 116, Remark 12.1). For the same subregion
in each of those 2-D planes, POD was applied on the k snapshots of the velocity field (k = 5, 023 for ISO,
k = 1, 024 for MHD, Section 12.1), and out of k POD modes {ϕi (x), i = 1, . . . , k}, the five (m = 5 ≪ k)
most dominant POD modes {ϕi (x), i = 1, . . . , m} representative of the flow dynamics (Figure 18) were
retained to form a reduced-order basis onto which the velocity field was projected. The coefficient αi (t) of
the POD mode ϕi (x) represented the evolution of the participation of that mode in the velocity field, and
was decomposed into thousands of small samples using a moving window. The first half of each sample was
used as input signal to an LSTM network, whereas the second half of the sample was used as output signal
for supervised training of the network. Two different methods were proposed [26]:
(1) Multiple-network method: Use a RNN for each coefficient of the dominant POD modes
(2) Single-network method: Use a single RNN for all coefficients of the dominant POD modes
For both methods, variants with the original LSTM units or the BiLSTM units were implemented. Each of
the employed RNN had a single hidden layer.
Demonstrative results for the prediction capabilities of both the original LSTM and the BiLSTM net-
works are illustrated in Figure 20. Contrary to the authors’ expectation, networks with the original LSTM
units performed better than those using BiLSTM units in both physical problems of isotropic turbulence
(ISO) (Figure 20a) and magnetohydrodyanmics (MHD) (Figure 20b) [26].
Details of the formulation in [26] are discussed in Section 12.
30 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where r0 4 is theResults
background firing rate at zero stimulus, K(·) is the synaptic kernel, and s(·) the stimulus;
see, e.g., [19], p. 46, Eq. (2.1).47 . The stimulus s(·) in Eq. (5) is a train (sequence in time) of spikes described
The LSTM architecture described above is trained with different lengths of the prediction
46
From herehorizons and(5)
on, if Eq. timeis windows.
found a bitThe time window
abstract is an indicator
at first reading, of the
first-time “memory”
learners couldinskip
the the
flowremaining
and of this
while its exact computation is not feasible, we will explore an approximate approach in sections
short Section 3 to begin reading Section 4, and come back later after reading through subsequent sections, particularly
ahead.
Section 13.2, to These
have antime window/prediction
overview horizon
of the connection choices
among have toseparate
seemingly be typically made at the NN
topics.
47
Eq. (5) is a reformulated version of Eq. (2.1) in [19], p. 46, and is similar to Eqs. (7.1)-(7.2) in [19], p. 233, Chapter
“7 Network Models”.
7/22
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 31
Figure 17. Mean Absolute Scaled Error (MASE) for LSTM predictions on all test samples in
Figure 11. Mean Absolute Scaled Error (MASE) for LSTM predictions on all test samples in
MHD dataset with Unified Model
ISO dataset
5023 realizations. The results show that the MASE is generally low, except at samples where a
sudden increase isFigure
observed. A 21: Biological
similar trend is also neuron response
observed for BiLSTM into stimulus,
Fig. 12, althoughexperimental result (Section 3). An os-
the average MASEcillating current
is higher than was injected
LSTM across into the
all 5 POD modes. neuron
Finally, (top), and neuron spiking response was recorded
the predicted
temporal coefficients can be used to compute future time evolution of the flow using Eqn. 7. A
comparison of the (below) [106],
predicted flow with
field with the permission
expected flow atfrom
a giventhe
timeauthors and inFrontiers Media SA. A spike, also called
instant is shown
an action
Fig. 13. Since dominant modes potential, is an electrical
comprise a significant potential
amount of flow pulse across
energy, prediction errors the cell membrane that lasts about 100
in the lower modes (like POD mode 4 and 5) tend to less negatively impact the flow field
accuracy than errors in POD modes 1 and 2. Naturally, a question arises about the distribution oftransmit information by firing sequences
milivolts over 1 miliseconds. “Neurons represent and
errors with variousof
POD spikes
modes,inandvarious temporal
this will be discussed inpatterns” [19],Now
the next section. pp.we3-4.
discuss
the lower prediction accuracy in BiLSTM, which initially appears counter-intuitive. Our
explanation for this behavior stems from the fact that BiLSTM was explicitly developed for long
range dependencies in the signals, where a forward and backward pass would help the NN learn
by
the input-output relationships (Fig. 3) better. As mentioned previously, the key application area
of BiLSTM appearsX to be language modeling for complex tasks, where well-defined patterns
have beens(τ δ(τ −Some
) = in literature.
observed ti ) of the most common examples are presence of (6)
high-frequency words iin the English language texts, mentioned in the seminal work on
redundancies and statistics by Claude Shannon [37]. Additionally, literature Figure 18. Mean Absolute Scaled Error (MASE) for BiLSTM predictions on all test samples in
in information
MHD dataset with Unified Model
where δ(·) is the Dirac delta. Eq. (5) then describes the firing rate r(t) at time t as the collective memory
effect of all spikes, going from the current time τ = t back far in the past with τ = −∞, with the weight
for the spike s(τ ) at time τ provided by the value of the synaptic kernel K(t − τ ) at the same time τ .
It will be seen in Section 13.2.2 on “Dynamics, time10/22
dependence, Volterra series” that the convolution 13/22
integral in Eq. (5) corresponds to the linear part of the Volterra series of nonlinear response of a biological
neuron in terms of the stimulus, Eq. (497), which in turn provides the theoretical foundation for taking the
linear combination of inputs, weights, and biases for an artificial neuron in a multilayer neural networks, as
represented by Eq. (26).
The Integrate-and-Fire model for biological neuron provides a motivation for the use of the rectified
linear units (ReLU) as activation function in multilayer neural networks (or perceptrons); see Figure 28.
Eq. (5) is also related to the exponential smoothing technique used in forecasting and applied to stochas-
tic optimization methods to train multilayer neural networks; see Section 6.5.3 on “Forecasting time series,
32 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
exponential smoothing”.
48
There is no physical flow here, only function mappings.
49
Figure2 in[20] is essentially the same as Figure 8.
50
See, e.g., [107], [108].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 33
Table 1: Top-down (rough) comparison of modeling steps in three fields (Section 3): Compu-
tational mechanics, neuroscience, and deep learning
Outputs Displacements (solids), Firing rate as response Image classified (car, frog,
velocities (fluids) human)
34 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
one (1). For this reason, we do not use neither the name “vector” for a column matrix, nor the name “tensor”
for an array with more than two indices.51 All arrays are matrices.
The matrix notation used here can follow either (1) the Matlab / Octave code syntax, or (2) the more
compact component convention for tensors in mechanics.
Using Matlab / Octave code syntax, the inputs to a network (to be defined soon) are gathered in an n × 1
column matrix x of real numbers, and the (target or labeled) outputs52 in an m × 1 column matrix y
x1
..
T
x = [x1 , · · · , xn ] = [x1 ; · · · ; xn ] = . ∈ Rn×1 , y = [y1 , · · · , ym ]T ∈ Rm×1 (7)
xn
where the commas are separators for matrix elements in a row, and the semicolons are separators for rows.
For matrix transpose, we stick to the standard notation using the superscript “T ” for written documents,
instead of the prime “′” as used in matlab / octave code. In addition, the prime “′” is more customarily used
to denote derivative in handwritten and in typeset equations.
Using the component convention for tensors in mechanics,53 The coefficients of a n × m matrix shown
below
[Aij ] = Aij ∈ Rn×m (8)
are arranged according to the following convention for the free indices i and j, which are automatically
expanded to their respective full range, i.e., i = 1, . . . , n and j = 1, . . . m when the variable Aij = Aij are
enclosed in square brackets:
(1) In case both indices are subscripts, then the left subscript (index i of Aij in Eq. (8)) denotes the row
index, whereas the right subscript (index j of Aij in Eq. (8)) denotes the column index.
(2) In case one index is a superscript, and the other index is a subscript, then the superscript (upper index
i of Aij in Eq. (8)) is the row index, and the subscript (lower index j of Aij in Eq. (8)) is the column
index.54
With this convention (lower index designates column index, while upper index designates row index),
the coefficients of array x in Eq. (7) can be presented either in row form (with lower index) or in column
form (with upper index) as follows:
1
x
i x2
[xi ] = [x1 , x2 , · · · , xn ] ∈ R 1×n
, x = . ∈ Rn×1 , with xi = xi , ∀i = 1, · · · , n (9)
.
.
xn
Instead of automatically associating any matrix variable such as x to the column matrix of its components,
the matrix dimensions are clearly indicated as in Eq. (7) and Eq. (9), i.e., by specifying the values m (number
of rows) and n (number of columns) of its containing space Rm×n .
51
See, e.g., [78], p. 31, where a “vector” is a column matrix, and a “tensor” is an array with coefficients (elements)
having more than two indices, e.g., Aijk . It is important to know the terminologies used in computer-science litera-
ture.
52
The inputs x and the target (or labeled) outputs y are the data used to train the network, which produces the predicted
(or approximated) output denoted by y e with an overhead tilde reminiscent of the approximation symbol ≈. See also
Footnote 87.
53
See, e.g., [107] [109] [110].
54
See, e.g., [111], Footnote 11. For example, A32 = A32 is the coefficient in row 3 and column 2.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 35
where the summation convention on the repeated indices r and s is implied. Then the Jacobian matrix ∂z ∂θ
can be obtained directly as a product of Jacobian matrices from the chain rule just by putting square brackets
around each factor:
∂z(y(x(θ))) ∂zi ∂zi ∂yr ∂xs
= = (14)
∂θ ∂θj ∂yr ∂xs ∂θj
Consider the scalar function E : Rm → R that maps the column matrix y ∈ Rm into a scalar, then the
components of the gradient of E with respect to y are arranged in a row matrix defined as follows:
∂E ∂E ∂E
= ,··· , = ∇Ty E ∈ R1×m (15)
∂yj 1×m ∂y1 ∂ym
with ∇Ty E being the transpose of the m × 1 column matrix ∇y E containing these same components.56
Now consider this particular scalar function below:57
Z = wy with w ∈ R1×n and y ∈ Rn×1 (16)
Then the gradients of z are58
∂z ∂z ∂z
= yj =⇒ T
= [yj ] = y ∈ R 1×n
and = [wj ] = w ∈ R1×n (17)
∂wj ∂wj ∂yj
∂y3
55
For example, the coefficient A32 = ∂x 2
is in row 3 and column 2. The Jacobian matrix in this convention is the
transpose of that used in [39], p. 175.
56
In [78], the column matrix (which is called “vector”) ∇y E is referred to as the gradient of E. Later on in the present
paper, E will be called the error or “loss” function, y the outputs of a neural network, and the gradient of E with
respect to y is the first step in the “backpropagation” algorithm in Section 5 to find the gradient of E with respect to
the network parameters collected in the matrix θ for an optimization descent direction to minimize E.
57
Soon, it will be seen in Eq. (26) that the function z is a linear combination of the network inputs y, which are outputs
coming from the previous network layer, with w being the weights. An advantage of defining w as a row matrix,
instead of a column matrix like y, is to de-clutter the equations in dispensing of (1) the superscript T designating the
transpose as in wT y, or (2) the dot product symbol as in w•y, leaving space for other indices, such as in Eq. (26).
58
The gradients of z will be used in the backpropagation algorithm in Section 5 to obtain the gradient of the error (or
loss) function E to find the optimal weights that minimize E.
36 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
e (predicted outputs)
y (L) = f (L) (x(L) ) = y (19)
Remark 4.1. The notation y (ℓ) , for ℓ = 0, · · · , L in Eq. (19) will be useful to develop a concise formulation
for the computation of the gradient of the cost function relative to the parameters by backpropagation for
use in training (finding optimal parameters); see Eqs. (91)-(92) in Section 5 on Backpropagation. ■
The quantities associated with layer (ℓ) in a network are indicated with the superscript (ℓ), so that the
inputs to layer (ℓ) as gathered in the m(ℓ−1) × 1 column matrix
(ℓ)
x(ℓ) = [x1 , · · · , x(ℓ) T
m(ℓ−1) ] = y
(ℓ−1)
∈ Rm(ℓ−1) ×1 (20)
are the predicted outputs from the previous layer (ℓ − 1), gathered in the matrix y (ℓ−1) , With m(ℓ−1) being
the width of layer (ℓ − 1). Similarly, The outputs of layer (ℓ) as gathered in the m(ℓ) × 1 matrix
(ℓ)
y (ℓ) = [y1 , · · · , ym ] = x(ℓ+1) ∈ Rm(ℓ) ×1
(ℓ) T
ℓ
(21)
are the inputs to the subsequent layer (ℓ + 1), gathered in the matrix x(ℓ+1) , with m(ℓ) being the width of
layer (ℓ).
Remark 4.2. The output for layer (ℓ), denoted by y (ℓ) , can also be written as h(ℓ) , where “h” is mnemonic
for “hidden”, since the inner layers between the input layer (1) and the output layer (L) are considered as
being “hidden”. Both notations y (ℓ) and h(ℓ) are equivalent
y (ℓ) ≡ h(ℓ) (22)
and can be used interchangeably. In the current Section 4 on “Static, feedforward networks”, the notation
y (ℓ) is used, whereas in Section 7 on “Dynamics, sequential data, sequence modeling”, the notation h[n] is
used to designate the output of the “hidden cell” at state [n] in a recurrent neural network, keeping in mind
the equivalence in Eq. (276) in Remark 7.1. Whenever necessary, readers are reminded of the equivalence
in Eq. (22) to avoid possible confusion when reading deep-learning literature. ■
The above chain in Eq. (18)—see also Eq. (23) and Figure 23—is referred to as “multiple levels of
composition” that characterizes modern deep learning, which no longer attempts to mimic the working
of the brain from the neuroscientific perspective.60 Besides, a complete understanding of how the brain
functions is still far remote.61
59
To alleviate the notation, the predicted output y (ℓ) from layer (ℓ) is indicated by the superscript (ℓ), without the tilde.
The output y (L) from the last layer (L) is the network predicted output y e.
60
See [78], p. 14, p.163.
61
In the review paper [12] addressing to computer-science experts, and dense with acronyms and jargon “foreign” to
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 37
or or
Figure 22: Function mapping, graphical representation (Section 4.3.1): n inputs in x ∈ Rn×1
(n × 1 column matrix of real numbers) are fed into function f to produce m outputs in y ∈
Rm×1 .
revealing the structure of the feedforward network as a multilevel composition of functions (or chain-based
network) in which the output y (ℓ−1) of the previous layer (ℓ − 1) serves as the input for the current layer
(ℓ), to be processed by the function f (ℓ) to produce the output y (ℓ) . the input x = y (0) for the input layer
(1) is the input for the entire network. The output y (L) = ye of the (last) layer (L) is the predicted output
for the entire network.
Figure 23: Feedforward network (Sections 4.3.1, 4.4.4): Multilevel composition in feedfor-
ward network with L layers represented as a sequential application of functions f (ℓ) , with
ℓ = 1, · · · , L, to n inputs gathered in x = y (0) ∈ Rn×1 (n × 1 column matrix of real num-
bers) to produce m outputs gathered in y (L) = ye ∈ Rm×1 . This figure is a higher-level block
diagram that corresponds to the lower-level neural network in Figure 7 or in Figure 35.
Remark 4.3. Layer definitions, action layers, state layers. In Eq. (23) and in Figure 23, an action layer is
defined by the action, i.e., the function f (ℓ) , on the inputs y (ℓ−1) to produce the outputs y (ℓ) . There are L
action layers. A state layer is a collection of inputs or outputs, i.e., y (ℓ) , ℓ = 0, . . . , L, each describes a state
of the system, thence the number of state layers is L + 1, and the number of hidden (state) layers (excluding
the input layer y (0) and the output layer y (L) ) is (L − 1). For an illustration of state layers, see [78], p. 6,
Figure1.2. See also Remark 11.3. From here on, “hidden layer” means “hidden state layer”, agreeing with
the terminology in [78]. See also Remark 4.5 on depth definitions in Section 4.6.1 on “Depth, size”. ■
first-time learners, the authors mentioned “It is ironic that artificial NNs [neural networks] (ANNs) can help to better
understand biological NNs (BNNs)”, and cited a 2012 paper that won an “image segmentation” contest in helping
to construct a 3-D model of the “brain’s neurons and dendrites” from “electron microscopy images of stacks of thin
slices of animal brains”.
38 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
h i
(ℓ) (ℓ)
= wi | bi ∈ R1×[m(ℓ−1) +1] (29)
(ℓ)
θ
h i 1 h
..
i
(ℓ)
Θ(ℓ) = θij =
. = W
(ℓ)
b(ℓ) ∈ Rm(ℓ) ×[m(ℓ−1) +1] (30)
(ℓ)
θm(ℓ)
For simplicity and convenience, the set of all parameters in the network is denoted by θ, and the set of all
parameters in layer (ℓ) by θ (ℓ) :65
θ = {θ (1) , · · · , θ (ℓ) , · · · , θ (L) } = {Θ(1) , · · · , Θ(ℓ) , · · · , Θ(L) } , such that θ (ℓ) ≡ Θ(ℓ) (31)
Note that the set θ in Eq. (31) is not a matrix, but a set of matrices, since the number of rows m(ℓ) for a layer
(ℓ) may vary for different values of ℓ, even though in practice, the widths of the layers in a fully connected
feed-forward network may generally be chosen to be the same.
62
See Eq. (497) for the continuous temporal summation, counterpart of the discrete spatial summation in Eq. (26).
63
It should be noted that the use of both W and W T in [78] in equations equivalent to Eq. (26) is confusing. For
example, on p. 205, in Section 6.5.4 on backpropagation for fully-connected feedforward network, Algorithm 6.3,
an equation that uses W in the same manner as Eq. (26) is a(k) = b(k) +W (k) h(k) , whereas on p. 191, in Section 6.4,
Architecture Design, Eq. (6.40) uses W T and reads as h(1) = G(1) (W (1)T x + b(1) ), which is similar to Eq. (6.36)
on p. 187. On the other hand, both W and W T appear on the same p. 190 in the expressions cos(W x + b) and
h = g(W T x + b). Here, we stick to a single definition of W (ℓ) as defined in Eq. (27) and used in Eq. (26).
64
Eq. (26) is a linear (additive) combination of inputs with possibly non-zero biases. An additive combination of
(ℓ) Qk=m (ℓ) (ℓ)
inputs with zero bias, and a “multiplicative” combination of inputs of the form zi = k=1 (ℓ) wik yk with zero
bias, were mentioned in [12]. In [112], the author went even further to propose the general case in which y (ℓ) =
(ℓ)
F (ℓ) (y (k) , with k < ℓ), where Fi is any differentiable function. But it is not clear whether any of these more
complex functions of the inputs were used in practice, as we have not seen any such use, e.g., in [21] [78], and many
other articles, including review articles such as [13] [20]. On the other hand, the additive combination has a clear
theoretical foundation as the linear-order approximation to the Volterra series Eq. (496); see Eq. (497) and also [19].
65
For the convenience in further reading, wherever possible, we use the same notation as in [78], p.xix.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 39
Similar to the definition of the parameter matrix θ (ℓ) in Eq. (30), which includes the biases b(ℓ) , it is con-
venient for use later in elucidating the backpropagation method in Section 5 (and Section 5.2 in particular)
to expand the matrix y (ℓ−1) in Eq. (26) into the matrix y (ℓ−1) (with an overbar) as follows:
i (ℓ−1)
y
h
(ℓ)
z =W y (ℓ) (ℓ−1) (ℓ)
+b ≡ W (ℓ)
b(ℓ)
=: θ (ℓ) y (ℓ−1) , (32)
1
with
(ℓ−1)
y
h i
θ (ℓ) := W (ℓ) b(ℓ) ∈ Rm(ℓ) ×[m(ℓ−1) +1] , and y (ℓ−1) := ∈ R[m(ℓ−1) +1]×1 . (33)
1
The total number of parameters of a fully-connected feedforward network is then
ℓ=L
X
PT := m(ℓ) × [m(ℓ−1) + 1] . (34)
ℓ=1
But why using a linear (additive) combination (or superposition) of inputs with weights, plus biases, as
expressed in Eq. (26) ? See Section 13.2.
(ℓ) (ℓ)
y (ℓ) = a(z (ℓ) ) such that yi = a(zi ) (35)
Without the activation function, the neural network is simply a linear regression, and cannot learn and
perform complex tasks, such as image classification, language translation, guiding a driver-less car, etc. See
Figure 32 for the block diagram of a one-layer network.
An example is a linear one-layer network, without activation function, being unable to represent the
seemingly simple XOR (exclusive-or) function, which brought down the first wave of AI (cybernetics), and
that is described in Section 4.5.
Rectified linear units (ReLU). Nowadays, for the choice of activation function a(·), Most modern
large deep-learning networks use the default,66 well-proven rectified linear function (more often known as
the “positive part” function) defined as67
(
0 for z ≤ 0
a(z) = z + = [z]+ = max(0, z) = (36)
z for 0 < z
and depicted in Figure 24, for which the processing unit is called the rectified linear unit (ReLU),68 which
was demonstrated to be superior to other activation functions in many problems.69 Therefore, in this section,
we discuss in detail the rectified linear function, with careful explanation and motivation. It is important to
66
“In modern neural networks, The default recommendation is to use the rectified linear unit, or ReLU,” [78], p. 168.
67
The notation z + for positive part function is used in the mathematics literature, e.g., “Positive and negative parts”,
Wikipedia, version 12:11, 13 March 2018, and less frequently in the computer-science literature, e.g., [33]. The
notation [z]+ is found in the neuroscience literature, e.g., [32] [19], p. 63. The notation max(0, z) is more widely
used in the computer-science literature, e.g., [34] [113], [78].
68
A similar relation can be applied to define the Leaky ReLU in Eq. (40).
69
In [78], p. 15, the authors cited the original papers [33] and [34], where ReLU was introduced in the context of image
/ object recognition, and [113], where the superiority of ReLU over hyperbolic-tangent units and sigmoidal units was
demonstrated.
40 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
note that ReLU is superior for large network size, and may have about the same, or less, accuracy than the
older logistic sigmoid function for “very small” networks, while requiring less computational efforts.70
0 0 0
z z z
Figure 24: Activation function (Section 4.4.2): Rectified linear function and its derivatives.
See also Section 5.3.3 and Figure 54 for Parametric ReLU that helped surpass human level per-
formance in ImageNet competition for the first time in 2015, Figure 3 [61]. See also Figure 26
for a halfwave rectifier.
To transform an alternative current into a direct current, the first step is to rectify the alternative current
by eliminating its negative parts, and thus The meaning of the adjective “rectified” in rectified linear unit
(ReLU). Figure 25 shows the current-voltage relation for an ideal diode, for a resistance, which is in series
with the diode, and for the resulting ReLU function that rectifies an alternative current as input into the
halfwave rectifier circuit in Figure 26, resulting in a halfwave current as output.
0 0
V V z
Figure 25: Current I versus voltage V (Section 4.4.2): Ideal diode, resistance, scaled rectified
linear function as activation (transfer) function for the ideal diode and resistance in series.
(Figure plotted with R = 2.) See also Figure 26 for a halfwave rectifier.
Mathematically, a periodic function remains periodic after passing through a (nonlinear) rectifier (active
function):
z(x + T ) = z(x) =⇒ y(x + T ) = a(z(x + T )) = a(z(x)) = y(x) (37)
where T in Eq. (37) is the period of the input current z.
Biological neurons encode and transmit information over long distance by generating (firing) electrical
pulses called action potentials or spikes with a wide range of frequencies [19], p. 1; see Figure 27. “To
reliably encode a wide range of signals, neurons need to achieve a broad range of firing frequencies and to
move smoothly between low and high firing rates” [114]. From the neuroscientific standpoint, the rectified
70
See, e.g., [78], p. 219, and Section 13.3 on the history of active functions. See also Section 4.6 for a discussion of
network size. The reason for less computational effort with ReLU is due to (1) it being an identity map for positive
argument, (2) zero for negative argument, and (3) its first derivative being the Step (Heaviside) function as shown in
Figure 24, and explained in Section 5 on Backpropagation.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 41
Figure 26: Halfwave rectifier circuit (Section 4.4.2), with a primary alternative current z going
in as input (left), passing through a transformer to lower the voltage amplitude, with the sec-
ondary alternative current out of the transformer being put through a closed circuit with an ideal
diode D and a resistor R in series, resulting in a halfwave output current, which can be grossly
approximated by the scaled rectified linear function y ≈ a(z) = max(0, z/R) (right) as shown
in Figure 25, with scaling factor 1/R. The rectified linear unit in Figure 24 corresponds to the
case with R = 1. For a more accurate Shockley diode model, The relation between current
I and voltage V for this circuit is given in Figure 29. Figure based on source in Wikipedia,
version 01:49, 7 January 2015.
linear function could be motivated as an idealization of the “Type I” relation between the firing rate (F) of a
biological neuron and the input current (I), called the FI curve. Figure 27 describes three types of FI curves,
with Type I in the middle subfigure, where there is a continuous increase in the firing rate with increase in
input current.
The Shockley equation for a current I going through a diode D, in terms of the voltage VD across the
diode, is given in mathematical form as:
1 I
I = p eqVD − 1 =⇒ VD = log (38)
+1 .
q p
With the voltage across the resistance being VR = RI, the voltage across the diode and the resistance in
series is then
1 I
−V = VD + VR = VD = log + 1 + RI , (39)
q p
which is plotted in Figure 29. The rectified linear function could be seen from Figure 29 as a very rough
approximation of the current-voltage relation in a halfwave rectifier circuit in Figure 26, in which a diode
and a resistance are in series. In the Shockley model, the diode is leaky in the sense that there is a small
amount of current flow when the polarity is reversed, unlike the case of an ideal diode or ReLU (Figure 24),
and is better modeled by the Leaky ReLU activation function, in which there is a small positive (instead of
just flat zero) slope for negative z:
(
0.01 z for z ≤ 0
a(z) = max(0.01z, z) = (40)
z for 0 < z
Prior to the introduction of ReLU, which had been long widely used in neuroscience as activation
function prior to 2011,71 the state-of-the-art for deep-learning activation function was the hyperbolic tangent
(Figure 31), which performed better than the widely used, and much older, sigmoid function72 (Figure 30);
see [113], in which it was reported that
71
See, e.g., [19], p. 14, where ReLU was called the “half-wave rectification operation”, the meaning of which is
explained above in Figure 26. The logistic sigmoid function (Figure 30) was also used in neuroscience since the
1950s.
72
See Section 13.3.1 for a history of the sigmoid function, which dated back at least to 1974 in neuroscience.
42 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 27: FI curves (Sections 4.4.2, 13.2.2). Firing rate frequency (F) versus applied depo-
larizing current (I), thus FI curves. Three types of FI curves. The time histories of voltage
Vm provide a visualization of the spikes, current threshold, and spike firing rates. The applied
(input) current Iapp in increased gradually until it passes a current threshold, then the neuron
begins to fire. Two input current levels (two black dots on FI curves at the bottom) near the
current threshold are shown, with one just below the threshold (black-line time history for Iapp )
and and one just above the threshold (blue line). two corresponding histories of voltage Vm (flat
black line, and blue line with spikes) are also shown. Type I displays a continuous increase in
firing frequency from zero to higher values when the current continues to increase past the cur-
rent threshold. Type II displays a discontinuity in firing frequency, with a sudden jump from
zero to a finite frequency, when the current passes the threshold. At low concentration ḡA of
potassium, the neuron exhibits Type-II FI curve, then transitions to Type-I FI curve as ḡA is
increased, and returns to Type-II⋆ for higher concentration ḡA . see [114]. the scaled rectified
linear unit (scaled ReLU, Figure 25 and Figure 26) can be viewed as approximating Type-I FI
curve, see also Figure 28 and Eq. (505) where the FI curve is used in biological neuron firing-
rate models. Permission of NAS.
“While logistic sigmoid neurons are more biologically plausible than hyperbolic tangent neu-
rons, the latter work better for training multilayer neural networks. Rectifying neurons are an
even better model of biological neurons and yield equal or better performance than hyperbolic
tangent networks in spite of the hard non-linearity and non-differentiability at zero.”
The hard non-linearity of ReLU is localized at zero, but otherwise ReLU is a very simple function—
identity map for positive argument, zero for negative argument—making it highly efficient for computation.
Also, due to errors in numerical computation, it is rare to hit exactly zero, where there is a hard non-
linearity in ReLU:
“In the case of g(z) = max(0, z), the left derivative at z = 0 is 0, and the right derivative is
1. Software implementations of neural network training usually return one of the one-sided
derivatives rather than reporting that the derivative is undefined or raising an error. This may
be heuristically justified by observing that gradient-based optimization on a digital computer
is subject to numerical error anyway. When a function is asked to evaluate g(0), it is very
unlikely that the underlying value truly was 0. Instead, it was likely to be some small value
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 43
(c) Modeling. FI curve using the Integrate-and-Fire (d) Hardware. Electronic-neuron firing rate (F) vs.
model; see [19], p. 164, Eq. (5.11). The figure is Input voltage (V). Very high firing rates up to 20
from [113]. MHz, roughly 105 to 106 times faster than firing rate
(Figure reproduced with permission of the authors.) of biological neurons [117], CC-BY-4.0 license.
Figure 28: FI or FV curves (Sections 3, 4.4.2, 13.2.2). Neuron firing rate (F) versus input
current (I) (FI curves, a,b,c) or voltage (V). The Integrate-and-Fire model in SubFigure (c) can
be used to replace the sigmoid function to fit the experimental data points in SubFigure (a). The
ReLU function in Figure 24 can be used to approximate the region of the FI curve just beyond
the current or voltage thresholds, as indicated in the red rectangle in SubFigure (c). Despite the
advanced mathematics employed to produce the Type-II FI curve of a large number of Type-I
neurons, as shown in SubFigure (b), it is not clear whether a similar result would be obtained
if the single neuron displays a behavior as in Figure 27, with transition from Type II to Type
I to Type II⋆ in a single neuron. See Section 13.2.2 on “Dynamic, time dependence, Volterra
series” for more discussion on Wilson’s equations Eqs. (508)-(509) [118]. On the other hand
for deep-learning networks, the above results are more than sufficient to motivate the use of
ReLU, which has deep roots in neuroscience.
44 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Thus, in addition to the ability to train deep networks, another advantage of using ReLU is the high ef-
ficiency in computing both the layer outputs and the gradients for use in optimizing the parameters (weights
and biases) to lower cost or loss, i.e., training; see Section 6 on Training, and in particular Section 6.3 on
Stochastic Gradient Descent.
The activation function ReLU approximates closer to how biological neurons work than other activation
functions (e.g., logistic sigmoid, tanh, etc.), as it was established through experiments some sixty years ago,
and have been used in neuroscience long (at least ten years) before being adopted in deep learning in 2011.
Its use in deep learning is a clear influence from neuroscience; see Section 13.3 on the history of activation
functions, and Section 13.3.2 on the history of the rectified linear function.
Deep-learning networks using ReLU mimic biological neural networks in the brain through a trade-off
between two competing properties [113]:
(1) Sparsity. Only 1% to 4% of brain neurons are active at any one point in time. Sparsity saves brain
energy. In deep networks, “rectifying non-linearity gives rise to real zeros of activations and thus truly
sparse representations.” Sparsity provides representation robustness in that the non-zero features73
would have small changes for small changes of the data.
73
See the definition of image “predicate” or image “feature” in Section 13.2.1, and in particular Footnote 302.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 45
Figure 30: Logistic sigmoid function (Sections 4.4.2, 5.1.3, 5.3.1, 13.3.3): s(z) = [1 +
exp(−z)]−1 = [tanh(z/2) + 1]/2 (red), with the tangent at the origin z = 0 (blue). See
also Remark 5.3 and Figure 46 on the softmax function.
Figure 31: Hyperbolic tangent function (Section 4.4.2): g(z) = tanh(z) = 2s(2z) − 1 (red)
and its tangent g(z) = z at the coordinate origin (blue), showing that this activation function is
identity for small signals.
(2) Distributivity. Each feature of the data is represented distributively by many inputs, and each input
is involved in distributively representing many features. Distributed representation is a key concept
dated since the revival of connectionism with [119] [120] and others; see Section 13.2.1.
Figure 32: One-layer network (Section 4.4.3) representing the relation between the predicted
output ye and the input x, i.e., ye = f (x) = a(W x + b) = a(z), with the weighted sum
z := W x + b; see Eq. (26) and Eq. (35) with ℓ = 1. For a lower-level details of this one layer,
see Figure 33.
For a multilayer neural network with L layers, with input-output relation shown in Figure 34, the
detailed components are given in Figure 35, which generalizes Figure 33 to layer (ℓ).
46 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 33: One-layer network (Section 4.4.3) in Figure 32: Lower level details, with m
processing units (rows or neurons), inputs x = [x1 , x2 , . . . , xn ]T and predicted outputs
y1 , ye2 , . . . , yem ]T .
ye = [e
or
Figure 34: Input-to-output mapping (Sections 4.4.3, 4.4.4): Layer (ℓ) in network with L layers
in Figure 23, input-to-output mapping f (ℓ) for layer (ℓ).
Figure 35: Low-level details of layer (ℓ) (Sections 4.4.3, 4.4.4) of the multilayer neural net-
work in Figure 23, with mℓ as the number of processing units (rows or neurons), and thus
the width of this layer, representing the layer processing (input-to-output) function f (ℓ) in Fig-
ure 34.
Figure 36: Artificial neuron (Sections 2.3.1, 4.4.4, 13.1), row i in layer (ℓ) P in Figure 35,
representing the multiple-inputs-to-single-output relation yei = a(wi x+bi ) = a( ni=1 wij xj +
bi ) with x = [x1 , x2 , · · · , xn ]T and wi = [wi1 , wi2 , · · · , win ]. This block diagram is the
exact equivalent of Figure 8, Section 2.3.1, and in [38]. See Figure 131 for the corresponding
biological neuron in Section 13.1 on “Early inspiration from biological neurons”.
Table 2: Exclusive-or (XOR) function (Section 4.5) produces the True value only if two argu-
ments are different. The symbol ⊕ (“Oh-plus”) denotes the XOR operator. A concrete example
of the XOR function would be that there is one and only one of two poker player would be the
winner, and there is no tie possible, i.e., both players cannot win, and both cannot lose.
j xj = [x1j , x2j ]T yi = f (xj ) = x1j ⊕ x2j (XOR)
1 [0, 0]T 0
2 [1, 0]T 1
3 [0, 1]T 1
4 [1, 1]T 0
The dataset or design matrix74 X is the collection of the coordinates of all four points in Table 2:
0 1 0 1
X = [x1 , . . . , x4 ] = ∈ R2×4 , with xi ∈ R2×1 (41)
0 0 1 1
An approximation (or prediction) for the XOR function y = f (x) with θ parameters is denoted by e
y =
fe(x, θ), with mean squared error (MSE) being:
4 4
1 Xh i2 1 X 2
(42)
J(θ) = fe(xi , θ) − f (xi ) = y i − yi
4 4 e
i=1 i=1
We begin with a one-layer network to show that it cannot represent the XOR function,75 then move on to a
two-layer network, which can.
“Model based on the f (x, w) = i wi xi used by the perceptron and ADALINE are called
P
linear models. Linear models have many limitations... Most famously, they cannot learn the
XOR function... Critics who observed these flaws in linear models caused a backlash against
biologically inspired learning in general (Minsky and Papert, 1969). This was the first major
dip in the popularity of neural networks.”
74
See [78], p. 103.
75
This one-layer network is not the Rosenblatt perceptron in Figure 132 due to the absence of the Heaviside function
as activation function, and thus Section 4.5.1 is not the proof that the Rosenblatt perceptron cannot represent the
XOR function. For such proof, see [121].
76
See [78], p. 167.
48 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
First-time learners, who have not seen the definition of Rosenblatt’s (1958) perceptron [119], could confuse
Eq. (43) as the perceptron—which was not a linear model, but more importantly the Rosenblatt perceptron
was a network with many neurons77 —because Eq. (43) is only a linear unit (a single neuron), and does not
have an (nonlinear) activation function. A neuron in the Rosenblatt perceptron is Eq. (489) in Section 13.2,
with the Heaviside (nonlinear step) function as activation function; see Figure 132.
Figure 37: Representing XOR function (Sections 4.5, 13.2). This one-layer network (which is
not the Rosenblatt perceptron in Figure 132) cannot perform this task. For each input matrix xi
in the design matrix X = [x1 , . . . , x4 ] ∈ R2×4 , with i = 1, . . . , 4 (see Table 2), the linear unit
(neuron) z (1) (xi ) = wxi + b ∈ R in Eq. (43) predict a value yei = z (1) (xi ) as output, which
is collected in the output matrix ye = [e y1 , . . . , ye4 ] ∈ R1×4 . The MSE cost function J(θ) in
Eq. (42) is used in a gradient descent to find the parameters θ = [w, b]. The result is a constant
function, yei = 12 , for i = 1, . . . , 4, which cannot represent the XOR function.
Setting the gradient of the cost function in Eq. (46) to zero and solving the resulting equations, we obtain
the weights and the bias:
∂J ∂J ∂J T ∂J ∂J ∂J T
∇θ J(θ) = , , = , , (47)
θ 1 θ2 θ3 ∂w1 ∂w2 ∂b
∂J
= 0 =⇒ 2w1 + w2 + 2b = 1
∂w1
∂J
1
= 0 =⇒ w1 + 2w2 + 2b = 1 =⇒ w1 = w2 = 0, b = , (48)
∂w2
2
∂J
= 0 =⇒ w1 + w2 + 2b = 1
∂b
y in Eq. (43) is a constant for any points in the dataset (or design matrix)
from which the predicted output e
X = [x1 , . . . , x4 ]:
1
eyi = f (xi , θ) = 2 , for i = 1, . . . , 4 (49)
and thus this one-layer network cannot represent the XOR function. Eqs. (48) are called the “normal”
equations.78
77
See Section 13.2 on the history of the linear combination (weighted sum) of inputs with biases.
78
In least-square linear regression, the normal equations are often presented in matrix form, starting from the errors (or
residuals) at the data points, gathered in the matrix e = y − Xθ. To minimize the squared of the errors represented
by ∥ e ∥2 , consider a perturbation θϵ = θ + ϵγ and eϵ = y − Xθϵ , then set the directional derivative of ∥ e ∥2
to zero, i.e., dϵ
d
∥ eϵ ∥2 ϵ=0 = X T (d − Xθ) = 0, which is the “normal equation” in matrix form, since the error
matrix e is required to be “normal” (orthogonal) to the span of X. For the above XOR function with four data
points, the relevant matrices are (using the Matlab / Octave notation) e = [e1 , e2 , e3 , e4 ]T , y = [0, 1, 1, 0]T , and
X = [[0, 0, 1]; [1, 0, 1]; [0, 1, 1]; [1, 1, 1]], which also lead to Eq. (48). See, e.g., [122] [123] and [78], p. 106.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 49
“It has, in fact, been widely conceded by psychologists that there is little point in trying to
‘disprove’ any of the major learning theories in use today, since by extension, or a change
in parameters, they have all proved capable of adapting to any specific empirical data. In
considering this approach, one is reminded of a remark attributed to Kistiakowsky, that ‘given
seven parameters, I could fit an elephant.’ ”
So we now add a second layer, and thus more parameters in the hope to be able to represent the XOR
function, as shown in Figure 38.79
Figure 38: Representing XOR function (Sections 4.5). This two-layer network can perform
this task. The four points in the design matrix X = [x1 , . . . , x4 ] ∈ R2×4 (see Table 2) are
converted into three points that are linearly separable by the two nonlinear units (neurons or
(1) (4)
rows) of Layer (1), i.e., Y (1) = [y1 , . . . , y1 ] = f (1) (X (1) ) = a(Z (1) ) = X (2) ∈ R2×4 ,
with Z (1) = w(1) X (1) + B (1) ∈ R2×4 , as in Eq. (58), and a(·) a nonlinear activation function.
Layer (2) consists of a single linear unit (neuron or row) with three parameters, i.e., ye =
(2) (2)
y1 , . . . , ye4 ] = f (2) (X (2) ) = w1 X (2) + b1 ∈ R1×4 . The three non-aligned points in X (2)
[e
(2) (2)
offer three equations to solve for the three parameters θ (2) = [w1 , b1 ] ∈ R1×3 ; see Eq. (61).
Layer (1): six parameters (4 weights, 2 biases), plus a (nonlinear) activation function. The purpose is
to change coordinates to move the four input points of the XOR function into three points, such that the two
points with XOR value equal 1 are coalesced into a single point, and such that these three points are aligned
on a straight line. Since these three points remain not linearly separable, the activation function then moves
these three points out of alignment, and thus linearly separable.
(1) (1) (1)
zi = wi xi + bi = w(1) xi + b(1) , for i = 1, . . . , 4 (50)
(1) 1 1 (1) 0
wi = w(1) = , bi = b(1) = , for i = 1, . . . , 4 (51)
1 1 −1
h
(1) (1)
i h
(1) (1)
i 0 1 0 1
Z (1)
= z1 , . . . , z4 ∈R 2×4
, X (1)
= x1 , . . . , x4 = ∈ R2×4 (52)
0 0 1 1
79
Our presentation is more detailed and more general than in [78], pp. 167-171, where there was no intuitive explana-
tion of how the numbers were obtained, and where only the activation function ReLU was used.
50 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
h
(1) (1)
i 0 0 0 0
B (1)
= b1 , . . . , b4 = ∈ R2×4 (53)
−1 −1 −1 −1
Figure 39: Two-layer network for XOR representation (Sections 4.5). Left: XOR function,
(1) (1) (1) (1)
with A = x1 = [0, 0]T , B = x2 = [0, 1]T , C = x3 = [1, 0]T , D = x4 = [1, 1]T ; see
Eq. (52). The XOR value for the solid red dots is 1, and for the open blue dots 0. Right: Images
of points A, B, C, D in the z-plane due only to the first term of Eq. (54), i.e., w(1) X (1) , which
is shown in Eq. (55). See also Figure 40.
For activation functions such as ReLu or Heaviside80 to have any effect, the above three points are next
translated in the negative z2 direction using the biases in Eq. (53), so that Eq. (54) yields:
0 1 1 2
Z =(1)
∈ R2×4 , (56)
−1 0 0 1
and thus
h
(1) (1)
i 0 1 1 2 h
(2) (2)
i
Y (1) = y1 , . . . , y4 = a(Z (1) ) = = X (2) = x1 , . . . , x4 ∈ R2×4 , (57)
0 0 0 1
For general activation function a(·), the outputs of Layer (1) are:
a(0) a(1) a(1) a(2)
Y (1) = X (2) = a(Z (1) ) = ∈ R2×4 . (58)
a(−1) a(0) a(0) a(1)
80
In general, the Heaviside function is not used as activation function since its gradient is zero, and thus would not
work for gradient descent. But for this XOR problem without using gradient descent, the Heaviside function offers a
workable solution as the rectified linear function.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 51
Layer (2): three parameters (2 weights, 1 bias), no activation function. Eq. (59) for this layer is identical
to Eq. (43) for the one-layer network above, with the output y (1) of Layer (1) as input x(2) = y (1) , as shown
in Eq. (57):
(2) (2) (2) (2) (2) (2) (2) (2)
eyj = f (xj , θ (2) ) = f (2) (xj ) = w1 xj + b1 = wxj + b = w1 x1j + w2 x2j + b , (59)
(2) (2)
with three distinct points in Eq. (57), because x2 = x3 = [1, 0]T , to solve for these three parameters:
(2) (2)
θ (2) = [w1 , b1 ] = [w1 , w2 , b] (60)
We have three equations:
ey1
a(0) a(−1) 1 w1 y1 0
y = a(1) a(0)
e2 1 w2 = y2 = 1 ,
(61)
ey4 a(2) a(1) 1 b y4 0
for which the exact analytical solution for the parameters θ (2) is easy to obtain, but the expressions are rather
lengthy. Hence, here we only give the numerical solution for θ (2) in the case of the logistic sigmoid function
in Table 3.
Table 3: Two-layer network for XOR representation (Section 4.5). Values of parameters θ (2)
in Eq. (60). The results are exact for ReLU and Heaviside, but rounded for sigmoid due to the
irrational Euler’s number e. See Figure 40.
Activation function Parameters θ (2)
Figure 40: Two-layer network for XOR representation (Sections 4.5). Left: Images of points
A, B, C, D of Z (1) in Eq. (56), obtained after a translation by adding the bias b(1) = [0, −1]T
in Eq. (51) to the same points A, B, C, D in the right subfigure of Figure 39. The XOR value
for the solid red dots is 1, and for the open blue dots 0. Right: Images of points A, B, C, D
after applying the ReLU activation function, which moves point A to the origin; see Eq. (57).
the points A, B (= C), D are no longer aligned, and thus linearly separable by the green
dotted line, whose normal vector has the components [1, −2]T , which are the weights shown in
Table 3.
52 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
We conjecture that any (nonlinear) function a(·) in the zoo of activation functions listed, e.g., in “Acti-
vation function”, Wikipedia, version 21:00, 18 May 2019 or in [36] (see Figure 139), would move the three
points in Z (1) in Eq. (56) out of alignment, and thus provide the corresponding unique solution θ (2) for
Eq. (61).
Remark 4.4. Number of parameters. In 1953, Physicist Freeman Dyson (Princeton Institute of Advanced
Study) once consulted with Nobel Laureate Enrico Fermi about a new mathematical model for a difficult
physics problem that Dyson and his students had just developed. Fermi asked Dyson how many parameters
they had. “Four”, Dyson replied. Fermi then gave his now famous comment “I remember my friend Johnny
von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle
his trunk” [124].
But it was only more than sixty years later that physicists were able to plot an elephant in 2-D using a
model with four complex numbers as parameters [125].
With nine parameters, the elephant can be made to walk (representing the XOR function), and with
a billion parameters, it may even perform some acrobatic maneuver in 3-D; see Section 4.6 on depth of
multilayer networks. ■
“There is no single correct value for the depth of an architecture,82 just as there is no single
correct value for the length of a computer Program. Nor is there a consensus about how much
depth a model requires to Qualify as ‘deep.’ ”
For example, keeping the number of layers the same, then the “depth” of a sparsely-connected feedfor-
ward network (in which not all outputs of a layer are connected to a neuron in the following layer) should
be smaller than the “depth” of a fully-connected feedforward network.
The lack of consensus on the boundary between “shallow” and “deep” networks is echoed in [12]:
“At which problem depth does Shallow Learning end, and Deep Learning begin? Discussions
with DL experts have not yet yielded a conclusive response to this question. Instead of commit-
ting myself to a precise answer, let me just define for the purposes of this overview: problems
of depth > 10 require Very Deep Learning.”
81
There are two viewpoints on the definition of depth, one based on the computational graph, and one based on the
conceptual graph. From the computational-graph viewpoint, depth is the number of sequential instructions that must
be executed in an architecture. From the conceptual-graph viewpoint, depth is the number of concept levels, going
from simple concepts to more complex concepts. See also [78], p. 163, for the depth of fully-connected feedforward
networks as the “length of the chain” in Eq. (18) and Eq. (23), which is the number of layers.
82
There are several different network architectures. Convolutional neural networks (CNN) use sparse connections,
have achieved great success in image recognition, and contributed to the burst of interest in deep learning since
winning the ImageNet competion in 2012 by almost halving the image classification error rate; see [13, 12, 75].
Recurrent neural networks (RNN) are used to process a sequence of inputs to a system with changing states as in a
dynamical system, to be discussed in Section 7. there are other networks with skip connections, in which information
flows from layer (ℓ) to layer (ℓ + 2), skipping layer (ℓ + 1); see [78], p. 196.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 53
Remark 4.5. Action depth, state depth. In view of Remark 4.3, which type of layer (action or state) were
they talking about in the above quotation? We define here action depth as the number of action layers, and
state depth as the number of state layers. The abstract network in Figure 23 has action depth L and state
depth (L + 1), with (L − 1) as the number of hidden (state) layers. ■
The review paper [13] was attributed in [38] for stating that “training neural networks with more than
three hidden layers is called deep learning”, implying that a network is considered “deep” if its number of
hidden (state) layers (L − 1) > 3. In the work reported in [38], the authors used networks with number
of hidden (state) layers (L − 1) varying from one to five, and with a constant hidden (state) layer width
of m(ℓ) = 50, for all hidden (state) layers ℓ = 1, . . . , 5; see Table 1 in [38], reproduced in Figure 99 in
Section 10.2.2.
An example of recognizing multidigit numbers in photographs of addresses, in which the test accuracy
increased (or test error decreased) with increasing depth, is provided in [78], p. 196; see Figure 41.
Figure 41: Test accuracy versus network depth (Section 4.6.1), showing that test accuracy for
this example increases monotonically with the network depth (number of layers). [78], p. 196.
(Figure reproduced with permission of the authors.)
But it is not clear where in [13] that it was actually said that a network is “deep” if the number of
hidden (state) layers is greater than three. An example in image recognition having more than three layers
was, however, given in [13] (emphases are ours):
“An image, for example, comes in the form of an array of pixel values, and the learned fea-
tures in the first layer of representation typically represent the presence or absence of edges at
particular orientations and locations in the image. The second layer typically detects motifs
by spotting particular arrangements of edges, regardless of small variations in the edge posi-
tions. The third layer may assemble motifs into larger combinations that correspond to parts of
familiar objects, and subsequent layers would detect objects as combinations of these parts.”
But the above was not a criterion for a network to be considered as “deep”. It was further noted on the
number of the model parameters (weights and biases) and the size of the training dataset for a “typical
deep-learning system” as follows [13] (emphases are ours):
See Remark 7.2 on recurrent neural networks (RNNs) as equivalent to “very deep feedforward networks”.
Another example was also provided in [13]:
A neural network with 160 billion parameters was perhaps the largest in 2015 [126]:
As mentioned above, for general network architectures (other than feedforward networks), not only that
there is no consensus on the definition of depth, there is also no consensus on how much depth a network
must have to qualify as being “deep”; see [78], p. 8, who offered the following intentionally vague definition:
“Deep learning can be safely regarded as the study of models that involve a greater amount of
composition of either learned functions or learned concepts than traditional machine learning
does.”
Figure 42 depicts the increase in the number of neurons in neural networks over time, from 1958 (Net-
work 1 by Rosenblatt (1958) [119] in Figure 42 with one neuron, which was an error in [78], as discussed
in Section 13.2) to 2014 (Network 20 GoogleNet with more than one million neurons), which was still far
below the more than ten million biological neurons in a frog.
4.6.2 Architecture
The architecture of a network is the number of layers (depth), the layer width (number of neurons per
layer), and the connection among the neurons.85 We have seen the architecture of fully-connected feedfor-
ward neural networks above; see Figure 23 and Figure 35.
One example of an architecture different from that fully-connected feedforward networks is convolu-
tional neural networks, which are based on the convolutional integral (see Eq. (497) in Section 13.2.2 on
“Dynamic, time dependence, Volterra series”), and which had proven to be successful long before deep-
learning networks:
“Convolutional networks were also some of the first neural networks to solve important com-
mercial applications and remain at the forefront of commercial applications of deep learning
today. By the end of the 1990s, this system deployed by NEC was reading over 10 percent of
all the checks in the United States. Later, several OCR and handwriting recognition systems
based on convolutional nets were deployed by Microsoft.” [78], p. 360.
83
A special type of deep network that went out of favor, then now back in favor, among the computer-vision and
machine-learning communities after the spectacular success that ConvNet garnered at the 2012 ImageNet compe-
tition; see [13] [75] [74]. Since we are reviewing in detail some specific applications of deep networks to com-
putational mechanics, we will not review ConvNet here, but focus on MultiLayer Neural (MLN)—also known as
MultiLayer Perceptron (MLP)—networks.
84
A network processing “unit” is also called a “neuron”.
85
See [78], p. 166.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 55
Figure 42: Increasing network size over time (Section 4.6.1, 13.2). All networks before 2015
had their number of neurons smaller than that of a frog at 1.6 × 107 , and still far below that in
a human brain at 8.6 × 1010 ; see “List of animals by number of neurons”, Wikipedia, version
02:46, 9 May 2019. In [78], p. 23, it was estimated that neural network size would double
every 2.4 years (a clear parallel to Moore’s law, which stated that the number of transistors
on integrated circuits doubled every 2 years). It was mentioned in [78], p. 23, that Network
1 by Rosenblatt (1958 [119], 1962 [2]) as having one neuron (see figure above), which was
incorrect, since Rosenblatt (1957) [1] conceived a network with 1000 neurons, and even built
the Mark I computer to run this network; see Section 13.2 and Figure 133. (Figure reproduced
with permission of the authors.)
“Fully-connected networks were believed not to work well. It may be that the primary barri-
ers to the success of neural networks were psychological (practitioners did not expect neural
networks to work, so they did not make a serious effort to use neural networks). Whatever the
case, it is fortunate that convolutional networks performed well decades ago. In many ways,
they carried the torch for the rest of deep learning and paved the way to the acceptance of
neural networks in general.” [78], p. 361.
Here, we present a more recent and successful network architecture different from the fully-connected
feedforward network. Residual network was introduced in [127] to address the problem of vanishing gradi-
ent that plagued “very deep” networks with as few as 16 layers during training (see Section 5 on Backprop-
agation) and the problem of increased training error and test error with increased network depth as shown in
Figure 43.
Remark 4.6. Training error, test (generalization) error. Using a set of data, called training data, to find the
parameters that minimize the loss function (i.e., doing the training) provides the training error, which is the
least square error between the predicted outputs and the training data. Then running the optimally trained
model on a different set of data, which was not been used for the training, called test data, provides the test
error, also known as generalization error. More details can be found in Section 6, and in [78], p. 107. ■
The basic building block of residual network is shown in Figure 44, and a full residual network in
Figure 45. The rationale for residual networks was that, if the identity map were optimal, it would be easier
56 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 43: Training/test error vs. iterations, depth (Sections 4.6.2, 6). The training error
and test error of deep fully-connected networks increased when the number of layers (depth)
increased [127]. (Figure reproduced with permission of the authors.)
Figure 44: Residual network (Sections 4.6.2, 6), basic building block having two layers with
the rectified linear activation function (ReLU), for which the input is x, the output is H(x) =
F(x) + x, where the internal mapping function F(x) = H(x) − x is called the residual.
Chaining this building block one after another forms a deep residual network; see Figure 45
[127].
(Figure reproduced with permission of the authors.)
for the optimization (training) process to drive the residual F(x) down to zero than to fit the identity map
with a bunch of nonlinear layers; see [127], where it was mentioned that deep residual networks won 1st
places in several image recognition competitions.
Remark 4.7. The identity map that jumps over a number of layers in the residual network building block
in Figure 44 and in the full residual network in Figure 45 is based on a concept close to that for the path
of the cell state c[k] in the Long Short-Term Memory (LSTM) unit for recurrent neural networks (RNN), as
described in Figure 81 in Section 7.2. ■
A deep residual network with more than 1,200 layers was proposed in [128]. A wide residual-network
architecture that outperformed deep and thin networks was proposed in [129]: “For instance, [their] wide
16-layer deep network has the same accuracy as a 1000-layer thin deep network and a comparable number
of parameters, although being several times faster to train.”
It is still not clear why some architecture worked well, while others did not:
“The design of hidden units is an extremely active area of research and does not yet have many
definitive guiding theoretical principles.” [78], p. 186.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 57
Figure 45: Full residual network (Sections 4.6.2, 6) with 34 layers, made up from 16 building
blocks with two layers each (Figure 44), together with an input and an output layer. This resid-
ual network has a total of 3.6 billion floating-point operations (FLOPs with fused multiply-add
operations), which could be considered as the network “computational depth” [127]. (Figure
reproduced with permission of the authors.)
5 Backpropagation
Backpropagation, sometimes abbreviated as “backprop”, was a child of whom many could claim to be
the father, and is used to compute the gradient of the cost function with respect to the parameters (weights
and biases); see Section 13.4.1 for a history of backpropagation. This gradient is then subsequently used
in an optimization process, usually the Stochastic Gradient Descent method, to find the parameters that
minimize the cost or loss function.
The factor 21 is for the convenience of avoiding to carry the factor 2 when taking the gradient of the cost (or
loss) function J.87
While the components yk on the output matrix y cannot be independent and identically distributed
(i.i.d.), since y must represent a recognizable pattern (e.g., an image), in the case of training with m exam-
ples as inputs:88
86
For other types of loss function, see, e.g., (1) Section “Loss functions” in “torch.nn — PyTorch Master Documenta-
tion” (Original website, Internet archive), and (2) Jah 2019, A Brief Overview of Loss Functions in Pytorch (Original
website, Internet archive).
87
There is an inconsistent use of notation in [78] that could cause confusion, e.g., in [78], Chap. 5, p. 104, Eq. (5.4), the
notation yb (with the hat) was defined as the network outputs, i.e., predicted values, with y (without the hat) as target
values, whereas later in Chap. 6, p. 163, the notation y (without the hat) was used for the network outputs. Also, in
[78], p. 105, the cost function was defined as the mean squared error, without the factor 12 . See also Footnote 52.
88
In our notation, m is the dimension of the output array y, whereas m (in a different font) is here the number of
examples in Eq. (63), and later represents the minibatch size in Eqs. (136)-(138). The size of the whole training set,
called the “full batch” (Footnote 117), is denoted by M (Footnote 144).
58 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where X is the set of m examples, and Y the set of the corresponding outputs, the examples x|k| can be
i.i.d., and the half MSE cost function for these outputs is half the expectation of the SE:
k=m
1 1 1 X |k| 2
J(θ) = MSE = E({SE |k| , k = 1, · · · , m}) = ∥ y |k| − y
e ∥ . (64)
2 2 2m
k=1
Remark 5.1. Information content, Shannon entropy, maximum likelihood. The expression in Eq. (65)—with
the minus sign and the log function—can be abstract to readers not familiar with the probability concept of
maximum likelihood, which is related to the concepts of information content and Shannon entropy. First, an
event x with low probability (e.g., an asteroid will hit the Earth tomorrow) would have higher information
content than an event with high probability (e.g., the sun will rise tomorrow morning). since the probability
of x, i.e., P (x), is between 0 and 1, the negative of the logarithm of P (x), i.e.,
I(x) = − log P (x) , (68)
called the information content of x, would have large values near zero, and small values near 1. In addition,
the probability of two independent events to occur is the product of the probabilities of these events, e.g.,
the probability of having two heads in two coin tosses is
1 1 1
P (x = head, y = head) = P (head) × P (head) = × = (69)
2 2 4
The product (chain) rule of conditional probabilities consists of expressing a joint probability of several
random variables {x|1| , x|2| , · · · , x|n| } as the product90
k=n
Y
P (x|1| , · · · , x|n| ) = P (x|1| ) P (x|k| |x|1| , · · · , x|k−1| ) (70)
k=2
89
The simplified notation ⟨·⟩ for expectation E(·), with implied probability distribution, is used in Section 6.3.4 on
step-length decay and simulated annealing (Remark 6.9, Section 6.3.5) as an add-on improvement to the stochastic
gradient descent algorithm.
90
See, e.g., [78], p. 57. The notation x|k| (with vertical bars enclosing the superscript k) is used to designate example
k in the set X of examples in Eq. (73), instead of the notation x(k) (with parentheses), since the parentheses were
already used to surround the layer number k, as in Figure 35.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 59
The logarithm of the products in Eq. (69) and Eq. (70) is the sum of the factor probabilities, and provides
another reason to use the logarithm in the expression for information content in Eq. (68): Independent events
have additive information. Concretely, the information content of two asteroids independently hitting the
Earth should double that of one asteroid hitting the Earth.
The parameters θe that minimize the probability cost J(θ) in Eq. (65) can be expressed as91
k=m
1 X
θe = − arg min Ex,y∼p̂data log pmodel (y | x; θ) = arg max log pmodel (y |k| | x|k| ; θ) (71)
θ θ m
k=1
k=m
1 Y
= arg max log Pmodel (y |k| | x|k| ; θ) = arg max pmodel (Y | X; θ) (72)
m θ θ
k=1
Remark 5.2. Relation between Mean Squared Error and Maximum Likelihood. The MSE is a particular
case of the Maximum Likelihood. Consider having m examples X = {x|1| , · · · , x|m| } that are independent
and identically distributed (i.i.d.), as in Eq. (63). If the model probability pmodel (y |k| | x|k| ; θ) has a normal
distribution, with the predicted output
|k| |k|
e = f (x , θ)
y (75)
as in Eq. (66), predicting the mean of this normal distribution,93 then
" #
|k| 2
|k| 1 ∥ y |k| − y ∥
pmodel (y |k| | x|k| ; θ) = N (y |k| ; f (x|k| , θ), σ 2 ) = N (y |k| ; y , σ 2
) = 1 exp − e
2
,
e [2πσ 2 ] 2 2σ
(76)
with σ designating the standard deviation, i.e., the error between the target output y |k| and the predicted
|k|
output y
e is normally distributed. By taking the negative of the logarithm of pmodel (y|k| | x|k| ; θ), we have
|k|
1 ∥ y |k| − y
e ∥2 .
log pmodel (y |k| | x|k| ; θ) =
log(2πσ 2 ) − (77)
2 2σ 2
Then summing Eq. (77) over all examples k = 1, · · · , m as in the last expression in Eq. (71) yields
k=m k=m
X ∥ y |k| − y |k| ∥2
X
|k| |k| m
log pmodel (y | x ; θ) = 2
log(2πσ ) − e , (78)
2 2σ 2
k=1 k=1
91
A tilde is put on top of θ to indicate that the matrix θe contains the estimated values of the parameters (weights and
biases), called the estimates, not the true parameters. Recall from Footnote 87 that [78] used an overhead “hat”
(ˆ·) to indicate predicted value; see [78], p. 120, where θ is defined as the true parameters, and θ̂ the predicted (or
estimated) parameters.
92
See, e.g., [78], p. 128.
93
The normal (Gaussian) distribution of scalar random variable x, mean µ, and variance σ 2 is written as N (x; µ, σ 2 ) =
(2πσ 2 )−1/2 exp[−(x − µ)2 /(2σ 2 )]; see, e.g., [130], p. 24.
60 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where the MSE cost function J(θ) was defined in Eq. (64), noting that constants such as m or 2m do not
affect the value of the minimizer θ.
e
Thus finding the minimizer of the maximum likelihood cost function in Eq. (65) is the same as finding
the minimizer of the MSE in Eq. (62); see also [78], p. 130. ■
Remark 5.2 justifies the use of Mean Squared Error as a Maximum Likelihood estimator.94 For the
purpose of this review paper, it is sufficient to use the MSE cost function in Eq. (42) to develop the back-
propagation procedure.
(80)
The output of the neural network is supposed to represent the probability pmodel (y = 1 | x; θ), i.e., a real-
valued number in the interval [0, 1]. A linear output layer e
y = y (L) = z (L) = W (L) y (L−1) + b(L) does not
meet this constraint in general. To squash the output of the linear layer into the range of [0, 1], the logistic
sigmoid function s (see Figure 30) can be added to the linear output unit to render z (L) a probability
ey = y
(L)
= a(z (L) ), a(z (L) ) = s(z (L) ). (81)
In case more than two categories occur in a classification problem, a neural network is trained to es-
timate the probability distribution over the discrete number (k > 2) of classes. Such distribution is re-
ferred to as multinoulli or categorial distribution, which is parameterized by the conditional probabilities
pi = p(y = i | x) ∈ [0, 1], i = 1, . . . , k of an input x belonging to the i-th category. The output of the
neural network accordingly is a k-dimensional vector y e ∈ Rk×1 , where eyi = p(y = i | x). In addition to
the requirement of each component e y i being in the range [0, 1], we must also guarantee that all components
sum up to 1 to satisfy the definition of a probability distribution.
94
See, e.g., [78], p. 130.
95
See, e.g., [130], p. 68.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 61
For this purpose, the idea of exponentiation and normalization, which can be expressed as a change
of variable in the logistic sigmoid function s (Figure 30, Section 5.3.1), as in the following example [78],
p. 177:
1
p(y) = s[(2y − 1)z] = , with y ∈ {0, 1} and z = constant , (82)
1 + exp[−(2y − 1)z]
1 1
p(0) + p(1) = s(−z) + s(z) = + =1, (83)
1 + exp(z) 1 + exp(−z)
and is generalized then to vector-valued outputs; see also Figure 46.
The softmax function converts the vector formed by a linear unit z (L) = W (L) y (L−1) + b(L) ∈ Rk into
e by means of
the vector of probabilities y
(L)
exp zi
(L)
eyi = (softmax z )i = Pj=k exp z(L) , (84)
j=1 j
Figure 46: Sofmax function for two classes, logistic sigmoid (Section 5.1.3, 5.3.1): s(z) =
[1 + exp(−z)]−1 and s(−z) = [1 + exp(z)]−1 , such that s(z) + s(−z) = 17. See also
Figure 30.
Remark 5.3. Softmax function from Bayes’ theorem. For a classification with multiple classes {Ck , k =
1, . . . , K}, particularized to the case of two classes with K = 2, the probability for class C1 , given the input
column matrix x, is obtained from Bayes’ theorem97 as follows ([130], p. 197):
p(x|C1 )p(C1 ) p(x, C1 ) exp(z) 1
p(C1 |x) = = = = = s(z) , (85)
p(x) p(x, C1 ) + p(x, C2 ) exp(z) + 1 1 + exp(−z)
p(x, C1 ) p(x, C1 )
with z := ln ⇒ = exp(z) , (86)
p(x, C2 ) p(x, C2 )
96
See also [78], p. 179 and p. 78, where the softmax function is used to stabilize against the underflow and overflow
problem in numerical computation.
97
Since the probability of x and y is p(x, y) = p(y, x), and since p(x, y) = p(x|y)p(y) (which is the product rule),
where p(x|y) is the probability of x given y, we have p(x|y)p(y) = p(y|x)p(x), and thus p(y|x) = p(x|y)p(y)/p(x).
The sum rule is p(x) = y p(x, y). See, e.g., [130], p. 15. The right-hand side of the second equation in Eq. (85)2
P
makes common sense in terms of the predator-prey problem, in which p(x, C1 ) would be the percentage of predator
in the total predator-prey population, and p(x, C2 ) the percentage of prey, as the self-proclaimed “best mathematician
of France” Laplace said “probability theory is nothing but common sense reduced to calculation” [130], p. 24.
62 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where the product rule was applied to the numerator of Eq. (85)2 , the sum rule to the denominator, and s the
logistic sigmoid. Likewise,
p(x, C2 ) 1
p(C2 |x) = = = s(−z) (87)
p(x, C1 ) + p(x, C2 ) exp(z) + 1
It is convenient to recall here some equations developed earlier (keeping the same equation numbers)
for the computation of the gradient ∂J/∂θ (ℓ) of the cost function J(θ) with respect to the parameters θ (ℓ)
in layer (ℓ), going backward from the last layer ℓ = L, · · · , 1.
• Cost function J(θ):
k=m
1 1 1 X 2
J(θ) = MSE = ∥y−y ∥ 2
= yk − e
yk , (62)
2 2m e 2m
k=1
m(0) ×1 m ×1
• Inputs x = y (0) ∈R with m(0) = n and predicted outputs y (L) = y
e ∈ R (L) with m(0) = m:
y (0) = x(1) = x (inputs)
e (predicted outputs)
y (L) = f (L) (x(L) ) = y (19)
98
See also [130], p. 115, version 1 of softmax function, i.e., µk = exp(ηk )/[1 + j exp ηj ] and k µk ≤ 1, had
P P P
“1” as a summand in the denominator, similar to Eq. (89) while version 2 did not, similar to Eq. (90) [130], p. 198,
and was the same as Eq. (84).
99
See [78], Section 6.5.4, p. 206, Algorithm 6.4.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 63
Figure 47: Backpropagation building block, typical layer (ℓ) (Section 5.2, Algorithm 1, Ap-
pendix 1). The forward propagation path is shown in blue, with the backpropagation path in
red. The update of the parameters θ (ℓ) in layer (ℓ) is done as soon as the gradient ∂J/∂θ (ℓ) is
available using a gradient descent algorithm. The row matrix r (ℓ) = ∂J/∂z (ℓ) in Eq. (104) can
be computed once for use to evaluate both the gradient ∂J/∂θ (ℓ) in Eq. (105) and the gradient
∂J/∂y (ℓ−1) in Eq. (106), then discarded to free up memory. See pseudocode in Algorithm 1.
.
θ = {θ (1) , · · · , θ (ℓ) , · · · , θ (L) } = {Θ(1) , · · · , Θ(ℓ) , · · · , Θ(L) } , such that θ (ℓ) ≡ Θ(ℓ) (31)
The gradient of the cost function J(θ) with respect to the parameters θ (ℓ) in layer (ℓ), for ℓ = L, · · · , 1,
64 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 48: Backpropagation in fully-connected network (Section 5.2, 5.3, Algorithm 1, Ap-
pendix 1). Starting from the predicted output ye = y (L) In the last layer (L) at the end of
any forward propagation (blue arrows), and going backward (red arrows) to the first layer with
ℓ = L, · · · , 1, and along the way at layer (ℓ), compute the gradient of the cost function J
relative the the parameters θ (ℓ) to update those parameters in a gradient descent, then compute
the gradient of J relative to the outputs y (ℓ−1) of the lower-level layer (ℓ − 1) to continue
the backpropagation. See pseudocode in Algorithm 1. For a particular example of the above
general case, see Figure 51.
is simply:
(ℓ) m (ℓ)
∂J(θ) ∂J(θ) ∂y (ℓ) ∂J X ∂J ∂yk
= ⇔ (ℓ) = (91)
∂θ (ℓ) ∂y (ℓ) ∂θ (ℓ) ∂θij
(ℓ)
k=1 ∂yk ∂θij
(ℓ)
" # " #
∂J(θ) ∂J ∂J(θ) ∂J
= ∈ Rm(ℓ) ×[m(ℓ−1) +1] , = ∈ R1×m(ℓ) (row) (92)
∂θ (ℓ) ∂θ
(ℓ) ∂y (ℓ)
∂y
(ℓ)
ij k
The above equations are valid for the last layer ℓ = L, since since the predicted output y e is the same as
the output of the last layer (L), i.e., y
e ≡ y by Eq. (19). Similarly, these equations are also valid for the
(L)
first layer (1) since the input for layer (1) is x(1) = x = y (0) . using Eq. (35), we obtain (no sum on k)
Using Eq. (93) in Eq. (91) leads to the expressions for the gradient, both in component form (left) and in
matrix form (right):
" # " #T
∂J ∂J ′ (ℓ) (ℓ−1) ∂J ∂J
h i
′ (ℓ)
h
(ℓ−1) T
i
(ℓ)
= (ℓ)
a (zi )y j (no sum on i) ⇔ (ℓ)
= (ℓ)
⊙ a (zi ) y j , (94)
∂θij ∂yi ∂θij ∂yi
where ⊙ is the elementwise multiplication, known as the Hadamard operator, defined as follows:
[pi ] , [qi ] ∈ Rm×1 ⇒ [pi ] ⊙ [qi ] = [(pi qi )] = [p1 q1 , · · · , pm qm ]T ∈ Rm×1 (no sum on i) , (95)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 65
1 Backpropagation pseudocode
Data:
• Input into network x = y (0) ∈ Rm(0) ×1
• Learning rate ϵ; see Section 6.2 on deterministic optimization
• Results from any forward propagation:
⋆ Network parameters θ = {θ (1) , · · · , θ (L) } (all layers)
⋆ Layer weighted inputs and biases z (ℓ) , for ℓ = 1, · · · , L
⋆ Layer outputs y (ℓ) , for ℓ = 1, · · · , L
Result: Updated network parameters θ to reduce cost function J.
2 Initialize:
3 • Gradient ∂J/∂y (L) relative to predicted output y e = y(L) of last layer L using Eq. (99)
4 • Set layer counter ℓ to last layer L, i.e., ℓ ← L ;
5 while ℓ > 0, for current layer (ℓ), do
6 Compute gradient r (ℓ) = ∂J/∂z (ℓ) of cost J relative to weighted sum z (ℓ) using Eq. (104) ;
7 Compute gradient g (ℓ) = ∂J/∂θ (ℓ) of cost J relative to layer parameters θ (ℓ) using Eq. (105) ;
8 Update layer parameter θ (ℓ) to decrease cost J using gradient descent Eq. (120), Section 6.2 ;
9 Compute gradient ∂J/∂y (ℓ−1) of cost J relative to outputs y (ℓ−1) using Eq. (106) ;
10 Decrement layer counter by one: ℓ ← ℓ − 1 ;
11 Propagate to lower-level layer (ℓ − 1) ;
12 end
13
Algorithm 1: Backpropagation pseudocode (Section 5.2). Compute the gradient of cost function J
relative to the parameters θ (ℓ) , with ℓ = L, · · · , 1, by backpropagation in one step of a gradient-descent
algorithm to find parameters that decrease the cost function. The focus here is backpropagation, not the
overall optimization. So the algorithm starts at the end of any forward propagation to begin backprop-
agation, from the last layer (L) back to the first layer (1), to update the parameters to decrease the cost
function. See also the block diagrams in Figure 47 and Figure 48, and Appendix 1 where an alternative
backprop Algorithm 9 is used to explain the equivalent Algorithm 6.4 in [78], p. 206.
and
" #
∂J h
(ℓ)
i
(ℓ)
(ℓ)
∈ R1×m(ℓ) (row) , z (ℓ) = zk ∈ Rm(ℓ) ×1 (column) ⇒ a′ (zi ) ∈ Rm(ℓ) ×1 (column) (96)
∂yi
" #T
∂J
h i h i
′ (ℓ) m(ℓ) ×1 (ℓ−1)
(ℓ)
⊙ a (zi ) ∈ R (column, no sum on i) , and y j ∈ R[m(ℓ−1) +1]×1 (column)
∂yi
(97)
" # " #T
∂J ∂J h
(ℓ)
i h
(ℓ−1) T
i
⇒ (ℓ)
= (ℓ)
⊙ a′ (zi ) y j ∈ Rm(ℓ) ×[m(ℓ−1) +1] , (98)
∂θij ∂yi
which then agrees with the matrix dimension in the first expression for ∂J/∂θ (ℓ) in Eq. (92). For the last
layer ℓ = L, all terms on the right-hand side of Eq. (98) are available for the computation of the gradient
66 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
(L)
∂J/∂θij since
k=m
∂J ∂J 1 X |k| |k|
y (L)
=y ⇒ = =
e ∂y(L) ∂ye m y
e − y ∈ R1×m(L) (row) (99)
k=1
with m being the number of examples and m(L) the width of layer (L), is the mean error from the expression
(L) (L−1)
of the cost function in Eq. (64), with zi and y j already computed in the forward propagation. To
compute the gradient of the cost function with respect to the parameters θ (L−1) in layer (L − 1), we need
(L−1)
the derivative ∂J/∂yi , per Eq. (98). Thus, in general, the derivative of cost function J with respect
(ℓ−1)
to the output matrix y (ℓ−1) of layer (ℓ − 1), i.e., ∂J/∂yi , can be expressed in terms of the previously
(ℓ)
computed derivative ∂J/∂yi and other quantities for layer (ℓ) as follows:
(ℓ) (ℓ)
∂J X ∂J ∂yk X ∂J
′ (ℓ) (ℓ)
X ∂J ∂zk
(ℓ−1)
= (ℓ) (ℓ−1)
= (ℓ)
a (zk )wki = (ℓ) (ℓ−1)
(100)
∂yi k ∂yk ∂yi k ∂yk k ∂zk ∂yi
" # "" # #
∂J ∂J h
(ℓ)
iT h
(ℓ)
i
(ℓ−1)
= (ℓ)
⊙ a′ (zk ) wki ∈ R1×m(ℓ−1) (no sum on k) (101)
∂yi ∂yk
" # " #
∂J ∂J h
(ℓ)
i
(ℓ−1)
∈ R1×m(ℓ−1) , (ℓ)
∈ R1×m(ℓ) (row) , a′ (zk ) ∈ Rm(ℓ) ×1 (column) (102)
∂yi ∂yk
" # (ℓ)
∂J h
′ (ℓ) T
i ∂J h
(ℓ)
i ∂zk
(ℓ)
⊙ a (zk ) = (ℓ)
∈R 1×m(ℓ)
(no sum on k) , wki = (ℓ−1)
∈ Rm(ℓ) ×m(ℓ−1) . (103)
∂yk ∂zk ∂yi
Comparing Eq. (101) and Eq. (98), when backpropagation reaches layer (ℓ), the same row matrix
"" # #
∂J ∂J h
(ℓ)
iT ∂J
r (ℓ) := (ℓ)
= (ℓ)
⊙ a′ (zi ) = (ℓ)
⊙ a′ (z (ℓ)T ) ∈ R1×m(ℓ) (row) (104)
∂z ∂y ∂y
i
is only needed to be computed once for use to compute both the gradient of the cost J relative to the
parameters θ (ℓ) [see Eq. (98) and Figure 47]
∂J
= r (ℓ)T y (ℓ−1)T ∈ Rm(ℓ) ×[m(ℓ−1) +1] (105)
∂θ (ℓ)
and the gradient of the cost J relative to the outputs y (ℓ−1) of layer (ℓ − 1) [see Eq. (101) and Figure 47]
∂J
= r ℓ W (ℓ) ∈ R1×m(ℓ−1) . (106)
∂y (ℓ−1)
The block diagram for backpropagation at layer (ℓ)—as described in Eq. (104), Eq. (105), Eq. (106)—is
given in Figure 47, and for a fully-connected network in Figure 48, with pseudocode given in Algorithm 1.
We note immediately that the vanishing / exploding gradient problem can be resolved using the rectified
linear function (ReLu, Figure 24) as active function in combination with “normalized initialization”100 and
“intermediate normalization layers”, which are mentioned in [127], and which we will not discuss here.
The speed of learning of a hidden layer (ℓ) in Figure 49 is defined as the norm of the gradient g (ℓ) of
the cost function J(θ) with respect to the parameters θ (ℓ) in the hidden layer (ℓ):
∂J
∥ g (ℓ) ∥= (107)
∂θ (ℓ)
The speed of learning in each of the four layers as a function of the number of epochs101 of training drops
down quickly after less than 50 training epochs, then plateaued out, as depicted in Figure 49, where the
speed of learning of layer (1) was 100 times less than that of layer (4) after 400 training epochs.
Figure 49: Vanishing gradient problem (Section 5.3). Speed of learning of earlier layers is
much slower than that of later layers. Here, after 400 epochs of training, the speed of learning
of Layer (1) at 10−5 (blue line) is 100 times slower than that of Layer (4) at 10−3 (green line);
[21], Chapter 5, ‘Why are deep neural networks hard to train ?’ (CC BY-NC 3.0).
To understand the reason for the quick and significant decrease in the speed of learning, consider a
network with four layers, having one scalar input x with target scalar output y, and predicted scalar output
ey, as shown in Figure 50, where each layer has one neuron.102 The cost function and its derivative are
1 ∂J
y )2 ,
J(θ) = (y − e = y−e
y (108)
2 ∂y
100
See [78], p. 295.
101
An epoch is when all examples in the dataset had been used in a training session of the optimization process. For a
formal definition of “epoch”, see Section 6.3.1 on stochastic gradient descent (SGD) and Footnote 145.
102
See also [21].
68 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 50: Neural network with four layers (Section 5.3), one neuron per layer, scalar input
x, scalar output y, cost function J(θ) = 12 (y − ye)2 , with ye = y (4) being the target output
and also the output of layer (4), such that f (ℓ) (y (ℓ−1) ) = a(z (ℓ) ), with a(·) being the active
function, z (ℓ) = w(ℓ) y (ℓ−1) + b(ℓ) , for ℓ = 1, . . . , 4, and the network parameters are θ =
[w1 , . . . , w4 , b1 , . . . , b4 ]. The detailed block diagram is in Figure 51.
The neuron in layer (ℓ) accepts the scalar input y (ℓ−1) to produce the scalar output y (ℓ) according to
f (ℓ) (y (ℓ−1) ) = a(z (ℓ) ) , with z (ℓ) = w(ℓ) y (ℓ−1) + b(ℓ) . (109)
As an example of computing the gradient, the derivative of the cost function J with respect to the bias b(1)
of layer (1) is given by
∂J
(1)
= (y − e y )[a′ (z (4) )w(4) ][a′ (z (3) )w(3) ][a′ (z (2) )w(2) ][a′ (z (1) )w(1) ] (110)
∂b
The back propagation procedure to compute the gradient ∂J/∂b(1) in Eq. (110) is depicted in Figure 51,
which is a particular case of the more general Figure 48.
Figure 51: Neural network with four layers in Figure 50 (Section 5.3). Detailed block dia-
gram. Forward propagation (blue arrows) and backpropagation (red arrows). In the forward
propagation wave, at each layer (ℓ), the product a′ w(ℓ) is computed and stored, awaiting for
the chain-rule derivative to arrive at this layer to multiply. The cost function J(θ) is computed
together with its derivative ∂J/∂y, which is the backprop starting point, from which, when
following the backpropagation red arrow, the order of the factors are as in Eq. (110), until the
derivative ∂J/∂b(1) is reached at the head of the backprop red arrow. (Only the weights are
shown, not the biases, which are not needed in the back propagation, to save space.) The speed
of learning is slowed down significantly in early layers due to vanishing gradient, as shown in
Figure 49. See also the more general case in Figure 48.
Remark 5.5. While the vanishing gradient problem for multilayer networks (static case) may be alleviated
by weights that vary from layer to layer (the mixed cases mentioned above), this problem is especially
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 69
critical in the case of Recurrent Neural Networks, since the weights stay constant for all state numbers (or
“time”) in a sequence of data. See Remark 7.3 on “short-term memory” in Section 7.2 on Long Short-Term
Memory. In back-propagation through the states in a sequence of data, from the last state back to the first
state, the same weight keeps being multiplied by itself. Hence, when a weight is less than 1, successive
powers of its magnitude eventually decrease to zero when progressing back the first state. ■
Figure 52: Sigmoid and hyperbolic tangent functions, derivative (Section 5.3.1). The deriva-
tive of sigmoid function (s′ (z) = s(z)[1 − s(z)], green line) is less than 1 everywhere, whereas
the derivative of the hyperbolic tangent (tanh′ (z) = (1 + z 2 )−1 , purple line) is less than 1
everywhere, except at the abscissa z = 0, where it is equal to 1.
The exploding gradient problem is opposite to the vanishing gradient problem, and occurs when the
gradient has its magnitude increases in subsequent multiplications, particularly at a “cliff”, which is a sharp
drop in the cost function in the parameter space.103 The gradient at the brink of a cliff (Figure 53) leads to
large-magnitude gradients, which when multiplied with each other several times along the back propagation
path would result in an exploding gradient problem.
“For a given input only a subset of neurons are active. Computation is linear on this subset ...
Because of this linearity, gradients flow well on the active paths of neurons (there is no gradient
103
See [78], p. 281
70 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 53: Cost-function cliff (Section 5.3.1). A cliff, or a sharp drop in the cost function. The
parameter space is represented by a weight w and a bias b. The slope at the brink of the cliff
leads to large-magnitude gradients, which when multiplied with each other several times along
the back propagation path would result in an exploding gradient problem. [78], p. 281. (Figure
reproduced with permission of the authors.)
vanishing effect due to activation non-linearities of sigmoid or tanh units), and mathematical
investigation is easier. Computations are also cheaper: there is no need for computing the
exponential function in activations, and sparsity can be exploited.”
A problem with ReLU was that some neurons were never activated, and called “dying” or “dead”, as
described in [131]:
“However, ReLU units are at a potential disadvantage during optimization because the gradient
is 0 whenever the unit is not active. This could lead to cases where a unit never activates as a
gradient-based optimization algorithm will not adjust the weights of a unit that never activates
initially. Further, like the vanishing gradients problem, we might expect learning to be slow
when training ReL networks with constant 0 gradients.”
To remedy this “dying” or “dead” neuron problem, the Leaky ReLU, proposed in [131],104 had the expression
already given previously in Eq. (40), and can be viewed as an approximation to the leaky diode in Figure 29.
Both ReLU and Leaky ReLU have been known and used in neuroscience for years before being imported
into artificial neural network; see Section 13 for a historical review.
ReLU Parametric
ReLU
Figure 54: Rectified Linear Unit (ReLU, left) and Parametric ReLU (right) (Section 5.3.2), in
which the slope s is a parameter to optimize; see Section 5.3.3. See also Figure 24 on ReLU.
Figure 55: Cost-function landscape (Section 6). Residual network with 56 layers (ResNet-56)
on the CIFAR-10 training set. Highly non-convex, with many local minima, and deep, narrow
valleys [132]. The training error and test error for fully-connected network increased when
the number of layers was increased from 20 to 56, Figure 43, motivating the introduction of
residual network, Figure 44 and Figure 45, Section 4.6.2. (Figure reproduced with permission of
the authors.)
105
A “full batch” is a complete training set of examples; see Footnote 117.
106
A minibatch is a random subset of the training set, which is called here the “full batch”; see Footnote 117.
107
See “CIFAR-10”, Wikipedia, version 16:44, 14 October 2019.
72 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Deterministic optimization methods (Section 6.2) include first-order gradient method (Algorithm 2) and
second-order quasi-Newton method (Algorithm 3), with line searches based on different rules, introduced
by Goldstein, Armijo, and Wolfe.
Stochastic optimization methods (Section 6.3) include
• First-order stochastic gradient descent (SGD) methods (Algorithm 4), with add-on tricks such as
momentum and accelerated gradient
• Adaptive learning-rate algorithms (Algorithm 5): Adam and variants such as AMSGrad, AdamW,
etc. that are popular in the machine-learning community
• Criticism of adaptive methods and SGD resurgence with add-on tricks such as effective tuning and
step-length decay (or annealing)
• Classical line search with stochasticity: SGD with Armijo line search (Algorithm 6), second-order
Newton method with Armijo-like line search (Algorigthm 7)
Figure 56: Training set, validation set, test set (Section 6.1). Partition of whole dataset. The
examples are independent. The three subsets are identically distributed.
Beyond the interpolation threshold N ⋆ , variance can be decreased by using ensemble average, as shown by
the orange line in Figure 61.
Such modern practice was the motivation for research into shallow networks with infinite width as a
first step to understand how overparameterized networks worked so well; see Figure 148 and Section 14.2
“Lack of understanding on why deep learning worked.”
Figure 57: Training and validation learning curves—Classical viewpoint (Section 6.1), i.e.,
plots of training error and validation errors versus epoch number (time). While the training
cost decreased continuously, the validation cost reaches a minimum around epoch 20, then
started to gradually increase, forming an “asymmetric U-shaped curve.” Between epoch 100
and epoch 240, the training error was essentially flat, indicating convergence. Adapted from
[78], p. 239. See Figure 60 (a, left), where the classical risk curve is the classical viewpoint,
whereas the modern interpolation viewpoint is on the right subfigure (b). (Figure reproduced
with permission of the authors.)
To develop a neural-network model, a dataset governed by the same probability distribution, such as
the CIFAR-10 dataset mentioned above, can be typically divided into three non-overlapping subsets called
training set, validation set, and test set. The validation set is also called the development set, a terminology
used in [55], in which an effective method of step-length decay was proposed; see Section 6.3.4.
It was suggested in [135], p. 61, to use 50% of the dataset as training set, 25% as validation set, and 25%
as test set. On the other hand, while a validation set with size about 1/4 of the training set was suggested in
[78], p. 118, there was no suggestion for the relative size of the test set.110 See Figure 56 for a conceptual
partition of the dataset.
Examples in the training set are fed into an optimizer to find the network parameter estimate θe that
minimizes the cost function estimate J( e 111 As the optimization on the training set progresses from epoch
e θ).
110
Andrew Ng suggested the following partitions. For small datasets having less than 104 examples, the train-
ing/validation/test ratio of 60%/20%/20% could be used. For large datasets with order of 106 examples, use
ratio 98%/1%/1%. For datasets with much more than 106 examples, use ratio 99.5%/0.25%/0.25%. See Coursera
course “Improving deep neural network: Hyperparameter tuning, regularization and optimization”, at time 4:00,
video website.
111
The word “estimate” is used here for the more general case of stochastic optimization with minibatches; see Sec-
tion 6.3.1 on stochastic gradient descent and subsequent sections on stochastic algorithms. When deterministic
optimization is used with the full batch of dataset, then the cost estimate is the same as the cost, i.e., Je ≡ J, and the
network parameter estimates are the same as the network parameters, i.e., θe ≡ θ.
74 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 58: Validation learning curve (Section 6.1, Algorithm 4). Validation error vs epoch
number. Some validation error could oscillate wildly around the mean, resulting in an “ugly
reality”. The global minimum validation error corresponded to epoch number τ ⋆ . Since the
stopping criteria may miss this global minimum, it was suggested to monitor the validation
learning curve to find the epoch τ ⋆ at which the network parameters θeτ ⋆ would be declared
optimal. Adapted from [135], p. 55. (Figure reproduced with permission of the authors.)
to epoch,112 examples in the validation set are fed as inputs into the network to obtain the outputs for comput-
ing the cost function Jeval (θeτ ), also called validation error, at predetermined epochs {τk } using the network
parameters {θeτk } obtained from the optimization on the training set at those epochs.
Figure 57 shows the different behaviour of the training error versus that of the validation error. The
validation error would decrease quickly initially, reaching a global minimum, then gradually increased,
whereas the training error continued to decrease and plateaued out, indicating that the gradients got smaller
and smaller, and there was not much decrease in the cost. From epoch 100 to epoch 240, the traning error
was at about the same level, with litte noise. The validation error, on the other hand, had a lot of noise.
Because of the “asymmetric U-shaped curve” of the validation error, the thinking was that if the opti-
mization process could stop early at the global mininum of the validation error, then the generalization (test)
error, i.e., the value of cost function on the test set, would also be small, thus the name “early stopping”.
The test set contains examples that have not been used to train the network, thus simulating inputs never
seen before. The validation error could have oscillations with large amplitude around a mean curve, with
many local minima; see Figure 58.
The difference between the test (generalization) error and the validation error is called the generalization
gap, as shown in the bias-variance trade-off [133] Figure 59, which qualitatively delineates these errors
versus model capacity, and conceptually explains the optimal model capacity as where the generalization
gap equals the training error, or the generalization error is twice the training error.
Remark 6.1. Even the best machine learning generalization capability nowadays still cannot compete with
the generalization ability of human babies; see Section 14.6 on “What’s new? Teaching machines to think
like babies”. ■
Early-stopping criteria. One criterion is to first define the lowest validation error from epoch 1 up to
112
An epoch is when all examples in the dataset had been used in a training session of the optimization process. For a
formal definition of “epoch”, see Section 6.3.1 on stochastic gradient descent (SGD) and Footnote 145.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 75
Figure 59: Bias-variance trade-off (Section 6.1). Training error (cost) and test error versus
model capacity. Two ways to change the model capacity: (1) change the number of network
parameters, (2) change the values of these parameters (weight decay). The generalization gap
is the difference between the test (generalization) error and the training error. As the model
capacity increases from underfit to overfit, the training error decreases, but the generalization
gap increases, past the optimal capacity. Figure 72 gives examples of underfit, appropriately fit,
overfit. See [78], p. 112. The above is the classical viewpoint, which is still prevalent [136]; see
Figure 60 for the modern viewpoint, in which overfitting with high capacity model generalizes
well (small test error) in practice. (Figure reproduced with permission of the authors.)
then define the generalization loss (in percentage) at epoch τ as the increase in validation error relative to
the minimum validation error from epoch 1 to the present epoch τ :
!
Jeval (θeτ )
G(τ ) = 100 · −1 . (117)
Je⋆ (θeτ )
val
[135] then defined the “first class of stopping criteria” as follows: Stop the optimization on the training set
when the generalization loss exceeds a certain threshold s (generalization loss lower bound):
Gs : Stop after epoch τ if G(τ ) > s . (118)
The issue is how to determine the generalization loss lower bound s so not to fall into a local minimum,
and to catch the global minimum; see Figure 58. There were many more early-stopping criterion classes in
[135]. But it is not clear whether all these increasingly sophisticated stopping criteria would work to catch
the validation-error global minimum in Figure 58.
Moreover, the above discussion is for the classical regime in Figure 60 (a). In the context of the modern
interpolation regime in Figure 60 (b), early stopping means that the computation would cease as soon as the
training error reaches “its lowest possible value (typically zero [beyond the interpolation threshold], unless
two identical data points have two different labels)” [137]. See the green line in Figure 61.
Computational budget, learning curves. A simple method would be to set an epoch budget, i.e., the
largest number of epochs for computation sufficiently large for the training error to go down significantly,
then monitor graphically both the training error (cost) and the validation error versus epoch number. These
plots are called the learning curves; see Figure 57, for which an epoch budget of 240 was used. Select the
76 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 60: Modern interpolation regime (Sections 6.1, 14.2). Beyond the interpolation thresh-
old, the test error goes down as the model capacity (e.g., number of parameters) increases,
describing the observation that networks with high capacity beyond the interpolation threshold
generalize well, even though overfit in training. Risk = error or cost. Capacity = number of
parameters (but could also be increased by weight decay). Figures 57, 59 corresponds to the
classical regime, i.e., old method (thinking) [136]. See Figure 61 for experimental evidence
of the modern interpolation regime, and Figure 148 for a shallow network with infinite width.
Permission of NAS.
global minimum of the validation learning curve, with epoch number τ ⋆ (Figure 57), and use the corre-
sponding network parameters θeτ ⋆ , which were saved periodically, as optimal paramters for the network.113
Remark 6.2. Since it is important to monitor the validation error during training, a whole section is de-
voted in [78] (Section 8.1, p. 268) to expound on “How Learning Differs from Pure Optimization”. And
also for this reason, it is not clear yet what global optimization algorithms such as in [138] could bring to
network training, whereas the stochastic gradient descent (SGD) in Section 6.3.1 is quite efficient; see also
Section 6.5.9 on criticism of adaptive methods. ■
Remark 6.3. Epoch budget, global iteration budget. For stochastic optimization algorithms—Sections 6.3,
6.5, 6.6, 6.7—the epoch counter is τ and the epoch budget τmax . Numerical experiments in Figure 73
had an epoch budget of τmax = 250, whereas numerical experiments in Figure 74 had an epoch budget of
τmax = 1800. The computational budget could be specified in terms of global iteration counter j as jmax .
Figure 71 had a global iteration budget of jmax = 5000. ■
Before presenting the stochastic gradient-descent (SGD) methods in Section 6.3, it is important to note
that classical deterministic methods of optimization in Section 6.2 continue to be useful in the age of deep
learning and SGD.
“One should not lose sight of the fact that [full] batch approaches possess some intrinsic ad-
vantages. First, the use full gradient information at each iterate opens the door for many deter-
ministic gradient-based optimization methods that have been developed over the past decades,
including not only the full gradient method, but also accelerated gradient, conjugate gradient,
quasi-Newton, inexact Newton methods, and can benefit from parallelization.” [80], p. 237.
Figure 61: Empirical test error vs Number of paramesters (Sections 6.1, 14.2). Experiments
using the MNIST handwritten digit database in [137] confirmed the modern interpolation
regime in Figure 60 [136]. Blue: Average over 20 runs. Green: Early stopping. Orange:
Ensemble average on n = 20 samples, trained independently. See Figure 148 for a shallow
network with infinite width. (Figure reproduced with permission of the authors.)
“Neural network researchers have long realized that the learning rate is reliably one of the
most difficult to set hyperparameters because it significantly affects model performance.” [78],
p. 298.
114
See Figure 151 in Section 14.7 on “Lack of transparency and irreproducibility of results” in recent deep-learning
papers.
78 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
In fact, it is well known in the field of optimization, where the learning rate is often mnemonically denoted
by λ, being Greek for “l” and standing for “step length”; see, e.g., Polak (1971) [139].
“We can choose ϵ in several different ways. A popular approach is to set ϵ to a small constant.
Sometimes, we can solve for the step size that makes the directional derivative vanish. Another
approach is to evaluate f (x − ϵ∇x f (x)) for several values of ϵ and choose the one that results
in the smallest objective function value. This last strategy is called a line search.” [78], p. 82.
Choosing an arbitrarily small ϵ, without guidance on how small is small, is not a good approach, since
exceedingly slow convergence could result for too small ϵ. In general, it would not be possible to solve
for the step size to make the directional derivative vanish. There are several variants of line search for
computing the step size to decrease the cost function, based on the “decrease” conditions,115 among which
some are mentioned below.116
Remark 6.4. Line search in deep-learning training. Line search methods are not only important for use in
deterministic optimization with full batch of examples,117 but also in stochastic optimization (see Section 6.3)
with random mini-batches of examples [80]. The difficulty of using stochastic gradient coming from random
mini-batches is the presence of noise or “discontinuities”118 in the cost function and in the gradient. Recent
stochastic optimization methods—such as the sub-sampled Hessian-free Newton method reviewed in [80],
the probabilisitic line search in [143], the first-order stochastic Armijo line search in [144], the second-order
sub-sampling line search method in [145], quasi-Newton method with probabilitic line search in [146],
etc.—where line search forms a key subprocedure, are designed to address or circumvent the noisy gradient
problem. For this reason, claims that line search methods have “fallen out of favor”119 would be misleading,
as they may encourage students not to learn the classics. A classic never dies; it just re-emerges in a different
form with additional developments to tackle new problems. ■
In view of Remark 6.4, a goal of this section is to develop a feel for some classical deterministic
line search methods for readers not familiar with these concepts to prepare for reading extensions of these
methods to stochastic line search methods.
is negative, i.e., the descent direction d and the gradient g form an obtuse angle bounded away from 90◦ ,120
and
J(θ + ϵ d) = min{J(θ + λ d) | λ ≥ 0} . (122)
λ
The minimization problem in Eq. (122) can be implemented using the Golden section search (or infinite
Fibonacci search) for unimodal functions.121 For more general non-convex cost functions, a minimizing step
length may be non-existent, or difficult to compute exactly.122 In addition, a line search for a minimizing
step length is only an auxilliary step in an overall optimization algorithm. It is therefore sufficient to find
an approximate step length satisfying some decrease conditions to ensure convergence to a local minimum,
while keeping the step length from being too small that would hinder a reasonable advance toward such local
minimum. For these reasons, inexact line search methods (rules) were introduced, first in [150], followed
by [151], then [152] and [153]. In view of Remark 6.4 and Footnote 119, as we present these deterministic
line-search rules, we will also immediately recall, where applicable, the recent references that generalize
these rules by adding stochasticity for use as a subprocedure (inner loop) for the stochastic gradient-descent
(SGD) algorithm.
A reason could be that the sector bounded by the two lines (1 − α)ϵ g • d and αϵ g • d may be too narrow
when α is close to 0.5 from below, making (1 − α) also close to 0.5 from above. For example, it was
recommended in [139], p. 33 and p. 37, to use α = 0.4, and hence 1 − α = 0.6, making a tight sector, but
we could enlarge such sector by choosing 0.6 < β < 1.
Figure 62: Inexact line search, Goldstein’s rule (Section 6.2.4). acceptable step lengths would
be such that a decrease in the cost function J, denoted by ∆J in Eq. (124), falls into an
acceptable sector formed by an upper-bound line and a lower-bound line. the upper bound is
given by the straight line α ϵ g • d (green), with fixed constant α ∈ (0, 12 ) and ϵ g • d < 0 being
the slope to the curve ∆J(ϵ) at ϵ = 0. The lower-bound line (1 − α)ϵ g • d (black), adopted
in [150] and [155], would be too narrow when α is close to 21 , leaving all local minimizers
[1−α] [1−α] [1−α]
such as ϵ⋆1 and ϵ⋆2 outside of the acceptable intevals I1 , I2 , and I3 (black), which
are themselves narrow. The lower-bound line βϵ g • d (purple) proposed in [156], p. 256, and
[149], p. 55, with (1 − α) < β < 1, would enlarge the acceptable sector, which then may
[β] [β]
contain the minimizers inside the corresponding acceptable intervals I1 and I2 (purple).
The search for an appropriate step length that satisfies Eq. (123) or Eq. (124) could be carried out by
a subprocedure based on, e.g., the bisection method, as suggested in [139], p. 33. Goldstein’rule—also
designated as Goldstein principle in the classic book [156], p. 256, since it ensured a decrease in the cost
function—has been “used only occasionally” per Polak (1997) [149], p. 55, largely superceded by Armijo’s
rule, and has not been generalized to add stochasticity. On the other hand, the idea behind Armijo’s rule is
similar to Goldstein’s rule, but with a convenient subprocedure126 to find the appropriate step length.
gradient-descent algorithm described in Section 6.3: Stochasticity was added to Armijo’s rule in [144], and
the concept was extended to second-order line search [145]. Line search based on Armijo’s rule is also
applied to quasi-Newton method for noisy functions in [158], and to exact and inexact subsampled Newton
methods in [159].128
Armijo’s rule is stated as follows: For α ∈ (0, 1), β ∈ (0, 1), and ρ > 0, use the step length ϵ such
that:129
ϵ(θ) = min{β a ρ | J(θ + β j d) − J(θ) ≤ αβ a ρ g • d} (125)
a
where the decrease in the cost function along the descent direction d, denoted by ∆J, was defined in
Eq. (124), and the descent direction d is related to the gradient g via Eq. (121). The Armijo condition in
Eq. (125) can be rewritten as
J(θ + ϵ d) ≤ J(θ) + αϵ g • d , (126)
which is also known as the Armijo sufficient decrease condition, the first of the two Wolfe conditions pre-
sented below; see [152], [149], p. 55.130
Regarding the paramters α, β, and ρ in the Armijo’s rule Eq. (125), [151] selected to fix
1
α = β = , and ρ ∈ (0, +∞) , (127)
2
and proved a convergence theorem. In practice, ρ cannot be arbitrarily large. Polak (1971) [139], p. 36, also
fixed α = 12 , but recommended to select β ∈ (0.5, 0.8), based on numerical experiments.131 , and to select132
ρ = 1 to minimize the rate r of geometric progression (from the iterate θi , for i = 0, 1, 2, . . ., toward the
local minimizer θ ⋆ ) for linear convergence:133
ρm 2
|J(θi+1 ) − J(θ ⋆ )| ≤ ri |J(θ0 ) − J(θ ⋆ )| , with r = 1 − , (128)
M
where m and M are the lower and upper bounds of the eigenvalues134 of the Hessian ∇2 J, thus M m
< 1. In
summary, [139] recommended:
1
α = , β ∈ (0.5, 0.8) , and ρ = 1 . (129)
2
The pseudocode for deterministic gradient descent with Armijo line search is Algorithm 2, and the
pseudocode for deterministic quasi-Newton / Newton with Armijo line search is Algorithm 3. When the
Hessian H(θ) = ∂ 2 J(θ)/(∂θ)2 is positive definite, the Newton descent direction is:
d = −H −1 (θ) g , (130)
Armijo, but without referring to the original paper [151], such as [157], clearly indicating that Armijo’s rule is a
classic, just like there is no need to refer to Newton’s original work for Newton’s method.
128
All of these stochastic optimization methods are considered as part of a broader class known as derivative-free
optimization methods [160].
129
[156], p. 491, called the constructive technique in Eq. (125) to obtain the step length ϵ the Goldstein-Armijo algo-
rithm, since [150] and [155] did not propose a method to solve for the step length, while [151] did. See also below
Eq. (124) where it was mentioned that a bisection method can be used with Goldstein’s rule.
130
See also [157], p. 34, [148], p. 230.
131
See [139], p. 301.
132
To satisfy the condition in Eq. (121), the descent direction d is required to satisfy (−g)• d ≥ ρ ∥ d ∥∥ g ∥, with
ρ > 0, so to form an obtuse angle, bounded away from 90◦ , with the gradient g. But ρ > 1 would violate the
Schwarz inequality, which requires |g • d| ≤∥ d ∥∥ g ∥.
133
The inequality in Eq. (128) leads to linear convergence in the sense that |J(θi+1 ) − J(θ ⋆ )| ≤ r|J(θi ) − J(θ ⋆ )|, for
i = 0, 1, 2, . . ., with |J(θ0 ) − J(θ ⋆ )| being a constant. See [139], p. 245.
134
A narrow valley with the minimizer θ ⋆ at the bottom would have a very small ratio m/M . See also the use of
“small heavy sphere” (also known as “heavy ball”) method to accelerate convergence in the case of narrow valley
in Section 6.3.2 on stochastic gradient descent with momentum.
82 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
When the Hessian H(θ) is not positive definite, e.g., near a saddle point, then quasi-Newton method uses
the gradient descent direction (−g = −∂J/∂θ) as the descent direction d as in Eq. (120),
d = −g = −∂J(θ)/∂θ , (131)
and regularized Newton method uses a descent direction based on a regularized Hessian of the form:
d = − [H(θ) + δI]−1 g , (132)
where δ is a small perturbation parameter (line 16 in Algorithm 3 for deterministic Newton and line 17 in
Algorithm 7 for stochastic Newton).135
∂J(θ + ϵ d)
•d ≥ β g•d . (134)
∂θ
The first Wolfe’s rule in Eq. (133) is the same as the Armijo’s rule in Eq. (126), which ensures that at the
updated point (θ + ϵ d) the cost function value J(θ + ϵ d) is below the green line αϵ g • d in Figure 62.
The second Wolfe’s rule in Eq. (134) is to ensure that at the updated point (θ + ϵ d) the slope of the
cost function cannot fall below the (negative) slope of the purple line β ϵ g • d in Figure 62.
For other variants of line search, we refer to [161].
“The learning rate may be chosen by trial and error. This is more of an art than a science, and
most guidance on this subject should be regarded with some skepticism.” [78], p. 287.
At the time of this writing, we are aware of two review papers on optimization algorithms for machine
learning, and in particular deep learning, aiming particularly at experts in the field: [80], as mentioned
135
See, e.g., [149], p. 35, and [78], p. 302, where both cited the Levenberg-Marquardt algorithm as the first to use
regularized Hessian.
136
As of 2022.07.09, [152] was cited 1336 times in various publications (books, papers) according to Google Scholar,
and 559 times in archival journal papers according to Web of Science.
137
The authors of [140] and [141] may not be aware that Goldstein’s rule appeared before Armijo’s rule, as they cited
Goldstein’s 1967 book [154], instead of Goldstein’s 1965 paper [150], and referred often to Polak (1971) [139],
even though it was written in [139], p. 32, that a “step size rule [Eq. (124)] probably first introduced by Goldstein
(1967) [154]” was used in an algorithm. See also Footnote 123.
138
An earlier version of the 2017 paper [143] is the 2015 preprint [147].
139
See [78], p. 271, about this terminology confusion. The authors of [80] used “stochastic” optimization to mean
optimization using random “minibatches” of examples, and “batch” optimization to mean optimization using “full
batch” or full training set of examples.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 85
above, and [162]. Our review complements these two review papers. We are aiming here at bringing first-
time learners up to speed to benefit from, and even to hopefully enjoy, reading these and others related
papers. To this end, we deliberately avoid the dense mathematical-programming language, not familiar
to readers outside the field, as used in [80], while providing more details on algorithms that have proved
important in deep learning than [162].
Listed below are the points that distinguish the present paper from other reviews. Similar to [78], both
[80] and [162]:
• Only mentioned briefly in words the connection between SGD with momentum to mechanics without
detailed explanation using the equation of motion of the “heavy ball”, a name not as accurate as the
original name “small heavy sphere” by Polyak (1964) [3]. These references also did not explain how
such motion help to accelerate convergence; see Section 6.3.2.
• Did not discuss recent practical add-on improvements to SGD such as step-length tuning (Sec-
tion 6.3.3) and step-length decay (Section 6.3.4), as proposed in [55]. This information would be
useful for first-time learners.
• Did not connect step-length decay to simulated annealing, and did not explain the reason for using the
name “annealing”140 in deep learning by connecting to stochastic differential equation and physics;
see Remark 6.9 in Section 6.3.5.
• Did not review an alternative to step-length decay by increasing minibatch size, which could be more
efficient, as proposed in [164]; see Section 6.3.5.
• Did not point out that the exponential smoothing method (or running average) used in adaptive
learning-rate algorithms dated since the 1950s in the field of forecasting. None of these references
acknowledged the contributions made in [165] and [166], in which exponential smoothing from time
series in forecasting was probably first brought to machine learning. See Section 6.5.3.
• Did not discuss recent adaptive learning-rate algorithms such as AdamW [56].141 These authors also
did not discuss the criticism of adaptive methods in [55]; see Section 6.5.10.
• Did not discuss classical line-search rules—such as [150], [151],142 [152] (Sections 6.2.2, 6.2.3,
6.2.4)—that have been recently generalized to add stochasticity, e.g., [143], [144], [145]; see Sec-
tions 6.6, 6.7.
“Nearly all of deep learning is powered by one very important algorithm: stochastic gradient
descent (SGD). Stochastic gradient descent is an extension of the gradient descent algorithm.”
[78], p. 147.
140
The authors of [162] only cited [163] for a brief mention of “simulated annealing” as an example of “heuristic
optimizers”, with no discussion, and no connection to step length decay. See also Remark 6.10 on “Metaheuristics”.
141
The authors of [162] only cited [56] in passing, without reviewing AdamW, which was not even mentioned.
142
The authors of [80] only cited Armijo (1966) [151] once for a pseudocode using line search.
143
See, e.g., [80]—in which there was a short bio of Robbins, the first author of [167]—and [162] [144].
86 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Minibatch. The number M of examples in a training set X could be very large, rendering prohibitively
expensive to evaluate the cost function and to compute the gradient of the cost function with respect to the
number of parameters, which by itself could also be very large. At iteration k within a training session
|m|
τ , let Ik be a randomly selected set of m indices, which are elements of the training-set indices [M] =
{1, . . . , M}. Typically, m is much smaller than M:144
“The minibatch size m is typically chosen to be a relatively small number of examples, ranging
from one to a few hundred. Crucially, m is usually held fixed as the training set size M grows.
We may fit a training set with billions of examples using updates computed on only a hundred
examples.” [78], p. 148.
|m|
Generated as in Eq. (136), the random-index sets Ik , for k = 1, 2, . . ., are non-overlapping such that after
kmax = M/m iterations, all examples in the training set are covered, and a training session, or training
epoch,145 is completed (line 6 in Algorithm 4). At iteration k of a training epoch τ , the random minibatch
|m| |m|
Bk is a set of m examples pulled out from the much larger training set X using the random indices in Ik ,
|m|
with the corresponding targets in the set Tk :
144
As of 2010.04.30, the ImageNet database contained more than 14 million images; see Original website, Internet
archive, Figure 3 and Footnote 14. There is a slight inconsistency in notation in [78], where on p. 148, m and m′
denote the number of examples in the training set and in the minibatch, respectively, whereas on p. 274, m denote
the number of examples in a minibatch. In our notation, m is the dimension of the output array y, whereas m (in a
different font) is the minibatch size; see Footnote 88. In theory, we write m ≤ M in Eq. (136); in practice, m ≪ M.
145
An epoch, or training session, τ is explicitly defined here as when the minibatches as generated in Eqs. (135)-(137)
covered the whole dataset. In [78], the first time the word “epoch” appeared was in Figure 7.3 caption, p. 239, where
it was defined as a “training iteration”, but there was no explicit definition of “epoch” (when it started and when it
ended), except indirectly as a “training pass through the dataset”, p. 274. See Figure 151 in Section 14.7 on “Lack
of transparency and irreproducibility of results” in recent deep-learning papers.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 87
where we wrote the random index as ia instead of ia,k as in Eq. (136) to alleviate the notation. The corre-
sponding gradient estimate is:
a=m≤M a=m≤M
∂ J(θ)
e 1 X ∂Jia (θ) 1 X
ge(θ) = = = gia . (140)
∂θ m ∂θ m
a=1 a=1
The pseudocode for the standard SGD is given in Algorithm 4. The epoch stopping criterion (line 1
146
in Algorithm 4) is usually determined by a computation “budget”, i.e., the maximum number of epochs
allowed. For example, [145] set a budget of 1,600 epochs maximum in their numerical examples.
Problems and resurgence of SGD. There are several known problems with SGD:
“Despite the prevalent use of SGD, it has known challenges and inefficiencies. First, the direc-
tion may not represent a descent direction, and second, the method is sensitive to the step-size
(learning rate) which is often poorly overestimated.” [144]
For the above reasons, it may not be appropriate to use the norm of the gradient estimate being small as
stationarity condition, i.e., where the local minimizer or saddle point is located; see the discussion in [145]
and stochastic Newton Algorithm 7 in Section 6.7.
Despite the above problems, SGD has been brought back to the forefront state-of-the-art algorithm to
beat, surpassing the performance of adaptive methods, as confirmed by three recent papers: [55], [168],
[56]; see Section 6.5.9 on criticism of adaptive methods.
Add-on tricks to improve SGD. The following tricks can be added onto the vanilla (standard) SGD to
improve its performance; see also the pseudocode in Algorithm 4:
• Momentum and accelerated gradient: Improve (accelerate) convergence in narrow valleys, Sec-
tion 6.3.2
• Step-length decaying or annealing: Find an effective learning-rate schedule147 to decrease the step
length ϵ as a function of epoch counter τ or global iteration counter j, cyclic annealing, Section 6.3.4
• Minibatch-size increase, keeping step length fixed, equivalent annealing, Section 6.3.5
Figure 63: SGD with momentum, small heavy sphere Section 6.3.2. The descent direction
(negative gradient, black arrows) bounces back and forth between the steep slopes of a deep
and narrow valley. The small-heavy-sphere method, or SGD with momentum, follows a faster
descent (red path) toward the bottom of the valley. See the cost-function landscape with deep
valleys in Figure 55. Figure from [78], p. 289. (Figure reproduced with permission of the authors.)
• SGD with classical momentum: γk = 0 and ζk ∈ (0, 1) (“small heavy sphere” or heavy point mass)148
[3]
• SGD with fast (accelerated) gradient:149 γk = ζk ∈ (0, 1), Nesterov (1983 [50], 2018 [51])
The continuous counterpart of the parameter update Eq. (141) with classical momentum, i.e., when γk ∈
(0, 1) and ζk = 0, is the equation of motion of a heavy point mass (thus no rotatory inertia) under viscous
friction at slow motion (proportional to velocity) and applied force −e g given below with its discretization
by finite difference in time, where hk and hk−1 are the time-step sizes [169]:
θek+1 − θek θek − θek−1
−
d2 θe dθe hk hk−1 θek+1 − θek
+ ν = −e
g⇒ +ν = −e
gk , (142)
(dt)2 dt hk hk
hk−1 1 1
θek+1 − θek − ζk (θek − θek−1 ) = −ϵk gek , with ζk = and ϵk = (hk )2 , (143)
hk 1 + νhk 1 + νhk
which is the same as the update Eq. (141) with γk = 0. The term ζk (θk − θk−1 ) is often called the
“momentum” term since it is proportional to (discretized) velocity. [3] on the other hand explained the
term ζk (θk − θk−1 ) as “giving inertia to the motion, [leading] to motion along the “essential” direction, i.e.
along ‘the bottom of the trough’ ”, and recommended to select ζk ∈ (0.8, 0.99), i.e., close to 1, without
explanation. The reason is to have low friction, i.e., ν small, but not zero friction (ν = 0), since friction is
important to slow down the motion of the sphere up and down the valley sides (like skateboarding from side
to side in a half-pipe), thus accelerate convergence toward the trough of the valley; from Eq. (143), we have
hk = hk−1 = h and ν ∈ [0, +∞) ⇒ ζk ∈ (0, 1] , with ν = 0 ⇒ ζk = 1 (144)
148
Often called by the more colloquial “heavy ball” method; see Remark 6.6.
149
Sometimes referred to as Nesterov’s Accelerated Gradient (NAG) in the deep-learning literature.
90 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Remark 6.5. The choice of the momentum parameter ζ in Eq. (141) is not trivial. If ζ is too small, the
signal will be too noisy; if ζ is too large, “the average will lag too far behind the (drifting) signal” [165],
p. 212. Even though Polyak (1964) [3] recommended to select ζ ∈ (0.8, 0.99), as explained above, it was
reported in [78], p. 290: “Common values of ζ used in practice include 0.5, 0.9, and 0.99. Like the learning
rate, ζ may also be adapted over time. Typically it begins with a small value and is later raised. Adapting ζ
over time is less important than shrinking ϵ over time”. The value of ζ = 0.5 would correspond to relatively
high friction µ, slowing down the motion of the sphere, compared to ζ = 0.99.
Figure 68 from [170] shows the convergence of some adaptive learning-rate algorithms: AdaGrad,
RMSProp, SGDNesterov (accelerated gradient), AdaDelta, Adam.
In their remarkable paper, the authors of [55] used a constant momentum parameter ζ = 0.9; see crit-
icism of adaptive methods in Section 6.5 and Figure 73 comparing SGD, SGD with momentum, AdaGrad,
RMSProp, Adam.150
See Figure 151 in Section 14.7 on “Lack of transparency and irreproducibility of results” in recent
deep-learning papers. ■
For more insight into the update Eq. (143), consider the case of constant coefficients ζk = ζ and ϵk = ϵ,
and rewrite this recursive relation in the form:
X k
θek+1 − θek = −ϵ ζ i gek−i , using θe1 − θe0 = −ϵe
g0 , (145)
i=0
i.e., without momentum for the first term. So the effective gradient is the sum of all gradients from the
beginning i = 0 until the present i = k weighted by the exponential function ζ i so there is a fading memory
effect, i.e., gradients that are farther back in time have less influence than those closer to the present time.151
The summation term in Eq. (145) also provides an explanation of how the “inertia” (or momentum) term
work: (1) Two successive opposite gradients would cancel each other, whereas (2) Two successive gradients
in the same direction (toward the trough of the valley) would reinforce each other. See also [171], pp. 104-
105, and [172] who provided a similar explanation:
“Momentum is a simple method for increasing the speed of learning when the objective func-
tion contains long, narrow and fairly straight ravines with a gentle but consistent gradient along
the floor of the ravine and much steeper gradients up the sides of the ravine. The momentum
method simulates a heavy ball rolling down a surface. The ball builds up velocity along the
floor of the ravine, but not across the ravine because the opposing gradients on opposite sides
of the ravine cancel each other out over time.”
In recent years, Polyak (1964) [3] (English version)152 has often been cited for the classical momentum
(“small heavy sphere”) method to accelerate the convergence in gradient descent, but not so before, e.g.,
the authors of [22] [175] [176] [177] [172] used the same method without citing [3]. Several books on
optimization not related to neural networks, many of them well-known, also did not mention this method:
[139] [178] [149] [157] [148] [179]. Both the original Russian version and the English translated version
150
A nice animation of various optimizers (SGD, SGD with momentum, AdaGrad, AdaDelta, RMSProp) can be found
in S. Ruder, ‘An overview of gradient descent optimization algorithms’, updated on 2018.09.02 (Original website).
151
See also Section 6.5.3 on time series and exponential smoothing.
152
Polyak (1964) [3]’s English version appeared before 1979, as cited [173], where a similar classical dynamics of
a “small heavy sphere” or heavy point mass was used to develop an iterative method to solve nonlinear systems.
There, the name Polyak was spelled as “Poljak” as in the Russian version. The earliest citing of the Russian version,
with the spelling “Poljak” was in [156] and in [174], but the terminology “small heavy sphere” was not used. See
also [171], p. 104 and p. 481, where the Russian version of [3] was cited.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 91
[3] (whose author’s name was spelled as “Poljak” before 1990) were cited in the book on neural networks
[180], in which another neural-network book [171] was referred to for a discussion of the formulation.153
Remark 6.6. Small heavy sphere, or heavy point mass, is better name. Because the rotatory motion is not
considered in Eq. (142), the name “small heavy sphere” given in [3] is more precise than the more colloquial
name “heavy ball” often given to the SGD with classical momentum,154 since “small” implies that rotatory
motion was neglected, and a “heavy ball” could be as big as a bowling ball155 for which rotatory motion
cannot be neglected. For this reason, “heavy point mass” would be a precise alternative name. ■
Remark 6.7. For Nesterov’s fast (accelerated) gradient method, many references referred to [50].156 The
authors of [78], p. 291, also referred to Nesterov’s 2004 monograph, which was mentioned in the Preface
of, and the material of which was included in, [51]. For a special class of strongly convex functions,157 the
step length can be kept constant, while the coefficients in Nesterov’s fast gradient method varied, to achieve
optimal performance, [51], p. 92. “Unfortunately, in the stochastic gradient case, Nesterov momentum does
not improve the rate of convergence” [78], p. 292. ■
“To tune the step sizes, we evaluated a logarithmically-spaced grid of five step sizes. If the
best performance was ever at one of the extremes of the grid, we would try new grid points so
that the best performance was contained in the middle of the parameters. For example, if we
initially tried step sizes 2, 1, 0.5, 0.25, and 0.125 and found that 2 was the best performing, we
would have tried the step size 4 to see if performance was improved. If performance improved,
we would have tried 8 and so on.”
The above logarithmically-spaced grid was given by 2k , with k = 1, 0, −1, −2, −3. This tuning method
appears effective as shown in Figure 73 on the CIFAR-10 dataset mentioned above, for which the following
values for ϵ0 had been tried for different optimizers, even though the values did not always belong to the
sequence {ak }, but could include close, rounded values:
• SGD with momentum (Section 6.3.2): 2, 1, 0.5 (best), 0.25, 0.05, 0.01
• AdaGrad (Section 6.5): 0.1, 0.05, 0.01 (best, default), 0.0075, 0.005
153
See [180], p. 159, p. 115, and [171], p. 104, respectively. The name “Polyak” was spelled as “Poljak” before 1990,
[171], p. 481, and sometimes as “Polyack”, [169]. See also [181].
154
See, e.g., [171], p. 104, [180], p. 115, [169], [181].
155
Or the “Times Square Ball”, Wikipedia, version 05:17, 29 December 2019.
156
Reference [50] cannot be found from the Web of Science as of 2020.03.18, perhaps because it was in Russian, as
indicated in Ref. [35] in [51], p. 582, where Nesterov’s 2004 monograph was Ref. [39].
157
A function f (·) is strongly convex if there is a constant µ > 0 such that for any two points x and y, we have
f (y) ≥ f (x) + ⟨∇f (x), y − x⟩ + 21 µ ∥ y − x ∥2 , where ⟨·, ·⟩ is the inner (or dot) product, [51], p. 74.
158
The last two values {0.05, 0.01} did not belong to the sequence 2k , with k being integers, since 2−3 = 0.125,
2−4 = 0.0625 and 2−5 = 0.03125.
92 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
• Adam (Section 6.5): 0.005, 0.001 (default), 0.0005, 0.0003 (best), 0.0001, 0.00005
Cyclic annealing. In additional to decaying the step length ϵ, which is already annealing, cyclic an-
nealing is introduced to further reduce the step length down to zero (“cooling”), quicker than decaying,
then bring the step length back up rapidly (heating), and doing so for several cycles. The cosine function is
typically used, such as shown in Figure 75, as a multiplicative factor ak ∈ [0, 1] to the step length ϵk in the
parameter update, and thus the name “cosine annealing”:
θek+1 = θek − ak ϵk gek , (152)
as an add-on to the parameter update for vanilla SGD Eq. (120), or
θek+1 = θek − ak ϵk ge(θek + γk (θek − θek−1 )) + ζk (θek − θek−1 ) (153)
as an add-on to the parameter update for SGD with momentum and accelerated gradient Eq. (141). The
cosine annealing factor can take the form [56]:
q=p−1
X
ak = 0.5 + 0.5 cos(πTcur /Tp ) ∈ [0, 1] , with Tcur := j − Tq (154)
q=1
where Tcur is the number of epochs from the start of the last warm restart at the end of epoch q=p−1 Tq ,
P
q=1
where ak = 1 (“maximum heating”), j the current global iteration counter, Tp the maximum number of
epochs allowed for the current pth annealing cycle, during which Tcur would go from 0 to Tp , when ak = 0
(“complete cooling”). Figure 75 shows 4 annealing cycles, which helped reduce dramatically the number of
epochs needed to achieve the same lower cost as obtained without annealing.
Figure 74 shows the effectiveness of cosine annealing in bringing down the cost rapidly in the early
stage, but there is a diminishing return, as the cost reduction decreases with the number of annealing cycle.
Up to a point, it is no longer as effective as SGD with weight decay in Section 6.3.6.
Convergence conditions. The sufficient conditions for convergence, for convex functions, are162
∞
X ∞
X
ϵ2j < ∞ , and ϵj = ∞ . (155)
j=1 j=1
The inequality on the left of Eq. (155), i.e., the sum of the squared of the step lengths being finite, ensures
that the step length would decay quickly to reach the minimum, but is valid only when the minibatch size is
fixed. The equation on the right of Eq. (155) ensures convergence, no matter how far the initial guess was
from the minimum [164].
In Section 6.3.5, the step-length decay is shown to be equivalent to minibatch-size increase and sim-
ulated annealing in the sense that there would be less fluctuation, and thus lower “temperature” (cooling)
by analogy to the physics governed by the Langevin stochastic differential equation and its discrete version,
which is analogous to the network parameter update.
162
Eq. (155) are called the “stepsize requirements” in [80], and “sufficient condition for convergence” in [78], p. 287,
and in [184]. Robbins & Monro (1951b) [49] were concerned with solving M (x) = α when the function M (·) is
not known, but the distribution of the output, as a random variable, y = y(x) is assumed known. For the network
training problem at hand, one can think of M (x) =∥ ∇J(x) ∥, i.e., the magnitude of the gradient of the cost
function J at x =∥ θe − θ ∥, the distance from a local minimizer, and α = 0, i.e., the stationarity point of J(·). In
[49]—in which there was no notion of “epoch τ ” but only global iteration counter j—Eq. (6) on p. 401 corresponds
to Eq. (155)1 (first part), and Eq. (27) on p. 404 corresponds to Eq. (155)2 (second part). Any sequence {ϵk } that
satisfied Eq. (155) was called a sequence of type 1/k; the convergence Theorem 1 on p. 404 and Theorem 2 on p. 405
indicated that the sequence of step length {ϵk } being of type 1/k was only one among other sufficient conditions
for convergence. In Theorem 2 of [49], the additional sufficient conditions were Eq. (33), M (x) =∥ ∇J(x) ∥
non decreasing, Eq. (34), M (0) =∥ ∇J(0) ∥= 0, and Eq. (35), M ′ (0) =∥ ∇2 J(0) ∥> 0, i.e., the iterates
{xk | k = 1, 2, . . .}, fell into a local convex bowl.
94 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Next, the mean value of the “square” of the gradient error, i.e., ⟨eT e⟩, in which we omitted the iteration
counter subscript k to alleviate the notation, relies on some identities related to the covariance matrix ⟨e, e⟩.
The mean of the square matrix xTi xj , where {xi , xj } are two random row matrices, is the sum of the product
of the mean values and the covariance matrix of these matrices163
⟨xTi xj ⟩ = ⟨xi ⟩T ⟨xj ⟩ + ⟨xi , xj ⟩ , or ⟨xi , xj ⟩ = ⟨xTi xj ⟩ − ⟨xi ⟩T ⟨xj ⟩ , (162)
where ⟨xi , xj ⟩ is the covariance matrix of xi and xj , and thus the covariance operator ⟨·, ·⟩ is bilinear due to
the linearity of the mean (expectation) operator ⟨·⟩ in Eq. (157):
* +
X X X
αi ui , βj vj = αi βj ⟨ui , vj ⟩ , ∀αi , βj ∈ R and ∀ui , vj ∈ Rn random . (163)
i j i
Eq. (163) is the key relation to derive an expression for the square of the gradient error ⟨eT e⟩, which can be
rewritten as the sum of four covariance matrices upon using Eq. (162)1 and either Eq. (160) or Eq. (161),
i.e., ⟨e
gk ⟩ = ⟨gk ⟩ = gk , as the four terms gkT gk cancel each other out:
⟨eT e⟩ = ⟨(e
g − g)T (e
g − g)⟩ = ⟨e
g , ge⟩ − ⟨e
g , g⟩ − ⟨g, ge⟩ + ⟨g, g⟩ , (164)
163
See, e.g., [185], p. 36, Eq. (2.8.3).
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 95
where the iteration counter k had been omitted to alleviate the notation. Moreover, to simplify the notation
further, the gradient related to an example is simply denoted by ga or gb , with a, b = 1, . . . , m for a
minibatch, and a, b = 1, . . . , M for the full batch:
m m m
1 X 1 X 1 X
ge = gek = gia ,k = g ia = ga , (165)
m m m
a=1 a=1 a=1
kX
max m M
1 1 X 1 X
g= gia ,k = gb . (166)
kmax m M
k=1 a=1 b=1
Now assume the covariance matrix of any pair of single-example gradients ga and gb depends only on the
parameters θ, and is of the form:
⟨ga , gb ⟩ = C(θ)δab , ∀a, b ∈ {1, . . . , M} , (167)
where δab is the Kronecker delta. Using Eqs. (165)-(166) and Eq. (167) in Eq. (164), we obtain a simple
expression for ⟨eT e⟩:164
1 1
T
⟨e e⟩ = − C. (168)
m M
The authors of [164] introduced the following stochastic differential equation as a continuous counter-
part of the discrete parameter update Eq. (156), as ϵk → 0:
dθ dJ
= −g + n(t) = − + n(t) , (169)
dt dθ
where n(t) is the noise function, the continuous counterpart of the gradient error ek := (gk − gek ). The noise
n(t) is assumed to be Gaussian, i.e., with zero expectation (mean) and with covariance function of the form
(see Remark 6.9 on Langevin stochastic differential equation):
⟨n(t)⟩ = 0 , and ⟨n(t)n(t′ )⟩ = FC(θ)δ(t − t′ ]) , (170)
where E[·] = ⟨·⟩ is the expectation of a function, F the “noise scale” or fluctuation factor, C(θ) the same
gradient-error covariance matrix in Eq. (167), and δ(t − t′ ) the Dirac delta. Integrating Eq. (169), we obtain:
t=ϵ
Z k t=ϵ
Z k * t=ϵZ k + t=ϵ
Z k
dθk
dt = θk+1 − θk = −ϵk gk + n(t)dt , and n(t)dt = ⟨n(t)⟩dt = 0 . (171)
dt
t=0 t=0 t=0 t=0
The fluctuation factor F can be identified by equating the square of the error in Eq. (156) to that in Eq. (171),
i.e.,
t′ =ϵ
Zt=ϵ Z
′ ′ 1 1 1 1
2 T
ϵ ⟨e e⟩ = ⟨n(t)n(t )⟩dt dt ⇒ ϵ 2
− C = ϵFC ⇒ F = ϵ − , (172)
m M m M
t=0 t′ =0
Remark 6.8. Fluctuation factor for large training set. For large M, our fluctuation factor F is roughly
proportional to the ratio of the step length over the minibatch size, i.e., F ≈ ϵ/m. Thus step-length ϵ decay,
or equivalenly minibatch size m increase, corresponds to a decrease in the fluctuation factor F. On the other
hand, [164] obtained their fluctuation factor G as165
M 1 1
G=ϵ − 1 = ϵM − = MF (173)
m m M
164
Eq. (168) possesses a simplicity elegance compared to the expression ⟨α2 ⟩ = N (N/B − 1)F (ω) in [164], based on
a different definition of gradient, with N ≡ M, B ≡ m, ω ≡ θ, but F (ω) ̸= C(θ).
165
In [186] and [164], the fluctuation factor was expressed, in original notation, as g = ϵ(N/B − 1), where the
equivalence with our notation is g ≡ G (fluctuation factor), N ≡ M (training-set size), B ≡ m (minibatch size).
96 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 64: Optimal minibatch size vs. training-set size (Section 6.3.5). For a given training-
set size, the smallest minibatch size that achieves the highest accuracy is optimal. Left figure:
The optimal mimibatch size was moving to the right with increasing training-set size M. Right
figure: The optimal minibatch size in [186] is linearly proportional to the training-set size M
for large training sets (i.e., M → ∞), Eq. (173), but our fluctuation factor F is independent of
M when M → ∞; see Remark 6.8. (Figure reproduced with permission of the authors.)
since their cost function was not an average, i.e., not divided by the minibatch size m, unlike our cost
function in Eq. (139). When M → ∞, our fluctuation factor F → ϵ/m in Eq. (172), but their fluctuation
factor G ≈ ϵM/m → ∞, i.e., for increasingly large M, our fluctuation factor F is bounded, but not their
fluctuation factor G. [186] then went on to show empirically that their fluctation factor G was proportional
to the training-set size M for large M, as shown in Figure 64. On the other hand, our fluctuation factor F
does not depend on the training set size M. As a result, unlike [186] in Figure 64, our optimal minibatch
size would not depend of the training-set size M. ■
It was suggested in [164] to follow the same step-length decay schedules166 ϵ(t) in Section 6.3.4 to adjust
the size of the minibatches, while keeping the step length constant at its initial value ϵ0 . To demonstrate the
equivalence between decreasing the step length and increasing minibatch size, the CIFAR-10 dataset with
three different training schedules as shown in Figure 65 was used in [164].
The results are shown in Figure 66, where it was shown that the number of updates decreased drastically
with minibatch-size increase, allowing for significantly shortening the training wall-clock time.
Remark 6.9. Langevin stochastic differential equation, annealing. Because the fluctuation factor F was
proportional to the step length, and in physics, fluctuation decreases with temperature (cooling), “decay-
ing the learning rate (step length) is simulated annealing”168 [164]. Here, we will connect the step length
to “temperature” based on the analogy of Eq. (169), the continuous counterpart of the parameter update
Eq. (156). In particular, we point exact references that justify the assumptions in Eq. (170).
166
See Figure 151 in Section 14.7 on “Lack of transparency and irreproducibility of results” in recent deep-learning
papers.
167
See Figure 151 in Section 14.7 on “Lack of transparency and irreproducibility of results” in recent deep-learning
papers.
168
“In metallurgy and materials science, annealing is a heat treatment that alters the physical and sometimes chemical
properties of a material to increase its ductility and reduce its hardness, making it more workable. It involves
heating a material above its recrystallization temperature, maintaining a suitable temperature for a suitable amount
of time, and then allow slow cooling.” Wikepedia, ‘Annealing (metallurgy)’, Version 11:06, 26 November 2019.
The name “simulated annealing” came from the highly cited paper [163], which received more than 20,000 citations
on Web of Science and more than 40,000 citations on Google Scholar as of 2020.01.17. See also Remark 6.10 on
“Metaheuristics”.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 97
Figure 65: Minibatch-size increase vs. step-length decay, training schedules (Section 6.3.5).
Left figure: Step length (learning rate) vs. number of epochs. Right figure: Minibatch size vs.
number of epochs. Three learning-rate schedules167 were used for training: (1) The step length
was decayed by a factor of 5, from an initial value of 10−1 , at specific epochs (60, 120, 160),
while the minibatch size was kept constant (blue line); (2) Hybrid, i.e., the step length was
initially kept constant until epoch 120, then decreased by a factor of 5 at epoch 120, and by
another factor of 5 at epoch 160 (green line); (3) The step length was kept constant, while the
minibatch size was increased by a factor of 5, from an initial value of 128, at the same specific
epochs, 60, 120, 160 (red line). See Figure 66 for the results using the CIFAR-10 dataset [164]
(Figure reproduced with permission of the authors.)
Even though the authors of [164] referred to [187] for Eq. (169), the decomposition of the parameter
update in [187]:
√ √ √
θek+1 = θek − ϵk gk + ϵk vk , with vk := ϵk [gk − gek ] = ϵk ek , (174)
√
with the intriguing factor ϵk was consistent with the equivalent expression in [185], p. 53, Eq. (3.5.10),169
which was obtained from the Fokker-Planck equation:
√
θ(t + ∆t) = θ(t) + A(θ(t), t)∆(t) + ∆tη(t) , (175)
√
where A(θ(t), t) is a nonlinear operator. The noise term ∆tη(t) is not related to the gradient error as in
Eq. (174), and is Gaussian with zero mean and covariance matrix of the form:
√
∆t⟨η(t)⟩ = 0 , and ∆t⟨η(t), η(t)⟩ = ∆tB(t) . (176)
The column matrix (or vector) A(θ(t), t) in Eq. (175) is called the drift vector, and the square matrix B in
Eq. (176) the diffusion matrix, [185], p. 52. Eq. (175) implies that θ(·) is a continuous function, called the
“sample path”.
To obtain a differential equation, Eq. (175) can be rewritten as
θ(t + ∆t) − θ(t) η(t)
= A(θ(t), t) + √ , (177)
∆t ∆t
which shows
√ that the derivative of θ(t) does not exist when taking the limit as ∆t → 0, not only due to the
factor 1/ ∆t → ∞, but also due to the noise η(t), [185], p. 53.
√
The last term η/ ∆t in Eq. (177) corresponds to the random force X exerted on a pollen particle by
the viscous fluid molecules in the 1-D equation of motion of the pollen particle, as derived by Langevin and
√
169
In original notation used in [185], p. 53, Eq. (3.5.10) reads as y(t + ∆t) = y(t) + A(y(t), t)∆t + η(t) ∆t, in
which the noise η(t) has zero mean, i.e., ⟨η⟩ = 0, and covariance matrix ⟨η(t), η(t)⟩ = B(t).
98 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 66: Minibatch-size increase, fewer parameter updates, faster comutation (Sec-
tion 6.3.5). For each of the three training schedules in Figure 65, the same learning curve is
plotted in terms of the number of epochs (left figure), and again in terms of the number of pa-
rameter updates (right figure), which shows the significant decrease in the number of parameter
updates, and thus computational cost, for the training schedule with minibatch-size decrease.
The blue curve ends at about 80,000 parameter updates for step-length decrease, whereas the
red curve ends at about 29,000 parameter updates for minibatch-size decrease [164] (Figure
reproduced with permission of the authors.)
Eq. (178) cannot be directly integrated to obtain the velocity v in terms of the noise force X since the
derivative does not exist, as interpreted in Eq. (177). Langevin went around this problem by multiplying
Eq. (178) by the displacement x(t) and take the average to obtain, [188]:
m dz RT
= −3πµaz + , (180)
2 dt N
where z = d(x2 )/dt is the time derivative of the mean square displacement, R the ideal gas constant, and
N the Avogadro number. Eq. (180) can be integrated to yield an expression for z, which led to Einstein’s
result for Brownian motion. ■
Remark 6.10. Metaheuristics and nature-inspired optimization algorithms. There is a large class of nature-
inspired optimization algorithms that implemented the general conceptual metaheuristics—such as neigh-
borhood search, multi-start, hill climbing, accepting negative moves, etc.—and that include many well-
known methods such as Evolutionary Algorithms (EAs), Artificial Bee Colony (ABC), Firefly Algorithm,
etc. [190].
The most famous of these nature-inspired algorithms would be perhaps simulated annealing in [163],
which is described in [191], p. 18, as being “inspired by the annealing process of metals. It is a trajectory-
based search algorithm starting with an initial guess solution at a high temperature and gradually cooling
down the system. A move or new solution is accepted if it is better; otherwise, it is accepted with a proba-
bility, which makes it possible for the system to escape any local optima”, i.e., the metaheuristic “accepting
negative moves” mentioned in [190]. “It is then expected that if the system is cooled down slowly enough,
the global optimal solution can be reached”, [191], p. 18; that’s step-length decay or minibatch-size increase,
as mentioned above. See also Footnotes 140 and 168.
For applications of these nature-inspired algorithms, we cite the following works, without detailed
review: [191] [192] [193] [194] [195] [196] [197] [198] [199]. ■
Figure 67: Weight decay (Section 6.3.6). Effects of magnitude of weight-decay parameter d.
Adapted from [78], p. 116. (Figure reproduced with permission of the authors.)
and add the weight-decay term dθek from Eq. (181), then scale both the weight-decay term and the gradient-
descent term (−ϵk gek ) by the cyclic annealing multiplier ak in Eq. (182), leaving the momentum term
ζk (θek − θek−1 ) alone, to obtain:
θek+1 = θek − ak [ϵk ge(θek + γk (θek − θek−1 )) + dk θek ] + ζk (θek − θek−1 ) , (183)
which is included in Algorithm 4.
where W (ℓ) and b(ℓ) denote the layer’s weight matrix and bias vector, respectively. The output of the layer
y (ℓ) is given by element-wise application of the activation function a, see Sections 4.4.1 and 4.4.2 for a
detailed presentation. All components of the weight matrix W (ℓ) are assumed to be independent of each
other and to share the same probability distribution. The same holds for the components of the input vector
y (ℓ−1) and the output vector y (ℓ) . Additionally, elements of W (ℓ) and y (ℓ) shall be mutually independent.
Further, it is assumed that W (ℓ) and y (ℓ) have zero mean (i.e., expectation, cf. Eq. (67)) and are symmetric
around zero. In this case, the variance of z (ℓ) ∈ Rm(ℓ) ×1 is given by
(ℓ) (ℓ) (ℓ−1)
Var zi = m(ℓ) Var Wij yj (185)
(ℓ) (ℓ−1) (ℓ) (ℓ−1) (ℓ) (ℓ−1) 2
= m(ℓ) Var Wij Var yj + E(Wij )2 Var yj + Var Wij E(yj ) (186)
(ℓ) (ℓ−1) (ℓ−1) 2
= m(ℓ) Var Wij Var yj + E(yj ) , (187)
where m(ℓ) denotes the width of the ℓ-th layer and the fundamental relation Var(XY ) = Var X Var Y +
E(X)2 Var Y + E(Y )2 Var X has been used along with the assumption of weights having a zero mean, i.e.,
(ℓ)
E(Wij ) = 0. The variance of some random variable X, which is the expectation of the squared deviation
of X from its mean, i.e.,
Var X = E((X − E(X))2 ), (188)
is a measure of the “dispersion” of X around its mean value. The variance is the square of the standard
deviation σ of the random variable X, or, conversely,
√ p
σ(X) = Var X = E((X − E(X))2 . (189)
As opposed to the variance, the standard deviation of some random variable X has the same physical di-
mension as X itself.
The elementary relation Var X 2 = E(X 2 ) − E(X)2 gives
(ℓ) (ℓ) (ℓ−1) 2
Var zi = m(ℓ) Var Wij E((yj ) ), (190)
Note that the mean of inputs does not vanish for activation functions that are not symmetric about zero
as, e.g., the ReLU functions (see Section 5.3.2). For the ReLU activation function, y = a(x) = max(0, x)
the mean value of the squared output and the variance of the input are related by
Z ∞
2
E(y ) = y 2 P (y)dy (191)
−∞
∞ ∞ ∞
1
Z Z Z
= 2
max(0, x) P (x)dx = 2
x P (x)dx = x2 P (x)dx (192)
−∞ 0 2 −∞
1
= Var x. (193)
2
Substituting the above result in Eq. (190) provides the following relationship among the variances of the
inputs to the activation function of two consecutive layers, i.e., Var z (ℓ) and Var z (ℓ−1) , respectively:
(ℓ) m(ℓ) (ℓ) (ℓ−1)
Var zi = Var Wij Var zj . (194)
2
For a network with L layers, the following relation between the variance of inputs Var z (1) and outputs
Var z (L) is obtained:
L
(L) (1)
Y m(ℓ) (ℓ)
zi = Var zj Var Wij . (195)
2
ℓ=2
102 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
To preserve the variance through all layers of the network, the following condition must be fulfilled regarding
the variance of weight matrices:
m(ℓ)
(ℓ) 2
Var Wij = 1 ∀ℓ ↔ (ℓ)
W ∼ N 0, , (196)
2 m(ℓ)
where N (0, σ 2 ) denotes the normal (or Gaussian) distribution with zero mean and σ 2 = 2/m(ℓ) variance.
The above result, which is known as Kaiming He initialization, implies that the width of a layer m(ℓ) needs
to be regarded in the initialization of weight matrices. Preserving the variance of inputs mitigates exploding
or vanishing gradients and improves convergence in particular for deep networks. The authors of [127]
provided analogous results for the parametric rectified linear unit (PReLU, see Section 5.3.3).
1 Unified adaptive learning-rate, momentum, weight decay, cyclic annealing (for epoch τ )
Data:
• Network parameters θe0 obtained from previous epoch (τ − 1)
• Select a small stabilization parameter δ > 0, e.g., 10−8
• Select t as epoch τ or global iteration j, Eq. (148), and budget τmax or jmax
• Select learning-rate schedule ϵ(t) ∈ { Eqs. (147)-(151 }, and parameters ϵ0 , kc , ϖ
• Select ζ if use standard momentum, and γ if use Nesterov Accelerated Gradient, Eq. (141)
• Select a value for m the number of examples in the minibatches
Result: Updated network parameters θe0 for the next epoch (τ + 1).
2
3 ▶ Begin training epoch τ
4 ▶ Initialize θe1 = θeτ⋆ (from previous training epoch) (line 5 in Algorithm 4)
5 ② for k = 1, 2, . . . , kmax do
|m|
6 Obtain random minibatch Bk containing m examples, Eq. (137) ;
7 Compute gradient estimate gek = −e g (θek ) per Eq. (140) ;
8 Compute mk = ϕt (e g1 , . . . , gek ) and m
ck = χϕt (e g1 , . . . , gek ) as in Eq. (197) ;
9 Compute Vk = ψt (e g1 , . . . , gek ) and Vk = χψt (e
b g1 , . . . , gek ) as in Eq. (198) ;
10 Compute learning rate ϵk as in Eq. (200) ;
11 Set descent direction dek = (−c mk ) ;
12 if Stopping-criterion not met (if used, Section 6.1) then
13 ▶ Update network parameter θek to θek+1 : Use Eq. (183)
14 with Eq. (200), which includes Eq. (201) and Eq. (251) for AdamW ;
15 else
16 ▶ Stopping criterion met
17 Reset initial minimizer-estimate for next training epoch: θe0 ← θek ;
18 Stop ② for loop ;
19 end
20 end
21
Algorithm 5: Unified adaptive learning-rate algorithm (Section 6.5.1, 6.5.2, 6.5.4, 6.5.5, 6.5.6, 6.5.7,
6.5.10, Algorithm 4) which includes momentum, accelerated gradient, weight decay, cyclic annealing.
The outer ① for τ = 1, . . . , τmax loop over the training epochs, as shown in Algorithm 4 for SGD, is
not presented to focus on the ② for k = 1, . . . , kmax inner loop of one pass over the training set, for
a typical training epoch τ . The differences with Algorithm 4 are highlighted in purple color. For the
parameter update in lines 13-14, see also Footnote 176.
104 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where ϵ(k) can be either Eq. (147)178 (which includes ϵ(k) = ϵ0 = constant) or Eq. (149);179 δ is a small
number to avoid division by zero;180 the operations (square root, addition, division) are element-wise, with
both ϵ0 and δ = O(10−6 ) to O(10−8 ) (depending on the algorithm) being constants.
Remark 6.11. A particular case is the AdaDelta algorithm, in which mk in Eq. (197) is the second moment
for the network parameter increments {∆θi , i = 1, . . . , k}, and m
ck also in Eq. (197) the corrected gradient.
■
All of the above arrays—such as mk and m ck in Eq. (197), Vk and Vbk in Eq. (198), and dek in
Eq. (199)—together the resulting array ϵk in Eq. (200) has the same structure as the network parameter
array θ in Eq. (31), with PT in Eq. (34) being the total number of parameters. The update of the network
parameter estimate in θ is written as follows:
θek+1 = θek + ϵk ⊙ dek = θek + ϵk dek = θek + ∆θek (element-wise operations) , (201)
where the Hadamard operator ⊙ (element-wise multiplication) is omitted to alleviate the notation.181
Remark 6.12. The element-wise operations in Eq. (200) and Eq. (201) would allow each parameter in
array θ to have its own learning rate, unlike in traditional deterministic optimization algorithms, such as
in Algorithm 2 or even in the Standard SGD Algorithm 4, where the same learning rate is applied to all
parameters in θ. ■
It remains to define the functions ϕt and χϕt in Eq. (197), and ψt and χψt in Eq. (198) for each of the
particular algorithms covered by the unified Algorithm 5.
SGD. To obtain Algorithm 4 as a particular case, select the following functions for Algorithm 5:
together with learning-rate schedule ϵ(k) presented in Section 6.3.4 on step-length decay and annealing. In
other words, from Eqs. (199)-(201), the parameter update reduces to that of the vanilla SGD with the fixed
learning-rate schedule ϵ(k), without scaling:
Similarly for SGD with momentum and accelerated gradient (Section 6.3.2), step-length decay and cyclic
annealing (Section 6.3.4), weight decay (Section 6.3.6).
178
Suggested in [78], p. 287.
179
Suggested in [182], p. 3.
180
AdaDelta and RMSProp used the first form of Eq. (200), with δ outside the square root, whereas AdaGrad and
Adam used the second part, with δ inside the square root. AMSGrad, Nostalgic Adam, AdamX did not use δ, i.e.,
set δ = 0.
181
There are no symbols similar to the Hadamard operator symbol ⊙ for other operations such as square root, addition,
and division, as implied in Eq. (200), so there is no need to use the symbol ⊙ just for multiplication.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 105
Figure 68: Convergence of adaptive learning-rate algorithms (Section 6.3.2): AdaGrad, RM-
SProp, SGDNesterov, AdaDelta, Adam [170]. (Figure reproduced with permission of the authors.)
k
X ϵ(k)
χψk = I and δ = 10−7 ⇒ Vbk = gi )2 and ϵk = q
(e (element-wise operations) , (207)
i=1 Vbk + δ
leading to an update with adaptive scaling of the learning rate
ϵ(k)
θek+1 = θek − q gek (element-wise operations) , (208)
Vbk + δ
in which each parameter in θek is updated with its own learning rate. For a given network parameter, say, θpq ,
its learning rate ϵk,pq is essentially ϵ(k) scaled with the inverse of the square root of the sum of all historical
182
As of 2019.11.28, [52] was cited 5,385 times on Google Scholars, and 1,615 times on Web of Science. By
2022.07.11, [52] was cited 10,431 times on Google Scholars, and 3,871 times on Web of Science.
106 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
values of the corresponding gradient component (pq), i.e., (δ + ki=1 gei,pq2 )−1/2 , with δ = 10−7 being very
P
small. A consequence of such scaling is that a larger gradient component would have a smaller learning rate
and a smaller per-iteration decrease in the learning rate, whereas a smaller gradient component would have a
larger learning rate and a higher per-iteration decrease in the learning rate, even though the relative decrease
is about the same.183 Thus progress along different directions with large difference in gradient amplitudes is
evened out as the number of iterations increases.184
Figure 68 shows the convergence of some adaptive learning-rate algorithms: AdaGrad, RMSProp,
SGDNesterov, AdaDelta, Adam.
Figure 69: Dow Jones Industrial Average (DJIA, Section 6.5.3) stock index year-to-date
(YTD) chart as from 2019.01.01 to 2019.11.30, Google Finance.
“Exponential smoothing methods have been around since the 1950s, and are still the most popular fore-
casting methods used in business and industry” such as “minute-by-minute stock prices, hourly temperatures
at a weather station, daily numbers of arrivals at a medical clinic, weekly sales of a product, monthly unem-
ployment figures for a region, quarterly imports of a country, and annual turnover of a company” [206]. See
Figure 69 for the chart of a stock index showing noise.
“Exponential smoothing was proposed in the late 1950s (Brown, 1959; Holt, 1957; Winters, 1960),
and has motivated some of the most successful forecasting methods. Forecasts produced using exponential
smoothing methods are weighted averages of past observations, with the weights decaying exponentially as
the observations get older. In other words, the more recent the observation the higher the associated weight.
This framework generates reliable forecasts quickly and for a wide range of time series, which is a great
183
See [54]. For example, compare the sequence { 15 , 5+5 1
} to the sequence { 12 , 2+2
1
}. The authors of [78], p. 299,
mistakenly stated “The parameters with the largest partial derivative of the loss have a correspondingly rapid de-
crease in their learning rate, while parameters with small partial derivatives have a relatively small decrease in their
learning rate.”
184
It was stated in [54]: “progress along each dimension evens out over time. This is very beneficial for training deep
neural networks since the scale of the gradients in each layer is often different by several orders of magnitude, so
the optimal Learning rate should take that into account.” Such observation made more sense than saying “The net
effect is greater progress in the more gently sloped directions of parameter space” as did the authors of [78], p. 299,
who referred to AdaDelta in Section 8.5.4, p. 302, through the work of other authors, but might not read [54].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 107
advantage and of major importance to applications in industry” [207], Chap. 7, “Exponential smoothing”.
See Figure 70 for an example of “exponential-smoothing” curve that is not “smooth”.
Figure 70: Saudi Arabia oil production during 1996-2013 (Section 6.5.3). Piecewise linear
data (black) and fitted curve (red), despite the name “smoothing”. From [207], Chap. 7. (Figure
reproduced with permission of the authors.)
For neural networks, early use of exponential smoothing dates back at least to 1998 in [165] and [166].185
For adaptive learning-rate algorithms further below (RMSProp, AdaDelta, Adam, etc.), let {x(t), t =
1, 2, . . .} be a noisy raw-data time series as in Figure 69, and {s(t), t = 1, 2, . . .} its smoothed-out counter-
part. The following recurrence relation is an exponential smoothing used to predict s(t + 1) based on the
known value s(t) and the data x(t + 1):
s(t + 1) = βs(t) + (1 − β)x(t + 1) , with β ∈ [0, 1) (209)
Eq. (209) is a convex combination between s(t) and x(t + 1). a value of β closer to 1, e.g., β = 0.9 and
1 − β = 0.1, would weigh the smoothed-out past data s(t) more than the future raw data point x(t + 1).
From Eq. (209), we have
s(1) = βs(0) + (1 − β)x(1) (210)
It should be noted, however, that for forecasting (e.g., [207]), the following recursive equation, slightly
different from Eq. (209), is used instead:
s(t + 1) = βs(t) + (1 − β)x(t) , with β ∈ [0, 1) , (214)
where x(t) (shown in purple) is used instead of x(t + 1), since if the data at (t + 1) were already known,
there would be no need to forecast.
ϵ(k)
χψk = I and δ = 10−6 ⇒ Vbk = Vk and ϵk = q (element-wise operations) , (218)
Vbk + δ
where the running average of the squared gradients is given in Eq. (216) for efficient coding, and in Eq. (217)
in fully expanded form as a series with exponential coefficients β k−i , for i = 1, . . . , k. Eq. (216) is the exact
counterpart of exponential smoothing recurrence relation in Eq. (209), and Eq. (217) has its counterpart in
Eq. (212) if ge0 = ge02 = 0; see Section 6.5.3 on forecasting time series and exponential smoothing.
Figure 68 shows the convergence of some adaptive learning-rate algorithms: AdaGrad, RMSProp,
SGDNesterov, AdaDelta, Adam.
RMSProp still depends on a global learning rate ϵ(k) = ϵ0 constant, a tuning hyperparameter. Even
though RMSProp was one of the go-to algorithms for machine learning, the pitfalls of RMSProp, along with
other adaptive learning-rate algorithms, were revealed in [55].
gradients. If the initial gradients are large, the learning rates will be low for the remainder of training.
This can be combatted by increasing the global learning rate, making the AdaGrad method sensitive to the
choice of learning rate. Also, due to the continual accumulation of squared gradients in the denominator,
the learning rate will continue to decrease throughout training, eventually decreasing to zero and stopping
training completely.”
AdaDelta was then introduced in [54] as an improvement over AdaGrad with two goals in mind: (1) to
avoid the continuing decay of the learning rate, and (2) to avoid having to specify ϵ(k), called the “global
learning rate”, as a constant. Instead of summing past squared gradients over a finite-size window, which is
not efficient in coding, exponential smoothing was employed in [54] for both the squared gradients (e
gk )2 and
for the squared increments (∆θk )2 , with the increment used in the update θk+1 = θk + ∆θk , by choosing
the following functions for Algorithm 5:
gk )2 with β ∈ [0, 1) (element-wise square)
Vk = βVk−1 + (1 − β)(e (219)
k
X
Vk = (1 − β) β k−i (e
gi )2 = ψk (e
g1 , . . . , gek ) , (220)
i=1
Figure 68 shows the convergence of some adaptive learning-rate algorithms: AdaGrad, RMSProp,
SGDNesterov, AdaDelta, Adam.
Despite this progress, AdaDelta and RMSProp, along with other adaptive learning-rate algorithms,
shared the same pitfalls as revealed in [55].
1
ck =
m mk (bias correction) , (227)
1 − (β1 )k
1
Vbk = Vk (bias correction) , (229)
1 − (β2 )k
ϵ(k)
ϵk = q (element-wise operations) . (230)
Vbk + δ
with the following recommended values of the parameters:
β1 = 0.9 , β2 = 0.999 , ϵ0 = 0.001 , δ = 10−8 . (231)
Remark 6.13. RMSProp is a particular case of Adam, when β1 = 0, together with the absence of the bias-
corrected 1st moment Eq. (227) and bias-corrected 2nd moment Eq. (229). Moreover, the get RMSProp from
Adam, choose the constant δ and the learning rate ϵk as in Eq. (218), instead of Eq. (230) above, but this
choice is a minor point, since either choice should be fine. On the other hand, for deep-learning applications,
having the 1st moment (or momentum), and thus requiring β > 0, would be useful to “significantly boost
the performance” [182], and hence an advantage of Adam over RMSProp. ■
It follows from Eq. (212) in Section 6.5.3 on exponential smoothing of time series that the recurrence
relation for gradients (1st moment) in Eq. (226) leads to the following series:
k
X
mk = (1 − β1 ) β k−i gei , (232)
i=1
since m0 = 0. Taking the expectation, as defined in Eq. (67), on both sides of Eq. (232) yields
k
X k
X
E[mk ] = (1 − β1 ) β k−i E[e gi ] · (1 − β1 )
gi ] = E[e β k−i + D = E[e
gi ] · [1 − (β1 )2 ] + D , (233)
i=1 i=1
where D is the drift from the expected value, with D = 0 for stationary random processes.190 For non-
stationary processes, it was suggested in [170] to keep D small by choosing small β1 so only past gradients
close to the present iteration k would contribute, so to keep any change in the mean and standard deviation
in subsequent iterations small. By dividing both sides by [1 − (β1 )2 ], the bias-corrected 1st moment m ck
shown in Eq. (227) is obtained, showing that the expected value of m ck is the same as the expected value of
the gradient gei plus a small number, which could be zero for stationary processes:
mk D
E[cmk ] = E 2
= E[e
gi ] + . (234)
1 − (β1 ) 1 − (β1 )2
190
A random process is stationary when its mean and standard deviation stay constant over time.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 111
The argument to obtain the bias-corrected 2nd moment Vbk in Eq. (229) is of course the same.
The authors of [170] pointed out the lack of bias correction in RMSProp (Remark 6.13), leading to
“very large step sizes and often divergence”, and provided numerical experiment results to support their
point.
Figure 68 shows the convergence of some adaptive learning-rate algorithms: AdaGrad, RMSProp,
SGDNesterov, AdaDelta, Adam. Their results show the superior performance of Adam compared to other
adaptive learning-rate algorithms. See Figure 151 in Section 14.7 on “Lack of transparency and irrepro-
ducibility of results” in recent deep-learning papers.
β11
β1k = β11 ∈ [0, 1) , or β1k = β11 λk−1 with λ ∈ (0, 1) , or β1k = , (236)
k
ck = mk (no bias correction) ,
m (237)
β11
β2 ∈ [0, 1) and √ < 1 ⇔ β2 > (β1 )2 , (239)
β2
Vbk = max(Vbk−1 , Vk ) (element-wise max, “long-term memory”) , and δ not used , (240)
The parameter λ was not defined in Corollary 1 of [182]; such omission could create some difficulty for
first-time leaners. It has to be deduced from reading the Corollary that λ ∈ (0, 1). For the step-length
(or size) schedule ϵ(k), even though only Eq. (149) was considered in [182] for the convergence proofs,
Eq. (147) (which includes ϵ(k) = ϵ0 = constant) and Eq. (150) could also be used.192
First-time learners in this field could be overwhelmed by complex-looking equations in this kind of
paper, so it would be helpful to elucidate some key results that led to the above expressions, particularly for
β1k , which can be a constant or a function of the iteration number k, in Eq. (236).
It was stated in [182] that “one typically uses a constant β1k in practice (although, the proof requires
a decreasing schedule for proving convergence of the algorithm),” and hence the first choice β1k = β11 ∈
(0, 1).
√
The second choice β1k = β11 λk−1 , with λ ∈ (0, 1) and ϵ(k) = ϵ0 / k as in Eq. (149), was the result
stated in Corollary 1 in [182], but without proof. We fill this gap here to explain this unusual expression for
β1k . Only the second term on the right-hand side of the inequality in Theorem 4 needs to be bounded by
191
Sixth International Conference on Learning Representations (Website).
192
The authors of [182] distinguished the step size ϵ(k) from the scaled step size ϵk in Eq. (147) or Eq. (149), which
were called learning rate.
112 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
2 G β P kX
D∞ max 2 G β P
D∞ 1
∞ 1 T ∞ 1 T
≤ 2
kλ k−1
= , (244)
(1 − β1 ) ϵ0 (1 − β1 ) ϵ0 (1 − λ)2
2
k=1
where the following series expansion had been used:195
kXmax
1
= kλk−1 . (245)
(1 − λ)2
k=1
Comparing the bound on the right-hand side of (244) to the corresponding bound shown in [182], Corollary
1, second term, it can be seen that two factors, PT and ϵ0 (in purple), were missing in the numerator and in
the denominator, respectively. In addition, there should be no factor β1 (in blue) as pointed out in [183] in
their correction of the proof in [182].
On the other hand, there were some slight errors in the theorem statements and in the proofs in [182] that
were corrected in [183], whose authors did a good job of not skipping any mathematical details that rendered
the understanding and the verification of the proofs obscure and time consuming. It is then recommended
to read [182] to get a general idea on the main convergence results of AMSGrad, then read [183] for the
details, together with their variant of AMSGrad called AdamX.
The authors of [203], like those of [182], pointed out errors in the convergence proof in [170], and
proposed a fix to this proof, but did not suggest any new variant of Adam.
In the two large numerical experiments on the MNIST dataset in Figure 71,196 the authors of [182]√used
contant β1k = β1 = 0.9, with β2 ∈ {0.99, 0.999}; they chose the step size schedule ϵ(k) = ϵ0 / k in
193
To write this term in the notation used in[182], Theorem 4 and Corollary 1, simply make the following changes in
notation: k → t, kmax → T , PT → d, V → v, ϵ(k) → αt .
194
The notation “G” is clearly mnemonic for “gradient”, and the uppercase is used to designate upperbound.
195
See also [183], p. 4, Lemma 2.4.
196
See a description of the NMIST dataset in Section 5.3 on “Vanishing and exploding gradients”. For the difference
between logistic regression and neural network, see, e.g., [208], Raschka, “Machine Learning FAQ: What is the
relation between Logistic Regression and Neural Networks and when to use which?” Original website, Internet
archive. See also [78], p. 200, Figure6.8b, for the computational graph of Logistic Regression (one-layer network).
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 113
Figure 71: AMSGrad vs Adam, numerical examples (Sections 6.1, 6.5.7). The MNIST dataset
is used. The first two figures on the left were the results of using logistic regression (network
with one layer with logistic sigmoid activation function), whereas the figure on the right is by
using a neural network with three layers (input layer, hidden layer, output layer). The cost
function decreased faster for AMSGrad compared to that of Adam. For logistic regression, the
difference between the two cost values also decreased with the iteration number, and became
very small at iteration 5000. For the three-layer neural network, the cost difference between
AMSGrad and Adam stayed more or less constant, as the cost went down to more than one
tenth of the initial cost at about 0.3, and after 5000 iterations, the AMSGrad cost (≈ 0.01)
was about 50% of the Adam cost (≈ 0.02). See [182]. (Figure reproduced with permission of the
authors.)
the logistic regression experiment, and constant step size ϵ(k) = ϵ0 in a network with three layers (input,
hidden, output). There was no single set of optimal parameters, which appeared to be problem dependent.
The authors of [182] also did not provide any numerical example with β1k = β1 λk−1 ; such numerical
examples can be found, however, in [183] in connection with AdamX below.
Unfortunately, when comparing AMSGrad to Adam and AdamW (further below), it was remarked in
[209] that AMSGrad generated “a lot of noise for nothing”, meaning AMSGrad did not live up to its potential
and best-paper award when tested on “real-life problems”.
In addition, numerical examples were provided in [183] with β1k = β1 λk−1 , β1 = 0.9, λ = 0.001,
β2 = 0.999, and δ = 10−8 in Eq. (200), even though the pseudocode did not use δ (or set δ = 0). The
authors of [183] showed that both AMSGrad and AdamX converged with similar results, thus supporting
their theoretical investigation, in particular, correcting the errors in the proofs of [182].
Nostalgic Adam. The authors of [204] also fixed the non-convergence of Adam by introducing “long-
term memory” to the second-moment of the gradient estimates, similar to the work in [182] on AMSGrad
and in [183] on AdamX.
There are many more variants of Adam. But how are Adam and its variants compared to good old SGD
114 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 72: Overfitting (Section 6.5.9, 6.5.10). Left: Underfitting with 1st-order polynomial.
Middle: Appropriate fitting with 2nd-order polynomial. Right: Overfitting with 9th-order poly-
nomial. See [78], p. 110, Figure5.2. (Figure reproduced with permission of the authors.)
“Despite the fact that our experimental evidence demonstrates that adaptive methods are not
advantageous for machine learning, the Adam algorithm remains incredibly popular. We are
not sure exactly as to why, but hope that our step-size tuning suggestions make it easier for
practitioners to use standard stochastic gradient methods in their research.”
The work of [55] has encouraged researchers who were enthusiastic with adaptive methods to take a fresh
look at SGD again to tease something more out of this classic method.200
197
See [78], pp. 301-302.
198
See [55], p. 4, Section 3.3 “Adaptivity can overfit”.
199
See [78], p. 107, regarding training error and test (generalization) error. “The ability to perform well on previously
unobserved inputs is called generalization.”
200
See, e.g., [210], where [55] was not referred to directly, but through a reference to [164], in which there was a
reference to [55].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 115
Figure 73: Standard SGD and SGD with momentum vs AdaGrad, RMSProp, Adam on CIFAR-
10 dataset (Sections 6.1, 6.3.2, 6.5.9). From [55], where a method for step-size tuning and
step-size decaying was proposed to achieve lowest training error and generalization (test) error
for both Standard SGD and SGD with momentum (“Heavy Ball” or better yet “Small Heavy
Sphere” method) compared to adaptive methods such as AdaGrad, RMSProp, Adam. (Figure
reproduced with permission of the authors.)
Jereg (θ)
e = J( e + d (∥ θe ∥2 )2
e θ) (248)
2
The magnitude of the coefficient d regulates (or regularizes) the behavior of the network: d = 0 would lead
to overfitting (Figure 72 right), a moderate d may yield appropriate fitting (Figure 72 middle), a large d may
lead to underfitting (Figure 72 left). The gradient of the regularized cost Jereg is then
∂ Jereg ∂ Je
gereg := = + dθe = ge + dθe (249)
∂θe ∂θ e
and the update becomes:
θek+1 = θek − ϵk (e
gk + dθek ) = (1 − ϵk d)θek − ϵk gek , (250)
201
See [78], p. 107, Section 5.2 on “Capacity, overfitting and underfitting”, and p. 115 provides a good explanation
and motivation for regularization, as in Gupta 2017, ‘Deep Learning: Overfitting’, 2017.02.12, Original website,
Internet archive.
116 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 74: AdamW vs Adam, SGD, and variants on CIFAR-10 dataset (Sections 6.1, 6.5.10).
While AdamW achieved lowest training loss (error) after 1800 epochs, the results showed
that SGD with weight decay (SGDW) and with warm restart (SGDWR) achieved lower test
(generalization) errors than Adam, AdamW, AdamWR. See Figure 75 for the scheduling of
the annealing multiplier ak , for which the epoch numbers (100, 300, 700, 1500) for complete
cooling (ak = 0) coincided with the same epoch numbers for the sharp minima. There was,
however, a diminishing return beyond the 4th cycle as indicated by the dotted arrows, for
both training error and test error, which actually increased at the end of the 4th cycle (right
subfigure, red arrow), see Section 6.1 on early-stopping criteria and Remark 6.14. Adapted
from [56]. (Figure reproduced with permission of the authors.)
which is equivalent to decaying the parameters (including the weights in) θek when (1 − ϵk d) ∈ (0, 1),
but with varying decay parameter dk = ϵk d ∈ (0, 1) depending on the step length ϵk , which itself would
decrease toward zero.
The same equivalence between L2 regularization Eq. (248) and weight decay Eq. (250), which is linear
with respect to the gradient, cannot be said for adaptive methods due to the nonlinearity with respect to
the gradient in the update procedure using Eq. (200) and Eq. (201). See also the parameter update in
lines 13-14 of the unified pseudocode for adaptive methods in Algorithm 5 and Footnote 176. Thus, it was
proposed in [56] to explicitly add weight decay to the parameter update Eq. (201) for AdamW (lines 13-14
in Algorithm 5) as follows:
θek+1 = θek + ak (ϵk dek − dθek ) (element-wise operations) , (251)
where the parameter ak ∈ [0, 1] is an annealing multiplier defined in Section 6.3.4 on weight decay:
q=p−1
X
ak = 0.5 + 0.5 cos(πTcur /Tp ) ∈ [0, 1] , with Tcur := j − Tq , (154)
q=1
The results of the numerical experiments on the CIFAR-10 dataset using Adam, AdamW, SGDW
(Weight decay), AdamWR (Warm Restart), and SGDWR were reported in Figure 74 [56].
Remark 6.14. Limitation of cyclic annealing. The 5th cycle of annealing not shown in Figure 74 would
end at epoch 1500 + 1600 = 3100, which is well beyond epoch budget of 1800. In view of the diminishing
return in the decrease of the training error at the end of each cycle, in addition to an increase in the test error
by the end of the 4th cycle, as shown in Figure 74, it is unlikely that it is worthwhile to start the 5th cycle,
since not only the computation would be more expensive, due to the warm restart, the increase in the test
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 117
Figure 75: Cosine annealing (Sections 6.3.4, 6.5.10). Annealing factor ak as a function of
epoch number. Four annealing cycles p = 1, . . . , 4, with the following schedule for Tp in
Eq. (154): (1) Cycle 1, T1 = 100 epochs, epoch 0 to epoch 100, (2) Cycle 2, T2 = 200 epochs,
epoch 101 to epoch 300, (3) Cycle 3, T3 = 400 epochs, epoch 301 to epoch 700, (4) Cycle 4,
T4 = 800 epochs, epoch 701 to epoch 1500. From [56]. See Figure 74 in which the curves for
AdamWR and SGDWR ended at epoch 1500. (Figure reproduced with permission of the authors.)
error indicated that the end of the 3rd cycle was optimal, and thus the reason for [56] to stop at the end of
the 4th cycle. ■
The results for test errors in Figure 74 appeared to confirm the criticism in [55] that adaptive methods
brought about “marginal value” compared to the classic SGD. Such observation was also in agreement with
[168], where it was stated:
“In our experiments, either AdaBayes or AdaBayes-SS outperformed other adaptive meth-
ods, including AdamW (Loshchilov & Hutter, 2017), and Ada/AMSBound (Luo et al., 2019),
though SGD frequently outperformed all adaptive methods.” (See Figure 76 and Figure 77)
If AMSGrad generated “a lot of noise for nothing” compared to Adam and AdamW, according to [209],
then does “marginal value” mean that adaptive methods in general generated a lot of noise for not much,
compared to SGD ?
The work in [55], [168], and [56] proved once more that a classic like SGD introduced by Robbins
& Monro (1951b) [49] never dies, and would be motivation to generalize classical deterministic first and
second-order optimization methods together with line search methods to add stochasticity. We will review
in detail two papers along this line: [144] and [145].
J(θ e ≤ J(θ)
e + ϵd) e + αϵ ge • de , (253)
202
It is not until Section 4.7 in [144] that this version is presented for the general descent of nonconvex case, whereas
the pseudocode in their Algorithm 1 at the beginning of their paper, and referred to in Section 4.5, was restricted to
steepest descent for convex case.
118 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Resnet-34 DenseNet-121
Figure 76: CIFAR-100 test loss using Resnet-34 and DenseNet-121 (Section 6.5.10). Compar-
ison between various optimizers, including Adam and AdamW, showing that SGD achieved
the lowest global minimum loss (blue line) compared to all adaptive methods tested as shown
[168]. See also Figure 77 and Section 6.1 on early-stopping criteria. (Figure reproduced with
permission of the authors.)
Figure 77: SGD frequently outperformed all adaptive methods (Section 6.5.10). The table
contains the global minimum for each optimizer, for each of the two datasets CIFAR-10 and
CIFAR-100, using two different networks. For each network, an error percentage and the
loss (cost) were given. Shown in red are the lowest global minima obtained by SGD in the
corresponding columns. Even in the three columns in which SGD results were not the lowest,
two SGD results were just slightly above those of AdamW (1st and 3rd columns), and one even
smaller (4th column). SGD clearly beat Adam, AdaGrad, AMSGrad [168]. See also Figure 76.
(Figure reproduced with permission of the authors.)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 119
1 SGD with Armijo line-search and adaptive minibatch (for training epoch τ )
Data:
• Parameter estimate θeτ⋆ from previous epoch (τ − 1) (line 5 in Algorithm 4)
• Select 3 Armijo parameters α ∈ (0, 1), β ∈ (0, 1), ρ > 0, and reliability bound δ1 = δ > 0
• Select probability pf ∈ (0, 1] for cost estimate Je to be close to true cost J
• Select probability pg ∈ (0, 1] for gradient estimate ge to be close to true gradient g
Result: Parameter estimate for next epoch (τ + 1): θeτ⋆+1 .
2 Define cost-estimate function J( e ϵ, pJ ) using adaptive minibatch, Eq. (254) ;
e θ,
3 Define gradient-estimate function ge(θ, e ϵ, pg ) using adaptive minibatch, Eq. (254) ;
4 ▶ Begin training epoch τ (see Algorithm 4 for standard SGD)
5 Initialize parameter estimate θe1 = θeτ⋆ and step length ϵ1 = ρ ;
6 ② ③ for k = 1, 2, . . . do
7 Compute gradient estimate gek = ge(θek , ϵk , pg ) with adaptive minibatch ;
8 Set descent-direction estimate dek = −e gk as in Eq. (256) ;
9 Compute cost estimates J( e θek , ϵk , pJ ) and J(
e θek + ϵk dek , ϵk , pJ ) with adaptive minibatch ;
10 if Stopping criterion not satisfied then
11 ▶ Compute step length (learning rate) ϵk using stochastic Armijo’s rule.
12 if Stochastic Armijo decrease condition Eq. (253) not satisfied then
13 ▶ Decrease step length: Set ϵk ← βϵk ;
14 else
15 ▶ Stochastic Armijo decrease condition Eq. (253) satisfied.
16 ▶ Update network parameter θek to θek+1 : Set θek+1 ← θek + ϵk dk ;
17 ▶ Increase the next step length ϵk+1 : Set ϵk+1 = max{ρ, (ϵk /β)} ;
18 ▶ Check for reliability of step length
19 if Step length reliable, ϵk ∥ ge ∥2 ≥ δk2 then
20 ▶ Increase reliability bound: Set δk+1 2 ← δk2 /β ;
21 else
22 ▶ Decrease reliability bound: Set δk+1 2 ← βδk2 ;
23 end
24 end
25 else
26 ▶ Stopping criterion satisfied;
27 ▶ Update parameter estimate for next epoch (τ + 1): θeτ⋆+1 ← θek
28 ▶ Stop ② ③ for loop (end minimization for training epoch τ ).
29 end
30 ▶ Continue to next iteration (k + 1): Set k ← k + 1 ;
31 end
32
Algorithm 6: SGD with Armijo line search and adaptive minibatch (Section 6.3.2, 6.6, 6.7, Algo-
rithms 2, 4, 7). From [144]. The differences compared to the deterministic gradient descent with Armijo
line search in Algorithm 2, including the ② ③ for loop, are highlighted in purple; See Remark 6.15. For
the related stochastic methods with fixed minibatch, see the standard SGD Algorithm 4 and the Newton
descent with Armijo-like line search Algorithm 7. Table 4 shows a comparison of the notations used by
several authors in relation to Armijo line search.
120 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where the overhead tilde of a quantity designates an estimate of that quantity based on a randomly selected
minibatch, i.e., Je is the cost estimate, de the descent-direction estimate, and ge the gradient estimate, similar
to those in Algorithm 4.
There is a difference though: The standard SGD in Algorithm 4 uses a fixed minibatch for the computa-
tion of the cost estimate and the gradient estimate, whereas Algorithm 6 in [144] uses adaptive subprocedure
to adjust the size of the minibatches to achieve a desired (fixed) probability pJ that the cost estimate is close
to the true cost, and a desired probability pg that the gradient estimate is close to the true gradient. These
adaptive-minibatch subprocedures are also functions of the learning rate (step length) ϵ, conceptually written
as:
e ϵ, pJ ) and ge = ge(θ, ϵ, pg ) ,
Je = J(θ, (254)
which are the counterparts to the fixed-minibatch procedures in Eq. (139) and Eq. (140), respectively.
Remark 6.15. Since the appropriate size of the minibatch depends on the gradient estimate, which is not
known and which is computed based on the minibatch itself, the adaptive-minibatch subprocedures for cost
estimate Je and for gradient estimate ge in Eq. (254) contain a loop, started by guessing the gradient estimate,
to gradually increase the size of the minibatch by adding more samples until certain criteria are met.203
In addition, since both the cost estimate Je and the gradient estimate ge depend on the step size ϵ, the
Armijo line-search loop to determined the step length ϵ—denoted as the ② for loop in the deterministic
Algorithm 2—is combined with the iteration loop k in Algorithm 6, where these two combined loops are
denoted as the ② ③ for loop. ■
For SGD, the descent-direction estimate de is identified with the steepest-descent direction estimate (−eg ):
de = −e
g, (256)
For Newton-type algorithms, such as in [145] [146], the descent direction estimate de is set to equal to the
Hessian estimate H
f multiplied by the steepest-descent direction estimate (−e
g ):
f · (−e
de = H g) , (257)
Remark 6.16. In the SGD with Armijo line search and adaptive minibatch Algorithm 6, the reliability
parameter δk and its use is another difference between Algorithm 6 and Algorithm 2, the deterministic
gradient descent with Armijo line search, and similarly for Algorithm 4, the standard SGD. The reason
was provided in [144]: Even when the probability of gradient estimate and cost estimate is near 1, it is not
guaranteed that the expected value of the cost at the next iterate θek+1 would be below the cost at the current
iterate θek , due to arbitrary increase of the cost. “Since random gradient may not be representative of the
true gradient the function estimate accuracy and thus the expected improvement needs to be controlled by a
different quantity,” δk2 . ■
The authors of [144] provided a rigorous convergence analysis of their proposed Algorithm 6, but had
not implemented their method, and thus had no numerical results at the time of this writing.204 Without
empirical evidence that the algorithm works and is competitive compared to SGD (see adaptive methods
and their criticism in Section 6.5.9), there would be no adoption.
203
See [144], p. 7, below Eq. (2.4).
204
Based on our private communications with the authors of [144] on 2019.11.16.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 121
Table 4: Armijo parameters (Section 6.2.3, Algorithms 2, 6, 7). Comparing our notations here
as used in Eq. (125) and Eq. (126) to those of other authors: [139] [144] [145] [151]. In [151]
[139] [144], the parameters are for the first-order term (−g)• d =∥ d ∥2 (for steepest descent,
d = −g = −∂J/∂θ) in the Taylor series expansion of ∆J = J(θ + ϵd) − J(θ). In [145], the
parameters are for the second-order term in the Taylor series expansion, leading to the cube of
the norm of the descent direction, ∥ d ∥3 . For the stochastic optimization algorithms presented
in this paper, the 4th parameter δ is introduced to represent the reliability parameter δ0 of [144]
(Algorithm 6) and the stability parameter ϵ of [145] (Algorithm 7). Deterministic algorithms
in [151] and [139] do not have this 4th parameter.
Parameter type This paper Polak [139] Paquette [144] Bergou [145] Armijo [151]
Fixed factor α α θ 1
6
1
2
Varying-power factor β β γ −1 θ 1
2
Fixed factor ρ ρ αmax η α
Reliability / Stability δ — δ0 ϵ —
Figure 78: Stochastic Newton with Armijo-like 2nd order line search (Section 6.7). IJCNN1
dataset from the LIBSVM library. Three batch sizes were used (1%, 5%, 100%) for both SGD
and ALAS (stochastic Newton Algorithm 7). The exact gradient norm for each of these six
cases was plotted against the training epochs on the left, and against the iteration numbers
on the right. An epoch is the number of non-overlapping minibatches (and thus iterations)
to cover the whole training set. One epoch for a minibatch size of s% (respectively 1%, 5%,
100%) of the training set is equivalent to 100/s (respectively 100, 20, 1) iterations. Thus, for
SGD-ALGO (1%), as shown, 10 epochs is equivalent to 1,000 iterations, with the same gradient
norm. The markers on the curves were placed every 100 epochs (left) and 800 iterations (right).
For the same number of epochs, say 10, SGD with smaller minibatches yielded lower gradient
norm. The same was true for Algorithm 7 for number of epochs less than 10, but the gradient
norm plateaued out after that with a lot of noise. Second-order Algorithm 7 converged faster
than 1st-order SGD. See [145]. (Figure reproduced with permission of the authors.)
122 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
k=m≤M
1 X
J(θ)
e = Jik (θ) , with Jik (θ) = J(f (x|ik | , θ), y |ik | ) , and x|ik | ∈ B|m| , y |ik | ∈ T|m| , (139)
m
k=1
k=m≤M
∂ J(θ)
e 1 X ∂ Jeik (θ)
ge(θ) = = , (140)
∂θ m ∂θ
k=1
k=m≤M
∂ 2 J(θ)
e 1 X ∂ 2 Jeik (θ)
H(θ)
f = = . (258)
(∂θ)2 m (∂θ)2
k=1
In the above computation, the minibatch in Algorithm 7 is fixed, not adaptive such as in Algorithm 6.
If the current iterate θk is in a flat region (or plateau) or at the bottom of a local convex bowl, then the
smallest eigenvalue λ e of the Hessian estimate H f would be close to zero or positive, respectively, and the
gradient estimate would be zero (line 7 in Algorithm 7):
e ≥ −δ 1/2 and ∥ ge ∥= 0 ,
λ (259)
where δ is a small positive number. In this case, no step (or update) would be taken, which is equivalent to
setting step length to zero or descent direction to zero206 , then go to the next iteration (k + 1). Otherwise,
i.e., the conditions in Eq. (259) are not met, do the remaining steps.
If the current iterate θk is on the downward side of a saddle point, characterized by the condition that
the smallest eigenvalue λ ek is clearly negative (line 10 in Algorithm 7):
ek < −δ 1/2 < 0 ,
λ (260)
then find the eigenvector v ek corresponding to λ
ek , and scale it such that its norm is equal to the absolute value
of this negative smallest eigenvalue, and such that this eigenvector forms an obtuse angle with the gradient
gek , then select such eigenvector as the descent direction dek :
Hfk v ek = λ ek such that ∥ v
ek v ek ∥= −λ ek and v
ek • gek < 0 ⇒ dek = v ek (261)
When the iterate θk is in a local convex bowl, the Hessian H fk is positive definite, i.e., the smallest
eigenvalue λk is strictly positive, then use the Newton direction as descent direction dek (line 13 in Algo-
e
rithm 7):
ek >∥ gek ∥1/2 > 0 ⇒ H
λ fk dek = −e gk (262)
The remaining case is when the iterate θk is close to a saddle point such that the smallest eigenvalue λ
ek
is bounded below by −δ 1/2 and above by ∥ gek ∥ , then λ
1/2 ek is nearly zero, and thus the Hessian estimate
205
The Armijo line search itself is 1st order; see Section 6.2 on full-batch deterministic optimization.
206
See Step 2 of Algorithm 1 in [145].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 123
fk is nearly singular. Regularize (or stabilize) the Hessian estimate, i.e, move its smallest eigenvalue away
H
from zero, by adding a small perturbation diagonal matrix using the sum of the bounds δ 1/2 and ∥ gek ∥1/2 .
The regularized (or stabilized) Hessian estimate is no longer nearly singular, thus invertible, and can be
used to find the Newton descent direction (lines 16-17 in Algorithm 7 for stochastic Newton and line 16 in
Algorithm 3 for deterministic Newton):
h i
−δ 1/2 ≤ λek ≤∥ gek ∥1/2 ⇒ 0 < λek + ∥ gek ∥1/2 +δ 1/2 ⇒ H fk + ∥ gek ∥1/2 +δ 1/2 I dek = −e gk . (263)
If the stopping criterion is not met, use Armijo’s rule to find the step length ϵk to update the parameter
θk to θk+1 , then go to the next iteration (k + 1). (line 19 in Algorithm 7). The authors of [145] provided
a detailed discussion of their stopping criterion. Otherwise, the stopping criterion is met, accept the current
iterate θek as local minimizer estimate θe⋆ , stop the Newton-descent ② for loop to end the current training
epoch τ .
From Eq. (125), the deterministic 1st-order Armijo’s rule for steepest descent can be written as:
J(θk + ϵk dk ) ≤ J(θk ) − αβ a ρ ∥ d ∥2 , with d = −g , (264)
with a being the minimum power for Eq. (264) to be satisfied. In Algorithm 7, the Armijo-like 2nd-order
line search reads as follows:
e θek ) − 1 (β a )3 ρ ∥ dek ∥3 ,
e θek + ϵk dek ) ≤ J(
J( (265)
6
with a being the minimum power for Eq. (265) to be satisfied. The parallelism between Eq. (265) and
Eq. (264) is clear; see also Table 4.
Figure 78 shows the numerical results of Algorithm 7 on the IJCNN1 dataset ([211]) from the Library
for Vector Support Machine (LIBSVM) library by [212]. It is not often to see plots versus epochs side by
side with plots versus iterations. Some papers may have only plots versus iterations (e.g., [182]); other
papers may rely only on plots versus epochs to draw conclusions (e.g., [56]). Thus Figure 78 provides a
good example to see the differences, as noted in Remark 6.17.
Remark 6.17. Epoch counter vs global iteration counter in plots. When plotted gradient norm versus
epochs (left of Figure 78), the three curves for SGD were separated, with faster convergence for smaller
minibatch sizes Eq. (135), but the corresponding three curves fell on top of each other when plotted versus
iterations (right of Figure 78). The reason was the scale on the horizontal axis was different for each curve,
e.g., 1 iteration for full batch was equivalent to 100 iterations for minibatch size at 1% of full batch. While
the plot versus iterations was the zoom-in view, but for each curve separately. To compare the rates of
convergence among different algorithms and different minibatch sizes, look at the plots versus epochs, since
each epoch covers the whole training set. It is just an optical illusion to think that SGD with different
minibatch sizes had the same rate of convergence. ■
The authors of [145] planned to test their Algorithm 7 on large datasets such as the CIFAR-10, and
report the results in 2020.207
Another algorithm along the same line as Algorithm 6 and Algorithm 7 is the stochastic quasi-Newton
method proposed in [146], where the stochastic Wolfe line search of [143] was employed, but with no
numerical experiments on large datasets such as CIFAR-10, etc.
At the time of this writing, due to lack of numerical results with large datasets commonly used in
the deep-learning community such as CIFAR-10, CIFAR-100 and the likes, for testing, and thus lack of
comparison of performance in terns of cost and accuracy against Adam and its variants, our assessment is
207
Per our private correspondence as of 2019.12.18.
124 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
1 Stochastic Newton with 2nd-order line-search and fixed minibatch (for training epoch τ )
Data:
• Parameter estimate θeτ⋆ from previous epoch (τ − 1) (line 5 in Algorithm 4)
• Select 3 Armijo paramters α ∈ (0, 1), β ∈ (0, 1), ρ > 0, and stability parameter δ > 0
Result: Parameter estimate for next epoch (τ + 1): θeτ⋆+1 . (line 1 in Algorithm 6)
2 ▶ Begin training epoch τ . Set θe1 = θeτ⋆ ;
3 ② for k = 1, 2, . . . do
|m|
4 Obtain minibatch Bk , Eq. (137) (line 7 in Algorithm 4) ;
5 Compute gradient estimate gek = ge(θek ), Eq. (140) ;
6 Compute Hessian estimate H fk = H(f θek ), Eq. (258), and its smallest eigenvalue λ ek ;
7 if Plateau or local minimizer, λ ek ≥ −δ and ∥ gek ∥= 0, Eq. (259) then
1/2
that SGD and its variants, or Adam and its better variants, particularly AdamW, continue to be the prevalent
methods for training.
Time constraint did not allow us to review other stochastic optimization methods such as that with the
gradient-only line search in [142] and [179] could not be reviewed here.
we obtain a compact representation of the equations of motion, which relates the temporal change of the
•
system’s state q to its current state q and the input b linearly by means of
•
q = Aq + Bf = Aq + b. (269)
We found similar relations in neuroscience,209 where the dynamics of neurons was accounted for, e.g.,
in the pioneering work [214], whose author modeled a neuron as electrical circuit with capacitances. Time-
continuous RNNs were considered in a paper on back-propagation [215]. The temporal change of an RNN’s
state is related to the current state y and the input x by
•
y = −y + a(W y) + x, (270)
where a denotes a non-linear activation function as, e.g., the sigmoid function, and W is the weight matrix
that describes the connection among neurons.210
208
The state-space representation of time-continuous LTI-systems in control theory, see, e.g., Chapter 3 of [213],
•
“State Variables and the State Space Description of Dynamic Systems” is typically written as x = Ax + Bu, with
the output equation y = Cx + Du. The state vector is denoted by x, u and y are the vectors of inputs and outputs,
respectively. The ODE describes the (temporal) evolution of the system’s state. The output of the system y is a
linear combination of states x and the inputs u.
209
See Section 13.2.2 on “Dynamic, time dependence, Volterra series”.
210
See the general time-continuous neural network with a continuous delay described by Eq. (514).
126 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Returning to mechanics, we are confronted with the problem that the equations of motion do not admit
closed-form solutions in general. To construct approximate solution, time-integration schemes need to be
resorted to, where we mention a few examples such as Newmark’s method [216], the Hilber-Hughes-Taylor
(HHT) method [217], and the generalized-α method [218]. For simplicity, we have a closer look at the
classical trapezoidal rule, in which the time integral of some function f over a time step ∆t = tn+1 − tn is
approximated by the average of the function values f (tn+1 ) and f (tn ). If we apply the trapezoidal rule to
the system of ODEs in Eq. (269), we obtain a system of algebraic relations that reads
∆t [n]
q [n+1] = q [n] + Aq + b[n] + Aq [n+1] + b[n+1] . (271)
2
Rearranging the above relation for the new state gives the update equation for the state vector,
−1 −1
∆t ∆t ∆t ∆t
q [n+1]
= I− A I+ A q + [n]
I− A b[n] + b[n+1] , (272)
2 2 2 2
which determines the next state q [n+1] in terms of the state q [n] as well as the inputs b[n] and b[n+1] .211 To
keep it short, we introduce the matrices W and U as
−1 −1
∆t ∆t ∆t ∆t
W = I− A I+ A , U= I− A , (273)
2 2 2 2
which allows us to rewrite Eq. (272) as
q [n+1] = W q [n] + U b[n] + b[n+1] . (274)
The update equation of time-discrete RNNs is similar to the discretized equations of motion Eq. (274).
Unlike feed-forward neural networks, the state h of an RNN at the n-th time step,212 which is denoted by
h[n] , does not only depend on the current input x[n] , but also on the state h[n−1] of the previous time step
n − 1. Following the notation in [78], we introduce a transition function f that produces the new state,
h[n] = f (h[n−1] , x[n] ; θ). (275)
Remark 7.1. In [78], there is a distinction between the “hidden state” of an RNN cell at the n-th step
denoted by h[n] , where “h” is mnemonic for “hidden”, and the cell’s output y [n] . The output of a multi-layer
RNN is a linear combination of the last layer’s hidden state. Depending on the application, the output is not
necessarily computed at every time step, but the network can “summarize” sequences of inputs to produce
an output after a certain number of steps.213 If the output is identical to the hidden state,
y [n] ≡ h[n] , (276)
h[n] and y [n] can be used interchangeably. In the current Section 7 on “Dynamics, sequential data sequence
modeling”, the notation h[n] is used, whereas in Section 4 on “Static, feedforward networks”, the notation
y (ℓ) is used to designate the output of the “hidden layer” (ℓ), keeping in mind the equivalence in Eq. (22)
in Remark 4.2. whenever necessary, readers are reminded of the equivalence in Eq. (276) to avoid possible
confusion when reading deep-learning literature. ■
211
We can regard the trapezoidal rule as a combination of Euler’s explicit and implicit methods. The explicit Euler
method approximates time-integrals by means of rates (and inputs) at the beginning of a time step. The next state
(at the end of a time step) is obtained from previous state and the previous input as qn+1 = qn + ∆t f (qn , bn )
= (I + ∆t A) qn + ∆t bn . On the contrary, the implicit Euler method uses rates (and inputs) at the end of a time
−1
step, which leads to the update relation qn+1 = qn + ∆t f (qn+1 , bn+1 ) = (I − ∆t A) qn + ∆t bn+1 .
212
Though the elements x[n] of a sequence are commonly referred to as “time steps”, the nature of a sequence is not
necessarily temporal. The time step index t then merely refers to a position within some given sequence.
213
cf. [78], p. 371, Figure 10.5.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 127
The above relation is illustrated as a circular graph in Figure 79 (left), where the delay is explicit in the
superscript of h[n−1] . The hidden state h[n−1] at n − 1, in turn, is a function of the hidden state h[n−2] and
the input x[n−1] ,
Continuing this unfolding process repeatedly until we reach the beginning of a sequence, the recurrence can
be expressed as a function g [n] ,
h[n] = g [n] (x[n] , x[n−1] , x[n−2] , . . . , x[2] , x[1] , h[0] ; θ) = g [n] ({x[k] , k = 1, . . . , n}, h[0] ; θ), (278)
which takes the entire sequence up to the current step n, i.e., {x[k] , k = 1, . . . , n} (along with an initial
state h[0] and parameters θ), as input to compute the current state h[n] . The unfolded graph representation
is illustrated in Figure 79 (right).
Unfold
Figure 79: Folded and unfolded discrete RNN (Section 7.1, 13.2.2). Left: Folded discrete RNN
at configuration (or state) number [k], where k is an integer, with input x[k] to a multilayer
neural network f (·) = f (1) ◦ f (2) ◦ · · · ◦ f (L) (·) as in Eq. (18), having a feedback loop h[k−1]
with delay by one step, to produce output h[k] . Right: Unfolded discrete RNN, where the
feedback loop is unfolded, centered at k = n, as represented by Eq. (275). This graphical
representation, with f (·) being a multilayer neural network, is more general than Figure 10.2
in [78], p. 366, and is a particular case of Figure 10.13b in [78], p. 388. See also the continuous
RNN explained in Section 13.2.2 on “Dynamic, time dependence, Volterra series”, Eq. (514),
Figure 135, for which we refer readers to Remark 7.1 and the notation equivalence y [k] ≡ h[k]
in Eq. (276).
As an example, consider the default (“vanilla”) single-layer RNN provided by PyTorch214 and Tensor-
Flow215 , which is also described in [78], p. 370:
First, z [n] is formed from an affine transformation of the current input x[n] , the previous hidden state h[n−1]
and the bias b using weight matrices U and W , respectively. Subsequently, the hyperbolic tangent is applied
to the elements of z [n] as activation function that produces the new hidden state h[n] .
A common design pattern of RNNs adds a linear output layer to the simplistic example in Figure 79, i.e.,
the RNN has a recurrent connection between its hidden units, which represent the state h,216 and produces
an output at each time step.217 Figure 80 shows a two-layer RNN, which extends our above example by a
214
See PyTorch documentation: Recurrent layers, Original website (Internet archive)
215
See TensorFlow API: TensorFlow Core r1.14: tf.keras.layers.SimpleRNN, Original website (Internet archive)
216
For this reason, h is typically referred to as the hidden state of an RNN.
217
Such neural network is universal, i.e., any function computable by a Turing machine can be computed by an RNN
of finite size, see [78], p. 368.
128 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Unfold
Figure 80: RNN with two multilayer neural networks (MLNs), (Section 7.1) denoted by f1 (·)
and f2 (·), whose outputs are fed into the loss function for optimization. This RNN is a gener-
alization of the RNN in Figure 79, and includes the RNN in Figure 10.3 in [78], p. 369, as a
[n] [n]
particular case, where Eq. (282) of the first layer is simply h1 = f1 (x[n] ) = a1 (z1 ), with
[n] [n−1] [n]
z1 = b + W h1 + U x[n] and a1 (·) = tanh(·), whereas Eq. (282) is h2 = f2 (x[n] ) =
[n] [n] [n]
a2 (z2 ), with z2 = c + V h1 and a2 (·) = softmax(·) for the second layer. The general
[n] [n−1] [n] [n]
relation for both MLNs is hj = fj (hj , hj−1 ) = aj (zj ), for j = 1, 2.
[n]
Irrespective of the number of layers, the hidden state hj of the j-th layer is gennerally computed from
[n−1] [n]
the previous hidden state hj and the input to the layer hj−1 ,
Comparing the recurrence relation in Eq. (277) and its unfolded representation in Eq. (278), we can
make the following observations:
• The unfolded representation after n steps g [n] can be regarded as a factorization into repeated appli-
cations of f . Unlike g [n] , the transition function f does not depend on the length of the sequence and
always has the same input size.
• The same transition function f with the same parameters θ is used in every time step.
Remark 7.2. Depth of RNNs. For the above reasons and Figures 79-80, “RNNs, once unfolded in time,
can be seen as very deep feedforward networks in which all the layers share the same weights” [13]. See
Section 4.6.1 on network depth and Remark 4.5. ■
By nature, RNNs are typically employed for the processing of sequential data x[1] , . . . , x[τ ] , where the
sequence length τ typically need not be constant. To process data of variable length, parameter sharing is
a fundamental concept that characterizes RNNs. Instead of using separate parameters for each time step
in a sequence, the same parameters are shared across several time-steps. The idea of parameter sharing
does not only allows us to process sequences of variable length (and possibly not seen during training),
the “statistical strength”218 is also shared across different positions in time, which is important if relevant
information occurs at different positions within a sequence. A fully-connected feedforward neural network
that takes each element of a sequence as input instead needs to learn all its rules separately for each position
in the sequence.
Comparing the update equations Eq. (274) and Eq. (279), we note the close resemblance of dynamic
systems and RNNs. Let aside the non-linearity of the activation function and the presence of the bias vector,
both have state vectors with recurrent connections to the previous states. Employing the trapezoidal rule for
the time-discretization, we find a recurrence in the input, which is not present in the type of RNNs described
above. The concept of parameter sharing in RNNs translates into the notion of time-invariant systems in
dynamics, i.e., the state matrix A does not depend on time. In the computational mechanics context, typical
outputs of a simulation could be, e.g., the displacement of a structure at some specified point or the von-
Mises stress field in its interior. The computations required for determining the output from the state (e.g.,
nodal displacements of a finite-element model) depend on the respective nature of the output quantity and
need not be linear.
The crucial challenge in RNNs is to learn long-term dependencies, i.e., relations among distant elements
in input sequences. For long sequences, we face the problem of vanishing or exploding gradients when
training the network by means of back-propagation. To understand vanishing (or exploding) gradients, we
can draw analogies between RNNs and dynamic systems once again. For this purpose, we consider an RNN
without inputs whose activation function is the identity function:
h[n] = W h[n−1] . (283)
From the dynamics point of view, the above update equation corresponds to a linear autonomous system,
whose time-discrete representation is given by
hn+1 = W hn . (284)
Clearly, the equilibrium state of the above system is the trivial state h = 0. Let h0 denote a perturbation
of the equilibrium state. An equilibrium state is called Lyapunov stable if trajectories of the system, i.e.,
218
see [78], p. 363.
130 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
the states at times t ≥ 0, remain bounded. If the trajectory eventually arrives at the equilibrium state for
t → ∞ (i.e., the trajectory is attractive), the equilibrium of the system is called asymptotically stable. In
other words, an initial perturbation h0 from the equilibrium state (i.e., the initial state) vanishes over time
in the case of asymptotic stability. Linear time-discrete systems are asymptotically stable if all eigenvalues
λi of W have an absolute value smaller than one. From the unfolded representation in Eq. (278) (see also
Figure 79 (right)), it is understood that we observe a similar behavior in the RNN described above. At step
n, the initial state h[0] has been multiplied n times with the weight matrix W , i.e.,
h[n] = W n h[0] . (285)
If eigenvalues of W have absolute values smaller than one, exponentially decays to zero in long se-
h[0]
quences. On the other hand, we encounter exponentially increasing values for eigenvalues of W with
magnitudes greater than one, which is equivalent to an unstable system in dynamics. When performing
back-propagation to train RNNs, gradients of the loss function need to be passed backwards through the
unfolded network, where gradients are repeatedly multiplied with W as is the initial state h[0] in the for-
ward pass. The exponential decay (or increase) therefore causes gradients to vanish (or explode) in long
sequences, which makes it difficult to learn long-term dependencies among distant elements of a sequence.
Figure 81: Folded Recurrent Neural Network (RNN) with Long Short-Term Memory
[k]
(LSTM) cell (Section 7.2, 11.3.3). The cell state at [k] is denoted by zs ≡ c[k] .
Two feedback loops, one for cell state zs and one for hidden state h, with one-step
[k]
delay [k − 1]. The key unified recurring relation is Fα = Aα (zα ), with α ∈
{s (state), f (forget), I (Input), g (external input), O (Output)}, where Aα is a sigmoidal ac-
[k]
tivation (squashing) function, and zα is a linear combination of some inputs with weights plus
biases at cell state [k]. See Figure 82 for unfolded RNN with LSTM cells, and also Figure 15
in Section 2.3.2.
(LSTM) and networks based on the gated recurrent unit (GRU) have proven to successfully overcome the
vanishing gradient problem in diverse applications. The common idea of gated RNNs is to create paths
through time along which gradients neither vanish nor explode. Gated RNNs can accumulate information
in their state over many time steps, but, once the information has been used, they are capable to forget their
state by, figuratively speaking, “closing gates” to stop the information flow. This concept bears a resem-
blance to residual networks, which introduce skip connections to circumvent vanishing gradients in deep
feed-forward networks; see Section 4.6.2 on network “Architecture”.
Remark 7.3. What is “short-term” memory? The vanishing gradient at the earlier states of an RNN (or
layers in the case of a multilayer neural network) makes it that information in these earlier states (or layers)
did not propagate forward to contribute to adjust the predicted outputs to track the labeled outputs, so to
decrease the loss. A state [k] with two feedback loops is depicted in Figure 81. The reason for information
in the earlier states not propagating forward was because the weights in these layers did not change much
(i.e., did not learn), due to very small gradients, due to repeated multiplications of numbers with magnitude
less than 1. As a result, information in earlier states (or “events”) played little role in decreasing the loss
function, and thus had only “short-term” effects, rather than the needed long-term effect to be carried forward
to the output layer. Hence, we had a short-term memory problem. See also Remark 5.5 in Section 5 on back-
propagation for vanishing or exploding gradient in multilayer neural networks. ■
In their pioneering work on LSTM, the authors of [24] presented a mechanism that allows information
(inputs, gradients) to flow over a long duration by introducing additional states, paths and self-loops. The
additional components are encapsulated in so-called LSTM cells. LSTM cells are the building blocks for
LSTM networks, where they are connected recurrently to each other analogously to hidden neurons in con-
ventional RNNs. The introduction of a cell state219 c is one of the key ingredients to LSTM. The schematic
cell representation (Figure 81) shows that the cell state can propagate through an LSTM cell without much
interference, which is why this path is described as “conveyor belt” for information in [219].
Another way to explain that could contribute to elucidate the concept is: Since information in RNN
cannot be stored for a long time, over many subsequent steps, LSTM cell corrected this short-term memory
problem by remembering the inputs for a long time:
“A special unit called the memory cell acts like an accumulator or a gated leaky neuron: it has
a connection to itself at the next time step that has a weight of one, so it copies its own real-
valued state and accumulates the external signal, but this self-connection is multiplicatively
gated by another unit that learns to decide when to clear the content of the memory.” [13].
In a unified manner, the various relations in the original LSTM unit depicted in Figure 81 can be
expressed in a single key generic recurring-relation that is more easily remembered:
Fα (x, h) = Aα (zα ) with α ∈ {s (state), f (forget), I (Input), g (external input), O (Output)} , (286)
where x = x[k] is the input at cell state [k], h = h[k−1] the hidden variable at cell state [k − 1], Aα (with
“A” being mnemonic for “activation”) is a sigmoidal activation (squashing) function—which can be either
[k]
the logistic sigmoid function or the hyperbolic tangent function (see Section 5.3.1)—and zα = zα is a
linear combination of some inputs with weights plus biases at cell state [k]. The choice of the notation in
Eq. (286) is to be consistent with the notation in the relation ye = f (x) = a(W x + b) = a(z) in the caption
of Figure 32.
In Figure 81, two types of squashing functions are used: One type (three blue boxes with α ∈ {f, I, O})
squashes inputs into the range (0, 1) (e.g., the logistic sigmoid, Eq. (113)), and the other type (purple box
219
The cell state is denoted with the variable s in [78], p. 399, Eq. (10.41).
132 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
with α = g, brown box with α = s) squashes inputs into the range (−1, +1) (e.g., the hyperbolic tangent,
Eq. (114)). The gates are the activation functions Aα with α ∈ {f, I, g, O} (3 blue and 1 purple boxes),
[k]
with argument containing the input x[k] (through z α ).
Remark 7.4. The activation function As (brown box in Figure 81) is a hyperbolic tangent, but is not called
a gate, since it has the cell state c[k] , but not the input x[k] , as argument. In other words, a gate has to take in
the input x[k] in its argument. ■
There are two feedback loops, each with a delay by one step. The cell-state feedback loop (red) at the top
[k−1] [k−1]
involves the LSTM cell state zs ≡ c[k−1] , with a delay by one step. Since the cell state zs ≡ c[k−1]
is not squashed by a sigmoidal activation function, vanishing or exploding gradient is avoided; the “short-
term” memory would last longer, thus the name “Long Short-Term Memory”. See Remark 5.5 on vanishing
or exploding gradient in back-propagation, and Remark 7.3 on short-term memory.
[k]
In the hidden-state feedback loop (green) at the bottom, the combination zO of the input x[k] and the
previous hidden state h[k−1] is squashed into the range (0, 1) by the processor FO to form a factor that
[k]
filters out less important information from the cell state zs ≡ c[k] , which had been squashed into the range
(−1, +1). See Figure 82 for an unfolded RNN with LSTM cells. See also Appendix 2 and Figure 152 for
an alternative block diagram.
Figure 82: Unfolded RNN with LSTM cells (Sections 2.3.2, 7.2, 12.1): In this unfolded RNN,
the cell states are centered at the LSTM cell [k = n], preceded by the LSTM cell [k = n − 1],
and followed by the LSTM cell [k = n + 1]. See Eq. (290) for the recurring relation among the
successive cell states, and Figure 81 for a folded RNN with details of an LSTM cell. Unlike
conventional RNNs, in which the hidden state is repeatedly multiplied with its shared weights,
the additional cell state of an LSTM network can propagate over several time steps without
much interference. For this reason, LSTM networks typically perform significantly better on
long sequences as compared to conventional RNNs, which suffer from the problem of vanishing
gradients when being trained. See also Figure 117 in Section 12.2.
As the term suggests, the presence of gates that control the information flow are a further key concept
in gated RNNs and LSTM, in particular. Gates are constructed from a linear layer with a sigmoidal function
(logistic sigmoid or hyperbolic tangent) as activation function that squashes the components of a vector into
the range (0, 1) (logistic sigmoid) or (−1, +1) (hyperbolic tangent). A component-wise multiplication of
the sigmoid’s output with the cell state controls the evolution of the cell state (forward pass) and the flow of
its gradients (backward pass), respectively. Multiplication with 0 suppresses the propagation of a component
of the cell state, whereas a gate value of 1 allows a component to pass.
To understand the function of LSTM, we follow the paths information is routed through an LSTM cell.
At time n, we assume that the hidden state h[n−1] and the cell state c[n−1] from the previous time step along
with the current input x[n] are given. The hidden state h[n−1] and the input x[n] are inputs to the forget gate,
i.e., a fully connected layer with a sigmoid non-linearity, and the general expression Eq. (286), with k = n
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 133
(see Figure 82), becomes (Figure 81, where the superscript [k] on Fα was omitted to alleviate the notation)
[n] [n] [n] [n]
α = f ⇒ Ff = Af (zf ) → f [n] = s(zf ), with zf = Wf h[n−1] + Uf x[n] + bf . (287)
The weights associated with the hidden state and the cell state are Wf and Uf , respectively; the bias
vector of the forget gate is denoted by bf . The forget gate determines which and to what extent components
of the previous cell state c[n−1] are to be kept subsequently. Knowing which information of the cell state to
keep, the next step is to determine how to update the cell state. For this purpose, the hidden state h[n−1] and
the input x[n] are input to a linear layer with a hyperbolic tangent as activation function, called the external
input gate, and the general expression Eq. (286), with k = n, becomes (Figure 81)
α = g ⇒ Fg[n] = Ag (zg[n] ) → g [n] = tanh(zg[n] ), with zg[n] = Wg h[n−1] + Ug x[n] + bg . (288)
Again, Wg and Ug are linear weights and bg represents the bias. The output vector of the tanh layer, which
is also referred to as cell gate, can be regarded as a candidate for updates to the cell state.
The actual updates are determined by a component-wise multiplication of the candidate values g [n] with
the input gate, which has the same structure as the forget gate but has its own parameters (i.e., Wi , Ui , bi ),
and the general expression Eq. (286) becomes (Figure 81)
[n] [n] [n] [n]
α = i ⇒ Fi = Ai (zi ) → i[n] = s(zi ), with zi = Wi h[n−1] + Ui x[n] + bi . (289)
The new cell state c[n] is formed by summing the scaled (by the forget gate) values of the previous cell state
and scaled (by the input gate) values of the candidate values,
c[n] = f [n] ⊙ c[n−1] + i[n] ⊙ g [n] , (290)
where the component-wise multiplication of matrices, which is also known by the name Hadamard product,
is indicated by a “⊙”.
Remark 7.5. The path of the cell state c[k] is reminiscent of the identity map that jumps over some layers
to create a residual map inside the jump in the building block of a residual network; see Figure 44 and also
Remark 4.7 on the identity map in residual networks. ■
Finally, the hidden state of the LSTM cell is computed from the cell state c[n] . The cell state is first
squashed into the range (−1, +1) by a hyperbolic tangent tanh, i.e., the general expression Eq. (286) be-
comes (Figure 81)
α = s ⇒ Fs[n] = As (zs[n] ) = As (c[n] ) = tanh c[n] , with zs[n] ≡ c[n] , (291)
[n]
before the result Fs is multiplied with the output of a third sigmoid gate, i.e., the output gate, for which
the general expression Eq. (286) becomes (Figure 81)
[n] [n]
α = O ⇒ FO = AO (zO ) → o[n] = s(zo[n] ), with zo[n] = Wo h[n−1] + Uo x[n] + bo . (292)
Hence, the output, i.e., the new hidden state h[n] , is given by
[n]
h[n] = FO ⊙ Fs[n] = o[n] ⊙ tanh c[n] . (293)
For the LSTM cell, we get an intuition for respective choice of the activation function. The hyperbolic
tangent is used to normalize and center information that is to be incorporated into the cell state or the hidden
state. The forget gate, input gate and output gate make use of the sigmoid function, which takes values
between 0 and 1, to either discard information or allow information to pass by.
134 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 83: Folded RNN with Gated Recurrent Unit (GRU) (Section 7.3). The cell state at
[k − 1], i.e., (x[k−1] , h[k−1] ) are inputs to produce the hidden state h[k] . One feedback loop
for the hidden state h, with one-step delay [k − 1]. The key unified recurring relation is
[k−1]
Fα = Aα (zα ), with α ∈ {r (reset), u (update), O (Output)}, where Aα is a logistic sigmoi
[k−1]
activation function, and zα is a linear combination of some inputs with weights plus biases
at cell state [k − 1]. Compare to the LSTM cell in Figure 81.
where x = x[k−1] is the input at cell state [k − 1], h = h[k−1] the hidden variable at cell state [k − 1], Aα
[k−1]
the logistic sigmoid activation function, and zα = zα a linear combination of some inputs with weights
plus biases at cell state [k − 1].
To facilitate a direct comparison between the GRU cell and the LSTM cell, the locations of the boxes
(gates) in the GRU cell in Figure 83 are identical to those in the LSTM in Figure 81. It can be observed that
in the GRU cell (1) There is no feedback loop for the cell state, (2) The input is x[k−1] (instead of x[k] ), (3)
The reset gate replaces the LSTM forget gate, (4) The update gate replaces the LSTM input gate, (5) The
identity map (no effect on h[k−1] ) replaces the LSTM external input gate, (6) The complement of the update
gate, i.e., (1 − Fu ) replaces the LSTM state activation As = tanh. There are fewer activations in the GRU
cell compared to the LSTM cell.
The GRU was introduced in [220], and tested against LSTM and tanh-RNN in [221], with concise
GRU schematics, and thus not easy to follow, unlike Figure 83. The GRU relations below follow [78],
p. 400.
The hidden variable h[n] , with k = n, is obtained from the GRU cell state at [n − 1], including the
[n−1]
input x[n−1] , and is a convex combination of h[n−1] and the GRU output-gate effect FO , using the GRU
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 135
[n−1]
update-gate effect Fu as coefficient
[n−1]
h[n] = Fu[n−1] ⊙ h[n−1] + (1 − Fu[n−1] ) ⊙ FO . (295)
[n−1]
For the GRU update-gate effect Fu , the generic relation Eq. (294) becomes (Figure 83, where the
superscript [k − 1] on Fα was omitted to alleviate the notation), with α = u
Fu[n−1] = Au (zu[n−1] ) → u[n−1] = s(z [n−1]
u ), with zu[n−1] = W u h[n−1] + U u x[n−1] + bu . (296)
[n−1]
For the GRU output-gate effect FO , the generic Eq. (294) becomes (Figure 83), with α = O,
[n−1] [k−1]
FO = AO (zO ) → o[n] = s(zo[n−1] ), with zo[n−1] = Wo (Fr[n−1] ⊙ h[n−1] ) + Uo x[n−1] + bo . (297)
[n−1]
For the GRU reset-gate effect Fr , the generic Eq. (294) becomes (Figure 83), with α = r
Fr[n−1] = Ar (zr[n−1] ) → r [n−1] = s(zr[n−1] ), with zr[n−1] = W r h[n−1] + U r x[n−1] + br . (298)
Remark 7.6. GRU has fewer activation functions compared to LSTM, and is thus likely to be more efficient,
even though it was stated in [221] that no concrete conclusion could be made as to “which of the two gating
units was better.” See Remark 7.7 on the use of GRU to solve hyperbolic problems with shock waves. ■
function e. In terms of our notation, let x[k] denote the k-th element in the sequence of input vectors, X ,
and let h[k−1] denote the corresponding hidden state at time k − 1, the hidden state at time k follows from
the recurrence relation (Figure 79)
h[k] = f (h[k−1] , x[k] ). (299)
In Section 7.1, we referred to f as the transition function (denoted by f in Figure 79). The context vector
c is generated by another generally non-linear function e, which takes the sequence of hidden states x[k] ,
k = 1, . . . , n as input:
c = e({h[k] , k = 1, . . . , n}). (300)
Note that e could as well be a function of just the final hidden state in the encoder-RNN.223
From a probabilistic point of view, the joint probability of the entire output sequence (i.e., the “transla-
tion”) y can be decomposed into conditional probabilities of the k-th output item y [k] given its predecessors
y [1] , . . . , y [k−1] and the input sequence x, which in term can be approximated using the context vector c:
n
Y n
Y
P (y [1] , . . . , y [n] ) = P (y [k] | y [1] , . . . , y [k−1] , x) ≈ P (y [k] | y [1] , . . . , y [k−1] , c). (301)
k=1 k=1
Accordingly, the decoder is trained to predict the next item (word, character) in the output sequence y [k]
given the previous items y [1] , . . . , y [k−1] and the context vector c. In analogy to the encoder E, the decoder
D comprises a recurrence function g and a non-linear feedforward function d. Practically, RNNs provide
an intuitive means to realize functions of variable-length sequences, since the current hidden state in RNNs
contains information of all previous inputs. Accordingly, the decoder’s hidden state s[k] at step k follows
from the recurrence relation g as
s[k] = g(s[k−1] , y [k−1] , c). (302)
To predict the conditional probability of the next item y [k] by means of the function d, the decoder can
therefore use only the previous item y [k−1] and the current hidden state s[k] as inputs (along with the context
vector c) rather than all previously predicted items y [1] , . . . , y [k−1] :
P (y [k] | y [1] , . . . , y [k−1] , c) = d(y [k−1] , s[k] , c). (303)
We have various choices of how the context vector c and inputs y [k] are fed into to the decoder-RNN.224 The
context vector c, for instance, can either be used as the decoder’s initial hidden state s[1] or, alternatively, as
the first input.
7.4.2 Attention
As the authors of [57] emphasized, the encoder “needs to be able to compress all the necessary infor-
mation of a source sentence into fixed-length vector”. For this reason, long sentences pose a challenge in
neural machine translation, in particular, as sentences to be translated are longer than the sentences networks
have seen during training, which was confirmed by the observations in [223]. To cope with long sentences,
an encoder–decoder architecture, “which learns to align and translate jointly,” was proposed in [57]. Their
approach is motivated by the observation that individual items of the target sequence correspond to different
parts of the source sequence. To account for the fact that only a subset of the source sequence is relevant
when generating a new item of the target sequence, two key ingredients (alignment and translation) to the
conventional encoder–decoder architecture described above were introduced in [57], and will be presented
below.
223
See, e.g., [78], p. 385.
224
See, e.g., [78], p. 386.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 137
☛ The first key ingredient to their concept of “alignment” is the idea of using a distinct context vector
ck for each output item y [k] instead of a single context c. Accordingly, the recurrence relation of the decoder,
Eq. (302), is modified and takes the context ck as argument
s[k] = g(s[k−1] , y [k−1] , ck ), (304)
as is the conditional probability of the output items, Eq. (303),
P (y [k] | y [1] , . . . , y [k−1] , x) = P (y [k] | y [1] , . . . , y [k−1] , ck ) ≈ d(y [k−1] , s[k] , ck ), (305)
i.e., it is conditioned on distinct context vectors ck for each output y [k] , k = 1, . . . , n.
The k-th context vector ck is supposed to capture the information of that part of the source sequence x
which is relevant to the k-th target item y [k] . For this purpose, ck is computed as weighted sum of all hidden
states h[l] , l = 1, . . . , n of the encoder:
Xn
ck = αkl h[l] . (306)
l=1
The k-th hidden state of a conventional RNN obeying the recurrence given by Eq. (299) only includes
information about the preceding items (1, . . . , k) in the source sequence, since the remaining items (k +
1, . . . , n) still remain to be processed. When generating the k-th output item, however, we want information
about all source items, before and after, to be contained in the k-th hidden state h[k] .
For this reason, using a bidirectional RNN 225 as encoder was proposed in [57]. A bidirectional RNN
combines two RNNs, i.e., a forward RNN and backward RNN, which independently process the source
sequence in the original and in reverse order, respectively. The two RNNs generate corresponding sequences
of forward and backward hidden states,
hfwd = ffwd (hfwd , x[k] ), rev = frev (hrev , x
[k] [k−1]
h[k] [k−1] [n−k]
). (307)
In each step, these vectors are concatenated to a single hidden state vector h[k] , which the authors of [57]
refer to as “annotation” of the k-th source item:
T T
[k] T
h[k] = hfwd , h[k] rev . (308)
They mentioned “the tendency of RNNs to better represent recent inputs” as reason why the annotation h[k]
focuses around the k-th encoder input x[k] .
☛ As the second key ingredient, the authors of [57] proposed a so-called “alignment model”, i.e., a
function a to compute weights αkl needed for the context ck , Eq. (306),
akl = a(s[k−1] , h[l] ), (309)
which is meant to quantify (“score”) the relation, i.e., the alignment, between the k-th decoder output (target)
and inputs “around” the l-th position of the source sequence. The score is computed from the decoder’s
hidden state of the previous output s[k−1] and the annotation h[l] of the l-th input item, so it “reflects the
importance of the annotation h[l] with respect to the previous hidden state s[k−1] in deciding the next state
s[k] and generating y [k] .”
The alignment model is represented by a feedforward neural network, which is jointly trained along with
all other components of the encoder–decoder architecture. The weights of the annotations, in turn, follow
from the alignment scores upon exponentiation and normalization (through softmax(·) (Section 5.1.3) along
the second dimension):
exp (akl )
αkl = Pn . (310)
j=1 exp (akj )
225
See [224].
138 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
The weighting in Eq. (306) is interpreted as a way “to compute an expected annotation, where the expec-
tation is over possible alignments" [57]. From this perspective, αkl represents the probability of an output
(target) item y [k] being aligned to an input (source) item x[l] .
In neural machine translation it was possible to show that the attention model in [57] significantly
outperformed conventional encoder–decoder architectures, which encoded the entire source sequence into a
single fixed-length vector. In particular, this proposed approach turned out to perform better in translating
long sentences, where it could achieve performance on par with phrase-based statistical machine translation
approaches of that time.
Scaling with the square root of the query/key dimension is supposed to prevent pushing the softmax(·)
function to regions of small gradients for large dk as scalar products grow with the dimension of queries
and keys. The above attention model can be simultaneously applied to multiple queries. For this purpose,
let Q = [q 1 , . . . , q k ]T ∈ Rk×dk and C = [c1 , . . . , ck ]T ∈ Rk×dv denote matrices of query and context
vectors, respectively. We can rewrite the attention model using matrix multiplication as follows:
1
C = A(Q, K, V ) = √ softmax QK T V ∈ Rk×dk . (312)
dk
Note that QK T gives a k × m matrix, for which the softmax(·) is computed along the second dimension.
Based on the concept of scaled dot-product attention, the idea of using multiple attention functions
in parallel rather than just a single one was proposed in [31], see Figure 84. In their concept of “Multi-
Head Attention”, each “head” represents a separate context C j computed from scaled dot-product attention,
Eq. (312), on queries Qj , keys K j and values V j , respectively. The inputs to the individual scaled dot-
product attention functions Qj , K j and V j , j = 1, . . . , h, in turn, are head-specific (learned) projections
of the queries Q ∈ Rk×d , keys K ∈ Rm×d and values V ∈ Rm×d . Assuming that queries q i , ki and
v i share the dimension d, the projections are represented by matrices W Q j ∈ R
d×dk , W K ∈ Rd×dk and
j
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 139
Figure 84: Scaled dot-product attention and multi-head attention (Section 7.4.3). Scaled-dot
product attention (left) is the elementary building block of the Transformer model. It compares
query
Figurevectors (Q)Scaled
2: (left) against a set of key
Dot-Product vectors (K)
Attention. to produce
(right) Multi-Heada context vector
Attention by weighting
consists of severalto
value vectors
attention (V)running
layers that correspond
in parallel. to the keys. For this purpose, softmax(·) function is applied
to the inner product (MatMul) of the query and key vectors (scaled by a constant). The output
of the softmax(·) represent the weighting by which value vectors are scaled taking the inner
3.2.1 Scaled Dot-Product Attention
product (MatMul), see Eq.(311). To prevent attention functions of the decoder, which generates
the
Weoutput sequence
call our particularitem by item,
attention fromDot-Product
"Scaled using future items of(Figure
Attention" the output sequence,
2). The a masking
input consists of
queries and keys of dimension
layer is introduced. By the masking, d k , and values of dimension d . We compute the dot products
p all scores beyond the current time/position are set to −∞.
v of the
query with all keys, divide each by dk , and apply a softmax function to obtain the weights on the
Multi-head
values.
attention (right) combines several (h) scaled-dot product attention functions in par-
allel, each of which is referred to as “head”. For this purpose, queries, keys and values are
In practice, we compute the attention function on a set of queries simultaneously, packed together
projected by means of a head-specific linear layers (Linear), whose outputs are input to the in-
into a matrix Q. The keys and values are also packed together into matrices K and V . We compute
dividual scaled dot-product
the matrix of outputs as: attention functions, see Eq.(313). The context vectors produced by
each head are concatenated before being fed into one more linear layer, see Eq.(314). (Figure
reproduced with permission of the authors.) QK T
Attention(Q, K, V ) = softmax( p )V (1)
dk
W Vj ∈ Rd×dv :226
The two most commonly used attention functions are additive attention [2], and dot-product (multi-
Qj = QW plicative)
Q
∈ Rattention.
k×dk
, Dot-product
K = KW attention
K
∈R ism×d
identical
k
, to ourV j algorithm,
= VW Vj except
∈ Rm×dforkthe
. scaling factor (313)
of pj1d . Additive attentionj computesj the compatibility function using a feed-forward network with
k
a single
Multi-head hiddencombines
attention layer. While
the the two are similar
individual “heads”in C
theoretical
i throughcomplexity, dot-product
concatenation (along attention
the secondis dimen-
sion) andmuch faster andprojection
subsequent more space-efficient
by meansinofpractice,
W O ∈ since it can
Rdv h×d , be implemented using highly optimized
matrix multiplication code.
H = Amh (Q,for
While K,small
V) =values
[C 1 ,of
. . d. k, C
theh ]W
O
∈ Rk×d , perform
two mechanisms Ci = A(Qi , additive
similarly, K i , V i ),
attention outperforms (314)
dot product attention without scaling for larger values of dk [3]. We suspect that for large values of
where hddenotes the number of heads, i.e., individual scaled dot-product attention functions used in parallel.
k , the dot products grow large in magnitude, pushing the softmax function into regions where it has
Note that the output
extremely ofgradients
small multi-head
4
. Toattention haseffect,
counteractHthis the same dimensions
we scale of as by
the dot products thepinput
1
d
. queries Q, i.e.,
k
H, Q ∈ Rk×d .
3.2.2 Multi-Head
To understand why theAttention
projection is essential in the Transformer architecture, we shift our attention
(no pun Instead
intended)
of performing a single attentionillustrated
to the encoder-structure in Figure
function with 85. The encoder
dmodel -dimensional combines
keys, values a stack of N
and queries,
identicalwelayers,
foundwhich, in turn,
it beneficial are composed
to linearly from
project the two sub-layers
queries, each.h To
keys and values timesstack
withthe layers learned
different, without further
linear
projection, allprojections
inputs andtooutputs
dk , dk and dv dimensions,
of the encoder layersrespectively.
share theOnsame
eachdimension.
of these projected
The sameversions of true for
holds
queries, keys and values we then perform the attention function in parallel, yielding
each of the sub-layers, which are also designed to preserve the input dimensions. The first sub-layer is a d v -dimensional
output values. These are concatenated and once again projected, resulting in the final values, as
multi-head self-attention
depicted in Figure function,
2. in which the input sequence attends to itself. The concept of self-attention
226
Unlike the4 To
use of notation
illustrate in dot
why the [31], we use
products getdifferent symbols
large, assume forcomponents
that the the arguments
of q and of kthe
arescaled dot-product
independent random attention
P dk
function, Eq. (312),
variables and those
with mean 0 and of the multi-head
variance attention,
1. Then their Eq. (314),
dot product, q · k = to emphasize
i=1 q i k i , hastheir
mean distinct
0 and dimensions.
variance dk .
4
140 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
The authors of [31] introduced a residual connection around the self-attention (sub-)layer, which, in view of
the same dimensions of heads and queries, reduces to a simple addition.
To prevent values from growing upon summation, the residual connection is followed by layer normal-
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 141
ization as proposed in [225], which scales the input to zero mean and unit variance:
1
N(T ) = (T − µI) , (316)
σ
where µ and σ denote the mean value and the standard deviation of the input components; I is a matrix
of ones with the same dimension as the input T . The normalized output of the encoder’s first sub-layer
therefore follows from the sum of inputs to self-attention function X and its outputs H as
Z = N(X + H) ∈ Rn×d . (317)
The output of the first sub-layer is the input to the second sub-layer within the encoder stack, i.e., a
“position-wise feed-forward network.” Position-wise means that a fully connected feedforward network
(see Section 4 on “Static, feedforward networks”), which is subsequently represented by the function F, is
applied to each item (i.e., “position”) of the input sequence z i ∈ Z = [z 1 , . . . , z n ]T ∈ Rn×d “separately
and identically.” In particular, a network with a single hidden layer and a linear rectifier unit (see Section
5.3.2 on “Rectified linear function (ReLU)”) as activation function was used in [31] to compute the vectors
y i :227
y i = F(z i ) = W 2 max(0, W 1 z i + b1 ) + b2 ∈ Rd , (318)
where W 1 ∈ Rh×df f ,
W2 ∈ denote the connection weights and b1 ∈
Rdf f ×h and b2 ∈
Rh are bias Rd
vectors. The individual outputs y i , which correspond to the respective items xi of the input sequence X ,
form the matrix Y = [y 1 , . . . , y n ]T ∈ Rn×d .
As for the first sub-layer, the authors of [31] introduced a residual connection followed by layer nor-
malization around the feedforward network. The ouput of the encoder’s second sublayer, which, at the same
time, is the output of the encoder layer, is given by
E = N(Z + Y) ∈ Rn×d . (319)
Within the transformer architecture (Figure 85), the encoder is composed from N encoder layers, which
are “stacked”, i.e., the output of the ℓ-th layer is input to the subsequent layer X (ℓ+1) = E (ℓ) . Let E(ℓ)
(ℓ)
denote the ℓ-th encoder layer composed from the layer-specific self-attention function Aself , which takes
X (ℓ) ∈ Rn×d as input, and the layer-specific feedforward network F(ℓ) , the layer’s output E (ℓ) is computed
as follows:
(ℓ) (ℓ)
E (ℓ) = E(ℓ) (X (ℓ) ) = N(F(ℓ) (N(X (ℓ) + Aself (X (ℓ) ))) + N(X (ℓ) + Aself (X (ℓ) ))), (320)
or, alternatively, using a step-wise representation,
(ℓ)
H(ℓ) = Aself (X (ℓ) ) ∈ Rn×d , (321)
The decoder’s structure within the transformer architecture is similar to that of the encoder, see the
right part in Figure 85. As the encoder, it is composed from N identical layers, each of which combines
three sub-layers (as opposed to two sub-layers in the encoder). In addition to the self-attention sub-layer
(first sub-layer) and the fully-connected position-wise feed-forward network (third sub-layer) , the decoder
inserts a second sub-layer that “performs multi-head attention over the output of the encoder stack” in
between. Attending to outputs of the encoder enables the decoder to relate items of the source sequence to
items of the target sequence. Just as in the encoder, residual connections are introduced around each of the
decoder’s sub-layers.
(ℓ)
Let Y 1 ∈ Rn×d denote the input to the ℓ-th decoder layer, which, for ℓ > 1, is the output D (ℓ−1) ∈
R n×d of the previous layer. The output D (ℓ) of the ℓ-th decoder layer is then obtained through the following
computations. The first sub-layer performs (multi-headed) self-attention over items of the output sequence,
i.e., it establishes a relation among them:
(ℓ) (ℓ) (ℓ)
H1 = Aself (Y 1 ) ∈ Rn×d , (325)
(ℓ) (ℓ)
Z 1 = N(Y (ℓ) + H1 ) ∈ Rn×d . (326)
The second sub-layer relates items of the source and the target sequences by means of multi-head attention:
(ℓ) (ℓ) (ℓ)
Y 2 = Amh (E (ℓ) , E (ℓ) , Z 1 ) ∈ Rn×d , (327)
(ℓ) (ℓ) (ℓ)
Z 2 = N(Z 1 + Y 2 ) ∈ Rn×d . (328)
The third sub-layer is a fully-connected feed-forward network (with a single hidden layer) that is applied to
each element of the sequence (position-wisely)
(ℓ) (ℓ)
Y 3 = F(ℓ) (Z 2 ) ∈ Rn×d , (329)
(ℓ) (ℓ)
D (ℓ) = N(Z 2 + Y 3 ) ∈ Rn×d . (330)
The output of the last (ℓ = N ) decoder-layer within the decoder-stack is projected onto a vector that has
the dimension of the “vocabulary”, i.e., the set of all feasible output items. Taking the softmax(·) over the
components of the vector produces probabilities over elements of the vocabulary to be item of the output
sequence (see Section 5.1.3).
A Transformer-model is trained on the complete source and target sequences, which are input to the
encoder and the decoder, respectively. The target sequence is shifted right by one position, such that a
special token indicating the start of a new sequence can be placed at the beginning, see Fig. 85. To prevent
the decoder from attending to future items of the output sequence, the (multi-headed) self-attention sub-layer
needs to be “masked”. The masking is realized by setting those inputs to the softmax(·) function of the
scaled dot-product attention, see Eq. (311), for which query vectors are aligned with keys that correspond
to items in the output sequence beyond the respective query’s position, set to −∞. As loss function, the
negative log-likelihood is typically used; see Section 5.1.2 on maximum likelihood (probability cost) and
Section 5.1.3 on classification loss functions.
As the transformer architecture does not have recurrent connections, positional encodings were added
to the inputs of both encoder and decoder [31], see Figure 85. The positional encodings supply the individual
items of the inputs to the encoder and the decoder with information about their positions within the respective
sequences. For this purpose, the authors of [31] proposed to add vector-valued positional encodings pi ∈ Rd ,
which have the same dimension d as the input embedding of an item, i.e., its numerical representation as
a vector. They used sine and cosine functions of an items position (i) within respective sequence, whose
frequency varies (decreases) with the component index (j):
j
pi,2j = sin θij , pi,2j−1 = cos θij , θij = , j = 1, . . . , d. (331)
100002i/d
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 143
The trained Transformer-model produces one item of the output sequence at a time. Given the input
sequence and all outputs already generated in previous steps, the Transformer predicts probabilities for the
next item in the output sequence. Models of this kind are referred to as “auto-regressive”.
The authors of [31] varied parameters of the Transformer model to study the importance of individual
components. Their “base model” used N = 6 encoder and decoder layers. Inputs and outputs of all sub-
layers are sequences of vectors, which have a dimension of d = 512. All multi-head attention functions of
the Transformer, see Eq.(321), Eq.(325) and Eq.(327), have h = 8 heads each. Before the alignment scores
are computed by each head, the dimensions of queries, keys and values were reduced to dk = dv = 64. The
hidden layer of the fully-connected feedforward network, see Eq.(318), was chosen as df f = 2048 neurons
wide. With this set of (hyper-)parameters, the Transformer model features a total of 65 × 106 parameters.
The training cost (in terms of FLOPS) of the attention-based Transformer was shown to be (at least) two
orders of magnitudes smaller than comparable models at that time while the performance was maintained.
A refined (“big”) model was able to outperform all previous approaches (based on RNNs or convolutional
neural networks) in English-to-French and English-to-German translation tasks.
As of 2022, the Generative Pre-trained Transformer 3 (GPT-3) model [226], which is based on the
Transformer architecture, belongs to the most powerful language models. GPT-3 is an autoregressive model
that produces text from a given (initial) text prompt, whereby it can deal with different tasks as translation,
question-answering, cloze-tests228 and word-unscrambling, for instance. The impressive capabilities of GPT-
3 are enabled by its huge capacity of 175 billion parameters, which is 10 times more than preceding language
models.
Remark 7.7. Attention mechanism, kernel machines, physics-informed neural networks (PINNs). In [227],
a new attention architecture (mechanism) was proposed by using kernel machines discussed in Section 8,
whereas in [228], the gated recurrent units (GRU, Section 7.3) and the attention mechanism (Section 7.4.1)
were used in conjunction with Physics-Informed Neural Networks (PINNs, Section 9.5) to solve hyperbolic
problems with shock waves; Remark 9.5 and Remark 11.11. ■
was discussed in Section 13.2.2 in connection with the continuous temperal summation in neuroscience.
Our aim here is only to provide first-time learners background material on kernel methods in terms of space
variables (specifically the “Setup” in [235]) in preparation to read more advanced references mentioned in
this section, such as [235] [232] [236] etc.
where the subscript y in the scalar product ⟨·, ·⟩y indicates the integrand.
Let {ϕ1 (x), . . . , ϕn (x)} be a set of linearly independent basis functions. A function f (x) can be ex-
pressed in this basis as follows:
n
X
f (x) = ζk ϕk (x) . (333)
k=1
The scalar product of two functions f and g is given by:
n
X n
X n
X
g(x) = ηk ϕk (x) ⇒ ⟨f, g⟩ = ζi ηi ⟨ϕi , ϕj ⟩ = ζi ηi Γij , (334)
k=1 i,j=1 i,j=1
where
Γij = ⟨ϕi , ϕj ⟩ , Γ = [Γij ] > 0 ∈ Rn×n (335)
is the Gram matrix, 229
which is strictly positive definite, with its inverse (also strictly positive definite)
denoted by
∆ = Γ−1 = [∆ij ] > 0 ∈ Rn×n . (336)
Remark 8.1. It is easy to verify that the function K(x, y) in Eq. (337) is a reproducing kernel:
X X X X
⟨f (y), K(y, x)⟩y = ⟨ ζi ϕi (y), ∆jk ϕj (y)ϕk (x)⟩y = ζi ⟨ϕi (y), ϕj (y)⟩y ∆jk ϕk (x) (338)
i j,k i j,k
X X X X X
= ζi Γij ∆jk ϕk (x) = ζi δik ϕk (x) = ζi ϕi (x) = f (x) , (339)
i j,k i k i
229
The stiffness matrix in the displacement finite element method is a Gram matrix.
230
See Eq. (6), p. 346, in [237].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 145
where ∥ · ∥K denotes the norm induced by the kernel K, and where the matrix notation in Eq. (7) was used.
When the basis functions in Φ are mutually orthogonal, then the Gram matrix Γ in Eq. (335) is diagonal,
and the reproducing kernel K in Eq. (337) takes the simple form:
n
X
K(x, y) = ∆k ϕk (x)ϕk (y) , with ∆k = ∆ik δik = ∆(kk) > 0 , for k = 1, . . . , ∞ , (341)
k=1
where δij is the Kronecker delta, and where the summation convention on repeated indices, except when
enclosed in parentheses, was applied. In the case of infinite-dimensional space of functions, Eq. (341),
Eq. (333) and Eq. (334) would be written with n → ∞:231
∞
X ∞
X ∞
X
f (x) = ζk ϕk (x) , ⟨f, g⟩ = ζk ηk Γk = ζk ηk /∆k = ζ T Γη (342)
k=1 k=1 k=1
∞
X ∞
X ∞
X
∥ f ∥2K = (ζk )2 Γk = (ζk )2 /∆k = ζ T Γζ , K(x, y) = ∆k ϕk (x)ϕk (y) . (343)
k=1 k=1 k=1
Let L(y, f (x)) be the loss (cost) function, with x being the data, f (x) the predicted output, and y the
label. Consider the regularized minimization problem:
∞ ∞
" n # n
X X X X
min L(yi , f (xi )) + λ ∥ f ∥2K = min∞ L(yi , ζj ϕj (xi )) + λ (ζj )2 /∆j , (344)
f {ζi }1
i=1 i=1 j=1 j=1
which is a “ridge” penalty method232 with λ being the penalty coefficient (or regularization parameter),
with the aim of forcing the norm of the minimizer, i.e, ∥ f ⋆ ∥2 , to be as small as possible, by penalizing
the objective function (cost L plus penalty λ ∥ f ∥2K ) if it were not.233 What is remarkable is that even
though the minimization problem is infinite dimensional, the minimizer (solution) f ⋆ of Eq. (344) is finite
dimensional [238]:
n
X
f ⋆ (x) = αi⋆ K(x, xi ) , (345)
i=1
where K(·, xi ) is a basis function, and αi⋆ the corresponding coefficient, for i = 1, . . . , n.
231
Eq. (342)1 , and Eqs. (343)1,2 were given as Eq. (5.46), Eq. (5.47), and Eq. (5.45), respectively, in [238], pp. 168-
169, where the basis functions ϕi (x) were the “eigen-functions,” and where the general case non-orthogonal basis
presented in Eq. (337)
P∞was not discussed. Technically, Eq. (343)2 is accompanied by the conditions ∆i ≥ 0, for
i = 1, . . . , ∞, and k=1 (∆k )2 < ∞, i.e., the sum of the squared coefficients is finite [238], p. 188.
232
See [238], Eq. (3.41), p. 61, which is in Section 3.4.1 on “Ridge regression,” which “shrinks the regression coeffi-
cients by imposing a penalty on them.” Here the penalty is imposed on the kernel-induced norm of f , i.e., ∥ f ∥2K .
233
In classical regularization, the loss function (1st term) in Eq. (344) is called the “empirical risk” and the penalty
term (2nd term) the “stabilizer” [239].
146 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Thus K = [K(xi , xj )] ∈ Rn×n is the Gram matrix of the set of functions {K(·, x1 ), . . . , K(·, xn )}. To show
that K > 0, i.e., positive definite, for any set {a1 , . . . , an }, consider
n
X n
X ∞
X ∞
X
ai K(xi , xj )aj = ai ∆p ϕp (xi )ϕp (xj )aj = ∆p (bp )2 ≥ 0 , (348)
i,j=1 i,j=1 p=1 p=1
n
X
bp := ai ϕp (xi ) , for p = 1, . . . , ∞ , (349)
i=1
which is equivalent to the matrix K being positive definite, i.e., K > 0,234 and thus the functions {K(·, x1 ),
. . . , K(·, xn )} are linearly independent, and form a basis, making expression such as Eq. (345) possible.
A goal now is to show that the solution to the infinite-dimensional regularized minimization problem
Eq. (344) is finite dimensional, for which the coefficients αi⋆ , i = 1, . . . , n, in Eq. (345) are to be determined.
It is also remarkable that the solution of the form Eq. (345) holds in general for any type of differentiable
loss function L in Eq. (344), and not necessarily restricted to the squared-error loss [241] [239].
For notation compactness, let the objective function (loss plus penalty) in Eq. (344) be written as
n
X n
X ∞
X ∞
X
L[f ] := L(yk , f (xk )) + λ ∥ f ∥2K = L(yk , ζj ϕj (xk )) + λ (ζk )2 /∆j , (350)
k=1 k=1 j=1 k=1
and set the derivative of L with respect to the coefficients ζp , for p = 1, . . . , ∞, to zero to solve for these
coefficients:
n n
∂L X ∂L(yk , f (xk )) ∂f (xk ) ∂ ∥ f ∥K X ∂L(yk , f (xk )) ζp
=− + 2λ =− ϕp (xk ) + 2λ = 0 (351)
∂ζp ∂f ∂ζp ∂ζp ∂f ∆p
k=1 k=1
n n
∆p X ∂L(yk , f (xk )) X 1 ∂L(yk , f (xk ))
⇒ ζp⋆ = ϕp (xk ) = ∆p αk⋆ ϕp (xk ) , with αk⋆ := , (352)
2λ ∂f 2λ ∂f
k=1 k=1
∞
X ∞
X n
X n
X ∞
X n
X
⇒ f ⋆ (x) = ζk⋆ ϕk (x) = ∆k αi⋆ ϕk (xi )ϕk (x) = αi⋆ ∆k ϕk (xi )ϕk (x) = αi⋆ K(xi , x) ,
k=1 k=1 i=1 i=1 k=1 i=1
(353)
where the last expression in Eq. (353) came from using the kernel expression in Eq. (343)2 , and the end
result is Eq. (345), i.e., the solution (minimizer) f ⋆ is of finite dimension.235 ■
y T = [y1 , . . . , yn ] ∈ R1×n , K = [Kij ] = [K(xi , xj )] ∈ Rn×n , α⋆T = [α1⋆ , . . . , αn⋆ ] ∈ Rn×n , (355)
the coefficients αk⋆ in Eq. (353) (or Eq. (345)) can be computed using Eq. (352)2 :
n
1
Kkj αj⋆ ⇒ [K + λI] α⋆ = y ⇒ α⋆ = [K + λI]−1 y .
X
αk⋆ = [yk − f ⋆ (xk )] ⇒ λαk⋆ = yk − (356)
λ
j=1
234
See, e.g., [240], p. 11.
235
See also [238], p. 169, Eq. 5.50, and p. 185.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 147
It is then clear from the above that the “Setup” section in [235] simply corresponded to the particular case
where the penalty parameter was zero:
n
X
λ = 0 ⇒ yk = f ⋆ (xk ) = Kkj αj⋆ , (357)
j=1
i.e., f ⋆ is interpolating.
For technical jargon such as Reproducing Kernel Hilbert Space (RKHS), Riesz Representation Theo-
rem, K(·, xi ) as a representer of evaluation at xi , etc. to describe several concepts presented above, see [237]
[242] [240] [238].236
Table 5: Some reproducing kernels (Section 8). See [239], [130], p. 296, p. 305, [244].
Remark 8.3. Laplacian kernel is reproducing. Consider the Laplacian kernel in Eq. (358)2 for the scalar
case (x, y) with σ = 1 for simplicity, without loss of generality. The goal is to show that the reproducing
236
A succinct introduction to Hilbert space and the Riesz Representation theorem, with detailed proofs, starting from
the basic definitions, can be found in [243].
237
In [130], p. 305, the kernel KL in Eq. (358)2 was called “exponential kernel,” without a reference to Laplace, but
referred to “the Ornstein-Uhlenbeck process originally introduced by Uhlenbeck and Ornstein (1930) to describe
Brownian motion.” Similarly, in [234], p. 85, the term “exponential covariance function” (or kernel) was used for
the kernel k(r) = exp(−r/ℓ), with r = |x − y|, in connection with the Ornstein-Uhlenbeck process. Even though
the name “kernel” came from the theorie of integral operator [234], p. 80, the attribution of the exponential kernel
to Laplace came from the Laplace probability distribution (Wikepedia, version 06:55, 24 August 2022), also called
the “double exponential” distribution, but not from the different kernel used in the Laplace transform. See also
Remark 8.3.1 and Figure 86.
148 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
property in Eq. (332)1 holds for such kernel expressed in Eq. (359)1 below.
(
∂K(x, y) −K(x, y) for y > x
K(x, y) = exp [−|x − y|] ⇒ K′ (x, y) := = (359)
∂y K(x, y) for y < x
∂ 2 K(x, y)
⇒ K′′ (x, y) := = K(x, y) , for y ̸= x . (360)
(∂y)2
The method is by using integration by parts and by using a function norm different from that in Eq. (332)2 ;
see [240], p. 8. Now start with the integral in Eq. (332)2 , and do integration by parts:
Z Z Z Z Z
′′
′ ′ ′ y=+∞
f (y)K(x, y)dy = f (y)K (x, y)dy = f K dy − f K dy = f K y=−∞ − f ′ K′ dy (361)
′ ′
′ y=+∞ − y=+∞
f K y=−∞ = [f (+K)]y=x − − + +
y=−∞ + [f (−K)]y=x+ = f (x )K(x , x) + f (x )K(x , x) = 2f (x) (362)
Z Z
⇒ f (y)K(x, y)dy + f ′ (y)K′ (x, y)dy = 2f (x) , (363)
where x− = x − ϵ and x+ = x + ϵ with ϵ > 0 being very small. The scalar product on the space of functions
that are differentiable almost everywhere, i.e.,
Z
1
Z
′ ′
⟨f, g⟩ = f (y)g(y)dy + f (y)g (y)dy , (364)
2
together with Eq. (363), and g(y) = K(x, y), show that the Laplacian kernel is reproducing.238 ■
“If you ask only for the properties of the function at a finite number of points, then the inference
from the Gaussian process will give you the same answer if you ignore the infinitely many other
points, as if you would have taken them into account! And these answers are consistent with
any other finite queries you may have. One of the main attractions of the Gaussian process
framework is precisely that it unites a sophisticated and consistent view with computational
tractability.”
A simple example of a Gaussian process is the linear model f (·) below [130]:
k=n
X
y = f (x) = wk xk = wT ϕ(x) , with wT = ⌊w0 , . . . , wn ⌋ ∈ R1×(n+1) , wk ∼ N (0, 1) , (365)
k=0
with random coefficients wk , k = 0, . . . , n, being normally distributed with zero mean µ = 0 and unit
variance σ 2 = 1, and with basis functions {ϕk (x) = xk , k = 0, . . . , n}.
More generally, ϕ(x) could be any basis of nonlinear functions in x, and the weights in w ∈ Rn×1
could have a joint Gaussian distribition with zero mean and a given covariance marix C ww = Σ ∈ Rn×n ,
i.e, w ∼ N (0, Σ). If wk , k = 1, . . . , n, are idependently and identically distributed (i.i.d.), with variance
σ 2 , then the covariance matrix is diagonal (and is called “isotropic” [130], p. 84), i.e., Σ = σ 2 I and
w ∼ N (0, σ 2 I), with I being the identity matrix.
Formally, a Gaussian process is a probability distribution over the functions f (x) such that, for any
given arbitrary set of input training points x = [x1 , . . . , xm ]T ∈ Rm×1 , the set of output values of f (·) at
input training points, i.e., y = [y1 , . . . , ym ]T = [f (x1 ), . . . , f (xm )]T Rm×1 , such that yi = f (xi ), which
from Eq. (365) can be written as
Xn
y = (wT Φ)T = ΦT w , yj = f (xj ) = wi ϕi (xj ) , j = 1, . . . , m , such that w ∼ N (0, Σ) , (367)
i=1
and the design matrix Φ = [ϕi (xj )] ∈ Rn×m , has a joint probability distribution; see [130], p. 305.
Another way to put it succinctly, a Gaussian process describes a distribution over functions, and is
defined as a collection of random variables (representing the values of the function f (x) at location x), such
that any finite subset of which has a joint probability distribution [234], p. 13.
The multivariate (joint probability) Gaussian distribution for an m × 1 matrix y, with mean µ (element-
wise expectation of y) and covariance matrix C yy (element-wise expectation of yy T ) is written as
1 1 T −1
N (y|µ, C yy ) = exp − (y − µ) C yy (y − µ) , with (368)
(2π)m/2 det C yy 2
µ = E[y] = E wT Φ = 0 , (369)
where C ww = Σ ∈ Rn×n is a given covariance matrix of the weight matrix w ∈ Rn×1 . The covariance
matrix C yy in Eq. (370) has the same mathematical structure as Eq. (337), and is therefore a reproducing
kernel, with kernel function k(·, ·)
n X
X n
cov(yi , yj ) = cov(f (xi ), f (xj )) = E [f (xi ), f (xj )] = K(xi , xj ) = ϕp (xi )Σpq ϕq (xj ) . (371)
p=1 q=1
In the case of an isotropic covariance matrix C ww , the kernel function k(·, ·) takes a simple form:240
n
X
C ww = σ 2 I ⇒ K(yi , yj ) = K(f (xi ), f (xj )) = σ 2 ϕp (xi )ϕp (xj ) = σ 2 ϕT (xi )ϕ(xj ) , (372)
p=1
Remark 8.4. Zero mean. In a Gaussian process, the joint distribution Eq. (368) over the outputs, i.e., the
n random variables in y = f (x) ∈ Rn×1 , is defined completely by “second-order statistics,” i.e., the mean
µ = 0 and the covariance C yy . In practice, the mean of y = f (x) is not available a priori, and “so by
240
The “precision” is the inverse of the variance, i.e., σ −2 [130], p. 304.
150 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 86: Gaussian process priors (Section 8.3). Left: Two samples with Gaussian kernel.
Right: Two samples with Laplacian kernel. Parameters for both kernels: Kernel precision
(inverse of variance) γ = σ −2 = 0.2 in Eq. (358), isotropic noise variance ν 2 I = 10−6 I added
to covariance matrix C yy of output y and isotropic weight covariance matrix C ww = Σ = I
in Eq. (374).
symmetry, we take it to be zero” [130], p. 305, or equivalently, specify that the weights in w ∈ Rn×1 have
zero mean, as in Eqs. (365), (367), (369). ■
Figure 87: Gaussian process prior and posterior samplings, Gaussian kernel (Section 8.3).
Top left: Gaussian-prior samples (Section 8.3.1). The shaded red zones represent the predic-
tive density of at each input location. Top right: Gaussian-posterior samples with 1 data point.
Bottom left: Gaussian-posterior samples with 2 data points. Bottom right: Gaussian-posterior
samples with 3 data points [247]. See Figure 88 for the noise effects and Figure 89 for anima-
tion of GP priors and posteriors. (Figure reproduced with permission of the authors.)
x) ∈ Rm×1
ye = f (e e . The combined function values in the matrix [y, ye]T ∈ R(m+m)×1
e , as random variables,
are distributed “normally” (Gaussian), i.e.,
K(x, x) + ν 2 I K(x, x
f (x) y µ(x) e)
= =N , . (376)
f (e
x) ye x)
µ(e K T (x, x
e) x, x
K(e e)
The Gaussian-process posterior distribution, i.e., the conditional Gaussian distribution for the test output ye
given the training data (x, y = f (x)) is then (See Appendix 3 for the detailed derivation, which is simpler
than in [130] and in [248])242
y | x, y, x
p(e e ) = N (e
y|µ
e , C yeye) , (377)
−1
x, x) K(x, x) + ν 2 I x, x)K −1 (x, x) y , (378)
e = µ(e
µ x) + K(e [y − µ(x)] = K(e
−1
x, x) K(x, x) + ν 2 I (379)
C yeye = K(e e ) − K(e
x, x e) ,
K(x, x
where the mean was set to zero by Remark 8.4. In Figure 87, the number m of training points varied from
1 to 3. The Gaussian posterior sampling follows the same method as in Eq. (374), but with the covariance
242
See, e.g., [234], p. 16, [130], p. 87, [247], p. 4. The authors of [234], in their Appendix A.2, p. 200, referred to
[248] “sec. 9.3” for the derivation of Eqs. (378)-(379), but there were several sections numbered “9.3” in [248];
the correct referencing should be [248], Chapter XIII “More on distributions,” Sec. 9.3 “Marginal distributions and
conditional distributions,” p. 427.
152 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 88: Gaussian process posterior samplings, noise effects (Section 8.3). Not all sampled
curves in Figure 87 went through the data points, such as the black line in the present zoomed-
in view of the bottom-left subfigure [247]. It is easy to make the sampled curves passing closer
to the data points simply by reducing the noise variance ν 2 in Eqs. (376), (378), (379). See
Figure 89 for animation of GP priors and posteriors. (Figure reproduced with permission of the
authors.)
Figure 89: Gaussian process posterior samplings, animation (Section 8.3). Interactive Gaus-
sian Process Visualization, Infinite curiosity. Click on the plot area to specify data points. See
Figures 87 and 88.
• Linear algebra: In essence, DL boils down to algebraic operations on large sets of data arranged in
multi-dimensional arrays, which are supported by all software frameworks, see Section 4.4.246
• Optimization: DL-libraries provide a variety of optimization algorithms that have proven effective in
training of neural networks, see Section 6.
• Frontend and API: Popular DL-frameworks provide an intuitive API, which supports accessibility for
first-time learners and dissemination of novel methods to scientific fields beyond computer science.
Python has become the prevailing programming language in DL, since it is more approachable for
246
In the context of DL, multi-dimensional arrays are also referred to a tensors, although the data often lacks the
defining properties of tensors as algebraic objects. The software framework TensorFlow even reflects that by its
name.
154 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 90: Top deep-learning libraries in 2018 by the “Power Score” in [249]. By 2022, using
Google Trends, the popularity of different frameworks is significantly different; see Figure 91.
In what follows, a brief description of some of the most popular software frameworks is given.
9.1 TensorFlow
TensorFlow [250] is a free and open-source software library which is being developed by the Google
Brain research team, which, in turn, is part of Google’s AI division. TensorFlow emerged from Google’s
proprietary predecessor “DistBelief” and was released to public in November 2015. Ever since its release,
TensorFlow has rapidly become the most popular software framework in the field of deep-learning and
maintains a leading position as of 2022, although it has been outgrown in popularity by its main competitor
PyTorch particularly in research.
In 2016, Google presented its own AI accelerator hardware for TensorFlow called “Tensor Processing
Unit” (TPU), which is built around an application-specific integrated circuited (ASIC) tailored to compu-
tations needed in training and evaluation of neural networks. DeepMind’s grandmaster-beating software
AlphaGo, see Section 2.3 and Figure 2, was trained using TPUs.247 TPUs were made available to the public
as part of “Google Cloud” in 2018. A single fourth-generation TPU device has a peak computing power of
275 teraflops for 16 bit floating point numbers (bfloat16) and 8 bit integers (int8). A fourth-generation
cloud TPU “pod”, which comprises 4096 TPUs offers a peak computing power of 1.1 exaflops.248
247
See Google’s announcement of TPUs [251].
248
Recall that the ‘exa’-prefix translates into a factor of 1 × 1018 . For comparison, Nvidia’s latest H100 GPU-based
accelerator has a half-precision floating point (bfloat16) performance of 1 petaflops.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 155
80
60
40
20
0
Dec-17
Jun-18
Dec-18
Jun-19
Dec-19
Jun-20
Dec-20
Jun-21
Dec-21
Jun-22
Sep-17
Mar-18
Sep-18
Mar-19
Sep-19
Mar-20
Sep-20
Mar-21
Sep-21
Mar-22
TensorFlow Keras PyTorch Caffe Theano
Figure 91: Google Trends of deep-learning software libraries (Section 9). The chart shows the
popularity of five DL-related software libraries most “powerful” in 2018 over the last 5 years
(as of July 2022). See also Figure 90 for the rankings of DL frameworks in 2018.
9.1.1 Keras
Keras [252] plays a special role among the software frameworks discussed here. As a matter of fact, it
is not a full-featured DL-library, much rather Keras can be considered as an interface to other libraries pro-
viding a high-level API, which was originally built for various backends including TensorFlow, Theano and
the (now deprecated) Microsoft Cognitive Toolkit (CNTK). As of version 2.4, TensorFlow is the only sup-
ported framework. Keras, which is free and also open-source, is meant to further simplify experimentation
with neural networks as compared to TensorFlow’s lower level API.
9.2 PyTorch
PyTorch [253], which is a free and open-source library, which was originally released to public in
January 2017.249 As of 2022, PyTorch has evolved from a research-oriented DL-framework to a fully fledged
environment for both scientific work and industrial applications, which, as of 2022 has caught up with, if
not surpassed, TensorFlow in popularity. Primarily addressing researchers in its early days, PyTorch saw a
rapid growth not least for the—at that time—unique feature of dynamic computational graphs, which allows
for great flexibility and simplifies creation of complex network architectures. As opposed to its competitors
as, e.g., TensorFlow, computational graphs, which represent compositions of mathematical operations and
allow for automatic differentiation of complex expressions, are created on the fly, i.e., at the very same time
as operations are performed. Static graphs, on the other hand, need to be created in a first step, before they
can be evaluated and automatically differentiated. Some examples of applying PyTorch to computational
mechanics are provided in the next two remarks.
Remark 9.1. Reinforcement Learning (RL) is a branch of machine-learning, in which computational meth-
ods and DL-methods naturally come together. Owing to the progress in DL, reinforcement learning, which
has its roots in the early days of cybernetics and machine learning, see, e.g., the survey [256], has gained
attention in the fields of automatic control and robotics again. In their opening to a more recent review, the
authors of [257] expect no less than “deep reinforcement-learning is poised to revolutionize artificial intel-
ligence and represents a step toward building autonomous systems with a higher level understanding of the
visual world.” RL is based on the concept that an autonomous agent learns complex tasks by trial-and-error.
249
See this blog post on the history of PyTorch [254] and the YouTube talk of Yann LeCun, PyTorch co-creator Sou-
smith Chintala, Meta’s PyTorch lead Lin Qiao and Meta’s CTO Mike Schroepfer [255].
156 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Interacting with its environment, the agent receives a reward if it succeeds in solving a given tasks. Not least
to speed up training by means of the parallelization, simulation has become are key ingredient to modern
RL, where agents are typically trained in virtual environments, i.e., simulation models of the physical world.
Though (computer) games are classical benchmarks, in which DeepMind’s AlphaGo and AlphaZero models
excelled humans (see Section 1, Figure 2), deep RL has proven capable of dealing with real-world appli-
cations in the field of control and robotics, see, e.g., [258]. Based on the PyTorch’s introductory tutorial
(see Original website) of a the classic cart-pole problem, i.e., an inverted pendulum (pole) mounted to a
moving base (cart), we developed a RL-model for the control of large-deformable beams, see Figure 92 and
the video illustrating the training progress. For some large-deformable beam formulations, see, e.g., [259],
[260], [261], [262]. ■
Figure 92: Positioning and pointing control of large deformable beam (Section 9, Remark 9.1).
Reinforcement learning. The agent is trained to align the tip of the flexible beam with the
target position (red ball). For this purpose, the agent can move the base of the cantilever; the
environment returns the negative Euclidean distance of the beam’s tip to the target position as
“reward” in each time-step of the simulation. Simulation video. See also GitLab repository for
code and video.
9.3 JAX
JAX [263] is a free and open-source research project driven by Google. Released to the public in 2018,
JAX is one of the more recent software frameworks that have emerged during the current wave of AI. It is
being described as being “Autograd and XLA” and as “a language for expressing and composing transfor-
mations of numerical programs”, i.e., JAX focuses on accelerating evaluations of algebraic expression and,
in particular, gradient computations. As a matter of fact, its core API, which provides a mostly NumPy-
compatible interface to many mathematical operations, is rather trimmed-down in terms of DL-specific
functions as compared to the broad scope of functionality offered by TensorFlow and PyTorch, for instance,
which, for this reason, are often referred to as end-to-end frameworks. JAX, on the other hand, considers
itself as system that facilitates “transformations” like gradient computation, just-in-time compilation and au-
tomatic vectorization of compositions of functions on parallel hardware as GPUs and TPUs. A higher-level
interface to JAX’ functionality, which is specifically made for ML-purposes, is available through the FLAX
framework [264]. FLAX rovides many fundamental building blocks essential for creation and training of
neural networks.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 157
∂u 1 2
r = [r1 , r2 , r3 ]T = + (u•∇) u + ∇p − ∇ u ∈ R3×1 , and r4 = ∇•u , (382)
∂t Re
where Nf is the number of residual (collocation) points, which could be in the millions generated ran-
domly,251 the residual r = [r1 , r2 , r3 ]T in Eq. (382)1 is the left-hand side of the balance of linear momentum
250
GPU-computing and automatic differentiation are by no means new to scientific computing not least in the field
of computational mechanics. Project Chrono (see Original website), for instance, is well known for its support of
GPU-computing in problems of flexible multi-body and particle systems. The general purpose finite-element code
Netgen/NGSolve [265] (see Original website) offers a great degree of flexibility owing to its automatic differentiation
capabilities. Well-established commercial codes, on the other hand, are often built on a comparatively old codebase,
which dates back to times before the advent of GPU-computing.
251
For incompressible flow past a cylinder, the computational domain of dimension [−7.5, 28.5] × [−20, 20] ×
158 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 93: DL-frameworks in nonlinear finite-element problems (Section 9.4). The computa-
tional efficiency of a PyTorch-based (Version 1.8) finite-element code implemented was com-
pared against the state-of-the-art general purpose Netgen/NGSolve [265] for a problem of non-
linear elasticity, see the slides of the presentation and the corresponding video. The figures
show timings (in seconds) for evaluations of the strain energy (top left), the internal forces
(residual, top right) and element-stiffness matrices (bottom left) and the global stiffness ma-
trix (bottom right) against the number of elements. Owing to PyTorch’s parallel computation
capacity, the simple Python implementation could compete with the highly-optimized finite-
element code, in particular, as computations were moved to a GPU (NVIDIA Tesla V100).
(the right-hand side being zero), and the residual r4 in Eq. (382)2 is the left-hand side of the incomrepssibil-
(i) (i)
ity constraint, and (xf , tf ) the space-time coordinates of the collocation point (i).
Some review papers on PINN are [269] [270], with the latter being more general than [268], which was
restricted to fluid mechanics, and touching on many different fields. Table 6 lists PINN frameworks that are
currently actively developed, and a few selected solvers among which are summarized below.
☛ DeepXDE [271], one of the first PINN framework and a solver (Table 6) was developed in Python, with
a TensorFlow backend, for both teaching and research. This framework can solve both forward problems
(“given initial and boundary conditions”) and inverse problems (“ given some extra measures”), with do-
mains having complex geometry. According to the authors, DeepXDE is user-friendly, with compact user
code resembling the problem mathematical formulation, customizable to different types of mechanics prob-
lem. The site contains many published papers with a large number of demo problems: Poisson equation,
Burgers equation, diffusion-reaction equation, wave propagation equation, fractional PDEs, etc. In addi-
[0, 12.5]—with coordinates (x, y, z) non-dimensionalized by the diameter of the cylinder, with axis along the z
direction, and going through the point (x, y) = (0, 0)—contained 3 × 106 residual (collocation) points [268].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 159
Figure 94: Physics-Informed Neural Networks (PINN) concept (Section 9.5). The goal is
to find the optimal network parameters θ⋆ (weights) and PDE parameters λ⋆ that minimize
the total weighted loss function L(θ, λ), which is a linear combination of four loss functions:
(1) The residual of the PDE, LPDE , (2) Loss due to initial conditions, LIC , (3) Loss due to
boundary conditions , LBC , (4) Loss due to known (labeled) data, Ldata , with cP , cI , cB , cd being
the combination coefficients. With the space-time coordinates (x1 , . . . , xn , t) as inputs, the
neural network produces an approximated multi-physics solution û = {u, v, p, ϕ}, of which
the derivatives, estimated by automatic differentiation, are used to evaluate the loss functions.
If the total loss L is not less than a tolerance, its gradients with respect to the parameters (θ, λ)
are used to update these parameters in a descent direction toward a local minimum of L [268].
tion, there are demos on inverse problems and operator learning. Three more backends beyond TensorFlow,
which was reported in [270], have been added to DeepXDE: PyTorch, JAX, Paddle.252
☛ NeuroDiffEq [274], a solver, was developed about the same time as DeepXDE, with the backend being
PyTorch. Even though it was written that the authors were “actively working on extending NeuroDiffEq
to support three spatial dimensions,” this feature is not ready, and can be worked around by including the
3D boundary conditions in the loss function.253 Even though in principle, NeuroDiffEq can be used to
solve PDEs of interest to engineering (e.g., Navier-Stokes solutions), there were no such examples in the
official documentation, except for a 2D Laplace equation and a 1D heat equation. The backend is limited to
PyTorch, and the site did not list any papers, either by the developers or by others, using this framework.
☛ NeuralPDE [275], a solver, was developed in a relatively new language Julia, which is 20 years younger
than Python, has a speed edge over Python in machine learning, but does not in data science.254 Demos are
252
See Section Working with different backends. See also [269].
253
“All you need is to import GenericSolver from neurodiffeq.solvers, and Generator3D from
neurodiffeq.generators. The catch is that currently there is no reparametrization defined in
neurodiffeq.conditions to satisfy 3D boundary conditions,” which can be hacked into the loss function
“by either adding another element in your equation system or overwriting the additional_loss method of
GenericSolve.” Private communication with a developer of NeuroDiffEq on 2022.10.08.
254
There are many comparisons of Julia versus Python on the web; one is Julia vs Python: Which is Best to Learn
First?, by By Zulie Rane on 2022.02.05, updated on 2022.10.01.
160 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Table 6: PINN frameworks (Section 9.5) being actively developed [270] [269]. A solver solves
the problem defined by users. A wrapper does not solve, but only wraps low-level functions
from other libraries (e.g., PyTorch) into high-level functions that are convenient for users to
implement PINN to solve the problem.
given for ODEs, generic PDEs, such as the coupled nonlinear hyperbolic PDEs of the form:
∂2u a ∂ n ∂u u ∂2w b ∂ n ∂u u
2
= (x ) + uf ( ) , 2
= (x ) + wf ( ) , (383)
(∂t) n ∂x ∂x w (∂t) n ∂x ∂x w
with f and g being arbitrary functions. There are initial and boundary conditions, and exact solution to find
the error of the numerical solution, Figure 95. The site did not list any papers, either by the developers or
by others, using this framework.
Figure 95: Coupled nonlinear hyperbolic equations (Section 9.5). Analytical solution, pre-
dicted solution by NeuralPDE [275] and error for the coupled nonlinear hyperbolic equations
in Eq. (383).
Additional PINN software packages other than those in Table 6 are listed and summarized in [269].
Remark 9.2. PINN and activation functions. Deep neural networks (DNN), having at least two hidden
layers, with ReLU activation function (Figure 24) were shown to correspond to linear finite element in-
terpolation [280], since piecewise linear functions can be written as DNN with ReLU activation functions
[281].
But using the strong form such as the PDE in Eq. (383), which has the second partial derivative with
respect to x, and since the second derivative of ReLU is zero, activation functions such as the logistic
sigmoid (Figure 30), hyperbolic tangent (Figure 31), or the swish function (Figure 139) are recommended.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 161
Because of the presence of the second partial derivatives with respect to the spatial coordinates in the general
PDE Eq. (384), particularized to the 2D Navier-Stokes Eq. (385):
ut + N [u; λ] = 0 for x ∈ Ω t ∈ [0, T ] , with (384)
ut λ1 (uux + vuy ) + px − λ2 (uxx + uyy ) λ1
ut = , N [u; λ] = , λ= , (385)
vt λ1 (uvx + vvy ) + py − λ2 (vxx + vyy ) λ2
for which the hyperbolic tangent (tanh) was used as activation function [282]. ■
Remark 9.3. Variational PINN. Similar to the finite element method, in which the weak form, not the strong
form as in Remark 9.2, of the PDE allows for a reduction in the requirement of differentiability of the trial
solution, and is discretized with numerical integration used to evaluate the resulting coefficients for various
matrices (e.g, mass, stiffness, etc), PINN can be formulated using the weak form, instead of the strong form
such as Eq. (385), at the expense of having to perform numerical integration (quadrature) [283] [284] [285].
Examples of 1-D PDEs were given in [283] in which the activation function was a sine function defined
over the interval (−1, 1). To illustrate the concept, consider the following simple 1-D problem [283] (which
could model the axial displacement u(x) of an elastic bar under distributed load f (x), with prescribed end
displacements):
d2 u(x)
−u′′ (x) = − = f (x), for x ∈ (−1, 1) , u(−1) = g , u(1) = h , (386)
(dx)2
where g and h are constants. Three variational forms of the strong form Eq. (386) are:
Z +1
A k (u, v) = B(f, v) := f (x)v(x)dx , ∀v with v(−1) = v(+1) = 0 , and (387)
−1
Z +1
A 1 (u, v) := − u′′ (x)v(x)dx , (388)
−1
Z +1
A 2 (u, v) := u′ (x)v ′ (x)dx , (389)
−1
Z +1
x=+1
A 3 (u, v) := − u(x)v ′′ (x)dx + u(x)v ′ (x) x=−1
, (390)
−1
where the familiar symmetric operator A2 in Eq. (389) is the weak form, with the non-symmetric operator
A 1 in Eq. (388) retaining the second derivation on the solution u(x), and the non-symmetric operator A 3 in
Eq. (390) retaining the second derivative on the test function v(x) in addition to the boundary terms. Upon
replacing the solution u(x) by its neural network (NN) approximation uN N (x) obtained using one single
hidden layer (L = 1 in Figure 23) with y = f (1) (x) and layer width m(1) , and using the sine activation
function on the interval (−1, +1):
m(1) m(1)
X X
uN N (x) = y(x) = f (1)
(x) = uN Ni (x) = ci sin(wi x + bi ) , (391)
i=1 i=1
which does not satisfy the essential boundary conditions (whereas the solution u(x) does), the loss function
for the VPINN method is then the squared residual of the variational form plus the squared residual of the
essential boundary conditions:
L(θ) = [Ak (uN N , v) − B(f, v)]2 + [uN N (x) − u(x)]2x∈{−1,+1} , and (392)
where θ⋆ = {w⋆ , b⋆ , c⋆ } are the optimal parameters for Eq. (391), with the goal of enforcing the variational
form and essential boundary conditions in Eq. (387). More details are in [283] [284].
For a symmetric variational form such as A2 (u, v) in Eq. (389), a potential energy exists, and can be
minimized to obtain the neural approximate solution uN N (x):
1 ˜ = 1 A2 (uN N , uN N ) − B(f, uN N ) ,
J(u) = A2 (u, u) − B(f, u) , J(θ) (394)
2 2
θ⋆ = {w⋆ , b⋆ , c⋆ } = argmin L(θ) , ˜ + [uN N (x) − u(x)]2
L(θ) = J(θ) (395)
x∈{−1,+1} ,
θ
which is similar to the approach taken in [280], where the ReLU activation function (Figure 24) was used,
and where a constraint on the NN parameters was used to satisfy an essential boundary condition. ■
Remark 9.4. PINN, kernel machines, training, convergence problems. There is a relationship between
PINN and kernel machines in Section 8. Specifically, the neural tangent kernel [232], which “captures the
behavior of fully-connected neural networks in the infinite width limit during training via gradient descent”
was used to understand when and why PINN failed to train [286], whose authors found a “remarkable
discrepancy in the convergence rate of the different loss components contributing to the total training error,”
and proposed a new gradient descent algorithm to fix the problem.
It was often reported that PINN optimization converged to “solutions that lacked physical behaviors,”
and “reduced-domain methods improved convergence behavior of PINNs”; see [287], where a dynamical
system of the form below was studied:
N
" #2
duN N (x) 1 X duN N (x(i) )
= f (uN N (x)) , L(θ) = (i)
− f (uN N (x )) , (396)
dx N dx
i=1
with L(θ) being called the “physics” loss function. An example with f (u(x)) = u(x) 1 − u2 (x) was
studied to show that the “physics loss optimization predominantly results in convergence issues, leading
to incorrectly learned system dynamics”; see [287], where it was found that “solutions corresponding to
nonphysical system dynamics [could] be dominant in the physics loss landscape and optimization,” and that
“reducing the computational domain [lowered] the optimization complexity and chance of getting trapped
with nonphysical solutions.”
See also [288] for incorporating the Lyapunov stability concept into PINN formulation for CFD to
“improve the generalization error and reduce the prediction uncertainty.” ■
Remark 9.5. PINN and attention architecture. In [228], PIANN, Physics-Informed Attention-based Neu-
ral Network, was proposed to connect PINN to attention architecture, discussed in Section 7.4.3 to solve
hyperbolic PDE with shock wave. See Remark 7.7 and Remark 11.11. ■
Remark 9.6. “Physics-Informed Learning Machine” (PILM) 2021 US Patent [289]. First note that the
patent title used the phrase “learning machine,” instead of “machine learning,” indicating that the emphasis
of the patent appeared to be on “machine,” instead of on “learning” [289]. PINN was not mentioned, as
it was first invented in [290] [291], which were cited by the patent authors in their original PINN paper
[282].255 The abstract of this 2021 PILM US Patent [289] reads as follows:
“A method for analyzing an object includes modeling the object with a differential equation,
such as a linear partial differential equation (PDE), and sampling data associated with the
differential equation. The method uses a probability distribution device to obtain the solution
to the differential equation. The method eliminates use of discretization of the differential
equation.”
255
The 2019 paper [282] was a merger of a two-part preprint [292] [293].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 163
The first sentence is nothing new to the readers. In the second sentence, a “probability distribution device”
could be replaced by a neural network, which would make PILM into PINN. This patent mainly focused on
the Gaussian processes (Section 8.3), as an exemple of probability distribution (see Figure 4 in [289]). The
third sentence would be the claim-to-fame of PILM, and also of PINN. ■
Remark 9.7. Using PINN frameworks. While undergraduates with limited knowledge on the theory of the
Finite Element Method could run FE Analysis of complicated structures and complex domain geometries
on a laptop using commercial FE codes, solving problems with exceedingly simple domain geometry using
a PINN framework such as DeepXDE does require knowledge of governing PDEs, initial and boundary
conditions, artificial neural networks and frameworks (such as PyTorch, TensorFlow, etc.), the Python lan-
guage, and having a more powerful computer. In addition, because there are many parameters to fiddle with,
beyond the sample problems posted on the DeepXDE website, first-time users could encounter disappoint-
ment and doubt when trying to solve a new problem. It is not clear when the PINN methods would reach the
level of FE commercial codes that undergraduates could use, or would they just fade away after an initial
period of excitement like the meshless methods before them. ■
B(1 ± r1 d, 0, 0), C(1 ± r2 d, 1 ± r3 d, ±r4 d), D(±r5 d, 1 ± r6 d, 0), E(±r7 d, ±r8 d, 1 ± r9 d),
F (1 ± r10 d, ±r11 d, 1 ± r12 d), G(1 ± r13 d, 1 ± r14 d, 1 ± r15 d), H(±r16 d, 1 ± r17 d, 1 ± r18 d),
(400)
where ri ∈ [0, 1], i = 1, . . . , 18 were 18 random numbers, see Figure 97. The elements were collected
into five groups according to five different degrees of maximum distortion (maximum possible nodal dis-
placement) d ∈ {0.1, 0.2, 0.3, 0.4, 0.5}. Elements in the set d = 0.5 would only be highly distorted with ri
having values closer to 1, but may only be slightly distorted with ri having values closer to 0.259 To avoid
258
Conventional linear hexahedra are known to suffer from locking, see, e.g., [39] Section “10.3.2 Locking,” which
can be alleviated by “reduced integration,” i.e., using a single integration point (1 × 1 × 1). The concept in [38],
however, immediately translates to higher-order shape functions and non-conventional finite element formulations
(e.g., mixed formulation).
259
In fact, a somewhat ambiguous description of the random generation of elements was provided in [38]. On the one
hand, the authors stated that a “. . . coordinates of nodes are changed using a uniform random number r . . . ,” and
did not distinguish the random numbers in Eq. (400) by subscripts. On the other hand, they noted that exaggerated
distortion may occur if nodal coordinates of an element were changed independently, and introduced the constraints
on the distortion mentioned above in that context. If the same random number r were used for all nodal coordinates,
all elements generated would exhibit the same mode of distortion.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 165
A. Oishi, G. Yagawa / Comput. Methods Appl. Mech. Engrg. 327 (2017) 327–351 339 Fig. 6. 8-noded linear solid element.
H G
E F
D C
A B
Figure 97: Creation of randomly distorted elements (Section 10). Hexahedra forming the train-
ing and validation sets are created by randomly displacing the nodes of a regular hexahedral.
To comply with the normalization procedure, node A remains fixed, node B is shifted along
the x-axis and node C is displaced with the xy-plane. For each of the remaining nodes (E, F,
G, H), all three nodal coordinates are varied randomly. The elements are grouped according to
the maximum possible nodal displacement d ∈ {0.1, 0.2, 0.3, 0.4, 0.5}, from which each of the
18 nodal displacements was obtained upon multiplication with a random number ri ∈ [0, 1],
i = 1, . . . , 18 [38], see Eq. (400). Each of the 8 nodal zones in which the corresponding node
can be placed randomly is shown in red; all nodal zones are cubes, except for Node B (an
interval on the x axis) and for Node C (a square in the plane (x, y)).
neural network performed a classification task,261 where each class corresponded to the minimum number
of integration points q along a local coordinate axis for a maximum error etol = 10−3 . Figure 98 presents
the distribution of 10,000 elements generated randomly for two degrees of maximum possible distortion,
d = 0.1 and d = 0.5, using the method of Figure 97, and classified the minimum number of integration
points. Similar results for d = 0.3 were presented in [38], in which the conclusion that “as the shape is
distorted, more integration points are needed” is expected.
Figure 98: Method 1, Optimal number of integration points, feasibility (Section 10.2.1). Dis-
tribution of minimum numbers of integration points on a local coordinate axes for a maximum
error of etol = 10−3 among 10,000 elements generated randomly using the method in Fig-
ure 97. For d = 0.1, all elements were only slightly distorted, and required 3 integration points
each. For d = 0.5, close to 5,000 elements required 4 integration points each; very few ele-
ments required 3 or 10 integration points [38]. (Figure reproduced with permission of the authors.)
ranging from 1 to 5, keeping the width fixed at 50) to determine the optimal structure of their classifier
network, which used the logistic sigmoid as activation function.264 , 265 Their optimal feed-forward network,
composed of 3 hidden layers of 50 neurons each, correctly predicted the number of integration points needed
for 98.6% of the elements in the training set, and for 81.6 % of the elements in the validation set, Figure 99.
Figure 99: Method 1, Optimal network architecture for training (Section 10.2.2). The number
of hidden layers varies from 1 to 5, keeping the number of neurons per hidden layer constant
at 50. The network with 3 hidden layers provided the highest accuracy for both the training set
(“patterns”) at 98.6% and for the validation set (“test patterns”) at 81.6%. Increase the network
depth does not necessarily increase the accuracy [38]. (Table reproduced with permission of the
authors.)
For Table (b) in Figure 100, the numbers in column “Total” add up to 5 + 1553 + 2574 + 656 + 135 +
56 + 21 + 5 = 5005 elements in the validation set, which should have 5000, as written by [38]. Was there a
misprint ? The diagonal coefficients add up to 1430 + 2222 + 386 + 36 + 6 = 4080 elements with correctly
predicted number of integration points, yielding the accuracy of 4080 / 5000 = 81.6%, which agrees with
Row 3 in Figure 99. As a result of this agreement, the number of elements in the validation set (“test
patterns”) should be 5000, and not 5005, i.e., there was a misprint in column “Total”.
Figure 100: Method 1, application phase (Section 10.2.3). The numbers of quadrature points
predicted by the neural network was compared to the minimum numbers of quadrature points
for maximum error etol = 10−3 [38]. Table (a) shows the results for the training set (“pat-
terns”), and Table (b) for the validation set. (Table reproduced with permission of the authors.)
i.e.,
opt
{wi,j,k } = arg min Rerror , (403)
wi,j,k
were retained as target values for training, and identified with the superscript “opt”, standing for “optimal”.
q,opt
The corresponding optimally integrated coefficients in the element stiffness matrix are denoted by kij ,
with i, j = 1, 2, · · · , as appeared in Eq. (402).
It turns out that a reduction of the quadrature error by correcting the quadrature weights is not feasible
for all element shapes. Undistorted elements, for instance, which are already integrated exactly using stan-
dard quadrature weights, naturally do not admit improvements. These 20,000 elements were classified into
two categories A and B [38], Figure 101. Quadrature weight correction was not effective for Category A
(Rerror ≥ 1), but effective for Category B (Rerror < 1). Figure 101 shows that elements with a higher degree
of maximum distortion were more likely to benefit from the quadrature weight correction as compared to
weakly distorted elements. Recall that among the elements of the group d = 0.5 were elements that were
only slightly distorted (due to the random factors rij in Eq. (400) being close to zero), and therefore would
not benefit from quadrature weight correction (Category A); there were 1489 such elements, Figure 101.
Figure 101: Method 2, Quadrature weight correction, feasibility (Section 10.3.1). Each ele-
Table 4
ment was tested 1 million
Results times into
of classification with
tworandomly
categories A generated sets
and B based on of quadrature
whether or not accuracyweights. There
were 4000 elements in each of the 5 groups with different degrees of maximum distortion, d.
of quadrature is improved using some correction factors.
Quadrature weight correction
(a) Training effectiveness increased with element distortion. Weakly distorted
patterns
elements (d = 0.1) did not have any improvement, and(estimated
Category thus belonged
by neuralto Category A (error ratio
network)
Rerror ≥ 1). As d increased, the size of Category A decreased, while the
A B Totalsize of Category B
increased. Among the 4000
Category elements
(correct)
A in the group 3707with d = 0.5, 29 the stiffness
3736 matrices of 2511
B 70 1194 1264
elements could be integrated more accurately by correcting their quadrature weights (Category
B, Rerror < 1) [38].
(b) Test patternsreproduced with permission of the authors.)
(Table
Category (estimated by neural network)
A B Total
proposed normalization procedure for linear
A hexahedra as inputs, i.e.,
3682
(0) ∈ R18×1 . The output of the
155x = y 3837
Category (correct)
neural network was single scalar e y ∈ R,Bwhere e
y = 0224indicated an
939element of Category A and e
1163 y = 1 an
element of Category B.
Out of the 20,000 elements generated, 10,000 elements were selected to train the classifier network,
To solve this problem, we adopt the two-stage optimization as follows. In the first stage, by using a classification
for which both each
neural network, the training
element set and the into
is classified validation setAcomprised
Category 5000toelements
or B according each
whether it can [38].
266
Theerror
reduce the optimal
with
neural network factors
some correction in termsor of classification
not, and then, in accuracy
the second for this
stage, by application
using a had 4neural
regression hidden layersthe
network, with 30correction
best neurons
n o
per layer;
factors wi,see
opt Figure 102. The trained neural network succeeded in predicting the correct category for 98 %
are identified only for the elements categorized as Category B in the above.
j,k
and 92 % of the elements in the training set and in the validation set, respectively.
3.5.2.2. Training phase. As the first stage, a neural network to judge whether error reduction is possible or not in
In the second stage, a second neural network was trained to predict the corrections to the quadrature
the element given is tuned by using a total of 10,000 patterns selected from the 20,000 patterns described above. The
weights for all those elements, which allowed a reduction of the quadrature error. Again, the 18 non-trivial
input and the teacher signal of the neural network are as follows:
nodalInputcoordinates of a normalized
data: Coordinate values of the hexahedron
nodes in anwere inputwhere
element, to theteacher
neuralsignal:
network,1 fori.e.,Rx = y (0) ∈ R18×1 .
Err or < 1, and 0 for
The outputs
R Err or of the
1. Note thatneural network
the neural y ∈has
network
e R onerepresented
8×1 the eight
unit in the output layer.correction factors to the standard weights
opt
{wOuti,j,k },
of i,the
j, k10,000 2, of the5000
= 1, patterns, Gauss-Legendre quadrature.
patterns are selected The authors
for training, of others
and the [38] stated
for testthat 10,000toelements
patterns estimate
with an error capability.
generalization reduction The ration
maximum
Rerror < 1 formed
number equally
of epochs is setlarge training
to 30,000. set cycles
Many and validation
of trainings set,arecomprising
performed
5000 elements
with various each.267
learning A neuralshowing
parameters, networkthat withthe5 best
hidden layers
result of 50 neurons
is obtained from thewas reported
neural network towith
performfour best in
hidden
predicting
layers and 30 theunits
corrections
per hidden to layer
the quadrature
as shown inweights.
Table 4. The
It is normalized
noted that theerror distribution
classification
268
of the
accuracy, i.e. correction
the rate of
correct answer,
factors obtainedinisFigure
are illustrated as high103.
as 98% and 92% for the training and the test patterns, respectively. Then, as the
second stage, a neural network to identify a set of optimized correction factors for a given element is trained by using
a total of 10,000 patterns obtained from the elements with R Err or < 1. The input and the teacher signal of the neural
10.3.3 Method 2, application phase
network are as follows:
Thedata:
Input effectiveness
Coordinateofvalues
the numerical
of the nodesquadrature with corrected quadrature weights was already presented
in an element.
in Figure
Teacher10 in Section
signal: 2.3.1, with
Eight correction the distribution
factors, of the error-reduction
each corresponding ratio Rerrorpoints.
to one of eight integration definedNotein that
Eq. the
(402) for
neural
network
the has eight
training units in the(a)
set (patterns) output
and thelayer.
validation set (test patterns) (b). Further explanation is provided here
The 5000 patterns out of the 10,000 patterns are selected again for training, and the others for test patterns to
estimate
266 generalization
Even though a reasoncapability. Setting
for not using the the maximum
entire number
set of 20,000 of epochs
elements wasbeing 30,000,it the
not given, bestberesult
could guessedis obtained
that the
from the neural
authors of [38] network
would with
want five
the hidden
size of layers and 50setunits
the training andper hidden
of the layer. Fig.
validation 9 shows
set to be the the
samedistribution
as that in of error in
Method 1,
identifying
Section 10.2.the value of each
Moreover, correction
even though factor
detailsobtained
were notby the network.
given, the selection of these 10,000 elements would likely be
a random process.
267
According to Figure 101, only 4868 out of in total 20,000 elements generated belonged to Category B, for which
Rerror < 1 held. Further details on the 10,000 elements that were being used for training and validation were not
proviced in [38].
268
No details on the normalization procedure were provided in the paper.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 171
Figure 102: Method 2, training phase, classifier network (Section 10.3.2). The training and
validation sets comprised 5000 elements each, of which 3707 and 3682, respectively, belonged
to Category A (no improvements upon weight correction). A first neural network with 4 hidden
layers of 30 neurons correctly classified (3707 + 1194)/5000 ≈ 98 % elements in the training
set (a) and (3682 + 939)/5000 ≈ 92 % elements in the validation set (b). See also Figure 10
in Section 2.3.1 [38]. (Table reproduced with permission of the authors.)
Figure 103: Method 2, training phase, regression network (Section 10.3.2). A second neu-
Fig. 9. Distributions of estimation errors for [0, 1] normalized correction factors.
ral network estimated 8 correction factors {wi,j,k }, with i, j, k ∈ {1, 2}, to be multiplied by
the standard quadrature weights for each element. Distribution of normalized errors, i.e., the
normalized differences between the predicted weights (outputs) Oj and the true weights Tj for
the elements of the training set (red) and the test set (blue). For both sets, which consist of
5000 × 8 = 40,000 correction factors each, the error has a mean of zero and seems to obey a
normal distribution [38]. (Figure reproduced with permission of the authors.)
Figure 104: Three scales in data-driven fault-reactivation simulations (Sections 2.3.2, 11.1,
11.3.5). Relative orientation of Representative Volume Elements (RVEs). Left: Microscale
(µ) RVE using Discrete Element Method (DEM), Figure 106 and Row 1 of Figure 14. Center:
Mesoscale (cm) REV using FEM; Row 2 of Figure 14. Right: Field-size macroscale (km) FEM
model; Row 3 of Figure 14 [25]. (Figure reproduced with permission of the authors.)
Figure 105: Single-physics block diagram (Section 11.2). Single physics is an easiest way
to see the role of deep learning in modeling complex nonlinear constitutive behavior (stress-
strain relation, red arrow), as first realized in [23], where balance of linear momentum and
strain-displacement relation are definitions or accepted “universal principles” (black arrows)
[25] (Figure reproduced with permission of the authors.)
works and conventional constitutive models. To illustrate the hierarchy among the relations of models and to
identify, the authors of [25] used directed graphs, which also indicated the nature of the individual relations
by the colors of the graph edges. Black edges correspond to “universal principles” whereas red edges rep-
resent phenomenological relations, see, e.g., the classical problem in solid mechanics shown in Figure 105.
Within classical mechanics, the balance of linear momentum is axiomatic in nature, i.e., it represents a well-
accepted premise that is taken to be true. The relation between the displacement field and the strain tensor
represents a definition. The constitutive law describing the stress response, which, in the elastic case, is
an algebraic relation among stresses and strains, is the only phenomenological part in the “single-physics”
solid mechanics problem and, therefore, highlighted in red.
In many engineering problems, stress-strain relations of, possibly nonlinear, elasticity, which are pa-
rameterized by a set of elastic moduli, can be used in the modeling. For heterogeneous materials as, e.g.,
in composite structures, even the “single physics” problem of elastic solid mechanics may necessitate mul-
tiscale approaches, in which constitutive laws are replaced by RVE simulations and homogenization. This
approach was extended to multiphysics models of porous media, in which multiple scales needed to be
considered [25]. The counterpart of Figure 105 for the mechanics of porous media is complex, could be
confusing for readers not familiar with the field, does not add much to the understanding of the use of deep
learning in this study, and therefore not included here; see [25].
174 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
The hybrid approach in [25], which was described as graph-based machine learning model, retained
those parts of the model which represented universal principles or definitions (black arrows). Phenomeno-
logical relations (red arrows), which, in conventional multiscale approaches, followed from microscale mod-
els, were replaced by computationally efficient data-driven models. In view of the path-dependency of the
constitutive behavior in the poromechanics problem considered, it was proposed in [25] to use recurrent neu-
ral networks (RNNs), Section 7.1, constructed with Long Short-Term Memory (LSTM) cells, Section 7.2.
where Nc is the number of contact points (which is the same as the coordination number CN ), and nc the
unit normal vector at contact point c.
Remark 11.1. Even though in Figure 14 (row 1) in Section 2.3.2, the microscale RVE was indicated to be
of micron size, but the microscale RVE in Figure 106 was of size 10 cm × 10 cm × 5 cm with particles of
0.5 cm in diameter, many orders of magnitude larger. See Remark 11.9. ■
Other microstructure data such as the porosity and the coordination number (or number of contact
point), being scalars and did not incorporate directional data like the fabric tensor, did not help to improve
the accuracy of the network prediction, as noted in the caption of Figure 17.
To enforce objectivity of constitutive models realized as neural networks, (the history of) principal
strains and incremental rotation parameters that describe the orientation of principal directions served as
inputs to the network. Accordingly, principal stresses and incremental rotations for the principal directions
were outputs of what was referred to as Spectral RNNs [25], which preserved objectivity of constitutive
models.
Figure 108: Optimal RNN-LSTM architecture (Section 11.3.3). Training error and test errors
for 5 different configurations of RNN with LSTM units, see Figure 107 [25] (Figure reproduced
with permission of the authors.)
LSTM units (Figure 15, Section 2.3.2, and Figure 81 for the detailed original LSTM cell), and either logistic
sigmoid or ReLU as activation function, Figure 107.
Configuration 1 has 2 hidden layers with 50 LSTM units each, and with the logistic sigmoid as activation
function. Config 2 is similar, but with 80 LSTM units per hidden layer. Config 3 is similar, but with 100
LSTM units per hidden layer. Config 4 is similar to Config 2, but with 3 hidden layers. Config 5 is similar
to Config 4, but with ReLU activation function.
The training error and test error obtained from using these 5 configurations are shown in Figure 108.
The zoomed-in views of the training error and test error from epoch 3000 to epoch 5000 in Figure 109
show that Config 5 was the optimal with smaller errors, and since ReLU would be computationally more
efficient than the logistic sigmoid. But Config 2 was, however, selected in [25], whose authors cited that the
discrepancy was “not significant”, and that Config 2 gave “good training and prediction performances”.
Remark 11.2. The above search for an optimal network architecture is similar to searching for an appropri-
ate degree of a polynomial function for a best fit, avoiding overfit and underfit, over a given set of data points
in a least-square curve fittings. See Figure 72 in Section 6.5.9 for an explanation of underfit and overfit, and
Figure 99 in Section 10 for a similar search of an optimal network for numerical integration by ANN. ■
Remark 11.3. Referring to Remark 4.3 and the neural network in Figure 14 and to our definition of action
depth as total number of action layers L in Figure 23 in Section 4.3.1 and Remark 4.5 in Section 4.6, it is
clear that a network layer in [25] is a state layer, i.e., an input matrix y (ℓ) , with ℓ = 0, . . . , L. Thus all configs
in Figure 107 with 2 hidden layers have an action depth of L = 3 layers and a state depth of L + 1 = 4
layers, whereas Config 4 with 3 hidden layers has an action depth of L = 4 layers and a state depth of
L + 1 = 5 layers. On the other hand, since RNNs with LSTM units were used, in view of Remark 7.2, these
networks were equivalent to “very deep feedforward networks”. ■
Remark 11.4. The same selected architecture of RNN with LSTM units on both the microscale RVE (Fig-
ure 106, Figure 110) and the mesoscale RVE (Figure 112, Figure 113) was used to produce the mesoscale
RNN with LSTM units (“Mesoscale data-driven constitutive model”) and the macroscale RNN with LSTM
units (“Macroscale data-driven constitutive model”), respectively [25]. ■
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 177
Figure 109: Optimal RNN-LSTM architecture (Section 11.3.3). Training error (a) and testing
error (b), close-up views of Figure 108 from epoch 3000 to epoch 5000: Config 5 (purple line
with dots) was optimal, with smaller errors than those of Config 2 (blue dashed line). See
Figure 107 for config details [25] (Figure reproduced with permission of the authors.)
where q is the fluid mass flux (kg/(m2 s)), ρf the fluid mass density (kg / m3 ), v the fluid velocity (m/s), k
the medium permeability (m2 ), µf the fluid dynamic viscosity (Pa · s = N s/m2 ), p the pressure (N/m2 ),
and x the distance (m).
Figure 110: Mesoscale RNN with LSTM units. Traction-separation law (Sections 11.3.3,
11.3.5). Left: Sequence of imposed displacement jumps on microscale RVE (Figure 106),
normal (solid line) and tangential (dotted line, with us ≡ um in Figure 106, center). Right:
Normal traction vs. normal displacement. Cyclic loading and unloading. Microscale RVE
training data (blue) vs. Mesoscale RNN with LSTM prediction (red, Section 11.3.3), with
mean squared error 3.73 × 10−5 . See also Figure 115 on the macroscale RNN with LSTM
units [25] (Figure reproduced with permission of the authors.)
Remark 11.5. Dimension of ᾱ in Eq. (408). Each term in the balance of linear momentum Eq. (406) has
force per unit volume (F/L3 ) as dimension, which is therefore the dimension of the right-hand side of
Eq. (406), where c0 appears. As a result, c0 has the dimension of mass (M ) per unit volume (L3 ) per unit
time (T ):
F M M T M
vm − v
[c0 (e eM )] = 3
= 2 2 ⇒ [c0 ] = 2 2 = 3 . (410)
L L T L T L L T
Another way to verify is to identify the right-hand side of Eq. (406) with the usual inertia force per unit
volume:
∂v [v] [ρ] M
vm − v
[c0 (e eM )] = ρ ⇒ [c0 ] [v] = [ρ] ⇒ [c0 ] = = 3 . (411)
∂t T T L T
The empirical relation Eq. (408) adopted in [25] implies that the dimension of ᾱ was
M/(L3 T ) F T /L2
[c0 ] [µf ] M
[ᾱ] = = = 3 , (412)
[(pM − pm )] F/L2 L
i.e., mass density, whereas permeability has the dimension of area (L2 ), as seen in Darcy’s law Eq. (409). It
is not clear why it is written in [25] that ᾱ characterized the “interface permeability between the macropores
and the micropores”. A reason could be that the “semi-empirical” relation Eq. (408) was assumed to be
analogous to Darcy’s law Eq. (409). Moreover, as a result of Eq. (412), the dimension of µf /ᾱ is therefore
the same as that of the kinematic viscosity νf = µf /ρ. If ᾱ had the dimension of permeability, then the
choice of the right-hand side of the balance of linear momentum Eq. (406) was inconsistent, dimensionally
speaking. See Remark 11.6. ■
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 179
Figure 111: Continuum with embedded strong discontinuity (Section 11.3.5). Domain B =
B + ∪ B − with embedded discontinuity surface Γ, running through the middle of a narrow band
(light blue) Bh = (Bh+ ∪ Bh− ) ⊂ B between the parallel surfaces Γ+ and Γ− . Objects behind Γ
in the negative direction of the normal n to Γ are designated with the minus sign, and those in
front of Γ with the plus sign. The narrow band Bh represents an embedded strong discontinuity,
as shown in the mesoscale RVE in Figure 113, where the discretized strong discontinuity zones
were a network of straight narrow bands. Top right: No sliding, tnterpretation of uµ = u +
[[u]](HΓ − fΓ ) in [25]. Bottom right: Sliding, interpretation of u = ū + [[u]]HΓ in [297].
The (local) porosity ϕ is the ratio of the void volume dVv over the total volume dV . Within the void
volume, let ψ be the percentage of macropores, and (1 − ψ) the percentage of micropores; we have:
dVv dVM dVm
ϕ= , ψ= , 1−ψ = . (413)
dV dVv dVv
The absolute macropore flux qM and the absolute micropore flux qm are defined as follows:
eM = ρf ϕψ , qM = eM vM , em = ρf ϕ(1 − ψ) , qm = em vm , (414)
There is a conservation of fluid transfer between the macropores and the micropores across any closed
surface Γ, i.e.,
Z
(qM + qm ) • n dΓ = 0 ⇒ div (qM + qm ) = 0 . (415)
Γ
Generalizing Eq. (409) to 3-D, Darcy’s law in tensor form governs the fluid mass fluxes qM , qm is
written as
kM km
qeM = eM v eM = −ρf · (∇pM − ρf g) , qem = eM v em = −ρf · (∇pm − ρf g) , (416)
µf µf
where kM , km denote the permeability tensors on the respective scales and g is the gravitational accelera-
tion.
From Eq. (415), assuming that
div qM = −div qm = c0 , (417)
180 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where c0 is given in Eq. (408), then two more governing equations are obtained:
div qM = eM div v + div qeM = c0 , (418)
Remark 11.6. It can be verified from Eq. (417) that the dimension of c0 is
[qM ] [ρ] [v] M
[c0 ] = [div qM ] = = = 3 , (420)
L L L T
agreeing with Eq. (411). In view of Remark 11.5, for all three governing Eq. (406), Eqs. (418)-(419) to be
dimensionally consistent, ᾱ in Eq. (408) should have the same dimension as mass density, as indicated in
Eq. (412). Our consistent notation for fluxes (qM , qm ) and (qeM , qem ) differ in meaning compared to [25].
■
Figure 112: Mesoscale RVE (Sections 11.3.3, 11.3.5). A 2-D domain of size 1 m × 1 m (Re-
mark 11.9). See Figure 14 (row 2, left), and also Figure 111 for a conceptual representation.
Embedded strong discontinuity zones where damage occurred formed a network of straight
narrow bands surrounded by elastic material. Both the strong discontinuity narrow bands and
the elastic domain were discretized into finite elements. Imposed displacements (uN , uM ),
with uS ≡ uN , at the top (center). See Figure 113 for the deformation (strains and displace-
ment jumps) [25]. (Figure reproduced with permission of the authors.)
Remark 11.7. For field-size simulations, the above equations do not include the changing size of the pores,
which were assumed to be of constant size, and thus constant porosity, in [25]. As a result, the collapse
of the pores that leads to nonlinearity in the stress-strain relation observed in experiments (Figure 13) is
not modelled in [25], where the nonlinearity essentially came from the embedded strong discontinuities
(displacement jumps) and the associated traction-separation law obtained from DEM simulations using the
micro RVE in Figure 106 to train the meso RNN with LSTM; see Section 11.3.5. See also Remark 11.10.
■
Figure 113: Mesoscale RVE (Section 11.3.3). Strains and displacement jumps [25] (Figure
reproduced with permission of the authors.)
where τ is the shear stress along the fault line, τp the critical shear stress for the onset of fault reactivation, C
the cohesion strength, µ the coefficient of friction, σ ′ the effective stress normal to the fault line, σ the normal
stress, and p the fluid pore pressure. The authors of [299] demonstrated that increase in fluid injection rate
led to increase in peak fluid pressure and, as a result, fault reactivation, as part of a study on why there was
an exponential increase in seismic activities in Oklahoma, due to wastewater injection for use in hydraulic
fracturing (i.e., fracking) [300].
But criterion Eq. (421) involves only stresses, with no displacement, and thus cannot be used to quan-
tify the amount of fault slip. To allow for quantitative modeling of fault slip, or displacement jump, in a
displacement-driven FEM environment, the so-called “cohesive traction-separation laws,” expressing trac-
tion (stress) vector on fault surface as a function of fault slip, similar to those used in modeling cohesive
zone in nonlinear fracture mechanics [301], is needed. But these classical “cohesive traction-separation law”
are not appropriate for handling loading-unloading cycles.
To model a continuum with displacement jumps, i.e., embedded strong discontinuities, the traction-
separation law was represented in [25] as
T ([[u]]) = σ ′ ([[u]])•n , (422)
where T is the traction vector on the fault surface, u the displacement field, [[·]] the jump operator, making
[[u]] the displacement jump (fault slip, separation), σ ′ the effective stress tensor at the fault surface, n the
normal to the fault surface. To obtain the traction-separation law represented by the function T ([[u]]), a
neural network can be used provided that data are available for training and testing.
It was assumed in [25] that cracks were pre-existing and did not propagate (see the mesoscale RVE in
Figure 113), then set out to use the microscale RVE in Figure 106 to generate training data and test data for
the mesocale RNN with LSTM units, called the “Mesoscale data-driven constitutive model”, to represent
the traction-separation law for porous media. Their results are shown in Figure 110.
Remark 11.8. The microscale RVE in Figure 106 did not represent any real-world porous rock sample
such the Majella limestone with macroporosity 11.4% ≈ 0.1 and microporosity 19.6% ≈ 0.2 shown in
Figure 12 in Section 2.3.2, but was a rather simple assembly of mono-disperse (identical) solid spheres with
no information on size and no clear contact force-displacement relation; see [302]. Another problem was
that realistic porosity of 0.1 or 0.2 could not be achieved with this microscale RVE, yielding a porosity above
182 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
0.3, which was the total porosity (= macroporosity + microporosity) of the highly porous Majella limestone.
A goal of [25] was only to demonstrate the methodology, not presenting realistic results. ■
370 K. Wang, W. Sun / Comput. Methods Appl. Mech. Engrg. 334 (2018) 337–380
Remark 11.9. Even though in Figure 14 (row 2) in Section 2.3.2, the mesoscale RVE was indicated to be
of centimeter size, but the mesoscale RVE in Figure 113 was of size 1 m × 1 m, many orders of magnitude
larger. See Remark 11.1. ■
(a) TR1. (b) TR2. (c) TR3.
To analyze the mesoscale RVE in Figure 113 (Figure 104, center) and the macroscale (field-size) model
(Figure 104, right) by finite elements, both with embedded strong discontinuities (Figure 111), the authors
of [25] adopted a formulation that looked similar to [297] to represent strong discontinuities, which could
result from fractures or shear bands, by the local displacement field uµ as273
uµ = u + [[u]] (HΓ − fΓ ) , ū := u − [[u]]fΓ ⇒ uµ = ū + [[u]]HΓ , (423)
which differs from the global smooth displacement field u by the displacement jump vector [[u]] across the
singular surface Γ that represents a discontinuity, multiplied by the function (HΓ − fΓ ), where HΓ is the
Heaviside function,
(d) TE1. such that HΓ = 0 in B and (e)
− HΓTE2.= 1 in B + , and fΓ a smooth ramp function equal to
(f) TE3.
−
zero in (B − Bh ) and going smoothly up to 1 in (B − Bh+ ), as defined in [297]. So Eq. (423) means that
− +
Fig.the
26. displacement fieldselected
Loading path of three u only had cases
training a smooth “bump”
TR1, TR2, TR3 andwith
threesupport beingcases
selected testing the TE1,
bandTE2,
BhTE3, asona the
result of a local
meso-scale RVE. u n
anddisplacement
u s are the normaljump,
and tangential displacement jumps. The coordinate system is {M, N } (or {x, y})
with no fault sliding, as shown in the top right subfigure in Figure 111. depicted in Fig. 15. It can be seen that TR2
represents a tensile-shear loading case (as u n is positive), TR1 and TR3 represent compressive-shear loading cases (u n negative). The numbers
mark the Eq. (423)
sequence was the starting
of 3loading–unloading point in [297], but without using the definition of ū in Eq. (423)2 :274
cycles.
(a) TR1, MSE = 4.96e 5. (b) TR2, MSE = 2.46e 4. (c) TR3, MSE =2.13e 4.
(d) TE1, MSE = 8.22e 5. (e) TE2, MSE = 6.04e 3. (f) TE3, MSE = 3.57e 4.
Figure 115: Macroscale RNN with LSTM units (Section 11.3.5). Normal traction (Tn) vs im-
Fig. 27. Comparison of the meso-scale FEM–LSTM simulation data and the trained macro-scale data-driven model. Normal traction against
posed displacement jumps (Un) on mesoscale RVE (Figure 112, Figure 113). Blue: Training
normal displacement jump for the selected training and testing cases. The numbers mark the sequence of loading–unloading cycles. MSE refers to
thedata
scaled(TR1-TR3) and
mean squared error testindata
defined (TE1-TE3) from the mesoscale FEM-LSTM model, where num-
Eq. (59).
bers indicate the sequence of loading-unloading steps similar to those in Figure 110 for the
mesoscale RNN with LSTM units. Red: Corresponding predictions of the trained macroscale
RNN with LSTM units. The mean squared error (MSE) was used as loss function [25]. (Figure
reproduced with permission of the authors.)
shown in the bottom right subfigure in Figure 111. Assuming that the jump [[u]] has zero gradient in B+ ,
take the gradient of rate form in Eq. (424)2 and symmetrize to obtain the small-strain rate275
1
(∇[[u]] = 0 and ∇HΓ = nδΓ ) ⇒ ϵ̇ = sym(∇u̇) + sym([[u̇]] ⊗ nδΓ ) , with sym(b) := (b + bT ) . (425)
(a) TR1, MSE = 1.38e 4. (b) TR2, MSE = 3.25e 3. (c) TR3, MSE =22.25e 3.
Later in [297], an equation that looked similar to Eq. (423)1 , but in rate form, was introduced:276
u̇ = ū˙ + [[u̇]]HΓ = (ū˙ + [[u̇]]fΓ ) + [[u̇]](HΓ − fΓ ) = ũ˙ + [[u̇]](HΓ − fΓ ), with ũ˙ := ū˙ + [[u̇]]fΓ , (426)
where ũ, defined with the term +[[u]], is not the same as ū defined with the term −[[u]] in Eq. (423)2 of [25],
even though Eq. (423)3 looked similar to Eq. (426)1 . From Eq. (426)3 , the small-strain rate ϵ̇ in Eq. (425)3
in terms of the smoothed velocity ũ˙ is277
˙ + sym([[u̇]] ⊗ nδΓ ) − sym([[u̇]] ⊗ ∇fΓ ) ,
ϵ̇ = sym(∇ũ) (427)
which, when(d)removing
TE1, MSE = the
3.41eoverhead
3. dot, is similar
(e) TE2, to,
MSEbut different
= 4.76e 3. from, the small
(f) TE3,strain expression
MSE = 8.44e 4. in [25],
written as
Fig. 28. Comparison of the meso-scale FEM–LSTM simulation data and the trained macro-scale data-driven model. Tangential traction against
tangential displacement
ϵ = sym(∇u) jump for ⊗
+ sym([[u]] the nδ
selected training and testing cases. The numbers mark the sequence of loading–unloading cycles. MSE refers
Γ ) − sym([[u]] ⊗ ∇fΓ ) , (428)
to the scaled mean squared error defined in Eq. (59).
where the first term was u and not ũ = u + [[u]]fΓ in the notation of [25].278
275
[297], Eq. (2.2).
276
[297], Eq. (2.15).
277
[297], Eq. (2.17).
278
Recall that u in [25] (Eq. (423)1 ), the “large-scale (or conformal) displacement field” without displacement jump,
is equivalent to ū in [297] (Eq. (424)1 ), but is of course not the ū in [25] (Eq. (423)2 ).
184 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Typically, in this type of formulation [297], once the traction-separation law T ([[u]]) in Eq. (422) was
available (e.g., Figure 110), then given the traction T , the displacement jump [[u]] was solved for using
Eq. (422) at each Gauss point within a constant-strain triangular (CST) element [25].
At this point, it is no longer necessary to review further this continuum formulation for displacement
jumps to return to the training of the macroscale RNN with LSTM units, which the authors of [25] called
the “Macroscale data-driven constitutive model” (Figure 14, row 3, right), using the data generated from the
simulations using the mesoscale RVE (Figure 113) and the mesoscale RNN with LSTM units, called the
“Mesoscale data-driven constitutive model” (Figure 14, row 2, right), obtained earlier.
The mesoscale RNN with LSTM units (“mesoscale data-driven constitutive model”) was first validated
using the mesoscale RVE with embedded discontinuities (Figure 113), discretized into finite elements, and
subjected to imposed displacements at the top. This combined FEM and RNN with LSTM units on the
mesoscale RVE is denoted FEM-LSTM, with results compared well with those obtained from the coupled
FEM and DEM (denoted as FEM-DEM), as shown in Figure 114.
Once validated, the FEM-LSTM model for the mesocale RVE was used to generate data to train the
macroscale RNN with LSTM units (called “Macroscale data-driven constitutive model”) by imposing dis-
placement jumps at the top of the mesoscale RVE (Figure 112), very much like what was done with the
microscale RVE (Figure 106, Figure 110), just at a larger scale.
The accuracy of the macroscale RNN with LSTM units (“Macroscale data-driven constitutive model”)
is illustrated in Figure 115, where the normal tractions under displacement loading were compared to results
obtained with the mesoscale RVE (Figure 112, Figure 113), which was used for generating the training data.
Once established, the macroscale RNN with LSTM units is used in field-size macroscale simulations. Since
there is no further interesting insights into the use of deep learning, we stop of review of [25] here.
Remark 11.10. No non-linear stress-strain relation. In the end, the authors of [25] only used Figure 12 to
motivate the double porosity (in Majella limestone) in their macroscale modeling and simulations, which did
not include the characteristic non-linear stress-strain relation found experimentally in Majella limestone as
shown in Figure 13. All nonlinear responses considered in [25] came from the nonlinear traction-separation
law obtained from DEM simulations in which the particles themselves were elastic, even though the Hertz
contact force-displacement relation was nonlinear [303] [304] [302]. See Remark 11.7. ■
Remark 11.11. Physics-Informed Neural Networks (PINNs) applied to solid mechanics. The PINN method
discussed in Section 9.5 has been applied to problems in solid mechanics [305]: Linear elasticity (square
plate, plane strain, trigonometric body force, with exact solution), nonlinear elasto-plasticity (perforated
plate with circular hole, under plane-strain condition and von-Mises elastoplasticity, subjected to uniform
extension, showing localized shear band). Less accuracy was encountered for solutions that presented dis-
continuiies (localized high gradients) in the materials properties or at the boundary conditions; see Re-
mark 7.7 and Remark 9.5. ■
Figure 116: 2-D datasets for training neural networks (Sections 2.3.3, 12.1). Extract 2-D
datasets from 3-D turbulent flow field evolving in time. From the 3-D flow field, extract N
equidistant 2-D planes (slices). Within each 2-D plane, select a region (yellow square), and k
temporal snapshots of this region as it evolves in time to produce a dataset. Among these N
datasets, each containing k snapshots of the same region within each 2-D plane, the majority
of the datasets is used for training, and the rest for testing; see Remark 12.1. For each dataset,
the reduced POD basis consists of m ≪ k POD modes with highest eigenvalues of a matrix
constructed from the k snapshots (Figure 18) [26]. (Figure reproduced with permission of the
author.)
Figure 117: LSTM unit and BiLSTM unit (Sections 2.3.2, 2.3.3, 7.2, 12.2). Each blue dot is
an original LSTM unit (in folded form Figure 81 in Section 7.2, without peepholes as shown
in Figure 15), thus a single hidden layer. The above LSTM architecture (left) in unfolded form
corresponds to Figure 82, with the inputs at state [k] designated by x[k] and the corresponding
outputs by h[k] , for k = . . . , n − 1, n, n + 1, . . .. In the BiLSTM architecture (right), there are
two LSTM units in the hidden layer, with the forward flow of information in the bottom LSTM
unit, and the backward flow in the top LSTM unit [26]. (Figure reproduced with permission of the
author.)
which is equivalent to maximizing the amplitude α(t), and thus the information content of u(x, t) in ϕ(x),
which in turn is also called “coherent structure”. The square of the amplitude in Eq. (433) can be written as
Z Z
λ := α2 (t) = (u, ϕ)2 = u(x, t)u(y, t)ϕ(y) dy ϕ(x) dx , (434)
B B
so that λ is the component (or projection) of the term in square brackets in Eq. (434) along the direction
ϕ(x). The component λ is maximal if this term in square brackets is colinear with (or “parallel to”) ϕ(x),
i.e.,
Z Z
u(x, t)u(y, t)ϕ(y) dy = λϕ(x) ⇒ ⟨u(x, t)u(y, t)⟩ϕ(y) dy = λϕ(x) , (435)
B B
which is a continuous eigenvalue problem with the eigenpair being (λ, ϕ(x)). In practice, the dynamic
quantity u(x, t) is sampled at k discrete times {t1 , . . . , tk } to produce k snapshots, which are functions of
x, assumed to be linearly independent, and ordered in matrix form as follows
{u(x, t1 ) . . . , u(x, tk )} =: {u1 (x), . . . , uk (x)} =: u(x) . (436)
The coherent structure ϕ(x) can be expressed on the basis of the snapshots as
i=k
X
ϕ(x) = βi ui (x) = β • u , with β = {β1 , . . . , βk } . (437)
i=1
As a result of the discrete nature of Eq. (436) and Eq. (437), the eigenvalue problem in Eq. (435) is dis-
cretized into
1
Z
Cβ = λβ , with C := u(y) ⊗ u(y)dy , (438)
k
B
where the matrix C is symmetric, positive definite, leading to positive eigenvalues in k eigenpairs (λi , βi ),
with i = 1, . . . , k. With ϕ(x) now decomposed into k linearly independent directions ϕi (x) := βi ui (x)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 187
Figure 118: LSTM/BiLSTM training strategy (Sections 12.2.1, 12.2.2). From the 1-D time
series αi (t) of each dominant mode ϕi , for i = 1, . . . , m, use a moving window to extract
thousands of samples αi (t), t ∈ [tk , tspl
k ], with tk being the time of snapshot k. Each sample is
subdivided into an input signal αi (t), t ∈ [tk , tk + tinp ] and an output signal αi (t), t ∈ [tk +
tinp , tspl spl
k ], with tk − tk = tinp + tout and 0 < tout ≤ tinp . The windows can be overlapping.
These thousands of input/output pairs were then used to train LSTM-ROM networks, in which
LSTM can be replaced by BiLSTM (Figure 117). The trained LSTM/BiLSTM-ROM networks
were then used to predict α̃i (t), t ∈ [tk + tinp , tspl k ] of the test datasets, given αi (t), t ∈
[tk , tk + tinp ]. [26]. (Adapted with permission of the author.)
according to Eq. (437), the dynamic quantity u(x, t) in Eq. (429) can now be written as a linear combination
of ϕi (x), with i = 1, . . . , k, each with a different time-dependent amplitude αi (t), i.e.,
i=k
X
u(x, t) = αi (t)ϕi (x) , (439)
i=1
which is called a proper orthogonal decomposition of u(x, t), and is recorded in Figure 18 as “Full POD
reconstruction”. Technically, Eq. (437) and Eq. (439) are approximations of infinite-dimensional functions
by a finite number of linearly-independent functions.
Usually, a subset of m ≪ k POD modes are selected such that the error committed by truncating the
basis as done in Eq. (4) would be small compared to Eq. (439), and recalled here for convenience:
j=m
X
u(x, t) ≈ ϕij (x)αij (t) , with m ≪ k and ij ∈ {1, . . . , k} . (4)
j=1
One way is to select the POD modes corresponding to the highest eigenvalues (or energies) in Eq. (435); see
Step (2) in Section 12.2.2.
Remark 12.1. Reduced-order POD. Data for two physical problems were available from numerical sim-
ulations: (1) the Force Isotropic Turbulence (ISO) dataset, and (2) the Magnetohydrodynamic Turbulence
188 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 14. Vortex Shedding behind a cylinder, displaying strong correlations in spatio-temporal
wake dynamics
Figure 9. Training of a Multiple NN models for all POD dominant modes Figure 15. Training of a unified NN model for all POD dominant modes
(a) Multiple-network method (b) Single-network method
chaotic systems, they are outside the scope of this work.
Figure
design level before 119:
the training Twobegins,
process methods ofsignificantly
as they can developing affect LSTM-ROM
the accuracy of (Section 2.3.3, 12.2.2). For each physical
the model. Therefore, some trial and error is involved in making these choices, based on the
problem (ISO or MHD): (a) Each dominant
physics of the flow. The results presented in this section have been generated with a time
POD mode has a network
4.2 Magnetohydrodynamic to predict αi (t + t′ ),
Turbulence
window of 10 timewith steps
′ , given
t and αi (t);
a predictive (b)
horizon All
of 10 time POD
msteps. This dominant modes
Thefor
single, modest value share
strategy the same
in the previous network
section required to topredict
a NN model {α
be trained for each(t
i POD +mode - a
manageable, and to multiple model approach. This approach implies that the NN learns universal features for the
the horizon is chosen
t ′ ), iso= 1, . . . , m}, with t ′ , given {α
as to make the NN training computationally
i (t), i =
demonstrate the efficacy of this method. We investigate the two turbulence datasets abovethat
1, .
same. . ,
modem} [26].
between (Figure
the various trainingreproduced with
datasets. The implicit permission
assumption in this approach is
to the features to be learned in, say Mode 1, are only present in the corresponding Mode 1 in
of training
illustrate different the author.)
paradigms using the LSTM architectures described in the previousthe training datasets. However, modal decompositions like POD are a linear combination of
section. We will study the performance with using longer horizons (and equally long timeeigenvectors and eigenvalues, which do not explicitly account for the inter-modal, nonlinear
windows) in the forthcoming section. interactions seen in turbulence. A classic example of this is the energy transfer from larger to
smaller scales (the energy cascade) and the less significant mechanism of the inverse energy
(MHD) dataset [105]. For each physical problem, the authors
cascade of from
(or backscatter) [26] employed
smaller to larger scales.N = 6instead
Therefore, equidistant 2-D
of the flow dynamics
4.1 Isotropic Turbulence being artificially constrained to a single POD mode, in reality the signature of a flow feature
planes (slices, Figure 116), with 5 of those 2-D planes(like used formaytraining,
a vortex) and
be spread across 1 remaining
a range of POD modes. As 2-D a result,plane
the multipleused
model for
The isotropic turbulence dataset is decomposed into 2D slices (planes) as mentioned in the
testing
previous section(see
and itsSection 6.1). The
temporal coefficients same
are used sub-region
to train the NN. A direct the 6approach
ofapproach from
may not capture the full signature of the flow dynamics in a POD mode, since it learns
toequidistant 2-DPOD
only the training dataset plane
modes(yellow squares
of the same rank. in Figure
Another practical issue with 116)
the
performing this is to train a) LSTM, and b) BiLSTM on each of the 5 dominant POD temporal
was used to generate 6 training / testing datasets. For each training / testing dataset,
multiple model approach is the increased memory requirements for embedding NN models in
k = 5, 023
coefficients, such that we obtain 5 trained models - one for every mode (as shown in the the on-board electronic hardware for flow control. Increasing the fidelity of the ROM by using
snapshots
for the
schematic in Fig.ISO
9). Fordataset and
instance, there is a k = 1,model
dedicated 024forsnapshots
POD mode 1, whichfor isthe MHD dataset were used in [26]. The reason was
more POD modes can cause a memory bottleneck since N modes would require N models.
then
used to predict the POD mode 1 temporal mode coefficients for the test dataset. Therefore, if In wethis section, we propose an alternative training strategy to account for these issues - the
buildbecause
a ROM usingthe ISOmdataset
modes, we contained 5,023Intimeorder tosteps, whereas the MHD dataset contained 1,024 time
unified model approach. In this approach, the NN is trained with samples not just from steps.
POD
dominant would need m models. evaluate the
modes of the same rank, but including other ranks as well. For instance, the LSTM (or BiLSTM)
So the number of snapshots was the same as the number
prediction, we provide an input to the model any sample from the temporal coefficients ofNN
relevant mode from the test dataset. As mentioned before, the entire temporal coefficientThis
the
oftrained
will be time with steps. These
all samples from POD modesnapshots were
1,2,3,4 and 5, from reduced
all training datasets. to
1-Dresults in a single, unified NN model, which can be used to predict the coefficients for all
mhas∈been{5,
signal . . .into
divided , 10} POD
thousands modes
of samples with
of equal highest
length. eigenvalues
Upon receiving any ofthe
these
POD (thus energies),
modes. The which
schematic in Fig. 15 outlineswere much fewer than the
this approach.
samples as input, the NN model should predict the sample which follows immediately after The it. LSTM and BiLSTM NNs are trained just like the previous section, but using the unified
This original snapshots,
test, althoughksimple; since m
can shed considerable ≪
light k. See
on how capableRemark 12.3.
the LSTM based NNs are,approach. The results at a randomly chosen sample are shown Fig. 16, indicating that the■
model
since we have thousands of samples to test for a given mode, making the findings statistically
significant. The prediction results at a randomly chosen sample for all 5 dominant POD modes
Remark
are provided in Fig. 12.2. Another
10. The figures show 3 method of finding
curves - the expected thethePOD
(truth) data, LSTM NN modes without forming the symmetric matrix C in
prediction and the BiLSTM NN prediction. It is evident that the predictions are following
Eq. (438) is by using the Singular Value Decomposition (SVD) directly on the rectangular matrix of12/22
expected trends, especially with LSTM NN. Interestingly, the BiLSTM architecture appears to
the
sampled
be less snapshots,
accurate than the LSTM, which discrete inconsidering
is intriguing both space and time.
the theoretical The POD modes are then obtained from the left sin-
justification
behind its architecture and success in literature. Possible reasons for this discrepancy will be
gular
explored vectors
shortly. times
In order to the
study the corresponding
accuracy singular
of these NN architectures values.to A reduced POD basis is obtained next based on an
it is necessary
gaugeinformation-content
the prediction error over a largematrix. See
number of samples, [306], where
for statistical POD
significance
observations. The Mean Absolute Scaled Error [36] (MASE) metric is chosen to quantify the
of was
our applied to efficiently solve nonlinear electromag-
netic
deviation problems
of prediction from governed by Maxwell’s equations with nonlinear hysteresis at low frequency (10 kHz), called
the expected trend.
Figure 11 shows the MASE of the LSTM predictions for all samples in each of the 5
static
dominant PODhysteresis,
modes, with the discretized
average MASE forby thatamode.
finite-element method.
With a prediction horizon of 10, See
a also Remark 12.4. ■
total of 5003 samples can be generated for each POD temporal coefficient, since it has a total of
Figure 120: Hurst exponent vs POD-mode rank for Isotropic Turbulence (ISO) (Sections 12.3).
POD modes with larger eigenvalues (Eq. (438)) are higher ranked, and have lower rank number,
e.g., POD mode rank 7 has larger eigenvalue, and thus more dominant, than POD mode rank
50. The Hurst exponent, even though fluctuating, trends downward with the POD mode rank,
but not monotonically, i.e., for two POD modes sufficiently far apart (e.g., mode 7 and mode
50), a POD mode with lower rank generally has a lower Hurst exponent. The first 800 POD
modes for the ISO problem have the Hurst exponents higher than 0.5, and are thus persistent
[26]. (Adapted with permission of the author.)
(1) From the 3-D computational domain of a physical problem (ISO or MHD), select N equidistant 2-D
planes that slice through this 3-D domain, and select the same subregion for all of these planes, the
majority of which is used for the training datasets, and the rest for the test datasets (Figure 116 and
Remark 12.1 for the actual value of N and the number of training datasets and test datasets employed
in [26]).
(2) For each of the training datasets and test datasets, extract from k snapshot a few m ≪ k dominant
POD modes ϕi , i = 1, . . . , m (with the highest energies / eigenvalues) and their corresponding
coefficients αi (t), i = 1, . . . , m, by solving the eigenvalue problem Eq. (438), then use Eq. (437) to
obtain the POD modes ϕi , and Eq. (431) to obtain αi (t) for use in Step (3).
(3) The time series of the coefficient αi (t) of the dominant POD mode ϕi (x) of a training dataset is
chunked into thousands of small samples with t ∈ [tk , tspl
k ], where tk is the time of the kth snapshot,
by a moving window. Each sample is subdivided into two parts: The input part with time length tinp
and the output part with time length tout , Figure 118. These thousands of input/output pairs were
then used to train LSTM/BiLSTM networks in Step (4). See Remark 12.3.
190 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
(4) Use the input/output pairs generated from the training datasets in Step (3) to train LSTM/BiLSTM-
ROM networks. Two methods were considered in [26]:
(a) Multiple-network method: Use a separate RNN for each of the m dominant POD modes to
separately predict the coefficient αi (t), for i = 1, . . . , m, given a sample input, see Figure 119a
and Figure 118. Hyperparameters (layer width, learning rate, batch size) are tuned for the most
dominant POD mode and reused for training the other neural networks.
(b) Single-network method: Use the same RNN to predict the coefficients αi (t), i = 1, . . . , m of
all m dominant POD modes at once, given a sample input, see Figure 119b and Figure 118.
The single-network method better captures the inter-modal interactions that describe the energy trans-
fer from larger to smaller scales. Vortices that spread over multiple dominant POD modes also support
the single-network method, which does not artificially constrain flow features to separate POD modes.
(5) Validation: Input/output pairs similar to those for training in Step (3) were generated from the test
dataset for validation.279 With a short time series of the coefficient αi (t) of the dominant POD mode
ϕi (x) of the test dataset, with t ∈ [tk , tk + tinp ], where tk is the time of the kth snapshot, and
0 < tinp , as input, use the trained LSTM/BiLSTM networks in Step (4) to predict the values of α̃i (t),
with t ∈ [tk + tinp , tspl spl
k ], where tk is the time at the end of the sample, such that the sample length is
spl
tk − tk = tinp + tout and 0 < tout ≤ tinp ; see Figure 118. Compute the error between the predicted
value α̃i (t+tout ) and the target value of αi (t+tout ) from the test dataset. Repeat the same prediction
procedure for all m dominant POD modes chosen in Step (2). See Remark 12.3.
(6) At time t, use the predicted coefficients α̃i (t) together with the dominant POD modes ϕi (x), for
i = 1, . . . , m, to compute the flow field dynamics u(x, t) using Eq. (4).
Remark 12.3. Even though the value of tinp = tout = 0.1 sec was given in [26] as an example, all
LSTM/BiLSTM networks in their numerical examples were trained using input/output pairs with tinp =
tout = 10 × ∆t, i.e., 10 time steps for both input and output samples. With the overall simulation time for
both physical problems (ISO and MHD) being 2.056 secs, the time step size is ∆tISO = 2.056/5023 = 4.1×
10−4 sec for ISO, and ∆tMHD = 2.056/1024 = 2 × 10−3 sec = 5∆tISO for MHD. See also Remark 12.1.
■
The “U-velocity field for all results” was mentioned in [26], but without a definition of “U-velocity”,
which was possibly the x component of the 2-D velocity field in the 2-D planes (slices) used for the datasets,
with “V-velocity” being the corresponding y components.
The BiLSTM networks in the numerical examples were not as accurate as the LSTM networks for both
physical problems (ISO and MHD), despite having more computations; see Figure 117 above and Figure 20
in Section 2.3.3. The authors of [26] conjectured that a reason could be due to the randomness nature of
turbulent flows, as opposed to the high long-term correlation found in natural human languages, for which
BiLSTM was designed to address.
Since LSTM architecture was designed specifically for sequential data with memory, it was sought in
[26] to quantify whether there was “memory” (or persistence) in the time series of the coefficients αi (t) of
279
The authors of [26] did not have a validation dataset as defined in Section 6.1.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 191
the POD mode ϕi (x). To this end, the Hurst exponent280 was used to quantify the presence or absence of
long-term memory in the time series Si (k, n):
Ri (k, n)
Si (k, n) := {αi (tk+1 ), . . . , αi (tk+n )}, E = Ci eHi , (440)
i,n fixed σi (k, n)
Ri (k, n) = max Si (k, n) − min Si (k, n), σi (k, n) = stdev [Si (k, n)], (441)
where Si (k, n) is a sequence of n steps of αi (t), starting at snapshot time tk+1 , E is the expectation of the
ratio of range Ri over standard deviation σi for many samples Si (k, n), with different values of k keeping
(i, n) fixed, Ri (k, n) the range of sequence Si (k, n), σi (k, n) the standard deviation of sequence Si (k, n),
Ci a constant, Hi the Hurst exponent for POD mode ϕi (x).
A Hurst coefficient of H = 1 indicates persistent behavior, i.e., an upward trend in a sequence is
followed by an upward trend. If H = 0 the behavior that is represented by time series data is anti-persistent,
i.e., an upward trend is followed by a downward trend (and vice versa). The case H = 0.5 indicates random
behavior, which implies a lack of memory in the underlying process.
The effects of prediction horizon and persistence on the prediction accuracy of LSTM network were
studied in [26]. Horizon is the number of steps after the input sample that a LSTM model would predict
the values of αi (t), and is proportional to the output time length tout , assuming constant time step size.
Persistence refers to the amount of correlation among subsequent realizations within sequential data, i.e.,
the presence of the long-term memory.
To this end, they selected one dataset (training or testing), and followed the multiple-network method
in Step (4) of Section 12.2.2 to develop a different LSTM network model for each POD mode with “non-
negligible eigenvalue”. For both the ISO (Figure 120) and MHD problems, the 800 highest ranked POD
modes were used.
A baseline horizon of 10 steps were used, for which the prediction errors were {1.08, 1.37, 1.94, 2.36,
2.03, 2.00} for POD ranks R = {7, 15, 50, 100, 400, 800}, and for Hurst exponents {0.78, 0.76, 0.652,
0.653, 0.56, 0.54}, respectively. So the prediction error increased from 1.08 for POD rank 7 to 2.36 for
POD rank 100, then decreased slightly for POD ranks 400 and 800. The time histories of the corresponding
coefficients αi (t), with i ∈ R on the right of Figure 120 provided only some qualitative comparison between
the predicted values and the true values, but did not provide the scale of the actual magnitude of αi (t), nor
the time interval of these plots. For example, the magnitude of α800 (t) could be very small compared to
that of α7 (t), given that the POD modes were normalized in the sense of Eq. (431). Qualitatively, the
predicted values compared well with the true values for POD ranks 7, 15, 50. So even though there was a
divergence between the predicted value and the true value for POD ranks 100 and 800, but if the magnitude
of α100 (t) and α800 (t) were very small compared to those of the dominant POD modes, there would be less
of a concern.
Another expected result was that for the same POD mode rank lower than 50, the error increased
dramatically with the prediction horizon. For example, for POD rank 7, the errors were {1.08, 6.86, 15.03,
19.42, 21.40} for prediction horizons of {10, 25, 50, 75, 100} steps, respectively. Thus the error (at 21.40)
for POD rank 7 with horizon 100 steps was more than ten times higher than the error (at 2.00) for POD rank
800 with horizon 10 steps. For POD mode rank 50 and higher, the error did not increase as much with the
horizon, but stayed at about the same order of magnitude as the error for horizon 25 steps.
A final note is whether the above trained LSTM-ROM networks could produce accurate prediction
for flow dynamics with different parameters such as Reynolds number, mass density, viscosity, geometry,
initial conditions, etc., particularly that both the ISO and MHD datasets were created for a single Reynolds
280
“Hurst exponent”, Wikipedia, version 22:09, 30 October 2020.
192 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
number, which was not mentioned in [26], who mentioned that their method (and POD in general) would
work for a “narrow range of Reynolds numbers” for which the flow dynamics is qualitatively similar, and
for “simplified flow fields and geometries”.
Remark 12.4. Use of POD-ROM for different systems. The authors of [306] studied The flexibility of POD
reduced-order models to solve nonlinear electromagnetic problems by varying the excitation form (e.g.,
square wave instead of sine wave) and by using the undamped (without the first-order time derivative term)
snapshots in the simulation of the damped case (with the first-order time derivative term). They demonstrated
via numerical examples involving nonlinear power-magnetic-component simulations that the reduced-order
models by POD are quite flexible and robust. See also Remark 12.2. ■
Remark 12.5. pyMOR - Model Order Reduction with Python. Finally, we mention the software pyMOR,281
which is “a software library for building model order reduction applications with the Python programming
language. Implemented algorithms include reduced basis methods for parametric linear and non-linear prob-
lems, as well as system-theoretic methods such as balanced truncation or IRKA (Iterative Rational Krylov
Algorithm). All algorithms in pyMOR are formulated in terms of abstract interfaces for seamless integration
with external PDE (Partial Differential Equation) solver packages. Moreover, pure Python implementations
of FEM (Finite Element Method) and FVM (Finite Volume Method) discretizations using the NumPy/SciPy
scientific computing stack are provided for getting started quickly.” It is noted that pyMOR includes POD
and “Model order reduction with artificial neural networks”, among many other methods; see the documen-
tation of pyMOR. Clearly, this software tool would be applicable to many physical problems, e.g., solids,
structures, fluids, electromagnetics, coupled electro-thermal simulation, etc. ■
Figure 121: Space-time solution of inviscid 1D-Burgers’ equation (Section 12.4.1). The solu-
tion shows a characteristic steep spatial gradient, which shifts and further steepens in the course
of time. The FOM solution (left) and the solution of the proposed hyper-reduced ROM (center),
in which the solution subspace is represented by a nonlinear manifold in the form of a feed-
forward neural network (Section 4) (NM-LSPG-HR), show an excellent agreement, whereas
the spatial gradient is significantly blurred in the solution obtained with a hyper-reduced ROM
based on a linear subspace (LS-LSPG-HR) (right). The FOM is a obtained upon a finite-
difference approximation in the spatial domain with Ns = 1001 grid points (i.e., degrees
of freedom); the backward Euler scheme is employed for time-integration using a step size
∆t = 1 × 10−3 . Both ROMs only use ns = 5 generalized coordinates. The NM-LSPG-HR
achieves a maximum relative error of less than 1 %, while the maximum relative error of the
LS-LSPG-HR is approximately 6 %. (Figure reproduced with permission of the authors.)
281
See pyMOR website https://pymor.org/.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 193
were used to construct the individual ROMs. Note that solution data corresponding to the parameter value
µ = 1, for which the ROMs were evaluated, was not used in this example.
Figure 121 shows a zoomed view of solutions to the above problem obtained with the FOM (left), the
proposed nonlinear manifold-based ROM (NM-LSPG-HR, center) and a conventional ROM, in which the
full-order solution is represented by a linear subspace. The initial solution is characterized by a “bump”
in the left half of the domain, which is centered at x = 1/2. The advective nature of Burgers’ problem
causes the bump to move right, which also results in slopes of the bump to increase in movement direction
but decrease on the averted side. The zoomed view in Figure 121 shows the region at the end (t = 0.5)
of the considered time-span, in which the (negative) gradient of the solution has already steepened signif-
icantly. With as little as ns = 5 generalized coordinates, the proposed nonlinear manifold-based approach
(NM-LSPG-HR) succeeds in reproducing the FOM solution, which is obtained by a finite-difference ap-
proximation of the spatial domain using Ns = 1001 grid points; time-integration is performed by means
of the backward Euler scheme with a constant step size of ∆t = 1 × 10−3 , which translates into a total of
nt = 500 time steps.
The ROM based on a linear subspace of the full-order solution (LS-LSPG-HR) fails to accurately repro-
duce the steep spatial gradient that develops over time, see Figure 121, right Instead, the bump is substan-
tially blurred in the linear subspace-based ROM as compared to the FOM’s solution (left). The maximum
error over all time steps tn relative to the full-order solution defined as:
where x and x e denote FOM and ROM solution vectors, respectively, was considered in [47]. In terms
of the above metric, the proposed nonlinear-manifold-based ROM achieves a maximum error of approx-
imately 1 %, whereas the linear-subspace-based ROM shows a maximum relative error of 6 %. For the
given problem, the linear-subspace-based ROM is approximately 5 to 6 times faster than the FOM. The
nonlinear-manifold-based ROM, however, does not achieve any speed unless hyper-reduction is employed.
Hyper-reduction (HR) methods provide means to efficiently evaluated nonlinear terms in ROMs without
evaluations of the FOM (see Hyper-reduction, Section 12.4.4). Using hyper-reduction, a factor 2 speed-up
is achieved for both the nonlinear-manifold and linear-subspace-based ROMs, i.e., the effective speed-ups
amount to factors of 2 and 9–10, respectively.
The solution manifold can be represented by means of a shallow, sparsely connected feed-forward
neural network [47] (see Statics, feedforward networks, Sections 4, 4.6). The network is trained in an
unsupervised manner using the concept of autoencoders (see Autoencoder, Section 12.4.3).
• dx
x= = f (x, t; µ), x(0; µ) = x0 (µ). (445)
dt
In the above relation, x(t; µ) denotes the parameterized solution of the problem, where µ ∈ D ⊆ Rnµ are
an nµ -dimensional vector of parameters; x0 (µ) is the initial state. The function f represents the rate of
change of the state, which is assumed to be nonlinear in the state x and possibly also in its other arguments,
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 195
The fundamental idea of any projection-based ROM is to approximate the original solution space of
the FOM by a comparatively low-dimensional space S. In view of the aforementioned shortcomings of
linear subspaces, the authors of [47] proposed a representation in terms of a nonlinear manifold, which was
described by the vector-valued function g, whose dimensionality was supposed to be much smaller than that
of the FOM:
S = {g(v̂) | v̂ ∈ Rns } , g : Rns → RNs , dim(S) = ns ≪ Ns . (447)
Using the nonlinear function g, an approximation x
e to the FOM’s solution x was constructed using a set of
generalized coordinates x
b ∈ Rns :
x≈x
e = xref + g(bx), (448)
where xref denoted a (fixed) reference solution. The rate of change ẋ was approximated by
dx dx db
x ∂g
≈ e = Jg (b x) , Jg (b
x) = ∈ RNs ×ns , (449)
dt dt dt ∂xb
where the Jacobian Jg spanned the tangent space to the manifold at xb. The initial conditions for the gener-
alized coordinates were given by x
b0 = g −1 (x0 − xref ), where g −1 denoted the inverse function to g.
Note that linear subspace methods are included in the above relations as the special case where g is a
linear function, which can be written in terms of a constant matrix Φ ∈ RNs ×ns , i.e., g(b x) = Φb x. In this
case, the approximation to the solution in Eq. (448) and their respective rates in Eq. (449) are given by
dx dx db
x
x⇒x≈x
x) = Φb
g(b e = xref + Φbx, ≈ e =Φ . (450)
dt dt dt
As opposed to the nonlinear manifold Eq. (449), the tangent space of the (linear) solution manifold is
constant, i.e., Jg = Φ; see [307]. We also mention the example of eigenmodes as a classical choice for
basis vectors of reduced subspaces [309].
The authors of [47] defined a residual function r
e : Rns × Rns × R × D → RNs in the reduced set
of coordinates by rewriting the governing ODE Eq. (445) and substituting the approximation of the state
Eq. (448) and its rate Eq. (449):286
• •
Requiring the derivative of the (squared) residual to vanish, we obtain the following set of equations,287
∂ • 2 •
• r (xb , x
b , t; µ) = 2J T
g J g x
b − f (xref + g(b
x ), t; µ) = 0, (453)
∂xb
e 2
285
In solid mechanics, we typically deal with second-order ODEs, which can be converted into a system of first-order
ODEs by including velocities in the state space. For this reason, we prefer to use the term ‘rate’ rather than ‘velocity’
of x in what follows.
286
The tilde above the symbol for the residual function r
e in Eq. (451) was used to indicate an approximation, consistent
with the use of the tilde in the approximation of the state in Eq. (448), i.e., x ≈ x
e.
287
As the authors of [47] omitted a step-by-step derivation, we introduce it here for the sake of completeness.
196 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
which can be rearranged for the rate of the reduced vector of generalized coordinates as:
•
b = Jg† (b
x x)f (xref + g(b
x), t; µ), b(0; µ) = x
x b0 (µ), Jg† = (JgT Jg )−1 JgT . (454)
The above system of ODEs, in which Jg† denotes the Moore-Penrose inverse of the Jacobian Jg ∈ RNs ×ns ,
which is also referred to as pseudo-inverse, constitutes the ROM corresponding to the FOM governed by
Eq. (445). Equation (453) reveals that the minimization problem is equivalent to a projection of the full
residual onto the low-dimensional subspace S by means of the Jacobian Jg . Note that the rate of the
•
generalized coordinates xb lies in the same tangent space, which is spanned by Jg . Therefore, the authors of
[47] described the projection Eq. (453) as Galerkin projection and the ROM Eq. (454) as nonlinear manifold
(NM) Galerkin ROM. To construct the solutions, suitable time-integration schemes needed to be applied to
the semi-discrete ROM Eq. (454).
An alternative approach was also presented in [47] for the construction of ROMs, in which time-
discretization was performed prior to the projection onto the low-dimensional solution subspace. For
this purpose a uniform time-discretization with a step size ∆t was assumed; the solution at time tn =
tn−1 + ∆t = n∆t was denoted by xn = x(tn ; µ). To develop the method for implicit integration schemes,
the backward Euler method was chosen as example [47]:
xn − xn−1 = ∆tfn , (455)
The above integration rule implies that span of rate fn , which is given by evaluating (or by taking the
‘snapshot’ of) the nonlinear function f , is included in the span of the state (‘solution snapshot’) xn :
span {fn } ⊆ span {xn−1 , xn } → span {f1 , . . . , fNt } ⊆ span {x0 , . . . , xNt } . (456)
For the backward Euler scheme Eq. (455), a residual function was defined in [47] as the difference
r nBE (b
e xn ; x
bn−1 , µ) xn ) − g(b
= g(b xn−1 ) − ∆tf (xref + g(b
xn ), tn ; µ). (457)
n
Just as in the time-continuous domain, the system of equations reBE (bxn ; xbn−1 , µ) = 0 is over-determined
and, hence, was reformulated as a least-squares problem for the generalized coordinates x bn :
1 n 2
bn = arg min r
x BE
(b
v; x
bn−1 , µ) 2 . (458)
2 vb∈Rns e
To solve the least-squares problem, the Gauss-Newton method with starting point x bn−1 was applied,
i.e., the residual Eq. (457) was expanded into a first-order Taylor polynomial in the generalized coordinates
[47]:
n
n n ∂r
eBE (bxn ; xbn−1 , µ) ≈ reBE (bxn−1 ; xbn−1 , µ) + ∂ xbBE
r xn − x
(b bn−1 )
e
n x
bn =b
xn−1
= −∆tfn−1 + (Jg (b
xn−1 ) − ∆tJf (xref + g(b
xn−1 ))Jg (b xn − x
xn−1 )) (b bn−1 )
12.4.3 Autoencoder
Within the NS-ROM approach, it was proposed in [47] to construct g by training an autoencoder, i.e.,
a neural network that reproduces its input vector:
x≈x
e = ae(x) = de(en(x)). (460)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 197
Figure 122: Dense vs. shallow decoder networks (Section 12.4.3). Contributing neurons (or-
ange “nodes”) and connections (orange “edges”) lie in the “active” paths arriving at the selected
outputs (solid orange “nodes”) from the decoder’s inputs. In dense networks as the one in (a),
each neuron in a layer is connected to all other neurons in both the preceeding layer (if it exists)
and in the succeeding layer (if it exists). Fully-connected networks are characterized by dense
weight matrices, see Section 4.4. In sparse networks as the decoder in (b), several connections
among successive layers are dropped, resulting in sparsely populated weight matrices. (Figure
reproduced with permission of the authors.)
As the above relation reveals, autoencoders are typically composed from two parts, i.e., ae(x) = (de ◦
en)(x): The encoder ‘codes’ inputs x into a so-called latent state h = en(x) (not to be confused with
hidden states of neural networks). The decoder then reconstructs (‘decodes’) an approximation of the input
from the latent state, i.e., x
e = de(h). Both encoder and decoder can be represented by neural networks,
e.g., feed-forward networks as done in [47]. As the authors of [78] noted, an autoencoder is “not especially
useful” if it exactly learns the identity mapping for all possible inputs x. Instead, autoencoders are typically
restricted in some way, e.g., by reducing the dimensionality of the latent space as compared to the dimen-
sion of inputs, i.e., dim(h) < dim(x). The restriction forces an autoencoder to focus on those aspects of
the input, or input ‘features’, which are essential for the reconstruction of the input. The encoder network
of such undercomplete autoencoder288 performs a dimensionality reduction, which is exactly the aim of
projection-based methods for constructing ROMs. Using nonlinear encoder/decoder networks, undercom-
plete autoencoders can be trained to represent low-dimensional subspaces of solutions to high-dimensional
dynamical systems governed by Eq. (445).289 In particular, the decoder network represents the nonlinear
manifold S, which described by the function g that maps the generalized coordinates of the ROM x b onto
the corresponding element x e of FOM’s solution space. Accordingly, the generalized coordinates are identi-
fied with the autoencoder’s latent state, i.e., x
b = h; the encoder network represent the inverse mapping g −1 ,
which “captures the most salient features” of the FOM.
The input data, which was formed by the snapshots of solutions X = [x1 , . . . , xm ], where m denoted
the number of snapshots, was normalized when training the network [47]. The shift by the referential state
xref and scaling by xscale , i.e.,
xnormal = xscale ⊙ (x − xref ) ∈ RNs , (461)
is such that each component of the normalized input ranges from [−1, 1] or [0, 1]. 290
The encoder maps a
288
See [78], Chapter 14, p.493.
289
In fact, linear decoder networks combined with MSE-loss are equivalent to the Principal Components Analysis
(PCA) (see [78], p.494), which, in turn, is equivalent to the discrete variant of POD by means of singular value
decomposition (see Remark 12.2).
290
The authors of [47] did not provide further details.
198 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 123: Sparsity masks (Section 12.4.3) used to realize sparse decoders in one- and two-
dimensional problems. The structure of the respective binary-valued mask matrices S is in-
spired by grid-points required in the finite-difference approximation of the Laplace operator in
one and two dimensions, respectively. (Figure reproduced with permission of the authors.)
Using our notation for feedforward networks and activation functions introduced in Section 4.4, the
encoder network has the following structure:
(2) (1)
b = en(xnormal ) :=Wen
x y + b(2)
en , y (1) = s(z (1) ),
(465)
z (1) =Wen
(1) (0)
y + b(1)
en , y (0) = xnormal ,
(i)
where Wde , i = 1, 2 are dense matrices. The masked decoder, on the contrary, is characterized by sparse
connections between the hidden layer and the output layer, which is realized as element-wise multiplication
(2)
of a dense weight matrix Wde with a binary-valued “mask matrix” S reflecting the connectivity among the
two layers:
(1) (2) (2)
e ⊙ xscale − xref = de(bx) := S ⊙ Wde y + bde ,
x y (1) = s(z (1) ),
(466)
(1) (1) (1)
z =Wde y (0) + bde , y (0)
=x
b.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 199
The structure of the sparsity mask S, in turn, is inspired by the pattern (“stencil") of grid-points involved in
a finite-difference approximation of the Laplace operator, see Figure 123 for the one- and two-dimensional
cases.
12.4.4 Hyper-reduction
Irrespective of how small the dimensionality of the ROM’s solution subspace S is, we cannot expect
a reduction in computational efforts in nonlinear problems. Both the time-continuous NM-Galerkin ROM
in Eq. (454) and the time-discrete NM-LSPG ROM in Eq. (458) require repeated evaluations of the non-
linear function f and, in case implicit time-integration schemes, also its Jacobian Jf , whose computational
complexity is determined by the FOM’s dimensionality. The term hyper-reduction subsumes techniques in
model-order reduction to overcome the necessity for evaluations that scale with the FOM’s size. Among
hyper-reduction techniques, the Discrete Empirical Interpolation Method (DEIM) [310] and Gauss-Newton
with Approximated Tensors (GNAT) [311] have gained significant attention in recent years.
In [47], a variant of GNAT relying on solution snapshots for the approximation of the nonlinear residual
term re was used, and therefore the reason to extend the GNAT acronym with “SNS” for ’solution-based
subspace’ (SNS), i.e., GNAT-SNS, see [312]. The idea of DEIM, GNAT and their SNS variants takes up
the leitmotif of projection-based methods: The approximation of the full-order residual r eNiss in turn linearly
interpolated by a low-dimensional vector rb and an appropriate set of basis vectors ϕr,i ∈ R , i = 1, . . . , nr :
e ≈ Φr rb,
r Φr = [ϕr,1 , . . . , ϕr,nr ] ∈ RNs ×nr , ns ≤ nr ≪ Ns . (467)
In DEIM methods, the residual re of the time-continuous problem in Eq. (451) is approximated, whereas
n
e it to be replaced by reBE in the above relation when applying GNAT, which builds upon Petrov-Galerkin
r
ROMs as in Eq. (459). Both DEIM and GNAT variants use gappy POD to determine the basis Φr for
the interpolation of the nonlinear residual. Gappy POD originates in a method for image reconstruction
proposed [313] under the name of a Karhunen-Loève procedure, in which images were reconstructed from
individual pixels, i.e., from gappy data. In the present context of MOR, gappy POD aims at reconstructing
the full-order residual re from a small, nr -dimensional subset of its components.
The matrix Φr was computed by a singular value decomposition (SVD) on the snapshots of data. Unlike
original DEIM and GNAT methods, their SNS variants did not use snapshots of the nonlinear residual re and
n
eBE , respectively, but SVD was performed on solution snapshots X instead. The use of solution snapshots
r
was motivated by the fact that the span of snapshots of the nonlinear term f was included in the span of
of solution snapshots for conventional time-integration schemes, see Eq. (456). The vector of “generalized
coordinates of the nonlinear residual term” minimizes the square of the Euclidean distance of selected
components of full-order residual re and respective components of its reconstruction:
1 2 T
ZT r b 2 , Z T = ep1 , . . . , epnz ∈ Rnz ×Ns , ns ≤ nr ≤ nz ≪ Ns . (468)
rb = arg min − Φr v
b∈Rnr 2
v
e
The matrix Z T , which was referred to as sampling matrix, extracts a set of components from the full-
order vectors v ∈ RNs . For this purpose, the sampling matrix was built from unit vectors epi ∈ RNs , having
the value one at the pi -th component, which corresponded to the component of v to be selected (‘sampled’),
and the value zero otherwise. The components selected by Z T are equivalently described by the (ordered)
set of sampling indices I = {p1 , . . . , pnz }, which are represented by the vector i = [p1 , . . . , pnz ]T ∈ Rnz .
As the number of components of r e used for its reconstruction may be larger than the dimensionality of the
reduced-order vector rb, i.e., nz ≥ nr , Eq. (468) generally constitutes a least-squares problem, the solution
of which follows as
rb = (Z T Φr )† Z T r
e. (469)
200 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Substituting the above result in Eq. (467), the FOM’s residual can be interpolated using an oblique
projection matrix P as
† T N ×N
T
e ≈ Pre, P = Φr (Z Φr ) Z ∈ R s s .
r (470)
The above representation in terms of the projection matrix P somewhat hides the main point of hyperre-
duction. In fact, we do not apply P to the full-order residual re, which would be tautological. Unrolling
the definition of P, we note that Z r T
e ∈ R is a vector containing only a small subset of components
nz
Remark 12.6. Equivalent minimization problems. Note the subtle difference between the minimization
problems that govern the ROMs with and without hyper-reduction, see Eqs. (452) and (474), respectively:
For the case without hyper-reduction, see Eq. (452), the minimum is sought for the approximate full-
dimensional residual r e. In the hyper-reduced variant Eq. (474), the authors of [47] aimed, however, at
minimizing the projected residual rb, which was related to its full-dimensional counterpart by the residual
e = Φr rb. Using the full-order residual also in the hyper-reduced ROM translates into
basis matrix Φr , i.e., r
the following minimization problem:
• 2
b = arg min ∥Φr rb(b
x b, t; µ)∥22 = arg min Φr (Z T Φr )† Z T r
v, x e(bv, xb, t; µ) . (478)
b∈Rns
v b∈Rns
v 2
†
= (Z T Φr )† Z T Jg (Z T Φr )† Z T f (xref + g(b
x), t; µ),
202 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
i.e., using the identity Φ†r Φr = Inr , we recover exactly the same result as in the hyper-reduced case in
Eq. (477). The only requirement is that the residual basis vectors Φr,j , j = 1, . . . , nr need to be linearly
independent. ■
Remark 12.7. Further reduction of the system operator? At first glance, the operator in the ROM’s
governing equations in Eq. (477) appears to be further reducible:
†
(Z T Φr )† Z T Jg (Z T Φr )† = (Z T Jg )† (Z T Φr )(Z T Φr )† . (480)
Note that the product (Z T Φr )(Z T Φr )† generally does not, however, evaluate to identity, since our par-
ticular definition of the pseudo inverse of A ∈ Rm×n is a left inverse, for which A† A = In holds, but
AA† ̸= Im . ■
lution and, hence, “can be pre-computed once for all”. The products Z T Φr , Z T Φ and Z T f need not be
evaluated explicitly.
Figure 124: Subnet construction (Section 12.4.4). To reduce computational cost, a subnet
representing the set of active paths, which comprise all neurons and connections needed for
the evaluation of selected outputs (highlighted in orange), i.e., the reduced residual rb, is con-
structed (left). The size of the hidden layer of the subnet depends on which output components
of the decoder are needed for the reconstruction of the full-order residual. If the full-order resid-
ual is reconstructed from from successive outputs of the decoder, the number of neurons in the
hidden layer involved in the evaluation becomes minimal due to the specific sparsity patterns
proposed. Uniformly distributed components show the least overlap in terms of hidden-layer
neurons required, which is why the subnet and therefore the computational cost in the hyper-
reduction approach is maximal (right). (Figure reproduced with permission of the authors.)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 203
Instead, only those rows of Φr , Φ and f which are selected by the sampling matrix Z T need to be
extracted or computed when evaluating the above operator. In the context of MOR, pre-computations as,
e.g., the above operator, but, more importantly, also the computationally demanding collection of full-order
solutions and residual snapshots and subsequent SVDs, are attributed to the offline phase or stage, see, e.g.,
[307]. Ideally, the online phase only requires evaluations of quantities that scale with the dimensionality of
the ROM.
By keeping track of which components of full-order solution x e are involved in the computation of
selected components of the nonlinear term, i.e., Z T f , the computational cost can be reduced even further. In
other words, we need not reconstruct all components of the full-order solution x e , but only those components,
which are required for the evaluation of Z f . However, the number of components of x
T
e that is needed for
this purpose, which translates into the number of products of rows in Φ with the reduced-order solution x b,
is typically much larger than the number of sampling indices pi , i.e., the cardinality of the set I.
To explain this discrepancy, assume the full-order model to be obtained upon a finite-element discretiza-
tion. Given some particular nodal point, all finite elements sharing the node contribute to the corresponding
components of the nonlinear function f . So we generally must evaluate several element when computing a
single component of f corresponding to a single sampling index pi ∈ I, which, in turn, involves coordinates
of all elements associated with the pi -th degree of freedom.292
For nonlinear manifold methods, we cannot expect much improvement in computational efficiency by
the hyper-reduction. As a matter of fact, the ‘nonlinearity’ becomes twofold if the reduced subspace is
a nonlinear manifold: We do not only have to compute selected components of the nonlinear term Z T f ,
we need to evaluate the nonlinear manifold g. More importantly, from a computational point of view,
also relevant rows of the Jacobian of the nonlinear manifold, which are extracted by Z T Jg (b
x), must be
re-computed for every update of the (reduced-order) solution x
b, see Eq. (477).
For Petrov-Galerkin-type variants of ROMs, hyper-reduction works in exactly the same way as with
their Galerkin counterparts. The residual in the minimization problem in Eq. (458) is approximated by a
gappy reconstruction, i.e.,
1 2
bn = arg min
x (Z T Φr )† Z T r n
eBE (ve; xbn−1 , µ) . (482)
e∈Rns
v 2 2
From a computational point of view, the same implications apply to Petrov-Galerkin ROMs as for Galerkin-
type ROMs, which is why we focus on the latter in our review.
In the approach of [47], the nonlinear manifold g was represented by a feed-forward neural network,
i.e., essentially the decoder of the proposed sparse autoencoder, see Eq. (463).
The computational cost of evaluating the decoder and its Jacobian scales with the number of param-
eters of the neural network. Both shallowness and sparsity of the decoder network already account for
computational efficiency in regard of the number of parameters.
Additionally, the authors of [47] traced “active paths” when evaluating selected components of the
decoder and its Jacobian of the hyper-reduced model. The set of active paths comprises all those connections
and neurons of the decoder network which are involved in evaluations of its outputs. Figure 124 (left)
highlights the active paths for the computations of the components of the reduced residual rb, from which
e is reconstructed within the hyper-reduction method, in orange.
the full residual vector r
Given all active paths, a subnet of the decoder network is constructed to only evaluate those components
of the full-order state which are required to compute the hyper-reduced residual. The computational costs to
292
The “Unassembled DEIM” (UDEIM) method proposed in [315] provides a partial remedy for that issue in the
context of finite-element problems. In UDEIM, the algorithm is applied to the unassembled residual vector, i.e., the
set of element residuals, which restricts the dependency among generalized coordinates to individual elements.
204 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
compute the residual and its Jacobian depends on the size of the subnet. As both input and output dimension
are given, size translates into width of the (single) hidden layer. The size of the hidden layer, in turn,
depends on the distribution of the sampling indices pi , i.e., from which components the full residual rb is
reconstructed.
For the sparsity patterns assumed, successive output components show the largest overlap in terms of
the number of neurons in the hidden layer involved in the evaluation, whereas the overlap is minimal in case
of equally spaced outputs.
The cases of successive and equally distributed sampling indices constitute extremal cases, for which
the computational time for the evaluation of both the residual and its Jacobian of the 2D-example (Sec-
tion 12.4.5)) are illustrated as a function of the dimensionality of the reduced residual (“number of sampling
points”) in Figure 124 (right).
Figure 125: 2-D Burger’s equation. Solution snapshots of full and reduced-order models
(Section 12.4.5). From left to right, the components u (top row) and v (bottom row) of the ve-
locity field at time t = 2 are shown for the FOM, the hyper-reduced nonlinear-manifold-based
ROM (NM-LSPG-HR) and the hyper-reduced linear-subspace-based ROM (LS-LSPG-HR).
Both ROMs have a dimension of ns = 5; with respect to hyper-reduction, the residual basis
of the NM-LSPG-HR ROM has a dimension of nr = 55 and nz = 58 components of the
full-order residual ("sampling indices") are used in the gappy reconstruction of the reduced
residual. For the LS-LSPG-HR ROM, both the dimension of the residual basis and the number
of sampling indices are nr = nz = 59. Due to the advection, a steep, shock wave-like gra-
dient develops. While FOM and NM-LSPG-HR solutions are visually indistinguishable, the
LS-LSPG-HR fails to reproduce the FOM’s solution by a large margin (right column). Spu-
rious oscillation patterns characteristic of advection-dominated problems (Brooks & Hughes
(1982) [316]) occur. (Figure reproduced with permission of the authors.)
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 205
Figure 126: 2-D Burger’s equation. Reynolds number vs. singular values (Section 12.4.5).
Performing SVD on FOM solution snapshots, which were partitioned into x and y-components,
the influence of the Reynolds number on the singular values is illustrated. In diffusion-
dominated problems, which are characterized by low Reynolds number, a rapid decay of sin-
gular values was observed. Less than 100 singular values were non-zero (in terms of double
precision accuracy) in the present example for Re = 100. In problems with high Reynolds
number, in which advection dominates over diffusive processes, the decay of singular val-
ues was much slower. As many as 200 singular values were different from zero in the case
Re = 1 × 104 . (Figure reproduced with permission of the authors.)
The semi-discrete FOM was a finite-difference approximation in the spatial dimension on a uniform
60 × 60 grid of points (xi , yj ), where i ∈ {1, 2, . . . , 60}, j ∈ {1, 2, . . . , 60}. First spatial derivatives were
approximated by backward differences; central differences were used to approximate second derivatives. A
spatial discretization led to a set of ODEs, which was partitioned into two subsets that corresponded to the
206 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 127: 2D-Burgers’ equation: relative errors of nonlinear manifold and linear subspace
ROMs (Section 12.4.5). (Figure reproduced with permission of the authors.)
Table 7: 2-D Burger’s equation. Juxtaposition of hyper-reduced ROMs: speed-up and accu-
racy (Section 12.4.5). The 6 respective best least-squares Petrov-Galerkin ROMs built upon
nonlinear manifold approximation by means of autoencoders and linear subspaces are com-
pared, where ‘best’ refers to the maximum error relative to the FOM (Eq. (444)). The optimal
dimension of the basis Φr and the number of sampling indices pj used in the gappy recon-
struction of the nonlinear residual lie in similar ranges for all ROMs listed. The hyper-reduced
ROMs achieve speed-up in wall-clock time of factors of approximately 11 for the nonlinear-
manifold-based approach up to a factor of almost 30 in linear-subspace ROMs. While the
former show a maximum relative error of below 1 %, the latter fail to reproduce the FOM’s
behavior by a large margin. (Table reproduced with permission of the authors.)
ntrain = 2, 4, 6, 8 values of µ ∈ D, which were referred to as “parameter instances,” were created, where
the target value µ = 1 remained excluded, i.e.,
Dtrain = {0.9 + 0.2i/ntrain , i = 0, . . . , ntrain } \{1}; (487)
the reduced dimension was set to ns = 5. Figure 127 (right) reveals that, for the NM-LSPG ROM, 4 “pa-
rameter instances” were sufficient to reduce the maximum relative error below 1 % in the present problem.
None of the ROMs benefited from increasing the parameter set, for which the training data were generated.
Hyper-reduction turned out to be crucial with respect to computational efficiency. For a reduced di-
mension of ns = 5, all ROMs except for the NM-LSPG ROM, which achieved a minor speedup, were less
efficient than the FOM in terms of wall-clock time. The dimension of the residual basis nr and the number
of sampling indices nz were both varied in the range from 40 to 60 to quantify their relation to the maximum
relative error. For this purpose, the number of training instances was again set to ntrain = 4 and the reduced
dimension was fixed to ns = 5. Table 7 compares the 6 best—in terms of maximum error relative to the
FOM—hyper-reduced least-squares Petrov-Galerkin ROMs based on nonlinear manifolds and linear sub-
spaces, respectively. The NM-LSPG-HR ROM in [47] was able to achieve a speed-up of more than a factor
of 11 while keeping the maximum relative error below 1 %. Though the speed-up of the linear-subspace
208 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
counterpart was more than twice as large, relative errors beyond 34 % rendered these ROMs worthless.
Figure 128: Machine-learning accelerated CFD (Section 12.4.5). Speed-up factor, compared
to direct integration, was much higher than those obtained from nonlinear model-order reduc-
tion in Table 7 [317]. Permission of NAS.
Figure 129: Machine-learning accelerated CFD (Section 12.4.5). Good accuracy and good
generalization, devoiding of non-physical solutions [317]. Permission of NAS.
Figure 130: Machine-learning accelerated CFD (Section 12.4.5). The neural network gener-
ates interpolation coefficients based on local-flow properties, while ensuring at least first-order
accuracy relative to the grid spacing [317]. Permission of NAS.
Remark 12.8. Machine-learning accelerated CFD. A hybrid method between traditional direct integration
of the Navier-Stokes equation and machine learning (ML) interpolation was presented in [317] (Figure 130),
where a speed-up factor close to 90, many times higher than those in Table 7, was obtained, Figure 128,
while generalizing well (Figure 129). Grounded on the traditional direct integration, such hybrid method
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 209
would avoid non-physical solutions of pure machine-learning methods, such as the physics-inspired machine
learning (Section 9.5, Remark 9.4), maintain higher accuracy as obtained with direct integration, and at the
same time benefit from an acceleration from the learned interpolation. ■
Remark 12.9. In concluding this section, we mention the 2023 review paper [318], brought to our attention
by a reviewer, on “A state-of-the-art review on machine learning-based multiscale modeling, simulation,
homogenization and design of materials.” This review paper would nicely complement our present review
paper. ■
13 Historical perspective
13.1 Early inspiration from biological neurons
In the early days, many papers on artificial neural networks, particularly for applications in engineering,
started to motivate readers with a figure of a biological neuron as in Figure 131 (see, e.g., [23], Figure 1a),
before displaying an artificial neuron (e.g, [23], Figure 1b). When artificial neural networks took a foothold
Dendrite
Axon terminal
Outputs
Myelin sheath Output points = synapses
Figure 131: Biological Neuron and signal flow (Sections 4.4.4, 13.1, 13.2.2) along myelinated
axon, with inputs at the synapses (input points) in the dendrites and with outputs at the axon
terminals (output points,which are also the synapses for the next neuron). Each input current
xi is multiplied by the weight wi , then all weighted input currents are summed together (linear
combination), with i = 1, . . . , n, to form the total synaptic input current Is into the soma (cell
body). The corresponding artificial neuron is in Figure 36 in Section 4.4.4. (Figure adapted from
Wikipedia version 14:29, 2 May 2019).
in the research community, there was no need to motivate with a biological neuron, e.g., [38] [20], which
began directly with an artificial neuron.
output” and “some threshold value,” and attributed the following equation to Rosenblatt
(
0 if j wj xj ≤ threshold
P
output = (489)
1 if j wj xj > threshold
P
(ℓ)
where the threshold is simply the negative of the bias bi in Eq. (26)293 or (−b) in Figure 132, which is
a graphical representation of Eq. (489). The author of [21] went on to say, “That’s all there is to how a
Figure 132: The perceptron network (Sections 4.5, 13.2)—introduced by Rosenblatt (1958)
[119], (1962) [120]—has a linear combination with weights and bias as expressed in z (1) (xi ) =
wxi + b ∈ R, but differs from the one-layer network in Figure 37 in that it is activated by the
Heaviside function. That the Rosenblatt perceptron cannot represent the XOR function; see
Section 4.5.
perceptron works!” Such statement could be highly misleading for first-time learners in discounting Rosen-
blatt’s important contributions, which were extensively inspired from neuroscience, and were not limited
to the perceptron as a machine-learning algorithm, but also to the development of the Mark I computer, a
hardware implementation of the perceptron; see Figure 133 [319] [120] [320].
Moreover, adding to the confusion for first-time learners, another error and misleading statement about
the “Rosenblatt perceptron” in connection with Eq. (488) and Eq. (489)—which represent a single neuron—
is in [78], p. 13, where it was stated that the “Rosenblatt perceptron” involved only a “single neuron”:
“The first wave started with cybernetics in the 1940s-1960s, with the development of theo-
ries of biological learning (McCulloch and Pitts, 1943; Hebb, 1949) and implementations of
the first models, such as the perceptron (Rosenblatt, 1958), enabling the training of a single
neuron.” [78], p. 13.
The error of considering the Rosenblatt perceptron as having a “single neuron” is also reported in
Figure 42, which is Figure 1.11 in [78], p. 23. But the Rosenblatt perceptron as described in the cited
reference Rosenblatt (1958) [119] and in Rosenblatt (1960) [319] was a network, called a “nerve net”:
“Any perceptron, or nerve net, consists of a network of “cells,” or signal generating units, and
connections between them.”
293
“The Perceptron’s design was much like that of the modern neural net, except that it had only one layer with
adjustable weights and thresholds, sandwiched between input and output layers” [77]. In the neuroscientific ter-
minology that Rosenblatt (1958) [119] used, the input layer contains the sensory units, the middle (hidden) layer
contains the “association units,” and the output layer contains the response units. Due to the difference in notation
and due to “neurodynamics” as a new field for most readers, we provide here some markers that could help track
down where Rosenblatt used linear combination of the inputs. Rosenblatt (1962) [2], p. 82, defined the “transmis-
sion function c⋆ij ” for the connection between two “units” (neurons) ui and uj , with c⋆ij playing the same role as
(ℓ) (ℓ−1) (ℓ)
that of the term wij yj in zi in Eq. (26). Then for an “elementary perceptron”, the transmission function c⋆ij
(ℓ−1)
was defined in [2], p. 85, to be equal to the output of unit ui (equivalent to yi ) multiplied by the “coupling
(ℓ)
coefficient” vij (between unit ui and unit uj ), with vij being the equivalent of the weight wij in Eq. (26), ignoring
the time dependence. The word “weight,” meaning coefficient, was not used often in [2], and not at all in [119].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 211
Figure 133: Rosenblatt and the Mark I computer (Sections 4.6.1, 13.2) based on the percep-
tron, described in the New York Times article titled “New Navy device learns by doing” on
1958 July 8 (Internet archive), as a “computer designed to read and grow wiser”, and would
be able to “walk, talk, see, write, reproduce itself and be conscious of its existence. The first
perceptron will have about 1,000 electronic “association cells” [A-units] receiving electrical
impulses from an eye-like scanning device with 400 photo cells”. See also the Youtube video
“Perceptron Research from the 50’s & 60’s, clip”. Sometimes, it is incorrectly thought that
Rosenblatt’s network had only one neuron (A-unit); see Figure 42, Section 4.6.1.
Such a “nerve net” would surely not just contain a “single neuron”. Indeed, the report by Rosenblatt (1957)
[1] that appeared a year earlier mentioned a network (with one layer) containing as many a thousand neurons,
called “association cells” (or A-units):294
“Thus with 1000 A-units connected to each R-unit [response unit or output], and a system in
which 1% of the A-units respond to stimuli of a given size (i.e., Pa = .01), the probability of
making a correct discrimination with one unit of training, after 106 stimuli have been associ-
ated to each response in the system, is equal to the 2.23 sigma level, or a probability of 0.987
of being correct.” [1], p. 16.
The perceptron with one thousand A-units mentioned in [1] was also reported in the New York Times article
“New Navy device learns by doing” on 1958 July 8 (Internet archive); see Figure 133. Even if the report by
Rosenblatt (1957) [1] were not immediately accessible, it was stated in no uncertain terms that the perceptron
was a machine with many neurons:
294
We were so thoroughly misled in thinking that the “Rosenblatt perceptron” was a single neuron that we were
surprised to learn that Rosenblatt had built the Mark I computer with many neurons.
212 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 1 in [119] described a network (“nerve net”) with many A-units (neurons). Does anyone still read
the classics anymore?
Rosenblatt’s (1962) book [2], p. 33, provided the following neuroscientific explanation for using a linear
combination (weight sum / voting) of the inputs in both time and space:
“The arrival of a single (excitatory) impulse gives rise to a partial depolarization of the post-
synaptic298 membrane surface, which spreads over an appreciable area, and decays exponen-
tially with time. This is called a local excitatory state (l.e.s.). The l.e.s. due to successive
impulses is (approximately) additive. Several impulses arriving in sufficiently close succes-
sion may thus combine to touch off an impulse in the receiving neuron if the local excita-
tory state at the base of the axon achieves the threshold level. This phenomenon is called
temporal summation. Similarly, impulses which arrive at different points on the cell body or
on the dendrites may combine by spatial summation to trigger an impulse if the l.e.s. induced
at the base of the axon is strong enough.”
The spatial summation of the input synaptic currents is also consistent with Kirchhoff’s current law of
summing the electrical currents at a junction in an electrical network.299 We first look at linear combination
in the static case, followed by the dynamic case with Volterra series.
Readers not interested in reading the classics can skip this section. Here, we will not review the percep-
tron algorithm,300 but focus our attention on the historical details not found in many modern references, and
connect Eq. (488) and Eq. (489) to the original paper by Rosenblatt (1958) [119]. But the problem is that
such task is not directly obvious for readers of modern literature, such as [78] for the following reasons:
295
Heaviside activation function, see Figure 132 for the case of one neuron.
296
Weighted sum / voting (or linear combination) of inputs; see Eq. (488), Eq. (493), Eq. (494).
(ℓ)
297
The negative of the bias bi in Eq. (26).
298
Refer to Figure 131. A synapse (meaning “junction”) is “a structure that permits a neuron (or nerve cell) to pass an
electrical or chemical signal to another neuron”, and consists of three parts: the presynaptic part (which is an axon
terminal of an upstream neuron from which the signal came), the gap called the synaptic cleft, and the postsynaptic
part, located on a dendrite or on the neuron cell body (called the soma); [19], p. 6; “Synapse”, Wikipedia, version
16:33, 3 March 2019. A dendrite is a conduit for transmitting the electrochemical signal received from another
neuron, and passing through a synapse located on that dendrite; “Dendrite”, Wikipedia, version 23:39, 15 April
2019. A synapse is thus an input point to a neuron in a biological neural network. An axon, or nerve fiber, is
a “long, slender projection of a neuron, that conducts electrical impulses known as action potentials” away from
the soma to the axon terminals, which are the presynaptic parts; “Axon terminal”, Wikipedia, version 18:13, 27
February 2019.
299
See “Kirchhoff’s circuit laws”, Wikipedia, version 14:24, 1 May 2019.
300
See, e.g., “Perceptron”, Wikipedia, version 13:11, 10 May 2019, and many other references.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 213
• Rosenblatt’s work in [119] was based on neuroscience, which is confusing to those without this
background;
• Unfamiliar notations and concepts for readers coming from deep-learning literature, such as [21],
[78];
• The word “weight” was not used at all in [119], and thus cannot be used to indirectly search for hints
of equations similar to Eq. (488) or Eq. (489);
• The word “threshold” was used several times, such as in the sentence “If the algebraic sum of exci-
tatory and inhibitory impulse intensities is equal to or greater than the threshold (θ)”301 of the A-unit,
then the A-unit fires, again on an all-or-nothing basis”. The threshold θ is used in Eq. (2) of [119]:
e − i − le + li + ge − gi ≥ θ , (490)
where e is the number excitatory stimulus components received by the A-unit (neuron, or associated
unit), i the number of inhibitory stimulus components, le the number of lost excitatory components,
li the number of lost inhibitory components, ge the number of gained excitatory components, gi the
number of gained inhibitory components. But all these quantities are positive integers, and thus would
not be the real-number weights wj in Eq. (489), of which the inputs xj also have no clear equivalence
in Eq. (490).
As will be shown below, it was misleading to refer to [119] for equations such as Eq. (488) and
Eq. (489), even though [119] contained the seed ideas leading to these equations upon refinement as pre-
sented in [120], which was in turn based on the book by Rosenblatt (1962) [2].
Instead of a direct reading of [119], we suggest reading key publications in reverse chronological orders.
We also use the original notations to help readers to identify quickly the relevant equations in the classic
literature.
The authors of [121] introduced a general class of machines, each known under different names, but
decided to call all these machines as “perceptrons” in honor of the pioneering work of Rosenblatt. General
perceptrons were defined in [121], p. 10, as follows. Let φi be the ith image characteristic, called an image
predicate, which consists of a verb and an object, such as “is a circle”, “is a convex figure”, etc. An image
predicate is also known as an image feature.302 For example, let the ith image characteristic is whether an
image “is a circle”, then φi ≡ φcircle . If an image X is a circle, then φcircle (X) = 1; if not, φcircle (X) = 0:
(
1 if X is a circle
φcircle (X) = (491)
0 if X is not a circle
Let Φ be a family of simple image predicates:
Φ = [φ1 , . . . , φn ] (492)
A general perceptron was defined as a more complex predicate, denoted by ψ, which was a weighted voting
or linear combination of the simple predicates in Φ such that303
ψ(X) = 1 ⇐⇒ α1 φ1 + . . . + αn φn > θ (493)
301
Of course, the notation θ (lightface) here does not designate the set of parameters, denoted by θ (boldface) in
Eq. (31).
302
See [78], p. 3. Another example of a feature is a piece of information about a patient for medical diagnostics. “For
many tasks, it is difficult to know which features should be extracted.” For example, to detect cars, we can try to
detect the wheels, but “it is difficult to describe exactly what a wheel looks like in terms of pixel values”, due to
shadows, glares, objects obscuring parts of a wheel, etc.
303
[121], p. 10.
214 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
with αi being the weight associated with the ith image predicate φi , and θ the threshold or the negative of
the bias. As such “each predicate of Φ is supposed to provide some evidence about whether ψ is true for
any figure X.”304 The expression on the right of the equivalence sign, written with the notations used here,
is the general case of Eq. (489). The authors of [121], p. 12, then defined the Rosenblatt perceptron as a
special case of Eq. (493) in which the image predicates in Φ were random Boolean functions, generated by
a random process according to a probability distribution.
The next paper to read is [120], which was based on the book by Rosenblatt (1962) [2], and from where
the following equation305 can be identified as being similar to Eq. (493) and Eq. (489), again in its original
notation as
Na
X
yµ bµi > Θ , for i = 1, . . . , n , (494)
µ=1
where yµ was the weight corresponding to the input bµi to the “associated unit” aµ (neuron) from the
stimulus pattern Si (ith example in the dataset), Na the number of “associated units” (neurons), n the
number of “stimulus patterns” (examples in the dataset), and Θ the second of two thresholds, which were
(ℓ)
fixed real non-negative numbers, and which corresponded to the negative of the bias bi in Eq. (26), or (−b)
in Figure 132.
To discriminate between two classes, the input bµi took the value +1 or −1, when there was excitation
coming from the stimulus pattern Si to the neuron aµ , and the value 0 when there was no excitation from
Si to aµ . When the weighted voting or linear combination in Eq. (494) surpassed the threshold Θ, then the
response was correct (or yields the value +1).
If the algebraic sum αµi of the connection strengths Cσµ between the neuron (associated unit) aµ and
the sensory unit sσ inside the pattern (example) Si surpassed a threshold θ (which was the first of two
thresholds, and which does not correspond to the negative of the bias in modern networks), then the neuron
aµ was activated:306
X
αµi := Cσµ ≥ θ (495)
σ
sσ ∈Si
Eq. (495) in [120] would correspond to Eq. (490) in [119], with the connection strengths Cσµ being “random
numbers having the possible values +1, -1, 0”.
That Eq. (495) was not numbered in [120] indicates that it played a minor role in this paper. The reason
is clear, since the author of [120] stated307 that “the connections Cσµ do not change”, and thus “we may
disregard the sensory retina altogether”, i.e., Eq. (495).
Moreover, the very first sentence in [120] was “The perceptron is a self-organizing and adaptive system
proposed by Rosenblatt”, and the book by [2] was immediately cited as Ref. 1, whereas only much later in
the fourth page of [120] did the author write “With the Perceptron, Rosenblatt offered for the first time a
model...”, and cited Rosenblatt’s 1958 report first as Ref. 34, followed by the paper [119] as Ref. 35.
In a major work on AI dedicated to Rosenblatt after his death in a boat accident, the authors of [121],
p.xi, in the Prologue of their book, referred to Rosenblatt’s (1962) book [2] and not Rosenblatt’s (1958)
paper [119]:
304
[121], p. 11.
305
Eq. (4) in [120].
306
First equation, unnumbered, in [120]. That this equation was unnumbered also indicated that it would not be
subsequently referred to (and hence perhaps not considered as important).
307
See above Eq. (1) in [120].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 215
In fact, Rosenblatt’s (1958) paper [119] was never referred to in [121], except for a brief mention of
the influence of “Rosenblatt’s [1958]” work on p. 19, without the full bibliographic details. The authors of
[121] wrote:
“However, it is not our goal here to evaluate these theories [to model brain functioning], but
only to sketch a picture of the intellectual stage that was set for the perceptron concept. In
this setting, Rosenblatt’s [1958] schemes quickly took root, and soon there were perhaps as
many as a hundred groups, large and small, experimenting with the model either as a ‘learnÂ-
ing machine’ or in the guise of ‘adaptive’ or ‘self-organizing’ networks or ‘automatic control’
systems.”
So why was [119] often referred to for Eq. (488) or Eq. (489), instead of [120] or [2],308 which would
be much better references for these equations? One reason could be that citing [120] would not do justice to
[119], which contained the germ of the idea, even though not as refined as four years later in [120] and [2].
Another reason could be the herd effect by following other authors who referred to [119], without actually
reading the paper, or without comparing this paper to [120] or [2]. A best approach would be to refer to both
[119] and [120], as papers like these would be more accessible than books like [2].
Remark 13.1. The hype on the Rosenblatt perceptron Mark I computer described in the 1958 New York
Times article shown in Figure 133, together with the criticism of the Rosenblatt perceptron in [121] for
failing to represent the XOR function, led to an early great disappointment on the possibilities of AI when
overreached expectations for such device did not pan out, and contributed to the first AI winter that lasted
until the 1980s, with a resurgence in interest due to the development of backpropagation and application
in psychology as reported in [22]. But some sixty years since the Mark I computer, AI still cannot even
think like human babies yet: “Understanding babies and young children may be one key to ensuring that the
current “AI spring” continues—despite some chilly autumnal winds in the air” [321]. ■
308
A search on the Web of Science on 2019.07.04 indicated that [119] received 2,346 citations, whereas [120] received
168 citations. A search on Google Books on the same day indicated that [2] received 21 citations.
309
[19], p. 46. “The Volterra series is a model for non-linear behavior similar to the Taylor series. It differs from the
Taylor series in its ability to capture ’memory’ effects. The Taylor series can be used for approximating the response
of a nonlinear system to a given input if the output of this system depends strictly on the input at that particular time.
In the Volterra series the output of the nonlinear system depends on the input to the system at all other times. This
provides the ability to capture the ’memory’ effect of devices like capacitors and inductors. It has been applied in the
fields of medicine (biomedical engineering) and biology, especially neuroscience. In mathematics, a Volterra series
denotes a functional expansion of a dynamic, nonlinear, time-invariant functional,” in “Volterra series”, Wikipedia,
version 12:49, 13 August 2018.
216 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
where Kn (τ1 , . . . , τn ) is the kernel of the nth order, with all integrals going from τj = 0 to the current time
τj = +∞, for j = 1, . . . , n. The linear-order approximation of the Volterra series in Eq. (496) is then
τ1Z
=+∞ Zτ =t
z(t) ≈ K0 + K1 (τ1 )x(t − τ1 )dτ1 = K0 + K1 (t − τ )x(τ )dτ (497)
τ1 =0 τ =−∞
with the continuous linear combination (weighted sum) appearing in the second term. The convolution inte-
gral in Eq. (497) is the basis for convolutional networks for highly effective and efficient image recognition,
inspired by mammalian visual system. A review of convolutional networks outside the scope here, despite
them being the “greatest success story of biologically inspired artificial intelligence” [78], p. 353.
For biological neuron models, both The input x(t) and the continuous weighted sum z(t) can be either
currents, with nA (nano Ampere) as dimension, or firing rates (frequency), with Hz (Hertz) as dimension.
Eq. (497) is the continuous temporal summation, counterpart of the discrete spatial summation in
(ℓ)
Eq. (26), with the constant term K0 playing a role similar to that of the bias bi . The linear kernel K1 (τ1 )
(also called the Wiener kernel, or synaptic kernel in brain modeling) is the weight on the input x(t−τ1 ),
310 311
with τ1 going from −∞ to the current time t. In other words, the whole history of the input x(t) prior to
the current time has an influence on the output z(t), with typically smaller weight for more distant input
(fading memory). For this reason, the synaptic kernel used in the biological neuron firing-rate models is
often chosen to have an exponential decay of the form:312
−t
1
Ks (t) := K1 (t) = exp (498)
τs τs
where τs is the synaptic time constant such that the smaller τs is, the less memory of past input values, and
τs → 0 ⇒ Ks → δ(t) ⇒ z(t) → K0 + x(t) (499)
i.e., the continuous weighted sum z(t) would correspond to the instantaneous x(t) (without memory of past
input values) as the synaptic time constant τs goes to zero (no memory).
Remark 13.2. The discrete counterpart of the linear part of the Volterra series in Eq. (497) can be found in
the exponential-smoothing time series in Eq. (212), with the kernel K1 (t − τ ) being the exponential function
β (t−i) ; see Section 6.5.3 on exponential smoothing in forecasting. The similarity is even closer when the
synaptic kernel K1 is of exponential form as in Eq. (498). ■
In firing-rate models of the brain (see Figure 27), the function x(t) represents an input firing-rate at a
synapse, K0 is called the background firing rate, and the weighted sum z(t) has the dimension of firing rate
(Hz.
For a neuron with n pre-synaptic inputs [x1 (t), . . . , xn (t)] (either currents or firing rates) as depicted in
Figure 131 of a biological neuron, the total input z(t) (current or firing rate, respectively) going into the soma
310
Since the first two terms in the Volterra series coincide with the first two terms in the Wiener series; see [19], p. 46.
311
See [19], p. 234.
312
See [19], p. 234, below Eq. (7.3).
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 217
(cell body, Figure 131), called total somatic input, is a discrete weighted sum of all post-synaptic continuous
weighted sums zi (t) expressed in Eq. (497), assuming the same synaptic kernel Ks at all synapses:
n
X Zτ =t
z(t) = K0 + wi Ks (t − τ )xi (τ )dτ (500)
i=1 τ =−∞
with K0 being the constant bias,313 and wi the synaptic weight associated with the synapse i.
Using the synaptic kernel Eq. (498) in Eq. (500) for the total somatic input z(t), and differentiate,314 the
following ordinary differential equation is obtained
dz X
τs = −z + wi x i
dt
i (501)
Remark 13.3. The second term in Eq. (501), with time-independent input xi , is the steady state of the total
somatic input z(t):
−t
z(t) = [z0 − z∞ ] exp + z∞ , (502)
τ
X
with z0 := z(0) , and z∞ := wi x i (503)
i
z(t) → z∞ as t → ∞ . (504)
As a result, the subscript ∞ is often used to denote as the steady state solution, such as R∞ in the model of
neocortical neurons Eq. (509) [118]. ■
For constant total somatic input z, the output firing-rate y is given by an activation function a(·) (e.g,
scaled ReLU in Figure 25 and Figure 26) through relation315
Remark 13.4. The steady-state output y∞ in the first firing-rate model described in Eq. (500) and Eq. (505)
in the case of constant inputs xi is therefore
y∞ = a(z∞ ) , (506)
where z∞ is given by Eq. (503). ■
313
The negative of the bias K0 is the threshold. The constant bias K0 is called the background firing rate when the
inputs xi (t) are firing rates.
R x=B(t) • • R x=B(t) ∂f (x,t)
314
In general, for I(t) = x=A(t) f (x, t)dx, then dI(t)
dx = f (B(t), t)B(t) − f (A(t), t)A(t) + x=A(t) ∂x dx.
315
See [19], p. 234, in original notation as v = F (Is ), with v being the output firing rate, F an activation function, and
Is the total synaptic current.
316
See [19], p. 234, subsection “The Firing Rate”.
218 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
The second firing-rate model consists of using Eq. (501) for the total somatic input firing rate z, which
is then used as input for the following ODE for the output firing-rate y:
dy
τr = −y + a(z(t)) ,
dt (507)
where the activation function a(·) is applied on the time-dependent total somatic input firing rate z(t), but
with a different time constant τr , which describes how fast the output firing rate y approaches steady state
for constant input z.
Eq. (507) is a recurring theme that has been frequently used in papers in neuroscience and artificial
neural networks. Below are a few relevant papers for this review, particularly the continuous recurrent
neural networks (RNNs)—such as Eq. (510), Eq. (512), and Eqs. (515)-(516)—which are the counterparts
of the discrete RNNs in Section 7.1.
Outside membrane
Inside membrane
Figure 134: Model of neocortical neurons in [118] as a simplification of the model in [322]
(Section 13.2.2): A capacitor C with a potential V across its plates, in parallel with the equilib-
rium potentials ENa (sodium) and EK (potassium) in opposite direction. Two variable resistors
∞ (V ) and [gK R(V )]
m−1 −1 are each in series with one of the mentioned two equilibrium poten-
tials. The capacitor C is also in parallel with a current source I. The notation R is used here
for the “recovery variable”, not a resistor. See Eqs. (508)-(509).
The model for neocortical neurons in [118], a simplification of the model by [322], was employed in
[116] as starting point to develop a formulation that produced SubFigure(b) in Figure 28 consists of two
coupled ODE’s317
dV
C = −m∞ (V )(V − ENa ) − gK R(V − EK ) + I (508)
dt
dR
τ = −R + R∞ (V ) (509)
dt
where m∞ (V ) and R∞ (V ) are prescribed quadratic polynomials in the potential V , making the right-hand
side of Eq. (508) of cubic order. Eq. (508) describes the change in the membrane potential V due to the
capacitance C in parallel with other circuit elements shown in Figure 134, with (1) m∞ (V ) and ENa being
the activation function and equilibrium potential for the sodium ion (Na+ ), respectively, (2) gK , R, and EK
being the conductance, recovery variable, and equilibrium potential for the potassium ion (K+ ), respectively,
and (3) I the stimulating current. Eq. (509) for the recovery variable R has the same form as Eq. (507), with
R∞ (V ) being the steady state; see Remark 13.4.
To create a continuous recurrent neural network described by ODEs in Eq. (510), the input xi (t) in
Eq. (500) is replaced by the output yj (t) (i.e., a feedback), and the bias K0 becomes an input, now de-
noted by xi . Electrical circuits can be designed to approximate the dynamical behavior a spatially-discrete,
317
Eq. (1) and Eq. (2) in [116].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 219
temporally-continuous recurrent neural network (RNN) described by Eq. (510), which is Eq. (1) in [32]:318
dyi X
τi = −yi + xi + wij yj . (510)
dt
j
+
A densely distributed pre-synaptic input points [see Eq. (500) and Figure 131 of a biological neurons]
can be approximated by a continuous distribution in space, represented by x(s, t), with s being the space
variable, and t the time variable. In this case, a continuous RNN in both space and time, called “continuously
labeled RNN”, can be written as follows:320
∂y(s, t)
τr = −y(s, t) + a(z(s, t)) (515)
∂t
Z
z(s, t) = ρs [W (s, γ)x(γ, t) + M (s, γ)y(γ, t)] dγ (516)
where ρs is the neuron density, assumed to be constant. Space-time continuous RNNs such as Eqs. (515)-
(516) have been used to model, e.g., the visually responsive neurons in the premotor cortex [19], p. 242.
h i
318
In original notation, Eq. (510) was written as τi (xi ) dx in [32], whose outputs xi in
P
dt + xi = bi + j Wij xj
i
+
the previous expression are now rewritten as yi in Eq. (510), and the biases bi (t), playing the role of inputs, are
rewritten as xi (t) to be consistent with the notation for inputs and outputs used throughout in the present work; see
Section 4.2 on matrix notation, Eq. (7), and Section 7.1 on discrete recurrent neural networks, which are discrete in
both space and time. The paper [32] was cited in both [19] and [36], with the latter leading us to it.
•
319
In original notation, Eq. (512) was written as z = −Az + f (W z(t − h(t)) + J) in [323], where z = y, A = T −1 ,
h(t) = d(t) and J = x.
320
[19], p. 240, Eq. (7.14).
220 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 135: Continuous recurrent neural network with time-dependent delay d(t) (green feed-
back loop, Section 13.2.2), as expressed in Eq. (514), where f (·) is the operator with the first
defivative term plus a standard static term—which is an activation function acting on linear
combination of input and bias, i.e., a(z(t)) as in Eq. (35) and Eq. (32)—x(t) the input, y(t)
the output with the red feedback loop, and y(t − d) the delayed output with the green feedback
loop. This figure is the more general continuous counterpart of the discrete RNN in Figure 79,
represented by Eq. (275), which is a particular case of Eq. (514). We also refer readers to Re-
mark 7.1 and the notation equivalence y(t) ≡ h(t) as noted in Eq. (276).
The use of the logistic sigmoid function (Figure 30) in neuroscience dates back since the seminal work
of Nobel Laureates Hodgkin & Huxley (1952) [322] in the form of an electrical circuit (Figure 134), and
since the work reported in [35] in a form closer to today’s network; see also [325] and [37].
The authors of [78], p. 219, remarked: “Despite the early popularity of rectification (see next Sec-
tion 13.3.2), it was largely replaced by sigmoids in the 1980s, perhaps because sigmoids perform better
when neural networks are very small.”
The rectified linear function has, however, made a come back and was a key component responsible for
the success of deep learning, and helped inspired a variant that in 2015 surpassed human-level performance
in image classification, as “it expedites convergence of the training procedure [16] and leads to better solu-
tions [21, 8, 20, 34] than conventional sigmoid-like units” [61]. See Section 5.3.3 on Parametric Rectified
Linear Unit.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 221
Figure 137: Crayfish giant motor synapse (Section 13.3.2). The (pre-synaptic) lateral giant
fiber was connected to the (post-synaptic) giant motor fiber through a synapse where the two
fibers cross each other at the location annotated by “Giant motor synapse” in the figure. This
synapse was right underneath the giant motor fiber, at the crossing and contact point, and thus
could not be seen. The two left electrodes (including the second electrode from left) were
inserted in the lateral giant fiber, with the two right electrodes in the giant motor fiber. Cur-
rents were injected into the two electrodes indicated by solid red arrows, and electrical outputs
recorded from the two electrodes indicated by dashed blue arrows [326]. (Figure reproduced with
permission of the publisher Wiley.)
“The big question was how we could train deeper networks... Then a few years later, we dis-
covered that we didn’t need these approaches [Restricted Boltzmann Machines, autoencoders]
to train deep networks, we could just change the nonlinearity. One of my students was work-
ing with neuroscientists, and we thought that we should try rectified linear units (ReLUs)—we
called them rectifiers in those days—because they were more biologically plausible, and this
is an example of actually taking inspiration from the brain. We had previously used a sigmoid
function to train neural nets, but it turned out that by using ReLUs we could suddenly train
very deep nets much more easily. That was another big change that occurred around 2010 or
2011.”
The student mentioned by Bengio was likely the first author of [113]; see also the earlier Section 4.4.2 on
activation functions.
We were aware of Ref. [32] appearing in year 2000—in which a spatially-discrete, temperally-continuous
recurrent neural network was used with a rectified linear function, as expressed in Eq. (510)—through
Ref. [36]. On the other hand, prior its introduction in deep neural networks, rectified linear unit had been
used in neuroscience since at least 1995, but [327] was a book, as cited in [113]. Research results published
in papers would appear in book form several years later:
222 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
“ The current and third wave, deep learning, started around 2006 (Hinton et al., 2006; Bengio
et al., 2007; Ranzato et al., 2007a) and is just now appearing in book form as of 2016. The
other two waves [cybernetics and connectionism] similarly appeared in book form much later
than the corresponding scientific activity occurred” [78], p. 13.
Another clue that the rectified linear function was a well-known, well accepted concept—similar to the
relation Kd = F in the finite element method—is that the authors of [32] did not provide any reference
to their own important Eq. (1), which is reproduced in Eq. (510), as if it was already obvious to anyone in
neuroscience.
Indeed, more than sixty years ago, in a series of papers [58] [326] [59], Furshpan & Potter established
that current flows through a crayfish neuron synapse (Figure 136 and Figure 137) in essentially one direction,
thus deducing that the synapse can be modeled as a rectifier, diode in series with resistance, as shown in
Figure 138.
Figure 139: Swish function (Section 13.3.3) x · s(βx), with s(·) being the logistic sigmoid in
Figure 30, and other activation functions [36]. (Figure reproduced with permission of the authors.)
13.4.1 Back-propagation
In an interview published in 2018 [79], Hinton confirmed that backpropagation was independently
invented by many people before his own 1986 paper [22]. Here we focus on information that is not found in
the review of backpropagation in [12].
For example, the success reported in [22] laid not in backpropagation itself, but in its use in psychology:
321
The “Square Nonlinearity (SQNL)” activation, having a shape similar to that of the hyperbolic tangent function,
appeared in the article “Activation function” for the last time in version 22:00, 17 March 2021, and was was removed
from the table of activation functions starting from version 18:13, 11 April 2021 with the comment “Remove SQLU
since it has 0 citations; it needs to be broadly adopted to be in this list; Remove SQNL (also from the same author,
and this also does not have broad adoption)”; see the article History.
322
See [78], Chap. 8, “Optimization for training deep models”, p. 267
224 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
“Back in the mid-1980s, when computers were very slow, I used a simple example where you
would have a family tree, and I would tell you about relationships within that family tree. I
would tell you things like Charlotte’s mother is Victoria, so I would say Charlotte and mother,
and the correct answer is Victoria. I would also say Charlotte and father, and the correct
answer is James. Once I’ve said those two things, because it’s a very regular family tree with
no divorces, you could use conventional AI to infer using your knowledge of family relations
that Victoria must be the spouse of James because Victoria is Charlotte’s mother and James
is Charlotte’s father. The neural net could infer that too, but it didn’t do it by using rules of
inference, it did it by learning a bunch of features for each person. Victoria and Charlotte
would both be a bunch of separate features, and then by using interactions between those
vectors of features, that would cause the output to be the features for the correct person. From
the features for Charlotte and from the features for mother, it could derive the features for
Victoria, and when you trained it, it would learn to do that. The most exciting thing was that
for these different words, it would learn these feature vectors, and it was learning distributed
representations of words.” [79]
For psychologists, “a learning algorithm that could learn representations of things was a big breakthrough,”
and Hinton’s contribution in [22] was to show that “backpropagation would learn these distributed represen-
tations, and that was what was interesting to psychologists, and eventually, to AI people.” But backpropa-
gation lost ground to other technologies in machine learning:
“In the early 1990s, ... the support vector machine did better at recognizing handwritten digits
than backpropagation, and handwritten digits had been a classic example of backpropagation
doing something really well. Because of that, the machine learning community really lost
interest in backpropagation” [79].323
Despite such setback, psychologists still considered backpropagation as an interesting approach, and con-
tinued to work with this method:
There is “a distinction between AI and machine learning on the one hand, and psychology
on the other hand. Once backpropagation became popular in 1986, a lot of psychologists got
interested in it, and they didn’t really lose their interest in it, they kept believing that it was
an interesting algorithm, maybe not what the brain did, but an interesting way of developing
representations” [79].
The 2015 review paper [12] referred to Werbos’ 1974 PhD dissertation for a preliminary discussion of
backpropagation (BP),
“Efficient BP was soon explicitly used to minimize cost functions by adapting control param-
eters (weights) (Dreyfus, 1973). Compare some preliminary, NN-specific discussion (Wer-
bos, 1974, Section 5.5.1), a method for multilayer threshold NNs (Bobrowski, 1978), and a
computer program for automatically deriving and implementing BP for given differentiable
systems (Speelpenning, 1980).”
and explicitly attributed to Werbos early applications of backpropagation in neural networks (NN):
323
See Footnote 31 on how research on kernel methods (Section 8) for Support Vector Machines have been recently
used in connection with networks with infinite width to understand how deep learning works (Section 14.2).
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 225
“To my knowledge, the first NN-specific application of efficient backpropagation was de-
scribed in 1981 (Werbos, 1981, 2006). Related work was published several years later (LeCun,
1985, 1988; Parker, 1985). A paper of 1986 significantly contributed to the popularization of
BP for NNs (Rumelhart, Hinton, & Williams, 1986), experimentally demonstrating the emer-
gence of useful internal representations in hidden layers.”
See also [112] [328] [330]. The 1986 paper mentioned above was [22].
“The deep learning community has been somewhat isolated from the broader computer science
community and has largely developed its own cultural attitudes concerning how to perform
differentiation. More generally, the field of automatic differentiation is concerned with how
to compute derivatives algorithmically. The back-propagation algorithm described here is only
one approach to automatic differentiation. It is a special case of a broader class of techniques
called reverse mode accumulation.”
Let’s decode what was said above. The deep learning community was isolated because it was not in the
mainstream of computer science research during the last AI winter, as Hinton described in an interview
published in 2018 [79]:
“This was at a time when all of us would have been a bit isolated in a fairly hostile environment—
the environment for deep learning was fairly hostile until quite recently—it was very helpful to
have this funding that allowed us to spend quite a lot of time with each other in small meetings,
where we could really share unpublished ideas.”
If Hinton did not move from Carnegie Mellon University in the US to the University of Toronto in Canada,
it would be necessary for him to change research topic to get funding, AI winter would last longer, and he
may not get the Turing Award along with LeCun and Bengio [331]:
“The Turing Award, which was introduced in 1966, is often called the Nobel Prize of comput-
ing, and it includes a $1 million prize, which the three scientists will share.”
A recent review of AD is given in [329], where backprop was described as a particular case of AD,
known as “reverse mode AD”; see also [12].324
“For computer vision, 2012 was the inflection point. For speech, the inflection point was a few
years earlier. Two different graduate students at Toronto showed in 2009 that you could make a
better speech recognizer using deep learning. They went as interns to IBM and Microsoft, and
324
We only want to point out the connection between backprop and AD, together with a recent review paper on AD,
but will not review AD itself here.
226 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
a third student took their system to Google. The basic system that they had built was devel-
oped further, and over the next few years, all these companies’ labs converted to doing speech
recognition using neural nets. Many of the best people in speech recognition had switched to
believing in neural networks before 2012, but the big public impact was in 2012, when the
vision community, almost overnight, got turned on its head and this crazy approach turned out
to win.”
The mentioned 2009 breakthrough of applying deep learning to speech recognition did not receive much
of the non-technical press as the 2012 breakthrough in computer vision (e.g., [75] [74]), and was thus not
popularly known, except inside the deep-learning community.
Deep learning is being developed and used to guide consumers in nutrition [332]:
“Using machine learning, a subtype of artificial intelligence, the billions of data points were
analyzed to see what drove the glucose response to specific foods for each individual. In that
way, an algorithm was built without the biases of the scientists.
There are other efforts underway in the field as well. In some continuing nutrition studies,
smartphone photos of participants’ plates of food are being processed by deep learning, an-
other subtype of A.I., to accurately determine what they are eating. This avoids the hassle
of manually logging in the data and the use of unreliable food diaries (as long as participants
remember to take the picture).
But that is a single type of data. What we really need to do is pull in multiple types of data—
activity, sleep, level of stress, medications, genome, microbiome and glucose—from multiple
devices, like skin patches and smartwatches. With advanced algorithms, this is eminently
doable. In the next few years, you could have a virtual health coach that is deep learning about
your relevant health metrics and providing you with customized dietary recommendations.”
“The clear consensus was that AI tools had made little, if any, impact in the fight against
covid.”
A large collection of 37,421 titles (published and preprint reports) on Covid-19 models up to July 2020
were examined in [335], where only 169 studies describing 232 prediction models were selected based
on CHARMS (CHecklist for critical Appraisal and data extraction for Systematic Reviews of prediction
Modeling Studies) [337] for detailed analysis, with the risk of bias assessed using PROBAST (Pediction
model Risk Of Bias ASsessment Tool) [338]. A follow-up study [336] examined 2,215 titles up to Oct
2020, using the same methodology as in [335] with the added requirement of “sufficiently documented
325
As of 2020.12.18, the COVID-19 pandemic was still raging across the entire United States.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 227
Figure 140: MIT COVID-19 diagnosis by cough recordings. Machine learning architecture.
Audio Mel Frequency Cepstrum Coefficients (MFCC) as input. Each cough signal is split
into 6 audio chunks, processed by the MFCC package, then passed through the Biomarker 1
to check on muscular degradation. The output of Biomarker 1 is input into each of the three
Convolutional Neural Networks (CNNs), representing Biomarker 2 (Vocal cords), Biomarker
3 (Lungs & Respiratory Tract), Biomarker 4 (Sentiment). The outputs of these CNNs are
concatenated and “pooled” together to serve as (1) input for “Competing Aggregator Models”
to produce a “longitudinal saliency map”, and as (2) input for a deep and dense network with
ReLU activation, followed by a “binary dense layer” with sigmoid activation to produce Covid-
19 diagnosis. [333]. ((CC By 4.0)
methodologies”, to narrow down to 62 titles for review “in most details”. In the words of the lead developer
of PROBAST, “unfortunately” journals outside the medical field were not included since it would be a
“surprise” that “the reporting and conduct of AI health models is better outside the medical literature”.326
Covid-19 diagnosis from cough recordings. MIT researchers developed a cough-test smartphone
app that diagnoses Covid-19 from cough recordings [333], and claimed that their app achieved excellent
results:327
“When validated with subjects diagnosed using an official test, the model achieves COVID-19
sensitivity of 98.5% with a specificity of 94.2% (AUC: 0.97). For asymptomatic subjects it
achieves sensitivity of 100% with a specificity of 83.2%.” [333].
making one wondered why it had not been made available for use by everyone, since “These inventions
could help our coronavirus crisis now. But delays mean they may not be adopted until the worst of the
326
Private communication with Karel (Carl) Moons on 2021 Oct 28. In other words, only medical journals included in
PROBAST would report Covid-19 models that cannot be beaten by models reported in non-medical journals, such
as in [333], which was indeed not “fit for clinical use” to use the same phrase in [334].
327
“In medical diagnosis, test sensitivity is the ability of a test to correctly identify those with the disease (true
positive rate), whereas test specificity is the ability of the test to correctly identify those without the disease (true
negative rate).” See “Sensitivity and specificity”, Wikipedia version 02:21, 22 February 2021. For the definition of
“AUC” (Area Under the ROC Curve), with “ROC” abbreviating for “Receiver Operating characteristic Curve”, see
“Classification: ROC Curve and AUC”, in “Machine Learning Crash Course”, Website. Internet archive.
228 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
(1) Machine-learning models did not detect Covid-19, but only distinguished between healthy people and
sick people, a not so useful task.
(2) Surrounding acoustic environment may introduce biases into the cough sound recordings, e.g., Covid-
19 positive people tend to stay indoors, and Covid-19 negative people outdoors.
(3) Participants providing coughs for the datasets may know their Covid-19 status, and that knowledge
would affect their emotion, and hence the machine learning models.
(4) The machine-learning models can only be as accurate as the cough recording labels, which may not
be valid since participants self reported their Covid-19 status.
(5) Most researchers, like the authors of [333], don’t share codes and datasets, or even information on
their method as mentioned above; see also Section 14.7 “Lack of transparency”.
(6) The influence of factors such as comorbidity, ethnicity, geography, socio-economics, on Covid-19 is
complex and unequal, and could introduce biases in the datasets.
(7) Lack of population control (participant identity not recorded) led to non-disjoint training set, devel-
opment set, and test set.
Other Covid-19 machine-learning models. A comprehensive review of machine learning for Covid-
19 diagnosis based on medical-data collection, preprocessing of medical images, whose features are ex-
tracted, and classified is provided in [341], where methods based on cough sound recordings were not
included. Seven methods were reviewed in detail: (1) transfer learning, (2) ensemble learning, (3) un-
supervised learning and (4) semi-supervised learning, (5) convolutional neural networks, (6) graph neural
networks, (7) explainable deep neural networks.
In [342], deep-learning methods together with transfer learning were reviewed for classification and
detection of Covid-19 based on chest X-ray, computer-tomography (CT) images, and lung-ultrasound im-
ages. Also reviewed were machine-learning methods for selection of vaccine candidates, natural-language-
processing methods to analyze public sentiment during the pandemic.
For multi-disease (including Covid-19) prediction, methods based on (1) logistic regression, (2) ma-
chine learning, and in particular (3) deep learning were reviewed, with difficulties encountered forming a
basis for future developments pointed out, in [343].
328
One author of the present article (LVQ), more than one year after the preprint of [333], still spit into a tube for Covid
test instead of coughing into a phone.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 229
Information on the collection of genes, called genotype, related to Covid-19, was predicted by search-
ing and scoring similarities between the seed genes (obtained from prior knowledge) and candidate genes
(obtained from the biomedical literature) with the goal to establish the molecular mechanism of Covid-19
[344].
In [345], the proteins associated with Covid-19 were predicted using ligand329 designing and modecular
modeling.
In [346], after evaluating various computer-science techniques using Fuzzy-Analytic Hierarchy Process
integrated with the Technique for Order Performance by Similar to Ideal Solution, it was recommended
to use Blockchain as the most effective technique to be used by healthcare workers to address Covid-19
problems in Saudi Arabia.
Other Covid-19 machine learning models include the use of regression algorithms for real-time analysis
of Covid-19 pandemic [347], forecasting the number of infected people using the logistic growth curve
and the Gompertz growth curve [348], a generalization of the SEIR330 model and logistic regression for
forecasting [349].
Other applications of deep learning include a real-time maskless-face detector using deep residual net-
works [356], topology optimization with embedded physical law and physical constraints [357], prediction
of stress-strain relations in granular materials from triaxial test results [358], surrogate model for flight-load
analysis [359], classification of domestic refuse in medical institutions based on transfer learning and con-
volutional neural network [360], convolutional neural network for arrhythmia diagnosis [361], e-commerce
dynamic pricing by deep reinforcement learning [362], network intrusion detection [363], road pavement
distress detection for smart maintenance [364], traffic flow statistics [365], multi-view gait recognition using
deep CNN and channel attention mechanism [366], mortality risk assessment of ICU patients [367], stereo
matching method based on space-aware network model to reduce the limitation of GPU RAM [368], air
quality forecasting in Internet of Things [369], analysis of cardiac disease abnormal ECG signals [370], de-
tection of mechanical parts (nuts, bolts, gaskets, etc.) by machine vision [371], asphalt road crack detection
[372], steel commondity selection using bidirectional encoder representations from transformers (BERT)
[373], short-term traffic flow prediction using LSTM-XGBoost combination model [374], emotion analysis
based on multi-channel CNN in social networks [375].
Figure 141: Tesla Full-Self-Driving (FSD) controversy (Section 14.1). Left: Tesla in FSD
mode hit a child-size mannequin, repeatedly in safety tests by The Dawn Project, a software
competitor to Tesla, 2022.08.09 [376] [377]. Right: Tesla in FSD mode went around a child-
size mannequin at 15 mph in a residential area, 2022.08.14 [378] [379]. Would a prudent driver
stop completely, waiting for the kid to move out of the road, before proceeding forward? The
driver, a Tesla investor, did not use his own child, indicating that his maneuver was not safe.
“If a neural network is trained on images that show a coffee cup only from a side, for example,
it is unlikely to recognize a coffee cup turned upside down.”
Figure 142: Tesla Full-Self-Driving (FSD) controversy (Section 14.1). The Tesla was about to
run down the child-size mannequin at 23 mph, hitting it at 24 mph. The driver did not hold on,
but only kept his hands close, to the driving wheel for safety, and did not put his foot on the
accelerator. There were no cones on both sides of the road, and there was room to go around
the mannequin. The weather was clear, sunny. The mannequin wore a bright safety jacket.
Visibility was excellent, 2022.08.15 [380] [381].
forward in the country’s efforts to make autonomous vehicles an everyday reality. The new service, called
RoboRide, features Hyundai Ioniq 5 electric cars equipped with Level 4 autonomous driving capabilities.
The technology allows the taxis to move independently in real-life traffic without the need for human con-
trol, although a safety driver will remain in the car” [385]. According to Huyndai, the safety driver “only
intervenes under limited conditions,” which were explicitly not specified to the public, whereas the car itself
would “perceive, make decisions and control its own driving status.”
But what is “Level 4 autonomous driving”? Let’s look at the startup autononous-driving company
Waymo. Their Level 4 consists of “mapping the territory in a granular fashion (including lane markers,
traffic signs and lights, curbs, and crosswalks). The solution incorporates both GPS signals and real-time
sensor data to always determine the vehicle’s exact location. Further, the system relies on more than 20
million miles of real-world driving and more than 20 billion miles in simulation, to allow the Waymo Driver
to anticipate what other road users, pedestrians, or other objects might do” [386].
Yet Level 4 is still far from Level 5, for which “vehicles are fully automated with no need for the
driver to do anything but set the destination and ride along. They can drive themselves anywhere under any
conditions, safely” [387], and would still be many years later away [386].
Indeed, exactly two months after Huyndai’s announcement of their Level 4 test pilot program, on
2022.08.09, The Guardian reported that in a series of safety tests, a “professional test driver using Tesla’s
Full Self-Driving mode repeatedly hit a child-sized mannequin in its path” [377]; Figure 141, left. “It’s a
lethal threat to all Americans, putting children at great risk in communities across the country,” warned The
Dawn Project’s founder, Dan O’Dowd, who described the test results as “deeply disturbing,” as the vehicle
tended to “mow down children at crossroads,” and who argued for prohibiting Tesla vehicles from running
in the street until Tesla self driving software could be proven safe.
The Dawn Project test results were contested by a Tesla investor, who posted a video on 2022.08.14
to prove that the Tesla Full-Self-Driving (FSD) system worked as advertized (Figure 141, right). The next
day, 2022.08.15, Dan O’Dowd posted a video proving that the Tesla under FSD mode ran over a child-size
mannequin at 24 mph in clear weather, with excellent visibility, no cones on either side of the Tesla, and
232 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 143: Tesla crash (Section 14.1). July 2020. Left: “Less than a half-second after [the
Tesla driver] flipped on her turn signal, Autopilot started moving the car into the right lane and
gradually slowed, video and sensor data showed.” Right: “Halfway through, the Tesla sensed
an obstruction—possibly a truck stopped on the side of the road—and paused its lane change.
The car then veered left and decelerated rapidly” [382]. See also Figures 144, 145, 146. (Data
and video provided by QuantivRisk.)
without the driver pressing his foot on the accelerator (Figure 142).
“In June [2022], the National Highway Traffic Safety Administration (NHTSA), said it was expand-
ing an investigation into 830,000 Tesla cars across all four current model lines. The expansion came after
analysis of a number of accidents revealed patterns in the car’s performance and driver behavior” [377].
“Since 2016, the agency has investigated 30 crashes involving Teslas equipped with automated driving sys-
tems, 19 of them fatal. NHTSA’s Office of Defects Investigation is also looking at the company’s autopilot
technology in at least 11 crashes where Teslas hit emergency vehicles.”
In 2019, it was reported that several car executives thought that driveless cars were still several years
in the future because of the difficulty in anticipating human behavior [388]. The progress of Huyndai’s
driveless taxis has not solved the challenge of dealing with human behavior, as there was still a need for a
“safety driver.”
“On [2022] May 6, Lyft, the ride-sharing service that competes with Uber sold its Level 5 division, an
autonomous-vehicle unit, to Woven Planet, a Toyota subsidiary. After four years of research and develop-
ment, the company seems to realize that autonomous driving is a tough nut to crack—much tougher than
the team had anticipated.
“Uber came to the same conclusion, but even earlier, in December. The company sold Advanced
Technologies Group, its self-driving unit, to Aurora Innovation, citing high costs and more than 30 crashes,
culminating in a fatality as the reason for cutting its losses.
“Finally, several smaller companies, including Zoox, a robo-taxi company; Ike, an autonomous-trucking
startup; and Voyage, a self-driving startup; have also passed the torch to companies with bigger budgets”
[384].
“Those startups, like many in the industry, have underestimated the sheer difficulty of “leveling up”
vehicle autonomy to the fabled Level 5 (full driving automation, no human required)” [384].
On top of the difficulty in addressing human behavior, there were other problems, perhaps in principle
less challenging, so we thought, as reported in [386]: “widespread adoption of autonomous driving is still
years away from becoming a reality, largely due to the challenges involved with the development of accurate
sensors and cameras, as well as the refinement of algorithms that act upon the data captured by these sensors.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 233
Figure 144: Tesla crash (Section 14.1). July 2020. “Less than a second after the Tesla has
slowed to roughly 55 m.p.h. [Left], its rear camera shows a car rapidly approaching [Right]”
[382]. There were no moving cars on both lanes in front of the Tesla for a long distance ahead
(perhaps a quarter of a mile). See also Figures 143, 145, 146. (Data and video provided by
QuantivRisk.)
“This process is extremely data-intensive, given the large variety of potential objects that could be
encountered, as well as the near-infinite ways objects can move or react to stimuli (for example, road signs
may not be accurately identified due to lighting conditions, glare, or shadows, and animals and people do
not all respond the same way when a car is hurtling toward them).
“Algorithms in use still have difficulty identifying objects in real-world scenarios; in one accident in-
volving a Tesla Model X, the vehicle’s sensing cameras failed to identify a truck’s white side against a
brightly lit sky.”
In addition to Figure 141, another example was a Tesla crash in July 2020 in clear, sunny weather, with
little clouds, as shown in Figures 143, 144, 145, 146. The self-driving system could not detect that a static
truck was parked on the side of a highway, and due to the foward and changing-lane motion of the Tesla, the
software could have thought that it was running into the truck, and veered left while rapidly decelerating to
avoid collision with the truck. As a result, the Tesla was rear-ended by another fast coming car from behind
on its left side [382].
“Pony.ai is the latest autonomous car company to make headlines for the wrong reasons. It has just lost
its permit to test its fleet of autonomous vehicles in California over concerns about the driving record of the
safety drivers it employs. It’s a big blow for the company, and highlights the interesting spot the autonomous
car industry is in right now. After a few years of very bad publicity, a number of companies have made real
progress in getting self-driving cars on the road” [389].
The 2022 article “I’m the Operator’: The Aftermath of a Self-Driving Tragedy” [390] described these
“few years of very bad publicity” in stunning, tragic details about an Uber autonomous-vehicle operator,
Rafela Vasquez, who did not take over the control of the vehicle in time, and killed a jaywalking pedestrian.
The classification software of the Uber autonomous driving system could not recognize the pedestrian,
but vacillated between a “vehicle”, then “other”, then a “bicycle” [390].
“At 2.6 seconds from the object, the system identified it as ‘bicycle.’ At 1.5 seconds, it switched back
to considering it ‘other.’ Then back to ‘bicycle’ again. The system generated a plan to try to steer around
whatever it was, but decided it couldn’t. Then, at 0.2 seconds to impact, the car let out a sound to alert
Vasquez that the vehicle was going to slow down. At two-hundredths of a second before impact, traveling
at 39 mph, Vasquez grabbed the steering wheel, which wrested the car out of autonomy and into manual
234 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 145: Tesla crash (Section 14.1). July 2020. The fast-coming blue car rear-ended the
Tesla, indented its own front bumper, with flying broken glass (or clear plastic) cover shards
captured by the Tesla rear camera [382]. See also Figures 143, 144, 146. (Data and video
provided by QuantivRisk.)
mode. It was too late. The smashed bike scraped a 25-foot wake on the pavement. A person lay crumpled
in the road” [390].
The operator training program manager said “I felt shame when I heard of a lone frontline employee
has been singled out to be charged of negligent homicide with a dangerous instrument. We owed Rafaela
better oversight and support. We also put her in a tough position.” Another program manager said “You
can’t put the blame on just that one person. I mean, it’s absurd. Uber had to know this would happen. We
get distracted in regular driving. It’s not like somebody got into their car and decided to run into someone.
They were working within a framework. And that framework created the conditions that allowed that to
happen.” [390].
After the above-mentioned fatality caused by an Uber autonomous car with a single operator in it,
“many companies temporarily took their cars off the road, and after it was revealed that only one technician
was inside the Uber car, most companies resolved to keep two people in their test vehicles at all times”
[391]. Having two operators in a car would help to avoid accidents, but the pandemic social-distancing rule
often prevented such arrangement from happening.
“Many self-driving car companies have no revenue, and the operating costs are unusually high. Au-
tonomous vehicle start-ups spend $1.6 million a month on average—four times the rate at financial tech or
health care companies” [391].
“Companies like Uber and Lyft, worried about blowing through their cash in pursuit of autonomous
technology, have tapped out. Only the deepest-pocketed outfits like Waymo, which is a subsidiary of
Google’s parent company, Alphabet; auto giants; and a handful of start-ups are managing to stay in the
game.
“Late last month, Lyft sold its autonomous vehicle unit to a Toyota subsidiary, Woven Planet, in a deal
valued at $550 million. Uber offloaded its autonomous vehicle unit to another competitor in December. And
three prominent self-driving start-ups have sold themselves to companies with much bigger budgets over the
past year” [392].
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 235
Figure 146: Tesla crash (Section 14.1). After hitting the Tesla, the blue car “spun across the
highway [Left] and onto the far shoulder [Right],” as another car was coming toward on the
right lane (left in photo), but still at a safe distance so not to hit it. [382]. See also Figures 143,
144, 145. (Data and video provided by QuantivRisk.)
Similar problems exist with building autonomous boats to ply the oceans without a need for a crew on
board [393]:
“When compared with autonomous cars, ships have the advantage of not having to make split-
second decisions in order to avoid catastrophe. The open ocean is also free of jaywalking
pedestrians, stoplights and lane boundaries. That said, robot ships share some of the problems
that have bedeviled autonomous vehicles on land, namely, that they’re bad at anticipating what
humans will do, and have limited ability to communicate with them.
Shipping is a dangerous profession, as there were some 41 large ships lost at sea due to fires, rogue
waves, or other accidents, in 2019 alone. But before an autonomous ship can reach the ocean, it must get
out of port, and that remains a technical hurdle not yet overcome:
“ ’Technically, it’s not possible yet to make an autonomous ship that operates safely and effi-
ciently in crowded areas and in port areas,’ says Rudy Negenborn, a professor at TU Delft who
researches and designs systems for autonomous shipping.
Makers of autonomous ships handle these problems by giving humans remote control. But
what happens when the connection is lost? Satisfactory solutions to these problems have yet
to arrive, adds Dr. Negenborn.”
The onboard deep-learning computer vision system was trained to recognize “kayaks, canoes, Sea-
Doos”, but a person standing on a paddle board would look like someone walking on water to the system
[393]. See also Figures 143, 144, 145, 146 on the failure of the Tesla computer vision system in detecting a
parked truck on the side of a highway.
Beyond the possible lost of connection in a human remote-control ship, mechanical failure did occur,
such as that happened for the Mayflower autonomous ship shown in Figure 147 [394]. Measures would have
to be taken when mechanical failure happens to a crewless ship in the middle of a vast ocean.
See also the interview of S.J. Russell in [79] on the need to develop hybrid systems that have clas-
sical AI along side with deep learning, which has limitations, even though it is good at classification and
perception,332 and Section 14.3 on the barrier of meaning in AI.
332
S.J. Russell also appeared in the video “AI is making it easier...” mentioned at the end of this closure section.
236 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 147: Mayflower autonomous ship (Section 14.1) sailing from Plymouth, UK, planning
to arrive at Plymouth, MA, U.S., like the original Mayflower 400 years ago, but instead arriving
at Halifax, Nova Scotia, Canada, on 2022 Jun 05, due to mechanical problems [394]. (CC BY-
SA 4.0, Wikipedia, version 16:43, 17 July 2022.)
“Compared with conventional computer programs, [AI that teaches itself] acts for reasons in-
comprehensible to the outside world. It can be trained, as a parrot can, by rewarding the desired
behaviour; in fact, this describes the whole of its learning process. But it can’t be consciously
designed in all its details, in the way that a passenger jet can be. If an airliner crashes, it is in
theory possible to reconstruct all the little steps that led to the catastrophe and to understand
why each one happened, and how each led to the next. Conventional computer programs can
be debugged that way. This is true even when they interact in baroquely complicated ways.
But neural networks, the kind of software used in almost everything we call AI, can’t even in
principle be debugged that way. We know they work, and can by training encourage them to
work better. But in their natural state it is quite impossible to reconstruct the process by which
they reach their (largely correct) conclusions.”
The 2021 breakthough in computer science, as declared by the Quanta Magazine [233], was the discov-
ery of the connection between shallow networks with infinite width (Figure 148) and kernel machines (or
methods) as a first step in trying to understand how deep-learning networks work; see Section 8 on “Kernel
machines” and Footnote 31.
“ Machine learning algorithms don’t yet understand things the way humans do—with some-
times disastrous consequences.
Even more worrisome are recent demonstrations of the vulnerability of A.I. systems to
so-called adversarial examples. In these, a malevolent hacker can make specific changes to
images, sound waves or text documents that while imperceptible or irrelevant to humans will
cause a program to make potentially catastrophic errors.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 237
Figure 148: Network with infinite width (left) and Gaussian distribution (Right) (Section 6.1,
14.2). “A number of recent results have shown that DNNs that are allowed to become in-
finitely wide converge to another, simpler, class of models called Gaussian processes. In this
limit, complicated phenomena (like Bayesian inference or gradient descent dynamics of a con-
volutional neural network) boil down to simple linear algebra equations. Insights from these
infinitely wide networks frequently carry over to their finite counterparts. As such, infinite-
width networks can be used as a lens to study deep learning, but also as useful models in their
own right” [279] [231]. See Figures 60 and 61 for the motivation for networks with infinite
width. (CC BY-SA 4.0, Wikipedia, version 03:51, 18 June 2022.)
The possibility of such attacks has been demonstrated in nearly every application domain
of A.I., including computer vision, medical image processing, speech recognition and language
processing. Numerous studies have demonstrated the ease with which hackers could, in princi-
ple, fool face- and object-recognition systems with specific minuscule changes to images, put
inconspicuous stickers on a stop sign to make a self-driving car’s vision system mistake it for
a yield sign or modify an audio signal so that it sounds like background music to a human but
instructs a Siri or Alexa system to perform a silent command.
These potential vulnerabilities illustrate the ways in which current progress in A.I. is
stymied by the barrier of meaning. Anyone who works with A.I. systems knows that behind the
facade of humanlike visual abilities, linguistic fluency and game-playing prowess, these pro-
grams do not—in any humanlike way—understand the inputs they process or the outputs they
produce. The lack of such understanding renders these programs susceptible to unexpected
errors and undetectable attacks.
As the A.I. researcher Pedro Domingos noted in his book The Master Algorithm, ‘People
worry that computers will get too smart and take over the world, but the real problem is that
they’re too stupid and they’ve already taken over the world.’ ”
Such barrier of meaning is also a barrier for AI to tackle human controversies; see Section 14.5. See also
Section 14.1 on driverless cars not coming any time soon, which is related to the above barrier of meaning.
mensely complex” games such as Go (see Section 13.5 on resurgence of AI and current state), they also
reported another breakthrough as a more ominous warning on a “Power struggle” to preserve liberal democ-
racies against authoritarian governments and criminals:
“The second great development of the last year makes bad outcomes much more likely. This
is the much wider availability of powerful software and hardware. Although vast quantities of
data and computing power are needed to train most neural nets, once trained a net can run on
very cheap and simple hardware. This is often called the democratisation of technology but
it is really the anarchisation of it. Democracies have means of enforcing decisions; anarchies
have no means even of making them. The spread of these powers to authoritarian governments
on the one hand and criminal networks on the other poses a double challenge to liberal democ-
racies. Technology grants us new and almost unimaginable powers but at the same time it takes
away some powers, and perhaps some understanding too, that we thought we would always
possess.”
Nearly three years later, a report of a national poll of 2,200 adults in the U.S., released on 2021.11.15,
indicated that three in four adults were concerned about the loss of privacy, “loss of trust in elections (57%),
in threats to democracy (52%), and in loss of trust in institutions (56%). Additionally, 58% of respondents
say it has contributed to the spread of misinformation” [396].
Figure 149: Deepfake images (Section 14.4.1). AI-generated portraits using Generative Ad-
versarial Network (GAN) models. See also [397] [398], Chap. 8, “GAN Fingerprints in Face
Image Synthesis.” (Images from ‘This Person Does Not Exist’ site.)
14.4.1 Deepfakes
AI software available online helping to create videos that show someone said or did things that the
person did not say or do represent a clear danger to democracy, as these deepfake videos could affect the
outcome of an election, among other misdeeds, with risk to national security. Advances in machine learning
have made deepfakes “ever more realistic and increasingly resistant to detection” [399]; see Figure 149. The
authors of [400] concurred:
“Deepfake videos made with artificial intelligence can be a powerful force because they make
it appear that someone did or said something that they never did, altering how the viewers see
politicians, corporate executives, celebrities and other public figures. The tools necessary to
make these videos are available online, with some people making celebrity mashups and one
app offering to insert users’ faces into famous movie scenes.”
To be sure, deepfakes do have benefits in education, arts, and individual autonomy [399]. In education,
deepfakes could be used to provide information to students in a more interesting manner. For example,
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 239
deepfakes make it possible to “manufacture videos of historical figures speaking directly to students, giving
an otherwise unappealing lecture a new lease on life”. In the arts, deepfake technology allowed to resurrect
long dead actors for fresh roles in new movies. An example is a recent Star Wars movie with the deceased
actress Carrie Fisher. In helping to maintain some personal autonomy, deepfake audio technology could help
restore the ability to speak for a person suffered from some form of paralysis that prevents normal speaking.
On the other hand, the authors of [399] cited a long list of harmful uses of deepfakes, from harm to
individuals or organizations (e.g., exploitation, sabotage), to harm to society (e.g., distortion of democratic
discourse, manipulation of elections, eroding trust in institutions, exacerbating social division, undermin-
ing public safety, undermining diplomacy, jeopardizing national security, undermining journalism, crying
deepfake news as liar’s dividend).333 See also [402] [403] [404] [405].
Researchers have been in a race to develop methods to detect deepfakes, a difficult technological chal-
lenge [406]. One method is to spot the subtle characteristics of how someone spoke to provide a basis to
determine whether a video was true or fake [400]. But that method was not a top-five winner of the Deep-
Fake Detection Challenge (DFDC) [407] organized in the period 2019-2020 by “The Partnership for AI, in
collaboration with large companies including Facebook, Microsoft, and Amazon,” with a total prize money
of one million dollars, divided among the top five winners, out of more than two thousand teams [408].
Human’s ability to detect of deepfakes compared well with the “leading model,” i.e., the DFCD top
winner [408]. The results were “at odds with the commonly held view in media forensics that ordinary
people have extremely limited ability to detect media manipulations” [408]; see Figure 150, where the
width of a violin plot,334 at a given accuracy, represents the number of participants. In Col. 2 of Figure 150,
the area of the blue violin above the leading model accuracy of 65% represents 82% of the participants,
represented by the area of the whole violin. A crowd does have a collective accuracy comparable to (or for
those who viewed at least 10 videos, better than) the leading model; see Cols. 5, 6, 7 in Figure 150.
While it is difficult to detect AI deepfakes, the MIT Media Lab DeepFake detection project advised to
pay attention to the following eight facial features [411]:
(1) “Face. High-end DeepFake manipulations are almost always facial transformations.
(2) “Cheeks and forehead. Does the skin appear too smooth or too wrinkly? Is the agedness of the skin
similar to the agedness of the hair and eyes? DeepFakes are often incongruent on some dimensions.
(3) “Eyes and eyebrows. Do shadows appear in places that you would expect? DeepFakes often fail to
fully represent the natural physics of a scene.
(4) “Glasses. Is there any glare? Is there too much glare? Does the angle of the glare change when the
person moves? Once again, DeepFakes often fail to fully represent the natural physics of lighting.
(5) “Facial hair or lack thereof. Does this facial hair look real? DeepFakes might add or remove a
mustache, sideburns, or beard. But, DeepFakes often fail to make facial hair transformations fully
natural.
(7) Eye “blinking. Does the person blink enough or too much?
(8) “Size and color of the lips. Does the size and color match the rest of the person’s face?”
333
Watch also Danielle Citron’s 2019 TED talk “How deepfakes undermine truth and threaten democracy” [401].
334
See the classic original paper [409], which was cited 1,554 times on Google Scholar as of 2022.08.24. See also
[410] with Python code and resulting images on GitHub.
240 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Figure 150: DeepFake detection (Section 14.4.1). Violin plots. • Individual vs machine. The
leading model had an accuracy of 65% on 4,000 videos (Col. 1). In Experiment 1 (E1), 5,524
participants were asked to identify a deepfake from each of 56 pairs of videos. The participants
had a mean accuracy of 80% (white dot in Col. 2), with 82% of the participants having an
accuracy better than that of the leading model (65%). In Experiment 2 (E2), using a subset of
randomly sampled videos, the recruited (R) participants had mean accuracy at 66% (Col. 3), the
non-recruited (NR) participants at 69% (Col. 4), and leading model at 80%. • Crowd wisdom
vs machine. Crowd mean is the average accuracy by participants for each video. R participants
had a crowd-mean average accuracy at 74%, NR participants at 80%, which was the same for
the leading model, and NR participants who viewed at least 10 videos at 86% [408]. (CC BY-
NC-ND 4.0)
“Clearview AI, the controversial and secretive facial recognition company, recently experi-
enced its first major data breach—a scary prospect considering the sheer amount and scope
of personal information in its database, as well as the fact that access to it is supposed to be
restricted to law enforcement agencies.”
The leaked documents showed that Clearview AI had a large range of customers, ranging from law-enfor-
cement agencies (both domestic and internatinal), to large retail stores (Macy’s, Best Buy, Walmart). Experts
describe Clearview AI’s plan to produce a publicly available face recognition app as “dangerous”. So we
got screwed again.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 241
There was a documented wrongful arrest by face-recognition algorithm that demonstrated racism, i.e.,
a bias toward people of color [414]. A detective showed the wrongful-arrest victim a photo that was clearly
not the victim, and asked “Is this you?” to which the victim replied “You think all black men look alike?”
It is well known that AI has “propensity to replicate, reinforce or amplify harmful existing social biases”
[415], such as racial bias [416] among others: “An early example arose in 2015, when a software engineer
pointed out that Google’s image-recognition system had labeled his Black friends as ‘gorillas.’ Another
example arose when Joy Buolamwini, an algorithmic fairness researcher at MIT, tried facial recognition on
herself—and found that it wouldn’t recognize her, a Black woman, until she put a white mask over her face.
These examples highlighted facial recognition’s failure to achieve another type of fairness: representational
fairness” [417].335
A legal measure has been taken against gathering data for facial-recognition software. In May 2022,
Clearview AI was slapped with a “$10 million for scraping UK faces from the web. That might not be the
end of it”; in addition, “the firm was also ordered to delete all of the data it holds on UK citizens” [419].
There were more of such measures: “Earlier this year, Italian data protection authorities fined Clearview
AI C20 million ($21 million) for breaching data protection rules. Authorities in Australia, Canada, France,
and Germany have reached similar conclusions.
Even in the US, which does not have a federal data protection law, Clearview AI is facing increasing
scrutiny. Earlier this month the ACLU won a major settlement that restricts Clearview from selling its
database across the US to most businesses. In the state of Illinois, which has a law on biometric data,
Clearview AI cannot sell access to its database to anyone, even the police, for five years” [419].
“The problem is that these new algorithms are beginning to bump up against significant lim-
itations. They need enormous amounts of data, only some kinds of data will do, and they’re
not very good at generalizing from that data. Babies seem to learn much more general and
powerful kinds of knowledge than AIs do, from much less and much messier data. In fact,
human babies are the best learners in the universe. How do they do it? And could we get an
AI to do the same?
First, there’s the issue of data. AIs need enormous amounts of it; they have to be trained
on hundreds of millions of images or games.
Children, on the other hand, can learn new categories from just a small number of exam-
ples. A few storybook pictures can teach them not only about cats and dogs but jaguars and
rhinos and unicorns.
AIs also need what computer scientists call “supervision.” In order to learn, they must be
given a label for each image they “see” or a score for each move in a game. Baby data, by
contrast, is largely unsupervised.
Even with a lot of supervised data, AIs can’t make the same kinds of generalizations
that human children can. Their knowledge is much narrower and more limited, and they are
easily fooled by what are called “adversarial examples.” For instance, an AI image recognition
system will confidently say that a mixed-up jumble of pixels is a dog if the jumble happens to
fit the right statistical pattern—a mistake a baby would never make.”
Regarding early stopping and generalization error in network training, see Remark 6.1 in Section 6.1. To
make AIs into more robust and resilient learners, researchers are developing methods to build curiosity into
AIs, instead of focusing on immediate rewards.
Figure 151: Lack of transparency and irreproducibility (Section 14.7). The table shows many
missing pieces of information for the three networks—Lesion, Breast, and Case models—used
to detect breast cancer. Learning rate, Section 6.2. Learning-rate schedule, Section 6.3.1,
Figure 65 in Section 6.3.5. SGD with momentum, Section 6.3.2 and Remark 6.5. Adam
algorithm, Section 6.5.6. Batch size, Sections 6.3.1 and 6.3.5. Epoch, Footnote 145. [421].
(Figure reproduced with permission of the authors)
scientists [421] had enough and protested the lack of transparency in AI research in a “damning” article in
Nature, a major scientific journal.
“We couldn’t take it anymore,” says Benjamin Haibe-Kains, the lead author of the response,
who studies computational genomics at the University of Toronto. “It’s not about this study in
particular—it’s a trend we’ve been witnessing for multiple years now that has started to really
bother us.” [422]
The particular contentious study was published by the Google-Health authors of [423] on the use of AI in
medical imaging to detect breast cancer. But these authors of [423] provided so little information about
their code and how it was tested that their article read more like a “promotion of proprietary tech” than
a scientific paper. Figure 151 shows the missing pieces of crucial information to reproduce the results. A
question would immediately come to mind: Why would a reputable journal like Nature accept such a paper?
Was the review rigorous enough?
“When we saw that paper from Google, we realized that it was yet another example of a
very high-profile journal publishing a very exciting study that has nothing to do with science,”
Haibe-Kains says. “It’s more an advertisement for cool technology. We can’t really do any-
thing with it.” [422]
According to [421], even though the Google-Health authors of [423] stated that “all experiments and imple-
mentation details were described in sufficient detail in the supplementary methods section of their Article
to ‘support replication with non-proprietary libraries’,” that was a subjective statement, and replicating their
results would be a difficult task, since such textual description can hide a high level of complexity of the
code, and nuances in the computer code can have large effects in the training and evaluation results.
“AI is feeling the heat for several reasons. For a start, it is a newcomer. It has only really
become an experimental science in the past decade, says Joelle Pineau, a computer scientist at
Facebook AI Research and McGill University, who coauthored the complaint. ‘It used to be
theoretical, but more and more we are running experiments,’ she says. ‘And our dedication to
sound methodology is lagging behind the ambition of our experiments.’ ” [422]
No progress in science could be made if results were not verifiable and replicable by independent
researchers.
References
1. Rosenblatt, F. (1957). The perceptron: A perceiving and recognizing automaton. Technical report,
Cornell University. Cornell University, Report No. 85-460-1. Project PARA, January. Internet archive.
2, 11, 55, 211, 271
244 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
2. Rosenblatt, F. (1962). Principles of neurodynamics: Perceptrons and the theory of brain mechanisms.
Spartan Books. 2, 11, 46, 55, 210, 212, 213, 214, 215, 271
3. Polyak, B. (1964). Some methods of speeding up the convergence of iteration methods . USSR Com-
putational Mathematics and Mathematical Physics , 4(5), 1–17. DOI 10.1016/0041-5553(64)90137-5.
2, 10, 11, 85, 89, 90, 91
4. Roose, K. (2022). An A.I.-Generated Picture Won an Art Prize. Artists Aren’t Happy. New York Times,
(Sep 2). Original website. 6, 7
5. Jumper, J., Evans, R., Pritzel, A., Green, T., Figurnov, M., et al. (2021). Highly accurate protein
structure prediction with AlphaFold. Nature, 596(7873), 583–589. 7
6. Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., et al. (2016). Mastering the game of Go
with deep neural networks and tree search. Nature, 529(7587), 484+. Original website. 7, 12, 13
7. Moyer, C. How Google’s AlphaGo Beat a Go World Champion. 2016 Mar 28, Original website. 7
8. Edwards, B. (2022). DeepMind breaks 50-year math record using AI; new record falls a week later.
Ars Technica, (Oct 13). Original website, Internet archive. 7
9. Vu-Quoc, L., Humer, A. (2022). Deep learning applied to computational mechanics: A comprehensive
review, state of the art, and the classics. arXiv:2212.08989. 8
10. Roose, K. (2023). Bing (Yes, Bing) Just Made Search Interesting Again. New York Times, (Feb 8).
Original website. 8
11. Knight, W. (2023). Meet Bard, Google’s Answer to ChatGPT. WIRED, (Feb 6). Original website. 8
12. Schmidhuber, J. (2015). Deep learning in neural networks: An overview. Neural Networks, 61, 87–
117. 8, 36, 38, 52, 223, 224, 225, 272
13. LeCun, Y., Bengio, Y., Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. 8, 12, 14, 38,
52, 53, 54, 129, 131
14. Khan, S., Yairi, T. (2018). A review on the application of deep learning in system health management.
Mechanical Systems and Signal Processing, 107, 241–265. 8
15. Sanchez-Lengeling, B., Aspuru-Guzik, A. (2018). Inverse molecular design using machine learning:
Generative models for matter engineering. Science, 361(6400, SI), 360–365. 8
16. Ching, T., Himmelstein, D. S., Beaulieu-Jones, B. K., Kalinin, A. A., Do, B. T., et al. (2018). Opportu-
nities and obstacles for deep learning in biology and medicine. Journal of the Royal Society Interface,
15(141). 8
17. Quinn, J. A., Nyhan, M. M., Navarro, C., Coluccia, D., Bromley, L., et al. (2018). Humanitarian
applications of machine learning with remote-sensing data: review and case study in refugee settlement
mapping. Philosophical Transactions of the Royal Society A-Mathematical Physical and Engineering
Sciences, 376(2128). 8
18. Higham, C. F., Higham, D. J. (2019). Deep learning: An introduction for applied mathematicians.
SIAM Review, 61(4), 860–891. 8
19. Dayan, P., Abbott, L. (2001). Theoretical Neuroscience: Computational and Mathematical Modeling
of Neural Systems. MIT Press. 8, 9, 11, 30, 31, 38, 39, 40, 41, 43, 212, 215, 216, 217, 219
20. Sze, V., Chen, Y.-H., Yang, T.-J., Emer, J. S. (2017). Efficient Processing of Deep Neural Networks:
A Tutorial and Survey. Proceedings of the IEEE, 105(12), 2295–2329. 8, 17, 32, 38, 209
21. Nielsen, M. (2015). Neural Networks and Deep Learning. Determination Press. Original website.
Internet archive. 8, 32, 38, 66, 67, 209, 210, 213
22. Rumelhart, D., Hinton, G., Williams, R. (1986). Learning representations by back-propagating errors.
Nature, 323(6088), 533–536. 8, 90, 215, 223, 224, 225, 271
23. Ghaboussi, J., Garrett, J., Wu, X. (1991). Knowledge-based modeling of material behavior with neural
networks. Journal of Engineering Mechanics-ASCE, 117(1), 132–153. 8, 9, 26, 32, 173, 209, 272
24. Hochreiter, S., Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 245
336
The landmark paper “Little (1974)” was not listed in the Web of Science database as of Nov 2018, using the search
keywords [au=(little) and py=(1974) and ts=(brain)]. On the other hand, [au=(little) and
ts=(The existence of persistent states in the brain)], i.e., the author’s last name and the
full title of the paper, led to the 1995 collection of Little’s papers edited by Cabrera et al., in which “Little (1974)”was
found.
246 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
International Joint Conference on Neural Networks (IJCNN). IEEE. Rio de Janeiro, Brazil. 9, 220
38. Oishi, A., Yagawa, G. (2017). Computational mechanics enhanced by deep learning. Computer Meth-
ods in Applied Mechanics and Engineering, 327, 327–351. 9, 11, 18, 19, 20, 21, 32, 46, 53, 60, 163,
164, 165, 166, 167, 168, 169, 170, 171, 172, 209
39. Zienkiewicz, O., Taylor, R., Zhu, J. (2013). The Finite Element Method: Its Basis and Fundamentals.
Oxford: Butterworth-Heineman. 7th edition. 9, 35, 163, 164
40. Barlow, J. (1976). Optimal stress locations in finite-element models. International Journal for Numer-
ical Methods in Engineering, 10(2), 243–251. 9
41. Barlow, J. (1977). Optimal stress locations in finite-element models - reply. International Journal for
Numerical Methods in Engineering, 11(3), 604. 9
42. Abaqus 6.14. Theory Guide. Simulia Systems, Dassault Systèmes. Subsection 3.2.4 Solid isoparamet-
ric quadrilaterals and hexahedra. (Website, go to Section Reference, Abaqus Theory Guide, Section 3
Elements, Section 3.2 Continuum elements, then Section 3.2.4.). 9
43. Ghaboussi, J., Garrett, J., Wu, X. (1990). Material Modeling with Neural NetworksIn Pande, GN and
Middleton, J. Numerical Methods in Engineering : Theory and Applications, Vol 2. 3rd International
Conf on Numerical Methods in Engineering : Theory and Applications ( NUMETA 90 ), Univ Coll
Swansea, Swansea, Wales, Jan 07-11, 1990. 9
44. Chen, C. (1989). Applying and validating neural network technology for nondestructive evaluation of
materialsIn 1989 IEEE International Conference on Systems, Man, and Cybernetics, Vols 1-3: Con-
ference Proceedings. 1989 IEEE International Conf on Systems, Man, and Cybernetics : Decision-
Making in Large-Scale Systems, Cambridge, MA, Nov 14-17, 1989. 9
45. Sayeh, M., Viswanathan, R., Dhali, S. (1990). Neural networks for assessment of impact and stress
relief on composite-materialsIn Genisio, M. Sixth Annual Conference on Materials Technology: Com-
posite Technology. 6th Annual Conf on Materials Technology : Composite Technology, Southern
Illinois Univ Carbondale, Carbondale, IL, Apr 10-11, 1990. 9
46. Chen, C., Leclair, S. (1991). A probability neural network (pnn) estimator for improved reliability of
noisy sensor data. Journal of Reinforced Plastics and Composites, 10(4), 379–390. 9
47. Kim, Y., Choi, Y., Widemann, D., Zohdi, T. (2020). A fast and accurate physics-informed neural
network reduced order model with shallow masked autoencoderer. (Sep 28). Version 2, 2020.09.28:
arXiv:2009.11990v2, 2009.11990. 9, 10, 11, 193, 194, 195, 196, 197, 198, 199, 200, 201, 203, 205,
206, 207
48. Kim, Y., Choi, Y., Widemann, D., Zohdi, T. (2020). Efficient nonlinear manifold reduced order model.
(Nov 13). arXiv:2011.07727, 2011.07727. 9, 10, 11, 193
49. Robbins, H., Monro, S. (1951b). Stochastic approximation. Annals of Mathematical Statistics, 22(2),
316. 10, 87, 93, 117
50. Nesterov, I. (1983). A method of the solution of the convex-programming problem with a speed of
convergence O(1/k 2 ). Doklady Akademii Nauk SSSR, 269(3), 543–547. In Russian. 10, 89, 91
51. Nesterov, Y. (2018). Lecture on Convex Optimization. 2nd edition. Switzerland: Springer Nature. 10,
89, 91
52. Duchi, J., Hazan, E., Singer, Y. (2011). Adaptive Subgradient Methods for Online Learning and
Stochastic Optimization. Journal of Machine Learning Research, 12, 2121–2159. 10, 105
53. Tieleman, T., Hinton, G. (2012). Lecture 6e, rmsprop: Divide the gradient by a running average of its
recent magnitude. Youtube video, time 5:54. Lecture notes, p.29: Original website, Internet archive.
10, 108
54. Zeiler, M. D. (2012). ADADELTA: An adaptive learning rate method. (Dec 22). arXiv:1212.5701.
10, 106, 108, 109
55. Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., Recht, B. (2018). The marginal value of adaptive
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 247
gradient methods in machine learning. (May 22). arXiv:1705.08292v2. Version 1 appeared in 2017,
see also the Reviews for NIPS 2017. 10, 73, 85, 87, 90, 91, 92, 108, 110, 114, 115, 117
56. Loshchilov, I., Hutter, F. (2019). Decoupled weight decay regularization. (Jan 4). arXiv:1711.05101v3.
OpenReview. 10, 85, 87, 92, 93, 99, 106, 109, 115, 116, 117, 123
57. Bahdanau, D., Cho, K., Bengio, Y. (2015). Neural machine translation by jointly learning to align and
translate. CoRR, abs/1409.0473. arXiv:1409.0473. 11, 135, 136, 137, 138
58. Furshpan, E., Potter, D. (1957). Mechanism of nerve-impulse transmission at a crayfish synapse.
Nature, 180(4581), 342–343. 11, 222
59. Furshpan, E., Potter, D. (1959b). Slow post-synaptic potentials recorded from the giant motor fibre of
the crayfish. Journal of Physiology-London, 145(2), 326–335. 11, 222
60. Gershgorn, D. (2017). The data that transformed AI research—and possibly the world. Quartz, (Jul
26). Original website. Internet archive (blurry images). 11, 13
61. He, K., Zhang, X., Ren, S., Sun, J. (2015). Delving Deep into Rectifiers: Surpassing Human-Level
Performance on ImageNet Classification. CoRR, abs/1502.01852. arXiv:1502.01852, 1502.01852.
12, 40, 70, 206, 220
62. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., et al. (2015). ImageNet Large Scale Visual
Recognition Challenge. International Journal of Computer Vision, 115(3), 211–252. 12, 13
63. Park, E., Liu, W., Russakovsky, O., Deng, J., Li, F., et al. (2017). ImageNet Large scale visual recogni-
tion challenge (ILSVRC) 2017, Overview. ILSVRC 2017, (Jul 26). Original website Internet archive.
12, 13
64. Beckwith, W. Science’s 2021 Breakthrough: AI-powered Protein Prediction. 2022 Dec 17, Original
website. 11, 12
65. AlphaFold reveals the structure of the protein universe. DeepMind, 2022 Jul 28, Original website,
Internet archive. 12
66. Callaway, E. DeepMind’s AI predicts structures for a vast trove of proteins. 2021 Jul 21, Original
website. 12
67. Editorial (2019). The Guardian view on the future of AI: Great power, great irresponsibility. The
Guardian, (Jan 01). Original website. Internet archive. 12, 236, 237
68. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., et al. (2018). A general reinforcement
learning algorithm that masters chess, shogi, and Go through self-play. Science, 362(6419), 1140+. 12
69. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., et al. (2015). Human-level control
through deep reinforcement learning. Nature, 518(7540), 529–533. 13
70. Racaniere, S., Weber, T., Reichert, D. P., Buesing, L., Guez, A., et al. (2017). Imagination-Augmented
Agents for Deep Reinforcement Learning. In Guyon, I and Luxburg, UV and Bengio, S and Wallach, H
and Fergus, R and Vishwanathan, S and Garnett, R, editor, Advances in Neural Information Processing
Systems 30 (NIPS 2017), volume 30 of Advances in Neural Information Processing Systems. 13
71. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., et al. (2017). Mastering the
game of Go without human knowledge. Nature, 550(7676), 354+. 13
72. Cellan-Jones, Rory (2017). Artificial intelligence - hype, hope and fear. BBC, (Oct 16). Original
website. Internet archive. 13
73. Campbell, M. (2018). Mastering board games. A single algorithm can learn to play three hard board
games. Science, 362(6419), 1118. 13
74. The Economist (2016). Why artificial intelligence is enjoying a renaissance. (Jul 15).
(https://goo.gl/Grkofq). 13, 54, 226
75. The Economist (2016). From not working to neural networking. (Jun 25). (https://goo.gl/z1c9pc). 13,
52, 54, 226
76. Dodge, S., Karam, L. (2017). A Study and Comparison of Human and Deep Learning Recogni-
248 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
tion Performance Under Visual Distortions. (May 6). CoRR (Computing Research Repository),
abs/1705.02498.337 arXiv:1705.02498. 13
77. Hardesty, L. (2017). Explained: Neural networks. MIT News, (Apr 14). Original website. Internet
archive. 13, 210
78. Goodfellow, I., Bengio, Y., Courville, A. (2016). Deep Learning. Cambridge, MA: The MIT Press.
14, 16, 17, 27, 32, 34, 35, 36, 37, 38, 39, 40, 44, 46, 47, 48, 49, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61,
62, 65, 67, 69, 70, 72, 73, 75, 76, 77, 78, 84, 85, 86, 87, 89, 90, 91, 92, 93, 99, 100, 102, 104, 106,
108, 109, 112, 114, 115, 126, 127, 128, 129, 131, 134, 135, 136, 197, 209, 210, 212, 213, 216, 220,
221, 222, 223, 225, 266, 267, 268, 271, 272, 275
79. Ford, K. (2018). Architects of Intelligence: The truth about AI from the people building it. Packt
Publishing. 14, 16, 221, 223, 224, 225, 235
80. Bottou, L., Curtis, F. E., Nocedal, J. (2018). Optimization Methods for Large-Scale Machine Learning.
SIAM Review, 60(2), 223–311. 14, 76, 78, 84, 85, 87, 93, 106, 108, 109
81. Khullar, D. (2019). A.I. Could Worsen Health Disparities. New York Times, (Jan 31). Original website.
14
82. Kornfield, M., Firozi, P. (2020). Artificial intelligence use is growing in the U.S. healthcare system.
Washington Post, (Feb 24). Original website. 14
83. Lee, K. (2018a). AI Superpowers: China, Silicon Valley, and the New World Order. Houghton Mifflin
Harcourt. 14
84. Lee, K. (2018b). How AI can save our humanity. TED2018, (Apr). Original website. 14
85. Dunjko, V., Briegel, H. J. (2018). Machine learning & artificial intelligence in the quantum domain: a
review of recent progress. Reports on Progress in Physics, 81(7), article no.074001. 16, 17
86. Hinton, G. E., Osindero, S., Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural
Computation, 18(7), 1527–1554. 16
87. Merolla, P. A., Arthur, J. V., Alvarez-Icaza, R., Cassidy, A. S., Sawada, J., et al. (2014). A mil-
lion spiking-neuron integrated circuit with a scalable communication network and interface. Science,
345(6197), 668–673. 17
88. Esser, S. K., Merolla, P. A., Arthur, J. V., Cassidy, A. S., Appuswamy, R., et al. (2016). Convolutional
networks for fast, energy-efficient neuromorphic computing. Proceedings of the National Academy of
Sciences of the United States of America, 113(41), 11441–11446. 17
89. Warren, J., Root, P. (1963). The behavior of naturally fractured reservoirs. Society of Petroleum
Engineers Journal, 3(03), 245–255. 22
90. Ji, Y., Hall, S. A., Baud, P., Wong, T.-F. (2015). Characterization of pore structure and strain localiza-
tion in Majella limestone by X-ray computed tomography and digital image correlation. Geophysical
Journal International, 200(2), 701–719. 23, 24
91. Christensen, R. (2013). The Theory of Materials Failure. 1st edition. Oxford University Press. 22
92. Balogun, A. S., Kazemi, H., Ozkan, E., Al-Kobaisi, M., Ramirez, B. A., et al. (2007). Verification and
proper use of water-oil transfer function for dual-porosity and dual-permeability reservoirs. In SPE
337
In a rather cryptic manner to outsiders, several computer-science papers refer to papers in the Computing Re-
search Repository (CoRR) such as, e.g., “CoRR abs/1706.03762v5”, which means that the abstract of paper
number “1706.03762v5” (version 5) can be accessed by prepending to “abs/1706.03762v5” the CoRR web ad-
dress https://arxiv.org/ to form https://arxiv.org/abs/1706.03762v5, which can also be obtained via a web
search of “abs/1706.03762v5”, and where the PDF of the paper can be downloaded. An equivalent reference is
“arXiv preprint arXiv:1706.03762v5”, which may be clearer since more non-computer-science readers would have
heard of the arXiv rather than the CoRR. Papers such as [31] use both types of references, which are also used
in the present review paper so readers become familiar with both. To refer to the specific version 5, use “CoRR
abs/1706.03762v5”; to refer to the latest version (which may be different from version 5), remove “v5” to use only
“CoRR abs/1706.03762”.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 249
Middle East Oil and Gas Show and Conference. Society of Petroleum Engineers. 22
93. Ho, C. K. (2000). Dual porosity vs. dual permeability models of matrix diffusion in fractured
rock. Technical report. International High-Level Radioactive Waste Conference, Las Vegas, NV
(US), 04/29/2001-05/03/2001. Sandia National Laboratories, Albuquerque, NM (US), Report No.
SAND2000-2336C. Office of Scientific & Technical Information Report Number 763324. PDF
archived at the International Atomic Energy Agency. 22, 23
94. Datta-Gupta, A., King, M. J. (2007). Streamline simulation: Theory and practice, volume 11. Society
of Petroleum Engineers Richardson. 22, 23, 24
95. Croizé, D., Renard, F., Gratier, J.-P. (2013). Chapter 3 - compaction and porosity reduction in
carbonates: A review of observations, theory, and experiments. In R. Dmowska, editor, Advances in
Geophysics, volume 54 of Advances in Geophysics. Elsevier, 181 – 238. 23, 24
96. Lu, J., Qu, J., Rahman, M. M. (2019). A new dual-permeability model for naturally fractured reser-
voirs. Special Topics & Reviews in Porous Media: An International Journal, 10(5). 23
97. Gers, F. A., Schmidhuber, J. (2000). Recurrent nets that time and countIn Proceedings of the IEEE-
INNS-ENNS International Joint Conference on Neural Networks. IEEE. 24
98. Santamarina, J. C. (2003). Soil behavior at the microscale: particle forces. In Soil behavior and soft
ground construction. 25–56. Proc. of the Symposium in honor of Charles C. Ladd, October 2001, MIT.
27
99. Alam, M. F., Haque, A., Ranjith, P. G. (2018). A study of the particle-level fabric and morphology
of granular soils under one-dimensional compression using insitu x-ray ct imaging. Materials, 11(6),
919. 26
100. Karatza, Z., Andò, E., Papanicolopulos, S.-A., Viggiani, G., Ooi, J. Y. (2019). Effect of particle
morphology and contacts on particle breakage in a granular assembly studied using x-ray tomography.
Granular Matter, 21(3), 44. 26
101. Shire, T., O’Sullivan, C., Hanley, K., Fannin, R. J. (2014). Fabric and effective stress distribution
in internally unstable soils. Journal of Geotechnical and Geoenvironmental Engineering, 140(12),
04014072. 26
102. Kanatani, K.-I. (1984). Distribution of directional data and fabric tensors. International journal of
engineering science, 22(2), 149–164. 26, 174
103. Fu, P., Dafalias, Y. F. (2015). Relationship between void-and contact normal-based fabric tensors for
2d idealized granular materials. International Journal of Solids and Structures, 63, 68–81. 26
104. Graves, A., Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Networks, 18(5–6), 602–610. 29
105. Graham, J., Kanov, K., Yang, X., Lee, M., Malaya, N., et al. (2016). A web services accessible
database of turbulent channel flow and its use for testing a new integral wall model for les. Journal of
Turbulence, 17(2), 181–215. 29, 188
106. Rossant, C., Goodman, D. F. M., Fontaine, B., Platkiewicz, J., Magnusson, A. K., et al. (2011). Fitting
neuron models to spike trains . Front. Neurosci., Feb 23. 31
107. Brillouin, L. (1964). Tensors in Mechanics and Elasticity. New York: Academic Press. 32, 34
108. Misner, C., Thorne, K., Wheeler, J. (1973). Gravitation. New York: W.H. Freeman and Company. 32
109. Malvern, L. (1969). Introduction to the Mechanics of a Continuous Medium. Englewood Cliffs, New
Jersey: Prentice Hall. 34
110. Marsden, J., Hughes, T. (1994). Mathematical Foundation of Elasticity. New York: Dover. 34
111. Vu-Quoc, L., Li, S. (1995). Dynamics of sliding geometrically-exact beams - large-angle maneuver
and parametric resonance. Computer Methods in Applied Mechanics and Engineering, 120(1-2), 65–
118. 34
112. Werbos, P. (1988). Backpropagation: Past and future. IEEE 1988 International Conference on Neural
250 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
132. Li, H., Xu, Z., Taylor, G., Studer, C., Goldstein, T. (2018). Visualizing the loss landscape of neural
nets. (Nov 7). arXiv:1712.09913v3. 71
133. Geman, S., Bienenstock, E., Doursat, R. (1992). Neural networks and the bias/variance dilemma.
Neural computation, 4(1), 1–58. pdf, pdf. 72, 74
134. Hastie, T., Tibshirani, R., Friedman, J. H. (2001). The elements of statistical learning: Data mining,
inference, prediction. 1st edition. Springer. 2nd edition, corrected, 12 printing, 2017 Jan 13. 72
135. Prechelt, L. (1998). Early Stopping — But When ? In G. Orr, K. Muller. Neural Networds: Tricks of
the Trade . Springer . LLCS State-of-the-Art Survey. Paper pdf, Internet archive. 73, 74, 75
136. Belkin, M., Hsu, D., Ma, S., Mandal, S. (2019). Reconciling modern machine-learning practice and the
classical bias–variance trade-off. Proceedings of the National Academy of Sciences, 116(32), 15849–
15854. Original website, arXiv:1812.11118. 75, 76, 77
137. Geiger, M., Jacot, A., Spigler, S., Gabriel, F., Sagun, L., et al. (2020). Scaling description of gener-
alization with number of parameters in deep learning. Journal of Statistical Mechanics: Theory and
Experiment, 2020(2), 023401. Original website, arXiv:1901.01608. 75, 77
138. Sampaio, P. R. (2020). Deft-funnel: an open-source global optimization solver for constrained grey-
box and black-box problems. (Jan 2020). arXiv:1912.12637. 76
139. Polak, E. (1971). Computational Methods in Optimization: A Unified Approach. Academic Press. 78,
79, 80, 81, 82, 83, 84, 90, 121
140. Lewis, R., Torczon, V., Trosset, M. (2000). Direct search methods: then and now. Journal of Compu-
tational and Applied Mathematics, 124(1-2), 191–207. 78, 84
141. Kolda, T., Lewis, R., Torczon, V. (2003). Optimization by direct search: New perspectives on some
classical and modern methods. SIAM Review, 45(3), 385–482. 78, 84
142. Kafka, D., Wilke, D. (2018). Gradient-only line searches: An alternative to probabilistic line searches.
(Mar 22). arXiv:1903.09383. 78, 125
143. Mahsereci, M., Hennig, P. (2017). Probabilistic line searches for stochastic optimization. Jour-
nal of Machine Learning Research, 18. Article No.1. Also, CoRR, abs/1703.10034v2, Jun 30.
arXiv:1703.10034v2, 1703.10034. 78, 84, 85, 123
144. Paquette, C., Scheinberg, K. (2018). A stochastic line search method with convergence rate analysis.
(Jul 20). arXiv:1807.07994v1. 78, 81, 82, 83, 85, 87, 117, 119, 120, 121
145. Bergou, E., Diouane, Y., Kungurtsev, V., Royer, C. W. (2018). A subsampling line-search method with
second-order results. (Nov 21). arXiv:1810.07211v2. 78, 81, 82, 83, 85, 87, 117, 120, 121, 122, 123,
124
146. Wills, A., Schön, T. (2018). Stochastic quasi-newton with adaptive step lengths for large-scale
problems. (Feb 22). arXiv:1802.04310v1. 78, 120, 123
147. Mahsereci, M., Hennig, P. (2015). Probabilistic line searches for stochastic optimization. CoRR, (Feb
10). Abs/1502.02846. arXiv:1502.02846. 78, 84
148. Luenberger, D., Ye, Y. (2016). Linear and Nonlinear Programming. 4th edition. Springer. 79, 81, 90
149. Polak, E. (1997). Optimization: Algorithms and Consistent Approximations. Springer Verlag. 79, 80,
81, 84, 90
150. Goldstein, A. (1965). On steepest descent. SIAM Journal of Control, Series A, 3(1), 147–151. 79, 80,
81, 84, 85
151. Armijo, L. (1966). Minimization of functions having lipschitz continuous partial derivatives. Pacific
Journal of Mathematics, 16(1), 1–3. 79, 80, 81, 85, 121
152. Wolfe, P. (1969). Convergence conditions for ascent methods. SIAM Review, 11(2), 226–235. 79, 81,
84, 85
153. Wolfe, P. (1971). Convergence conditions for ascent methods .2. Some corrections. SIAM Review,
13(2), 185–188. 79, 84
252 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
154. Goldstein, A. (1967). Constructive Real Analysis. New York: Harper. 79, 84
155. Goldstein, A., Price, J. (1967). An effective algorithm for minimization. Numerische Mathematik, 10,
184–189. 79, 80, 81
156. Ortega, J., Rheinboldt, W. (1970). Iterative Solution of Nonlinear Equations in Several Variables. New
York: Academic Press. Republished in 2000 by SIAM, Classics in Applied Mathematics, Vol.30. 79,
80, 81, 90
157. Nocedal, J., Wright, S. (2006). Numerical Optimization. Springer. 2nd edition. 81, 90
158. Bollapragada, R., Byrd, R. H., Nocedal, J. (2019). Exact and inexact subsampled Newton methods for
optimization. IMA Journal of Numerical Analysis, 39(2), 545–578. 81
159. Berahas, A. S., Byrd, R. H., Nocedal, J. (2019). Derivative-free optimization of noisy functions via
quasi-newton methods. SIAM Journal on Optimization, 29(2), 965–993. 81
160. Larson, J., Menickelly, M., Wild, S. M. (2019). Derivative-free optimization methods. (Jun 25).
arXiv:1904.11585v2. 81
161. Shi, Z., Shen, J. (2005). Step-size estimation for unconstrained optimization methods. Computational
and Applied Mathematics, 24(3), 399–416. 84
162. Sun, S., Cao, Z., Zhu, H., Zhao, J. (2019). A survey of optimization methods from a machine learning
perspective. (Oct 23). arXiv:1906.06821v2. 85, 106, 108, 109
163. Kirkpatrick, S., Gelatt, C., Vecchi, M. (1983). Optimization by simulated annealing. Science,
220(4598), 671–680. 85, 96, 99
164. Smith, S. L., Kindermans, P.-J., Ying, C., Le, Q. V. (2018). Don’t decay the learning rate, increase the
batch size. (Feb 2018). arXiv:1711.00489v2. OpenReview. 85, 93, 95, 96, 97, 98, 114
165. Schraudolph, N. (1998). Centering Neural Network Gradient Factors In G. Orr, K. Muller. Neural
Networds: Tricks of the Trade . Springer . LLCS State-of-the-Art Survey. 85, 90, 107
166. Neuneier, R., Zimmermann, H. (1998). How to Train Neural Networks In G. Orr, K. Muller. Neural
Networds: Tricks of the Trade . Springer . LLCS State-of-the-Art Survey. 85, 107
167. Robbins, H., Monro, S. (1951a). A stochastic approximation method. Annals of Mathematical Statis-
tics, 22(3), 400–407. 85
168. Aitchison, L. (2019). Bayesian filtering unifies adaptive and non-adaptive neural network optimization
methods. (Jul 31). arXiv:1807.07540v4. 87, 107, 117, 118
169. Goudou, X., Munier, J. (2009). The gradient and heavy ball with friction dynamical systems: The qua-
siconvex case. Mathematical Programming, 116(1-2), 173–191. 7th French-Latin American Congress
in Applied Mathematics, Univ Chile, Santiago, CHILE, JAN, 2005. 89, 91
170. Kingma, D. P., Ba, J. (2014). Adam: A method for stochastic optimization. (Dec 22). Version 1,
2014.12.22: arXiv:1412.6980v1. Version 9, 2017.01.30: arXiv:1412.6980v9. 90, 102, 105, 106, 110,
111, 112, 113
171. Bertsekas, D., Tsitsiklis, J. (1995). Neuro-Dynamic Programming . Athena Scientific. 90, 91
172. Hinton, G. (2012). A Practical Guide to Training Restricted Boltzmann Machines In G. Montavon,
G. Orr, K. Muller. Neural Networds: Tricks of the Trade . Springer . LLCS State-of-the-Art Survey.
90
173. Incerti, S., Parisi, V., Zirilli, F. (1979). New method for solving non-linear simultaneous equations.
SIAM Journal on Numerical Analysis, 16(5), 779–789. 90
174. Voigt, R. (1971). Rates of convergence for a class of iterative procedures. SIAM Journal on Numerical
Analysis, 8(1), 127–&. 90
175. Plaut, D. C., Nowlan, S. J., Hinton, G. E. (1986). Experiments on learning by back propagation.
Technical Report Technical Report CMU-CS-86-126, June. Website. 90, 99
176. Jacobs, R. (1988). Increased rates of convergence through learning rate adaptation. Neural Networks,
1(4), 295–307. 90
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 253
199. Mirjalili, S., Dong, J. S., Lewis, A. (2020). Nature-Inspired Optimizers. Springer. 99
200. Smith, L. N., Topin, N. (2018). Super-convergence: Very fast training of residual networks using large
learning rates. (May 2018). arXiv:1708.07120v3. OpenReview. 99
201. Rögnvaldsson, T. S. (1998). A Simple Trick for Estimating the Weight Decay Parameter In G. Orr,
K. Muller. Neural Networds: Tricks of the Trade . Springer . LLCS State-of-the-Art Survey. 99
202. Glorot, X., Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural net-
worksIn Proceedings of the thirteenth international conference on artificial intelligence and statistics.
JMLR Workshop and Conference Proceedings. 100
203. Bock, S., Goppold, J., Weiss, M. (2018). An improvement of the convergence proof of the ADAM-
optimizer. (Apr 27). arXiv:1804.10587v1. 102, 112, 113
204. Huang, H., Wang, C., Dong, B. (2019). Nostalgic Adam: Weighting more of the past gradients when
designing the adaptive learning rate. (Feb 23). arXiv:1805.07557v2. 102, 113
205. Chen, X., Liu, S., Sun, R., Hong, M. (2019). On the convergence of a class of Adam-type algorithms
for non-convex optimization. (Mar 10). arXiv:1808.02941v2. OpenReview. 106
206. Hyndman, R. J., Koehler, A. B., Ord, J. K., Snyder, R. D. (2008). Forecasting with Exponential
Smoothing: A state state approach. Springer. 106
207. Hyndman, R. J., Athanasopoulos, G. (2018). Forecasting: Principles and Practices. 2nd edition.
OTexts: Melbourne, Australia. Original website, open online text. 107, 108
208. Dreiseitl, S., Ohno-Machado, L. (2002). Logistic regression and artificial neural network classification
models: a methodology review . Journal of Biomedical Informatics, 35, 352–359. 112
209. Gugger, S., Howard, J. (2018). AdamW and Super-convergence is now the fastest way to train neural
nets. Fast.AI, (Jul 02). Original website, Internet Archive. 113, 117
210. Xing, C., Arpit, D., Tsirigotis, C., Bengio, Y. (2018). A walk with sgd. (May 2018).
arXiv:1802.08770v4. OpenReview. 114
211. Prokhorov, D. (2001). IJCNN 2001 neural network competition. Slide presentation in IJCNN’01, Ford
Research Laboratory, 2001 Internet Archive. 123
212. Chang, C.-C., Lin, C.-J. (2011). LIBSVM: A Library for Support Vector Machines. ACM Transactions
on Intelligent Systems and Technology, 2(3). Article 27, April 2011. Original website for software
(Version 3.24 released 2019.09.11), Internet Archive. 123
213. Brogan, W. L. (1990). Modern Control Theory. 3rd edition. Pearson. 125
214. Hopfield, J. J. (1984). Neurons with graded response have collective computational properties like
those of two-state neurons. Proceedings of the National Academy of Sciences, 81(10), 3088–3092.
Original website. 125
215. Pineda, F. J. (1987). Generalization of back-propagation to recurrent neural networks. Physical Review
Letters, 59(19), 2229–2232. 125
216. Newmark, N. M. (1959). A Method of Computation for Structural Dynamics. Number 85 in A Method
of Computation for Structural Dynamics. American Society of Civil Engineers. 126
217. Hilber, H. M., Hughes, T. J., Taylor, R. L. (1977). Improved numerical dissipation for time integration
algorithms in structural dynamics. Earthquake Engineering & Structural Dynamics, 5(3), 283–292.
Original website. 126
218. Chung, J., Hulbert, G. M. (1993). A Time Integration Algorithm for Structural Dynamics With Im-
proved Numerical Dissipation: The Generalized-α Method. Journal of Applied Mechanics, 60(2), 371.
Original website. 126
219. Olah, C. (2015). Understanding LSTM Networks. colah’s blog, (Aug 27). Original website. Internet
archive. 131
220. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y. (2014). On the properties of neural machine
translation: Encoder-decoder approaches. arXiv:1409.1259. 134
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 255
221. Chung, J., Gulcehre, C., Cho, K., Bengio, Y. (2014). Empirical evaluation of gated recurrent neural
networks on sequence modeling. arXiv:1412.3555. 134, 135
222. Kim, Y., Denton, C., Hoang, L., Rush, A. M. (2017). Structured attention networks. International
Conference on Learning Representations, OpenReview.net, arXiv:1702.00887. 135
223. Cho, K., van Merriënboer, B., Bahdanau, D., Bengio, Y. (2014). On the properties of neural ma-
chine translation: Encoder–decoder approachesIn Proceedings of SSST-8, Eighth Workshop on Syntax,
Semantics and Structure in Statistical Translation. Doha, Qatar: Association for Computational Lin-
guistics. arXiv:1409.1259. 136
224. Schuster, M., Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE transactions on
Signal Processing, 45(11), 2673–2681. 137
225. Ba, J. L., Kiros, J. R., Hinton, G. E. (2016). Layer normalization. arXiv:1607.06450. 141
226. Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., et al. (2020). Language models are
few-shot learners. arXiv:2005.14165v4. 143
227. Tsai, Y.-H. H., Bai, S., Yamada, M., Morency, L.-P., Salakhutdinov, R. (2019). Transformer dissection:
A unified understanding of transformer’s attention via the lens of kernel. arXiv:1908.11775. 143
228. Rodriguez-Torrado, R., Ruiz, P., Cueto-Felgueroso, L., Green, M. C., Friesen, T., et al. (2022).
Physics-informed attention-based neural network for hyperbolic partial differential equations: appli-
cation to the buckley–leverett problem. Scientific reports, 12(1), 1–12. Original website. 143, 162
229. Bahri, Y. (2019). Towards an Understanding of Wide, Deep Neural Networks. Youtube. 143
230. Ananthaswamy, A. (2021). A New Link to an Old Model Could Crack the Mystery of Deep Learning.
Quanta Magazine, (Oct 11). Original website. 143
231. Lee, J., Bahri, Y., Novak, R., Schoenholz, S. S., Pennington, J., et al. (2018). Deep neural networks as
gaussian processes. arXiv:1711.00165. 143, 237
232. Jacot, A., Gabriel, F., Hongler, C. (2018). Neural tangent kernel: Convergence and generalization in
neural networks. arXiv:1806.07572. 143, 144, 162
233. 2021’s Biggest Breakthroughs in Math and Computer Science. Quanta Magazine, 2021 Dec 31.
Youtube. 143, 236
234. Rasmussen, C. E., Williams, C. K. (2006). Gaussian processes for machine learning. MIT press
Cambridge, MA. MIT website, GaussianProcess.org. 143, 147, 148, 149, 151, 271
235. Belkin, M., Ma, S., Mandal, S. (2018). To understand deep learning we need to understand kernel
learning. arXiv:1802.0139. 144, 147
236. Lee, J., Schoenholz, S. S., Pennington, J., Adlam, B., Xiao, L., et al. (2020). Finite versus infinite
neural networks: an empirical study. arXiv:2007.15801. 144
237. Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American mathematical
society, 68(3), 337–404. 144, 147
238. Hastie, T., Tibshirani, R., Friedman, Friedman, J. H. (2017). The elements of statistical learning: Data
mining, inference, and prediction. 2 edition. Springer. Corrected, 12th printing, Jan 13. 145, 146, 147
239. Evgeniou, T., Pontil, M., Poggio, T. (2000). Regularization networks and support vector machines.
Advances in computational mathematics, 13(1), 1–50. Semantic Scholar. 145, 146, 147
240. Berlinet, A., Thomas-Agnan, C. (2004). Reproducing kernel Hilbert spaces in probability and statis-
tics. New York: Springer Science & Business Media. 146, 147, 148
241. Girosi, F. (1998). An equivalence between sparse approximation and support vector machines. Neural
computation, 10(6), 1455–1480. Original website, Semantic Scholar. 146, 147
242. Wahba, G. (1990). Spline Models for Observational Data. Philadelphia, Pennsylvania: SIAM. 4th
printing 2002. 147
243. Adler, B. (2021). Hilbert spaces and the Riesz Representation Theorem. The University of Chicago
Mathematics REU 2021, Original website, Internet archive. 147
256 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
244. Schaback, R., Wendland, H. (2006). Kernel techniques: From machine learning to meshless methods.
Acta numerica, 15, 543–639. 147
245. Yaida, S. (2020). Non-gaussian processes and neural networks at finite widthsIn Mathematical and
Scientific Machine Learning. Proceedings of Machine Learning Research. PMLR site. 148
246. Sendera, M., Tabor, J., Nowak, A., Bedychaj, A., Patacchiola, M., et al. (2021). Non-gaussian gaussian
processes for few-shot regression. Advances in Neural Information Processing Systems, 34, 10285–
10298. arXiv:2110.13561. 148
247. Duvenaud, D. (2014). Automatic model construction with Gaussian processes. Ph.D. thesis, University
of Cambridge. PhD dissertation. Thesis repository, CC BY-SA 2.0 UK. 151, 152
248. von Mises, R. (1964). Mathematical theory of probability and statistics. Elsevier. Book site. 151, 271
249. Hale, J. (2018). Deep Learning Framework Power Scores 2018. Towards Data Science, (Sep 19).
Original website. Internet archive. 152, 154
250. Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., et al. (2015). TensorFlow: Large-scale
machine learning on heterogeneous systems. Whitepaper pdf, Software available from tensorflow.org.
154
251. Jouppi, N. Google supercharges machine learning tasks with TPU custom chip. Original website. 154
252. Chollet, F., et al. (2015). Keras. Original website. 155
253. Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., et al. (2019). Pytorch: An imperative style,
high-performance deep learning library. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc,
E. Fox, R. Garnett, editors, Advances in Neural Information Processing Systems 32. Curran Associates,
Inc., 8024–8035. Paper pdf. 155
254. Chintala, S. (2022). Decisions and pivots on pytorch. 2022 Jan 19, Original website Internet archive.
155
255. PyTorch Turns 5! 2022 Jan 20, Youtube. 155
256. Kaelbling, L. P., Littman, M. L., Moore, A. W. (1996). Reinforcement learning: A survey. Journal of
Artificial Intelligence Research, 4, 237–285. 155
257. Arulkumaran, K., Deisenroth, M. P., Brundage, M., Bharath, A. A. (2017). Deep reinforcement learn-
ing: A brief survey. IEEE Signal Processing Magazine, 34, 26–38. 155
258. Sünderhauf, N., Brock, O., Scheirer, W., Hadsell, R., Fox, D., et al. (2018). The limits and potentials
of deep learning for robotics. The International Journal of Robotics Research, 37, 405–420. 156
259. Simo, J. C., Vu-Quoc, L. (1988). On the dynamics in space of rods undergoing large motions–a
geometrically exact approach. Computer Methods in Applied Mechanics and Engineering, 66, 125–
161. 156
260. Humer, A. (2013). Dynamic modeling of beams with non-material, deformation-dependent boundary
conditions. Journal of sound and vibration, 332(3), 622–641. 156
261. Steinbrecher, I., Humer, A., Vu-Quoc, L. (2017). On the numerical modeling of sliding beams: A
comparison of different approaches. Journal of Sound and Vibration, 408, 270–290. 156
262. Humer, A., Steinbrecher, I., Vu-Quoc, L. (2020). General sliding-beam formulation: A non-material
description for analysis of sliding structures and axially moving beams. Journal of Sound and Vibra-
tion, 480, 115341. Original website. 156
263. Bradbury, J., Frostig, R., Hawkins, P., Johnson, M. J., Leary, C., et al. (2018). JAX: composable
transformations of Python+NumPy programs. Original website. 156
264. Heek, J., Levskaya, A., Oliver, A., Ritter, M., Rondepierre, B., et al. (2020). Flax: A neural network
library and ecosystem for JAX. Original website. 156
265. Schoeberl, J. (2014). C++11 Implementation of Finite Elements in NGSolve. Scientific report. 157,
158
266. Weitzhofer, S., Humer, A. (2021). Machine-Learning Frameworks in Scientific Computing: Finite
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 257
Element Analysis and Multibody Simulation. Talk slides, Video talk. 157
267. Lavin, A., Zenil, H., Paige, B., Krakauer, D., Gottschlich, J., et al. (2021). Simulation Intelligence:
Towards a New Generation of Scientific Methods. arXiv:2112.03235. 157
268. Cai, S., Mao, Z., Wang, Z., Yin, M., Karniadakis, G. E. (2021). Physics-informed neural networks
(PINNs) for fluid mechanics: A review. Acta Mechanica Sinica, 37(12), 1727–1738. Original website,
arXiv:2105.09506. 157, 158, 159
269. Cuomo, S., di Cola, V. S., Giampaolo, F., Rozza, G., Raissi, M., et al. (2022). Scientific Machine
Learning through Physics-Informed Neural Networks: Where we are and What’s next. Journal of
Scientific Computing, 92(3). Article No. 88, Original website, arXiv:2201.05624. 158, 159, 160
270. Karniadakis, G. E., Kevrekidis, I. G., Lu, L., Perdikaris, P., Wang, S., et al. (2021). Physics-informed
machine learning. Nature Reviews Physics, 3(6), 422–440. Original website. 158, 159, 160
271. Lu, L., Meng, X., Mao, Z., Karniadakis, G. E. (2021). DeepXDE: A deep learning library for solving
differential equations. SIAM Review, 63(1), 208–228. Original website, pdf, arXiv:1907.04502. 158,
160
272. Hennigh, O., Narasimhan, S., Nabian, M. A., Subramaniam, A., Tangsali, K., et al. (2020). NVIDIA
SimNet (tm): an AI-accelerated multi-physics simulation framework. arXiv:2012.07938. The software
name “SimNet” has been changed to “Modulus”; see NVIDIA Modulus. 160
273. Koryagin, A., Khudorozkov, R., Tsimfer, S. (2019). PyDEns: a Python Framework for Solving Dif-
ferential Equations with Neural Networks. arXiv:1909.11544. 160
274. Chen, F., Sondak, D., Protopapas, P., Mattheakis, M., Liu, S., et al. (2020). NeuroDiffEq: A python
package for solving differential equations with neural networks. Journal of Open Source Software,
5(46), 1931. Original website. 159, 160
275. Rackauckas, C., Nie, Q. (2017). DifferentialEquations. jl–a performant and feature-rich ecosystem for
solving differential equations in Julia. Journal of Open Research Software, 5(1). Original website.
159, 160
276. Haghighat, E., Juanes, R. (2021). Sciann: A keras/tensorflow wrapper for scientific computations
and physics-informed deep learning using artificial neural networks. Computer Methods in Applied
Mechanics and Engineering, 373, 113552. 160
277. Xu, K., Darve, E. (2020). ADCME: Learning Spatially-varying Physical Fields using Deep Neural
Networks. arXiv:2011.11955. 160
278. Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., Wilson, A. G. (2018). Gpytorch:
Blackbox matrix-matrix gaussian process inference with gpu acceleration. [v6] Tue, 29 Jun 2021
arXiv:1809.11165. 160
279. Schoenholz, S. S., Novak, R. Fast and Easy Infinitely Wide Networks with Neural Tangents. Google
AI Blog, 2020 Mar 13, Original website. 160, 237
280. He, J., Li, L., Xu, J., Zheng, C. (2020). ReLU deep neural networks and linear finite elements. Journal
of Computational Mathematics, 38(3), 502–527. arXiv:1807.03973. 160, 162
281. Arora, R., Basu, A., Mianjy, P., Mukherjee, A. (2016). Understanding deep neural networks with
rectified linear units. arXiv:1611.01491. 160
282. Raissi, M., Perdikaris, P., Karniadakis, G. E. (2019). Physics-informed neural networks: A deep
learning framework for solving forward and inverse problems involving nonlinear partial differential
equations. Journal of Computational physics, 378, 686–707. Original website. 161, 162
283. Kharazmi, E., Zhang, Z., Karniadakis, G. E. (2019). Variational physics-informed neural networks for
solving partial differential equations. arXiv:1912.00873. 161, 162
284. Kharazmi, E., Zhang, Z., Karniadakis, G. E. (2021). hp-vpinns: Variational physics-informed neural
networks with domain decomposition. Computer Methods in Applied Mechanics and Engineering,
374, 113547. See also arXiv:1912.00873. 161, 162
258 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
285. Berrone, S., Canuto, C., Pintore, M. (2022). Variational physics informed neural networks: the role of
quadratures and test functions. Journal of Scientific Computing, 92(3), 1–27. Original website. 161
286. Wang, S., Yu, X., Perdikaris, P. (2020). When and why pinns fail to train: A neural tangent kernel
perspective. arXiv:2007.14527. 162
287. Rohrhofer, F. M., Posch, S., Gössnitzer, C., Geiger, B. C. (2022). Understanding the difficulty of
training physics-informed neural networks on dynamical systems. arXiv:2203.13648. 162
288. Erichson, N. B., Muehlebach, M., Mahoney, M. W. (2019). Physics-informed Autoencoders for
Lyapunov-stable Fluid Flow Prediction. arXiv:1905.10866. 162
289. Raissi, M., Perdikaris, P., Karniadakis, G. E. (2021). Physics informed learning machine. US Patent
10,963,540, Mar 30. Google Patents, pdf. 162, 163
290. Lagaris, I. E., Likas, A., Fotiadis, D. I. (1998). Artificial neural networks for solving ordinary and
partial differential equations. IEEE transactions on neural networks, 9(5), 987–1000. Original website.
162
291. Lagaris, I. E., Likas, A. C., Papageorgiou, D. G. (2000). Neural-network methods for boundary value
problems with irregular boundaries. IEEE Transactions on Neural Networks, 11(5), 1041–1049. Orig-
inal website. 162
292. Raissi, M., Perdikaris, P., Karniadakis, G. E. (2017). Physics Informed Deep Learning (Part I): Data-
driven Solutions of Nonlinear Partial Differential Equations. arXiv:1711.10561. 162
293. Raissi, M., Perdikaris, P., Karniadakis, G. E. (2017). Physics Informed Deep Learning (Part II): Data-
driven Discovery of Nonlinear Partial Differential Equations. arXiv:1711.10566. 162
294. Gupta, S., Agrawal, A., Gopalakrishnan, K., Narayanan, P. (2015). Deep learning with limited numer-
ical precisionIn International Conference on Machine Learning. arXiv:1502.02551. 171
295. Courbariaux, M., Hubara, I., Soudry, D., El-Yaniv, R., Bengio, Y. (2016). Binarized neural networks:
Training deep neural networks with weights and activations constrained to+ 1 or-1. arXiv:1602.02830.
171
296. De Sa, C., Feldman, M., Ré, C., Olukotun, K. (2017). Understanding and optimizing asynchronous
low-precision stochastic gradient descentIn Proceedings of the 44th Annual International Symposium
on Computer Architecture. https://dl.acm.org/doi/abs/10.1145/3079856.3080248. 171
297. Borja, R. I. (2000). A finite element model for strain localization analysis of strongly discontinu-
ous fields based on standard galerkin approximation. Computer Methods in Applied Mechanics and
Engineering, 190(11-12), 1529–1549. 179, 182, 183, 184
298. Sibson, R. H. (1985). A note on fault reactivation. Journal of Structural Geology, 7(6), 751–754. 180
299. Passelègue, F. X., Brantut, N., Mitchell, T. M. (2018). Fault reactivation by fluid injection: Controls
from stress state and injection rate. Geophysical Research Letters, 45(23), 12–837. 181
300. Kuchment, A. (2019). Even if injection of fracking wastewater stops, quakes won’t. Scientific Ameri-
can. Sep 9. 181
301. Park, K., Paulino, G. H. (2011). Cohesive zone models: a critical review of traction-separation rela-
tionships across fracture surfaces. Applied Mechanics Reviews, 64(6). 181
302. Zhang, X., Vu-Quoc, L. (2007). An accurate elasto-plastic frictional tangential force–displacement
model for granular-flow simulations: Displacement-driven formulation. Journal of Computational
Physics, 225(1), 730–752. 181, 184
303. Vu-Quoc, L., Zhang, X. (1999). An accurate and efficient tangential force–displacement model for
elastic frictional contact in particle-flow simulations. Mechanics of materials, 31(4), 235–269. 184
304. Vu-Quoc, L., Zhang, X., Lesburg, L. (2001). Normal and tangential force–displacement relations for
frictional elasto-plastic contact of spheres. International journal of solids and structures, 38(36-37),
6455–6489. 184
305. Haghighat, E., Raissi, M., Moure, A., Gomez, H., Juanes, R. (2021). A physics-informed deep learning
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 259
framework for inversion and surrogate modeling in solid mechanics. Computer Methods in Applied
Mechanics and Engineering, 379, 113741. 184
306. Zhai, Y., Vu-Quoc, L. (2007). Analysis of power magnetic components with nonlinear static hysteresis:
Proper orthogonal decomposition and model reduction. IEEE Transactions on Magnetics, 43(5), 1888–
1897. 184, 188, 192
307. Benner, P., Gugercin, S., Willcox, K. (2015). A Survey of Projection-Based Model Reduction Methods
for Parametric Dynamical Systems. SIAM Review, 57(4), 483–531. 193, 195, 203
308. Greif, C., Urban, K. (2019). Decay of the Kolmogorov N-width for wave problems. Applied Mathe-
matics Letters, 96, 216–222. Original website. 193
309. Craig, R. R., Bampton, M. C. C. (1968). Coupling of substructures for dynamic analyses. AIAA
Journal, 6(7), 1313–1319. 195
310. Chaturantabut, S., Sorensen, D. C. (2010). Nonlinear Model Reduction via Discrete Empirical Inter-
polation. SIAM Journal on Scientific Computing, 32(5), 2737–2764. 199, 200
311. Carlberg, K., Bou-Mosleh, C., Farhat, C. (2011). Efficient non-linear model reduction via a least-
squares Petrov-Galerkin projection and compressive tensor approximations. International Journal for
Numerical Methods in Engineering, 86(2), 155–181. 199, 200
312. Choi, Y., Coombs, D., Anderson, R. (2020). SNS: A Solution-Based Nonlinear Subspace Method
for Time-Dependent Model Order Reduction. SIAM Journal on Scientific Computing, 42(2), A1116–
A1146. 199
313. Everson, R., Sirovich, L. (1995). Karhunen-Loève procedure for gappy data. Journal of the Optical
Society of America A, 12(8), 1657. 199
314. Carlberg, K., Farhat, C., Cortial, J., Amsallem, D. (2013). The GNAT method for nonlinear model
reduction: Effective implementation and application to computational fluid dynamics and turbulent
flows. Journal of Computational Physics, 242, 623–647. 200
315. Tiso, P., Rixen, D. J. (2013). Discrete empirical interpolation method for finite element structural
dynamics. 203
316. Brooks, A. N., Hughes, T. J. (1982). Streamline upwind/petrov-galerkin formulations for convection
dominated flows with particular emphasis on the incompressible navier-stokes equations. Computer
Methods in Applied Mechanics and Engineering, 32(1), 199–259. Original website. 204
317. Kochkov, D., Smith, J. A., Alieva, A., Wang, Q., Brenner, M. P., et al. (2021). Machine learning–
accelerated computational fluid dynamics. Proceedings of the National Academy of Sciences, 118(21),
e2101784118. 208
318. Bishara, D., Xie, Y., Liu, W. K., Li, S. (2023). A state-of-the-art review on machine learning-based
multiscale modeling, simulation, homogenization and design of materials. Archives of computational
methods in engineering, 30(1), 191–222. 209
319. Rosenblatt, F. (1960). Perceptron simulation experiments. Proceedings of the Institute of Radio Engi-
neers, 48(3), 301–309. 210
320. Block, H., Knight, B., Rosenblatt, F. (1962b). Analysis of a 4-layer series-coupled perceptron .2.
Reviews of Modern Physics, 34(1), 135–142. 210
321. Gopnik, A. (2019). The ultimate learning machines. The Wall Street Journal, Oct 11. Original website.
215, 241
322. Hodgkin, A., Huxley, A. (1952). A quantitative description of membrane current and its application to
conduction and excitation in nerve. Journal of Physiology-London, 117(4), 500–544. 218, 220
323. Dimirovski, G. M., Wang, R., Yang, B. (2017). Delay and recurrent neural networks: Computational
cybernetics of systems biology?In 2017 IEEE International Conference on Systems, Man, and Cyber-
netics (SMC). IEEE. Original website. 219
324. Gherardi, F., Souty-Grosset, C., Vogt, G., Dieguez-Uribeondo, J., Crandall, K. (2009). Infraorder
260 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
astacidea latreille, 1802 p.p.: The freshwater crayfish. In F. Schram, C. von Vaupel Klein, editors,
Treatise on Zoology - Anatomy, Taxonomy, Biology. The Crustacea, Volume 9 Part A, chapter 67.
Leiden, Netherlands: Brill, 269–423. 220
325. Han, J., Moraga, C. (1995). The Influence of the Sigmoid Function Parameters on the Speed of
Backpropagation LearningIn IWANN ’96 Proceedings of the International Workshop on Artificial
Neural Networks: From Natural to Artificial Neural Computation, Jun 07-09 . Springer-Verlag. 220
326. Furshpan, E., Potter, D. (1959a). Transmission at the giant motor synapses of the crayfish. Journal of
Physiology-London, 145(2), 289–325. 221, 222
327. Bush, P., Sejnowski, T. (1995). The Cortical Neuron. Oxford University Press. 221
328. Werbos, P. (1990). Backpropagation through time - what it does and how to do it. Proceedings of the
IEEE, 78(10), 1550–1560. 223, 225
329. Baydin, A. G., Pearlmutter, B. A., Radul, A. A., Siskind, J. M. (2018). Automatic Differentiation in
Machine Learning: a Survey. Journal of Machine Learning Research, 18. 223, 225
330. Werbos, P. J., Davis, J. J. J. (2016). Regular Cycles of Forward and Backward Signal Propagation in
Prefrontal Cortex and in Consciousness. Frontiers in Systems Neuroscience, 10. 225
331. Metz, C. (2019). Turing Award Won by 3 Pioneers in Artificial Intelligence. New York Times, (Mar
27). Original website. 225
332. Topol, E. (2019). The A.I. Diet. New York Times, (Mar 02). Original website. 226
333. Laguarta, J., Hueto, F., Subirana, B. (2020). Covid-19 artificial intelligence diagnosis using only cough
recordings. IEEE Open Journal of Engineering in Medicine and Biology. 227, 228
334. Heaven, W. (2021). Hundreds of ai tools have been built to catch covid. none of them helped. MIT
Technological Review. July 30. 226, 227, 228
335. Wynants, L., Van Calster, B., Collins, G. S., Riley, R. D., Heinze, G., et al. (2021). Prediction models
for diagnosis and prognosis of covid-19: systematic review and critical appraisal. BMJ, 369. 226
336. Roberts, M., Driggs, D., Thorpe, M., Gilbey, J., Yeung, M., et al. (2021). Common pitfalls and
recommendations for using machine learning to detect and prognosticate for covid-19 using chest
radiographs and ct scans. Nature Machine Intelligence, 3(3), 199–217. 226
337. Moons, K. G., de Groot, J. A., Bouwmeester, W., Vergouwe, Y., Mallett, S., et al. (2014). Criti-
cal appraisal and data extraction for systematic reviews of prediction modelling studies: the charms
checklist. PLoS medicine, 11(10), e1001744. 226
338. Wolff, R. F., Moons, K. G., Riley, R. D., Whiting, P. F., Westwood, M., et al. (2019). Probast: a tool
to assess the risk of bias and applicability of prediction model studies. Annals of internal medicine,
170(1), 51–58. 226
339. Matei, A. (2020). An app could catch 98.5% of all Covid-19 infections. Why isn’t it available? The
Guardian, (Dec 16). Original website. 228
340. Coppock, H., Jones, L., Kiskin, I., Schuller, B. (2021). Covid-19 detection from audio: seven grains
of salt. The Lancet Digital Health, 3(9), e537–e538. 228
341. Guo, X., Zhang, Y.-D., Lu, S., Lu, Z. (2022). A Survey on Machine Learning in COVID-19 Diagnosis.
CMES-Computer Modeling in Engineering & Sciences, 130(1), 23–71. 228, 229
342. Li, W., Deng, X., Shao, H., Wang, X. (2021). Deep Learning Applications for COVID-19 Analysis: A
State-of-the-Art Survey. CMES-Computer Modeling in Engineering & Sciences, 129(1), 65–98. 228,
229
343. Xie, S., Yu, Z., Lv, Z. (2021). Multi-Disease Prediction Based on Deep Learning: A Survey. CMES-
Computer Modeling in Engineering & Sciences, 128(2), 489–522. 228, 229
344. Gong, L., Zhang, X., Zhang, L., Gao, Z. (2021). Predicting Genotype Information Related to COVID-
19 for Molecular Mechanism Based on Computational Methods. CMES-Computer Modeling in Engi-
neering & Sciences, 129(1), 31–45. 229
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 261
345. Monajjemi, M., Esmkhani, R., Mollaamin, F., Shahriari, S. (2020). Prediction of Proteins Associated
with COVID-19 Based Ligand Designing and Molecular Modeling. CMES-Computer Modeling in
Engineering & Sciences, 125(3), 907–926. 229
346. Attaallah, A., Ahmad, M., Seh, A. H., Agrawal, A., Kumar, R., et al. (2021). Estimating the Impact of
COVID-19 Pandemic on the Research Community in the Kingdom of Saudi Arabia. CMES-Computer
Modeling in Engineering & Sciences, 126(1), 419–436. 229
347. Gupta, M., Jain, R., Gupta, A., Jain, K. (2020). Real-Time Analysis of COVID-19 Pandemic on Most
Populated Countries Worldwide. CMES-Computer Modeling in Engineering & Sciences, 125(3), 943–
965. 229
348. Areepong, Y., Sunthornwat, R. (2020). Predictive Models for Cumulative Confirmed COVID-19 Cases
by Day in Southeast Asia. CMES-Computer Modeling in Engineering & Sciences, 125(3), 927–942.
229
349. Singh, A., Bajpai, M. K. (2020). SEIHCRD Model for COVID-19 Spread Scenarios, Disease Pre-
dictions and Estimates the Basic Reproduction Number, Case Fatality Rate, Hospital, and ICU Beds
Requirement. CMES-Computer Modeling in Engineering & Sciences, 125(3), 991–1031. 229
350. Akyol, K. (2020). Growing and Pruning Based Deep Neural Networks Modeling for Effective Parkin-
son’s Disease Diagnosis. CMES-Computer Modeling in Engineering & Sciences, 122(2), 619–632.
229
351. Hemalakshmi, G. R., Santhi, D., Mani, V. R. S., Geetha, A., Prakash, N. B. (2020). Deep Residual
Network Based on Image Priors for Single Image Super Resolution in FFA Images. CMES-Computer
Modeling in Engineering & Sciences, 125(1), 125–143. 229
352. Vu-Quoc, L., Zhai, Y., Ngo, K. D. T. (2021). Model reduction by generalized Falk method for efficient
field-circuit simulations. CMES-Computer Modeling in Engineering & Sciences, 129(3), 1441–1486.
DOI: 10.32604/cmes.2021.016784. 229
353. Lu, Y., Li, H., Saha, S., Mojumder, S., Al Amin, A., et al. (2021). Reduced Order Machine Learn-
ing Finite Element Methods: Concept, Implementation, and Future Applications. CMES-Computer
Modeling in Engineering & Sciences. DOI: 10.32604/cmes.2021.017719. 229
354. Deng, X., Shao, H., Hu, C., Jiang, D., Jiang, Y. (2020). Wind Power Forecasting Methods Based on
Deep Learning: A Survey. CMES-Computer Modeling in Engineering & Sciences, 122(1), 273–301.
229
355. Liu, D., Zhao, J., Xi, A., Wang, C., Huang, X., et al. (2020). Data Augmentation Technology Driven By
Image Style Transfer in Self-Driving Car Based on End-to-End Learning. CMES-Computer Modeling
in Engineering & Sciences, 122(2), 593–617. 229
356. Sethi, S., Kathuria, M., Kaushik, T. (2021). A Real-Time Integrated Face Mask Detector to Curtail
Spread of Coronavirus. CMES-Computer Modeling in Engineering & Sciences, 127(2), 389–409. 230
357. Luo, J., Li, Y., Zhou, W., Gong, Z., Zhang, Z., et al. (2021). An Improved Data-Driven Topology
Optimization Method Using Feature Pyramid Networks with Physical Constraints. CMES-Computer
Modeling in Engineering & Sciences, 128(3), 823–848. 230
358. Qu, T., Di, S., Feng, Y. T., Wang, M., Zhao, T., et al. (2021). Deep Learning Predicts Stress-Strain
Relations of Granular Materials Based on Triaxial Testing Data. CMES-Computer Modeling in Engi-
neering & Sciences, 128(1), 129–144. 230
359. Li, H., Zhang, Q., Chen, X. (2021). Deep Learning-Based Surrogate Model for Flight Load Analysis.
CMES-Computer Modeling in Engineering & Sciences, 128(2), 605–621. 230
360. Guo, D., Yang, Q., Zhang, Y.-D., Jiang, T., Yan, H. (2021). Classification of Domestic Refuse in
Medical Institutions Based on Transfer Learning and Convolutional Neural Network. CMES-Computer
Modeling in Engineering & Sciences, 127(2), 599–620. 230
361. Yang, F., Zhang, X., Zhu, Y. (2020). PDNet: A Convolutional Neural Network Has Potential to
262 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
380. Musk’s Full Self-Driving @Tesla ruthlessly mowing down a child mannequin. Dan O’Dowd, The
Dawn Project, 2022.08.15, Tweet. 231
381. Hawkins, A. J. (2022). Tesla wants videos of its cars running over child-sized dummies taken down.
The Verge, (Aug 25). Original website. 231
382. Metz, C., Koeze, E. (2022). Can Tesla Data Help Us Understand Car Crashes? New York Times, (Aug
18). Original website. 232, 233, 234, 235
383. Metz, C. (2017). A New Way for Machines to See, Taking Shape in Toronto. New York Times, (Nov
28). Original website. 230
384. Dujmovic, J. (2021). You will not be traveling in a self-driving car anytime soon. Here’s what the
future will look like. Market Watch, (June 16 - Updated June 19). Original website. 230, 232
385. Maresca, T. (2022). Hyundai’s self-driving taxis roll out on the streets of South Korea. UPI, (Jun 09).
Original website. 231
386. Kirkpatrick, K. (2022). Still Waiting for Self-Driving Cars. Communications of the ACM, (April).
Original website. 231, 232
387. Bogna, J. (2022). Is Your Car Autonomous? The 6 Levels of Self-Driving Explained. PC Magazine,
(June 14). Original website. 231
388. Boudette, N. (2019). Despite High Hopes, Self-Driving Cars Are ’Way in the Future’. New York
Times, (Jul 07). Original website. 232
389. Guinness, H. (2022). What’s going on with self-driving cars right now? Popular Science, (May 28).
Original website. 233
390. Smiley, L. (2022). ‘I’m the Operator’: The Aftermath of a Self-Driving Tragedy. WIRED, (Mar 8).
Original website. 233, 234
391. Metz, C., Griffith, E. (2022). This Was Supposed to Be the Year Driverless Cars Went Mainstream.
New York Times, (May 12 - Updated Sep 15). Original website. 234
392. Metz, C. (2022). The Costly Pursuit of Self-Driving Cars Continues On. And On. And On. New York
Times, (May 24 - Updated Sep 15). Original website. 234
393. Nims, C. (2020). Robot Boats Leave Autonomous Cars in Their Wake — Unmanned ships don’t have
to worry about crowded roads. But crossing of the Atlantic is still a challenge. Wall Street Journal,
(Aug 29). 235
394. O’Brien, M. (2022). Autonomous Mayflower reaches American shores — in Canada. ABC News, (Jun
05). Original website. 235, 236
395. Mitchell, M. (2018). Artificial intelligence hits the barrier of meaning. The New York Times. 236
396. New Survey: Americans Think AI Is a Threat to Democracy, Will Become Smarter than Humans and
Overtake Jobs, Yet Believe its Benefits Outweigh its Risks. Stevens Institute of Technology, 2021 Nov
15, Website, Internet archive. 238
397. Hu, S., Li, Y., Lyu, S. (2020). Exposing gan-generated faces using inconsistent corneal specular
highlights. arXiv:2009.11924. 238
398. Sencar, H. T., Verdoliva, L., Memon, N. (2022). Multimedia forensics. 238
399. Chesney, R., Citron, D. K. (2019). Deep Fakes: A Looming Challenge for Privacy, Democracy, and
National Security. 107 California Law Review 1753. Original website U of Texas Law, Public Law
Research Paper No. 692. U of Maryland Legal Studies Research Paper No. 2018-21. 238, 239
400. Ingram, D., Ward, J. (2019). How do you spot a deepfake? A clue hides within our voices, researchers
say. NBC News, (Dec 16). Original website. 238, 239
401. Citron, D. How deepfakes undermine truth and threaten democracy. TEDSummit 2019. Website. 239
402. Manheim, K., Kaplan, L. (2019). Artificial Intelligence: Risks to Privacy and Democracy. Yale Journal
of Law & Technology, 21, 106–188. Original website. 239
403. Hao, K. (2019). Why AI is a threat to democracy—and what we can do to stop it. MIT Technology
264 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
271
426. Copeland, A. (1949). A cybernetic model of memory and recognition. Bulletin of the American
Mathematical Society, 55(7), 698. 272
427. Chavalarias, D. (2020). From inert matter to the global society life as multi-level networks of pro-
cesses. Philosophical Transactions of the Royal Society B-Biological Sciences, 375(1796, SI). 272
428. Togashi, E., Miyata, M., Yamamoto, Y. (2020). The first world championship in cybernetic building
optimization. Journal of Building Performance Simulation, 13(3), 391–408. 272
429. Jube, S. (2020). Labour and international accounting standards: A question of social justice. Interna-
tional Labour Review. Early Access Date MAR 2020. 272
430. McCulloch, W., Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin
of Mathematical Biophysics, 5, 115–133. Reprinted in the Bulletin of Mathematical Biology, Vol.52,
No.1-2, pp.99-115, 1990. 272, 275
431. Kline, R. (2015). The Cybernetic Moment, or why we call our Age the Information Age. Baltimore:
Johns Hopkins University Press. 272, 273, 274, 275
432. Cariani, P. (2017). The Cybernetics Moment: or Why We Call Our Age the Information Age. Cognitive
Systems Research, 43, 119–124. 272, 273
433. Eisenhart, C. (1949). Cybernetics - A new discipline. Science, 109(2834), 397–399. 272
434. W.E.H. (1949). Book Review: Cybernetics: Or Control and Communication in the An-
imal and the Machine. Quarterly Journal of Experimental Psychology, 1(4), 193–194.
https://doi.org/10.1080/17470214908416765. 273
266 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
Appendices
1 Backprop pseudocodes, notation comparison
To connect the backpropagation Algorithm 1 in Section 5 to Algorithm 6.4 in [78], p.206, Section 6.5.4
on “Back-Propagation Computation in Fully Connected MLP”,338 a different form of Algorithm 1, where
the “while” loop is used, is provided in Algorithm 9, where the “for” loop is used. This information would
be especially useful for first-time learners. See also Remark 5.4.
1 Backpropagation pseudocode 2
Data:
• Input into network x = y (0) ∈ Rm(0) ×1
• Learning rate ϵ; see Section 6.2 on deterministic optimization
• Results from any forward propagation:
⋆ Network parameters θ = {θ (1) , · · · , θ (L) } (all layers)
⋆ Layer weighted inputs and biases z (ℓ) , for ℓ = 1, · · · , L
⋆ Layer outputs y (ℓ) , for ℓ = 1, · · · , L
Result: Updated network parameters θ to reduce cost function J.
2 Initialize:
3 • Gradient r = ∂J/∂y (L) ∈ R1×m(L) (row) on predicted output y e = y(L) , Eq.(99) ;
4 for ℓ = L, . . . , 1 do
5 ▶ Compute gradient on weighted inputs (pre-nonlinear activation) z (ℓ) :
6 r ← ∂J/∂z (ℓ) = r ⊙ a′ (z (ℓ) )T ∈ R1×m(ℓ) (row), Eq.(104) ;
7 ▶ Compute gradient on layer parameters θ (ℓ) :
8 ∂J/∂θ (ℓ) = r T y (ℓ−1)T ∈ Rm(ℓ) ×[m(ℓ−1) +1] (row), Eq.(105) ;
9 ▶ Compute gradient on layer outputs y (ℓ−1) :
10 r ← ∂J/∂y (ℓ−1) = rW (ℓ) ∈ R1×m(ℓ−1) (row), Eq.(106) ;
11 end
12
Algorithm 9: Backpropagation pseudocode. Alternative presentation of Algorithm 1, where the
“while” loop is used, to compare to Algorithm 6.4 in [78], p.206, where the “for” loop was used, and
where there was no layer-parameter update step (Line 8 of Algorithm 1). See also Table 8 for the corre-
spondence between the notations employed here and those in [78], and the block diagrams in Figure 47
and Figure 48.
In Algorithm 9, the regularization of the cost function J is not considered, i.e., we omit the penalty (or
regularization) term λΩ(θ) used in Algorithm 6.4 in [78], p.206, where the regularized cost was J(θ) +
λΩ(θ). As pointed out in Section 6.3.6, weight decay is more general than L2 regularization, and would be
the preferred method to avoid overfitting.
In Table 8, the correspondence between the notations employed here and those in [78], p.206, is pro-
vided.
338
MLP = MultiLayer Perceptron.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 267
Table 8: Equivalence of backprop Algorithm 9 and Algorithm 6.4 in [78], p.206. Comparison
of notations. The mathematical expressions in Algorithm 6.4 are reproduced here in their
original notations, except for the matrix dimensions, which were not given in Algorithm 6.4 of
[78].
Algorithm 9, current notation Goodfellow Algorithm 6.4, original notation
Figure 152: Folded RNN and LSTM cell, two feedback loops with delay, block diagram. Typ-
ical state [k]. Corrections for the figures in (1) Figure 10.16 in the online book Deep Learning
by Goodfellow et al 2016, Chap.10, p.405 (referred to here as “DL-A”, or “Deep Learning,
version A”), and (2) [78], p.398 (referred to as “DL-B”). (The above figure is adapted from a figure
reproduced with permission of the authors.)
Figure 10.16 in the online book Deep Learning by Goodfellow et al 2016, Chap.10, p.405 (referred
to here as “DL-A”, or “Deep Learning, version A”), was either incomplete with missing important details.
Even the updated Figure 10.16 in [78], p.398 (referred to as “DL-B”), was still incomplete (or incorrect).
The corrected arrows, added annotations, and colors correspond to those in the equivalent Figure 81.
The corrections are described below.
Error 1: The cell state c[k] should be squashed by the state sigmoidal activation function As with range
(−1, +1) (brown dot, e.g., tanh function) before being multiplied by the scaling factor FO coming from the
output gate to produce the hidden state h[k] . This correction is for both DL-A and DL-B.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 269
Error 2: The hidden-state feedback loop (green) should start from the hidden state h[k] , delayed by one
step, i.e., h[k−1] , which is fed to all four gates: (1) The externally-input gate g (purple box) with activation
function Ag having the range (−1, +1), (2) the input gate I, (3) the forget gate f , (4) the output gate O. The
activations Aα , with α ∈ {I, f, O} (3 blue boxes) all have the interval (0, 1) as their range. This hidden-
state feedback loop was missing in DL-A, whereas the hidden-state feedback loop in DL-B incorrectly
started from the summation operation (grey circle, just below c[k] ) in the cell-state feedback loop, and did
not feed into the input gate g.
Error 3: Four pairs of arrows pointing into the four gates {g, I, f, O}, with one pair per gate, were
intended as inputs to these gates, but were without annotation, and thus unclear / confusing. Here, for each
gate, one arrow is used for the hidden state h[k−1] , and the other arrow for the input x[k] . This correction is
for both DL-A and DL-B.
K(x, x) + ν 2 I K(x, x
y µ C yy,ν C yye e)
yb := , µ b := , C ybyb := = , (518)
ye µ
e C yy
e C yeye K T (x, xe) K(e x, xe)
−1
C yy,ν C yye D yy D yye
C −1 := = , with D Tyye = D yy
e , (519)
y
byb C yy
e C yeye D Tyye D yeye
then expand the exponent in the Gaussian joint probability Eq. (368), with (b
y−µ b ) instead of (y − µ), to
have the Mahalanobis distance ∆ squared, a quadratic form in terms of (b
340
b written as:
y − µ),
∆2 := (b b )T C −1
y−µ y
byb (b b ) = (y − µ)T D yy (y − µ) + (y − µ)T D yye(e
y−µ y−µ
e)
+ (e e )T D yy
y−µ e (y − µ) + (y e )T D yeye(e
e−µ y−µ
e) , (520)
which is also a quadratic form in terms of (ey−µ e ), based on the symmetry of C y−1 b , the inverse of the co-
by
variance matrix, in Eq. (519), implying that the distribution of the test values ye is Gaussian. The covariance
matrix and the mean of the Gaussian distribution over ye are determined by identifying the quadratic term
and the linear term in ye, compared to the expansion of the general Gaussian distribution N (z|m, C) over
the variable z with mean m and covariance matrix C:
∆2 = yeT D yeyeye − 2e
y T D yeyeµ e (y − µ) + constant , (522)
e − D yy
and compare to Eq. (521), then for the conditional distribution p(e
y |y) of the test values ye at x
e given the
339
See, e.g., [130], p. 85.
340
See, e.g., [130], p. 80.
270 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
C −1
e = Dy
y|y eye ⇒ C y|y
−1
e = Dy
eye (523)
C −1 (524)
e = Dy
e µy|y
y|y ey e − D yy
eµ e (y − µ) ⇒ µy|y
e = C y|y
e e − D yy
D yeyeµ e (y − µ) ,
⇒ µy|y
e =µe − D y−1 e (y − µ) ,
e D yy
ey (525)
in which Eq. (523)2 had been used.
At this point, the submatrices D yeye and D yy
e can be expressed in terms of the submatrices of the
partitioned matrix C ybyb in Eq. (518) as follows. From the definition of the matrix D ybyb, the inverse of the
covariance matrix C ybyb:
D yy D yye C yy C yye I 0
D ybybC ybyb = = (526)
D yy
e D yeye C yy e C yeye 0 I
the 2nd row gives rise to a system of two equations for two unknowns D yy b and D y b:
by
C yy C yye
(527)
D yy D yeye = 0 I ,
e
C yy
e C yeye
in which the covariance matrix C ybyb is symmetric, and is a particular case of the non-symmetric problem of
expressing (F , G) in terms of (P , Q, R, S):
P Q
= 0 I ⇒ G = −F P R−1 ⇒ G−1 F = −RP −1 , (528)
F G
R S
from the first equation, and leads to
F Q + GS = I ⇒ F Q − P R−1 S = I , (529)
−1 −1
⇒ F = Q − P R−1 S G = − Q − P R−1 S P R−1 , G−1 = S − RP −1 Q , (530)
,
which, after using Eq. (527) to identify (F , G) = (D yy e , Dy e ) and (P , Q, R, S) = (C yy , C y y
ey
T
e, C yye,
C yeye), and then when replaced in Eq. (523) for the conditional covariance C y|ye and Eq. (525) for the
conditional mean µy|y
e , yields Eq. (379) and Eq. (378), respectively.
Remark 3.1. Another way to obtain indirectly Eq. (379) and Eq. (378), without derivation, is to use the
identity of the inverse of a partitioned matrix in Eq. (531), as done in [130], p. 87:
−1
−M QS −1
P Q M −1
, M := P − QS −1 R (531)
= −1 −1 −1 −1 .
R S −S RM S + S RM QS
This method is less satisfactory since without derivation, there is no feel of where the matrix elements in
Eq. (531) came from. In fact, the derivation of the 1st row of Eq. (531) follows exactly the same line as for
Eqs.(528)-(530). The 2nd row in Eq. (531) looks complex, but before getting into its derivation, we note that
exactly the same line of derivation for the 1st row can be straightforwardly followed to arrive at different,
and simpler, expressions of the 2nd-row matrix elements (2, 1) and (2, 2), which are similar to those in the
1st row, and which were already derived in Eq. (530).
It can be easily verified that
−M QS −1
M P Q I 0
= . (532)
−S −1 RM S −1 + S −1 RM QS −1 R S 0 I
To derive the 2nd row of Eq. (531), premultiply the 1st row (which had been derived as mentioned above)
of Eq. (532) by (−S −1 R) to have
−1
P Q
−1
= (−S −1 R) I 0 = (−S −1 R) 0 . (533)
(−S R) M −M QS
R S
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 271
To make the right-hand side become [0 I], add to both sides of Eq. (533) the matrix (S −1 R) I to
obtain Eq. (532)’s 2nd row, whose complex expressions did not contribute to the derivation of the conditional
Gaussian posterior mean Eq. (378) and covariance Eq. (379).
Yet another way to derive Eqs. (378)-(379) is to use the more complex proof in [248], p. 429, which
was referred to in [234], p. 200 (see also Footnote 242). ■
In summary, the above derivation is simpler and more direct than in [130], p. 87, and in [248], p. 429.
Figure 153: The first two waves of AI, according to [78], p.13, showing the “cybernetics”
wave (blue line) started in the 1940s peaked before 1970, then gradually declined toward 2006
and beyond. The results were based on a search for frequency of words in Google Books. It
was mentioned, incorrectly, that the work of Rosenblatt (1957-1962) [1]-[2] was limited to one
neuron; see Figure 42 and Figure 133. (Figure reproduced with permission of the authors.)
Figure 154: Cybernetics papers, (Appendix 4). Web of Science search on 2020.04.15, having
more than 100 Web of Science categories. The first paper was [426]. There was no clear wave
that crested before 1970, but actually the number of papers in Cybernetics continue to increase
over the years.
The first paper in 1949 [426] was categorized as Mathematics. More recent papers include Biological
Science, e.g., [427], Building Construction, e.g., [428], Accounting, e.g., [429].
It is interesting to note that McCulloch who co-authored the well-known paper [430] was part of the
original cybernetics movement that started in the 1940s, as noted in [431]:
“Warren McCulloch, the “chronic chairman” and founder of the cybernetics conferences.343
An eccentric physiologist, McCulloch had coauthored a foundational article of cybernetics on
the brain’s neural network.”
But McCulloch & Pitt’s 1943 paper [430]—often cited in artificial-neural-network papers (e.g., [23], [12])
and books (e.g., [78]), and dated six years before [426]—was placed in the Web of Science category “Bi-
ology; Mathematical & Computational Biology,” and thus did not show up in the search with keyword
“cyberneti*” shown in Figure 154. A reason is [430] did not contain the word “cybernetics,” which was not
invented until 1948 with the famous book by Wiener, and which was part of the title of [426]. Cybernetics
was a “new science” with a “mysterious name and universal aspirations” [431], p.5.
“What exactly is (or was) cybernetics? This has been a perennial ongoing topic of debate
within the American Society for Cybernetics throughout its 50-year history. ... the word has
a much older history reaching back to Plato, Ampère (“Cybernétique = the art of growing”),
and others. “Cybernetics” comes from the Greek word for governance, kybernetike, and the
related word, kybernetes, steersman or captain” [432].
Figure 155: Cybernetics papers, (Appendix 4). Web of Science search on 2020.04.17, ALL
Computer-Science categories (3,555 papers): Cybernetics (2,666), Artificial Intelligence (602),
Information Systems (432), Theory Methods (300), Interdisciplinary Applications (293), Soft-
ware Engineering (163). The wave crest was in 2007, with a tiny bump in 1980.
“... (feedback) control and communication theory pertinent to the description, analysis, or
construction of systems that involve (1) mechanisms (receptors) for the reception of messages
or stimuli, (2) means (circuits) for communication of these to (3) a central control unit that
responds by feeding back through the system (4) instructions that (will or tend to) produce
specific actions on the part of (5) particular elements (effectors) of the system. ... The cen-
tral concept in cybernetics is a feedback mechanism that, in response to information (stimuli,
messages) received through the system, feeds back to the system instructions that modify or
otherwise alter the performance of the system.”
Even though [432] did not use the word “control”, the definition is similar:
“The core concepts involved natural and artificial systems organized to attain internal stabil-
ity (homeostasis), to adjust internal structure and behavior in light of experience (adaptive,
self-organizing systems), and to pursue autonomous goal-directed (purposeful, purposive) be-
havior.” [432]
“If “cybernetics” means “control and communication,” what does it not mean? It would be
difficult to think of any process in which nothing is either controlled or communicated.”
which is the reason why cybernetics is found in a large number of different fields. [431], p.4, offered a
similar, more detailed explanation of cybernetics as encompassing all fields of knowledge:
“Wiener and Shannon defined the amount of information transmitted in communications sys-
tems with a formula mathematically equivalent to entropy (a measure of the degradation of
energy). Defining information in terms of one of the pillars of physics convinced many re
searchers that information theory could bridge the physical, biological, and social sciences.
274 First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3
The allure of cybernetics rested on its promise to model mathematically the purposeful behav-
ior of all organisms, as well as inanimate systems. Because cybernetics included information
theory in its purview, its proponents thought it was more universal than Shannon’s theory, that
it applied to all fields of knowledge.”
Figure 156: Cybernetics papers, (Appendix 4). Web of Science search on 2020.04.15 (two
days before Figure 155), category Computer Science Cybernetics (2,665 papers). The wave
crest was in 2007, with a tiny bump in 1980.
Figure 157: Cybernetics papers, (Appendix 4). Web of Science search on 2020.04.15 (two
days before Figure 155), category Computer Science Artificial Intelligence (601 papers). Sim-
ilar to Figure 156, the wave crest was in 2007, but with no tiny bump in 1980, since the first
paper was in 1982.
In 1969, the then president of the International Association of Cybernetics asked “But after all what is
cybernetics? Or rather what is it not, for paradoxically the more people talk about cybernetics the less they
seem to agree on a definition,” then identified several meanings: A mathematical control theory, automa-
tion, computerization, communication theory, study of human-machine analogies, philosophy explaining
the mysteries of life! [431], p.5.
First online at CMES on 2023.03.01, DOI: 10.32604/cmes.2023.028130 - arXiv v3 275
Cybernetics
Biology
Artificial Intelligence (AI) Mathematics
e.g., Knowledge bases Psychology
Neuroscience
Machine Learning (ML) Social sciences
e.g., Support Vector Machine (SVM) and more...
Neuromorphic (spiking) computing
Figure 158: Artificial Intelligence (AI), Machine Learning (ML), and Deep Learning (DL).
Cybernetics is broad and encompasses many fields, including AI. See also Figure 6.
So was there a first wave in AI called “cybernetics” ? Back in Oct 2018, we conveyed our search result
at that time—which was similar to Figure 154, but clearly did not support the existence of the cybernetics
wave shown in Figure 153—to Y. Bengio of [78], who then replied:
“Ian [Goodfellow] did those figures, but my take on your observations is that the later surge in
’new cybernetics’ does not have much more to do with artificial neural networks. I’m not sure
why the Google Books search did not catch that usage, though.”
We then selected only the categories that had the words “Computer Science” in their names; there were only
six such categories among more than 100 categories, as shown in Figure 155. A similar figure obtained in
Oct 2018 was also shared with Bengio, who had no further comment. The wave crest in Figure 155 occurred
in 2007, with a tiny bump in 1980, but not before 1970 as in Figure 153.
Figure 156 is the histogram for the largest single category Computer Science Cybernetics with 2,665
papers. In this figure, similar to Figure 155, the wave crest here also occurred in 2007, with a tiny bump in
1980.
Figure 157 is the histogram for the category Computer Science Artificial Intelligence with 601 papers.
Again here, similar to Figure 155 and Figure 156, the wave crest here also occurred in 2007, with no bump
in 1980. The first document, a 5-year plan report of Latvia, appeared in 1982. There is a large “impulse” of
number of papers in 2007, and a smaller “impulse” in 2014, but no smooth bump. There were no papers for
9 years between 1982 and 1992, in which a single paper appeared in the series “Lecture Notes in Artificial
Intelligence” on cooperative agents.
Cybernetics, including the original cybernetics moment, as described in [431], encompassed many
fields and involved many researchers not working on neural nets, such as Wiener, John von Neumann, Mar-
garet Mead (anthropologist), etc., whereas the physiologist McCulloch co-authored the first “foundational
article of cybernetics on the brain’s neural network”. So it is not easy to attribute even the original cyber-
netic moment to research on neural nets alone. Moreover, many topics of interest to researchers at the time
involve natural systems (including [430]), and thus natural intelligence, instead of artificial intelligence.