0% found this document useful (0 votes)
63 views30 pages

Enforcing Constraints For Time Series Prediction in Supervised, Unsupervised and Reinforcement Learning

This document discusses enforcing constraints from dynamical systems during machine learning for time series prediction. It presents methods for enforcing constraints during supervised learning by adding them to the objective function, during unsupervised learning like GANs by augmenting the discriminator input, and during reinforcement learning by adding constraints to the reward function. It also introduces a novel approach for reinforcement learning using homotopy of the action-value function to stabilize training. Numerical results using the Lorenz system illustrate the approaches.

Uploaded by

armiknk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
63 views30 pages

Enforcing Constraints For Time Series Prediction in Supervised, Unsupervised and Reinforcement Learning

This document discusses enforcing constraints from dynamical systems during machine learning for time series prediction. It presents methods for enforcing constraints during supervised learning by adding them to the objective function, during unsupervised learning like GANs by augmenting the discriminator input, and during reinforcement learning by adding constraints to the reward function. It also introduces a novel approach for reinforcement learning using homotopy of the action-value function to stabilize training. Numerical results using the Lorenz system illustrate the approaches.

Uploaded by

armiknk
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Enforcing constraints for time series prediction in

supervised, unsupervised and reinforcement learning


arXiv:1905.07501v1 [cs.LG] 17 May 2019

Panos Stinis
Advanced Computing, Mathematics and Data Division, Pacific Northwest
National Laboratory, Richland WA 99354

Abstract
We assume that we are given a time series of data from a dynamical
system and our task is to learn the flow map of the dynamical system.
We present a collection of results on how to enforce constraints com-
ing from the dynamical system in order to accelerate the training of
deep neural networks to represent the flow map of the system as well
as increase their predictive ability. In particular, we provide ways to
enforce constraints during training for all three major modes of learn-
ing, namely supervised, unsupervised and reinforcement learning. In
general, the dynamic constraints need to include terms which are anal-
ogous to memory terms in model reduction formalisms. Such memory
terms act as a restoring force which corrects the errors committed by
the learned flow map during prediction.
For supervised learning, the constraints are added to the objective
function. For the case of unsupervised learning, in particular genera-
tive adversarial networks, the constraints are introduced by augment-
ing the input of the discriminator. Finally, for the case of reinforce-
ment learning and in particular actor-critic methods, the constraints
are added to the reward function. In addition, for the reinforcement
learning case, we present a novel approach based on homotopy of the
action-value function in order to stabilize and accelerate training. We
use numerical results for the Lorenz system to illustrate the various
constructions.

Introduction
Scientific machine learning, which combines the strengths of scientific com-
puting with those of machine learning, is becoming a rather active area of

1
research. Several related priority research directions were stated in the re-
cently published report [2]. In particular, two priority research directions
are: (i) how to leverage scientific domain knowledge in machine learning
(e.g. physical principles, symmetries, constraints); and (ii) how can ma-
chine learning enhance scientific computing (e.g reduced-order or sub-grid
physics models, parameter optimization in multiscale simulations).
Our aim in the current work is to present a collection of results that
contribute to both of the aforementioned priority research directions. On
the one hand, we provide ways to enforce constraints coming from a dynam-
ical system during the training of a neural network to represent the flow
map of the system. Thus, prior domain knowledge is incorporated in the
neural network training. On the other hand, as we will show, the accurate
representation of the dynamical system flow map through a neural network
is equivalent to constructing a temporal integrator for the dynamical system
modified to account for unresolved temporal scales. Thus, machine learning
can enhance scientific computing.
We assume that we are given data in the form of a time series of the
states of a dynamical system (a training trajectory). Our task is to train a
neural network to learn the flow map of the dynamical system. This means
to optimize the parameters of the neural network so that when it is presented
with the state of the system at one instant, it will predict accurately the
state of the system at another instant which is a fixed time interval apart.
If we want to use the data alone to train a neural network to represent the
flow map, then it is easy to construct simple examples where the trained
flow map has rather poor predictive ability [24]. The reason is that the
given data train the flow map to learn how to respond accurately as long
as the state of the system is on the trajectory. However, at every timestep,
when we invoke the flow map to predict the estimate of the state at the next
timestep, we commit an error. After some steps, the predicted trajectory
veers into parts of phase space where the neural network has not trained.
When this happens, the neural network’s predictive ability degrades rapidly.
One way to aid the neural network in its training task is to provide
data that account for this inevitable error. In [24], we advanced the idea of
using a noisy version of the training data i.e. a noisy version of the training
trajectory. In particular, we attach a noise cloud around each point on the
training trajectory. During training, the neural network learns how to take
as input points from the noise cloud, and map them back to the noiseless
trajectory at the next time instant. This is an implicit way of encoding
a restoring force in the parameters of the neural network (see Section 1.1
for more details). We have found that this modification can improve the

2
predictive ability of the trained neural network but up to a point (see Section
3 for numerical results).
We want to aid the neural network further by enforcing constraints that
we know the state of the system satisfies. In particular, we assume that we
have knowledge of the differential equations that govern the evolution of the
system (our constructions work also if we assume algebraic constraints see
e.g. [24]). Except for special cases, it is not advisable to try to enforce the
differential equations directly at the continuum level. Instead we can dis-
cretize the equations in time using various numerical methods. We want to
incorporate the discretized dynamics into the training process of the neural
network. The purpose of such an attempt can be explained in two ways: (i)
we want to aid the neural network so that it does not have to discover the
dynamics (physics) from scratch; and (ii) we want the constraints to act as
regularizers for the optimization problem which determines the parameters
of the neural network.
Closer inspection of the concept of noisy data and of enforcing the dis-
cretized constraints reveals that they can be combined. However, this needs
to be done with care. Recall that when we use noisy data we train the neural
network to map a point from the noise cloud back to the noiseless point at
the next time instant. Thus, we cannot enforce the discretized constraints
as they are because the dynamics have been modified. In particular, the
use of noisy data requires that the discretized constraints be modified to
account explicitly for the restoring force. We have called the modification of
the discretized constraints the explicit error-correction (see Section 1.2).
The meaning of the restoring force is analogous to that of memory terms
in model reduction formalisms [6]. To see this, note that the flow map as
well as the discretization of the original constraints are based on a finite
timestep. The timescales that are smaller than the timestep used are not
resolved explicitly. However, their effect on the resolved timescales cannot
be ignored. In fact, it is what causes the inevitable error at each applica-
tion of the flow map. The restoring force that we include in the modified
constraints is there to remedy this error i.e. to account for the unresolved
timescales albeit in a simplified manner. This is precisely the role played by
memory terms in model reduction formalisms. In the current work we have
restricted attention to linear error-correction terms. The linear terms come
with coefficients whose magnitude is optimized as part of the training. In
this respect, optimizing the error-correction term coefficients becomes akin
to temporal renormalization. This means that the coefficients depend on the
temporal scale at which we probe the system [8, 3]. Finally, we note that
the error-correction term can be more complex than linear. In fact, it can

3
be modeled by a separate neural network. Results for such more elaborate
error-correction terms will be presented elsewhere.
We have implemented constraint enforcing in all three major modes of
learning. For supervised learning, the constraints are added to the objective
function (see Section 2.1). For the case of unsupervised learning, in partic-
ular generative adversarial networks [9], the constraints are introduced by
augmenting the input of the discriminator (see Section 2.2 and [24]). Finally,
for the case of reinforcement learning and in particular actor-critic methods
[25], the constraints are added to the reward function. In addition, for the
reinforcement learning case, we have developed a novel approach based on
homotopy of the action-value function in order to stabilize and accelerate
training (see Section 2.3).
In recent years, there has been considerable interest in the develop-
ment of methods that utilize data and physical constraints in order to
train predictors for dynamical systems and differential equations e.g. see
[4, 20, 5, 13, 22, 7, 26, 16] and references therein. Our approach is different,
it introduces the novel concept of training on purpose with modified (noisy)
data in order to incorporate (implicitly or explicitly) a restoring force in the
dynamics learned by the neural network flow map. We have also provided
the connection between the incorporation of such restoring forces and the
concept of memory in model reduction.
The paper is organized as follows. Section 1 explains the need for con-
straints to increase the accuracy/efficiency of time series prediction as well as
the form that these constraints can have. Section 2 presents ways to enforce
such constraints in supervised learning (Section 2.1), unsupervised learning
(Section 2.2) and reinforcement learning (Section 2.3). Section 3 contains
numerical results for the various constructions using the Lorenz system as
an illustrative example. Finally, Section 4 contains a brief discussion of the
results as well as some ideas for current and future work.

1 Constraints for time series prediction of dynam-


ical systems
Suppose that we are given a dynamical system described by an M-dimensional
set of differential equations

dx
= f (x), (1)
dt

4
where x ∈ RM . The system (1) needs to be supplemented with an initial
condition x(0) = x0 . Furthermore, suppose that we are provided with time
series data from the system (1). This means a sequence of points from a
trajectory of the system {xdata
i }N
i=1 recorded at time intervals of length ∆t.
We would like to use this time series data to train a neural network to
represent the flow map of the system i.e. a map H ∆t with the property
H ∆t x(t) = x(t + ∆t) for all t and x(t).
We want to find ways to enforce during training the constraints implied
by the system (1). Before we proceed, we should mention that in addition
to (1), one could have extra constraints. For example, if the system (1)
is Hamiltonian, we have extra algebraic constraints since the system must
evolve on an energy surface determined by its initial condition. We note that
the framework we present below can also enforce algebraic constraints but
in the current work we will focus on the enforcing of dynamic constraints
like the system (1). Enforcing dynamic constraints is more demanding than
enforcing algebraic ones. Technically, the enforcing of algebraic constraints
requires only knowledge of the state of the system. On the other hand, the
enforcing of dynamic constraints requires knowledge of the state and of the
rate of change of the state.
It is not advisable to attempt enforcing directly the constraints in (1).
To do that requires that the output of the neural network includes both the
state of the system and its rate of change. This doubles the dimension of the
output and makes the necessary neural network size larger and its training
more demanding. Instead, we will enforce constraints that involve only the
state, albeit at more than one instants. For example, we can consider the
simplest temporal discretization scheme, the forward Euler scheme [12], and
discretize (1) with timestep ∆t to obtain

x̂(t + ∆t) = x̂(t) + ∆tf (x̂(t)) (2)


Then, we can choose to enforce (2) during training of the flow map. To
be more precise, we can train the neural network representation of the flow
map such that (2) holds for all the training data {xdata
i }N
i=1 . In addition, one
can consider more elaborate temporal discretization schemes e.g. explicit
Runge-Kutta methods [12]. In such a case, (2) is replaced by

x̂(t + ∆t) = x̂(t) + ∆tf RK (x̂(t)) (3)

where f RK (x̂(t)) represents the functional form of the Runge-Kutta update.


Such an approach of enforcing the constraint is not enough to guaran-
tee that the trained neural network representation of the flow map will be

5
accurate. In fact, as can be seen by simple numerical examples [24], the
trained neural network flow map can lose its predictive ability rather fast.
The reason is that we have used data from a time series i.e. a trajectory to
train the neural network. However, a single trajectory is extremely unlikely
(has measure zero) in the phase space of the system. Thus, the trained
network predicts accurately as long as the predicted state remains on the
training trajectory. But this is impossible, since the action of the flow map
at every (finite) timestep involves an inevitable approximation error. If left
unchecked, this approximation error causes the prediction to deviate into a
region of phase space that the network has never trained on. Soon after, all
the predictive ability of the network is lost.
This observation highlights the need for an alternate way of enforcing the
constraints. In fact, as we will explain now, it points towards the need for the
enforcing of alternate constraints altogether. In particular, this observation
underlines the need for the training of the neural network to include some
kind of error-correcting mechanism. Such an error-correcting mechanism
can help restore the trajectory predicted by the learned flow map when it
inevitably starts deviating due to the finiteness of the used timestep.
The way we have devised to implement this error-correcting mechanism
can be implicit or explicit. By implicit we mean that we do not specify
the functional form of the mechanism but only what we want it to achieve
(Section 1.1). On the other hand, the explicit implementation of the error-
correcting mechanism does involve the specification of the functional form
of the mechanism (Section 1.2).
The common ingredient for both implicit and explicit implementations
is the use of a noisy version of the data during training. The main idea is the
fact that the training of the neural network must address the inevitability
of error that comes with the use of a finite timestep. For example, suppose
that we are given data that are recorded every ∆t. The flow map we wish to
train will produce states of the system at time instants that are ∆t apart.
Every time the flow map is applied, even if it is applied on a point from
the exact trajectory it will produce a state that has deviated from the exact
trajectory at the next timestep. So, the trained flow map must learn how
to correct such a deviation.

1.1 Implicit error-correction


The implicit implementation of the error-correcting mechanism can be real-
ized by the following simple procedure. We can consider each point of the
given time series data and we can enhance it by a (random) cloud of points

6
centered at the point on the time series. Such a cloud of points accounts
for our ignorance about the inevitable error that the flow map commits at
every step. The next step is to train the neural network to map a point from
this cloud back to the noiseless trajectory at the next timestep. In this way,
the neural network is trained to incorporate an error-correcting mechanism
implicitly.
Of course, there are the questions of the extent of the cloud of noisy
points as well as the number of samples we need from it. These parameters
depend on the magnitude of the interval ∆t and the accuracy of the train-
ing data. For example, if the training data were produced by a numerical
method with a known order of accuracy then we expect the necessary ex-
tent of the noisy cloud to follow a scaling law with respect to the interval
∆t. Similarly, if the training data were produced by a numerical experiment
with known measurement error, we expect the extent of the noisy cloud to
depend on the measurement error.

1.2 Explicit error-correction


The explicit implementation of the error-correcting mechanism requires the
specification of the functional form of the mechanism in addition to enhanc-
ing the given time series data by a noisy cloud. The main idea is that the
need for the incorporation of the error-correcting mechanism means that
the flow map we have to learn is not of the original system but of a mod-
ified system. Symbolically, the dynamics that the neural network based
flow map must learn are given by “learned dynamics = original dynamics
+ error-correction”. As we have explained before, the error-correction term
is needed due to the inevitable error caused by the use of a finite timestep.
Such error-correction terms can be interpreted as memory terms appearing
in model reduction formalisms [6]. However, note that here the reduction is
in the temporal sense since it is caused by the use of a finite timestep. It can
be thought of as a way to account for all the timescales that are contained
in the interval ∆t and it is akin to temporal renormalization. Another way
to interpret such an error-correction term is as a control mechanism [14].
For the specific case of the forward Euler scheme given in (2), the explicit
implementation of the error-correcting mechanism will mean that we want
our trained flow map to satisfy

x̂(t + ∆t) = x̂(t) + ∆tf (x̂(t)) + ∆tf C (x̂(t)), (4)


where f C (x̂(t))
is the error-correcting term. The obvious question is what
is the form of f C (x̂(t)). The simplest state-dependent approximation is

7
to assume f C (x̂(t)) is a linear function of the state x̂(t). For example,
f C (x̂(t)) = Ax̂(t), where A is a M × M matrix whose entries need to be
determined. The entries of A can be estimated during the training of the
flow map neural network. There is no need to restrict the form of the cor-
rection term f C (x̂(t)) to a linear one. In fact, we can consider a separate
neural network to represent f C (x̂(t)). We have explored such constructions
although a detailed presentation will appear in a future publication. A fur-
ther question to be explored is the dependence of the elements of A or of
the parameters of the network for the representation of the error-correcting
term on the timestep ∆t. In fact, we expect a scaling law dependence on ∆t
which would be a manifestation of incomplete similarity [3].
We also note that there is a further generalization of the error-correcting
term f C (x̂(t)), if we allow it to depend on the state of the system for times
before t. Given the analogy to memory terms alluded to above, such a de-
pendence on the history of the evolution of the state of the system is an
instance of a non-Markovian memory [6].
Finally, we note that (4) offers one more way to interpret the error-
correction term, namely as a modified numerical method, a modified Euler
scheme in this particular case, where the role of the error-correction term is
to account for the error of the Euler scheme.

2 Enforcing constraints in supervised, unsuper-


vised and reinforcement learning
In this section we will examine ways of enforcing the constraints in the 3 ma-
jor modes of learning, namely supervised, unsupervised and reinforcement
learning.

2.1 Supervised learning


The case of supervised learning is the most straightforward. Let us assume
that the flow map is represented by a deep neural network denoted by G
depending on the parameter vector θG (the parameter vector θG contains the
weights and biases of the neural network). The simplest objective function
is the L2 discrepancy between the network predictions and the training data
trajectory given by
Λ
1X
Losssupervised = (G(zi ) − xdata
i )2 , (5)
Λ
i=1

8
where zi is a point from the noise cloud around a point of the given training
trajectory and xdatai is the noiseless point on the given training trajectory
after time ∆t. Note that we allowed freedom here in choosing the value of
Λ to accommodate various implementation choices e.g. number of samples
from the noise cloud, mini-batch sampling etc. The parameter vector θG
can be estimated by minimizing the objective function Losssupervised .
For the sake of simplicity, suppose that we want to enforce the constraints
given in (4) with a diagonal linear representation of the error-correcting term
i.e. fjC (x̂(t)) = −aj x̂j (t), for j = 1, . . . , M (the minus sign is in anticipation
of this being a restoring force). Then we can consider the modified objective
function given by
 Λ
1 X
Lossconstraints
supervised = (G(zi ) − xdata
i )2
Λ
i=1
M
X 
2
+ (Gj (zi ) − zij − ∆tfj (zij ) + ∆taj zij ) , (6)
j=1

where zij is the j-th component of the noise cloud point zi and fj is the
j-th component of the vector f. Notice that the minimization of the mod-
ified objective function Lossconstraints
supervised leads to the determination of both
the parameter vector θG and the error-correcting representation parameters
aj , j = 1, . . . , M. Also note, that if instead of the forward Euler scheme we
use e.g. a more elaborate Runge-Kutta method as given in (3), then we can
still use (6) but with the vector f replaced by f RK .

2.2 Unsupervised learning - Generative Adversarial Networks


The next mode of learning that we will examine how to enforce constraints
for is unsupervised learning, in particular Generative Adversarial Networks
(GANs) [9]. This material appeared first in [24] albeit with different nota-
tion. We repeat it here with the current notation for the sake of complete-
ness.
Generative Adversarial Networks comprise of two networks, a generator
and a discriminator. The target is to train the generator’s output distribu-
tion pg (x) to be close to that of the true data pdata . We define a prior input
pz (z) on the generator input variables z and a mapping G(z; θG ) to the data
space where G is a differentiable function represented by a neural network
with parameters θG . We also define a second neural network (the discrimina-
tor) D(x; θD ), which outputs the probability that x came from the true data

9
distribution pdata rather than pg . We train D to maximize the probability of
assigning the correct label to both training examples and samples from the
generator G. Simultaneously, we train G to minimize log(1 − D(G(z))). We
can express the adversarial nature of the relation between D and G as the
two-player min-max game with value function V (D, G):

min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]. (7)
G D

The min-max problem can be formulated as a bilevel minimization prob-


lem for the discriminator and the generator using the objective functions
−Ex∼pdata (x) [log D(x)] and −Et∼pz (z) [log(D(G(z)))] respectively. The mod-
ification of the objective function for the generator has been suggested to
avoid early saturation of log(1 − D(G(z))) due to the faster training of the
discriminator [9]. On the other hand, while this modification avoids satu-
ration, the well-documented instability of GAN training appears [1]. Even
though the min-max game can be formulated as a bilevel minimization prob-
lem, in practice the discriminator and generator neural networks are usually
updated iteratively.
We are interested in training the generator G to represent the flow map
of the dynamical system. That means that if z is the state of the system
at a time instant t, we would like to train the generator G to produce
as output G(z), an accurate estimate of the state of the system at time
t + ∆t. In [24] we have presented a way to enforce constraints in the output
of the generator G that respects the game-theoretic setup of GANs. We
can do so by augmenting the input of the discriminator by the constraint
residuals i.e. how well does a sample satisfy the constraints. Of course, such
an augmentation of the discriminator input should be applied both to the
generator-created samples as well as the samples from the true distribution.
This means that we consider a two-player min-max game with the modified
value function

min max V constraints (D, G)


G D
= Ex∼pdata (x) [log D(x, D (x))] + Ez∼pz (z) [log(1 − D(G(z), G (z)))], (8)

where D (x) is the constraint residual for the true sample and G (z) is the
constraint residual for the generator-created sample. Note that in our setup,
the generator input distribution pz (z) will be from the noise cloud around
the training trajectory. On the other hand, the true data distribution pdata
is the distribution of values of the (noiseless) training trajectory.

10
As explained in [24], taking the constraint residual D (x) to be zero for
the true samples can exacerbate the well-known saturation (instability) is-
sue with training GANs. Thus, we take D (x) to be a random variable with
mean zero and small variance dictated by Monte-Carlo or other numeri-
cal/mathematical/physical considerations. On the other hand, for G (z) we
can use the constraint we want to enforce. For example, for the constraint
based on the forward Euler scheme with diagonal linear error-correcting
term, we take jG (z) = Gj (z) − zj − ∆tfj (zj ) + ∆taj zj for j = 1, . . . , M,
where z is a sample from the noise cloud around a point of the training time
series data. The expression for the constraint residual jG (z) can be eas-
ily generalized for more elaborate numerical methods and error-correcting
terms.

2.3 Reinforcement learning - Actor-critic methods


The third and final mode of learning for which we will examine how to
enforce constraints for time series prediction is reinforcement learning and
in particular Actor-Critic (AC) methods [25, 10, 21, 15, 19]. We will also
present a novel approach based on homotopy in order to stabilize and accel-
erate training.

2.3.1 General setup of AC methods


We will begin with the general setup of an AC method and then provide the
necessary modifications to turn it into a computational device for training
flow map neural network representations. The setup consists of an agent (ac-
tor) interacting with an environment in discrete timesteps. At each timestep
t the agent is supplied with an observation of the environment and the agent
state st . Based on the state st it takes an action at and receives a scalar re-
ward rt . An agent’s behavior is based on a action policy π, which is a map
from the states to a probability distribution over the actions, π : S → Pπ (A)
where S is the state space and A is the action space. We also need to specify
an initial state distribution p0 (s0 ), the transition function P(st+1 |st , at ) and
the reward distribution R(st , at ).
The aim of an AC method is to learn in tandem an action-value function
(critic)

X 
π k

Q (st , at ) = Est+k+1 ∼P,rt+k ∼R,at+k+1 ∼π γ rt+k st , at
(9)
k=0

11
and an action policy that is optimal for the action-value function

π ∗ = arg max Es0 ∼p0 ,a0 ∼π [Qπ (s0 , a0 )]. (10)


π

The parameter γ ∈ [0, 1] is called the discount factor and it expresses the
degree of trust in future actions. Eq. (9) can be rewritten in a recursive
manner as

Qπ (st , at ) = Ert ∼R,st+1 ∼P [rt + γEat+1 ∼π [Qπ (st+1 , at+1 )]] (11)

which is called the Bellman equation. Thus, the task of finding the action-
value function is equivalent to solving the Bellman equation. We can solve
the Bellman equation by reformulating it as an optimization problem

Qπ = arg minEst ,at ∼π (Q(st , at ) − yt )2


 
(12)
Q

where
yt = Ert ∼R,st+1 ∼P,at+1 ∼π [rt + γQ(st+1 , at+1 )] (13)
is called the target. In (12), instead of the square of the distance of the
action-value function from the target, we could have used any other di-
vergence that is positive except when the action-value  function and 2target

are equal [19]. Using the objective functions Est ,at ∼π (Q(st , at ) − yt ) and
−Es0 ∼p0 ,a0 ∼π [Qπ (s0 , a0 )] for the action-value function and action policy re-
spectively, we can express the task of reinforcement learning also as a bilevel
minimization problem [19]. However, as in the case of GANs discussed
before, in practice the action-value function and action policy are usually
updated iteratively.
Before we adapt the AC setup to our task of enforcing constraints for
time series prediction we will focus on two special choices: (i) the use of
deterministic target policies and (ii) the of use neural networks to represent
both the action-value function and the action policy [21, 15].
We start with the effect of using a deterministic target policy denoted
as µ : S → A. Then, the Bellman equation (11) can be written as

Qµ (st , at ) = Ert ∼R,st+1 ∼P [rt + γQµ (st+1 , µ(st+1 ))] (14)

Note that the use of a deterministic target policy at+1 has allowed us to
drop the expectation with respect to at+1 that appeared in (13) and find

yt = Ert ∼R,st+1 ∼P [rt + γQ(st+1 , µ(st+1 )]. (15)

12
Also, note that the expectations in (14) and (15) depend only on the environ-
ment. This means that it is possible to learn Qµ off-policy, using transitions
that are generated from a different stochastic behavior policy β. We can
rewrite the optimization problem (12)-(13) as

Qµ = arg minEst ∼ρβ ,at ∼β,rt ∼R (Q(st , at ) − yt )2


 
(16)
Q

where
yt = rt + γQ(st+1 , µ(st+1 )). (17)
The state visitation distribution ρβ is related to the policy β. We will use
below this flexibility to introduce our noise cloud around the training tra-
jectory.
We continue with the effect of using neural networks to represent both
the action-value function and the policy. We restrict attention to the case of
a deterministic policy since this will be the type of policy we will use later
for our time series prediction application. To motivate the introduction of
neural networks we begin with the concept of Q-learning as a way to learn the
action-value function and the policy [27, 17]. In Q-learning, the optimization
problem (16)-(17) to find the action-value function is coupled with the greedy
policy estimate µ(s) = arg max Q(s, a). Thus, the greedy policy requires an
a
optimization at every timestep. This can become prohibitively costly for
the type of action spaces that are encountered in many applications. This
has led to (i) the adoption of (deep) neural networks for the representation
of the action-value function and the policy and (ii) the update of the neural
network for the policy after each Q-learning iteration for the action-value
function [21].
We assume that the action-value function Q(st , at |θQ ) is represented by
a neural network with parameter vector θQ and the deterministic policy
µ(s|θµ ) by a neural network with parameter vector θµ . The deterministic
policy gradient algorithm [21] uses (16)-(17) to learn Q(st , at |θQ ). The policy
µ(s|θµ ) is updated after every iteration of the Q-optimization using the
policy gradient

∇θµ Est ∼ρβ [Q(s, a|θQ )|s=st ,a=µ(st |θµ ) ]


= Est ∼ρβ [∇θµ Q(s, a|θQ )|s=st ,a=µ(st |θµ ) ] (18)

which can be computed through the chain rule [21, 15].

13
2.3.2 AC methods for time series prediction and enforcing con-
straints
We explain now how an AC method can be used to train the flow map of a
dynamical system. In addition, we provide a way for enforcing constraints
during training.
We begin by identifying the state st with the state of the dynamical
system at time t. Also, we identify the discrete timesteps with the iterations
of the flow map that advance the state of the dynamical system by ∆t units
in time. The action policy µ(st |θµ ) is the action that needs to be taken to
bring the state of the system from st to st+1 . However, instead of learning
separately the action policy that results in st being mapped to st+1 , we can
identify the policy µ(st |θµ ) with the state st+1 i.e. µ(st |θµ ) = st+1 . In this
way, training for the policy µ(st |θµ ) is equivalent to training for the flow
map of the dynamical system.
We also take advantage of the off-policy aspect of (16)-(17) to choose
the distribution of states ρβ to be the one corresponding to the noise cloud
around the training trajectory needed to implement the error-correction.
Thus, we see that the intrinsic statistical nature of the AC method bodes
well with our approach to error-correcting.
To complete the specification of the AC method as a method for training
the flow map of a dynamical system we need to specify the reward function.
The specification of the reward function is an important aspect of AC meth-
ods and reinforcement learning in general [23, 11]. We have chosen a simple
negative reward function. To conform with the notation from Sections 2.1
and 2.2 we specify the reward function as
M
X
r(z, x) = − (µj (z) − xdata
j )2 , (19)
j=1

where z is a point on the noise cloud at the time t, µj (z) is the j-th com-
ponent of its image through the flow map and xdata
j is the j-th component
of the noiseless point on the training trajectory at time t + ∆t that it is
mapped to. Similarly, for the case when we want to enforce constraints e.g.
diagonal linear representation for the error-correcting term we can define
the reward function as
XM  
data 2 2
r(z, x) = − (µj (z) − xj ) + (µj (z) − zj − ∆tfj (z) + ∆taj zj ) . (20)
j=1

For each time t, the reward function that we have chosen uses information
only from the state of the system at time t and t + ∆t. Of course, how

14
much credit we assign to this information is determined by the value of the
discount factor γ.
If γ = 0, then we disregard any information beyond time t + ∆t. In this
case, the AC method becomes a supervised learning method in disguise.
In fact, from (16)-(18) we see that when γ = 0, the task of maximizing the
action-value function is equivalent to maximizing the reward. The maximum
value for our negative reward (19) is 0 which is the optimal value for the
supervised learning loss function (5) (the average over the noise cloud in (5)
is the same as the average w.r.t. to ρβ ). A similar conclusion holds for the
case of the reward with constraints (20) and the supervised learning loss
function with constraints (6).
If on the other hand we set γ = 1, we assign equal importance to current
and future rewards. This corresponds to the case where the environment is
deterministic and thus the same actions result in the same rewards.

2.3.3 Homotopy for the action-value function


AC methods utilizing deep neural networks to represent the action-value
function and the policy have proven to be difficult to train due to instabili-
ties. As a result, a lot of techniques have been devised to stabilize training
(see e.g. [19] and references therein for a review of stabilizing techniques).
In our numerical experiments we tried some of these techniques but could
not get satisfactory training results neither for the action-value function nor
for the policy. This is the reason we developed a novel approach based on
homotopy which indeed resulted in successful training (see results in Section
3.3).
To motivate our approach we examine the case when γ = 0, although
similar arguments hold for the other values of γ. As we have discussed in
the previous section, when γ = 0, the AC method is a supervised learning
method in disguise. In fact, the AC method tries to achieve the same result
as a supervised learning method but does it in a rather inefficient way. If we
look at (16)-(18), we see that in the action-value update (which is effected
through (16)), the AC method tries to minimize the distance between the
action-value function and the reward function. Then, in the action policy
update step (which is effected through the use of (18)), the AC method
tries to maximize the action-value function. In essence, through this two-
step procedure, the AC method tries to maximize the reward but does so in
a roundabout way.
If we think of a plane (in function space) where on one axis we have
the action-value function Q(st , at ) and on the other the reward rt , we are

15
trying to find the point on the line Q(st , at ) = rt which maximizes Q(st , at ).
But hitting this line from a random initialization of the neural networks for
Q(st , at ) and of the policy µ(st ) is extremely unlikely. We would be better
off if we started our optimization from a point on the line and then look for
the maximum of Q(st , at ). In other words, for the case of γ = 0, we have
a better chance of training accurately if we let Q(st , at ) = rt in (18). A
similar argument for the case γ 6= 0 shows why we will have a better chance
of training if we let Q(st , at ) = rt + γQ(st+1 , µ(st+1 )) in (18).
There is an extra mathematical reason why the identification Q(st , at ) =
rt + γQ(st+1 , µ(st+1 )) can result in better training. Recall from (20) that
the reward function rt contains all the information from the training trajec-
tory and the constraints we wish to enforce. In addition, rt depends on the
parameter vector θµ for the neural network that represents the action policy
µ. Thus, when we use the expression rt + γQ(st+1 , µ(st+1 )) in (18) for the
update step of θµ , we back-propagate directly the available information from
the training trajectory and the constraints to θµ . This is because we differ-
entiate directly rt w.r.t. θµ . On the other hand, in the original formulation
we do not differentiate rt at all, because there rt appears only in the update
step for the action-value function. That update step involves differentiation
w.r.t. the action-value function parameter vector θQ but not θµ .
Of course, if we make the identification Q(st , at ) = rt +γQ(st+1 , µ(st+1 ))
in (18) we have modified the original problem. The question is how is
the solution to the modified problem related to the original one. Through
algebraic inequalities, one can show that the optimum for Q(st , at ) for the
modified problem provides a lower bound on the optimum for the original
problem. It can also provide an upper bound if we make extra assumptions
about the difference Q(st , at )−rt −γQ(st+1 , µ(st+1 )) e.g. the convex-concave
assumptions appearing in the min-max theorem [18].
To avoid the need for such extra assumptions, we have developed an
alternative approach. We initialize the training procedure with the identifi-
cation Q(st , at ) = rt + γQ(st+1 , µ(st+1 )) in (18). As the training progresses
we morph the modified problem back to the original one via homotopy. In
particular, we use in (18) instead of Q(st , at ) the expression

δ × Q(st , at ) + (1 − δ) × [rt + γQ(st+1 , µ(st+1 ))], (21)

where δ is the homotopy parameter. A user-defined schedule evolves δ during


training from 0 (modified problem) to 1 (original problem). The accuracy
of the training is of course dependent on the schedule for δ. However, in
our numerical experiments we obtained good results without the need for a

16
very refined schedule. One general rule of thumb is that the schedule should
be slower for larger values of γ i.e. allow more iterations between increases
in the value of δ. This is to be expected because for larger values of γ, the
influence of rt in the optimization of rt +γQ(st+1 , µ(st+1 )) is reduced. Thus,
it is more difficult to back-propagate the information from rt to the action
policy parameter vector θµ . However, note that larger values of γ allow us
to take more into account future rewards, thus allowing the AC method to
be more versatile.

3 Numerical results
We use the example of the Lorenz system to illustrate the constructions
presented in Sections 1 and 2.
The Lorenz system is given by
dx1
= σ(x2 − x1 ) (22)
dt
dx2
= ρx1 − x2 − x1 x3 (23)
dt
dx3
= x1 x2 − βx3 (24)
dt
where σ, ρ and β are positive. We have chosen for the numerical experiments
the commonly used values σ = 10, ρ = 28 and β = 8/3. For these values
of the parameters the Lorenz system is chaotic and possesses an attractor
for almost all initial points. We have chosen the initial condition x1 (0) = 0,
x2 (0) = 1 and x3 (0) = 0.
We have used as training data the trajectory that starts from the spec-
ified initial condition and is computed by the Euler scheme with timestep
δt = 10−4 . In particular, we have used data from a trajectory for t ∈ [0, 3].
For all three modes of learning, we have trained the neural network to rep-
resent the flow map with timestep ∆t = 1.5×10−2 i.e. 150 times larger than
the timestep used to produce the training data. After we trained the neural
network that represents the flow map, we used it to predict the solution for
t ∈ [0, 9]. Thus, the trained flow map’s task is to predict (through iterative
application) the whole training trajectory for t ∈ [0, 3] starting from the
given initial condition and then keep producing predictions for t ∈ (3, 9].
This is a severe test of the learned flow map’s predictive abilities for
four reasons. First, due to the chaotic nature of the Lorenz system there is
no guarantee that the flow map can correct its errors so that it can follow
closely the training trajectory even for the interval [0, 3] used for training.

17
Second, by extending the interval of prediction beyond the one used for
training we want to check whether the neural network has actually learned
the map of the Lorenz system and not just overfitting the training data.
Third, we have chosen an initial condition that is far away from the attractor
but our integration interval is long enough so that the system does reach
the attractor and then evolves on it. In other words, we want the neural
network to learn both the evolution of the transient and the evolution on the
attractor. Fourth, we have chosen to train the neural network to represent
the flow map corresponding to a much larger timestep than the one used
to produce the training trajectory in order to check the ability of the error-
correcting term to account for a significant range of unresolved timescales
(relative to the training trajectory).
We performed experiments with different values for the various param-
eters that enter in our constructions. We present here indicative results
for the case of N = 2 × 104 samples (N/3 for training, N/3 for validation
and N/3 for testing). We have chosen Ncloud = 100 for the cloud of points
around each input. Thus, the timestep ∆t = 1.5 × 10−2 . This is because
there are 20000/100 = 200 time instants in the interval [0, 3] at a distance
∆t = 3/200 = 1.5 × 10−2 apart.
The noise cloud for the neural network at a point t was constructed using
the point xi (t) for i = 1, 2, 3, on the training trajectory and adding random
disturbances so that it becomes the collection xil (t)(1−Rrange +2Rrange ×ξil )
where l = 1, . . . , Ncloud . The random variables ξil ∼ U [0, 1] and Rrange =
2 × 10−2 . As we have explained before, we want to train the neural network
to map the input from the noise cloud at a time t to the noiseless point
xi (t + ∆t) (for i = 1, 2, 3,) on the training trajectory at time t + ∆t.
We have to also motivate the value of Rrange for the range of the noise
cloud. Recall that the training trajectory was computed with the Euler
scheme which is a first-order scheme. For the interval ∆t = 1.5 × 10−2 we
expect the error committed by the flow map to be of similar magnitude and
thus we should accommodate this error by considering a cloud of points
within this range. We found that taking Rrange slightly larger and equal to
2 × 10−2 helps the accuracy of the training.
We denote by (F1 (z), F2 (z), F3 (z)) the output of the neural network flow
map for an input z. This corresponds to (G1 (z), G2 (z), G3 (z)) for the no-
tation of Section 2.1 (supervised learning) and 2.2 (unsupervised learning)
and to (µ1 (z), µ2 (z), µ3 (z)) for the notation of Section 2.3.
As explained in detail in [24], we employ a learning rate schedule that
we have developed and which uses the relative error of the neural network

18
flow map. For a mini-batch of size m, we define the relative error as
m 
1 X 1 |F1 (zj ) − x1 (tj + ∆t)| |F2 (zj ) − x2 (tj + ∆t)|
REm = +
m 3 |x1 (tj + ∆t)| |x2 (tj + ∆t)|
j=1

|F3 (zj ) − x3 (tj + ∆t)|
+ ,
|x3 (tj + ∆t)|
where (F1 (zj ), F2 (zj ), F3 (zj )) is the neural network flow map prediction at
tj + ∆t for the input vector zj = (zj1 , zj2 , zj3 ) from the noise cloud at time
tj . Also, (x1 (tj + ∆t), x2 (tj + ∆t), x3 (tj + ∆t)) is the point on the training
trajectory computed by the Euler scheme −4
pwith δt = 10p . The tolerance for
the relative error was set to T OL = 1/ N/3 = 1/ 204 /3 ≈ 0.0122. (see
[24] for more details about T OL). For the mini-batch size we have chosen
m = 1000 for the supervised and unsupervised cases and m = 33 for the
reinforcement learning case.
We also need to specify the constraints that we want to enforce. Using
the notation introduced above, we want to train the neural network flow
map so that its output (F1 (zj ), F2 (zj ), F3 (zj )) for an input data point zj =
(zj1 , zj2 , zj3 ) from the noise cloud satisfies
F1 (zj ) = zj1 + ∆t[σ(zj2 − zj1 )] − ∆ta1 zj1 (25)
F2 (zj ) = zj2 + ∆t[ρzj1 − zj2 − zj1 zj3 ] − ∆ta2 zj2 (26)
F3 (zj ) = zj3 + ∆t[zj1 zj2 − βzj3 ] − ∆ta3 zj3 (27)
where a1 , a2 and a3 are parameters to be optimized during training. The
first two terms on the RHS of (25)-(27) come from the forward Euler scheme,
while the third is the diagonal linear error-correcting term.

3.1 Supervised learning


We begin the presentation of results with the case of supervised learning. Our
aim in this subsection is threefold: (i) show that the explicit enforcing of
the constraints is better than the implicit one, (ii) show that the addition of
noise to the training trajectory is beneficial and (iii) show that the addition
of error-correcting terms to the constraints can be beneficial even if we use
the noiseless trajectory. The latter point highlights once again the promising
influx to predictive machine learning of ideas from model reduction.

3.1.1 Implicit versus explicit constraint enforcing


We used a deep neural network for the representation of the flow map with 10
hidden layers of width 20. We note that because the solution of the Lorenz

19
system acquires values outside of the region of the activation function we
have removed the activation function from the last layer of the generator (al-
ternatively we could have used batch normalization and kept the activation
function). Fig. 1 compares the evolution of the prediction for x1 (t) of the
neural network flow map starting at t = 0 and computed with a timestep
∆t = 1.5 × 10−2 to the ground truth (training trajectory) computed with
the forward Euler scheme with timestep δt = 10−4 . We show plots only for
x1 (t) since the results are similar for the x2 (t) and x3 (t). We want to make
two observations.
First, the prediction of the neural network flow map is able to follow
with adequate accuracy the ground truth not only during the interval [0, 3]
that was used for training, but also during the interval (3, 9]. Second, the
explicit enforcing of constraints i.e. the enforcing of the constraints (25)-(27)
(see results in Fig. 1(b)) is better than the implicit enforcing of constraints.

20 20

15 15

10 10

5 5
x1

x1

0 0

−5 −5

−10 −10

−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time

(a) (b)

Figure 1: Supervised learning. Comparison of ground truth for x1 (t) com-


puted with the Euler scheme with timestep δt = 10−4 (blue dots) and the
neural network flow map prediction with timestep ∆t = 1.5 × 10−2 (red
crosses). (a) noisy data without enforced constraints during training; (b)
noisy data with enforced constraints during training (see text for details).

3.1.2 Noisy versus noiseless training trajectory


We have advocated the use of a noisy version of the training trajectory in
order for the neural network flow map to be exposed to larger parts of the
phase space. The objective of such an exposure is to train the flow map to
know how to respond to points away from the training trajectory where it is
bound to wander due to the inevitable error committed through its repeated
application during prediction. In this subsection we present results which
corroborate our hypothesis.

20
Fig. 2 compares the predictions of neural networks trained with noisy
and noiseless training data. In addition, we perform such comparison both
for the case with enforced constraints during training and without enforced
constraints. Fig. 2(a) shows that when the constraints are not enforced
during training, the use of noisy data can have a significant impact. This is
along the lines of our argument that the data from a single training trajec-
tory are not enough by themselves to train the neural network accurately
for prediction purposes. Fig. 2(b) shows that when the constraints are en-
forced during training, the difference between the predictions based on noisy
and noiseless training data is reduced. However, using noisy data results in
better predictions for parts of the trajectory where there are rapid changes.
Also the use of noisy data helps the prediction to stay “in phase” with the
ground truth for longer times.
We have to stress that we conducted several numerical experiments and
the performance of the neural network flow map trained with noisy data
was consistently more robust than when it was trained with noiseless data.
A thorough comparison will appear in a future publication.

25

20
20

15 15

10 10

5
x1

x1

0 0

−5 −5

−10 −10

−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time

(a) (b)

Figure 2: Supervised learning. Comparison of ground truth for x1 (t) com-


puted with the Euler scheme with timestep δt = 10−4 (blue dots), the neu-
ral network flow map prediction with timestep ∆t = 1.5 × 10−2 using noisy
training data (red crosses) and the neural network flow map prediction with
timestep ∆t = 1.5 × 10−2 using noiseless training data (green triangles). (a)
without enforced constraints during training; (b) with enforced constraints
during training (see text for details).

3.1.3 Error-correction for training with noiseless trajectory


The results from Fig. 2(b) prompted us to examine in more detail the role
of the error-correction term in the case of training with noiseless data. In

21
particular, we would like to see how much of the predictive accuracy is due
to enforcing the forward Euler scheme alone i.e. set a1 = a2 = a3 = 0 in
(25)-(27) versus allowing a1 , a2 , a3 to be optimized during training.

25 25

20 20

15 15

10 10
x1

x1
5 5

0 0

−5 −5

−10 −10

−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time

(a) (b)

Figure 3: Supervised learning. (a) Comparison of ground truth for x1 (t)


computed with the Euler scheme with timestep δt = 10−4 (blue dots), the
neural network flow map prediction with timestep ∆t = 1.5 × 10−2 us-
ing noiseless training data and enforcing the Euler scheme without error-
correction (red crosses) and the neural network flow map prediction with
timestep ∆t = 1.5 × 10−2 using noiseless training data and without enforc-
ing any constraints (green triangles); (b) Comparison of ground truth for
x1 (t) computed with the Euler scheme with timestep δt = 10−4 (blue dots),
the neural network flow map prediction with timestep ∆t = 1.5 × 10−2
using noiseless training data and enforcing the Euler scheme without error-
correction (red crosses) and the neural network flow map prediction with
timestep ∆t = 1.5 × 10−2 using noiseless training data and enforcing the
Euler scheme with error-correction (orange triangles)(see text for details).

Fig. 3(a) compares to the ground truth the prediction from the trained
neural network flow map when we do not enforce any constraints and the
prediction from the trained neural network flow map when we enforce only
the forward Euler part of the constraints (25)-(27) (a1 = a2 = a3 = 0).
We see that indeed, even if we enforce only the forward Euler part of the
constraint we obtain much more accurate results than not enforcing any
constraint at all.
Fig. 3(b) examines how is the performance of the neural network affected
further if we allow also the error-correcting term in (25)-(27) i.e. optimize
a1 , a2 , a3 during training. The inclusion of the error-correcting term allows
the solution to remain for longer “in phase” with the ground truth than if
the error-correction term is absent. This is expected since we are examining
the solution of the Lorenz system as it transitions from an initial condition

22
far from the attractor to the attractor and then evolves on it. While on
the attractor the solution remains oscillatory and bounded, so the main
error of the neural network flow map prediction comes from going “out of
phase” with the ground truth. The error-correcting term keeps the predicted
trajectory closer to the ground truth thus reducing the loss of phase. Recall
that the error-correcting term is one of the simplest possible. From our prior
experience with model reduction, we anticipate larger gains in accuracy if
we use more sophisticated error-correcting terms.
We want to stress again that training with noiseless data is significantly
less robust than training with noisy data. However, we have chosen to
present results of training with noiseless data that exhibit good prediction
accuracy to raise various issues that should be more thoroughly investigated.

3.2 Unsupervised learning


We continue with the case of unsupervised learning and in particular the
case of a GAN. We have used for the GAN generator a deep neural network
with 9 hidden layers of width 20 and for the discriminator a neural network
with 2 hidden layers of width 20. The numbers of hidden layers both for the
generator and the discriminator were chosen as the smallest that allowed
the GAN training to reach its game-theoretic optimum without at the same
time requiring large scale computations. Fig. 4 compares the evolution of
the prediction of the neural network flow map starting at t = 0 and computed
with a timestep ∆t = 1.5 × 10−2 to the ground truth (training trajectory)
computed with the forward Euler scheme with timestep δt = 10−4 .
Fig. 4(a) shows results for the implicit enforcing of constraints. We see
that this is not enough to produce a neural network flow map with long-
term predictive accuracy. Fig. 4(b) shows the significant improvement in the
predictive accuracy when we enforce the constraints explicitly. The results
for this specific example are not as good as in the case of supervised learning
presented earlier. We note that training a GAN with or without constraints
is a delicate numerical task as explained in more detail in [24]. One needs
to find the right balance between the expressive strengths of the generator
and the discriminator (game-theoretic optimum) to avoid instabilities but
also train the neural network flow map i.e. the GAN generator, so that it
has predictive accuracy.
We also note that training with noiseless data is even more brittle. For
the very few experiments where we avoided instability the predicted solution
from the trained GAN generator was not accurate at all.

23
20 20

15 15

10 10

5 5
x1

x1
0 0

−5 −5

−10 −10

−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time

(a) (b)

Figure 4: Unsupervised learning (GAN). Comparison of ground truth for


x1 (t) computed with the Euler scheme with timestep δt = 10−4 (blue dots)
and the neural network flow map (GAN generator) prediction with timestep
∆t = 1.5 × 10−2 (red crosses). (a) noisy data without enforced constraints
during training; (b) noisy data with enforced constraints during training (see
text for details).

3.3 Reinforcement learning


The last case we examine is that of reinforcement learning. In particular,
we want to see how an actor-critic method performs in the difficult case
when the discount factor γ = 1. We repeat that γ = 1 corresponds to
the case of a deterministic environment which means that the same actions
always produce the same rewards. This is the situation in our numerical
experiments where we are given a training trajectory that does not change.
We have conducted more experiments for other values of γ but a detailed
presentation of those results will await a future publication.
For the representation of the action-value function we used a deep neural
network with 15 hidden layers of width 20. For the representation of the
deterministic action policy i.e. the neural network flow map in our parlance,
we used a deep neural network with 10 hidden layers of width 20. The task
of learning an accurate representation of the action-value function is more
difficult than that of finding the action policy. This justifies the need for a
stronger network to represent the action-value function.
As we have mentioned Section 2.3, researchers have developed various
modifications and tricks to stabilize the training of AC methods [19]. The
one that enabled us to stabilize results in the first place is that of target
networks [17, 15]. However, the predictive accuracy of the trained neural
network flow map i.e. the action policy, was extremely poor unless we also
used our homotopy approach for the action-value function. This was true for

24
both cases of enforcing or not constraints explicitly during training. With
this in mind we present results with and without the homotopy approach for
the action-value function to highlight the accuracy improvement afforded by
the use of homotopy.
Before we present the results we provide some details about the target
networks, the reward function and the specifics of the homotopy schedule.
The target network concept uses different networks to represent the action-
value function and the action policy that appear in the expression for the
target (17). In particular, if θQ and θµ are the parameter vectors for the
action-value function and action policy respectively, then we use neural net-
works with parameter vectors θQ0 and θµ0 (the target networks) to evaluate
the target expression (17). The vectors θQ0 and θµ0 can be initialized with
the same values as θQ and θµ but they evolve in a different way. In fact,
after every iteration update for θQ and θµ we apply the update rule

θQ0 ← τ θQ + (1 − τ )θQ0 (28)


θµ0 ← τ θµ + (1 − τ )θµ0 (29)

where we have taken τ = 0.001 [15].


The reward function (with constraints) for an input point z from the
noise cloud
3 
X 
data 2
r(z, x) = − (µj (z) − xj ) (30)
j=1

+(µ1 (z) − z1 − ∆t[σ(z2 − z1 )] + ∆ta1 z1 )2 (31)


2
+(µ2 (z) − z2 − ∆t[ρz1 − z2 − z1 z3 ] + ∆ta2 z2 ) (32)

2
+(µ3 (z) − z3 − ∆t[z1 z2 − βz3 ] + ∆ta3 z3 ) (33)

where xdata is the noiseless point from the training trajectory. As we have
explained in Section 2.3 (see the comment after (17)), in the AC method
context the distribution of the noise cloud of the input data points at every
timestep corresponds to the state visitation distribution ρβ appearing in
(16).
The homotopy schedule we used is a rudimentary one that we did not
attempt to optimize. Obviously, this is a topic of further investigation that
will appear elsewhere. We initialized the homotopy parameter δ at 0, and
increased its value (until it reached 1) every 2000 iterations of the optimiza-
tion.

25
25

20
20

15 15

10 10

5
x1

x1
5

0 0

−5 −5

−10 −10

−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time

(a) (b)

Figure 5: Reinforcement learning (Actor-critic). Comparison of ground


truth for x1 (t) computed with the Euler scheme with timestep δt = 10−4
(blue dots), the neural network flow map prediction with timestep ∆t =
1.5 × 10−2 with homotopy for the action-value function during training
(red crosses) and the neural network flow map prediction with timestep
∆t = 1.5 × 10−2 without homotopy for the action-value function during
training (green triangles). (a) noisy data without enforced constraints dur-
ing training; (b) noisy data with enforced constraints during training (see
text for details).

Fig. 5 presents results of the prediction performance of the neural net-


work flow map when it was trained with and without the use of homotopy
for the action value function. In Fig. 5(a) we have results for the implicit
enforcing of constraints while in Fig. 5(b) for the explicit enforcing of con-
straints. We make two observations. First, both for implicit and explicit
enforcing of the constraints, the use of homotopy leads to accurate results
for long times. Especially for the case of explicit enforcing which gave us
some of the best results from all the numerical experiments we conducted
for the different modes of learning. Second, if we do not use homotopy, the
predictions are extremely poor both for implicit and explicit forcing. In-
deed, the green curve in Fig. 5(a) representing the prediction of x1 (t) for
the case of implicit constraint enforcing without homotopy is as inaccurate
as it looks. It starts at 0 and within a few steps drops to a negative value
and does not change much after that. The predictions for x2 (t) and x3 (t)
are equally inaccurate.

26
4 Discussion and future work
We have presented a collection of results about the enforcing of known con-
straints for a dynamical system during the training of a neural network to
represent the flow map of the system. We have provided ways that the
constraints can be enforced in all three major modes of learning, namely
supervised, unsupervised and reinforcement learning. In line with the law
of scientific computing that one should build in an algorithm as much prior
information is known as possible, we observe a striking improvement in per-
formance when known constraints are enforced during training. We have
also shown the benefit of training with noisy data and how these correspond
to the incorporation of a restoring force in the dynamics of the system. This
restoring force is analogous to memory terms appearing in model reduction
formalisms. In our framework, the reduction is in a temporal sense i.e. it
allows us to construct a flow map that remains accurate though it is defined
for large timesteps.
The model reduction connection opens an interesting avenue of research
that makes contact with complex systems appearing in real-world problems.
The use of larger timesteps for the neural network flow map than the ground
truth without sacrificing too much accuracy is important. We can imagine
an online setting where observations come at sparsely placed time instants
and are used to update the parameters of the neural network flow map. The
use of sparse observations could be dictated by necessity e.g. if it is hard
to obtain frequent measurements or efficiency e.g. the local processing of
data in field-deployed sensors can be costly. Thus, if the trained flow map
is capable of accurate estimates using larger timesteps then its successful
updated training using only sparse observations becomes more probable.
The constructions presented in the current work depend on a large num-
ber of details that can potentially affect their performance. A thorough
study of the relative merits of enforcing constraints for the different modes
of learning needs to be undertaken and will be presented in a future pub-
lication. We do believe though that the framework provides a promising
research direction at the nexus of scientific computing and machine learn-
ing.

5 Acknowledgements
The author would like to thank Court Corley, Tobias Hagge, Nathan Hodas,
George Karniadakis, Kevin Lin, Paris Perdikaris, Maziar Raissi, Alexandre

27
Tartakovsky, Ramakrishna Tipireddy, Xiu Yang and Enoch Yeung for help-
ful discussions and comments. The work presented here was partially sup-
ported by the PNNL-funded “Deep Learning for Scientific Discovery Agile
Investment” and the DOE-ASCR-funded ”Collaboratory on Mathematics
and Physics-Informed Learning Machines for Multiscale and Multiphysics
Problems (PhILMs)”. Pacific Northwest National Laboratory is operated by
Battelle Memorial Institute for DOE under Contract DE-AC05-76RL01830.

References
[1] Martin Arjovsky and Lon Bottou. Towards principled meth-
ods for training Generative Adversarial Networks. arXiv preprint
arXiv:1701.04862, 2017.

[2] Nathan Baker, Frank Alexander, Timo Bremer, Aric Hagberg, Yan-
nis Kevrekidis, Habib Najm, Manish Parashar, Abani Patra, James
Sethian, Stefan Wild, and Karen Willcox. Workshop report on basic
research needs for scientific machine learning: Core technologies for
artificial intelligence. 2019.

[3] Grigory I Barenblatt. Scaling. Cambridge University Press, 2003.

[4] Tyrus Berry, Dimitrios Giannakis, and John Harlim. Nonparamet-


ric forecasting of low-dimensional dynamical systems. Phys. Rev. E,
91:032915, Mar 2015.

[5] Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, and David


Duvenaud. Neural ordinary differential equations. arXiv preprint
arXiv:1806.07366v3, 2018.

[6] Alexandre J Chorin and Panagiotis Stinis. Problem reduction, renor-


malization and memory. Communications in Applied Mathematics and
Computational Science, 1:1–27, 2007.

[7] L. Felsberger and P.S. Koutsourelakis. Physics-constrained, data-


driven discovery of coarse-grained dynamics. arXiv preprint
arXiv:1802.03824v1, 2018.

[8] N Goldenfeld. Lectures on Phase Transitions and the Renormalization


Group. Perseus Books, 1992.

28
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial nets. Advances in neural information processing sys-
tems, pages 2672–2680, 2014.

[10] Ivo Grondman, Lucian Busoniu, Gabriel AD Lopes, and Robert


Babuska. A survey of actor-critic reinforcement learning: Standard
and natural policy gradients. IEEE Transactions on Systems, Man,
and Cybernetics, Part C (Applications and Reviews), 42(6):1291–1307,
2012.

[11] Xiaoxiao Guo, Satinder Singh, Richard Lewis, and Honglak Lee. Deep
learning for reward design to improve monte carlo tree search in atari
games. In Proceedings of the Twenty-Fifth International Joint Confer-
ence on Artificial Intelligence, IJCAI’16, pages 1519–1525. AAAI Press,
2016.

[12] Ernst Hairer, S.E. Nörsett, and G. Wanner. Solving Ordinary Differ-
ential Equations I. Springer, NY, 1987.

[13] Jiequn Han, Arnulf Jentzen, and Weinan E. Solving high-dimensional


partial differential equations using deep learning. Proceedings of the
National Academy of Sciences, 115(34):8505–8510, 2018.

[14] Alberto Isidori. Nonlinear Control Systems. Springer, London, 1995.

[15] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas


Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra.
Continuous control with deep reinforcement learning. arXiv preprint
arXiv:1509.02971, 2015.

[16] Huanfei Ma, Siyang Leng, Kazuyuki Aihara, Wei Lin, and Luonan
Chen. Randomly distributed embedding making short-term high-
dimensional data predictable. Proceedings of the National Academy
of Sciences, 115(43):E9994–E10002, 2018.

[17] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu,


Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, An-
dreas K. Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie,
Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran,
Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level con-
trol through deep reinforcement learning. Nature, 518(7540):529–533,
February 2015.

29
[18] Martin J Osborne et al. An introduction to game theory, volume 3.
Oxford University Press New York, 2004.

[19] David Pfau and Oriol Vinyals. Connecting generative adversarial net-
works and actor-critic methods. arXiv preprint arXiv:1610.01945, 2016.

[20] Maziar Raissi, Paris Perdikaris, and George Karniadakis. Numerical


Gaussian processes for time-dependent and nonlinear partial differential
equations. SIAM J. Sci. Comput., 40:A172–A198, 2018.

[21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra,
and Martin Riedmiller. Deterministic policy gradient algorithms. In
ICML, 2014.

[22] Justin Sirignano and Konstantinos Spiliopoulos. DGM: A deep learning


algorithm for solving partial differential equations. Journal of Compu-
tational Physics, 375:1339 – 1364, 2018.

[23] Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward de-
sign via online gradient ascent. In Advances in Neural Information
Processing Systems, pages 2190–2198, 2010.

[24] P. Stinis, T. Hagge, A. M. Tartakovsky, and E. Young. Enforcing con-


straints for interpolation and extrapolation in generative adversarial
networks. arXiv preprint arXiv:1803.08182, 2018.

[25] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Man-
sour. Policy gradient methods for reinforcement learning with function
approximation. In Proceedings of the 12th International Conference
on Neural Information Processing Systems, NIPS’99, pages 1057–1063,
Cambridge, MA, USA, 1999. MIT Press.

[26] ZY Wan, P Vlachas, P Koumoutsakos, and T Sapsis. Data-assisted


reduced-order modeling of extreme events in complex dynamical sys-
tems. PLoS ONE, 13:e0197704, 2018.

[27] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learn-
ing, 8(3-4):279–292, 1992.

30

You might also like