Enforcing Constraints For Time Series Prediction in Supervised, Unsupervised and Reinforcement Learning
Enforcing Constraints For Time Series Prediction in Supervised, Unsupervised and Reinforcement Learning
Panos Stinis
Advanced Computing, Mathematics and Data Division, Pacific Northwest
National Laboratory, Richland WA 99354
Abstract
We assume that we are given a time series of data from a dynamical
system and our task is to learn the flow map of the dynamical system.
We present a collection of results on how to enforce constraints com-
ing from the dynamical system in order to accelerate the training of
deep neural networks to represent the flow map of the system as well
as increase their predictive ability. In particular, we provide ways to
enforce constraints during training for all three major modes of learn-
ing, namely supervised, unsupervised and reinforcement learning. In
general, the dynamic constraints need to include terms which are anal-
ogous to memory terms in model reduction formalisms. Such memory
terms act as a restoring force which corrects the errors committed by
the learned flow map during prediction.
For supervised learning, the constraints are added to the objective
function. For the case of unsupervised learning, in particular genera-
tive adversarial networks, the constraints are introduced by augment-
ing the input of the discriminator. Finally, for the case of reinforce-
ment learning and in particular actor-critic methods, the constraints
are added to the reward function. In addition, for the reinforcement
learning case, we present a novel approach based on homotopy of the
action-value function in order to stabilize and accelerate training. We
use numerical results for the Lorenz system to illustrate the various
constructions.
Introduction
Scientific machine learning, which combines the strengths of scientific com-
puting with those of machine learning, is becoming a rather active area of
1
research. Several related priority research directions were stated in the re-
cently published report [2]. In particular, two priority research directions
are: (i) how to leverage scientific domain knowledge in machine learning
(e.g. physical principles, symmetries, constraints); and (ii) how can ma-
chine learning enhance scientific computing (e.g reduced-order or sub-grid
physics models, parameter optimization in multiscale simulations).
Our aim in the current work is to present a collection of results that
contribute to both of the aforementioned priority research directions. On
the one hand, we provide ways to enforce constraints coming from a dynam-
ical system during the training of a neural network to represent the flow
map of the system. Thus, prior domain knowledge is incorporated in the
neural network training. On the other hand, as we will show, the accurate
representation of the dynamical system flow map through a neural network
is equivalent to constructing a temporal integrator for the dynamical system
modified to account for unresolved temporal scales. Thus, machine learning
can enhance scientific computing.
We assume that we are given data in the form of a time series of the
states of a dynamical system (a training trajectory). Our task is to train a
neural network to learn the flow map of the dynamical system. This means
to optimize the parameters of the neural network so that when it is presented
with the state of the system at one instant, it will predict accurately the
state of the system at another instant which is a fixed time interval apart.
If we want to use the data alone to train a neural network to represent the
flow map, then it is easy to construct simple examples where the trained
flow map has rather poor predictive ability [24]. The reason is that the
given data train the flow map to learn how to respond accurately as long
as the state of the system is on the trajectory. However, at every timestep,
when we invoke the flow map to predict the estimate of the state at the next
timestep, we commit an error. After some steps, the predicted trajectory
veers into parts of phase space where the neural network has not trained.
When this happens, the neural network’s predictive ability degrades rapidly.
One way to aid the neural network in its training task is to provide
data that account for this inevitable error. In [24], we advanced the idea of
using a noisy version of the training data i.e. a noisy version of the training
trajectory. In particular, we attach a noise cloud around each point on the
training trajectory. During training, the neural network learns how to take
as input points from the noise cloud, and map them back to the noiseless
trajectory at the next time instant. This is an implicit way of encoding
a restoring force in the parameters of the neural network (see Section 1.1
for more details). We have found that this modification can improve the
2
predictive ability of the trained neural network but up to a point (see Section
3 for numerical results).
We want to aid the neural network further by enforcing constraints that
we know the state of the system satisfies. In particular, we assume that we
have knowledge of the differential equations that govern the evolution of the
system (our constructions work also if we assume algebraic constraints see
e.g. [24]). Except for special cases, it is not advisable to try to enforce the
differential equations directly at the continuum level. Instead we can dis-
cretize the equations in time using various numerical methods. We want to
incorporate the discretized dynamics into the training process of the neural
network. The purpose of such an attempt can be explained in two ways: (i)
we want to aid the neural network so that it does not have to discover the
dynamics (physics) from scratch; and (ii) we want the constraints to act as
regularizers for the optimization problem which determines the parameters
of the neural network.
Closer inspection of the concept of noisy data and of enforcing the dis-
cretized constraints reveals that they can be combined. However, this needs
to be done with care. Recall that when we use noisy data we train the neural
network to map a point from the noise cloud back to the noiseless point at
the next time instant. Thus, we cannot enforce the discretized constraints
as they are because the dynamics have been modified. In particular, the
use of noisy data requires that the discretized constraints be modified to
account explicitly for the restoring force. We have called the modification of
the discretized constraints the explicit error-correction (see Section 1.2).
The meaning of the restoring force is analogous to that of memory terms
in model reduction formalisms [6]. To see this, note that the flow map as
well as the discretization of the original constraints are based on a finite
timestep. The timescales that are smaller than the timestep used are not
resolved explicitly. However, their effect on the resolved timescales cannot
be ignored. In fact, it is what causes the inevitable error at each applica-
tion of the flow map. The restoring force that we include in the modified
constraints is there to remedy this error i.e. to account for the unresolved
timescales albeit in a simplified manner. This is precisely the role played by
memory terms in model reduction formalisms. In the current work we have
restricted attention to linear error-correction terms. The linear terms come
with coefficients whose magnitude is optimized as part of the training. In
this respect, optimizing the error-correction term coefficients becomes akin
to temporal renormalization. This means that the coefficients depend on the
temporal scale at which we probe the system [8, 3]. Finally, we note that
the error-correction term can be more complex than linear. In fact, it can
3
be modeled by a separate neural network. Results for such more elaborate
error-correction terms will be presented elsewhere.
We have implemented constraint enforcing in all three major modes of
learning. For supervised learning, the constraints are added to the objective
function (see Section 2.1). For the case of unsupervised learning, in partic-
ular generative adversarial networks [9], the constraints are introduced by
augmenting the input of the discriminator (see Section 2.2 and [24]). Finally,
for the case of reinforcement learning and in particular actor-critic methods
[25], the constraints are added to the reward function. In addition, for the
reinforcement learning case, we have developed a novel approach based on
homotopy of the action-value function in order to stabilize and accelerate
training (see Section 2.3).
In recent years, there has been considerable interest in the develop-
ment of methods that utilize data and physical constraints in order to
train predictors for dynamical systems and differential equations e.g. see
[4, 20, 5, 13, 22, 7, 26, 16] and references therein. Our approach is different,
it introduces the novel concept of training on purpose with modified (noisy)
data in order to incorporate (implicitly or explicitly) a restoring force in the
dynamics learned by the neural network flow map. We have also provided
the connection between the incorporation of such restoring forces and the
concept of memory in model reduction.
The paper is organized as follows. Section 1 explains the need for con-
straints to increase the accuracy/efficiency of time series prediction as well as
the form that these constraints can have. Section 2 presents ways to enforce
such constraints in supervised learning (Section 2.1), unsupervised learning
(Section 2.2) and reinforcement learning (Section 2.3). Section 3 contains
numerical results for the various constructions using the Lorenz system as
an illustrative example. Finally, Section 4 contains a brief discussion of the
results as well as some ideas for current and future work.
dx
= f (x), (1)
dt
4
where x ∈ RM . The system (1) needs to be supplemented with an initial
condition x(0) = x0 . Furthermore, suppose that we are provided with time
series data from the system (1). This means a sequence of points from a
trajectory of the system {xdata
i }N
i=1 recorded at time intervals of length ∆t.
We would like to use this time series data to train a neural network to
represent the flow map of the system i.e. a map H ∆t with the property
H ∆t x(t) = x(t + ∆t) for all t and x(t).
We want to find ways to enforce during training the constraints implied
by the system (1). Before we proceed, we should mention that in addition
to (1), one could have extra constraints. For example, if the system (1)
is Hamiltonian, we have extra algebraic constraints since the system must
evolve on an energy surface determined by its initial condition. We note that
the framework we present below can also enforce algebraic constraints but
in the current work we will focus on the enforcing of dynamic constraints
like the system (1). Enforcing dynamic constraints is more demanding than
enforcing algebraic ones. Technically, the enforcing of algebraic constraints
requires only knowledge of the state of the system. On the other hand, the
enforcing of dynamic constraints requires knowledge of the state and of the
rate of change of the state.
It is not advisable to attempt enforcing directly the constraints in (1).
To do that requires that the output of the neural network includes both the
state of the system and its rate of change. This doubles the dimension of the
output and makes the necessary neural network size larger and its training
more demanding. Instead, we will enforce constraints that involve only the
state, albeit at more than one instants. For example, we can consider the
simplest temporal discretization scheme, the forward Euler scheme [12], and
discretize (1) with timestep ∆t to obtain
5
accurate. In fact, as can be seen by simple numerical examples [24], the
trained neural network flow map can lose its predictive ability rather fast.
The reason is that we have used data from a time series i.e. a trajectory to
train the neural network. However, a single trajectory is extremely unlikely
(has measure zero) in the phase space of the system. Thus, the trained
network predicts accurately as long as the predicted state remains on the
training trajectory. But this is impossible, since the action of the flow map
at every (finite) timestep involves an inevitable approximation error. If left
unchecked, this approximation error causes the prediction to deviate into a
region of phase space that the network has never trained on. Soon after, all
the predictive ability of the network is lost.
This observation highlights the need for an alternate way of enforcing the
constraints. In fact, as we will explain now, it points towards the need for the
enforcing of alternate constraints altogether. In particular, this observation
underlines the need for the training of the neural network to include some
kind of error-correcting mechanism. Such an error-correcting mechanism
can help restore the trajectory predicted by the learned flow map when it
inevitably starts deviating due to the finiteness of the used timestep.
The way we have devised to implement this error-correcting mechanism
can be implicit or explicit. By implicit we mean that we do not specify
the functional form of the mechanism but only what we want it to achieve
(Section 1.1). On the other hand, the explicit implementation of the error-
correcting mechanism does involve the specification of the functional form
of the mechanism (Section 1.2).
The common ingredient for both implicit and explicit implementations
is the use of a noisy version of the data during training. The main idea is the
fact that the training of the neural network must address the inevitability
of error that comes with the use of a finite timestep. For example, suppose
that we are given data that are recorded every ∆t. The flow map we wish to
train will produce states of the system at time instants that are ∆t apart.
Every time the flow map is applied, even if it is applied on a point from
the exact trajectory it will produce a state that has deviated from the exact
trajectory at the next timestep. So, the trained flow map must learn how
to correct such a deviation.
6
centered at the point on the time series. Such a cloud of points accounts
for our ignorance about the inevitable error that the flow map commits at
every step. The next step is to train the neural network to map a point from
this cloud back to the noiseless trajectory at the next timestep. In this way,
the neural network is trained to incorporate an error-correcting mechanism
implicitly.
Of course, there are the questions of the extent of the cloud of noisy
points as well as the number of samples we need from it. These parameters
depend on the magnitude of the interval ∆t and the accuracy of the train-
ing data. For example, if the training data were produced by a numerical
method with a known order of accuracy then we expect the necessary ex-
tent of the noisy cloud to follow a scaling law with respect to the interval
∆t. Similarly, if the training data were produced by a numerical experiment
with known measurement error, we expect the extent of the noisy cloud to
depend on the measurement error.
7
to assume f C (x̂(t)) is a linear function of the state x̂(t). For example,
f C (x̂(t)) = Ax̂(t), where A is a M × M matrix whose entries need to be
determined. The entries of A can be estimated during the training of the
flow map neural network. There is no need to restrict the form of the cor-
rection term f C (x̂(t)) to a linear one. In fact, we can consider a separate
neural network to represent f C (x̂(t)). We have explored such constructions
although a detailed presentation will appear in a future publication. A fur-
ther question to be explored is the dependence of the elements of A or of
the parameters of the network for the representation of the error-correcting
term on the timestep ∆t. In fact, we expect a scaling law dependence on ∆t
which would be a manifestation of incomplete similarity [3].
We also note that there is a further generalization of the error-correcting
term f C (x̂(t)), if we allow it to depend on the state of the system for times
before t. Given the analogy to memory terms alluded to above, such a de-
pendence on the history of the evolution of the state of the system is an
instance of a non-Markovian memory [6].
Finally, we note that (4) offers one more way to interpret the error-
correction term, namely as a modified numerical method, a modified Euler
scheme in this particular case, where the role of the error-correction term is
to account for the error of the Euler scheme.
8
where zi is a point from the noise cloud around a point of the given training
trajectory and xdatai is the noiseless point on the given training trajectory
after time ∆t. Note that we allowed freedom here in choosing the value of
Λ to accommodate various implementation choices e.g. number of samples
from the noise cloud, mini-batch sampling etc. The parameter vector θG
can be estimated by minimizing the objective function Losssupervised .
For the sake of simplicity, suppose that we want to enforce the constraints
given in (4) with a diagonal linear representation of the error-correcting term
i.e. fjC (x̂(t)) = −aj x̂j (t), for j = 1, . . . , M (the minus sign is in anticipation
of this being a restoring force). Then we can consider the modified objective
function given by
Λ
1 X
Lossconstraints
supervised = (G(zi ) − xdata
i )2
Λ
i=1
M
X
2
+ (Gj (zi ) − zij − ∆tfj (zij ) + ∆taj zij ) , (6)
j=1
where zij is the j-th component of the noise cloud point zi and fj is the
j-th component of the vector f. Notice that the minimization of the mod-
ified objective function Lossconstraints
supervised leads to the determination of both
the parameter vector θG and the error-correcting representation parameters
aj , j = 1, . . . , M. Also note, that if instead of the forward Euler scheme we
use e.g. a more elaborate Runge-Kutta method as given in (3), then we can
still use (6) but with the vector f replaced by f RK .
9
distribution pdata rather than pg . We train D to maximize the probability of
assigning the correct label to both training examples and samples from the
generator G. Simultaneously, we train G to minimize log(1 − D(G(z))). We
can express the adversarial nature of the relation between D and G as the
two-player min-max game with value function V (D, G):
min max V (D, G) = Ex∼pdata (x) [log D(x)] + Ez∼pz (z) [log(1 − D(G(z)))]. (7)
G D
where D (x) is the constraint residual for the true sample and G (z) is the
constraint residual for the generator-created sample. Note that in our setup,
the generator input distribution pz (z) will be from the noise cloud around
the training trajectory. On the other hand, the true data distribution pdata
is the distribution of values of the (noiseless) training trajectory.
10
As explained in [24], taking the constraint residual D (x) to be zero for
the true samples can exacerbate the well-known saturation (instability) is-
sue with training GANs. Thus, we take D (x) to be a random variable with
mean zero and small variance dictated by Monte-Carlo or other numeri-
cal/mathematical/physical considerations. On the other hand, for G (z) we
can use the constraint we want to enforce. For example, for the constraint
based on the forward Euler scheme with diagonal linear error-correcting
term, we take jG (z) = Gj (z) − zj − ∆tfj (zj ) + ∆taj zj for j = 1, . . . , M,
where z is a sample from the noise cloud around a point of the training time
series data. The expression for the constraint residual jG (z) can be eas-
ily generalized for more elaborate numerical methods and error-correcting
terms.
11
and an action policy that is optimal for the action-value function
The parameter γ ∈ [0, 1] is called the discount factor and it expresses the
degree of trust in future actions. Eq. (9) can be rewritten in a recursive
manner as
Qπ (st , at ) = Ert ∼R,st+1 ∼P [rt + γEat+1 ∼π [Qπ (st+1 , at+1 )]] (11)
which is called the Bellman equation. Thus, the task of finding the action-
value function is equivalent to solving the Bellman equation. We can solve
the Bellman equation by reformulating it as an optimization problem
where
yt = Ert ∼R,st+1 ∼P,at+1 ∼π [rt + γQ(st+1 , at+1 )] (13)
is called the target. In (12), instead of the square of the distance of the
action-value function from the target, we could have used any other di-
vergence that is positive except when the action-value function and 2target
are equal [19]. Using the objective functions Est ,at ∼π (Q(st , at ) − yt ) and
−Es0 ∼p0 ,a0 ∼π [Qπ (s0 , a0 )] for the action-value function and action policy re-
spectively, we can express the task of reinforcement learning also as a bilevel
minimization problem [19]. However, as in the case of GANs discussed
before, in practice the action-value function and action policy are usually
updated iteratively.
Before we adapt the AC setup to our task of enforcing constraints for
time series prediction we will focus on two special choices: (i) the use of
deterministic target policies and (ii) the of use neural networks to represent
both the action-value function and the action policy [21, 15].
We start with the effect of using a deterministic target policy denoted
as µ : S → A. Then, the Bellman equation (11) can be written as
Note that the use of a deterministic target policy at+1 has allowed us to
drop the expectation with respect to at+1 that appeared in (13) and find
12
Also, note that the expectations in (14) and (15) depend only on the environ-
ment. This means that it is possible to learn Qµ off-policy, using transitions
that are generated from a different stochastic behavior policy β. We can
rewrite the optimization problem (12)-(13) as
where
yt = rt + γQ(st+1 , µ(st+1 )). (17)
The state visitation distribution ρβ is related to the policy β. We will use
below this flexibility to introduce our noise cloud around the training tra-
jectory.
We continue with the effect of using neural networks to represent both
the action-value function and the policy. We restrict attention to the case of
a deterministic policy since this will be the type of policy we will use later
for our time series prediction application. To motivate the introduction of
neural networks we begin with the concept of Q-learning as a way to learn the
action-value function and the policy [27, 17]. In Q-learning, the optimization
problem (16)-(17) to find the action-value function is coupled with the greedy
policy estimate µ(s) = arg max Q(s, a). Thus, the greedy policy requires an
a
optimization at every timestep. This can become prohibitively costly for
the type of action spaces that are encountered in many applications. This
has led to (i) the adoption of (deep) neural networks for the representation
of the action-value function and the policy and (ii) the update of the neural
network for the policy after each Q-learning iteration for the action-value
function [21].
We assume that the action-value function Q(st , at |θQ ) is represented by
a neural network with parameter vector θQ and the deterministic policy
µ(s|θµ ) by a neural network with parameter vector θµ . The deterministic
policy gradient algorithm [21] uses (16)-(17) to learn Q(st , at |θQ ). The policy
µ(s|θµ ) is updated after every iteration of the Q-optimization using the
policy gradient
13
2.3.2 AC methods for time series prediction and enforcing con-
straints
We explain now how an AC method can be used to train the flow map of a
dynamical system. In addition, we provide a way for enforcing constraints
during training.
We begin by identifying the state st with the state of the dynamical
system at time t. Also, we identify the discrete timesteps with the iterations
of the flow map that advance the state of the dynamical system by ∆t units
in time. The action policy µ(st |θµ ) is the action that needs to be taken to
bring the state of the system from st to st+1 . However, instead of learning
separately the action policy that results in st being mapped to st+1 , we can
identify the policy µ(st |θµ ) with the state st+1 i.e. µ(st |θµ ) = st+1 . In this
way, training for the policy µ(st |θµ ) is equivalent to training for the flow
map of the dynamical system.
We also take advantage of the off-policy aspect of (16)-(17) to choose
the distribution of states ρβ to be the one corresponding to the noise cloud
around the training trajectory needed to implement the error-correction.
Thus, we see that the intrinsic statistical nature of the AC method bodes
well with our approach to error-correcting.
To complete the specification of the AC method as a method for training
the flow map of a dynamical system we need to specify the reward function.
The specification of the reward function is an important aspect of AC meth-
ods and reinforcement learning in general [23, 11]. We have chosen a simple
negative reward function. To conform with the notation from Sections 2.1
and 2.2 we specify the reward function as
M
X
r(z, x) = − (µj (z) − xdata
j )2 , (19)
j=1
where z is a point on the noise cloud at the time t, µj (z) is the j-th com-
ponent of its image through the flow map and xdata
j is the j-th component
of the noiseless point on the training trajectory at time t + ∆t that it is
mapped to. Similarly, for the case when we want to enforce constraints e.g.
diagonal linear representation for the error-correcting term we can define
the reward function as
XM
data 2 2
r(z, x) = − (µj (z) − xj ) + (µj (z) − zj − ∆tfj (z) + ∆taj zj ) . (20)
j=1
For each time t, the reward function that we have chosen uses information
only from the state of the system at time t and t + ∆t. Of course, how
14
much credit we assign to this information is determined by the value of the
discount factor γ.
If γ = 0, then we disregard any information beyond time t + ∆t. In this
case, the AC method becomes a supervised learning method in disguise.
In fact, from (16)-(18) we see that when γ = 0, the task of maximizing the
action-value function is equivalent to maximizing the reward. The maximum
value for our negative reward (19) is 0 which is the optimal value for the
supervised learning loss function (5) (the average over the noise cloud in (5)
is the same as the average w.r.t. to ρβ ). A similar conclusion holds for the
case of the reward with constraints (20) and the supervised learning loss
function with constraints (6).
If on the other hand we set γ = 1, we assign equal importance to current
and future rewards. This corresponds to the case where the environment is
deterministic and thus the same actions result in the same rewards.
15
trying to find the point on the line Q(st , at ) = rt which maximizes Q(st , at ).
But hitting this line from a random initialization of the neural networks for
Q(st , at ) and of the policy µ(st ) is extremely unlikely. We would be better
off if we started our optimization from a point on the line and then look for
the maximum of Q(st , at ). In other words, for the case of γ = 0, we have
a better chance of training accurately if we let Q(st , at ) = rt in (18). A
similar argument for the case γ 6= 0 shows why we will have a better chance
of training if we let Q(st , at ) = rt + γQ(st+1 , µ(st+1 )) in (18).
There is an extra mathematical reason why the identification Q(st , at ) =
rt + γQ(st+1 , µ(st+1 )) can result in better training. Recall from (20) that
the reward function rt contains all the information from the training trajec-
tory and the constraints we wish to enforce. In addition, rt depends on the
parameter vector θµ for the neural network that represents the action policy
µ. Thus, when we use the expression rt + γQ(st+1 , µ(st+1 )) in (18) for the
update step of θµ , we back-propagate directly the available information from
the training trajectory and the constraints to θµ . This is because we differ-
entiate directly rt w.r.t. θµ . On the other hand, in the original formulation
we do not differentiate rt at all, because there rt appears only in the update
step for the action-value function. That update step involves differentiation
w.r.t. the action-value function parameter vector θQ but not θµ .
Of course, if we make the identification Q(st , at ) = rt +γQ(st+1 , µ(st+1 ))
in (18) we have modified the original problem. The question is how is
the solution to the modified problem related to the original one. Through
algebraic inequalities, one can show that the optimum for Q(st , at ) for the
modified problem provides a lower bound on the optimum for the original
problem. It can also provide an upper bound if we make extra assumptions
about the difference Q(st , at )−rt −γQ(st+1 , µ(st+1 )) e.g. the convex-concave
assumptions appearing in the min-max theorem [18].
To avoid the need for such extra assumptions, we have developed an
alternative approach. We initialize the training procedure with the identifi-
cation Q(st , at ) = rt + γQ(st+1 , µ(st+1 )) in (18). As the training progresses
we morph the modified problem back to the original one via homotopy. In
particular, we use in (18) instead of Q(st , at ) the expression
16
very refined schedule. One general rule of thumb is that the schedule should
be slower for larger values of γ i.e. allow more iterations between increases
in the value of δ. This is to be expected because for larger values of γ, the
influence of rt in the optimization of rt +γQ(st+1 , µ(st+1 )) is reduced. Thus,
it is more difficult to back-propagate the information from rt to the action
policy parameter vector θµ . However, note that larger values of γ allow us
to take more into account future rewards, thus allowing the AC method to
be more versatile.
3 Numerical results
We use the example of the Lorenz system to illustrate the constructions
presented in Sections 1 and 2.
The Lorenz system is given by
dx1
= σ(x2 − x1 ) (22)
dt
dx2
= ρx1 − x2 − x1 x3 (23)
dt
dx3
= x1 x2 − βx3 (24)
dt
where σ, ρ and β are positive. We have chosen for the numerical experiments
the commonly used values σ = 10, ρ = 28 and β = 8/3. For these values
of the parameters the Lorenz system is chaotic and possesses an attractor
for almost all initial points. We have chosen the initial condition x1 (0) = 0,
x2 (0) = 1 and x3 (0) = 0.
We have used as training data the trajectory that starts from the spec-
ified initial condition and is computed by the Euler scheme with timestep
δt = 10−4 . In particular, we have used data from a trajectory for t ∈ [0, 3].
For all three modes of learning, we have trained the neural network to rep-
resent the flow map with timestep ∆t = 1.5×10−2 i.e. 150 times larger than
the timestep used to produce the training data. After we trained the neural
network that represents the flow map, we used it to predict the solution for
t ∈ [0, 9]. Thus, the trained flow map’s task is to predict (through iterative
application) the whole training trajectory for t ∈ [0, 3] starting from the
given initial condition and then keep producing predictions for t ∈ (3, 9].
This is a severe test of the learned flow map’s predictive abilities for
four reasons. First, due to the chaotic nature of the Lorenz system there is
no guarantee that the flow map can correct its errors so that it can follow
closely the training trajectory even for the interval [0, 3] used for training.
17
Second, by extending the interval of prediction beyond the one used for
training we want to check whether the neural network has actually learned
the map of the Lorenz system and not just overfitting the training data.
Third, we have chosen an initial condition that is far away from the attractor
but our integration interval is long enough so that the system does reach
the attractor and then evolves on it. In other words, we want the neural
network to learn both the evolution of the transient and the evolution on the
attractor. Fourth, we have chosen to train the neural network to represent
the flow map corresponding to a much larger timestep than the one used
to produce the training trajectory in order to check the ability of the error-
correcting term to account for a significant range of unresolved timescales
(relative to the training trajectory).
We performed experiments with different values for the various param-
eters that enter in our constructions. We present here indicative results
for the case of N = 2 × 104 samples (N/3 for training, N/3 for validation
and N/3 for testing). We have chosen Ncloud = 100 for the cloud of points
around each input. Thus, the timestep ∆t = 1.5 × 10−2 . This is because
there are 20000/100 = 200 time instants in the interval [0, 3] at a distance
∆t = 3/200 = 1.5 × 10−2 apart.
The noise cloud for the neural network at a point t was constructed using
the point xi (t) for i = 1, 2, 3, on the training trajectory and adding random
disturbances so that it becomes the collection xil (t)(1−Rrange +2Rrange ×ξil )
where l = 1, . . . , Ncloud . The random variables ξil ∼ U [0, 1] and Rrange =
2 × 10−2 . As we have explained before, we want to train the neural network
to map the input from the noise cloud at a time t to the noiseless point
xi (t + ∆t) (for i = 1, 2, 3,) on the training trajectory at time t + ∆t.
We have to also motivate the value of Rrange for the range of the noise
cloud. Recall that the training trajectory was computed with the Euler
scheme which is a first-order scheme. For the interval ∆t = 1.5 × 10−2 we
expect the error committed by the flow map to be of similar magnitude and
thus we should accommodate this error by considering a cloud of points
within this range. We found that taking Rrange slightly larger and equal to
2 × 10−2 helps the accuracy of the training.
We denote by (F1 (z), F2 (z), F3 (z)) the output of the neural network flow
map for an input z. This corresponds to (G1 (z), G2 (z), G3 (z)) for the no-
tation of Section 2.1 (supervised learning) and 2.2 (unsupervised learning)
and to (µ1 (z), µ2 (z), µ3 (z)) for the notation of Section 2.3.
As explained in detail in [24], we employ a learning rate schedule that
we have developed and which uses the relative error of the neural network
18
flow map. For a mini-batch of size m, we define the relative error as
m
1 X 1 |F1 (zj ) − x1 (tj + ∆t)| |F2 (zj ) − x2 (tj + ∆t)|
REm = +
m 3 |x1 (tj + ∆t)| |x2 (tj + ∆t)|
j=1
|F3 (zj ) − x3 (tj + ∆t)|
+ ,
|x3 (tj + ∆t)|
where (F1 (zj ), F2 (zj ), F3 (zj )) is the neural network flow map prediction at
tj + ∆t for the input vector zj = (zj1 , zj2 , zj3 ) from the noise cloud at time
tj . Also, (x1 (tj + ∆t), x2 (tj + ∆t), x3 (tj + ∆t)) is the point on the training
trajectory computed by the Euler scheme −4
pwith δt = 10p . The tolerance for
the relative error was set to T OL = 1/ N/3 = 1/ 204 /3 ≈ 0.0122. (see
[24] for more details about T OL). For the mini-batch size we have chosen
m = 1000 for the supervised and unsupervised cases and m = 33 for the
reinforcement learning case.
We also need to specify the constraints that we want to enforce. Using
the notation introduced above, we want to train the neural network flow
map so that its output (F1 (zj ), F2 (zj ), F3 (zj )) for an input data point zj =
(zj1 , zj2 , zj3 ) from the noise cloud satisfies
F1 (zj ) = zj1 + ∆t[σ(zj2 − zj1 )] − ∆ta1 zj1 (25)
F2 (zj ) = zj2 + ∆t[ρzj1 − zj2 − zj1 zj3 ] − ∆ta2 zj2 (26)
F3 (zj ) = zj3 + ∆t[zj1 zj2 − βzj3 ] − ∆ta3 zj3 (27)
where a1 , a2 and a3 are parameters to be optimized during training. The
first two terms on the RHS of (25)-(27) come from the forward Euler scheme,
while the third is the diagonal linear error-correcting term.
19
system acquires values outside of the region of the activation function we
have removed the activation function from the last layer of the generator (al-
ternatively we could have used batch normalization and kept the activation
function). Fig. 1 compares the evolution of the prediction for x1 (t) of the
neural network flow map starting at t = 0 and computed with a timestep
∆t = 1.5 × 10−2 to the ground truth (training trajectory) computed with
the forward Euler scheme with timestep δt = 10−4 . We show plots only for
x1 (t) since the results are similar for the x2 (t) and x3 (t). We want to make
two observations.
First, the prediction of the neural network flow map is able to follow
with adequate accuracy the ground truth not only during the interval [0, 3]
that was used for training, but also during the interval (3, 9]. Second, the
explicit enforcing of constraints i.e. the enforcing of the constraints (25)-(27)
(see results in Fig. 1(b)) is better than the implicit enforcing of constraints.
20 20
15 15
10 10
5 5
x1
x1
0 0
−5 −5
−10 −10
−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time
(a) (b)
20
Fig. 2 compares the predictions of neural networks trained with noisy
and noiseless training data. In addition, we perform such comparison both
for the case with enforced constraints during training and without enforced
constraints. Fig. 2(a) shows that when the constraints are not enforced
during training, the use of noisy data can have a significant impact. This is
along the lines of our argument that the data from a single training trajec-
tory are not enough by themselves to train the neural network accurately
for prediction purposes. Fig. 2(b) shows that when the constraints are en-
forced during training, the difference between the predictions based on noisy
and noiseless training data is reduced. However, using noisy data results in
better predictions for parts of the trajectory where there are rapid changes.
Also the use of noisy data helps the prediction to stay “in phase” with the
ground truth for longer times.
We have to stress that we conducted several numerical experiments and
the performance of the neural network flow map trained with noisy data
was consistently more robust than when it was trained with noiseless data.
A thorough comparison will appear in a future publication.
25
20
20
15 15
10 10
5
x1
x1
0 0
−5 −5
−10 −10
−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time
(a) (b)
21
particular, we would like to see how much of the predictive accuracy is due
to enforcing the forward Euler scheme alone i.e. set a1 = a2 = a3 = 0 in
(25)-(27) versus allowing a1 , a2 , a3 to be optimized during training.
25 25
20 20
15 15
10 10
x1
x1
5 5
0 0
−5 −5
−10 −10
−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time
(a) (b)
Fig. 3(a) compares to the ground truth the prediction from the trained
neural network flow map when we do not enforce any constraints and the
prediction from the trained neural network flow map when we enforce only
the forward Euler part of the constraints (25)-(27) (a1 = a2 = a3 = 0).
We see that indeed, even if we enforce only the forward Euler part of the
constraint we obtain much more accurate results than not enforcing any
constraint at all.
Fig. 3(b) examines how is the performance of the neural network affected
further if we allow also the error-correcting term in (25)-(27) i.e. optimize
a1 , a2 , a3 during training. The inclusion of the error-correcting term allows
the solution to remain for longer “in phase” with the ground truth than if
the error-correction term is absent. This is expected since we are examining
the solution of the Lorenz system as it transitions from an initial condition
22
far from the attractor to the attractor and then evolves on it. While on
the attractor the solution remains oscillatory and bounded, so the main
error of the neural network flow map prediction comes from going “out of
phase” with the ground truth. The error-correcting term keeps the predicted
trajectory closer to the ground truth thus reducing the loss of phase. Recall
that the error-correcting term is one of the simplest possible. From our prior
experience with model reduction, we anticipate larger gains in accuracy if
we use more sophisticated error-correcting terms.
We want to stress again that training with noiseless data is significantly
less robust than training with noisy data. However, we have chosen to
present results of training with noiseless data that exhibit good prediction
accuracy to raise various issues that should be more thoroughly investigated.
23
20 20
15 15
10 10
5 5
x1
x1
0 0
−5 −5
−10 −10
−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time
(a) (b)
24
both cases of enforcing or not constraints explicitly during training. With
this in mind we present results with and without the homotopy approach for
the action-value function to highlight the accuracy improvement afforded by
the use of homotopy.
Before we present the results we provide some details about the target
networks, the reward function and the specifics of the homotopy schedule.
The target network concept uses different networks to represent the action-
value function and the action policy that appear in the expression for the
target (17). In particular, if θQ and θµ are the parameter vectors for the
action-value function and action policy respectively, then we use neural net-
works with parameter vectors θQ0 and θµ0 (the target networks) to evaluate
the target expression (17). The vectors θQ0 and θµ0 can be initialized with
the same values as θQ and θµ but they evolve in a different way. In fact,
after every iteration update for θQ and θµ we apply the update rule
where xdata is the noiseless point from the training trajectory. As we have
explained in Section 2.3 (see the comment after (17)), in the AC method
context the distribution of the noise cloud of the input data points at every
timestep corresponds to the state visitation distribution ρβ appearing in
(16).
The homotopy schedule we used is a rudimentary one that we did not
attempt to optimize. Obviously, this is a topic of further investigation that
will appear elsewhere. We initialized the homotopy parameter δ at 0, and
increased its value (until it reached 1) every 2000 iterations of the optimiza-
tion.
25
25
20
20
15 15
10 10
5
x1
x1
5
0 0
−5 −5
−10 −10
−15 −15
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9
Time Time
(a) (b)
26
4 Discussion and future work
We have presented a collection of results about the enforcing of known con-
straints for a dynamical system during the training of a neural network to
represent the flow map of the system. We have provided ways that the
constraints can be enforced in all three major modes of learning, namely
supervised, unsupervised and reinforcement learning. In line with the law
of scientific computing that one should build in an algorithm as much prior
information is known as possible, we observe a striking improvement in per-
formance when known constraints are enforced during training. We have
also shown the benefit of training with noisy data and how these correspond
to the incorporation of a restoring force in the dynamics of the system. This
restoring force is analogous to memory terms appearing in model reduction
formalisms. In our framework, the reduction is in a temporal sense i.e. it
allows us to construct a flow map that remains accurate though it is defined
for large timesteps.
The model reduction connection opens an interesting avenue of research
that makes contact with complex systems appearing in real-world problems.
The use of larger timesteps for the neural network flow map than the ground
truth without sacrificing too much accuracy is important. We can imagine
an online setting where observations come at sparsely placed time instants
and are used to update the parameters of the neural network flow map. The
use of sparse observations could be dictated by necessity e.g. if it is hard
to obtain frequent measurements or efficiency e.g. the local processing of
data in field-deployed sensors can be costly. Thus, if the trained flow map
is capable of accurate estimates using larger timesteps then its successful
updated training using only sparse observations becomes more probable.
The constructions presented in the current work depend on a large num-
ber of details that can potentially affect their performance. A thorough
study of the relative merits of enforcing constraints for the different modes
of learning needs to be undertaken and will be presented in a future pub-
lication. We do believe though that the framework provides a promising
research direction at the nexus of scientific computing and machine learn-
ing.
5 Acknowledgements
The author would like to thank Court Corley, Tobias Hagge, Nathan Hodas,
George Karniadakis, Kevin Lin, Paris Perdikaris, Maziar Raissi, Alexandre
27
Tartakovsky, Ramakrishna Tipireddy, Xiu Yang and Enoch Yeung for help-
ful discussions and comments. The work presented here was partially sup-
ported by the PNNL-funded “Deep Learning for Scientific Discovery Agile
Investment” and the DOE-ASCR-funded ”Collaboratory on Mathematics
and Physics-Informed Learning Machines for Multiscale and Multiphysics
Problems (PhILMs)”. Pacific Northwest National Laboratory is operated by
Battelle Memorial Institute for DOE under Contract DE-AC05-76RL01830.
References
[1] Martin Arjovsky and Lon Bottou. Towards principled meth-
ods for training Generative Adversarial Networks. arXiv preprint
arXiv:1701.04862, 2017.
[2] Nathan Baker, Frank Alexander, Timo Bremer, Aric Hagberg, Yan-
nis Kevrekidis, Habib Najm, Manish Parashar, Abani Patra, James
Sethian, Stefan Wild, and Karen Willcox. Workshop report on basic
research needs for scientific machine learning: Core technologies for
artificial intelligence. 2019.
28
[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David
Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Gen-
erative adversarial nets. Advances in neural information processing sys-
tems, pages 2672–2680, 2014.
[11] Xiaoxiao Guo, Satinder Singh, Richard Lewis, and Honglak Lee. Deep
learning for reward design to improve monte carlo tree search in atari
games. In Proceedings of the Twenty-Fifth International Joint Confer-
ence on Artificial Intelligence, IJCAI’16, pages 1519–1525. AAAI Press,
2016.
[12] Ernst Hairer, S.E. Nörsett, and G. Wanner. Solving Ordinary Differ-
ential Equations I. Springer, NY, 1987.
[16] Huanfei Ma, Siyang Leng, Kazuyuki Aihara, Wei Lin, and Luonan
Chen. Randomly distributed embedding making short-term high-
dimensional data predictable. Proceedings of the National Academy
of Sciences, 115(43):E9994–E10002, 2018.
29
[18] Martin J Osborne et al. An introduction to game theory, volume 3.
Oxford University Press New York, 2004.
[19] David Pfau and Oriol Vinyals. Connecting generative adversarial net-
works and actor-critic methods. arXiv preprint arXiv:1610.01945, 2016.
[21] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra,
and Martin Riedmiller. Deterministic policy gradient algorithms. In
ICML, 2014.
[23] Jonathan Sorg, Richard L Lewis, and Satinder P Singh. Reward de-
sign via online gradient ascent. In Advances in Neural Information
Processing Systems, pages 2190–2198, 2010.
[25] Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Man-
sour. Policy gradient methods for reinforcement learning with function
approximation. In Proceedings of the 12th International Conference
on Neural Information Processing Systems, NIPS’99, pages 1057–1063,
Cambridge, MA, USA, 1999. MIT Press.
[27] Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learn-
ing, 8(3-4):279–292, 1992.
30