Output Feedback Reinforcement Learning Control For Linear Systems
Output Feedback Reinforcement Learning Control For Linear Systems
Output Feedback
Reinforcement
Learning Control
for Linear
Systems
Control Engineering
Series Editor
William S. Levine, Department of Electrical and Computer Engineering, University
of Maryland, College Park, MD, USA
Output Feedback
Reinforcement Learning
Control for Linear Systems
Syed Ali Asad Rizvi Zongli Lin
Electrical and Computer Engineering Electrical and Computer Engineering
Tennessee Technological University University of Virginia
Cookeville, TN, USA Charlottesville, VA, USA
This book is published under the imprint Birkhäuser, www.birkhauser-science.com by the registered
company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
v
vi Preface
the optimality of the solution is subject to the accuracy of the system model. The
requirement of an accurate system model is hard to satisfy in practice because
of the presence of unavoidable uncertainties. Furthermore, these uncertainties add
up as the complexity of system increases. More often, the modeling and system
identification process is undesirable owing to the cost and effort associated with this
process. As a result, the dynamics of the underlying system is often unknown for
the control design. Even when the system dynamics is available in the early design
phase, the system parameters are subject to change over the operating life span of
the system due to variations in the process itself. These model variations may arise
as a result of aging, faults, or subsequent upgrades of the system components. Thus,
it is desirable to develop optimal control methods that do not rely on the knowledge
of the system dynamics.
Machine learning is an area of artificial intelligence and involves the development
of computational intelligence algorithms that learn by analyzing the system behavior
instead of being explicitly programmed to perform certain tasks. Recent advances in
machine learning have opened a new avenue for the design of intelligent controllers.
Machine learning techniques come into the picture when accurate system models
are not known but examples of system behavior are available or a measure of the
goodness of the behavior can be assigned. Machine learning algorithms are formally
categorized into three types based on the kind of supervision available, namely
supervised learning, unsupervised learning, and reinforcement learning. Supervised
learning is undoubtedly the most common type of machine learning which finds use
in applications where examples of the desired behavior are available in the form
of input-output training data. However, the nature of control problems requires
selecting control actions whose consequences emerge over time. As a result, the
optimal control inputs that achieve the desired output are not known in advance.
In these scenarios, reinforcement learning can be used to learn the desired control
inputs by providing the controller with a suitable evaluation of its performance
during the course of learning.
Today, more and more engineering designs are finding inspiration from our
mother nature. Living organisms tend to learn, adapt, and optimize their behavior
over time by interacting with their environment. This is the main motivation behind
reinforcement learning techniques, which are computational intelligence algorithms
based on the principle of action and reward that naturally incorporate a feedback
mechanism to capture optimal behavior. The ideas of adaptability and optimality are
also present in the control community in the form of adaptive control and dynamic
programming (optimal control). However, reinforcement learning has provided a
new direction of learning optimal adaptive control by observing a reward stimulus.
Recently, the idea of using reinforcement learning to solve optimal control problems
has attracted a lot of attention in the control community.
The treatment of RL problems in control settings calls for a more mathematically
rigorous formulation so that connections with the fundamental control concepts such
as stability, controllability, and observability can be made. We start to see to such
sophistications right from the beginning when we are required to select a rigorously
formulated objective function that satisfies certain control assumptions.
Preface vii
While significant progress has been made in recent years, the reinforcement
learning (RL) control paradigm continues to undergo developments more rapidly
than ever before. Reinforcement learning for dynamic systems requires to take into
account the system dynamics. Control of dynamic systems gives prime importance
to the closed-loop stability. Control algorithms are required to guarantee closed-
loop stability, which is a bare minimum requirement for a feedback system. Control
systems without performance, robustness, and safety margins are not acceptable by
the industry. Current developments in the RL control literature are directed towards
making RL algorithms applicable in real world scenarios.
This book focuses on the recent developments in the design of RL controllers for
general linear systems represented by state space models, either in continuous-time
or in discrete-time. It is dedicated to the design of output feedback RL controllers.
While the early developments in RL algorithms have been primarily attributed to the
computer science community where it is of practical interest to pose the problem in
the stochastic setting, the scope of this research monograph is towards enhancing the
output feedback capability of the mainstream RL algorithms developed within the
control engineering community, which are primarily in the deterministic setting.
This specialized choice of topics sets this research monograph apart from the
leading mainstream books that develop the conceptual foundations of the theory
of approximate dynamic programming [8], reinforcement learning [9], and their
connection with feedback control [54, 68]. The book presents control algorithms
that are aimed to deal with the challenges associated with learning under limited
sensing capability. Fundamental to these algorithms are the issues of exploration
bias and stability guarantee. Model-free solutions to the classical optimal control
problems such as stabilization and tracking are presented for both discrete-time and
continuous-time dynamic systems. Output feedback control for linear systems is
currently more formalized in the latest literature compared to the ongoing extensions
for nonlinear systems; therefore, the discussions for the most part are dedicated to
linear systems.
The output feedback formulation presented in this book differs from the tra-
ditional design method that involves a separate observer design. As such, the
emphasis of the monograph is not on optimal state estimation, as is involved in
an observer and control design process that leads to the optimal output feedback
control. Instead, our motivation is to learn the optimal control parameters directly
based on a state parameterization approach. This in turn circumvents the separation
principle of performing two-step learning for optimal state estimation and optimal
control design.
The results covered in the book extend beyond the classical problems of
stabilization and tracking. A variety of practical challenges are studied ranging from
the disturbance rejection, control constraints, and communication delays. Model-
free H∞ controllers are developed using output feedback based on the principles
of game theory. The ideas of low gain feedback control are employed to develop
RL controllers that achieve global stability under control constraints. New results
on the design of model-free optimal controllers for systems subject to both state
and input delays are presented based on an extended state augmentation approach,
viii Preface
which requires neither the knowledge of the lengths of the delays nor the knowledge
of the number of the delays.
The organization of the book is as follows. In Chap. 1, we introduce the readers
to optimal control theory and reinforcement learning. Fundamental concepts and
results are reviewed. This chapter also provides a brief survey of the recent develop-
ments in the field of RL control. Challenges associated with the RL controllers are
highlighted.
Chapter 2 presents model-free output feedback RL algorithms to solve the
optimal stabilization problem. Both continuous-time systems and discrete-time
systems are considered. A review of existing mainstream approaches and algorithms
is presented to highlight some of the challenges in guaranteeing closed-loop
stability and optimality. Q-learning and integral reinforcement learning algorithms
are developed based on the parameterization of the system state. The issues of
discounting factor and exploration bias are discussed in detail, and improved output
feedback RL methods are presented to overcome these difficulties.
Chapter 3 brings attention to the disturbance rejection control problem formu-
lated as an H∞ control problem. The framework of game theory is employed to
develop both continuous-time and discrete-time algorithms. A literature review is
provided to elaborate the difficulties in some of the recent RL-based disturbance
rejection algorithms. Q-learning and integral reinforcement learning algorithms
are developed based on the input-output parameterization of the system state, and
convergence to the optimal solution is established.
Chapter 4 presents model-free algorithms for global asymptotic stabilization
of linear systems subject to actuator saturation. Existing reinforcement learning
approaches to solving the constrained control problem are first discussed. The idea
of gain-scheduled low gain feedback is presented to develop control laws that avoid
saturation and achieve global asymptotic stabilization. To design these control laws,
we employ the parameterized ARE-based low gain design technique. Reinforcement
learning algorithms based on this approach are then presented to find the solution of
the parameterized ARE without requiring any knowledge of the system dynamics.
The presented scheme has the advantage that the resulting control laws have a
linear structure and global asymptotic stability is ensured without causing actuator
saturation. Both continuous-time systems and discrete-time systems are considered.
The last two chapters build upon the fundamental results established in the
earlier chapters to solve some important problems in control theory practice, control
of systems in the presence of time delays, the optimal tracking problem, and
the multi-agent synchronization problem. In particular, Chap. 5 focuses on the
control problems involving time delays. A review of some existing RL approaches
to addressing the time-delay control problem of linear systems is first provided.
The design of model-free RL controllers is presented based on an extended state
augmentation approach. It is shown that discrete-time delay systems with input
and/or state delays can be brought into a delay-free form. Q-learning is then
employed to learn the optimal control parameters for the extended dynamic system.
Systems with arbitrarily large delays can be dealt with using the presented approach.
Furthermore, this method requires neither the number and nor the lengths of delays.
Preface ix
xi
xii Contents
4.3.2
Q-learning Based Global Asymptotic Stabilization
Using State Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.3.3 Q-learning Based Global Asymptotic Stabilization
by Output Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.3.4 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4 Global Asymptotic Stabilization of Continuous-Time Systems . . . . . . 191
4.4.1 Model-Based Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.4.2 Learning Algorithms for Global Asymptotic
Stabilization by State Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.4.3 Learning Algorithms for Global Asymptotic
Stabilization by Output Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.4.4 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5 Model-Free Control of Time Delay Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.4 Extended State Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.5 State Feedback Q-learning Control of Time Delay Systems . . . . . . . . . . 238
5.6 Output Feedback Q-learning Control of Time Delay Systems. . . . . . . . 243
5.7 Numerical Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
5.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6 Model-Free Optimal Tracking Control and Multi-Agent
Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.3 Q-learning Based Linear Quadratic Tracking. . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.4 Experience Replay Based Q-learning for Estimating the
Optimal Feedback Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.5 Adaptive Tracking Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.6 Multi-Agent Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.7 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Notation and Acronyms
xv
xvi Notation and Acronyms
1.1 Introduction
Control algorithms are unarguably one of the most ubiquitous elements found at
the heart of many of today’s real world systems. Owing to the demand for high
performance of these systems, the development of better control techniques has
become ever so important. A consequence of their increasing complexity is that
it has become more difficult to model and control these systems to meet the desired
objectives. As a result, the traditional model-based control paradigm requires novel
techniques to cope with these increasing challenges. Optimal control, a control
framework that is regarded as an important development in modern control theory
for its capability to integrate optimization theory in controls, is now experiencing
a paradigm shift from the offline model-based setting to the online data-driven
approach. Such a shift has been fueled by the recent surge of the research and
developments in the area of machine learning pioneered by the artificial intelligence
(AI) community and has led to the developments in optimal decision making AI
algorithms popularly known as reinforcement learning.
While reinforcement learning holds the promise of rescuing the classical optimal
control from the curse of modeling, there are also several potential challenges
that stem from the marriage of these two paradigms owing to the mathematically
rigorous requirements inherent in control theory. In this chapter we will first provide
a concise but self-contained background of optimal control and reinforcement
learning in the context of linear dynamical systems with the aim of highlighting
the connections between the two. Iterative techniques are introduced that play an
essential role in the design of algorithms presented in the later chapters. A detailed
discussion then follows that highlights the existing difficulties found in the literature
of reinforcement learning based optimal control.
Ensuring stable operation that meets basic control objectives is an essential, though
preliminary, requirement in any control design. Besides stability, practical control
systems are required to meet certain specifications from both the performance and
cost standpoints. This requires the control problem to be solved in a certain optimal
fashion that takes into account various practical operational aspects. Optimization
techniques play an essential role in meeting such specifications. An optimal control
problem basically revolves around the idea of performing a certain control task such
as stabilization or trajectory tracking in a way that a good balance of performance
and control cost is maintained while executing the control task. Pioneering devel-
opments in optimal control date back to the 1950s with the Pontryagin’s minimum
principle that provides necessary conditions under which the problem is solvable
[12]. Around the same time the introduction of Bellman’s dynamic programming[7]
method enabled solving dynamic optimization problems backward in time in a
sequential manner.
where xk ∈ Rn represents the state, uk ∈ Rm is the control input, and k is the time
index. The basic objective is to find an optimal control sequence u∗k , k ≥ 0, that
causes that xk → 0 as k → ∞, while minimizing a long term running cost of the
form
∞
V (x0 ) = r(xk , uk ), (1.2)
k=0
∞
V ∗ (xk ) = r xi , π ∗ (xi )
i=k
∞
= min r(xi , ui ), (1.3)
u∈U i=k
that is,
∞
∗
π (xk ) = arg min r(xi , ui ) . (1.4)
u∈U i=k
To solve this problem using the framework of dynamic programming, we first need
to recall the Bellman principle of dynamic programming. The Bellman principle
states that “An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal policy with
regard to the state resulting from the first decision.” [57]
The Bellman principle in essence provides necessary and sufficient conditions
for the policy to be optimal, which is that every sub-trajectory under that policy
is also optimal independent of where we start taking optimal actions on the
optimal trajectory. The idea itself is actually very intuitive and simple. Consider, for
example, that we are to do an air travel from one place to another while minimizing
the travel costs (such as time and expenses). By finding out the information about
the possible routes (the model in this case) we obtain an optimal flight route that
goes through certain connections along the route. However, if we had to reschedule
that flight from a certain intermediate connection, the flight starting from that
intermediate connection following the rest of the route of the original flight will
still be the optimal choice. Otherwise, the original flight route would not have been
optimal and can be improved by adopting better flights for the later portion of the
flight route.
Coming back to our original problem, mathematically, the Bellman optimality
principle implies that the optimal value function holds the following relationship,
V ∗ (xk ) = min r(xk , uk ) + V ∗ (xk+1 ) , (1.5)
u∈U
which suggests that, in theory, we can compute the optimal cost backwards in
time and working all the way back to the present instance, a process that requires
offline planning and invokes the system dynamics. Equation (1.5) is called the
Bellman Optimality equation, which is in fact the discrete-time counterpart of a
popular equation known as the Hamilton-Jacobi-Bellman (HJB) equation (a partial
differential equation) used to solve the continuous-time version of the same problem
as discussed below.
4 1 Introduction to Optimal Control and Reinforcement Learning
It is interesting to note that the right-hand side of this equation contains the
Hamiltonian function used in solving optimization problems. Once the solution (the
optimal value function) of the HJB equation is obtained, the optimal control can be
readily computed by minimizing the Hamiltonian with respect to the control u(t).
The Bellman optimality equation and the HJB equation are functional equations.
Their solution amounts to finding the optimal value function V ∗ . These equations
are difficult to solve for general dynamic systems and cost functions except for
some special cases. Thus, the standard dynamic programming framework faces two
big challenges. First, we need perfect models to solve (1.7). Secondly, even when
such models are available, we still need some method to approximate their solutions
[124].
One of the most widely discussed optimal control problems is the linear quadratic
regulator (LQR) problem. The problem deals with linear dynamic systems whose
cost functions are energy like functions that take a quadratic form. In this problem
we are to solve a regulation problem while minimizing a long term quadratic cost
function. Depending on whether we require finite-time or asymptotic regulation,
the cost function is defined accordingly. We will discuss only the more common
asymptotic regulation here as we will not discuss finite-time control problems in
this book. Suppose we are given a linear dynamic system in the state space form,
where xk ∈ Rn represents the state, uk ∈ Rm is the control input and k is the time
index. The basic objective is to find an optimal control sequence u∗k , k ≥ 0, that
causes that xk → 0 as k → ∞, while minimizing a long term running cost of the
form [55],
1.2 Optimal Control of Dynamic Systems 5
∞
V (x0 ) = xkT Qxk + uTk Ruk , (1.9)
k=0
√ √ T√
where Q ≥ 0 and R > 0 are the weighting matrices with A, Q , Q Q =
Q, being observable and (A, B) being controllable. It is worth mentioning that the
optimal LQR controller may exist under less restrictive conditions [33, 118]. Utility
functions of this form will be the focus throughout the book.
For the linear dynamic system (1.8) and the quadratic cost function (1.9), solving
the Bellman optimality equation (1.7) amounts to solving a matrix algebraic Riccati
equation (ARE) of the form,
−1
AT P A − P + Q − AT P B R + B T P B B T P A = 0. (1.10)
The solution of the ARE (1.10) gives the optimal value function,
V ∗ (xk ) = xkT P ∗ xk ,
u∗k = K ∗ xk
−1
= − R + B TP ∗B B T P ∗ Axk ,
where P ∗ > 0 is the unique positive definite solution to the ARE (1.10). The
uniqueness of P ∗ is ensured under the standard controllability and observability
assumptions [55].
Parallel results for the continuous-time version of the same problem can be
obtained by considering the linear dynamics
AT P + P A + Q − P BR −1 B T P = 0, (1.12)
6 1 Introduction to Optimal Control and Reinforcement Learning
u∗ (t) = K ∗ x(t)
= −R −1 B T P ∗ x(t), (1.13)
Optimal control problems rely on solving the HJB equation (for a general nonlinear
system and/or a general cost function) or the ARE equation (for a linear system
and a quadratic cost function). Even when accurate system models are available,
these equations are still difficult to solve analytically and, therefore, computational
methods have been developed in the literature to solve these equations. Many of
these methods are based on two computational techniques called policy iteration and
value iteration [9]. In this subsection we will give an overview of these algorithms
as they will serve as the basis of the online learning based methods that we will
introduce in the subsequent chapters.
The mathematical basis of the iterative procedures for solving the Bellman and
the ARE equations is that these equations satisfy a fixed-point property under some
general conditions on the cost function. This means that we can start with some sub-
optimal solution and successive improvements, when fed back to these equations,
would converge to the optimal one. To see this, we note that we can obtain the value
function from (1.2) corresponding to some policy π(xk ) for state xk as
which is known as the Bellman equation. This equation is used to evaluate the cost
of policy π . We can obtain an improved policy π and feeding this update back in
the Bellman equation will give us the value of the new policy Vπ , which satisfies
Vπ ≤ Vπ . The process can be repeated until Vπ = Vπ for all future iterations.
This method is formally referred to as policy iteration (PI) [113]. It allows finding
1.2 Optimal Control of Dynamic Systems 7
5: j← j +1
j j −1
6: until Vπ − Vπ < ε for some small ε > 0.
optimal policy without directly solving the Bellman optimality equation (1.5). This
PI procedure is described in Algorithm 1.1.
In each iteration of the policy iteration algorithm, the Bellman equation is
first solved by evaluating the cost or value of the current control policy and
then the policy is improved based on its policy evaluation. These two steps of
policy evaluation and policy update are repeated until the algorithm sees no further
improvement in the policy, and the final policy is said to be the optimal policy.
Notice that the PI algorithm iterates on the policies by evaluating them, hence the
name policy iteration. An important aspect of the algorithm is that it requires an
admissible policy because admissible policies have finite cost which is needed in
the policy evaluation step.
The second popular iteration method to solve the Bellman optimal equation is the
value iteration (VI) method. The method differs from the policy iteration method in
that it does not iterate on the policies by evaluating them but rather iterates on the
value function directly. This is a consequence of the fact that Bellman optimality
equation is a fixed-point functional equation, which means that iterating the value
functions directly would lead to the optimal value function. This VI procedure is
described in Algorithm 1.2.
The value iteration algorithm differs from the policy iteration algorithm only
in the policy evaluation step, in which the value iteration evaluates a policy value
based on the previous policy value. It should be noted that policy iteration generally
requires solving a system of equations in each iteration, whereas the value iteration
simply performs a one-step recursion, which is computationally economical. In
contrast to the policy iteration algorithm, the value iteration algorithm does not
actually find the policy value corresponding to the current policy at each step but
it takes only a step closer to that value. Also, the policy iteration algorithm must
be suitably initialized to converge, i.e., the initial policy must be admissible or
8 1 Introduction to Optimal Control and Reinforcement Learning
5: j← j +1
j j −1
6: until Vπ − Vπ < ε for some small ε > 0.
Note that this particular structure of the value function is advantageous from the
learning perspective, which is a consequence of the quadratic utility function as
known from the linear optimal control theory [55]. The control policy in this case is
uk = Kxk , which gives us
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
That is, the Bellman equation for the LQR problem actually corresponds to a
Lyapunov equation. This is analogous to the previous observation that the Bellman
optimality equation corresponds to the Riccati equation. These observations suggest
that we could apply iterations on the Lyapunov equation in the same way they are
applied on the Bellman equation. In this case Lyapunov iterations under the standard
LQR conditions would converge to the solution of Riccati equation. A Newton’s
iteration method was developed in the early literature that does exactly what a policy
iteration algorithm does and is presented in Algorithm 1.3.
Algorithm 1.3 finds the solution of the LQR Riccati equation iteratively. Instead
of solving the ARE (1.10), which is a nonlinear equation, Algorithm 1.3 solves
a Lyapunov equation which is linear in the unknown matrix P . Similar to the
general PI algorithm, Algorithm 1.3 also needs to be initialized with a stabilizing
policy. Such initiation is essential because the policy evaluation step involves
finding the positive definite solution of the Lyapunov equation, which requires the
feedback gain to be stabilizing. The algorithm is known to converge with a quadratic
convergence rate under the standard LQR conditions as shown in [35].
We can also apply value iteration to find the solution of the Riccati equation.
Similar to the general value iteration algorithm, Algorithm 1.2, we perform recur-
sions on the Lyapunov equation to carry out value iterations on the matrix P for
value updates. That is, instead of solving the Lyapunov equation, we only perform
recursions, which are computationally faster. The policy update step still remains the
same as in Algorithm 1.3. Under the standard LQR conditions, the value iteration
LQR algorithm, Algorithm 1.4, converges to the solution of the LQR ARE.
10 1 Introduction to Optimal Control and Reinforcement Learning
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
K j +1 = −R −1 B T P j .
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
The primary motivation of reinforcement learning stems from the ways the living
beings learn to perform tasks. A key feature of intelligence in these beings is the
way they adapt to their environment and optimize their actions by interacting with
it. Reinforcement learning techniques were originally introduced in the computer
science community to serve as computational intelligence algorithms to automate
12 1 Introduction to Optimal Control and Reinforcement Learning
solves these decision making problems by searching for an optimal policy that
maximizes the expected value of the performance measure by examining the reward
at each step. Examples of such sequential decision processes where reinforcement
learning has been successfully applied include robot navigation problems [44],
board games [76] such as chess [106], and more recently, the Google Alpha Go
[107].
Recently, reinforcement learning for the control of dynamic systems has received
significant attention in the automatic control community [59]. These dynamic
systems are represented by differential or difference equations. System dynamics
plays an essential part in the design of human engineered systems. RL based control
of dynamic systems requires to incorporate the system dynamics with an infinite
state space, which makes it different from the traditional RL algorithms. With the
recent advances in functional approximation such as neural networks, RL techniques
have been extended to approximate the solution of the HJB equation based on the
Bellman principle of dynamic programming. These methods are often referred to as
approximate or adaptive dynamic programming (ADP) in the control literature.
The development of reinforcement learning controllers is motivated by their
optimal and model-free nature. Reinforcement learning controllers are inherently
optimal because the problem formulation embeds a performance criterion or cost
function. Several optimization criteria, such as minimum energy, and minimum
time, can be taken into account. In addition to being optimal, reinforcement learning
also inherits adaptation capability in that the controller is able to adapt to the changes
in system dynamics during its operation by observing the real-time data. In other
words, the controller does not need to be reprogrammed if some parameters of the
systems are changed.
Reinforcement learning control is different from the adaptive control theory in
the sense that adaptive control techniques adapt the controller parameters based
on the error between the desired output and the measured output. As a result, the
learned controllers do not take into account the optimality aspect of the problem.
On the other hand, RL techniques are based on the Bellman equation, which
incorporates a reward signal to adapt the controller parameters and, therefore,
consider optimality while ensuring the control objectives to be achieved. The
controllers based on RL methods, however, do share a feature of direct adaptive
controllers in the sense that the RL controller is designed without the model
identification process and the optimal control is learned online through the learning
episodes by reinforcing the past control actions that give the maximum reward or
the minimum control utility.
Model-Free Control
Classical dynamic programming methods have been used to solve optimal control
problems. However, these techniques are offline in nature and require complete
knowledge of the system dynamics. Reinforcement learning addresses the core lim-
itation of requiring complete knowledge of the system dynamics by assuming only
basic properties of the system dynamics such as controllability and observability.
RL control is different from indirect adaptive control, where system identification
techniques are first used to estimate the system parameters and then a controller is
designed based on the identified parameters. Instead, it learns the optimal control
policies directly based on the real-time system data.
V (xk ) = W T φ(xk ),
where the vector W contains the unknown weights corresponding to the user-defined
basis set φ(xk ). Using the universal approximation property of neural networks,
the value function V (xk ) can be approximated with an arbitrary accuracy provided
that a sufficient number of terms are used in the approximation. For linear dynamic
systems, we know that the value function is quadratic in state xk , that is,
As a result, we can exactly obtain the value function with a finite basis set φ(x)
defined as
φ(x) = x12 , x1 x2 , · · · , x1 xn , x22 , x2 x3 , x2 xn , · · · , xn2 .
where the utility function r(xk , uk ) plays the role of the reward or penalty signal.
Equation (1.16) is employed in policy evaluation and value update steps found in the
policy iteration and value iteration algorithms. The equation is linear in the unknown
vector W and standard linear equation solving techniques such as the least-squares
method can be employed to solve it based only on the datasets of (xk , uk , xk+1 ) and
without involving the system dynamics. This serves as a major step towards making
the iterative algorithms data-driven.
Recalling from the dynamic programming method in Sect. 1.2.1, we find that
the Bellman optimality equation (1.5) provides a backward in time procedure to
obtain the optimal value function. Such a procedure inherently involves offline
planning using models to perform operations in reverse time. Now consider the data-
driven reinforcement learning Bellman equation (1.16). The sequence of operations
required to solve the data-driven learning equation proceeds forward in time. That is,
at a given time index k, an action is applied to the system and the resulting reward
or penalty r(xk , uk ) corresponding to this action in the current state is observed.
The goal is to minimize the difference between the predicted performance and the
sum of the observed reward and the current estimate of the future performance. This
forward in time sequence is an important difference that distinguishes reinforcement
learning from dynamic programming (see Fig. 1.2).
A value function that is quite frequently used in model-free reinforcement
learning is the “Quality Function” or the Q-function [113]. The Q-function is
defined similar to the right-hand side of the Bellman equation (1.14),
Like the value function, Q-function also provides a measure of the cost of the policy
π . However, unlike the value function, it is explicit in uk and gives the single step
cost of executing an arbitrary control uk from state xk at time index k together with
the cost of executing policy π from time index k + 1 on. The Q-function description
is in a sense more comprehensive than the value function description as it covers
both the state and action spaces and, therefore, the best control action in each state
can be selected by knowing only the Q-function. Once the optimal Q-function is
found, the optimal control can be readily obtained by finding the control action that
minimizes or maximizes the optimal Q-function. Similar to the value function, we
can estimate the Q-function using some function approximator such as
1.3 Reinforcement Learning Based Optimal Control 17
Qπ (xk , uk ) = W T φ(xk , uk ).
The Q-function also satisfies the Bellman equation, which can be obtained by using
the relationship Qπ (xk , π(xk )) = Vπ (xk ) as follows,
The Bellman equation (1.18) can be parameterized to find the unknown vector W
by solving the linear equation
which gives us the required Q-function. Then, following the policy iteration and
value iteration algorithms, Algorithms 1.1 and 1.2, we can iteratively solve the above
equation to obtain the optimal Q-function. Once we have the optimal Q-function
Q∗ (xk , uk ), the optimal control u∗k is obtained by solving
T
xk Q + AT P A AT P B xk
QK (xk , uk ) =
uk B T P A R + B T P B uk
= zkT H zk
T Hxx Hxu
= zk zk . (1.19)
Hux Huu
Algorithm 1.6 Q-learning policy iteration algorithm for the LQR problem
input: input-state data
output: H ∗ and K ∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H j ,
5: j← j +1
6: until H j − H j −1 < ε for some small ε > 0.
Algorithm 1.6 is a policy iteration algorithm for solving the LQR problem
without requiring the knowledge of the system dynamics. It solves the Q-learning
Bellman equation (1.18) for the LQR Q-function matrix H . The algorithm is
initialized with a stabilizing control policy K 0 . In the policy evaluation step the
cost of the current policy K j is evaluated by estimating the Q-function matrix
H j associated with policy K j . The improved control policy K j +1 is obtained by
minimizing the Q-function
Qj = zkT H j zk .
Subsequent iterations of these steps have been shown to converge to the optimal cost
function matrix H ∗ and the optimal control K ∗ under the standard LQR conditions.
To address the difficulty of having any a priori knowledge of a stabilizing policy
K 0 , we recall the following value iteration algorithm, Algorithm 1.7. The Q-learning
equation in this case is recursive in terms of the matrix H . Instead of evaluating
the policy K j , the value (in this case the Q-function itself) is iterated towards the
optimal value.
Algorithms 1.6 and 1.7 represent an important development in the design of
model-free optimal controllers for discrete-time systems, and they will serve as the
foundation of the designs presented in the later chapters.
1.4 Recent Developments and Challenges in Reinforcement Learning Control 19
Algorithm 1.7 Q-learning value iteration algorithm for the LQR problem
input: input-state data
output: H ∗ and K ∗
1: initialize. Select an arbitrary policy K 0 and H 0 ≥ 0. Set j ← 0.
2: repeat
3: value update. Solve the following Bellman equation for H j +1 ,
5: j← j +1
6: until H j − H j −1 < ε for some small ε > 0.
This is referred to as state feedback, which requires as many sensors as the order of
the system. The difficulty with state feedback is that access to the complete state is
generally not available in practice. The state of the system may not be a physically
measurable quantity and a sensor may not be available to measure that state. Even
when a sensor is available, it may not be feasible to install sensor to measure
every component of the state owing to the cost and complexity. Furthermore, as
the order of the system increases, the requirement becomes more difficult to satisfy.
In contrast to state feedback, control methods which employ feedback of the system
output are more desirable. It is known that under a certain observability assumption
the system input and output signals can be utilized to reconstruct the full state of
the system. However, output feedback control becomes quite challenging in the
reinforcement learning paradigm because of the unavailability of the system model,
which is needed to reconstruct the internal state. Thus, model-free output feedback
methods should be sought to overcome the limitation of full state feedback.
the design of reinforcement learning controllers to address the excitation noise bias
issue. However, it has been pointed out in the recent control literature [88] that the
closed-loop system stability may be compromised due to the use of discounted cost
functions. The issue stems from the need to make the long term cost of the control
finite by assigning less weight to the future costs. However, this discounting factor
masks the long term effect of the state energy in the cost function and, therefore, the
convergence of the state is not guaranteed even when the cost is finite. Although the
use of discounting factor is common in many sequential decision making problems,
its application may not be feasible in control applications. Recent works tend to
find a bound on this discounting factor which could still ensure the closed-loop
stability [50, 80]. However, the computation of this bound requires the knowledge
of the system model, which is unavailable in the model-free reinforcement learning
control. Thus, undiscounted reinforcement learning controllers are sought upon.
Disturbance is a key issue in the standard control setting. However, the primary
formulation of the RL is oriented towards solving decision making problems,
where disturbance does not readily fit in. The traditional control literature offers
various frameworks such as H∞ control and the internal model principle to handle
disturbances. Differently from these formulations, a game theory based approach
has been found to be more fitting in the RL setting, in which the disturbance is
treated as some intelligent decision maker that plays an adversarial role in the
system dynamics. Recently, some state feedback RL methods have been proposed
to solve the robust H∞ control problem using game theoretic arguments [3, 4, 35,
70]. However, these works solve the full-information H∞ control problem, where
the measurements of both state and disturbances are required. Development of more
practical approaches to disturbance rejection is in order.
With the growing complexity of today’s systems, it has become difficult to solve
control and decision making problems in a centralized setting. In this regard, multi-
agent distributed approaches are promising as they provide solutions to the control
1.5 Notes and References 23
problems that were once considered intractable in a centralized manner. This has
naturally led to more interest in RL community to design distributed learning
algorithms [14, 15, 105]. However, RL algorithms are in principle single-agent
based and, therefore, the leap from single-agent control to distributed algorithms for
multi-agent systems is challenging. Issues such as the coordination of these agents
and the exchange of information during the learning phase need to be carefully
addressed. More importantly, there are open questions on how to harness the model-
free power of RL to deal with not only unknown dynamics but also unknown
network topologies. In addition to this, the problem size in the multi-agent scenario
may become quite large, which is major challenge for the current RL algorithms.
On the other hand, value iteration algorithms do not suffer from this limitation
as they iterate by exploiting the fixed-point property of the Bellman optimality
equation. A value iteration algorithm for the discrete-time LQR problem was
proposed in [53]. Value iteration for continuous-time problems is, however, not
straightforward. Recently, some approximation based value iteration methods have
also been proposed towards solving the continuous-time ARE equation, where the
requirement of a stabilizing initial policy was removed [11]. It should be noted that
all of these design algorithms are model-based and, therefore, require the complete
knowledge of the system dynamics.
Reinforcement learning was introduced in Sect. 1.3.1 as an approach to solving
optimal control problems without requiring full model information. The idea behind
this approach is to approximate the solution of the HJB equation by performing
some functional approximations using tools such as neural networks. We discussed
some advantages of reinforcement learning in Sect. 1.3.3, where we emphasized
on its capability to achieve adaptive optimal control and its applicability to a large
class of control problems. Studies of RL based control methods generally consider
one of the two main types of algorithms, actor-critic learning and Q-learning [114].
We only discussed Q-learning in this chapter as it will be used in the following
chapters. However, it is worth pointing out here that the actor-critic structure for
RL control was introduced by Werbos in [128–131], where it is referred to as
approximate dynamic programming (ADP). The structure consists of a critic sub-
system and an actor sub-system. The critic component assesses the cost of current
action based on some optimality criterion similar to the policy evaluation step in
PI and VI algorithms, while the actor component estimates an improved policy.
However, in RL ADP, the critic and actor employ approximators such as neural
networks to approximate the cost and control functions. Actor-critic algorithms
employ value function approximation (VFA) to evaluate the value of the current
policy. The use of functional approximators instead of lookup tables overcomes
the curse of dimensionality problem in traditional DP, which occurs when the state
space grows large. Actor-critic algorithms for the discrete-time and continuous-
time LQR problems have been described in [59]. Partial knowledge of the system
dynamics (the input coupling matrix) is needed in these methods.
The Q-learning algorithm was discussed in detail in Sect. 1.3.5. This technique
was introduced by Watkins [126] and is based on the idea of Q-function (Quality
function), which is a function of the state and the control. The main task in
the Q-learning algorithm is to estimate the optimal Q-function. Once the optimal
Q-function is found, the optimal control can be readily obtained by finding the
control action that minimizes or maximizes the optimal Q-function. The Q-function
description is more comprehensive than the value function description as it spans
the state-action space instead of the state space, which enables Q-learning to employ
only one functional approximation structure instead of two separate approximators
for critic (value function) and actor (control function). If the state space is
sufficiently explored, then the Q-learning algorithm will eventually converge to the
optimal Q-function [127]. The success of Q-learning in controls is evident from the
fact that it has provided model-free solutions to the popular control problems. A
1.5 Notes and References 25
Q-learning algorithm for the discrete-time linear quadratic regulation problem was
proposed in [13]. In [13], the Q-learning iterations are based on the policy iteration
method and, therefore, the knowledge of a stabilizing initial policy is required. It was
shown that, under a certain excitation condition, the Q-learning algorithm converges
to the optimal LQR controller. Later in [53], a value iteration based Q-learning LQR
algorithm was presented and convergence to the optimal controller was shown. The
requirement of a stabilizing initial policy was also obviated.
In the final part of this chapter, we highlighted some of the recent developments
and challenges in reinforcement learning control. Issues of state feedback and output
feedback were discussed in detail. Discussion on the deleterious effects of dis-
counted cost functions was provided which is crucial in automatic control from the
stability perspective. Difficulties associated with extending reinforcement learning
to solve continuous-time control problems were highlighted. More advanced and
challenging problems for reinforcement learning were also discussed briefly.
Chapter 2
Model-Free Design of Linear Quadratic
Regulator
2.1 Introduction
The linear quadratic regulator (LQR) is one of the most effective formulations of
the control problem. It aims to minimize a quadratic cost function and, under some
mild conditions on the cost function and the system dynamics, leads to a linear
asymptotically stable closed-loop system. The quadratic cost function represents the
long term cost of the control and the state in the form of energies of the signals. The
LQR control problem is essentially a multi-objective optimization problem, where
it is desired to minimize both the state and the control energies. However, there is a
trade-off between these two objectives because achieving better state performance
generally comes at a higher control effort.
In Chap. 1, the LQR problem was introduced briefly and a model-based solution
was presented based on dynamic programming. For the LQR problem, the Bellman
equation reduces to an algebraic Riccati equation (ARE). Since AREs are nonlinear
in the unknown parameter and, therefore, difficult to solve, iterative techniques have
been developed to solve them. In particular, iterative techniques of policy iteration
(PI) and value iteration (VI) have been developed that would find the solution
of the LQR ARE by iteratively solving a Lyapunov equation, which is a linear
equation. A Q-learning technique has also been introduced to design model-free
policy iteration and value iteration algorithms. All results introduced in Chap. 1
focus on state feedback design, which requires the measurement of the full state
for its implementation.
Output feedback control eliminates the need of full state measurement required
by state feedback control and involves fewer sensors, making it cost-effective and
more reliable. Output feedback in reinforcement learning is more challenging as the
system dynamics is unknown. In particular, the key difficulty lies in the design of a
state observer. Classical state estimation techniques used in the design of observer
based output feedback control laws involve a dynamic model of the system. The
observer relies on the system model to estimate the state of the system from the input
and output measurements. While the input-output data is available, the dynamic
model is not known in a reinforcement learning problem, making the traditional
model-based state observers not applicable.
Model-free output feedback design approaches that have been reported in the
reinforcement learning literature can be classified as neural network observer based
and input-output data based methods. Neural network observer based designs
generally try to estimate the system dynamics and then reconstruct the state with
bounded estimation errors. This approach requires a separate approximator in
addition to the approximators employed for learning and control. Such a design
tends to be complicated because of the difficulty in proving stability properties
as the separation principle does not directly apply. Furthermore the use of a
separate approximation structure also makes the implementation of the design more
complicated.
On the other hand, the data based output feedback design approach has attracted
more attention recently. In contrast to the neural network observer based design
approach, this approach has the advantage that no external observer is needed and,
as such, does not suffer from state estimation errors. Consequently, optimal solution
could be obtained even when full state feedback is not available. This direct output
feedback approach is based on the idea of parameterizing the state in terms of some
function of the input and output. For linear systems, the state can be reconstructed
based on a linear combination of some functions of the input and output mea-
surements (in the absence of unknown disturbances). However, incorporating this
parameterization has the potential of incurring bias in the estimation process as the
Bellman equation is modified to cater to the output feedback design.
In this chapter, we will present model-free output feedback reinforcement
learning algorithms for solving the linear quadratic regulation problem. The design
will be carried out in both discrete-time and continuous-time settings. We will
present some parameterization of the system state that would allow us to develop
new learning and control equations that do not involve the state information. It will
be shown that the output feedback control law that is learned by the RL algorithm
is the steady-state equivalent of the optimal state feedback control law. Both policy
iteration and value iteration based algorithms will be presented. For the discrete-
time problems, we will present output feedback Q-learning algorithms. On the
other hand, the treatment of the continuous-time problems will be different and will
employ ideas from integral reinforcement learning to develop the output feedback
learning equations.
As discussed in Sect. 2.1, the neural network observers and the input-output data
based methods have been the two key approaches in the RL literature to solving
the output feedback problems. For instance, [142] presented a suboptimal controller
using output feedback to solve the LQR problem by incorporating a neural network
2.3 Discrete-Time LQR Problem 29
observer. This, however, results in bounded estimation errors. In the same spirit, [22,
34, 67, 136, 140, 141] performed state estimation based on these observers together
with the estimation of the critic and actor networks to achieve near optimal solution.
Again, only ultimate boundedness of the estimation errors was demonstrated.
On the other hand, the state parameterization approach to output feedback RL
control has been gaining increasing attention. In particular, [1, 28, 50, 56, 80]
followed this approach to avoid the issue of estimation errors and to take the
advantage that no external observer is required. In the pioneering work of [1],
identification of the Markov parameters was used to design a data-driven output
feedback optimal controller. The authors of [56] were the first to build upon this
work and extend the idea in the RL setting by employing the VFA approach using
PI and VI output feedback RL LQR algorithms. Following the VFA approach of
[56], the authors of [50] solved the RL optimal linear quadratic tracking problem.
For the continuous-time LQR problem, a partially model-free output feedback
solution was proposed in [142]. The method requires the system to be static output
feedback stabilizable. Motivated by the work of [56], the authors of [80] solved
the continuous-time output feedback LQR problem using a model-free IRL method,
which does not require the system to be static output feedback stabilizable.
However, in all these works on state parameterization based output feedback
RL control, a discounting factor is introduced in the cost function, which helps to
diminish the effect of excitation noise bias as studied in [56]. The end result is that
the discounted controller is suboptimal and does not correspond to the solution of
the Riccati equation. More importantly, this discounting factor has the potential to
compromise system stability as reported recently in [80, 88]. The stability analysis
carried out in [88] has shown that the discounted cost function in general cannot
guarantee stability. At best, “semiglobal stability” can be achieved in the sense that
stability is guaranteed only when the discounting factor is chosen above a certain
lower bound (upper bound in the continuous-time setting). However, knowing this
bound requires the knowledge of the system dynamics, which is assumed to be
unknown in RL. Furthermore, the discounted controller loses optimality aspect of
the control problem as it does not correspond to the optimal solution of the original
ARE.
Consider a discrete-time linear system given by the following state space represen-
tation,
respectively, we would like to find the feedback control sequence uk = Kxk that
minimizes the long term cost [55],
∞
VK (xk ) = r(xi , ui ), (2.2)
i=k
where Qy ≥ 0 and R > 0 are the performance matrices. The optimal state feedback
control that minimizes (2.2) is given by
u∗k = K ∗ xk
−1
= − R + B TP ∗B B T P ∗ Axk , (2.4)
and its associated cost is V ∗ (xk ) = xkT P ∗ xk , under the conditions of controllability
√ √ T√
of (A, B) and observability of A, Q , where Q Q = Q, Q = C T Qy C.
Here, P ∗ = (P ∗ )T is the unique positive definite solution to the following ARE,
−1
AT P A − P + Q − AT P B R + B T P B B T P A = 0. (2.5)
Before working out our way to the model-free output feedback solution, we will
revisit some results for the state feedback LQR problem. For the sake of continuity,
we will recall the concept of Q-functions from Chap. 1 and provide details of the
LQR Q-function, which plays a fundamental role in the design of the Q-learning
algorithms that we will subsequently present.
Consider the cost function given in (2.2). Under a stabilizing feedback control
policy uk = Kxk (not necessarily optimal), the total cost incurred when starting at
time index k from state xk is quadratic in the state as given by [13]
where VK (xk+1 ) is the cost of following policy uk = Kxk (or, simply, policy K) at
all future time indices.
Next, we use (2.7) to define a Q-function as the sum of the one-step cost of taking
an arbitrary action uk at time index k and the total cost that would incur if the policy
K is followed at time index k + 1 and all the subsequent time indices [113],
The optimal LQR controller u∗k , which minimizes the long term cost, can be
obtained by solving
∂
Q∗ = 0.
∂uk
The result is the same as given in (2.4), which was obtained by solving the
ARE (2.5).
Model-based iterative algorithms based on policy iteration and value iteration
were presented in Algorithms 1.3 and 1.4. The model-free versions of these
algorithms were presented in Algorithms 1.6 and 1.7, which were based on Q-
learning.
32 2 Model-Free Design of Linear Quadratic Regulator
The use of the discounted cost function (2.11) does not ensure closed-loop stability.
Notice that, for 0 < γ < 1, the boundedness of the discounted cost (2.11), that is,
∞
γ i−k xiT Q + K T RK xi < ∞,
i=k
Classical state estimation techniques make use of the dynamic model of the system
to reconstruct the state by means of a state observer. A state observer of a system
is essentially a user-defined dynamic system that employs some error correction
mechanism to reconstruct, based on the dynamic model of the system, the state of
2.3 Discrete-Time LQR Problem 33
the system from its input and output data. The problem of state estimation is not
straightforward when the system model is not available.
A key technique in adaptive control of unknown systems is to parameterize a
quantity in terms of the unknown parameters and the measurable variables and
use these measurable variables to estimate the unknown parameters. The following
result provides a parameterized expression of the state in terms of the measurable
input and output data that will be used to derive the output feedback learning
equations in the subsequent subsections.
Theorem 2.1 Consider system (2.1). Let the pair (A, C) be observable. Then, there
exists a parameterization of the state in the form of
xk = Wu σk + Wy ωk + (A + LC)k x0 , (2.12)
⎡ i1 i1 ⎤
ay(n−1) i1
ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2
ay(n−2) i2 ⎥
· · · ay0
⎢ y(n−1) ⎥
Wy = ⎢
i
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) ay(n−2) · · · ay0
in in
whose elements are the coefficients of the numerators in the transfer function matrix
T
of a Luenberger observer with inputs uk and yk , and σk = σk1 σk2 · · · σkm and
p T
ωk = ωk1 ωk2 · · · ωk represent the states of the user-defined dynamics driven by
individual input uik and output yki as given by
i
σk+1 = Aσki + Buik , σ0i = 0, i = 1, 2, · · · , m,
i
ωk+1 = Aωki + Byki , ω0i = 0, i = 1, 2, · · · , p,
for a Schur matrix A whose eigenvalues coincide with those of A+LC and an input
vector B of the form
34 2 Model-Free Design of Linear Quadratic Regulator
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ···· · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎥ ⎢ ⎥
⎢ ⎢0⎥
⎢ 0 · · · 0 ⎥ , B = ⎢0⎥
⎥ ⎢
A=⎢ 1 0 ⎥.
⎢ . .. .. .. .. ⎥ ⎢.⎥
⎣ .. . . . . ⎦ ⎣ .⎦
.
0 0 ··· 1 0 0
Proof From the linear systems theory, it is known that if the pair (A, C) is
observable then a full state observer can be constructed as
x̂k+1 = Ax̂k + Buk − L yk − C x̂k
= (A + LC)x̂k + Buk − Lyk , (2.13)
where x̂k is the estimate of the state xk and L is the observer gain chosen such that
the matrix A+LC has all its eigenvalues strictly inside the unit circle. This observer
is a dynamic system driven by uk and yk with the dynamics matrix A + LC. This
dynamic system can be written in the filter form by treating both uk and yk as inputs
as follows,
We now show the [uk ] and [yk ] terms in (2.14) can be linearly parameterized.
i (z)
Consider first each input filter term U(z) [uik ],
U i (z) i
uk = (zI − A − LC)−1 Bi uik ,
(z)
2.3 Discrete-Time LQR Problem 35
⎡ ⎤
i1
au(n−1) zn−1 + au(n−2)
i1 zn−2 + · · · + au0
i1
⎢ n ⎥
⎢ z + αn−1 zn−1 + αn−2 zn−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 n−1 + a i2 n−2 + · · · + a i2 ⎥
⎢ u(n−1) z z u0 ⎥
⎢ u(n−2)
⎥
i
U (z) i ⎢ z + αn−1 z
n n−1 + αn−2 z n−2 + · · · + α0 ⎥
uk = ⎢
⎢
⎥ ui
⎥ k
(z) ⎢ .
.. ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ in ⎥
⎣ au(n−1) z n−1 + au(n−2) z
in n−2 + · · · + au0 ⎦
in
= Wui σki ,
i
σk+1 = Aσki + Buik ,
where
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ··· · · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ⎥ ⎢ ⎥
A=⎢ 1 0 ⎥, B = ⎢0⎥.
⎢ . .. .. .. . ⎦ . ⎥ ⎢.⎥
⎣ . . . . . . ⎣ .. ⎦
0 0 ··· 1 0 0
i
ωk+1 = Aωki + Byki ,
Note that (z) is the characteristic polynomial of A+LC and is a stable polynomial.
This implies that all the eigenvalues of A are strictly inside the unit circle and,
therefore, the dynamics of σk and ωk is asymptotically stable. Finally, by combining
the input and output terms, we can write (2.14) as,
ek = xk − x̂k
= (A + LC)k e0 .
xk = Wu σk + Wy ωk + (A + LC)k x0 . (2.16)
Since A + LC is Schur stable, the term (A + LC)k x̂0 in (2.15) and the term (A +
LC)k x0 in (2.16) vanish as k → ∞. This completes the proof.
The state parameterization presented in Theorem 2.1 is derived based on the
Luenberger observer (2.13). The parameterization (2.16) contains a transient term
(A + LC)k x0 , which depends on the initial condition. This transient term in turn
depends on the choice of the design matrix L, which assigns the eigenvalues of
matrix A + LC. The user-defined matrix A plays the role of the matrix A + LC in
the parameterization. Ideally, we would like the observer dynamics to be as fast as
possible so that x̄k converges to xk quickly. For discrete-time systems, this can be
achieved by placing all the eigenvalues of matrix A + LC or, equivalently, matrix
A, at 0. That is, the coefficients αi of matrix A are all chosen to be zero. In this
case, it can be verified that, for any x0 , (A + LC)k x0 vanishes in no more than n
time steps. Motivated by this property of discrete-time systems, a special case of the
above parameterization was presented in [56]. This special state parameterization
result is recalled below.
2.3 Discrete-Time LQR Problem 37
Theorem 2.2 Consider system (2.1). Let the pair (A, C) be observable. Then, the
system state can be uniquely represented in terms of measured input and output as
with
T T
VN = CAN −1 · · · (CA)T C T ,
UN = B AB · · · AN −1 B ,
⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN = ⎢ ... ... .. ..
. .
..
. ⎥.
⎢ ⎥
⎣0 0 · · · 0 CB ⎦
0 0 0 0 0
Remark 2.1 The parameterization matrices Wu and Wy in (2.17) are the same as
Wu and Wy in (2.12) if N = n and all eigenvalues of matrix A or, equivalently,
matrix A + LC, used in Theorem 2.1 are zero. This can be seen as follows. Recall
from the proof of Theorem 2.1 that the state can be represented by
xk = Wu σk + Wy ωk + (A + LC)k x0 . (2.18)
⎡ ⎤
zn−1
(z) [uk ]
⎢ ⎥
⎢ zn−2 ⎥
⎢ (z) [uk ]⎥
⎢
σk = ⎢ ⎥ = ūk−1,k−n ,
⎥
⎢ .. ⎥
⎣ . ⎦
1
(z) [u k ]
⎡ n−1 ⎤
z
[yk ]
⎢ (z) ⎥
⎢ zn−2 ⎥
⎢ (z) [yk ]⎥
⎢
ωk = ⎢ ⎥ = ȳk−1,k−n .
⎥
⎢ .. ⎥
⎣ . ⎦
1
(z) [yk ]
xk = Wu ūk−1,k−N + Wy ȳk−1,k−N , k ≥ N,
⎡ ⎤
1
(z) [uk ]
⎢ ⎥
⎢ z ⎥
⎢ ⎢ (z) [uk ]⎥ ⎥
= D0 B D1 B · · · Dn−1 B ⎢ .. ⎥
⎢ ⎥
⎢ . ⎥
⎣ ⎦
zn−1
(z) [uk ]
= Wu σk .
Here the matrices Di contain the coefficients of the adjoint matrix. It can be verified
that we can express Di in terms of the matrix A + LC and the coefficients of its
characteristic polynomial (s) as follows,
Dn−1 = I,
Dn−2 = (A + LC) + αn−1 I,
Dn−3 = (A + LC)2 + αn−1 (A + LC) + αn−2 I,
..
.
D0 = (A + LC)n−1 + αn−1 (A + LC)n−2 + · · · + α2 (A + LC) + α1 I.
Substituting the above Di ’s in the expression for Wu and analyzing the rank of the
resulting expression for Wu , we have
ρ(Wu )=ρ (A + LC)n−1 B + αn−1 (A + LC)n−2 B + · · · + α2 (A + LC)B + α1 B,
· · · (A + LC)B + αn−1 B B .
which is the controllability condition of the pair (A + LC, B). Thus, the controlla-
bility of the pair (A + LC, B) implies full row rank of matrix Wu and hence full row
rank of matrix W .
A similar analysis of the matrix Wy yields that the controllability of the pair
(A + LC, L) would imply full row rank of matrix Wy and hence full row rank of
matrix W . This completes the proof.
We note that the controllability condition of (A + LC, B) or (A + LC, L) in
Theorem 2.3 is difficult to verify since it involves the observer gain matrix L,
whose determination requires the knowledge of the system dynamics. Under the
observability condition of (A, C), even though L can be chosen to place eigenvalues
of matrix A + LC arbitrarily, it is not easy to choose an L that satisfies the
40 2 Model-Free Design of Linear Quadratic Regulator
= zkT H zk , (2.19)
where
T
zk = σkT ωkT uTk ,
2.3 Discrete-Time LQR Problem 41
H = H T ∈ R(mn+pn+m)×(mn+pn+m) ,
Given the optimal cost function V ∗ with the cost matrix P ∗ , we obtain the
corresponding optimal output feedback matrix H ∗ by substituting P = P ∗ in (2.20).
It is worth recalling that the notion of optimality in our context is with respect to the
controller since there are an infinite number of state parameterizations depending
on the choice of user-defined dynamics matrix A. However, in the discrete-time
setting, it is common to select A to have all zero eigenvalues, which gives the dead-
beat response due to the finite-time convergence property of the discrete-time state
parameterization. In this case, the optimal output feedback Q-function is given by
Q∗ = zkT H ∗ zk , (2.21)
∂Q∗
=0
∂uk
This control law solves the LQR output feedback control problem without requiring
access to the state xk .
We now show the relation between the presented output feedback Q-function and
the output feedback value function. The output feedback value function as used is
given by
42 2 Model-Free Design of Linear Quadratic Regulator
T
σk σ
V = P̄ k , (2.23)
ωk ωk
where
WuT P Wu WuT P Wy
P̄ = .
WyT P Wu WyT P Wy
The value function (2.23), by definition (2.6), gives the cost of executing the policy
K̄ = K Wu Wy
= − (Huu )−1 Huσ Huω .
Using the relation QK (xk , Kxk ) = VK (xk ), the output feedback value function
matrix P̄ can be readily obtained as
T
P̄ = I K̄ T H I K̄ T . (2.24)
Theorem 2.5 The output feedback law (2.22) is the steady-state equivalent of the
optimal LQR control law (2.4).
∗ =
Proof Consider the output feedback control law (2.22). Substitution of Huσ
T ∗ ∗ T ∗ ∗ T ∗
B P AWu , Huω = B P AWy , and Huu = R + B P B in (2.22) results in
−1
u∗k = − R + B T P ∗ B B T P ∗ AWu σk + B T P ∗ AWy ωk .
xk = Wu σk + Wy ωk ,
since A + LC is Schur stable. Thus, the output feedback controller (2.22) is the
steady-state equivalent of
−1
u∗k = − R + B T P ∗ B B T P ∗ Axk ,
which is the optimal state feedback control law (2.4). This completes the proof.
2.3 Discrete-Time LQR Problem 43
It should be noted at this point that the output feedback learning equation (2.27)
is precise for k ≥ N when A + LC is nilpotent, which is always achievable since
(A, C) is observable. The Q-function matrix H is unknown and is to be learned. We
can separate H by parameterizing (2.19) as
QK = H̄ T z̄k , (2.30)
where
H̄ = vec(H )
z̄k = zk ⊗ zk
T
= zk1
2
zk1 zk2 · · · zk1 zkl zk2
2
zk2 zk3 · · · zk2 zkl · · · zkl
2
,
T H̄ = Υ, (2.32)
and H̄ is the unknown vector to be found. In order to solve this linear equation, we
require at least L ≥ l(l + 1)/2 data samples.
In what follows, we present reinforcement learning techniques of policy iteration
and value iteration to learn the output feedback LQR control law.
The policy iteration algorithm, Algorithm 2.1, consists of two steps. The policy
evaluation step uses the parameterized learning equation (2.31) to solve for H̄ by
collecting L ≥ l(l + 1)/2 observations of uk , yk , σk , ωk , σk+1 , and ωk+1 to form the
data matrices. Then, the solution of (2.31) is obtained as
−1
H̄ j = T Υ, (2.33)
where H̄ j gives the Q-function matrix associated with the j th policy. In the policy
j +1
update step, we obtain a better policy uk by minimizing the Q-function of the
j th policy. From the estimation perspective, a difficulty arises due to the linear
dependence of uk in (2.22) on σk and ωk . On the other hand, the data matrix
already has columns of σk and ωk . As a result, the column entries corresponding to
uk become linearly dependent as they are formed by a linear combination of σk and
ωk using matrix H . It implies that T is singular. To make T nonsingular and
the solution (2.33) unique, we add an excitation signal in uk . That is, we need to
satisfy the following rank condition,
Algorithm 2.1 Output feedback Q-learning policy iteration algorithm for the LQR
problem
input: input-output data
output: H ∗
1: initialize. Select an admissible policy u0k . Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H̄ j ,
T
H̄ j (z̄k − z̄k+1 ) = ykT Qy yk + uTk Ruk .
5: j← j +1
6: until H̄ j − H̄ j −1 < ε for some small ε > 0.
where ε > 0 is some small constant. The convergence of the output feedback PI
algorithm is established in the theorem below.
√
Theorem 2.6 Let (A, B) be controllable, A, Q be observable and u0k be a
!
stabilizing initial control. Then, the sequence of policies K̄ j , j = 1, 2, 3, . . .
converges to the optimal output feedback policy K̄ ∗ as j → ∞ provided that the
state parameterization matrix W is of full row rank and the rank condition (2.34)
holds.
Proof By Theorem 2.5, the optimal output feedback control law (2.22) converges
to the optimal state feedback control law (2.4). Thus, we need to show that the
policy iterations on the output feedback cost matrix P̄ (or the Q-function matrix H )
46 2 Model-Free Design of Linear Quadratic Regulator
and output control matrix K̄ converge to their optimal values. Recall the following
iterations from Algorithm 2.1,
T T
zT H j zk = ykT Qy yk + uTk Ruk + zk+1
T
Hj zk+1 , (2.36)
−1
K̄ j +1 = − Huu
j j j
Huσ Huω . (2.37)
If the rank condition (2.34) holds, then we can solve a system of equations based on
Equations (2.36) and (2.37) to obtain H j and K̄ j +1 . From the definition of H , we
have
T
W Q + AT P j A W W T AT P j B
H =
j
. (2.38)
B T P j AW R + B TP j B
By Theorem 2.3 (or Theorem 2.4) we have the full row rank of W , which results in
T T
P j = Q + A + BK j P j A + BK j + K j RK j . (2.39)
Algorithm 2.2 Output feedback Q-learning value iteration algorithm for the LQR
problem
input: input-output data
output: H ∗
1: initialize. Select an arbitrary policy u0k and H 0 ≥ 0. Set j ← 0.
2: repeat
3: value update. Solve the Bellman equation for H̄ j +1 ,
T T
H̄ j +1 z̄k = ykT Qy yk + uTk Ruk + H̄ j z̄k+1 .
5: j← j +1
6: until H̄ j − H̄ j −1 < ε for some small ε > 0.
48 2 Model-Free Design of Linear Quadratic Regulator
These matrices are used to obtain the least-squares solution given by (2.33). The
rank condition (2.34) must be met by the addition of an excitation noise in the
control uk . The convergence of the output feedback VI algorithm is established in
the following theorem.
√
Theorem 2.7 Let (A, B) be controllable and A, Q be observable. Then, the
!
sequence of policies K̄ j , j = 1, 2, 3, . . . converges to the optimal output feedback
policy K̄ ∗ as j → ∞ provided that the state parameterization matrix W is of full
row rank and the rank condition (2.34) holds.
Proof By Theorem 2.5, the optimal output feedback control law (2.22) converges
to the optimal state feedback control law (2.4). Thus, we need to show that the value
iterations on the output feedback cost matrix P̄ (or the Q-function matrix H ) and
the output feedback control matrix K̄ converge to their optimal values. Recall the
following recursion from Algorithm 2.2,
T T
zT H j +1 zk = ykT Qy yk + uTk Ruk + zk+1
T
Hj zk+1 . (2.41)
If the rank condition (2.34) holds, then we can solve Equation (2.41) based on
the above equation to obtain H j +1 . Based on Theorem 2.1 (or Theorem 2.2) and
Equations (2.26) and (2.27), the terms on the right-hand side of Equation (2.41) can
be written as
T
zkT H j +1 zk = ykT Qy yk + uTk Ruk + xk+1
T
P j xk+1
T
W Q + AT P j A W W T AT P j B
= zkT zk ,
B T P j AW R + B TP j B
Since
j +1 −1 j +1 j +1
K̄ j +1 = − Huu Huσ Huω
−1
= R + B TP j B B T P j AW,
and
P̄ j = W T P j W,
From Theorem 2.3 (or Theorem 2.4), we have the full row rank of W , which results
in
−1
P j +1 = Q + AT P j A − AT P j B R + B T P j B B T P j A.
The above equation gives recursions in terms of ARE (2.5) that converge to P ∗
as j → ∞ under √ the standard controllability and observability assumptions on
(A, B) and A, Q , respectively [52]. This implies that P̄ j converges to P̄ ∗ , and
therefore, from the definitions of H in (2.20), we have the convergence of H j to
H ∗ . Then, by the relation
K̄ = − (Huu )−1 Huσ Huω ,
Let Ĥ be the estimate of the Q-function matrix H obtained using û. From (2.9), it
follows that
T
xk x T
Ĥ k = r(xk , ûk ) + Axk + B ûk P Axk + B ûk .
ûk ûk
We expand (2.43) and separate out the noise dependent terms involving νk ,
T
xk x
Ĥ k + xkT AT P Bνk + νkT B T P Axk + νkT Ruk + νkT B T P Buk + uTk Rνk
uk uk
+ uTk B T P Bνk + νkT Rνk + νkT B T P Bνk
= xkT Qxk + uTk Ruk + νkT Rνk + νkT Ruk + uTk Rνk + (Axk + Buk )T P (Axk + Buk )
+ xkT AT P Bνk + uTk B T P Bνk + νkT B T P Axk + νkT B T P Buk + νkT B T P Bνk .
As can be readily seen, the noise dependent terms get canceled out and we have
T
xk x
Ĥ k = xkT Qxk + uTk Ruk + (Axk + Buk )T P (Axk + Buk ) . (2.44)
uk uk
In light of Equations (2.25), (2.26), and (2.19), we have the noise-free output
feedback Bellman equation
10
x1
State
0 x̂1
-10
0 10 20 30 40 50 60 70 80 90 100
10
x2
State
0 x̂2
-10
0 10 20 30 40 50 60 70 80 90 100
time (sec)
Let Qy = 1 and R = 1 be the weights in the utility function (2.3). The eigenvalues
of the open-loop system are 0.5 and 0.6. We first verify the state reconstruction result
of Theorem 2.1. Let the characteristic polynomial of the observer be (z) = z2 . We
apply a sinusoidal signal to the system and compare the actual state trajectory with
that of the reconstructed state using the parameterization given in Theorem 2.1. It
can be seen in Fig. 2.1 that the estimated state converges exponentially to the true
state.
We use the PI algorithm as the system is open-loop stable. Sinusoids of different
frequencies and magnitudes are added in the control to satisfy the excitation con-
dition. We compare here the state feedback Q-learning algorithm (Algorithm 1.6),
the value function approximation (VFA) based output feedback method [56] and
the output feedback Q-learning algorithm (Algorithm 2.1). By solving the Riccati
equation (2.5), we obtain the optimal control matrices for the state feedback control
law (Algorithm 1.6) as
∗
Hux = 0.3100 −0.3151
52 2 Model-Free Design of Linear Quadratic Regulator
∗
Huu = 2.0504.
For the VFA based output feedback method [56], the control parameters p0∗ , pu∗ and
py∗ are obtained from the output feedback value function matrix
P̄ ∗ = W T P ∗ W
⎡ ⎤
p0∗ pu∗ py∗
⎢ T ⎥
⎢ ∗ ∗ ∗⎥
= ⎢ pu P22 P23 ⎥ ,
⎣ T ⎦
py∗ P32∗ P∗
33
p0∗ = 1.0504,
pu∗ = −0.8079,
py∗ = 1.1179 −0.0253 .
On the other hand, the nominal values of the Q-learning based output feedback
algorithm (Algorithm 2.1) are computed by solving ARE (2.5) as
∗
Huσ = 0.3100 −0.9635 ,
∗
Huω = 0.9895 −0.3613 ,
∗
Huu = 2.0504.
The state trajectories under these three different methods are shown in Figs. 2.2, 2.3,
and 2.4, respectively. The convergence of the parameter estimates under these three
methods are shown in Figs. 2.5, 2.6, and 2.7, respectively. The final parameter
estimates obtained are
Ĥux = 0.3100 −0.3151 ,
Ĥuu = 2.0504,
p̂0 = 1.0559,
p̂u = −0.8181,
p̂y = 1.1323 −0.2959 ,
1.5
1
x
0.5
0
0 50 100 150
time step (k)
Fig. 2.2 Example 2.1: State trajectory of the closed-loop system under state feedback Q-learning
40
30
x
20
10
0
0 50 100 150
time step (k)
Fig. 2.3 Example 2.1: State trajectory of the closed-loop system under output feedback value
function learning [56]
Ĥuu = 2.0506,
1.5
1
x
0.5
0
0 50 100 150
time step (k)
Fig. 2.4 Example 2.1: State trajectory of the closed-loop system under output feedback Q-learning
2.5
Hux (1)
Parameter Estimates
2
Hux (2)
1.5 Huu
0.5
-0.5
0 1 2 3 4 5 6
iterations
Fig. 2.5 Example 2.1: Convergence of the parameter estimates under state feedback Q-learning
It should be noted that we did not introduce a discounting factor in our proposed
method (Algorithms 2.1 and 2.2) as compared to [56]. Moreover, no bias problem is
observed in the proposed scheme. Furthermore, the excitation noise can be removed
once the convergence criterion is satisfied.
1.5
p0
Parameter Estimates pu
1
py (1)
0.5 py (2)
-0.5
-1
0 1 2 3 4 5 6
iterations
Fig. 2.6 Example 2.1: Convergence of the parameter estimates under output feedback value
function learning [56]
3
Huσ (1)
Parameter Estimates
2 Huσ (2)
Huω (1)
1 Huω (2)
Huu
0
-1
-2
0 1 2 3 4 5 6
iterations
Fig. 2.7 Example 2.1: Convergence of the parameter estimates under output feedback Q-learning
p0∗ = 2.0467,
pu∗ = −1.2713,
56 2 Model-Free Design of Linear Quadratic Regulator
3
x
0
0 50 100 150
time step (k)
Fig. 2.8 Example 2.2 State trajectory of the closed-loop system under state feedback Q-learning
py∗ = 3.8129 −1.9578 ,
for the output feedback Q-learning algorithm (Algorithm 2.2). We use the VI
algorithm as the system is unstable. The initial estimate is H 0 = I . The excitation
condition is ensured by adding sinusoidal probing noises. The convergence criterion
of ε = 0.01 was chosen for all three algorithms. Seven data samples were
collected for the state feedback algorithm, that is, L = 7, whereas L = 18
for the output feedback algorithm. The state trajectories under the state feedback
Q-learning, the VFA based output feedback method [56], and the output feedback Q-
learning method are shown in Figs. 2.8, 2.9, and 2.10, respectively. The convergence
of the parameter estimates under these three different methods are shown in
Figs. 2.11, 2.12, and 2.13, respectively. The final parameter estimates obtained are
Ĥux = 2.5416 −1.5759 ,
Ĥuu = 3.0466,
p̂0 = 1.3639,
p̂u = −0.6771,
2.3 Discrete-Time LQR Problem 57
60
50
40
x
30
20
10
0
0 50 100 150
time step (k)
Fig. 2.9 Example 2.2: State trajectory of the closed-loop system under output feedback value
function learning [56]
40
30
x
20
10
0
0 50 100 150
time step (k)
Fig. 2.10 Example 2.2: State trajectory of the closed-loop system under the proposed output
feedback Q-learning
p̂y = 2.5069 −1.1909 ,
Ĥuu = 3.0396,
for the output feedback Q-learning algorithm. It can be seen that both the state
feedback Q-learning algorithm and the output feedback Q-learning algorithm
converge to the solution of the undiscounted ARE (2.5), whereas the parameter
estimates in the VFA based output feedback method [56] differ even further from
58 2 Model-Free Design of Linear Quadratic Regulator
6
Hux (1)
Parameter Estimates
4 Hux (2)
Huu
2
-2
-4
0 5 10 15
iterations
Fig. 2.11 Example 2.2: Convergence of the parameter estimates under state feedback Q-learning
3
p0
Parameter Estimates
2 pu
py (1)
1 py (2)
-1
-2
0 5 10 15
iterations
Fig. 2.12 Example 2.2: Convergence of the parameter estimates under output feedback value
function learning [56]
Example 2.3 (A Balance Beam System) We consider the balance beam system [64,
135] as shown in Fig. 2.16. It serves as a test platform for magnetic bearing systems.
Two magnetic coils are located at the ends of a metal beam. Coil currents serve as
the control inputs that generates forces to balance the beam. The motion of the beam
is restricted to ±0.013 rad and proximity sensors are used to measure the beam tilt
angle. This system is modeled by the following continuous-time state equation,
2.3 Discrete-Time LQR Problem 59
10
Huσ (1)
Parameter Estimates
Huσ (2)
5
Huω (1)
Huω (2)
0 Huu
-5
-10
0 5 10 15
iterations
Fig. 2.13 Example 2.2: Convergence of the parameter estimates under the proposed output
feedback Q-learning
1.5
1
x
0.5
0
0 20 40 60 80 100 120
time step (k)
Fig. 2.14 Example 2.2: State trajectory of the closed-loop system under output feedback with a
better choice of initial estimates
with
0 1
Ac = ,
9248 −1.635
0
Bc = ,
281.9
Cc = 1 0 ,
60 2 Model-Free Design of Linear Quadratic Regulator
6
Huσ (1)
Parameter Estimates
4 Huσ (2)
Huω (1)
2 Huω (2)
Huu
0
-2
-4
0 1 2 3 4 5
iterations
Fig. 2.15 Example 2.2: Convergence of the parameter estimates under output feedback with a
better choice of initial estimates
where x1 (t) = θ (t) and x2 (t) = θ̇ (t) are the angular displacement and angular
velocity, respectively, and u(t) is the control current that is applied on the top of a
fixed bias current to generate a differential electromagnetic force between the two
coils. We discretize the model with a sampling period of 0.5 ms. The discretized
system model is in the form of system (2.1) with
1.0012 0.0005
A= ,
4.6239 1.0003
0.00004
B= ,
0.14094
C= 10 .
K ∗ = −64.0989 −0.6608 .
∗
Huσ = 0.1049 0.0537 ,
∗
Huω = 1599 −1523 ,
∗
Huu = 1.1001.
Because the system is unstable, we use the VI algorithm. Since we already have
a validated experimental model, we initialize the controller with 60% parametric
uncertainty (0.4 times the nominal model). The excitation requirement is met by
the addition of sinusoidal noises. The system response is shown in Fig. 2.17. The
controller parameters converge to the optimal parameters and are given as
Ĥuσ = 0.1049 0.0537 ,
Ĥuω = 1599 −1523 ,
Ĥuu = 1.1000.
Note that, in this example, it took around 40 iterations to converge to the optimal
parameters as the initial estimation error was large. In each iteration, L = 30 data
samples were collected. The convergence of the parameter estimates is shown in
Fig. 2.18. Comparing this result with the discounted cost function approach of [56],
we find that the discounted controller gain Kγ results in an unstable closed-loop
system. However, using the output feedback Q-learning method (Algorithm 2.2),
the closed-loop stability is preserved and the controller converges to the optimal
LQR solution.
Example 2.4 (A Higher Order System) In this example, we further test the output
feedback Q-learning scheme on a higher order practical system. Consider the power
system example for the load frequency control of an electric system [121]. A
62 2 Model-Free Design of Linear Quadratic Regulator
15
θ(milli-rad) 10
-5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
time (sec)
Fig. 2.17 Example 2.3: State trajectory of the closed-loop system under output feedback
2
Huσ (1)
Huσ (2)
Parameter Estimates
1
Huu
0
0 10 20 30 40 50 60 70 80 90 100
2000
Huω (1)
0 Huω (2)
-2000
0 10 20 30 40 50 60 70 80 90 100
iterations
Fig. 2.18 Example 2.3: Convergence of the parameter estimates under output feedback
practical problem arises when the actual power plant parameters are not precisely
known, yet an optimal feedback controller is desired. We discretize the following
continuous-time plant with a sampling period of 0.1s,
⎡ ⎤
−0.0665 11.5 0 0
⎢ 0 −2.5 2.5 0 ⎥
Ac = ⎢
⎣ −9.5
⎥,
0 −13.736 −13.736⎦
0.6 0 0 0
⎡ ⎤
0
⎢ 0 ⎥
Bc = ⎢ ⎥
⎣13.736⎦ ,
0
Cc = 1 0 0 0 .
2.3 Discrete-Time LQR Problem 63
0.2
x1
0.1 x2
x3
0 x4
States
-0.1
-0.2
-0.3
0 5 10 15 20 25 30 35 40
time step (k)
Fig. 2.19 Example 2.4: State trajectory of the closed-loop system under output feedback
Ĥuσ = 0.6214 0.1190 −0.8929 −0.1551 ,
Ĥuω = 10.1889 −18.4709 10.4217 −1.5110 ,
Ĥuu = 1.5542.
100
Huσ (1)
Parameter Estimates
50
Huσ (2)
0 Huσ (3)
Huσ (4)
-50 Huω (1)
Huω (2)
-100 Huω (3)
Huω (4)
-150 Huu
-200
0 1 2 3 4 5 6
iterations
Fig. 2.20 Example 2.4: Convergence of the parameter estimates under output feedback
for which the new nominal output feedback optimal controller parameters are
∗
Huσ = 1.7900 −1.4292 ,
∗
Huω = 3.4329 −1.1434 ,
∗
Huu = 2.7032.
Under the output feedback Q-learning algorithm (Algorithm 2.2), the closed-loop
system maintains stability as shown in Fig. 2.21. Furthermore, the controller adapts
to the new system dynamics and converges to the new optimal controller as
Ĥuσ = 1.7583 −1.3983 ,
Ĥuω = 3.3643 −1.1187 ,
Ĥuu = 2.6787.
The parameter convergence is shown in Fig. 2.22. We see that, after the second
iteration, the controller begins to adapt to the new system dynamics and converges
to the new optimal parameters in 7 iterations. In each iteration, L = 20 data samples
were collected.
2.4 Continuous-Time LQR Problem 65
1.5
1
x
0.5
0
0 50 100 150 200 250
time step (k)
Fig. 2.21 Example 2.5: State trajectory of the closed-loop system with changing dynamics and
under output feedback
4
Parameter Estimates
2 Huσ (1)
Huσ (2)
1 Huω (1)
Huω (2)
0
Huu
-1
-2
0 2 4 6 8 10 12
iterations
Fig. 2.22 Example 2.5: Convergence of the parameter estimates with changing dynamics and
under output feedback
ẋ = Ax + Bu,
(2.45)
y = Cx,
u∗ = K ∗ x (2.46)
V (x) = x T P x, (2.49)
K ∗ = −R −1 B T P ∗ , (2.50)
where P ∗ > 0 is the unique positive definite solution to the following ARE [55],
AT P + P A + Q − P BR −1 B T P = 0. (2.51)
Even when the system model information is available, the LQR ARE (2.51) is
difficult to solve owing to its nonlinear nature and, therefore, computational iterative
methods have been developed to address this difficulty. We recall the following
policy iteraion algorithm from [51].
The key equation in Algorithm 2.3 is the Bellman equation, which is a Lyapunov
equation and is easier to solve than the ARE (2.51). The algorithm essentially con-
sists of a policy evaluation step followed by a policy update step. We first compute
the cost P j of the control policy K j by solving the Lyapunov equation (2.252). In
the second step, we compute an updated policy K j +1 . It has been proven in [51]
that given a stabilizing initial policy K 0 , the successive iterations on the Lyapunov
equation converge to the optimal solution P ∗ and K ∗ .
Algorithm 2.3 requires a stabilizing initial policy. For an open-loop stable system,
the stabilizing initial policy K 0 can be set to zero. However, for an unstable system,
it is difficult to obtain such a stabilizing policy when the system dynamics is
2.4 Continuous-Time LQR Problem 67
Algorithm 2.3 Model-based policy iteration for solving the LQR problem
input: system dynamics (A, B)
output: P ∗ and K ∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for P ,
T
(A + BK)T P j + P j (A + BK) + Q + K j RK j = 0. (2.252)
K j +1 = −R −1 B T P j . (2.253)
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
unknown. To obviate this requirement, value iteration algorithms are used that
perform recursive updates on the cost matrix P instead of solving the Lyapunov
equation in every iteration.
In [11], a VI algorithm was proposed for the continuous-time LQR problem.! We
∞
recall the following definitions before introducing the VI algorithm. Let Bq q=0
be some bounded nonempty sets that satisfy
Bq ⊆ Bq+1 , q ∈ Z+
and
lim Bq = Pn+ ,
q→∞
where Pn+ is the set of n-dimensional positive definite matrices. For example,
Bq = P ∈ P2+ : |P | ≤ q + 1 , q = 0, 1, 2, . . . .
Algorithm 2.4 Model-based value iteration for solving the LQR ARE
Input: system dynamics (A, B)
Output: P ∗
Initialization. P 0 > 0, j ← 0, q ← 0.
1: loop
2: P̃ j +1 ← P j + j AT P j + P j A + Q − P j BR −1 B T P j .
3: if P̃ j +1 ∈/ Bq then
4: P j +1 ← P 0
5: q ← q + 1
6: else if P̃ j +1 −P j /j < ε, for some small ε > 0, then
7: return P j as P ∗
8: else
9: P j +1 ← P̃ j +1
10: end if
11: j ←j +1
12: end loop
In contrast, the Bellman equation for the discrete-time problems (2.7) is a recursion
between two consecutive values of the cost function and does not involve the system
dynamics. Recently, the idea of integral reinforcement learning (IRL) [121] has been
used to overcome this difficulty. The IRL Bellman equation is given by
t+T
V (x(t)) = r (x(τ ), Kx(τ )) dτ + V (x(t + T )).
t
The idea of using an interval integral in the learning equation has been successfully
used to design RL based control algorithms and will be adopted here to solve
continuous-time output feedback RL control problems. It is worth mentioning that
the above learning equation is employed in an on-policy setting. In such algorithms,
the behavioral policy that is employed during the learning phase follows the policy
that has been learned. Since we are interested in learning the linear feedback policy,
the behavioral policy is confined to this structural constraint. A downside to this
method is that it is hard to achieve sufficient exploration of the state and action
2.4 Continuous-Time LQR Problem 69
space. Often, episodic learning involving resetting of system states is used in this
class of algorithms.
Differently from the on-policy approach, off-policy learning methods have been
presented in the literature that solve the optimal control problems without requiring
model information. The learning equation in such methods involve an explicit
control term that is not restricted to the feedback policy being learned that allows
a good degree of freedom on the choice of exploration. As a result, the exploration
bias issue does not arise, which is different from the output feedback bias issue,
as will be seen next. In the control setting, some pioneering developments along
this line of work were first made in the continuous-time setting in [42] to solve
the state feedback LQR problem. This model-free state feedback algorithm based
on an off-policy policy iteration algorithm is presented in Algorithm 2.5. Recently,
a number of optimal control problems have been solved based on this formulation
both in the continuous-time and the discrete-time settings [18, 48, 79, 80]. Interested
readers can refer to [43] for more discussions on this topic. The continuous-time
output feedback results in this book build upon the state feedback off-policy learning
equations based on this approach.
Although Algorithm 2.5 is model-free, it requires a stabilizing initial control
policy, similar to the model-based PI algorithm, Algorithm 2.3. This could be quite
a restrictive requirement when the open-loop system is unstable and the system
model is not available. Recently, a model-free ADP value iteration algorithm,
Algorithm 2.6, was proposed that overcomes this situation [11].
It can be seen that the model-free Algorithm 2.6 does not require a stabilizing
control policy for its initialization. The algorithm is based on the recursive learning
equation (2.55), which is used to find the unknown matrices H j = AT P j + P j A
and K j = −R −1 B T P j .
Both model-free algorithms discussed in this subsection make use of the full
state, which is not always available in practical scenarios. To obviate the requirement
of a measurement of the full internal state of the system, we next propose a dynamic
output feedback scheme to solve the model-free LQR problem. It will be shown that
the proposed scheme is immune to the exploration noise bias and does not require a
discounted cost function. As a result, the closed-loop stability and optimality of the
solution are ensured.
ẋ = (A + BK)x
70 2 Model-Free Design of Linear Quadratic Regulator
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
Algorithm 2.6 Model-free state feedback based continuous-time LQR value itera-
tion algorithm
Input: input-state data
Output: P ∗
Initialization. Select P 0 > 0 and set j ← 0, q ← 0.
Collect Online Data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where
tl = t0 + lT and T is the interval length. Based on this data, perform the following iterations,
1: loop
2: Find the solution, H j and K j , of the following equation,
T
3: P̃ j +1 ← P j + j H j + C T Qy C − K j RK j
4: if P̃ j +1 ∈
/ Bq then
5: P j +1 ← P 0
6: q ← q + 1
7: else if P̃ j +1 −P j / j < ε then
8: return P j as P ∗
9: else
10: P j +1 ← P̃ j +1
11: end if
12: j ←j +1
13: end loop
2.4 Continuous-Time LQR Problem 71
where
T
ȳ = y T (t) y T (t − T ) · · · y T (t − (N − 1)T ) ,
T
G = C T e−T (A+BK) T C T · · · e−(N −1)T (A+BK) T C T .
The method is elegant, however, it assumes that u = Kx, and therefore, does not
consider the excitation noise in the input. Thus, introducing excitation noise violates
the above relations, which ultimately leads to bias in the parameter estimates in the
ADP learning equation. As a result, the estimated control parameters converge to
sub-optimal parameters [80]. To address this problem, a discounting factor γ is
typically introduced in the cost function [56, 80], which helps to suppress the noise
bias. The resulting discounted cost function takes the form of
∞
V (x(t)) = e−γ (τ −t) r (x(τ ), u(τ )) dτ. (2.56)
t
The introduction of the discounting factor γ , however, changes the solution of the
Riccati equation, and the resulting discounted control is no longer optimal. More
importantly, the introduction of the discounting factor does not guarantee closed-
loop stability as the original optimal control (2.46) would. To see this, we note that,
for a time function α(t) and a discounting factor γ > 0,
∞
e−γ (τ −t) α(τ )dτ < ∞
t
where L is the observer gain such that A + LC is Hurwitz, the parameterization
p
matrices Wu = Wu1 Wu2 · · · Wum and Wy = Wy1 Wy2 · · · Wy are given in the
form of
⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2
au(n−2) i2 ⎥
· · · au0
⎢ u(n−1) ⎥
Wu = ⎢
i
⎢ ..
⎥ , i = 1, 2, · · · , m,
⎢ .
.. . . .. ⎥
⎣ . . . ⎥⎦
in
au(n−1) in
au(n−2) · · · au0
in
⎡ i1 i1 ⎤
ay(n−1) i1
ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2
ay(n−2) i2 ⎥
· · · ay0
⎢ y(n−1) ⎥
Wyi = ⎢
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) ay(n−2) · · · ay0
in in
whose elements are the coefficients of the numerators in the transfer function
matrix of a Luenberger observer with inputs u(t) and y(t), and ζu =
T T m T T T T p T T
1
ζu ζu · · · ζu
2 and ζ y = 1
ζy 2
ζy · · · ζy represent the
2.4 Continuous-Time LQR Problem 73
states of the user-defined dynamics driven by individual input ui (t) and output y i (t)
as given by
Proof Under the observability condition of (A, C), the estimate of the state, x̂(t),
can be obtained based on the following observer,
˙ = Ax̂(t) + Bu(t) − L y(t) − C x̂(t)
x̂(t)
= (A + LC)x̂(t) + Bu(t) − Ly(t), (2.58)
where L is a user-defined observer gain selected such that matrix A+LC is Hurwitz.
The system input and output serve as the inputs to the observer, which can be written
in a filter notation as follows,
U i (s) i
We first consider the contribution to x̂(t) from the input u. Note that (s) u
in (2.59) is given by
U i (s) i
u = (sI − A − LC)−1 Bi ui , (2.60)
(s)
⎡ i1 s n−1 + a i1 s n−2 + · · · + a i1
⎤
an−1 n−2 0
⎢ n ⎥
⎢ s + αn−1 s n−1 + αn−2 s n−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 s n−1 + a i2 s n−2 + · · · + a i2 ⎥
⎢ ⎥
⎢ n−1 n−2 0
⎥
U i (s) i ⎢
⎢ s n+α
n−1 s n−1 + α
n−2 s n−2 + · · · + α ⎥
0⎥
u =⎢ ⎥ u
i
(s) ⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ an−1
in s n−1 + a in s n−2 + · · · + a in
n−2 0 ⎦
s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α0
⎡ i1 i1 ⎡ i ⎤
i1 ⎤
a0 a1 · · · an−1
1
(s) u
⎢ i2 i2 ⎥ ⎢ s i ⎥
⎢a a · · · a i2 ⎥ ⎢ ⎥
⎢ 0 1 n−1 ⎥ ⎢
⎢ (s) u ⎥
⎥
=⎢
⎢ .. .. ..
⎥
.. ⎥ ⎢ .. ⎥
⎢ . . . ⎢ ⎥
⎣ . ⎥
⎦⎢ . ⎥
⎣ ⎦
s n−1 i
a0 a1 · · · an−1
in in in
(s) u
where the parameterization matrix Wui ∈ Rn×n contains the coefficients of the
polynomial vector U i (s) and ζui ∈ Rn is the result of a filtering operation on the
ith input signal ui , which can also be obtained through the following the dynamic
system,
where
⎡ ⎤ ⎡ ⎤
0 1 0 ··· 0 0
⎢ 0 0 1 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥
⎥, B = ⎢ .⎥
.. .. .. .. ..
A=⎢
⎢ . . . . . ⎥ ⎢ .. ⎥ .
⎢ 0 ⎢ ⎥
⎣ 0 ···
0 1 ⎥ ⎦ ⎣0⎦
.
−α0 −α1 · · · .. −αn−1 1
2.4 Continuous-Time LQR Problem 75
The same procedure can also be applied for the contribution to x̂(t) from the ith
output y i as
Y i (s) i
y = Wyi ζyi (t),
(s)
where
T T
T T
ζu = ζu1 ζu2 · · · ζum ∈ Rmn ,
T T
p T T
ζy = ζy1 ζy2 · · · ζy ∈ Rpn ,
and
Wu = Wu1 Wu2 · · · Wum ∈ Rn×mn ,
p
Wy = Wy1 Wy2 · · · Wy ∈ Rn×pn .
It can be seen that e(A+LC)t x̂(0) in (2.63) and e(A+LC)t x(0) in (2.64) converge to
zero as t → ∞ because A + LC is Hurwitz stable. This completes the proof.
76 2 Model-Free Design of Linear Quadratic Regulator
= Wu ζu .
Here the matrices Di contain the coefficients of the adjoint matrix. It can be verified
that we can express Di in terms of the matrix A + LC and the coefficients of its
characteristic polynomial (s) as follows,
Dn−1 = I,
Dn−2 = (A + LC) + αn−1 I,
Dn−3 = (A + LC)2 + αn−1 (A + LC) + αn−2 I,
..
.
D0 = (A + LC)n−1 + αn−1 (A + LC)n−2 + · · · + α2 (A + LC) + α1 I.
Substituting the expressions for Di ’s in the expression for Wu and analyzing the
rank of Wu , we have
2.4 Continuous-Time LQR Problem 77
ρ(Wu ) = ρ (A + LC)n−1 B + αn−1 (A + LC)n−2 B + · · · + α2 (A + LC)B + α1 B · · ·
(A + LC)B + αn−1 B, B .
which is the controllability condition of the pair (A + LC, B). Thus, the controlla-
bility of the pair (A + LC, B) implies full row rank of matrix Wu , and hence full
row rank of matrix W .
A similar analysis of the matrix Wy yields that controllability of the pair (A +
LC, L) would also imply full row rank of matrix Wy . This completes the proof.
We note that the controllability condition of (A + LC, B) or (A + LC, L) in
Theorem 2.3 is difficult to verify since they involve the observer gain matrix L,
designing which would require the knowledge of the system dynamics. Under the
observability condition of (A, C), even though L can be chosen to place eigenvalues
of matrix A + LC arbitrarily, it is not easy to choose an L that satisfies the
conditions of Theorem 2.3. As a result, in a model-free setting, we would not rely
on Theorem 2.3 to guarantee full row rank of matrix W . It is worth pointing out that
we do not design L for the state parameterization. Instead, we form a user-defined
dynamics A that contains the desired eigenvalues of matrix A + LC. We need a
condition in terms of these eigenvalues instead of the matrix L. The following result
establishes this condition.
Theorem 2.11 The parameterization matrix W is of full row rank if matrices A and
A + LC have no common eigenvalues.
Proof By Theorem 2.10 matrix Wy has full row rank if the pair (A + LC, L)
is controllable. We will show that, if matrices A and A + LC have no common
eigenvalues, then the pair (A + LC, L) is indeed controllable. By the Popov–
Belevitch–Hautus
(PBH) test, the pair (A + LC, L) loses controllability if and only
if q T A + LC − λI L = 0 for a left eigenvector q associated with an eigenvalue
λ of A + LC. Then, q T (A + LC) = λq T and q T L = 0, and, therefore, q T A = λq T .
Then, λ must also be an eigenvalue of A if the pair (A + LC, L) is not controllable.
This completes the proof.
Next, we aim to use this state parameterization to describe the cost function
in (2.49). Substitution of (2.57) in (2.49) results in
T
ζu T ζu
V = Wu Wy P Wu Wy (2.65)
ζy ζy
= zT P̄ z, (2.66)
78 2 Model-Free Design of Linear Quadratic Regulator
where
ζ
z = u ∈ RN ,
ζy
W TP W W TP W
u y
∈ RN ×N ,
u u
P̄ = P̄ T =
WyT P Wu WyT P Wy
u = Kx (2.67)
ζu
= K Wu Wy (2.68)
ζy
= K̄z, (2.69)
where K̄ = K Wu Wy ∈ Rm×N . Therefore, the optimal cost matrix is given by
P̄ ∗ and the corresponding optimal output feedback control law is given by
u∗ = K̄ ∗ z. (2.70)
Theorem 2.12 The output feedback law (2.70) is the steady-state equivalent of the
optimal LQR control law (2.46).
Proof Consider
the
optimal output feedback LQR controller (2.70). Substituting
K̄ ∗ = K ∗ Wu Wy in (2.70) results in
u∗ = K ∗ Wu ζu (t) + Wy ζy (t) .
u∗ = K ∗ x.
The discussion so far in this section focused on finding the nominal solution of the
continuous-time LQR problem when the system model is known. In the following,
we will derive an output feedback learning equation that will allow us to learn the
optimal output feedback LQR controller based on the input and output data.
In view of the state parameterization, we can write the key learning equa-
tion (2.54) in the output feedback form as follows,
We can write the l number of Equation (2.71) in the following compact form,
vecs P̄ j
j
= j ,
vec K̄ j +1
with
T
Q̄j = K̄ j R K̄ j ,
z̄ = z12 2z1 z2 · · · z22 2z2 z3 · · · zN
2
,
j j j j j
vecs P̄ j = P̄11 P̄12 · · · P̄1n P̄22 P̄23 · · · P̄N N .
Algorithm 2.7 Model-free output feedback policy iteration algorithm for the
continuous-time LQR problem
input: input-output data
output: P ∗ and K ∗
1: initialize. Select a stabilizing control policy u0 = K̄ 0 z + ν, where ν is an exploration signal,
and set the iteration index j ← 0.
2: collect data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where tl = t0 + lT
and T is the interval length. Based on this data, perform the following iterations.
3: repeat
4: evaluate and improve policy. Find the solution, P̄ j and K̄ j +1 , of the following learning
equation,
5: j← j +1
6: until P̄ j − P̄ j −1 < ε for some small ε > 0.
In Algorithm 2.7, we collect only the filtered input and output data to compute
their quadratic integrals and form the data matrices. Note that we use a stabilizing
initial policy K̄ 0 to collect data, which will be reused in the subsequent iterations.
Since there are N(N + 1)/2 + mN unknowns in P̄ j and K̄ j +1 , we need l ≥ N(N +
1)/2+mN data sets to solve (2.71). Furthermore, since u = K̄z(t) depends linearly
on the input output data z, we add an exploration signal ν in u0 to find the unique
least-squares solution of (2.71). In other words, the following rank condition needs
to be satisfied,
2.4 Continuous-Time LQR Problem 81
N(N + 1)
rank j = + mN. (2.73)
2
Typical examples of exploration signals include sinusoids of various frequencies and
magnitudes. Moreover, we do not require this exploration condition once parameter
convergence has been achieved.
We now address the problem of requiring a stabilizing initial control policy. It
can be seen that Algorithm 2.7 solves the output feedback LQR problem without
using any knowledge of the system dynamics. However, it requires a stabilizing
initial control policy. When the system is unstable and a stabilizing initial policy is
hard to obtain, we propose a dynamic output feedback value iteration algorithm.
To this end, consider the following Lyapunov function candidate,
V (x) = x T P x. (2.74)
d
x T (t)P x(t) = (Ax(t) + Bu(t))T P x(t) + x T (t)P (Ax(t) + Bu(t))
dt
= x T (t)H x(t) − 2 (Ru(t))T Kx(t), (2.75)
Next, we use the state parameterization (2.57) in the above equation to result in
t
zT (t)P̄ z(t) − zT (t − T )P̄ z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )W T AT P + P A + C T Qy C W z(τ )dτ − 2 (Ru(τ ))T K̄z(τ )dτ,
t−T t−T
82 2 Model-Free Design of Linear Quadratic Regulator
or more compactly,
t
zT (t)P̄ z(t) − zT (t − T )P̄ z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ z(τ )dτ − 2 (Ru(τ ))T K̄z(τ )dτ, (2.77)
t−T t−T
where W = Wu Wy and H̄ = W T AT P + P A + C T Qy C W . Equation (2.77)
serves as the key equation for the output feedback based value iteration algorithm.
Note that (2.77) is a scalar equation, which is linear in the unknowns H̄ and K̄.
These matrices are the output feedback counterparts of the matrices H and K in
the state feedback case. As there are more unknowns than the number of equations,
we develop a system of l number of such equations by performing l finite window
integrals each of length T . To solve this system of linear equations, we define the
following data matrices,
N(N + 1)
rank j = + mN, (2.80)
2
which can be met by injecting sufficiently exciting exploration signal in the control
input.
Before presenting the output! feedback value iteration algorithm, we recall the
∞
following definitions. Let Bq q=0 be some bounded nonempty sets that satisfy
Bq ⊆ Bq+1 , q ∈ Z+ and limq→∞ Bq = Pn+ , where Pn+ is the set of n-
dimensional positive definite matrices. Also, let j be the step size sequence
satisfying limj →∞ j = 0. With these definitions, the continuous-time model-free
output feedback LQR algorithm is presented in Algorithm 2.8.
T
3: P̄˜ j +1 ← P̄ j + j H̄ j − K̄ j R K̄ j
4: if P̄˜ j +1 ∈ / Bq then
5: P̄ j +1 ← P 0
6: q ← q + 1
7: else if P̄˜ j +1 − P̄ j /j < ε then
8: return P j as P ∗
9: else
10: P̄ j +1 ← P̄˜ j +1
11: end if
12: j ←j +1
13: end loop
Proof By Theorem 2.12, we know that, in steady state, the optimal output feedback
control is equivalent to optimal state feedback control. Thus, we need to show that
the output feedback algorithms converge to the optimal output feedback solution.
For the PI algorithm, consider the output feedback learning equation (2.71), which
can be written as
By the full row rank condition of W of Theorem 2.10 (or Theorem 2.11), the above
equation reduces to
T T
A + BK j P j + P j A + BK j + C T Qy C + K j RK j = 0,
K̄ j +1 = −R −1 B T P j W
By the full row rank condition of W from Theorem 2.10 (or Theorem 2.11), the
above equation reduces to
T
P̃ j +1 = P j + j AT P + P A + C T Qy C − K j RK j .
Remark 2.5 In comparison with the previous output feedback LQR works based on
RL [80, 142], the output feedback VI algorithm, Algorithm 2.8, does not require a
stabilizing initial policy.
We now establish the immunity to the exploration bias problem of the output
feedback algorithms. We have the following result.
Theorem 2.14 The output feedback algorithms, Algorithms 2.7 and 2.8, are
immune to the exploration bias problem.
Consider the learning equation (2.54) with input û,
t T
x T (t)P̂ j x(t) − x T (t − T )P̂ j x(t − T ) = − x T (τ ) C T Qy C + K̂ j R K̂ j x(τ )dτ
t−T
t T
−2 û(τ ) − K̂ j x(τ ) R K̂ j +1 x(τ )dτ.
t−T
2x T P̂ j Bν = −2ν T R K̂ j +1 x,
we have
T
+x T (t − T ) C T Qy C + K̂ j R K̂ j x(t − T )
T
+2 u(t − T ) − K̂ j x(t − T ) R K̂ j +1 x(t − T ).
We now repack the terms and perform back the integral equation,
which gives us the learning equation in the control u free of exploration signal.
It can be seen that Equation (2.82) is the same as the state feedback learning
equation (2.54). Therefore, P̂ j = P j and K̂ j +1 = K j +1 .
We next consider the output feedback case. By the equivalency of the learning
equations (2.54) and (2.71) following from Theorem 2.9, we have the bias-free
output feedback equation (2.71), that is,
2.4 Continuous-Time LQR Problem 87
We now show the noise bias immunity of Algorithm 2.8. Consider the learning
equation (2.77) with the excited input û. Let P̄ˆ j , H̄ˆ j , and K̄ˆ j be the parameter
estimates obtained as a result of the excited input. We have
t
zT (t)P̄ˆ j z(t)−zT (t −T )P̄ˆ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ˆ j z(τ )dτ − 2 (R û)T (τ )K̄ˆ j z(τ )dτ.
t−T t−T
2zT (t)P̄ˆ j ż(t) − 2zT (t−T )P̄ˆ j ż(t−T ) + y T (t)Qy y(t) − y T (t−T )Qy y(t−T )
T
= zT (t)H̄ˆ j z(t) − zT (t−T )H̄ˆ j z(t−T ) − 2 R û(t) K̄ˆ j z(t)
T
+2 R û(t−T ) K̄ˆ j z(t−T )).
= zT (t)H̄ˆ j z(t) − zT (t−T )H̄ˆ z(t−T ) − 2 (Ru(t))T K̄ˆ j z(t) + 2 (Ru(t−T ))T K̄ˆ j z(t−T )
= Āz + B̄η,
88 2 Model-Free Design of Linear Quadratic Regulator
in which each Āi and Bi is further block diagonalized, respectively, with blocks of
A and B defined in Theorem 2.9, with the number of such blocks being equal to the
number of components in the individual vectors u and y.
Using the fact that W B̄1 = B, we have
= zT (t)H̄ˆ j z(t) − zT (t−T )H̄ˆ z(t−T ) − 2 (Ru(t))T K̄ˆ j z(t) + 2 (Ru(t−T ))T K̄ˆ j z(t−T ).
Comparing (2.83) with (2.77), we have P̄ˆ j = P̄ j , H̄ˆ j = H̄ j and K̄ˆ j = K̄ j . This
establishes the bias-free property of Algorithm 2.8.
Example 2.6 (A Power System) We test the output feedback RL scheme on the load
frequency control of power systems [121]. Although power systems are nonlinear,
a linear model can be employed to develop the optimal controllers for operation
under the normal conditions. The main difficulty arises from determining the plant
parameters in order to design an optimal controller. This motivates the use of model-
free optimal control methods. We use the policy iteration algorithms (Algorithms 2.5
and 2.7) as the system is open-loop stable.
The nominal system model parameters corresponding to (2.45) are
⎡ ⎤
−0.0665 8 0 0
⎢ 0 −3.663 3.663 0 ⎥
A=⎢
⎣ −6.86
⎥,
0 −13.736 −13.736⎦
0.6 0 0 0
2.4 Continuous-Time LQR Problem 89
⎡ ⎤
0
⎢ 0 ⎥
B=⎢ ⎥
⎣13.736⎦ ,
0
C= 1000 .
The control parameters are initialized to zero. We choose 100 learning intervals
of period T = 0.05s. The exploration condition is met by injecting sinusoidal
signals of different frequencies in the control. We compare the results of both the
state feedback PI algorithm (Algorithm 2.5) and the output feedback PI algorithm
(Algorithm 2.7). The state feedback results are shown in Figs. 2.23 and 2.24, and
the output feedback results are shown in Figs. 2.25 and 2.26. It can be seen that,
similar to the state feedback results, the output feedback parameters also converge
to their nominal values with performance quite close to that of the state feedback
case. However, the output feedback PI Bellman equation contains more unknown
terms than the state feedback PI Bellman equation. As a result, it takes the output
feedback algorithm longer to converge. It is worth noting that these results are
obtained without the use of a discounting factor. Furthermore, no exploration bias
is observed from the use of exploration signals, and these exploration signals can be
removed once the convergence criterion has been met.
Example 2.7 (An Unstable System) We now test the output feedback RL scheme on
an unstable system. Consider the double integrator system with
90 2 Model-Free Design of Linear Quadratic Regulator
1000
x1
500
x2
States
0 x3
x4
-500
-1000
0 5 10 15
time (sec)
Fig. 2.23 Example 2.6: State trajectory of the closed-loop system under state feedback (Algorithm
2.5)
Fig. 2.24 Example 2.6: Convergence of the parameter estimates under state feedback (Algorithm
2.5)
1000
x1
500
x2
States
0 x3
x4
-500
-1000
0 5 10 15
time (sec)
Fig. 2.25 Example 2.6: State trajectory of the closed-loop system under output feedback (Algo-
rithm 2.7)
2.4 Continuous-Time LQR Problem 91
Fig. 2.26 Example 2.6: Convergence of the parameter estimates under output feedback (Algo-
rithm 2.7)
01
A= ,
00
0
B= ,
1
C= 10 .
Both state feedback and output feedback VI algorithms (Algorithms 2.6 and 2.8)
are evaluated. We choose the performance index parameters as Qy = 1 and R = 1.
The eigenvalues of matrix A are all placed at −2. The optimal control parameters
as found by solving the ARE (2.51) are
∗ 1.4142 1.0000
P = ,
1.0000 1.4142
K∗ = −1.0000 −1.4142 ,
10
Wu = ,
41
44
Wy = .
04
The initial controller parameters are set to zero. We choose 20 learning intervals
of period T = 0.05s. We also choose the step size as
−1
j = j 0.2 + 5 , j = 0, 1, 2, . . .
92 2 Model-Free Design of Linear Quadratic Regulator
2
x1
States x2
1
-1
-2
0 10 20 30 40 50
time (sec)
Fig. 2.27 Example 2.7: State trajectory of the closed-loop system under state feedback (Algorithm
2.6)
2.5 Summary
In this chapter, a new output feedback Q-learning scheme was first presented to
solve the LQR problem for discrete-time systems. An embedded observer based
approach was presented that enables learning and control using output feedback
without requiring the knowledge of the system dynamics. To this end, we presented
2.5 Summary 93
Fig. 2.28 Example 2.7: Convergence of the parameter estimates under state feedback (Algorithm
2.6)
2
x1
x2
States
-1
-2
0 10 20 30 40 50
time (sec)
Fig. 2.29 Example 2.7: State trajectory of the closed-loop system under output feedback (Algo-
rithm 2.8)
a parameterization of the state in terms of the past input-output data. A new LQR
Q-function was presented that uses only the input-output data instead of the full
state. This Q-function was used to derive an equivalent output feedback LQR
controller. We presented output feedback Q-learning algorithms of policy iteration
and value iteration, where the latter does not require a stabilizing initial controller.
The proposed scheme has the advantage that it does not incur bias in the parameter
estimates. As a result, the need of using a discounted cost function has been obviated
and closed-loop stability is guaranteed. It was shown that the output feedback Q-
learning algorithms converge to the nominal solution as obtained by solving the
LQR ARE. A comprehensive simulation study was conducted that validates the
proposed designs.
The formulation in the case of continuous-time dynamics was found to be
quite different from its discrete-time counterpart. This was due to the fact that the
original formulation of reinforcement learning was developed for MDPs. Recent
94 2 Model-Free Design of Linear Quadratic Regulator
Fig. 2.30 Example 2.7: Convergence of the parameter estimates under output feedback (Algo-
rithm 2.8)
developments in integral reinforcement learning has enabled the design of the output
feedback algorithms presented in this chapter. In [142], a static output feedback
scheme was proposed to solve the continuous-time counterpart of this problem,
where some partial information of the system dynamics is needed. Furthermore,
this method imposes an additional condition of static output feedback stabilizability.
Later on, this stringent condition was relaxed in [80], where a completely model-
free output feedback solution to the continuous-time LQR problem was proposed.
However, similar to its discrete-time model-free counterpart, the work had to resort
to a discounted cost function. In particular, there is an upper bound (lower bound) on
the feasible discounting factor for continuous-time (discrete-time) systems, which
can only be precisely computed using the system model [80, 88].
In this chapter, a filtering based observer approach was presented for param-
eterizing the state in terms of the filtered inputs and outputs. Based on this
parameterization, we derived two new output feedback learning equations. We
considered both policy iteration and value iteration algorithms to learn the optimal
solution of the LQR problem, based on the system output measurements and
without using any system model information. Compared to previous RL works,
the proposed scheme is completely in continuous-time. Moreover, for the value
iteration algorithm, the need of a stabilizing output feedback initial policy was
obviated, which is useful for the control design of unknown unstable systems.
It was shown that the proposed scheme is not prone to exploration bias and
thus circumvents the need of employing a discounting factor. Under the proposed
scheme, the closed-loop stability is guaranteed and the resulting output feedback
control parameters converge to the optimal solution as obtained by solving the LQR
ARE. A comprehensive simulation study was carried out to validate the presented
results.
2.6 Notes and References 95
The presentation in this chapter expands on our results on the discrete-time and
continuous-time output feedback LQR problems presented in [91, 94] and [96, 101],
respectively, by providing a new perspective on the rank conditions of the state
parameterizations and their due roles in the convergence of the output feedback
learning algorithms.
Chapter 3
Model-Free H∞ Disturbance Rejection
and Linear Quadratic Zero-Sum Games
3.1 Introduction
Disturbance rejection is a core problem in control theory that has long been
recognized as a motivation of feedback control. Control designs capable of rejecting
disturbances are of strong theoretical and practical interest because control systems
are often subject to external disturbances. Addressing the presence of external
disturbances is of utmost importance as it would otherwise cause failure to meet
control objectives and may even result in instabilities. Owing to the significance of
this problem, the control literature has witnessed a vast variety of approaches to
addressing this issue under the paradigm of robust control. Among such approaches
is H∞ optimal control. A major portion of the robust control literature is dedicated to
this design methodology owing to its versatility in designing worst-case controllers
for a large class of dynamic systems that are prone to deleterious effects of external
disturbances.
Early designs of the H∞ control were formulated in the frequency domain based
on the sensitivity analysis and optimization techniques using the H∞ operator
norm. The frequency domain approach was found to be relatively complicated
as it involved advanced mathematical tools based on operator theory and spectral
factorization. Later developments, however, presented designs in the time-domain,
where the key machinery involved was based on the more familiar algebraic Riccati
equations similar to the ones found in the popular linear quadratic regulation (LQR)
framework. The time-domain approach also led to further extensions to cater for
more general scenarios such as those involving nonlinear dynamics, time-varying
and finite horizon problems.
A striking feature in the time-domain formulation of the H∞ problem is that
it matches well with the formulation of the zero-sum game problem found in
game theory. The framework of game theory provides strong mathematical tools to
describe situations involving strategic decision makers. These situations are known
as games and the rational decision makers are referred to as the players. Each player
in the game has its own interest, which is represented in the form of its own objective
function. Depending on the nature of the game, the objective of each player could
be in conflict with the interests of the other players. The fundamental idea in game
theory is that the decision of each player not only affects its own outcomes but also
affects the outcomes of the other players. Game theory, which has been successfully
applied in diverse areas such as social science, economics, political science, and
computer science, can be used to analyze many real-world scenarios.
The connection between games and the H∞ control problem stems from the
nature of the H∞ control problem, which is formulated as a minimax dynamic
optimization problem similar to the zero-sum game problem by considering the
controller and the disturbance as two independent players who have competing, in
fact, opposite objectives. That is, the controller can be considered as a minimizing
player who minimizes a quadratic cost similar to the one encountered in the LQR
problem discussed in Chap. 2. On the other hand, different from the LQR problem,
the disturbance acts as an independent agent that tries to have a negative impact
on the control performance by maximizing the cost. As the objective functions of
the controller and the disturbance are exactly opposite, the sum of their respective
functions is identically zero, and hence the name zero-sum game. The Bellman
dynamic programming principle plays a fundamental role in solving problems in
game theory. The key step in solving these problems using dynamic programming
involves finding the solution to the Hamilton-Jacobi-Isaacs (HJI) equation,
T
∂V
0 = max min r (x(t), u(t), w(t)) + f (x(t), u(t), w(t)) , (3.1)
w∈W u∈U ∂x
where w(t) is the maximizing player or disturbance that influences the game
dynamics f (x(t), u(t), w(t)) as well as the cost utility r (x(t), u(t), w(t)). It is
worthwhile to note that the Hamilton-Jacobi-Isaacs equation (3.1) is a generalization
of the Hamilton-Jacobi-Bellman PDE introduced in Chap. 1, and, as a result, its
solution is generally intractable. The discrete-time version of the HJI equation is the
following nonlinear difference equation often referred to as the Isaacs equation,
V ∗ (xk ) = max min r(xk , uk , wk ) + V ∗ (xk+1 ) , (3.2)
w∈W u∈U
where the control input u acts as the minimizing player and the disturbance input
w acts as the maximizing player. For the H∞ control problem, the utility function
r(xi , ui , wi ) takes the quadratic form,
I − γ −2 E T P E > 0. (3.8)
Then, given that the system dynamics is completely known and the full state xk is
available for feedback, there exist a unique optimal stabilizing controller u∗k = K ∗ xk
and a unique worst-case disturbance wk∗ = G∗ xk that solve the linear quadratic zero-
sum game [6], where
−1 −1
K ∗ = I + B TP B − B TP E ETP E − γ 2I ETP B
−1
× B TP E ETP E − γ 2I ETP A − B TP A , (3.9)
−1 −1
G∗ = E T P E − γ 2 I − E T P B I + B T P B B TP E
−1
× ETP B I + B TP B B TP A − ETP A . (3.10)
which, for the quadratic function (3.5), can be expressed in terms of the quadratic
value function V (xk ) = xkT P xk as
The policies in this case are uk = Kxk and wk = Gxk , which gives us
That is, the Bellman equation for the zero-sum game actually corresponds to a
Lyapunov equation, which is similar to the connection that exists between the LQR
Bellman equation and a Lyapunov equation. This observation suggests that we can
apply iterations on the Lyapunov equation in the same way as they are applied on
the Bellman equation in Chap. 1. In such a case, Lyapunov iterations under the
H∞ control design conditions would converge to the solution of the GARE. A
Newton’s iteration method is often used in the literature that does exactly what a
policy iteration algorithm does, and is presented in Algorithm 3.1.
Algorithm 3.1 finds the solution of the GARE (3.7) iteratively. Instead of solving
the GARE, which is a nonlinear equation, Algorithm 3.1 only involves solving
Lyapunov equations, which are linear in the unknown matrix P j . As is the case with
other PI algorithms, Algorithm 3.1 also needs to be initialized with a stabilizing
policy. Such initialization is essential because the policy evaluation step involves
finding the positive definite solution of the Lyapunov equation, which requires the
feedback gain to be stabilizing. The algorithm is known to converge with a quadratic
convergence rate under the stated conditions.
We can also apply value iteration to find the solution of the GARE. Similar to
the LQR value iteration algorithm presented in Chap. 1, we perform recursions on
the GARE to carry out value iterations on the matrix P for value updates. That
is, instead of solving the Lyapunov equation, we only perform recursions, which
are computationally faster. The policy update step still remains the same as in
Algorithm 3.1. Under the solvability conditions of the linear quadratic zero-sum
game, the value iteration algorithm for the zero-sum game, Algorithm 3.2, converges
to the solution of the GARE.
The iterative algorithms provide a numerically feasible way of solving the
GARE. The fixed-point property of the Bellman and GARE equations enables us
to perform successive approximation of the solution in a way similar to the way
for solving the LQR problem. This successive approximation property is inherited
from the dynamic programming approach that has enabled us to break a complex
optimization problem into smaller ones. However, these methods also inherit the
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 103
Algorithm 3.1 Model-based policy iteration algorithm for the discrete-time zero-
sum game
input: system dynamics
output: P ∗ , K ∗ and G∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Select G0 = 0.
Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P ,
T
A + BK j + EGj P j A + BK j + EGj
T T
−P j + Q + K j K j − γ 2 Gj Gj = 0.
−1 −1
K j +1 = I + B TP j B − B TP j E ETP j E − γ 2I ETP j B
−1
× B TP j E ETP j E − γ 2I ETP j A − B TP j A ,
−1 −1
Gj +1 = ETP j E − γ 2I − ETP j B I + B TP j B BTP j E
−1
× ETP j B I + B TP j B B TP j A − ETP j A .
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
Algorithm 3.2 Model-based value iteration algorithm for the discrete-time zero-
sum game
input: system dynamics
output: P ∗ , K ∗ and G∗
1: initialize. Select an arbitrary policy K 0 , G0 = 0, and a value function matrix P 0 > 0. Set
j ← 0.
2: repeat
3: value update. Perform the following recursion,
−1
I + BTP j B BTP j E
P j +1 = AT P j A + Q − AT P j B AT P j E
ETP j B ETP j E − γ 2I
T j
B P A
× T j .
E P A
−1 −1
K j +1 = I + B T P j +1 B − B T P j +1 E E T P j +1 E − γ 2 I E T P j +1 B
−1
× B T P j +1 E E T P j +1 E − γ 2 I E T P j +1 A − B T P j +1 A ,
−1 −1
Gj +1 = E T P j +1 E − γ 2 I − E T P j +1 B I + B T P j +1 B B T P j +1 E
−1
× E T P j B I + B T P j +1 B B T P j +1 A − E T P j +1 A .
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
Under the control policy uk = Kxk and the disturbance policy wk = Gxk , the total
cost incurred when starting with any state xk is quadratic in the state [6], that is,
for some positive definite matrix P ∈ Rn×n . Motivated by the Bellman optimality
principle, Equation (3.13) can be written recursively as
where V (xk+1 ) is the cost of following policies K and G in all future time indices.
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 105
which is the sum of the one-step cost of taking an arbitrary action uk under some
disturbance wk from state xk and the total cost that would incur if the policies K
and G are followed at time index k + 1 and all subsequent time indices. Note that
the Q-function (3.16) is similar to the cost function (3.15) but is explicit in xk , uk ,
and wk .
For the zero-sum game, we have a quadratic cost and the corresponding Q-
function can be expressed as
= zkT H zk , (3.17)
where
Hxx = Q + AT P A ∈ Rn×n ,
Hxu = AT P B ∈ Rn×m1 ,
Hxw = AT P E ∈ Rn×m2 ,
Hux = B T P A ∈ Rm1 ×n ,
Huu = B T P B + I ∈ Rm1 ×m1 ,
Huw = B T P E ∈ Rm1 ×m2 ,
Hwx = E T P A ∈ Rm2 ×n ,
Hwu = E T P B ∈ Rm2 ×m1 ,
Hww = E T P E − γ 2 I ∈ Rm2 ×m2 .
Given the optimal cost V ∗ , we can compute K ∗ and G∗ . To do so, we define the
optimal Q-function as the cost of executing an arbitrary control uk and disturbance
wk , and then following the optimal policies K ∗ and G∗ , as given by
That is, the optimal policies are obtained by performing the minimization and
maximization of (3.18), which in turn can be carried out by simultaneously solving
∂Q∗
= 0,
∂uk
∂Q∗
= 0,
∂wk
for uk and wk . The result is the same as given in (3.9) and (3.10), which was obtained
by solving the GARE.
In the discussion so far we have obtained the form of Q-function for the zero-
sum game. The next logical step would be to develop Q-learning algorithms that
can learn the optimal Q-function.
From (3.15) and the definition of (3.16), we have the following relationship,
The above equation is a key learning equation for solving the model-free zero-
sum games. We present the following state feedback Q-learning algorithms based
on policy iteration and value iteration that employ the Q-learning Bellman equation
(3.20). These algorithms are the extensions of the state feedback LQR Q-learning
algorithms introduced in Chap. 1.
Algorithm 3.3 is a policy iteration algorithm for solving the zero-sum game
without requiring the knowledge of the system dynamics. It is an extension of the
LQR Q-learning policy iteration algorithm to the case when two decision makers,
instead of a single one, are involved. An interesting feature of this algorithm is that it
updates the policies of the two players simultaneously based on a single Q-learning
equation. The players (or the agents corresponding to the control and disturbance)
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 107
Algorithm 3.3 Q-learning policy iteration algorithm for the zero-sum game
input: input-state data
output: P ∗ , K ∗ and G∗
1: initialize. Apply a stabilizing policy uk = K 0 xk + nk and w = νk with nk and νk being the
exploration signals. Set G0 = 0 and j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H j ,
−1 −1 −1
K j +1 =
j j j j j j
Huu − Huw Hww
j
Hwu j
Huw Hww Hwx − Hux ,
−1 −1 −1
Gj +1 =
j j j j j j j
j
Hww − Hwu Huu Huw Hwu Huu Hux − Hwx .
5: j← j +1
6: until H j − H j −1 < ε for some small ε > 0.
have competing objectives and, therefore, the aim of the algorithm is to find the best
case control and the worst-case disturbance under which the system still satisfies
the H∞ performance criterion (3.6). Note that the policy updates in K j and Gj are
fed back to the policy evaluation step through the variables uk+1 = Kxk+1 and
wk+1 = Gxk+1 present in zk+1 . Variables uk and wk do not necessarily follow the
policies K and G. They follow from the definition of the Q-function. As a policy
iteration algorithm, Algorithm 3.3 requires a stabilizing initial control policy K 0 .
Subsequent iterations of these steps have been shown [60] to converge to the optimal
cost function matrix H ∗ and the optimal strategies K ∗ and G∗ under the solvability
conditions for the zero-sum game.
The value iteration algorithm for the model-free zero-sum game has also been
developed in the literature that relaxes the requirement of the knowledge of a
stabilizing initial policy K 0 . The algorithm is recalled in, Algorithm 3.4. Similar
to the Q-learning value iteration for the LQR problem, this algorithm recursively
updates the value matrix H towards its optimal value. The policies of the players
are updated in the same fashion as in Algorithm 3.3.
An important consideration in these model-free algorithms is that they need
information of the full state xk , which is not available in our problem setting. To
circumvent this situation, we will next present a state reconstruction technique that
employs input-output and disturbance data of the system to observe the state. This
parameterization will play a key role in developing the output feedback Q-learning
equation for the zero-sum game and the associated H∞ control problem.
108 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Algorithm 3.4 Q-learning value iteration algorithm for the zero-sum game
input: input-state data
output: P ∗ , K ∗ and G∗
1: initialize. Apply an arbitrary policy uk = K 0 xk + nk and w = νk with nk and νk being the
exploration signals. Set H 0 ≥ 0 and j ← 0.
2: repeat
3: value update. Solve the following Bellman equation for H j +1 ,
−1 −1 −1
j +1 j +1 j +1 j +1 j +1 j +1
K j +1 = Huu − Huw j +1
Hww Hwu Huw j +1
Hww Hwx − Hux ,
−1
j +1 j +1 −1 j +1 j +1 j +1 −1 j +1 j +1
Gj +1 = j +1
Hww − Hwu Huu Huw Hwu Huu Hux − Hwx .
5: j← j +1
6: until H j − H j −1 < ε for some small ε > 0.
xk = Wu σk + Wy ωk + Ww υk + (A + LC)k x0 , (3.22)
⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2 i2 ⎥
· · · au0
⎢ u(n−1) au(n−2) ⎥
Wu = ⎢
i
⎢ ..
⎥ , i = 1, 2, · · · , m1 ,
⎢ .
.. . . .. ⎥
⎣ . . . ⎥ ⎦
in
au(n−1) in
au(n−2) · · · au0
in
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 109
⎡ ⎤
i1
aw(n−1) i1
aw(n−2) · · · aw0
i1
⎢ i2 ⎥
⎢a i2 ⎥
w(n−1) aw(n−2) · · · aw0 ⎥
i2
⎢
Wwi = ⎢
⎢ ..
⎥ , i = 1, 2, · · · , m2 ,
⎢ .
.. . . .. ⎥ ⎥
⎣ . . . ⎦
aw(n−1) aw(n−2) · · · aw0
in in in
⎡ i1 i1 ⎤
i1
ay(n−1) ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2 ⎥
y(n−1) ay(n−2) · · · ay0 ⎥
i2
⎢
Wyi = ⎢
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) in
ay(n−2) · · · ay0
in
whose elements are the coefficients of the numerators in the transfer function matrix
of a Luenberger observer with inputs uk , wk and yk , and σk = [σk1 σk2 · · · σkm1 ]T ,
p
υk = [υk1 υk2 · · · υkm2 ]T and ωk = [ωk1 ωk2 · · · ωk ]T represent the states of the user-
defined dynamics driven by individual input uk , disturbance wki and output yki as
i
given by
i
σk+1 = Aσki + Buik , σ i (0) = 0, i = 1, 2, · · · , m1 ,
i
υk+1 = Aυki + Bwki , υ i (0) = 0, i = 1, 2, · · · , m2 ,
i
ωk+1 = Aωki + Byki , ωi (0) = 0, i = 1, 2, · · · , p,
for a Schur matrix A whose eigenvalues coincide with those of A+LC and an input
vector B of the following form,
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ···· · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ⎥ ⎢ ⎥
A=⎢ 1 0 ⎥ , B = ⎢0⎥ .
⎢ . .. .. . . .. ⎥ ⎢.⎥
⎣ .. . . . . ⎦ ⎣ .. ⎦
0 0 ··· 1 0 0
Proof The proof follows from the proof of the state parameterization result
presented in Chap. 2. By considering the disturbance as an additional input, given
(A, C) is observable, we can obtain a full state observer as
where x̂k is the estimate of the state xk and L is the observer gain chosen such that
the matrix A+LC has all its eigenvalues strictly inside the unit circle. This observer
110 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
= Wwi υki , i = 1, 2, · · · , m2 ,
where Wwi ∈ Rn×n is the parametric matrix corresponding to the contribution to the
state from the ith disturbance input wki , υki , which can be obtained as
i
υk+1 = Aυki + Bwki ,
with
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ···· · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ⎥ ⎢ ⎥
A=⎢ 1 0 ⎥ , B = ⎢0 ⎥ .
⎢ . .. .. .. .. ⎥ ⎢.⎥
⎣ .. . . . . ⎦ ⎣ .. ⎦
0 0 ··· 1 0 0
ek = xk − x̂k
= (A + LC)k e0 .
xk = Wu σk + Ww υk + Wy ωk + (A + LC)k x0 . (3.24)
Since A + LC is Schur stable, the term (A + LC)k x̂0 in (3.23) and the term (A +
LC)k x0 in (3.24) vanish as k → ∞. This completes the proof.
It was shown in Chap. 2 that, for discrete-time systems, a special case of
the above result can be obtained if all the eigenvalues of matrix A + LC or,
equivalently, matrix A, are placed at 0. This property also pertains to the above
extended parameterization with the disturbance term. The following result presents
this special case.
Theorem 3.2 Consider system (3.3). Let the pair (A, C) be observable. Then,
the system state can be uniquely represented in terms of the input, output, and
disturbance as
with
T T
VN = CAN −1 · · · (CA)T C T ,
UN = B AB · · · AN −1 B ,
WN = E AE · · · AN −1 E ,
⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN 1 = ⎢ ... ... .. ..
. .
..
. ⎥,
⎢ ⎥
⎣0 0 · · · 0 CB ⎦
0 0 0 0 0
⎡ ⎤
0 CECAE · · · CAN −2 E
⎢0 0 CE · · · CAN −3 E ⎥
⎢ ⎥
⎢ ⎥
TN 2 = ⎢ ... ...
.. ..
. .
..
. ⎥.
⎢ ⎥
⎣0 0 ··· 0 CE ⎦
0 0 0 0 0
Remark 3.1 The parameterization matrices Wu , Wy , and Ww in (3.25) are the same
as those in (3.22) if N = n and all eigenvalues of matrix A or, equivalently, matrix
A + LC, used in Theorem 3.1 are zero. This can be seen as follows. Recall from the
proof of Theorem 3.1 that the state can be represented by
xk = Wu σk + Wy ωk + Ww υk + (A + LC)k x0 . (3.26)
⎡ ⎤
zn−1
[wk ]
⎢ (z)
⎥
⎢ zn−2 ⎥
⎢ (z) [wk ]⎥
⎢
υk = ⎢ ⎥ = w̄k−1,k−n .
⎥
⎢ .. ⎥
⎣ . ⎦
1
(z) [w k ]
Remark 3.2 The condition discussed in Theorem 3.4 will not be satisfied if we
use the parameterization (3.25) and the system happens to have a zero eigenvalue.
This is due to the dead-beat nature of this special parameterization that places
the eigenvalues of matrix A + LC all at the origin. In such a case, the general
parameterization (3.22) gives us extra flexibility in satisfying conditions by placing
eigenvalues of A at location other than zero. Compared to the disturbance-free
parameterization presented in Chap. 2, the disturbance matrix Ww provides another
degree of freedom to satisfy the full rank condition of W for both parameterizations
(3.22) and (3.25).
In this subsection we will solve the zero-sum game and the associated H∞ control
problem by using only the input-output and disturbance data. No knowledge of the
system dynamics (A, B, C, E) and no measurement of the state xk are assumed
available. We now proceed to apply the state parameterization (3.25) to describe the
Q-function in (3.17) in terms of the input, output, and disturbance. It can be easily
verified that substitution of the parameterization (3.25) for xk in (3.17) results in
⎡ ⎤T ⎡ ⎤⎡ ⎤
ūk−1,k−N Hūū Hūw̄ Hūȳ Hūu Hūw ūk−1,k−N
⎢w̄ ⎥ ⎢H Hw̄w ⎥ ⎢ ⎥
⎢ k−1,k−N ⎥ ⎢ w̄ū Hw̄w̄ Hw̄ȳ Hw̄u ⎥⎢w̄k−1,k−N ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
QK = ⎢ ȳk−1,k−N ⎥ ⎢ Hȳ ū Hȳ w̄ Hȳ ȳ Hȳu Hȳw ⎥⎢ ȳk−1,k−N ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ uk ⎦ ⎣ Huū Huw̄ Huȳ Huu Huw ⎦⎣ uk ⎦
wk Hwū Hww̄ Hwȳ Hwu Hww wk
= zkT H zk , (3.27)
where
T
zk = ūTk−1,k−N w̄k−1,k−N
T T
ȳk−1,k−N uTk wkT ,
H = H T ∈ R(m1 N +m2 N +pN +m1 +m2 )×(m1 N +m2 N +pN +m1 +m2 ) ,
Hūū = WuT Q + AT P A Wu ∈ Rm1 N ×m1 N ,
Hūw̄ = WuT Q + AT P A Ww ∈ Rm1 N ×m2 N ,
Hūȳ = WuT Q + AT P A Wy ∈ Rm1 N ×pN ,
Hūu = WuT AT P B ∈ Rm1 N ×m1 ,
Hūw = WuT AT P E ∈ Rm1 N ×m2 ,
Hw̄ū = WwT Q + AT P A Wu ∈ Rm2 N ×m1 N ,
Hw̄w̄ = WwT Q + AT P A Ww ∈ Rm2 N ×m2 N ,
Hw̄ȳ = WwT Q + AT P A Wy ∈ Rm2 N ×pN ,
Hw̄u = WwT AT P B ∈ Rm2 N ×m1 ,
Hw̄w = WwT AT P E ∈ Rm2 N ×m2 ,
Hȳ ū = WyT Q + AT P A Wu ∈ RpN ×m1 N ,
Hȳ w̄ = WyT Q + AT P A Ww ∈ RpN ×m2 N ,
Hȳ ȳ = WyT Q + AT P A Wy ∈ RpN ×pN ,
Hȳu = WyT AT P B ∈ RpN ×m1 ,
Hȳw = WyT AT P E ∈ RpN ×m2 ,
Huū = B T P AWu ∈ Rm1 ×m1 N ,
Huw̄ = B T P AWw ∈ Rm1 ×m2 N ,
Huȳ = B T P AWy ∈ Rm1 ×pN ,
Hwū = E T P AWu ∈ Rm2 ×m1 N ,
Hww̄ = E T P AWw ∈ Rm2 ×m2 N ,
Hwȳ = E T P AWy ∈ Rm2 ×pN ,
Huu = B T P B + I ∈ Rm1 ×m1 ,
Huw = B T P E ∈ Rm1 ×m2 ,
Hwu = E T P B ∈ Rm2 ×m1 ,
Hww = E T P E − γ 2 I ∈ Rm2 ×m2 .
Given the optimal cost function V ∗ with the cost matrix P ∗ , we obtain the
corresponding optimal output feedback matrix H ∗ by substituting P = P ∗ in
(3.3.3). Then, the optimal output feedback Q-function is given by
Q∗ = zkT H ∗ zk , (3.28)
∂Q∗
= 0,
∂uk
116 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
∂Q∗
= 0,
∂wk
∗ −1 ∗ −1 ∗ −1
u∗k = Huu
∗
−Huw ∗
Hww Hwu Huw ∗
Hww Hwū ūk−1,k−N
+Hw∗ w̄ w̄k−1,k−N + Hw∗ ȳ ȳk−1,k−N
− Hu∗ū ūk−1,k−N + Hu∗w̄ w̄k−1,k−N + Hu∗ȳ ȳk−1,k−N , (3.29)
∗ −1 ∗ −1 ∗ −1
∗
wk∗ = Hww
∗
−Hwu ∗
Huu Huw ∗
Hwu Huu Huū ūk−1,k−N
+Hu∗w̄ w̄k−1,k−N + Hu∗ȳ ȳk−1,k−N
− Hw∗ ū ūk−1,k−N + Hw∗ w̄ w̄k−1,k−N + Hw∗ ȳ ȳk−1,k−N . (3.30)
These policies solve the output feedback zero-sum game without requiring access
to the full state xk .
We next show that the output feedback policies (3.29) and (3.30) are equivalent
to the state feedback policies (3.9) and (3.10). To this end, we show the relation
between the presented output feedback Q-function and the output feedback value
function. The output feedback value function as used is given by
⎡ ⎤T ⎡ ⎤
ūk−1,k−N ūk−1,k−N
V = ⎣w̄k−1,k−N ⎦ P̄ ⎣w̄k−1,k−N ⎦, (3.31)
ȳk−1,k−N ȳk−1,k−N
where
⎡ ⎤
WuT P Wu WuT P Ww WuT P Wy
⎢ ⎥
P̄ = ⎣WwT P Wu WwT P Ww WwT P Wy ⎦.
WyT P Wu WyT P Ww WyT P Wy
The value function (3.31), by definition (3.14), gives the cost of executing the
policies
K̄ = K Wu Ww Wy ,
Ḡ = G Wu Ww Wy .
Using the relation QK (xk , Kxk , , Gxk ) = VK (xk ), the output feedback value
function matrix P̄ can be readily obtained as
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 117
⎡ ⎤T ⎡ ⎤
I I
P̄ = ⎣K̄ ⎦ H ⎣K̄ ⎦ . (3.32)
Ḡ Ḡ
Theorem 3.5 The output feedback policies given by (3.29) and (3.30) converge to
the state feedback policies (3.9) and (3.10), respectively, that solve the zero-sum
game (3.3)–(3.5).
Proof We know that the state vector can be represented by the input-output data
sequence as in (3.25). It can be easily verified that substituting (3.25) and (3.3.3)
in (3.29) and (3.30) results in the state feedback policies (3.9) and (3.10), which
solve the zero-sum game. So, the output feedback policies (3.29) and (3.30) are the
equivalent policies that also solve the zero-sum game (3.3)–(3.5). This completes
the proof.
QK = H̄ T z̄k , (3.34)
118 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
where
H̄ = vec(H )
z̄k = zk ⊗ zk
T
= zk1
2
zk1 zk2 · · · zk1 zkl zk2
2
zk2 zk3 · · · zk2 zkl · · · zkl
2
,
where zk = [zk1 zk2 · · · zkl ]T . Then, it follows from Equation (3.34) that
T H̄ = ϒ, (3.36)
Note that we require at least L ≥ l(l + 1)/2 data samples. Furthermore, since uk
and wk are linearly dependent on the vectors ūk−1,k−N , w̄k−1,k−N and ȳk−1,k−N ,
we add excitation signals in uk and wk so that all the involved vectors are linearly
independent to ensure a unique least-squares solution to (3.36). That is, we need to
satisfy the following rank condition,
Algorithm 3.5 Output feedback Q-learning policy iteration algorithm for the zero-
sum game
input: input-output data
output: H ∗
1: initialize. Select a stabilizing output feedback policy u0k and disturbance wk0 along with their
exploration signals nk and νk . Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H̄ j ,
T
H̄ j (z̄k − z̄k+1 ) = ykT Qy yk + uTk uk − γ 2 wkT wk .
−1 −1 −1
j +1 j j j j j j j
uk = Huu −Huw Hww Hwu Huw Hww Hwū ūk−1,k−N
j j j j
+ Hww̄ w̄k−1,k−N + Hwȳ ȳk−1,k−N − Huū ūk−1,k−N + Huw̄ w̄k−1,k−N
j
+ Huȳ ȳk−1,k−N ,
−1 −1 −1
j +1 j j j j j j j
wk = Hww −Hwu Huu Huw Hwu Huu Huū ūk−1,k−N
j j j j
+ Huw̄ w̄k−1,k−N + Huȳ ȳk−1,k−N − Hwū ūk−1,k−N + Hww̄ w̄k−1,k−N
j
+ Hwȳ ȳk−1,k−N .
5: j← j +1
6: until H̄ j − H̄ j −1 < ε for some small ε > 0.
Algorithm 3.5 requires a stabilizing control to start with, which can be quite
restrictive when the system itself is open-loop unstable. To obviate this requirement,
a value iteration algorithm, Algorithm 3.6, is presented next.
In Algorithm 3.6, the data matrices ∈ R(l(l+1)/2)×L and ϒ ∈ RL×1 are defined
by
= z̄k1 z̄k2 · · · z̄kL ,
1 T 2
ϒ = r 1 (yk , uk , wk ) + H̄ j T z̄k+1 r 2 (yk , uk , wk ) + H̄ j z̄k+1 ···
T L
r L (yk , uk , wk ) + H̄ j z̄k+1 .
120 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Algorithm 3.6 Output feedback Q-learning value iteration algorithm for the zero-
sum game
input: input-output data
output: H ∗
1: initialize. Apply arbitrary u0k and wk0 along with their exploration signals nk and νk . Set H 0 ≥
0 and j ← 0.
2: repeat
3: value update. Solve the following Bellman equation for H̄ j +1 ,
T T
H̄ j +1 z̄k = ykT Qy yk + uTk uk − γ 2 wkT wk + H̄ j z̄k+1 .
−1 −1 −1
j +1 j +1 j +1 j +1 j +1 j +1 j +1 j +1
uk = Huu − Huw Hww Hwu Huw Hww Hwū ūk−1,k−N
j +1 j +1 j +1
+ Hww̄ w̄k−1,k−N + Hwȳ ȳk−1,k−N − Huū ūk−1,k−N
j +1 j +1
+ Huw̄ w̄k−1,k−N + Huȳ ȳk−1,k−N ,
−1
j +1 j +1 j +1 j +1 −1 j +1 j +1 j +1 −1 j +1
wk = Hww − Hwu Huu Huw Hwu Huu Huū ūk−1,k−N
j +1 j +1 j +1
+ Huw̄ w̄k−1,k−N + Huȳ ȳk−1,k−N − Hwū ūk−1,k−N
j +1 j +1
+ Hww̄ w̄k−1,k−N + Hwȳ ȳk−1,k−N .
5: j← j +1
6: until H̄ j − H̄ j −1 < ε for some small ε > 0.
These matrices are used to obtain the least-squares solution given by (3.37). The
rank condition (3.38) must be met by the addition of excitation noises in the control
uk and the disturbance wk . Convergence of the output feedback learning algorithms,
Algorithms 3.5 and 3.6, is established in the following theorem.
Theorem 3.6 Consider system (3.3). Assume that the linear quadratic zero-sum
game is solvable. Then, the output feedback
Q-learningalgorithms
(Algorithms 3.5
j j
and 3.6) each generates policies uk , j = 1, 2, 3, ... and wk , j = 1, 2, 3, ...
that converge to the optimal output feedback policies given in (3.29) and (3.30)
as j → ∞ if the rank condition (3.38) holds.
Proof The proof follows from [5], where it is shown that, under sufficient
excitation, the state
feedback
Q-learning iterative
algorithm generates policies
j j
uk , j = 1, 2, 3, ... and wk , j = 1, 2, 3, ... that converge to the optimal state
feedback policies (3.9) and (3.10). By the state parameterization (3.25), we see
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 121
that the state feedback and output feedback Q-functions are equivalent and, by
Theorem 3.5, the output feedback policies (3.29) and (3.30) are equivalent to (3.9)
and (3.10), respectively. Therefore, following the result in [5], we can conclude
that, under sufficient excitation, such that the rank condition (3.38) holds, the output
feedback Q-learning algorithm generates policies that converge to the optimal
output feedback policies as j → ∞. This completes the proof.
We now show that the Q-learning scheme for solving the zero-sum game is
immune to the excitation noise bias.
Theorem 3.7 The output feedback Q-learning scheme does not incur bias in the
parameter estimates.
Proof Based on the state parameterization (3.25), we know that the output feedback
Q-function in (3.27) is equivalent to the original state feedback Q-function (3.17).
Under the excitation noise, we can write the Q-function as
Q(xk , ûk , ŵk ) = r xk , ûk , ŵk + V (xk+1 ),
where ûk = uk + nk and ŵk = wk + vk with nk and vk being the excitation noise
signals.
Let Ĥ be the estimate of H obtained using ûk and ŵ. It then follows from (3.17)
that
⎡ ⎤T ⎡ ⎤
xk xk
⎢ ⎥ ⎢ ⎥ T
⎣ ûk ⎦ Ĥ ⎣ ûk ⎦ = r xk , ûk , ŵk + Axk + B ûk + E ŵk P Axk + B ûk + E ŵk .
ŵk ŵk
= xkT Qxk + uT 2 T T T T 2 T 2 T
k uk − γ wk wk + nk nk + nk uk + uk nk − γ wk vk − γ vk wk
−γ 2 vkT vk + (Axk +Buk +Ewk )T P (Axk +Buk +Ewk ) + (Axk +Buk +Ewk )T
×P Bnk + nT T T T T
k B P Bnk + (Bnk ) P (Axk + Buk + Ewk ) + nk B P Evk
+vkT E T P Bnk + (Axk +Buk +Ewk )T P Evk + (Evk )T P (Axk + Buk + Ewk )
+vkT E T P Evk .
122 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
It can be easily verified that all the terms involving nk and vk get canceled on both
sides of the equation, and we are left with
⎡ ⎤T ⎡ ⎤
xk xk
⎣ uk ⎦ Ĥ ⎣ uk ⎦ = r(xk , uk , wk )+(Axk +Buk + Ewk )T P (Axk + Buk + Ewk ) .
wk wk
That is, we have obtained the Bellman equation in the absence of excitation noise
as given in (3.35). This completes the proof.
The three states are given by x = [x1 x2 x3 ]T , where x1 is the angle of attack, x2
is the rate of pitch, and x3 is the elevator angle of deflection. The initial state of
the system is x0 = [10 5 − 2]T . The user-defined cost function parameters are
Qy = 1 and γ = 1. The algorithm is initialized with u0k = nk and wk0 = vk .
The PE condition is ensured by adding sinusoidal noise nk of different frequencies
and amplitudes in the input uk and assuming that the disturbance wk is sufficiently
exciting. In the simulation study, we let wk = vk be sinusoidal noises of different
frequencies and amplitudes. The system order is 3, so N = 3 is selected. In
comparison with the output feedback model-free Q-learning schemes, the state
feedback case leads to a smaller H matrix but requires the measurement of the
full state. The nominal values of the state feedback control parameters are obtained
by solving the GARE (3.7) as follows,
∗
Hux = −0.0861 −0.0708 0.0001 ,
∗
Hwx = 0.1000 0.0671 −0.0001 ,
∗
Huu = 1.0009,
∗
Huw = −0.0008,
∗
Hww = −0.9990.
On the other hand, the output feedback case leads to a larger H matrix, whose
nominal values are
Hu∗ū = 0.0009 −0.0006 −0.0002 ,
Hu∗w̄ = −0.0008 0.0087 −0.0011 ,
Hu∗ȳ = −0.9987 0.9443 −0.1077 ,
Hw∗ ū = −0.0008 0.0005 0.0002 ,
Hw∗ w̄ = 0.0009 −0.0084 0.0011 ,
Hw∗ ȳ = 0.9813 −0.9132 0.1039 ,
∗
Huu = 1.0009,
∗
Huw = −0.0008,
∗
Hww = −0.9990.
Consequently, it takes longer for the estimates of the output feedback parameters
to converge than in the state feedback case. We first present the result of the
state feedback and output feedback Q-learning policy iteration algorithms, Algo-
rithms 3.3 and 3.5. The final estimated state feedback control parameters obtained
by Algorithm 3.3 are
124 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Ĥux = −0.0861 −0.0708 0.0001 ,
Ĥwx = 0.1000 0.0671 −0.0001 ,
Ĥuu = 1.0009,
Ĥuw = −0.0008,
Ĥww = −0.9990,
whereas the output feedback control parameters obtained by Algorithm 3.5 are
Ĥuū = 0.0009 −0.0006 −0.0002 ,
Ĥuw̄ = −0.0008 0.0087 −0.0011 ,
Ĥuȳ = −0.9983 0.9439 −0.1076 ,
Ĥwū = −0.0008 0.0005 0.0002 ,
Ĥww̄ = 0.0009 −0.0084 0.0011 ,
Ĥwȳ = 0.9813 −0.9132 0.1039 ,
Ĥuu = 1.0009,
Ĥuw = −0.0008,
Ĥww = −0.9990.
The closed-loop response and the convergence of the parameter estimates under
the state feedback PI algorithm, Algorithm 3.3, are shown in Figs. 3.1 and 3.2,
respectively. The corresponding results for the output feedback PI Algorithm 3.5 are
shown in Figs. 3.3 and 3.4, respectively. It can be seen that both algorithms converge
to the nominal solution. Furthermore, the output feedback algorithm circumvents
the need of full state measurement at the cost of a longer learning time as it requires
more parameters to be learnt.
We now test the state feedback and output feedback Q-learning value iteration
algorithms, Algorithms 3.4 and 3.6. Figures 3.5 and 3.6 show the closed-loop state
response and the convergence of the parameter estimates, respectively, under the
state feedback Q-learning value iteration Algorithm 3.4. The final estimated state
feedback control parameters under Algorithm 3.4 are
Ĥux = −0.0857 −0.0705 0.0001 ,
Ĥwx = 0.0997 0.0668 −0.0001 ,
Ĥuu = 1.0009,
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 125
10
5
x1
0
-5
0 100 200 300 400 500 600 700 800 900 1000
6
4
x2
2
0
-2
0 100 200 300 400 500 600 700 800 900 1000
5
x3
-5
0 100 200 300 400 500 600 700 800 900 1000
time step (k)
Fig. 3.1 Example 3.1: State trajectory of the closed-loop system under the state feedback Q-
learning PI algorithm, Algorithm 3.3
15
10
H *
—
Ĥ
0
0 1 2 3 4 5
iterations
Fig. 3.2 Example 3.1: Convergence of the parameter estimates under the state feedback Q-
learning PI algorithm, Algorithm 3.3
Ĥuw = −0.0008,
Ĥww = −0.9990.
The corresponding results for the output feedback Q-learning value iteration algo-
rithm, Algorithm 3.6, are shown in Figs. 3.7 and 3.8. As with the output feedback
policy iteration algorithm, it takes longer for the output feedback parameters to
converge in Algorithm 3.6 than in the state feedback case, Algorithm 3.5. The final
estimated control parameters by Algorithm 3.6 are
126 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
x1 0
-5
0 100 200 300 400 500 600 700 800 900 1000
2
x2
-2
0 100 200 300 400 500 600 700 800 900 1000
5
x3
-5
0 100 200 300 400 500 600 700 800 900 1000
time step (k)
Fig. 3.3 Example 3.1: State trajectory of the closed-loop system under the output feedback Q-
learning PI algorithm, Algorithm 3.5
2500
2000
H *
1500
—
Ĥ
1000
500
0
0 1 2 3 4 5
iterations
Fig. 3.4 Example 3.1: Convergence of the parameter estimates under the output feedback Q-
learning PI algorithm, Algorithm 3.5
Ĥuū = 0.0009 −0.0006 −0.0002 ,
Ĥuw̄ = −0.0008 0.0087 −0.0011 ,
Ĥuȳ = −0.9980 0.9436 −0.1076 ,
Ĥwū = −0.0008 0.0005 0.0002 ,
Ĥww̄ = 0.0009 −0.0084 0.0011 ,
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 127
10
5
x1
0
-5
0 500 1000 1500 2000 2500 3000 3500 4000
6
4
x2
2
0
-2
0 500 1000 1500 2000 2500 3000 3500 4000
5
x3
-5
0 500 1000 1500 2000 2500 3000 3500 4000
time step (k)
Fig. 3.5 Example 3.1: State trajectory of the closed-loop system under the state feedback Q-
learning VI algorithm, Algorithm 3.4 [5]
15
10
H *
—
Ĥ
0
0 50 100 150
iterations
Fig. 3.6 Example 3.1: Convergence of the parameter estimates under the state feedback Q-
learning VI algorithm, Algorithm 3.4 [5]
Ĥwȳ = 0.9811 −0.9130 0.1039 ,
Ĥuu = 1.0009,
Ĥuw = −0.0008,
Ĥww = −0.9990.
128 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
10
5
x1
0
-5
0 2000 4000 6000 8000 10000 12000 14000 16000
6
4
x2
2
0
-2
0 2000 4000 6000 8000 10000 12000 14000 16000
5
x3
-5
0 2000 4000 6000 8000 10000 12000 14000 16000
time step (k)
Fig. 3.7 Example 3.1: State trajectory of the closed-loop system under the output feedback Q-
learning VI algorithm, Algorithm 3.6
2500
2000
H *
1500
—
Ĥ
1000
500
0
0 50 100 150 200
iterations
Fig. 3.8 Example 3.1: Convergence of the parameter estimates under the output feedback Q-
learning VI algorithm, Algorithm 3.6
Note that in both the state feedback algorithms, we used 20 data samples in each
iteration, while 70 data samples were used for both the output feedback algorithms,
in order to satisfy the requirement of minimum number of data samples needed to
satisfy the rank condition (3.38). This example also demonstrates that for a given
system under the same initial conditions, the policy iteration algorithms converge
in fewer iterations than the value iteration algorithms in both the state feedback and
output feedback cases. Finally, it can be seen that the output feedback Q-learning
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 129
algorithms are able to maintain closed-loop stability and the controller parameters
converge to their optimal values.
The convergence criterion was chosen as ε = 0.1. Notice that no discounting
factor was employed and the results correspond to those obtained by solving the
game algebraic Riccati equation. Furthermore, the excitation noise did not introduce
any bias in the estimates, which is an advantage of the presented scheme. Moreover,
the excitation noise was removed after the convergence of parameter estimates.
T
∂V
0 = r (x(t), Kx(t), Gx(t)) + (Ax(t) + Bu(t) + Ew(t)) ,
∂x
which involves the derivative of the cost function and requires the knowledge
of the system dynamics. In Chap. 2, we presented algorithms based on integral
reinforcement learning (IRL) that circumvent this difficulty by providing a Bellman
equation in a recursive form. The IRL Bellman equation for the differential zero-
sum game is given by
t+T
V (x(t)) = r (x(τ ), Kx(τ ), Gx(τ )) dτ + V (x(t + T )) .
t
In the remainder of this section, we will first introduce the model-based iterative
techniques for solving the continuous-time problem. Then, we will present model-
free iterative techniques to find the optimal strategies for the linear quadratic
differential zero-sum game, which we will refer to as the differential zero sum game
for simplicity. We will first introduce the state feedback learning algorithms before
presenting the output feedback learning algorithms.
Consider a continuous-time linear system in the state space form,
ẋ = Ax + Bu + Ew,
(3.39)
y = Cx,
130 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
where x ∈ Rn is the state, u ∈ Rm1 is the control input (of Player 1), w ∈ Rm2 is
the disturbance input (of the opposing player, Player 2), and y ∈ Rp is the output.
We assume that the pair (A, B) are controllable and the pair (A, C) is observable.
Let us define the infinite horizon cost function as
∞
J (x(0), u, w) = r(x (τ ), u(τ ), w(τ ))dτ. (3.40)
0
For a linear quadratic differential game, the utility function r(x, u, w) takes the
following quadratic form,
If the feedback polices are such that the resulting closed-system is asymptotically
stable, then V (x) is quadratic in the state [6], that is,
V (x) = x T P x, (3.44)
for some P > 0. The optimal value function associated with the zero-sum game is
of the form
∞
V ∗ (x(t)) = min max y T (τ )Qy y(τ ) + uT (τ )u(τ ) − γ 2 w T (τ )w(τ ) dτ,
u w t
(3.45)
where the input u acts as the minimizing player and the input w acts as the
maximizing player. Equivalently, the aim of the zero-sum game is to find the
saddle point solution (u∗ , w ∗ ) that satisfies the following pair of Nash equilibrium
inequalities,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 131
J x(0), u∗ , w ≤ J x(0), u∗ , w ∗ ≤ J x(0), u, w∗ , (3.46)
for any feedback policies u(x) and w(x). For the special case of the H∞ control
problem, the maximizing player w is an L2 [0, ∞) disturbance. √
Under the additional assumption of the observability of A, Q , where
√ T√
Q Q = Q, Q = C T Qy C, the problem is solvable if there exists a unique
positive definite matrix P ∗ that satisfies the following continuous-time game
algebraic Riccati equation (GARE),
AT P + P A + Q − P BB T − γ −2 EE T P = 0. (3.47)
In this case, there exist the unique state feedback policies u∗ = K ∗ x and w ∗ = G∗ x
that achieve the objective (3.45), where
K ∗ = −B T P ∗ , (3.48)
G∗ = γ −2 E T P ∗ . (3.49)
From the above discussion it is clear that, in order to find the optimal game
strategies, we need to have full knowledge of the system matrices for the solution
of the GARE (3.47). Furthermore, access to the information of the state is needed
for the implementation of the optimal strategies. In what follows, we will present
alternate approaches to finding the solution of this GARE.
Even when the system model information is available, the GARE (3.47) is difficult
to solve owing to its nonlinear nature. Iterative computational methods have
been developed to address this difficulty. We recall the policy iteration algorithm,
Algorithm 3.7, from [132].
Algorithm 3.7 finds the optimal strategies for the differential zero-sum game
by iteratively solving the Lyapunov equation (3.50). As is the case with the
previously discussed PI algorithms, a stabilizing policy is required to ensure that
the policy evaluation step results in a finite cost. For games with stable dynamics,
the feedback control may simply be initialized to zero. However, for an unstable
system, it is difficult to obtain such a stabilizing policy when the system dynamics
is unknown. To obviate this requirement, value iteration algorithms are used that
perform recursive updates on the cost matrix P instead of solving the Lyapunov
equation in every iteration.
In [11], a VI algorithm was proposed for the H∞ control problem. The algorithm
is an extension of the VI algorithm presented in Chap. 2 for solving the LQR
132 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Algorithm 3.7 Model-based policy iteration for the differential zero-sum game
input: system dynamics (A, B)
output: P ∗ , K ∗ and G∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz. Set G0 = 0. Set
j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P j ,
T T
(A + BK + EG)T P j + P j (A + BK + EG) + Q + K j K j − γ 2 Gj Gj = 0.
(3.50)
K j +1 = −B T P j ,
Gj +1 = γ −2 E T P j +1 .
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
Bq ⊆ Bq+1 , q ∈ Z+
and
lim Bq = Pn+ ,
q→∞
where Pn+ is the set of n-dimensional positive definite matrices. With these
definitions, the VI algorithm is presented in Algorithm 3.8.
Algorithm 3.8 recursively solves the GARE equation (3.47), instead of solving a
Lyapunov equation. As a result, it does not require the knowledge of a stabilizing
initial policy. However, both Algorithms 3.7 and 3.8 are model-based as they require
the full model information (A, B, C, E).
Learning methods have been presented in the literature that solve the optimal control
problems without requiring the model information. The model-free state feedback
policy iteration algorithm, Algorithm 3.9, was developed in [60].
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 133
Algorithm 3.8 Model-based value iteration for the differential zero-sum game
Input: system dynamics (A, B, E)
Output: P ∗
Initialization. Set P 0 > 0, j ← 0, q ← 0.
1: loop
2: P̃ j +1 ← P j + j AT P j + P j A + Q − P j BB T P j + γ −2 P j EE T P j .
3: if P̃ j +1 ∈/ Bq then
4: P j +1 ← P 0
5: q ← q + 1
6: else if P̃ j +1 −P j /j < ε, for some small ε > 0, then
7: return P j as P ∗
8: else
9: P j +1 ← P̃ j +1
10: end if
11: j ←j +1
12: end loop
Algorithm 3.9 Model-free state feedback policy iteration algorithm for the differ-
ential zero-sum game
input: input-state data
output: P ∗ , K ∗ and G∗
1: initialize. Select a stabilizing control policy K 0 and apply u0 = K 0 x + n and w 0 = ν, where
n and ν are the exploration signals. Set G0 = 0. Set j ← 0.
2: collect data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where l is the number
of learning intervals of length tk − tk−1 = T , k = 1, 2, · · · , l. Based on this data, perform the
following iterations,
3: repeat
4: evaluate and improve policies. Find the solution, P j , K j +1 and Gj +1 , of the following
learning equation,
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
134 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
As is the case with other PI algorithms, Algorithm 3.9 requires a stabilizing initial
control policy for the policy evaluation step to have a finite cost. To overcome this
difficulty, a model-free value iteration algorithm, Algorithm 3.10, was proposed that
overcomes this situation [11].
It can be seen that the model-free Algorithm 3.10 does not require a stabilizing
control policy for its initialization. The algorithm is based on the recursive learning
equation (3.52), which is used to find the unknown matrices H j = AT P j + P j A,
K j = −B T P j and Gj = γ −2 E T P j .
Algorithm 3.10 Model-free state feedback value iteration algorithm for the differ-
ential zero-sum game
Input: input-state data
Output: P ∗ , K ∗ and G∗
Initialization. Select P 0 > 0 and set j ← 0, q ← 0.
Collect Online Data. Apply u0 = n and w 0 = ν with n and ν being the exploration signals
to the system and collect online data for t ∈ [t0 , tl ], where tl = t0 + lT and T is the interval
length. Based on this data, perform the following iterations,
1: loop
2: Find the solution, H j , K j and Gj , of the following equation,
T T
3: P̃ j +1 ← P j + j H j + Q − K j K j + γ 2 Gj Gj
4: if P̃ j +1 ∈
/ Bq then
5: P j +1 ← P 0
6: q ← q + 1
7: else if P̃ j +1 −P j /j < ε then
8: return P j , K j and Gj as P ∗ , K ∗ and G∗
9: else
10: P j +1 ← P̃ j +1
11: end if
12: j ←j +1
13: end loop
In the discussion so far in this subsection we have found that the model-free
algorithms make use of the full state measurement. In the following subsections, we
will present a dynamic output feedback scheme to solve the model-free differential
zero-sum game and the associated H∞ control problem. Along the lines of the
discussion in Chap. 2, we will first present a parameterization of the state of the
continuous-time system. Based on this parameterization, we will present output
feedback learning equations that will form the basis of the model-free output
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 135
feedback learning algorithms. Finally, it will be shown that, similar to the output
feedback algorithms in Chap. 2, these algorithms also possess the exploration bias
immunity and, therefore, do not require a discounted cost function. Maintaining the
original undiscounted cost function ensures the stability of the closed-loop and the
optimality of the solution.
In the previous developments we learnt that a key idea in developing the output
feedback learning equations is the parameterization of the state. Two parameteriza-
tions have been presented. One parameterization is based on the derivation using
the embedded observer and filtering approach, and the other is more direct in the
sense that it requires just the delayed measurements of the input, output, and the
disturbance signals. Based on the direct parameterization we developed the output
feedback learning equations. However, as has been discussed earlier in Chap. 2, the
direct parameterization does not extend to the continuous-time setting as it would
involve the derivatives of the input, output, and disturbance signals. On the other
hand, the observer and filtering based parameterization (3.22) result does have a
continuous-time counterpart, which is why it was introduced in the discrete-time
setting.
In the following, we will present a state parameterization procedure to represent
the state of a general continuous-time linear system in terms of the filtered input,
output, and disturbance.
Theorem 3.8 Consider system (3.3). Let the pair (A, C) be observable. Then, there
exists a parameterization of the state in the form of
⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2 i2 ⎥
· · · au0
⎢ u(n−1) au(n−2) ⎥
Wu = ⎢
i
⎢ ..
⎥ , i = 1, 2, · · · , m1 ,
⎢ .
.. . . .. ⎥
⎣ . . . ⎥ ⎦
in
au(n−1) in
au(n−2) · · · au0
in
136 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
⎡ ⎤
i1
aw(n−1) i1
aw(n−2) · · · aw0
i1
⎢ i2 ⎥
⎢a i2 ⎥
w(n−1) aw(n−2) · · · aw0 ⎥
i2
⎢
Wwi = ⎢
⎢ ..
⎥ , i = 1, 2, · · · , m2 ,
⎢ .
.. . . .. ⎥ ⎥
⎣ . . . ⎦
aw(n−1) aw(n−2) · · · aw0
in in in
⎡ i1 i1 ⎤
i1
ay(n−1) ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2 ⎥
y(n−1) ay(n−2) · · · ay0 ⎥
i2
⎢
Wyi = ⎢
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) in
ay(n−2) · · · ay0
in
whose elements are the coefficients of the numerators in the transfer function
matrix of a Luenberger observer with inputs u(t), w(t) and y(t), and
T T
ζu (t) = ζu1 (t) ζu2 (t) · · · ζum1 (t) , ζw = ζw1 (t) ζw2 (t) · · · ζwm2 (t) and ζy =
p T
ζy1 (t) ζy2 (t) · · · ζy (t) represent the states of the user-defined dynamics driven
by individual input ui (t), disturbance w i (t) and output y i (t) as given by
Proof The proof follows from the proof of the state parameterization result
presented in Chap. 2. Given that (A, C) is observable, we can obtain, by considering
the disturbance as an additional input, a state observer as
˙ = Ax̂(t) + Bu(t) + Ew(t) − L y(t) − C x̂(t)
x̂(t)
= (A + LC)x̂(t) + Bu(t) + Ew(t) − Ly(t),
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 137
where x̂(t) is the estimate of the state x(t) and L is the observer gain chosen such
that matrix A + LC has all its eigenvalues in the open left-half plane. This observer
is a dynamic system driven by u, w, and y with the dynamics matrix A + LC. By
linearity, the effect of the disturbance can be added in the same way as that of the u
and y as carried out in proof of Theorem 2.9. Therefore, following the arguments in
the proof of Theorem 2.9, we can express the effect of the disturbance w as
⎡ i1 s n−1 + a i1 s n−2 + · · · + a i1
⎤
an−1 n−2 0
⎢ n ⎥
⎢ s + αn−1 s n−1 + αn−2 zn−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 s n−1 + a i2 s n−2 + · · · + a i2 ⎥
⎢ ⎥
⎢ n−1 n−2 0
⎥
i
Ww (s) i ⎢ s + αn−1 s
n n−1 + αn−2 sz n−2 + · · · + α0 ⎥
w =⎢⎢
⎥ wi
⎥
(s) ⎢ .
.. ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ an−1 sz
in n−1 + an−2 s
in n−2 + · · · + a0in ⎦
s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α0
⎡ i1 a i1 · · · a i1 ⎤ ⎡ 1 i ⎤
aw0 w
⎢ i2
w1 w(n−1)
⎥ ⎢ (s) ⎥
⎢a i2 · · · i2 ⎥⎢ s i ⎥
⎢ w0 aw1 aw(n−1) ⎥⎢ (s) w ⎥
=⎢ ⎥⎢ ⎥
⎢ .. .. .. .. ⎥ ⎢⎢ .. ⎥
⎥
⎢ . . . ⎥
. ⎦⎢ . ⎥
⎣ ⎣ ⎦
in in · · · in s n−1 i
aw0 aw1 aw(n−1) (s) w
= Wwi ζwi ,
where
⎡ ⎤ ⎡ ⎤
00 ··· 0 1 0
⎢ 01 ··· 0 ⎥
0 ⎥ ⎢ ⎥
⎢ ⎢0⎥
⎢ ..
.. .. .. .. ⎥ , B = ⎢ .. ⎥
. ⎥ ⎢ .
A=⎢ . . . . ⎥.
⎢ ⎥ ⎢ ⎥
⎣ 0 0 0 ··· 1 ⎦ ⎣ 0⎦
−α0 −α1 · · · · · · −αn−1 1
It can be seen that e(A+LC)t x̂(0) in (3.57) and e(A+LC)t x(0) in (3.58) converge to
zero as t → ∞ because A + LC is Hurwitz stable. This completes the proof.
Remark 3.3 The variables ζu , ζw , and ζy are generated through the filter matrix A
that is independent of the system dynamics. The connection between the param-
eterized state x̄ and the actual system state x is established through the matrices
Wu , Ww , and Wy that contain the coefficients ajikl of the numerator polynomials of
U W Y
(s) , (s) and (s) that depend on (A, B, C, E, L).
Analogous to the discrete-time counterpart of this parameterization given in (3.22),
we have the same full row rank conditions for the parameterization matrix W =
Wu Ww Wy in the state parameterization (3.53).
Theorem 3.9 The state parameterization matrix W = Wu Ww Wy in (3.53) is of
full row rank if either of (A + LC, B),(A + LC, E), or (A + LC, L) is controllable.
Proof The proof follows the same procedure as in the proof of Theorem 3.3
by replacing the transfer function variable z corresponding to the discrete-time
dynamics of ζu , ζw and ζy with the transfer function variable s corresponding to
the continuous-time dynamics.
= zT P̄ z, (3.59)
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 139
where
T
z = ζuT ζwT ζyT ∈ RN ,
W = Wu Ww Wy ∈ Rn×N ,
P̄ = P̄ T
⎡ T ⎤
Wu P Wu WuT P Ww WuT P Wy
⎢ ⎥
= ⎣WwT P Wu WwT P Ww WwT P Wy ⎦ ∈ RN ×N ,
WyT P Wu WyT P Ww WyT P Wy
with N = m1 n + m2 n + pn.
By (3.59) we have obtained a new description of the steady-state cost function in
terms of the inputs and output of the system. The corresponding steady-state output
feedback policies are given by
⎡ ⎤
ζu
u = K Wu Ww Wy ⎣ζw ⎦ = K̄z, (3.60)
ζy
⎡ ⎤
ζu
w = G Wu Ww Wy ⎣ζw ⎦ = Ḡz, (3.61)
ζy
where
K̄ = K Wu Ww Wy ∈ Rm1 ×(m1 n+m2 n+pn)
and
Ḡ = G Wu Ww Wy ∈ Rm2 ×(m1 n+m2 n+pn) .
Therefore, the optimal cost matrix is given by P̄ ∗ and the corresponding steady-state
optimal output feedback policies are given by
ū∗ = K̄ ∗ z, (3.62)
w̄ ∗ = Ḡ∗ z. (3.63)
Theorem 3.11 The output feedback policies given by (3.62) and (3.63) converge,
respectively, to the optimal policies (3.48) and (3.49), which solve the zero-sum
game.
140 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
ū∗ = K̄ ∗ z = K ∗ W z,
w̄ ∗ = Ḡ∗ z = G∗ W z.
x̄ = W z,
ū∗ = K ∗ x,
w̄ ∗ = G∗ x,
which are the optimal steady-state strategies of the differential zero-sum game. This
completes the proof.
In Sect. 3.4.1, we have presented algorithms that learn the optimal state feedback
strategies that solve the differential zero-sum game. We now direct our attention
towards developing model-free techniques to learn the optimal output feedback
strategies that solve the differential zero-sum game. We first recall from Algo-
rithm 3.9 the following state feedback learning equation,
We can write the l number of Equation (3.65) in the following compact form,
⎤
⎡
vecs P̄ j
⎢ ⎥
j ⎣vec K̄ j +1 ⎦ = j ,
vec Ḡj +1
N(N+1)
l× +(m1 +m2 )N
∈R 2
,
j = −Izz vec Q̄j − Iyy vec Qy ∈ Rl ,
142 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
with
T T
Q̄j = K̄ j K̄ j − γ 2 Ḡj Ḡj ,
z̄ = z12 2z1 z2 · · · z22 2z2 z3 · · · zN
2
,
j j j j j
vecs P̄ j = P̄11 P̄12 · · · P̄1n P̄22 P̄23 · · · P̄N N .
We are now ready to present the output feedback policy algorithm, Algo-
rithm 3.11, to learn the solution of the differential zero-sum game.
Algorithm 3.11 Model-free output feedback policy iteration for the differential
zero-sum game
Input: Input-Output filtering data
Output: P̄ ∗ , K̄ ∗ and Ḡ∗
1: Initialize. Select policies u0 = K̄ 0 z + ν1 and w 0 = ν2 , with K̄ 0 being a stabilizing policy,
and ν1 and ν2 being the corresponding exploration signals. Set G0 = 0. Set j ← 0.
2: Acquire Data. Apply u0 and w 0 during t ∈ [t0 , tl ], where l is the number of learning intervals
of length tk − tk−1 = T , k = 1, 2, · · · , l. Collect the filtered input and output data for each
interval.
3: loop
4: Find P̄ j , K̄ j +1 and Ḡj +1 by solving (3.65), that is,
5: if P̄ j − P̄ j −1 < ε, then
6: return P̄ j , K̄ j +1 and Ḡj +1 as P̄ ∗ , K̄ ∗ and Ḡ∗
7: else
8: j ←j +1
9: end if
10: end loop
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 143
In Algorithm 3.11, we collect only the filtered input, output, and disturbance data
to compute their quadratic integrals and form the data matrices. The least-squares
problem corresponding to (3.65) is solved based on this data set. Note that we use a
stabilizing initial policy K̄ 0 to collect data, which will be reused in the subsequent
iterations. Since there are N(N + 1)/2 + (m1 + m2 )N unknowns corresponding to
P̄ j , K̄ j +1 and Ḡj +1 we need l ≥ N(N + 1)/2 + (m1 + m2 )N data sets to solve
(3.65).
We now address the problem of requiring a stabilizing initial control policy. It
can be seen that Algorithm 3.11 solves the output feedback differential zero-sum
game without using any knowledge of the system dynamics. However, it requires
a stabilizing initial control policy. In the situation when the system is unstable and
a stabilizing initial policy is not available, we propose an output feedback value
iteration algorithm. We start with the following equations,
d
Equation (3.67) is obtained by taking the derivative dt (x T P j x) along the
trajectory of (3.39) and then integrating both sides of the resulting equation.
We now work towards deriving an equivalent output feedback learning equation
corresponding to Equation (3.67). Notice that the recursive GARE equation (3.68)
uses the state weighting Q, which in the output feedback case is given by Q =
C T Qy C. However, this would require the knowledge of the output matrix C. To
t
overcome this difficulty, we add t−T y T (τ )Qy y(τ )dτ to both sides of (3.67) so
that Q can be lumped up together with the unknown H j = AT P j + P j A, that is,
t
x T (t)P j x(t) − x T (t − T )P j x(t − T )+ y T (τ )Qy y(τ )dτ
t−T
t t
= x T (τ ) AT P j +P j A + C T Qy C x(τ )dτ +2 uT (τ )B T P j x(τ )dτ
t−T t−T
t
+2 w T (τ )E T P j x(τ )dτ.
t−T
Next, we obtain the steady-state output feedback learning equation by using the state
parameterization (3.53) to obtain the following equation,
144 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
t
zT (t)P̄ j z(t)−zT (t −T )P̄ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ j z(τ )dτ + 2 uT (τ )B T P j W z(τ )dτ
t−T t−T
t
+2 w T (τ )E T P j W z(τ )dτ, (3.69)
t−T
where, for k = 1, 2, · · · , l,
Then, we can write the l number of Equation (3.69) in the following compact form,
⎡ ⎤
vecs H̄ j
⎢ ⎥
j ⎣vec B T P j W ⎦ = j ,
vec E T P j W
⎡ ⎤
vecs H̄ j T −1 T
⎢ ⎥
⎣vec B T P j W ⎦ = j j j j . (3.70)
vec E T P j W
Bq ⊆ Bq+1 , q ∈ Z+
and
lim Bq = P+ ,
q→∞
where P+ is the set of positive semi-definite matrices. Note that the purpose of the
set Bq is to prevent the estimates P̄˜ j +1 from escaping. If the upper bound on |P̄ ∗ |
is known, then Bq can be fixed to a B̄ that contains P̄ 0 and P̄ ∗ in its interior. This
assumption of the upper bound of P̄ ∗ , although quite restrictive, is helpful in the
projection of the estimate of P̄ ∗ in the value iteration algorithms. A discussion on
this assumption can be found in the relevant state feedback works, such as [11].
It can be seen that Algorithm 3.12 does not require a stabilizing control policy
for its initialization. The updates in the difference Riccati equation are performed
with a varying step size j that satisfies limj →∞ j = 0. It is worth pointing out
that the least-squares solution (3.70) for Algorithm 3.12 provides lumped parameter
estimates of the terms BP ∗ W and EP ∗ W .
Remark 3.4 Compared to their state feedback counterparts (3.51) and (3.52),
Equations (3.66) and (3.70) solve the least-squares problem using input-output data
instead of input-state data. Similar to the state feedback problem, in order to solve
these least-squares problems, we need to inject exploration signals ν1 and ν2 in
u and w, respectively. Since the data matrix i has columns associated with the
filtered measurements as well as the actions u and w, the two exploration signals
n and ν are selected not only independent of these filtered measurements but also
independent of one and other. That is, both of these two independency conditions
are necessary to ensure that i is of full column rank. In other words, the following
rank condition needs to hold for all j ,
N(N + 1)
rank j = + (m1 + m2 )N. (3.71)
2
146 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Algorithm 3.12 Model-free output feedback value iteration for the differential zero-
sum game
Input: Input-Output filtering data
Output: P̄ ∗ , K̄ ∗ , and Ḡ∗
1: Initialize. Choose u0 = ν1 and w 0 = ν2 . Initialize P̄ 0 > 0, and set i ← 0, q ← 0.
2: Acquire Data. Apply u0 and w 0 during t ∈ [t0 , tl ], where l is the number of learning intervals
of length tk −tk−1 = T , k = 1, 2, · · · , l. Collect the filtered input-output data for each interval.
3: loop
4: Find H̄ j , K̄ j and Ḡj by solving
t
zT (t)P̄ j z(t)−zT (t −T )P̄ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t t
= zT (τ )H̄ j z(τ )dτ − 2 uT (τ )K̄ j z(τ )dτ + 2γ 2 w T (τ )Ḡj z(τ )dτ,
t−T t−T t−T
T T
5: P̄˜ j +1 ← P̄ j + j H̄ j − K̄ j K̄ j + γ 2 Ḡj Ḡj
6: if P̄˜ j +1 ∈ / Bq then
7: P̄ j +1 ← P̄ 0
8: q ← q + 1
˜ j +1
9: else if P̄ − P̄ j / < ε then
j
10: return P̄ j , K̄ j and Ḡj as P̄ ∗ , K̄ ∗ and Ḡ∗ ,
11: else
12: P̄ j +1 ← P̄˜ j +1
13: end if
14: j ←j +1
15: end loop
We now show that the proposed output feedback algorithms, Algorithms 3.11
and 3.12, converge to the optimal output feedback solution.
Theorem 3.12 Consider system (3.39). Assume that the output feedback zero-sum
game, and hence, the H∞ problem, are solvable. Then, the proposed output feedback
algorithms, Algorithms 3.11 and 3.12, generate a sequence P̄ j that converges to the
optimal output feedback solution P̄ ∗ as j → ∞, provided that the rank condition
(3.71) holds.
Proof Consider a Lyapunov function
V j (x) = x T P j x. (3.72)
The time derivative of V j (x) along the trajectory of the closed-loop system with the
output feedback controls u = K̄ j z + ν1 and w = Ḡj z + ν2 is given by
V̇ j (x) = x T AT P j + P j A x + 2x T P j B K̄ j + E Ḡj z
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 147
+2(Bν1 + Eν2 )T P j x
= x T AT P j + P j A x + 2x T P j BK j + EGj W z
W z = x − e(A+LC)t x(0).
K i+1 = −B T P j ,
Gi+1 = γ −2 E T P j ,
ν1 = u − K̄ j z
= u − K j x − e(A+LC)t x(0)
and
ν2 = w − Ḡj z
= w − Gj x − e(A+LC)t x(0) ,
t T
−2 K j e(A+LC)τ x(0) K j +1 x(τ )dτ
t−T
t
=2 x T (τ )P j BK j e(A+LC)τ x(0)dτ,
t−T
t T
2γ 2 Gj e(A+LC)τ x(0) Gj +1 x(τ )dτ
t−T
t
=2 x T (τ )P j EGj e(A+LC)τ x(0)dτ,
t−T
which results in the cancellation of all the exponentially decaying terms. Comparing
the resulting equation with the learning equation (3.51), we have
T
A + BK j + EGj P j + P j A + BK j + EGj
T T
= C T Qy C + K j K j − γ 2 Gj Gj , (3.75)
which is the Lyapunov equation associated with the zero-sum game. Thus, iterating
on the solution of the output feedback learning equation (3.65) is equivalent to
iterating on the Lyapunov equation (3.75). The existence of the unique solution
to the equation (3.65) is guaranteed under the rank condition (3.71). In [132], it
has been shown that the iterations on the Lyapunov equation (3.75) converge to the
solution of the GARE (3.47) under the controllability and observability conditions.
Therefore, we can conclude that the proposed iterative algorithm, Algorithm 3.11,
also converges to the solution of the GARE (3.47), provided that the least-squares
problem (3.66) corresponding to (3.65) is solvable.
We next prove the convergence of the output feedback value iteration Algo-
rithm 3.12. Consider the following recursion in Algorithm 3.12,
P̄˜ j +1 = P̄ j + j H̄ j − W T P j BB T P j W + γ −2 W T P j EE T P j W .
W T P̃ j +1 W = W T P j W + j W T H j W + W T C T Qy CW
−W TP j BB T P jW + γ −2 W T P j EE T P j W .
Recall from Theorem 3.9 and 3.10 that W is of full row rank. Thus, the above
equation reduces to
P̃ j +1 = P j + j H j + Q − P j BB T P j + γ −2 P j EE T P j ,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 149
We now establish the immunity of the output feedback algorithms to the exploration
bias problem. We have the following result.
Theorem 3.13 The output feedback algorithms, Algorithms 3.11 and 3.12, are
immune to the exploration bias problem.
Proof Define the inputs of Player 1 and Player 2 with the excitation signals as
û = u + ν1 and ŵ = w + ν2 . As a result of applying the excited inputs, we have
the estimates of P̄ j , K̄ j and Ḡj in the j th iteration represented as P̄ˆ j , K̄ˆ j and Ḡ
ˆ j.
We first consider the following output feedback learning equation (3.65) with the
exploration inputs,
T
ˆ j z(t) T Ḡ
−2 û(t) − K̄ˆ j z(t) K̄ˆ j +1 z(t) + 2γ 2 ŵ(t) − Ḡ ˆ j +1 z(t)
T
+y T (t − T )Qy y(t − T ) + zT (t − T ) K̄ˆ j K̄ˆ j z(t − T )
150 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
ˆ j T Ḡ
−γ 2 zT (t − T ) Ḡ ˆ j z(t − T ) + 2 û(t − T ) − K̄ˆ j z(t − T ) T K̄ˆ j +1 z(t − T )
T
−2γ 2 ŵ(t − T ) − Ĝj x(t − T ) Ĝj +1 x(t − T ).
T
ˆ j z(t) T Ḡ
−2 u(t) − K̄ˆ j z(t) K̄ˆ j +1 z(t) − 2ν1T (t)K̄ˆ j +1 z(t) + 2γ 2 w(t) − Ḡ ˆ j +1 z(t)
ˆ j +1 z(t) + y T (t − T )Q y(t − T ) + zT (t − T ) K̄ˆ j T K̄ˆ j z(t − T )
+2γ 2 ν2T (t)Ḡ y
ˆ j T Ḡ
−γ 2 zT (t − T ) Ḡ ˆ j z(t − T ) + 2 u(t − T ) − K̄ˆ j z(t − T ) T K̄ˆ j +1 z(t − T )
ˆ j z(t − T ))T Ḡ
−2γ 2 (w(t − T ) − Ḡ ˆ j +1 z(t − T )
= Āz + B̄η,
in which each Āi and Bi is further a diagonal matrix whose blocks correspond to
matrices A and B defined in Theorem 3.8, with the number of such blocks being
equal to the number of components in the individual vectors u, w, and y.
Next, we recall the following facts that hold under state feedback laws,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 151
2x T P̂ j Bν1 = −2ν1T K̂ j +1 x,
2x T P̂ j Eν2 = 2γ 2 ν2T Ĝj +1 x,
Repacking the terms and performing integration on both sides of the above equation
lead to
2zT (t)P̄ˆ j ż(t) − 2zT (t − T )P̄ˆ j ż(t − T ) + y T (t)Qy y(t) − y T (t − T )Qy y(t − T )
ˆ j z(t) − zT (t − T )Ḡz(t
= zT (t)Ḡ ˆ − T ) − 2û(t)K̄ˆ j z(t) + 2γ 2 ŵ(t)Ḡ
ˆ j z(t)
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t)
−2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t)
+y T (t)Qy y(t) − y T (t − T )Qy y(t − T )
⎡ ⎤
1
E = 0⎦ ,
⎣
0
C= 100 .
The three states are given by x = [x1 x2 x3 ]T , where x1 is the angle of attack,
x2 is the rate of pitch, and x3 is the elevator angle of deflection [110]. Here u is a
stabilizing player and w is a perturbation on the angle of attack. Let the user-defined
cost function be specified by Qy = 1 and γ = 3. The eigenvalues of matrix A
are all chosen to be at −10. The nominal output feedback solution can be found by
solving the Riccati equation (3.47) and then computing Wu , Ww , Wy , P̄ ∗ , K̄ ∗ , and
Ḡ∗ as
⎡ ⎤
0.9149 0.5387 −0.0677
P∗ = ⎣ 0.5387 0.4248 −0.0607⎦ ,
−0.0677 −0.0607 0.0098
K∗ = −0.3385 −0.3035 0.0492 ,
G∗ = 0.1017 0.0599 −0.0075 ,
⎡ ⎤
−0.80 −0.01 0
Wu = ⎣−21.96 −0.87 0⎦ ,
1332.7 146.5 5
⎡ ⎤
−1.07 2.07 1
Ww = ⎣−1092.5 −260.89 0⎦ ,
5104.1 4737.4 0
⎡ ⎤
1029.8 303.5 27.2
Wy = ⎣ 1144.7 1382.3 261.7 ⎦ ,
−1674.9 −9930.8 −4737.3
T
P̄ ∗ = Wu Ww Wy P ∗ Wu Ww Wy ,
K̄ ∗ = K ∗ Wu Ww Wy ,
Ḡ∗ = G∗ Wu Ww Wy .
The policies K̄ 0 and Ḡ0 are initialized to zero. The learning period is chosen as
T = 0.1s and l = 65 learning intervals are performed. It is assumed that the
disturbance is sufficiently exciting. In the simulation, the rank condition (3.71) is
satisfied by adding sinusoidal signals of random frequencies to both the control
input and the disturbance input. These excitation signals are removed after the
convergence of the algorithm. As comparison, we carry out the state feedback policy
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 155
iteration, Algorithm 3.9. The algorithm is carried out using input-state data with the
same objective function and identical learning period as that of the output feedback
algorithm, Algorithm 3.11. It should be noted that the learning phase of both the
state and output feedback algorithms uses randomly generated frequencies. We
would like to compare the performance achievable under the state feedback and
output feedback algorithms.
The closed-loop response under the state feedback PI algorithm and the closed-
loop response under the output feedback PI algorithm are shown in Figs. 3.9
and 3.10, respectively. It can be seen that the post-learning trajectories with the
converged optimal solutions of both the state and output feedback algorithms are
similar, which shows that the output feedback policy recovers the performance
achievable under the state feedback policy. This confirms the result in Theorem 3.11
that the output feedback policy converges to its state feedback counterpart. It can be
seen that the proposed output feedback algorithm is able to achieve stabilization
similar to the state feedback algorithm but with the advantage that the access to
the full state of the system is no longer needed. The results of the convergence of
parameter estimates of the output feedback algorithm, Algorithm 3.11, are shown in
Figs. 3.11, 3.12, and 3.13. It can be seen that convergence to the optimal parameters
is achieved without incurring any estimation bias and the closed-loop stability is
guaranteed.
Example 3.3 (A Double Integrator with Disturbance) We now test Algorithm 3.12
on an unstable system. We consider the double integrator system, that is, system
(3.39) with
01
A= ,
00
0
B= ,
1
Fig. 3.9 Example 3.2: State trajectory of the closed-loop system under state feedback
156 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Fig. 3.10 Example 3.2: State trajectory of the closed-loop system under output feedback
Fig. 3.11 Example 3.2: Convergence of the output feedback cost matrix P̄ j
Fig. 3.12 Example 3.2: Convergence of the Player 1 output feedback policy K̄ j
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 157
Fig. 3.13 Example 3.2: Convergence of the Player 2 output feedback policy Ḡj
1
E= ,
1
C= 10 .
The double integrator model represents a large class of practical systems including
satellite attitude control and rigid body motion. It is known that such systems are not
static output feedback stabilizable. We choose the performance index parameters as
Qy = 1 and γ = 3. The eigenvalues of matrix A are placed at −2. The optimal
control parameters are found by solving the GARE (3.47) and then computing Wu ,
Ww , Wy , P̄ ∗ , K̄ ∗ , and Ḡ∗ as
1.7997 1.4821
P∗ = ,
1.4821 2.0941
K∗ = 1.4821 2.0941 ,
G∗ = 0.3646 0.3974 ,
10
Wu = ,
41
11
Ww = ,
01
44
Wy = .
04
158 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
Fig. 3.14 Example 3.3: State trajectory of the closed-loop system under state feedback
Fig. 3.15 Example 3.3: State trajectory of the closed-loop system under output feedback
The initial controller parameters are set to zero. The learning period is T = 0.05s
with a total of l = 20 intervals. The GARE recursions are performed with the step
size
−1
i = i 0.2 + 5 , i = 0, 1, 2, . . .
The closed-loop response under the state feedback algorithm, Algorithm 3.10,
is shown in Fig. 3.14 and the closed-loop response under the output feedback
algorithm, Algorithm 3.12, is shown in Fig. 3.15. The convergence of the parameter
estimates in the output feedback algorithm is shown in Figs. 3.16, 3.17, and 3.18.
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 159
Fig. 3.16 Example 3.3: Convergence of the output feedback cost matrix P̄ j
Fig. 3.17 Example 3.3: Convergence of the Player 1 output feedback policy K̄ j
Fig. 3.18 Example 3.3: Convergence of the Player 2 output feedback policy Ḡj
160 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games
3.5 Summary
The aim of this chapter is to present model-free output feedback algorithms for the
linear quadratic zero-sum game and the associated H∞ control problem. Both the
discrete-time and differential zero-sum games have been considered. In the first part
of the chapter, we presented an output feedback model-free solution for the discrete-
time linear quadratic zero-sum game and the associated H∞ control problem.
In particular, we developed an output feedback Q-function description, which is
more comprehensive than the value function description [56] due to the explicit
dependence of the Q-function on the control inputs and disturbances. In contrast to
[56], the issue of excitation noise bias is not present in our work due to the inclusion
of the input terms in the cost function, which results in the cancellation of noise
dependent terms in the Bellman equation. A proof of excitation noise immunity of
the proposed Q-learning scheme was provided. As a result, the presented algorithm
does not require a discounting factor which has been used in output feedback value
function learning. It was established that the presented method guarantees closed-
loop stability and that the learned output feedback controller is the optimal controller
corresponding to the solution of the original Riccati equation. Also, our approach
is different from the recently proposed off-policy technique used in [48], which
also addresses the excitation noise issue but requires a stabilizing initial policy and
full state feedback. Both of these requirements are not present in the presented
output feedback work. We note that the output feedback design presented here is
completely model-free. While other output feedback control schemes exist in the
literature, they require certain knowledge of the system dynamics and employ a
separate state estimator.
The second half of this chapter was devoted to the differential game, which
is the continuous-time counterpart of the discrete-time results in the first part of
the chapter. Similar to the discrete-time case, two player zero-sum differential
games have the same problem formulation as that of the continuous-time H∞
control problem. The framework of integral reinforcement learning was employed
to develop the learning equations. In [120], the continuous-time linear quadratic
differential zero-sum game was solved using a partially model-free method. Later,
a completely model-free solution to the same problem was presented in [60]. The
authors in [10] provided a learning scheme which further relaxes the requirement
of a stabilizing initial policy. All of these works, however, require the measurement
of the full state of the system. In this chapter, we presented an extension of the
state parameterization result introduced in Chap. 2. The extended parameterization
incorporates the effect of disturbance and serves as the basis of the output feedback
learning equations for solving the differential zero-sum game. Compared to the
recently proposed static output feedback solution to this problem [83], we presented
a dynamic output feedback scheme that does not require the system to be stabilizable
by static output feedback and, therefore, can also stabilize systems that are only
stabilizable by dynamic output feedback. Differently from the adaptive observer
approaches [86], we presented a type of parameterization of the state that can
3.6 Notes and References 161
be directly embedded into the learning equation, thereby eliminating the need to
identify the unknown observer model parameters and to estimate the system state,
which would otherwise complicate the learning process. Instead, the optimal output
feedback policies are learned directly without involving a state estimator. Finally,
compared to the recent output feedback works [56, 80], the scheme in this paper
implicitly takes into account the exploration signals, which makes it immune to the
exploration bias and eliminates the need of a discounted cost function.
The zero-sum game involves a two agent scenario in which the two agents or players
have opposing adjectives. Coincidently, the formulation of this problem matches the
time-domain formulation of the H∞ control problem. We presented an extension of
the state parameterization for both the discrete-time and the continuous-time H∞
problems to develop new Q-learning and integral reinforcement learning equations
that include the effect of the disturbance. For results on nonlinear systems, interested
readers can refer to [71]. It is worth mentioning that the approach does not
involve value function learning, which would otherwise involve a discounting factor.
Although there exists a lower bound (upper bound for the differential game) on the
value of the discounting factor above (below) which the closed-loop stability may
be ensured, the computation of this bound requires the knowledge of the system
dynamics, which is assumed to be unknown in this problem [88].
The presentation in Sect. 3.3 follows from our results in [95] and provides
extension to the policy iteration algorithm that results in faster convergence. The
extension incorporates the rank analysis of the state parameterization subject to
disturbance and provides detailed convergence analysis of the learning algorithms
under this rank condition. The presentation of the continuous-time results in
Sect. 3.4 follows from our recent results in [103] and further includes the analysis
of the full row rank condition of the state parameterization in the convergence of the
output feedback policy iteration algorithm used in solving differential games.
Chapter 4
Model-Free Stabilization in the Presence
of Actuator Saturation
4.1 Introduction
a family a linear feedback laws, either of state feedback or output feedback type,
parameterized in a scalar positive parameter, called the low gain parameter. By semi-
global asymptotic stabilization by linear low gain feedback we mean that, for any
a priori given, arbitrarily large, bounded set of the state space, the value of the low
gain parameter can be tuned small enough so that the given set is contained in the
domain of attraction of the resulting closed-loop system. There are different ways to
construct low gain feedback [63]. One of such constructions is based on the solution
of a parameterized algebraic Riccati equation (ARE), parameterized in the low gain
parameter.
The low gain feedback design is a model-based approach, which relies on
solving the parameterized ARE. This chapter builds on the model-free techniques
that we introduced in Chap. 2 to solve the LQR problem, which also involves the
solution of an ARE, to develop model-free low gain feedback designs for global
asymptotic stabilization of linear systems in the presence of actuator saturation.
Of particular interest is the question of how to avoid the actuator saturation in
the learning algorithm as the standard LQR learning algorithm does not consider
actuator saturation. To address this question, we introduce a scheduling mechanism
in the learning algorithm that helps to learn an appropriate value of the low gain
parameter so that actuator saturation is avoided for any initial condition of the
system. As a result, the model-free technique presented in this chapter will achieve
global asymptotic stabilization. Both state feedback and output feedback designs, in
both the discrete-time and continuous-time settings, will be considered.
where γ ∈ (0, 1] is the low gain parameter. Based on the solution of the ARE (4.3),
a family of low gain feedback laws can be constructed as
uk = K ∗ (γ )xk , (4.4)
where
−1
K ∗ (γ ) = − B T P ∗ (γ )B + I B T P ∗ (γ )A
and P ∗ (γ ) is the unique positive definite solution to the ARE and the value of γ ∈
(0, 1] is appropriately chosen so that actuator saturation is avoided for any initial
condition inside the a priori given bounded set. As will be shown in later sections,
the appropriate value of γ is determined in an iterative manner.
We recall the following lemma from [63].
Lemma 4.1 Let Assumption 4.1 hold. Then, for each γ ∈ (0, 1], there exists a
unique positive definite matrix P ∗ (γ ) that solves the ARE (4.3). Moreover, such a
P ∗ (γ ) satisfies
1. limγ →0 P ∗ (γ ) = 0,
2. There exists a γ ∗ ∈ (0, 1] such that
√
∗ 1 1
P (γ ) 2 AP ∗ (γ )− 2 ≤ 2, γ ∈ (0, γ ∗ ].
In [63], it has been shown that the family of low gain feedback control laws (4.4)
achieve semi-global exponential stabilization of system (4.1) by choosing the value
of the low gain parameter γ sufficiently small to avoid actuator saturation. We recall
from [63] the following result.
Theorem 4.1 Consider the system (4.1). Under Assumption 4.1, for any a priori
given (arbitrarily large) bounded set W, there exists a γ ∗ ∈ (0, 1] such that for any
γ ∈ (0, γ ∗ ], the low gain feedback control law (4.4) renders the closed-loop system
exponentially stable at the origin with W contained in the domain of attraction.
Moreover, for any initial condition in W, actuator saturation does not occur.
Remark 4.1 The upper bound on value of the low gain parameter γ depends on the
a priori given set of initial conditions as well as the actuator saturation limit b. The
learning based algorithms we are to develop in this chapter schedule the value of the
low gain parameter, and hence the feedback gain, as a function of the state of the
system to achieve global asymptotic stabilization.
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 167
The parameterized ARE (4.3) is an LQR ARE. Thus, solving it amounts to solving
the optimization problem
∞
V ∗ (xk ) = min γ xkT xk + uTk uk , (4.5)
u
i=k
subject to the dynamics (4.1), where the LQR weighting matrices are Q = γ I
and R = I . The parameterized ARE possesses the characteristics of the LQR
ARE. In particular, its nonlinear characteristic makes it difficult to be solved
analytically. Along the lines of Chap. 2, we will present iterative algorithms to solve
the parameterized ARE to compute the low gain feedback gain K ∗ (γ ) for a given
value of γ ∈ (0, 1]. The Bellman equation corresponding to the problem (4.5) is
given by
which can be expressed in terms of the quadratic value function V (xk ) = xkT P (γ )xk
as
The state feedback policy in this case is uk = K(γ )xk , which gives us
That is, the Bellman equation for the low gain feedback design corresponds to a
parameterized Lyapunov equation that is parameterized in the low gain parameter
γ . For a given the value of the low gain parameter, we can run iterations on the
parameterized Lyapunov equation in the same way as on the Bellman equation in
Chap. 2. In such a case, the Lyapunov iterations would converge to the solution
of the parameterized ARE. It is worth pointing out that, a suitable choice of the
value of the low gain parameter γ is implicit in the parameterized Lyapunov
equation (4.7) to ensure the resulting control policy satisfies the saturation avoidance
condition. A parameterized version of the policy iteration Algorithm 1.3 is presented
in Algorithm 4.1.
168 4 Model-Free Stabilization in the Presence of Actuator Saturation
5: j← j +1
6: until P j (γ ) − P j −1 (γ ) < ε for some small ε > 0.
5: j← j +1
6: until P j − P j −1 < ε for some small ε > 0.
the policy leading to the low gain feedback policy. The limitation of these iterative
algorithms is that they not only require the knowledge of the system dynamics but
also apply only to an appropriately chosen value of the low gain parameter γ . This
low gain parameter plays a crucial role in satisfying the control constraints.
Motivated by the results in Chap. 2, we will now extend the Q-learning algo-
rithms to not only cater for the problem of unknown dynamics but also learn
an appropriate value of the low gain parameter so that the control constraint is
satisfied. It is also worth pointing out that the model-based techniques for solving
the parameterized ARE result in semi-global asymptotic stabilization in the sense
that the domain of attraction of the resulting closed-loop system can be enlarged
to enclose any a priori given, arbitrarily large, bounded set of the state space by
choosing the value of the low gain parameter sufficiently small. In the following
subsection it will be shown that the learning algorithms are able to adjust the value
of the low gain parameter to achieve global asymptotic stabilization.
which is the utility function used to solve the linear quadratic regulation (LQR)
problem with the user-defined performance matrices chosen as Q = γ I and R = I .
Next, we define the following cost function starting from state xk at time k that we
would like to minimize,
∞
V (xk ) = r(xi , ui , γ ). (4.9)
i=k
Under a stabilizing feedback control policy uk = K(γ )xk , the total cost incurred
when starting with any state xk is quadratic in the state as given by
where VK (xk+1 ) is the cost of following policy K(γ ) in all future states. We now
use (4.11) to define a Q-function as
which is the sum of the one-step cost of executing an arbitrary control uk at time
index k together with the cost of executing policy K(γ ) from time index k + 1 on.
The low gain parameterized Q-function can be expressed as
or, equivalently, as
The optimal stabilizing controller for a given value of the low gain parameter γ
has a cost V ∗ with the associated optimal Q-function Q∗ and the optimal Q-function
matrix H ∗ (γ ). This optimal stabilizing controller can be obtained by solving
∂
Q∗ = 0
∂uk
The above control law is the same as given in (4.4), which was obtained by solving
the parameterized ARE (4.3).
We now seek to develop a Q-learning Bellman equation for our problem. Using
(4.11) and (4.12), we see that
The above Bellman equation is linear in the unknown matrix H (γ ), that is, it can be
written in a linearly parameterized form by defining
where
H̄ (γ ) = vec(H (γ ))
T
= h11 2h12 · · · 2h1(n+m) h22 2h23 · · · 2h2(n+m) · · · h(n+m)(n+m)
∈ R(n+m)(n+m+1)/2
172 4 Model-Free Stabilization in the Presence of Actuator Saturation
z̄k = zk ⊗ zk ,
Based on (4.19), we can utilize the Q-learning technique to learn the Q-function
matrix. The Bellman equation (4.17) is parameterized in the low gain parameter γ .
This parameter can be tuned and the utility function (4.8) can be updated such that
the resulting control law does not saturate the actuator. We are now ready to present
iterative Q-learning algorithms to find a control law that globally asymptotically
stabilizes the system. Both policy iteration and value iteration based Q-learning
algorithms will be presented.
Algorithms 4.3 and 4.4 employ the Q-learning based policy iteration and
value iteration techniques for finding a low gain feedback control law that avoids
saturation and achieves global asymptotic stabilization. In both the algorithms, we
begin by selecting a value of the low gain parameter γ ∈ (0, 1]. Different from
the unconstrained algorithms, we apply an open-loop control, which contains an
exploration signal that satisfies the saturation avoidance condition, to collect online
data. This initial control is referred to as the behavioral policy and is used to collect
system data to be used in Step 2. Note that, in the case of the PI algorithm, the
initial policy K 0 is used to compute the prediction term uk+1 = K 0 xk+1 on the
right-hand side of the Bellman equation and is chosen such that A + BK 0 is Schur
stable. However, it does not have to satisfy the control constraint as this policy is not
a behavioral policy and, therefore, will not be applied to the system to collect data.
The value iteration algorithm, Algorithm 4.4, uplifts this stabilizing initial policy
requirement but takes more iterations to converge as has been seen with other value
iteration algorithms studied in the previous chapters. In both the algorithms, the data
collection is performed only once and we use the same dataset repeatedly to learn
all future control policies. In each iteration, we use the collected data to solve the
Bellman Q-learning equation and then update the policy. The iterations eventually
converge to an optimal policy K ∗ (γ ) for a given γ . The key step in these constrained
control learning algorithms is the control constraint check, in which we check the
newly learned K ∗ (γ ) for the avoidance of actuator saturation. When the control
constraint is satisfied, we apply uk = K ∗ (γ )xk to the system, otherwise we reduce γ
and carry out Steps 3 to 7 with the updated cost function. We employ a proportional
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 173
Algorithm 4.3 Q-learning policy iteration algorithm for global asymptotic stabi-
lization by state feedback
input: input-state data
output: H ∗ (γ ) and K ∗ (γ )
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Pick a γ ∈ (0, 1]
and set j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint
uk ∞ ≤ b.
L ≥ (n + m)(n + m + 1)/2.
3: repeat
4: policy evaluation. Solve the following Bellman equation for H j (γ ),
6: j← j +1
7: until H j (γ ) − H j −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following control constraint,
j
K (γ )xk ≤ b.
∞
rule γi+1 = αγi for 0 < α < 1 to update the value of the low gain parameter. As
can be seen, the control constraint is checked prior to being applied to the system
in order to avoid actuator saturation. The control constraint is checked for all future
times. It should be noted that the low gain scheduling mechanism ensures that γ
will be small enough in a finite number of iterations in Step 8 at every time index.
Note that the L data samples refer to different points in time, k, which are used
to form the data matrices ∈ R((n+m)(n+m+1)/2)×L and ϒ ∈ RL×1 . For the case of
Algorithm 4.3, these data matrices are defined as
= z̄k1 − z̄k+1
1 z̄k2 − z̄k+1
2 · · · z̄kL − z̄k+1
L ,
174 4 Model-Free Stabilization in the Presence of Actuator Saturation
Algorithm 4.4 Q-learning value iteration algorithm for global asymptotic stabiliza-
tion by state feedback
input: input-state data
output: H ∗ and K ∗
1: initialize. Select an arbitrary policy H 0 (γ ) ≥ 0. Pick a γ ∈ (0, 1] and set j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint
uk ∞ ≤ b.
L ≥ (n + m)(n + m + 1)/2.
3: repeat
4: policy evaluation. Solve the following Bellman equation for H j +1 (γ ),
6: j← j +1
7: until H j (γ ) − H j −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following saturation
constraint,
j
K (γ )xk ≤ b.
∞
T
ϒ = r1 r2 · · · rL ,
Notice that the control uk = K j (γ )xk is linearly dependent on xk , which means that
T will not be invertible. To overcome this issue, one adds excitation noise νk in
uk , which guarantees a unique solution to (4.20). In other words, the following rank
condition needs to be satisfied,
This excitation condition can be met in several ways such as adding sinusoidal noise
of various frequencies, exponentially decaying noise and gaussian noise. A detailed
discussion on such persistent excitation condition can be found in adaptive control
literature [116].
Theorem 4.2 states the convergence of Algorithms 4.3 and 4.4.
Theorem 4.2 Consider system (4.1). Under Assumption 4.1 and the rank condition
(4.21), the iterative Q-learning algorithms, Algorithms 4.3 and 4.4 globally asymp-
totically stabilize the system at the origin.
Proof The proof is carried out in two steps, which correspond, respectively, to
the iteration on the low gain parameter γ and the value iteration on the matrix
H (γ ). With respect to the iterations on γ , denoted as γi , we have γi < γi−1 with
γi ∈ (0, 1]. By Lemma 4.1, there exists a unique P ∗ (γi ) > 0 that satisfies the
parameterized ARE (4.3) and, by definition (4.14), there exists a unique H ∗ (γi ).
By Theorem 4.1, for any given initial condition, there exists a γ ∗ ∈ (0, 1] such
that, for all γi ∈ (0, γ ∗ ], the closed-loop system is exponentially stable. As i
increases, we have γi ∈ (0, γ ∗ ]. Note that, by Lemma 4.1, the value of the low
gain parameter can be made sufficiently small to accommodate any initial condition.
Under the stabilizability assumption on (A, B) and rank condition (4.21), the policy
iteration (in Algorithm 4.3) and value iteration (in Algorithm 4.4) steps converge to
the optimal H ∗ [13, 53]. Therefore, H j (γi ) converges to H ∗ (γi ), γi ∈ (0, γ ∗ ], as
j → ∞. Finally, we have the convergence of K j (γi ) to K ∗ (γi ), γi ∈ (0, γ ∗ ]. This
completes the proof.
Remark 4.2 The presented learning scheme does not seek to find γ ∗ , rather it
searches for a γ ∈ (0, γ ∗ ] that suffices to ensure closed-loop stability without
saturating the actuator. However, better closed-loop performance could be obtained
if the final γ is closer to γ ∗ , which can be achieved by applying smaller decrements
to γ in each iteration at the expense of more iterations.
176 4 Model-Free Stabilization in the Presence of Actuator Saturation
The results in the previous subsections pertain to the full state feedback. Both
model-based and model-free solutions make use of the feedback of the full state
xk in the low gain feedback laws. This subsection builds upon the output feedback
techniques that we have developed in Chap. 2 to extend the state feedback results
in the previous subsection for designing a low gain based output feedback Q-
learning scheme. We recall from Chap. 2 that a key ingredient in achieving output
feedback Q-learning is the parameterization of the state in terms of the input-output
measurements. As will be seen next, the state parameterization result remains an
essential tool that allows us to develop the output feedback learning equations to
learn the low gain based output feedback controller that achieves global asymptotic
stabilization.
More specifically, we will extend Algorithms 4.3 and 4.4 to the output feedback
case by the method of state parameterization. To this end, we present the constrained
control counterpart of state parameterization result in Theorem 2.2.
Theorem 4.3 Consider system (4.1). Let Assumption 4.2 hold. Then, the system
state can be represented in terms of the constrained input and measured output as
with
T T
VN = CAN −1 · · · (CA)T C T ,
UN = B AB · · · AN −1 B ,
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 177
⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN = ⎢ ... ... .. ..
. .
..
. ⎥.
⎢ ⎥
⎣0 0 ··· 0 CB ⎦
0 0 0 0 0
Proof View the constrained control signal σ (uk ) as a new input signal and the result
follows directly from the result of Theorem 2.2.
We would like to use the above state parameterization to constitute a state-free Q-
function. We can write the expression in (4.22) as
σ (ūk−1,k−N )
xk = Wu Wy . (4.23)
ȳk−1,k−N
where
T
ζk = σ T (ūk−1,k−N ) ȳk−1,k−N
T uTk ,
−1
u∗k = − H∗uu H∗uū σ (ūk−1,k−N ) + H∗uȳ ȳk−1,k−N
σ (ūk−1,k−N )
= K∗ (γ ) . (4.26)
ȳk−1,k−N
T
QK = H̄ (γ )ζ̄k , (4.28)
With the above parameterization, we have the following linear equation with
unknowns in H̄,
T T
H̄ (γ )ζ̄k = γ ykT yk + uTk uk + H̄ (γ )ζ̄k+1 . (4.29)
Based on (4.29), we can utilize the Q-learning technique to learn the output feedback
Q-function matrix H. We now present the policy iteration and the value iteration
based Q-learning algorithms, Algorithms 4.5 and 4.6 that globally asymptotically
stabilize the system.
The output feedback Q-function matrix H has (mN + pN + m)(mN + pN +
m + 1)/2 unknowns. The data matrices ∈ R((mN+pN +m)(mN +pN +m+1)/2)×L and
ϒ ∈ RL×1 used in Algorithm 4.5 are given by
= ζ̄k1 − ζ̄k+1
1 ζ̄k2 − ζ̄k+1
2 · · · ζ̄kL − ζ̄k+1
L ,
T
ϒ = r1 r2 · · · rL ,
Algorithm 4.5 Q-learning Policy iteration algorithm for global asymptotic stabi-
lization by output feedback
input: input-output data
output: H∗ (γ ) and K∗ (γ )
1: initialize. Select an admissible policy K0 such that A + BK 0 is Schur stable, where K0 =
K 0 W . Set γ ← 1 and j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint
uk ≤ b.
3: repeat
4: policy evaluation. Solve the following Bellman equation for H j (γ ),
6: j← j +1
7: until Hj (γ ) − Hj −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following control constraint,
T
j
K (γ ) σ T (ūk−1,k−N ) ȳ T ≤ b.
k−1,k−N
∞
T T T T
j j j
ϒ= r1 + H̄ (γ ) 1
ζ̄k+1 r2 + H̄ (γ ) 2
ζ̄k+1 ··· rL + H̄ (γ ) L
ζ̄k+1 .
Then the least-squares solution of the output feedback Bellman equation is given by
j
−1
H̄ (γ ) = T ϒ. (4.30)
As was the case in the output feedback algorithms studied in Chap. 2, an excitation
signal vk is added in the control input during the learning phase to satisfy the
following rank condition,
180 4 Model-Free Stabilization in the Presence of Actuator Saturation
Algorithm 4.6 Q-learning value iteration algorithm for global asymptotic stabiliza-
tion by output feedback
input: input-output data
output: H∗ and K∗
1: initialize. Select an arbitrary matrix H0 ≥ 0. Set γ ← 1 and j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint (4.2). Collect L datasets of (xk , uk ) for k ∈ [0, L − 1]
along with their quadratic terms, where
L ≥ (n + m)(n + m + 1)/2.
3: repeat
4: policy evaluation. Solve the following Bellman equation for H j +1 (γ ),
6: j← j +1
7: until Hj (γ ) − Hj −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following saturation
condition,
T
j
K (γ ) σ T (ūk−1, k−N ) ȳ T ≤ b.
k−1, k−N
∞
and the corresponding control matrix K(γ ) without saturating the actuators. This
completes the proof.
The closed-loop eigenvalues have the magnitudes of 0.6132, 0.6132, 0.4193 and
0.4193, which are all less than 1. Hence A + BK is Schur stable. The closed-loop
response from the initial condition x0 = [1 1 1 1 ]T is shown in Fig. 4.1. As can
be seen, even though the feedback controller is chosen to be stabilizing, it violates
the control constraints and leads to instability. This motivates the design of low gain
feedback.
We first validate Algorithm 4.3 which is based on policy iteration and uses state
feedback. The initial state of the system remains as x0 = [ 1 1 1 1 ]T . The
algorithm is initialized with γ = 1 and
K 0 = 0.9339 −2.5171 3.2847 −1.8801 ,
with A+BK 0 being Schur stable but not necessarily satisfying the control constraint
as shown previously in Fig. 4.1. Figure 4.2 shows the state response and the control
effort. In every main iteration of the algorithm, the value of the low gain parameter
182 4 Model-Free Stabilization in the Presence of Actuator Saturation
1000
x1k
500
x2k
States
0 x3k
x4k
-500
-1000
0 10 20 30 40 50 60 70 80 90 100
time (k)
1
Control uk
-1
-2
0 10 20 30 40 50 60 70 80 90 100
time (k)
Fig. 4.1 Closed-loop response under linear state feedback control without taking actuator satura-
tion into consideration
γ is reduced by a factor of one half. In each sub-iteration under a given the value
of the low gain parameter γ , the convergence criterion of ε = 0.01 was selected on
the controller parameters. In this simulation, we collected L = 15 data samples to
satisfy the rank condition (4.21). These data samples are collected only once using
a behavioral policy comprising of sinusoidal signals of different frequencies and
magnitudes such that the control constraint is satisfied. Once these data samples are
available, we can repeatedly use this dataset to policy iterations for different values
of γ . It can be seen that the algorithm is able to find a suitable value of the low gain
parameter γ and learn the corresponding low gain state feedback control matrix
K(γ ) that guarantees convergence of the state without saturating the actuator. The
final value of the low gain parameter is γ = 2.4 × 10−4 and the corresponding low
gain feedback gain matrix is obtained by solving the ARE as
K ∗ (γ ) = 0.3002 −0.6717 0.6801 −0.2537 .
The convergence of the low gain matrix is shown in Fig. 4.3 and its final estimate is
K(γ ) = 0.3003 −0.6718 0.6802 −0.2537 ,
20
x1k
10
x2k
States
0 x3k
x4k
-10
-20
0 10 20 30 40 50 60 70 80 90 100
time (k)
0.5
Control uk
-0.5
-1
0 10 20 30 40 50 60 70 80 90 100
time (k)
4
K *( J )
2
—
K( J )
0
0 1 2 3 4 5 6 7 8
iterations
Fig. 4.3 Algorithm 4.3: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T
The convergence of the low gain feedback gain matrix is shown in Fig. 4.5 and its
final estimate is
K(γ ) = 0.2222 −0.4899 0.4860 −0.1781 .
It can be seen that even with this larger initial condition, the algorithm is able to
stabilize the system with a lower value of the low gain parameter. Therefore, these
two cases illustrate global asymptotic stabilization of the presented scheme.
184 4 Model-Free Stabilization in the Presence of Actuator Saturation
40
x1k
20
States x2k
0 x3k
x4k
-20
-40
0 10 20 30 40 50 60 70 80 90 100
time (k)
0.5
Control uk
-0.5
-1
0 10 20 30 40 50 60 70 80 90 100
time (k)
4
K *( J )
2
—
K( J )
0
0 1 2 3 4 5 6 7 8 9
iterations
Fig. 4.5 Algorithm 4.3: Convergence of the control matrix with x0 = [5 5 5 5]T
We shall now validate that the model-free value iteration algorithm, Algo-
rithm 4.4, uplifts the requirement of a stabilizing initial policy. The algorithm is
thus initialized with H 0 = I , which implies that
K0 = 0 0 0 0 .
All other conditions remain the same as in the simulation of Algorithm 4.3. The
closed-loop response and the convergence of the parameter estimates for x0 =
[ 1 1 1 1 ]T are shown in Figs. 4.6 and 4.7, respectively.
The final estimate of the low gain feedback gain matrix is
K(γ ) = 0.3025 −0.6771 0.6856 −0.2558 .
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 185
20
x1k
10
States x2k
0 x3k
x4k
-10
-20
0 10 20 30 40 50 60 70 80 90 100
time (k)
0.5
Control uk
-0.5
-1
0 10 20 30 40 50 60 70 80 90 100
time (k)
1.5
K *( J )
1
—
K( J )
0.5
0
0 5 10 15 20 25
iterations
Fig. 4.7 Algorithm 4.4: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T
The corresponding results for the second initial condition x0 = [5 5 5 5]T are shown
in Figs. 4.8 and 4.9 with the final estimated low gain feedback matrix being
K(γ ) = 0.2254 −0.4970 0.4930 −0.1807 .
Upon comparing the simulation results of Algorithms 4.3 and 4.4, we see that both
algorithms are able to arrive at an appropriate value of the low gain parameter and
the associated low gain feedback matrix. Furthermore, Algorithm 4.4 eliminates the
need of a stabilizing initial policy at the expense of more iterations.
We now validate Algorithm 4.5, which uses output feedback. We choose N = 4
as the bound on the observability index. The initial state of the system is x0 =
[ 1 1 1 1 ]T . The algorithm is initialized with γ = 1 and
186 4 Model-Free Stabilization in the Presence of Actuator Saturation
40
x1k
20
States x2k
0 x3k
x4k
-20
-40
0 10 20 30 40 50 60 70 80 90 100
time (k)
0.5
Control uk
-0.5
-1
0 10 20 30 40 50 60 70 80 90 100
time (k)
0.8
K *( J )
0.6
0.4
—
K( J )
0.2
0
0 5 10 15 20 25 30
iterations
Fig. 4.9 Algorithm 4.4: Convergence of the feedback gain matrix with x0 = [ 5 5 5 5 ]T
K0 = −1.4728 −1.3258 −0.1333 1.6318 2.8714 −5.5783 4.7487 −1.6318 .
100
x1k
50
x2k
States
0 x3k
x4k
-50
-100
0 50 100 150 200 250
time (k)
0.5
Control uk
-0.5
-1
0 50 100 150 200 250
time (k)
8
K(γ) →K∗ (γ)
0
0 1 2 3 4 5 6 7
iterations
Fig. 4.11 Algorithm 4.5: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T
The convergence of the low gain feedback gain matrix is shown in Fig. 4.11 and its
final estimate is
K(γ ) = −0.1496 −0.0123 0.1462 0.2300 0.1805 −0.4941 0.5043 −0.2300 ,
200
x1k
100
States x2k
0 x3k
x4k
-100
-200
0 50 100 150 200 250
time (k)
0.5
Control uk
-0.5
-1
0 50 100 150 200 250
time (k)
8
K(γ) →K∗ (γ)
0
0 1 2 3 4 5 6 7 8 9
iterations
Fig. 4.13 Algorithm 4.5: Convergence of the feedback gain matrix with x0 = [ 5 5 5 5 ]T
The convergence of the low gain feedback gain matrix is shown in Fig. 4.13 and its
final estimate is
K(γ ) = −0.0904 −0.0042 0.0897 0.1351 0.1016 −0.2822 0.2923 −0.1351 .
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 189
100
x1k
50
States x2k
0 x3k
x4k
-50
-100
0 50 100 150 200 250
time (k)
0.5
Control uk
-0.5
-1
0 50 100 150 200 250
time (k)
It can be seen that even with this larger initial condition, the algorithm is able to
stabilize the system with a lower value of the low gain parameter. Also, it is worth
noting that the output feedback Algorithm 4.5 does not require full state feedback
but it needs more data samples owing to the larger number of unknown parameters
to be learned. As a result, the learning phase of Algorithm 4.5 is longer and results
in a smaller value of the low gain parameter as compared to the state feedback
Algorithm 4.3.
We shall now validate that the output feedback model-free value iteration
algorithm, Algorithm 4.6, uplifts the requirement of a stabilizing initial policy. The
algorithm is thus initialized with H0 = I , which implies that
K0 (γ ) = 0 0 0 0 0 0 0 0 .
All other conditions remain the same as in the simulation of Algorithm 4.5. The
closed-loop response and the convergence of the parameter estimates for x0 =
[1 1 1 1]T are shown in Figs. 4.14 and 4.15, respectively. The final estimate
of the low gain feedback gain matrix is
K(γ ) = −0.1515 −0.0126 0.1477 0.2325 0.1826
−0.4996 0.5099 −0.2325 .
190 4 Model-Free Stabilization in the Presence of Actuator Saturation
0
0 5 10 15 20 25 30 35 40
iterations
Fig. 4.15 Algorithm 4.6: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T
200
x1k
100 x2k
x3k
States
0
x4k
-100
-200
0 50 100 150 200 250
time (k)
0.5
Control uk
-0.5
-1
0 50 100 150 200 250
time (k)
The corresponding results for the second initial condition x0 = [5 5 5 5]T are
shown in Figs. 4.16 and 4.17, with the final estimated low gain feedback gain matrix
being
K(γ ) = −0.0904 −0.0042 0.0897 0.1351 0.1016
−0.2822 0.2923 −0.1351 .
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 191
0.6
0.2
0
0 10 20 30 40 50 60
iterations
Fig. 4.17 Algorithm 4.6: Convergence of the feedback gain matrix with x0 = [ 5 5 5 5 ]T
In this section, we will address the problem of global asymptotic stabilization for
continuous-time linear systems subject to actuator saturation. The results presented
in this section are the continuous-time counterparts of the discrete-time results
presented in Sect. 4.3. The presented algorithms build upon the learning techniques
for continuous-time systems that were developed in Chap. 2 and incorporate a low
gain scheduling mechanism. In particular, we will first introduce the model-based
iterative techniques for designing a low gain feedback control law for continuous-
time systems. Then, we will present model-free techniques to learn the low gain
feedback control law. State feedback algorithms will be developed first and then
extended to arrive at output feedback learning algorithms.
Consider a continuous-time linear system subject to actuator saturation,
1, 2, · · · , m,
⎧
⎪
⎨−b if ui < −b,
⎪
σ (ui ) = ui if − b ≤ ui ≤ b, (4.33)
⎪
⎪
⎩b if u > b, i
1. (A, B) is stabilizable,
2. All eigenvalues of the system matrix A are in the closed left-half s-plane.
It is worth pointing out that the parameterized ARE (4.34) is the ARE found in
the solution of the LQR problem, in which the weighting matrix Q = γ I is
parameterized in the low gain parameter γ and R = I . As a result, the parameterized
ARE can be obtained directly by the substitution of these weights in the LQR ARE.
The resulting family of parameterized low gain feedback control laws is given by
u = K ∗ (γ )x, (4.35)
where
K ∗ (γ ) = −B T P ∗ (γ )
and P ∗ (γ ) > 0 is the unique positive definite solution of the ARE (4.34),
parameterized in the low gain parameter γ ∈ (0, 1].
We recall the following results from [63].
Lemma 4.2 Under Assumption 4.3, for each γ ∈ (0, 1], there exists a unique
positive definite solution P ∗ (γ ) to the ARE (4.34) that satisfies
lim P ∗ (γ ) = 0.
γ →0
Theorem 4.5 Consider system (4.32). Let Assumption 4.3 hold. Then, for any a
priori given (arbitrarily large) bounded set W, there exists a γ ∗ such that for any
γ ∈ (0, γ ∗ ], the low gain feedback control law (4.35) renders the closed-loop system
exponentially stable at the origin with W contained in the domain of attraction.
Moreover, for any initial condition in W, actuator saturation does not occur.
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 193
Remark 4.3 The domain of attraction can be made arbitrarily large by making γ →
0. The upper limit of 1 for the value of γ is chosen for the sake of convenience only.
Theorem 4.5 establishes that semi-global asymptotic stabilization can be
achieved with a constant low gain parameter. Furthermore, as shown in [63],
by scheduling the low gain parameter as a function of the state, global asymptotic
stabilization can be achieved.
The results discussed so far in this section rely on solving the parameterized
ARE (4.34), which requires the complete knowledge of the system dynamics. In
this section, we are interested in solving the global asymptotic stabilization problem
by using measurable data without invoking the system dynamics (A, B, C). An
iterative learning approach is presented where the low gain parameter is scheduled
online and the corresponding low gain feedback gain matrices are learned that
achieves global asymptotic stabilization.
subject to the dynamics (4.32), where the LQR weighting matrices are Q = γ I and
R = I with parameter γ being the low gain parameter.
In Chap. 2, we presented iterative algorithms that provide a computationally
feasible way of solving the continuous-time LQR ARE based on the solution of
the Lyapunov equation in the policy iteration technique or by performing recursion
on the ARE itself in the value iteration technique. Extensions of these algorithms to
solving the parameterized ARE will be presented next that enable us to find the
low gain feedback gain matrix K ∗ (γ ) for global asymptotic stabilization of the
system (4.32) without causing the actuator to saturate. A parameterized version of
the policy iteration algorithm, Algorithm 2.3 used for solving the continuous-time
LQR problem, is presented in Algorithm 4.7. As was the case with its discrete-
time counterpart, Algorithm 4.1, Algorithm 4.7 requires a suitable value of the low
gain parameter and a stabilizing initial control policy K 0 such that A + BK 0 is
Hurwitz. Also, recall that this policy does not have to be the one corresponding to
an appropriate value of γ . In other words, K 0 does not have to satisfy the control
constraint. This is an advantage of the policy iteration in constrained control that will
be more appreciable when we develop policy iteration based learning algorithms in
194 4 Model-Free Stabilization in the Presence of Actuator Saturation
K j +1 (γ ) = −B T P j (γ ).
5: j← j +1
6: until P j (γ ) − P j −1 (γ ) < ε for some small ε > 0.
the following subsections. The convergence properties of Algorithm 4.7 remain the
same as those of Algorithm 2.3.
Following the results in Chap. 2, a value iteration algorithm for solving the
parameterized ARE is presented that circumvents the need of a stabilizing
!∞ initial
policy at the expense of more iterations. Recall from Chap. 2 that Bq q=0 is some
bounded nonempty sets that satisfy Bq ⊆ Bq+1 , q ∈ Z+ and limq→∞ Bq = Pn+ ,
where Pn+ is the set of n-dimensional positive definite matrices. Also, let j is the
step size sequence satisfying limj →∞ j = 0. With these definitions, the value
iteration algorithm is presented in Algorithm 4.8.
Algorithms 4.7 and 4.8 provide alternative ways of designing a low gain
feedback controller without requiring to solve the parameterized ARE (4.34).
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 195
These algorithms make use of the knowledge of the system dynamics and require
an appropriate value of the low gain parameter for an a priori bounded set of
initial conditions. As a result, these algorithms provide semi-global asymptotic
stabilization. In the remainder of this section, we will develop model-free learning
techniques to arrive at scheduled low gain feedback laws for global asymptotic
stabilization of system (4.32).
V (x) = x T P j (γ )x
for system (4.32). We can also represent the dynamics of system (4.32) as follows,
ẋ = A + BK j (γ ) x + B σ (u) − K j (γ )x . (4.37)
Taking the time derivative of the Lyapunov function along the trajectory of (4.37) in
T
V̇ = x T A + BK j P j (γ ) + P j A + BK j (γ ) x
T
+2 σ (u) − K j (γ )x B T P j (γ )x.
By integrating both sides of the above equation over a finite time interval, we have
196 4 Model-Free Stabilization in the Presence of Actuator Saturation
K j +1 (γ ) = −B T P j (γ ). (4.40)
vecs P j (γ )
j
j +1 = j ,
vec K (γ )
j = −Ixx vec(Qj ) ∈ Rl ,
with
T
Qj = γ I + K j (γ ) K j (γ ),
x̄ = x12 2x1 x2 · · · z22 2z2 z3 · · · zn2 ,
j j j j j j
vecs P j (γ ) = P11 (γ ) P12 (γ ) · · · P1n (γ ) P22 (γ ) P23 (γ ) · · · Pnn (γ ) .
The above discussion has focused on the learning equation (4.41), which is
parameterized in the low gain parameter γ . Note, however, that the appropriate
value of the low gain parameter in our model-free setting is not known a priori.
For this reason, we need a low gain parameter scheduling mechanism, which can
be embedded in the learning equation (4.41). To this end, we present the model-free
state feedback based policy iteration algorithm, Algorithm 4.9 for the continuous-
time system (4.32).
Algorithm 4.9 is an extension of Algorithm 2.5 in Chap. 2. The key difference
between the two algorithms is the inclusion of the low gain scheduling mechanism,
which results in an LQR problem with a time-varying objective function. This
mechanism allows us to learn an appropriate value of the low gain parameter that
would result in a linear feedback controller that avoids actuator saturation. The
learning equation (4.41) merges the policy evaluation and policy update steps of
policy iteration. This also implies that we need a stabilizing initial policy such that
A + BK 0 is Hurwitz as we have seen in the previous policy iteration algorithms.
However, the initial policy is used only to initialize the iterations and does not need
to satisfy the control constraint as it is not applied to the system. Instead, an open-
loop policy that satisfies control constraint can be applied as a behavioral policy for
the data generation purpose. It is also worth pointing out that the control constraint
check in Step 7, which drives the low gain parameter scheduling mechanism, is a
198 4 Model-Free Stabilization in the Presence of Actuator Saturation
Algorithm 4.9 Model-free state feedback policy iteration algorithm for global
asymptotic stabilization
input: input-state data
output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz stable. Set γ ← 1
and j ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint
u∞ ≤ b.
3: repeat
4: evaluate and improve policy. Find the solution, P j (γ ) and K j +1 (γ ), of the following
learning equation,
5: j← j +1
6: until P j (γ ) − P j −1 (γ ) < ε for some small ε > 0.
7: control saturation check. Check the following control constraint,
j
K (γ )x ≤ b, t ≥ tl ,
∞
where tl = lT . If the control constraint is violated, reduce γ , reset j ← 0, and carry out Steps
3 to 6 with the updated value of the low gain parameter.
function of the current state, which in turn enables us to find a value of the low
gain parameter for any initial condition. As a result, we achieve global asymptotic
stabilization instead of semi-global asymptotic stabilization as compared to the
model-based policy iteration Algorithm 4.7.
Parallel to Sect. 4.3, we would also like to develop a value iteration variant of
Algorithm 4.9 that would uplift the requirement of a stabilizing initial policy. To
this end, consider the Lyapunov function candidate parameterized in the low gain
parameter γ ,
Evaluating the derivative of (4.43) along the trajectory of system (4.32), we have
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 199
d T
(x P (γ )x) = (Ax + Bσ (u))T P (γ )x + x T P (γ ) (Ax + Bσ (u))
dt
= x T H (γ )x − 2σ T (u)K(γ )x, (4.44)
or, equivalently,
where
x̄ = x12 2x1 x2 · · · x22 2x2 x3 · · · xn2
and
where, for k = 1, 2, · · · , l,
Based on the solution of Equation (4.46), we perform the recursion on the following
equation,
T
P j +1 (γ ) = P j (γ ) + j H j (γ ) + γ I − K j (γ ) K j (γ ) .
We are now ready to propose our scheduled low gain learning algorithm that
achieves model-free global asymptotic stabilization of system (4.32). Global asymp-
totic stabilization is achieved by preventing saturation under the scheduled low gain
feedback.
The details of Algorithm 4.10 are as follows. The algorithm is initialized with
a value of the low gain parameter γ ∈ (0, 1], say γ = 1, and an arbitrary control
policy comprising of exploration signal ν is used to generate system data. Note
that this initial policy is open-loop and not necessarily stabilizing. Furthermore,
it is selected so that actuator saturation is avoided. The trajectory data is used to
solve the learning equation in Step 4, where a stabilizing low gain feedback gain
is obtained corresponding to our choice of the low gain parameter. However, this
control policy needs to be checked for the control constraint to ensure that the value
of the low gain parameter is appropriate, which is done in Step 16. The value of
the low gain parameter γ can be updated under a proportional rule γj +1 = aγj , for
some a ∈ (0, 1), in future iterations. The novelty in Algorithm 4.10 is that there is
a scheduling mechanism for the time-varying low gain parameter that ensures the
satisfaction of the control constraint. This is in contrast to Algorithm 4.8, where the
knowledge of (A, B) as well as an appropriate value of the low gain parameter are
needed. Furthermore, the scheduling of the value of the low gain parameter enables
global asymptotic stabilization as the scheduling mechanism is updated according
to the current state of the system.
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 201
Algorithm 4.10 Model-free state feedback value iteration algorithm for global
asymptotic stabilization
Input: input-state data
Output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select P 0 > 0 and set γ ← 1, j ← 0 and q ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint
u∞ ≤ b.
3: loop
4: Find the solution, H j (γ ) and K j (γ ), of the following equation,
5: P̃ j +1 (γ ) ← P j (γ ) + j H j + γ I − (K j )T (γ )K j (γ )
6: if P̃ j +1 (γ ) ∈
/ Bq then
7: P j +1 (γ ) ← P 0
8: q ← q + 1
9: else if P̃ j +1 (γ )−P j (γ ) / j < ε then
10: return P j (γ ) as an estimate of P ∗ (γ )
11: else
12: P j +1 (γ ) ← P̃ j +1 (γ )
13: end if
14: j ←j +1
15: end loop
16: control saturation check. Check the following control constraint ∀t ≥ tf ,
j
K (γ )x ≤ b, t ≥ tl = lT .
∞
If the control constraint is violated, reduce γ , reset j ← 0, and carry out Steps 3 to 15 with
the updated value of the low gain parameter.
Recall from the previous chapters that, in order to solve the least-squares problem
of the forms (4.42) and (4.46), we need an exploration signal ν during the data
collection phase to satisfy the following rank condition,
rank Ixx Ixu = n(n + 1)/2 + mn. (4.47)
∗
converges
to a stabilizing γ ∈ (0, γ ] since 0 < a < 1. The control constraint
K̂(γ )x ≤ b can be met by Theorem 4.5. This shows the convergence of
∞
the scheduling mechanism found in the outer loops of Algorithms 4.9 and 4.10,
assuming that P ∗ (γ ) and K ∗ (γ ) can be obtained. To show the convergence
to P ∗ (γ ) and K ∗ (γ ), we next consider the inner-loop with iteration index j
corresponding to the iterations on P (γ ) and K(γ ). In the case of Algorithm 4.9,
the inner-loop has the following learning equation corresponding to a current low
gain parameter γi ,
which has a unique solution provided that the rank condition (4.47) holds. Then,
the iterations on this equation are equivalent to the iterations on the parameterized
Lyapunov equation used in Algorithm 4.7 as shown in the derivation of (4.41) earlier
in this section. These iterations are the LQR Lyapunov iterations [51] with Q = γ I
and converge under√ the controllability condition of (A, B) and the observability
condition of A, Q .
In the case of Algorithm 4.10, the inner-loop has the following learning equation
corresponding to the current value of the low gain parameter γi ,
This equation has a unique solution under the rank condition (4.47). Then, the
recursion
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 203
T
P̃ j +1 (γi ) ← P j (γi ) + j H j (γi ) + γi I − K j (γi )K j (γi )
corresponds to Algorithm 4.8 for finding K ∗ (γ ), which is the LQR recursive ARE
with Q = γi I and its convergence is shown in [11] under
√ the controllability
condition of (A, B) and the observability condition of A, Q .
In order to establish that the converged K ∗ (γ ) indeed stabilizes the system while
ensuring that actuator saturation does not occur, we perform the following Lyapunov
analysis.
We consider system (4.32) in the following closed-loop form,
ẋ = Ax + Bσ (u)
= A − BB T P ∗ (γ ) x + B (σ (u) − u) . (4.48)
For the given initial condition x(0), let W be a bounded set that contains x(0) and
let c > 0 be a constant such that
and let γ ∗ be such that, for all γ ∈ (0, γ ∗ ], x ∈ LV (c) implies that
T ∗
B P (γ )x ≤ b.
∞
T ∗
B P (γ )x ≤b
∞
for all γ ∈ (0, γ ∗ ]. In other words, we can always find a γ that makes the above
norm small enough so that the control operates in the linear region of the saturation
function. The evaluation of the derivative of V along the trajectory of the closed-
loop system (4.48) shows that, for all x ∈ LV (c),
V̇ = x T P ∗ (γ ) A − BB T P ∗ (γ ) x + B (σ (u) − u)
T
+ A − BB T P ∗ (γ ) x + B (σ (u) − u) P ∗ (γ )x
= x T AT P ∗ (γ ) + P ∗ (γ )A x − 2x T P ∗ (γ )BB T P ∗ (γ )x + 2x T P ∗ (γ )B σ (u) − u .
AT P ∗ (γ ) + P ∗ (γ )A = −γ I + P ∗ (γ )BB T P ∗ (γ ).
Since
T ∗
B P (γ )x ≤ b,
∞
it follows from the definition of the saturation function (4.33) that σ (u) = u, which
results in
V̇ = −γ x T x − x T P ∗ (γ )BB T P ∗ (γ )x
≤ −γ x T x,
which in turn implies that, for any γ ∈ (0, γ ∗ ], the equilibrium x = 0 of the closed-
loop system is asymptotically stable at the origin with x(0) ∈ W ⊂ LV (c) is
contained in the domain of attraction. Since x(0) is arbitrary, we establish global
asymptotic stabilization. This completes the proof.
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 205
where
1 T 2 T T T
ζu = ζu ζu · · · ζum ,
T T
p T T
ζy = ζy1 ζy2 · · · ζy ,
and ζui and ζyi are constructed, respectively, from the ith input ui and the ith output
y i as
ζ˙ui (t) = Aζui (t) + bσ ui (t) , i = 1, 2, · · · , m,
T
for any Hurwitz stable matrix A in the controllable form and b = 0 0 · · · 1 with
ζui (0) = 0 and ζyi (0) = 0. Then, x̄(t) converges to the state x(t) as t → ∞.
In the following we will develop the output feedback learning equations for the
constrained control problem. We introduce the following definitions,
T
z(t) = ζuT (t) ζyT (t) ∈ RN ,
W = Wu Wy ∈ Rn×N ,
206 4 Model-Free Stabilization in the Presence of Actuator Saturation
P̄ (γ ) = W T P (γ )W ∈ RN ×N ,
K̄(γ ) = K(γ )W ∈ RmN ,
where N = mn + pn. In view of these definitions and Theorem 2.9, and using
y = Cx, we can write the constrained control learning equation (4.41) as
Equation (4.52) is the output feedback version of the constrained control learning
equation (4.41). It can be seen that, with the help of the auxiliary dynamics z, the
full state has been eliminated in this equation. As was the case in its state feedback
counterpart, embedded in this equation are the steps of policy evaluation and policy
update found in a model-based policy iteration algorithm. The two unknowns in
this equation are P̄ j and K̄ j +1 , which contain a total of N(N + 1)/2 + Nm scalar
unknowns. To solve (4.52) in the least-squares sense, we can collect l ≥ N(N +
1)/2 + N m datasets of input-output data to form the following data matrices,
T
δzz = z̄T (t1 ) − z̄T (t0 ) z̄T (t2 ) − z̄T (t1 ) · · · z̄T (tl ) − z̄T (tl−1 ) ,
t
t
Izu = t01 z(τ ) ⊗ σ (u(τ ))dτ t12 z(τ ) ⊗ σ (u(τ ))dτ · · ·
tl T
tl−1 z(τ ) ⊗ σ (u(τ ))dτ ,
t2 tl T
t1
Izz = t0 z(τ ) ⊗ z(τ )dτ t1 z(τ ) ⊗ z(τ )dτ · · · tl−1 z(τ ) ⊗ z(τ )dτ ,
t2 tl T
t1
Iyy = t0 y(τ ) ⊗ y(τ )dτ t1 y(τ ) ⊗ y(τ )dτ · · · tl−1 y(τ ) ⊗ y(τ )dτ .
j = −Izz vec Q̄j − γ Iyy vec(I ) ∈ Rl ,
with
T
Q̄j = K̄ j (γ ) K̄ j (γ ),
z̄ = z12 2z1 z2 · · · z22 2z2 z3 · · · zN
2
,
j j j j j j
vecs P j (γ ) = P11 (γ ) P12 (γ ) · · · P1n (γ ) P22 (γ ) P23 (γ ) · · · PN N (γ ) .
Algorithm 4.11 Model-free output feedback policy iteration algorithm for asymp-
totic stabilization
input: input-output data
output: P̄ ∗ (γ ) and K̄ ∗ (γ )
1: initialize. Select an admissible policy K̄ 0 such that A + BK 0 is Hurwitz. Set γ ← 1 and
j ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint
u∞ ≤ b.
3: repeat
4: evaluate and improve policy. Find the solution, P̄ j (γ ) and K̄ j +1 (γ ), of the following
learning equation,
5: j← j +1
6: until P̄ j (γ ) − P̄ j −1 (γ ) < ε for some small ε > 0.
7: control saturation check. Check the following control constraint,
j
K̄ (γ )z ≤ b, t ≥ tl = lT .
∞
If the control constraint is violated, reduce γ , reset j ← 0, and carry out Steps 3 to 6 with the
updated value of the low gain parameter.
t
−2 σ T (u(τ ))K(γ )x(τ )dτ.
t−T
or, equivalently,
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 209
t
zT (t) ⊗ zT (t)|tt−T vec P̄ (γ ) + γ y T (τ ) ⊗ y T (τ )dτ vec(I )
t−T
t t
= z̄(τ )dτ vecs H̄ (γ ) − 2 zT (τ ) ⊗ σ T (u(τ ))dτ vec K̄(γ ) ,
t−T t−T
where
H̄ (γ ) = W T AT P (γ ) + P (γ )A + γ C T C W
and
z̄ = z12 2z1 z2 2z1 z3 · · · z22 2z2 z3 · · · zN
2
.
Equation (4.54) is a parameterized learning equation that uses only output feedback.
Similar to the state feedback equation (4.45), it is a scalar equation linear in the
unknowns H̄ (γ ) and K̄(γ ). These matrices are the output feedback counterparts of
the matrices H (γ ) and K(γ ) in the state feedback case. As there are more unknowns
than the number of equations, we develop a system of l number of such equations
by performing l finite window integrals each of length T . To solve this linear system
of equations, we define the following data matrices,
T
δzz = z ⊗ z|tt10 z ⊗ z|tt21 · · · z ⊗ z|ttll−1 ,
t
t
Izu = t01 z(τ ) ⊗ σ (u(τ ))dτ t12 z(τ ) ⊗ σ (u(τ ))dτ ···
tl T
tl−1 z(τ ) ⊗ σ (u(τ ))dτ ,
t2 tl T
t1
Izz = t0 z̄T (τ )dτ t1 z̄T (τ )dτ · · · T
tl−1 z̄ (τ )dτ ,
t2 tl T
t1
Iyy = t0 y(τ ) ⊗ y(τ )dτ t1 y(τ ) ⊗ y(τ )dτ · · · tl−1 y(τ ) ⊗ y(τ )dτ ,
where, for k = 1, 2, · · · , l,
vecs H̄ (γ ) T −1 T
= Izz −2Izu Izz −2Izu Izz −2Izu
vec K̄(γ )
× δxx vec P̄ (γ ) + γ Iyy vec(I ) . (4.55)
Based on the solution of (4.55), we perform the recursion on the following equation,
T
P̄ j +1 (γ ) = P̄ j (γ ) + j H̄ j (γ ) − K̄ j (γ ) K̄ j (γ ) .
As has been the case with the previous algorithms, the learning equation discussed
above requires an appropriate value of the low gain parameter as a target objective
function. This is achieved by a low gain parameter scheduling mechanism. The
resulting model-free output feedback value iteration algorithm for global asymptotic
stabilization of system (4.32) is presented in Algorithm 4.12.
Compared to the state feedback Algorithms 4.9 and 4.10, the output feedback
Algorithms 4.11 and 4.12 involve more unknown parameters and require the
following rank condition for the solution of the least-squares problems (4.53) and
(4.55),
rank Izz Izu = N(N + 1)/2 + mN. (4.56)
In this subsection we will validate the model-free algorithms for the continuous-time
system (4.32). Consider system (4.32) with
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 211
Algorithm 4.12 Model-free output feedback value iteration algorithm for global
asymptotic stabilization
Input: input-output data
Output: P̄ ∗ (γ ) and K̄ ∗ (γ )
1: initialize. Select P̄ 0 ≥ 0 and set γ ← 1, j ← 0 and q ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint
u∞ ≤ b.
3: loop
4: Find the solution, H̄ j (γ ) and K̄ j (γ ), of the following equation,
T
5: P̄˜ j +1 (γ ) ← P̄ j (γ ) + j H̄ j − K̄ j (γ )K̄ j (γ )
6: if P̄˜ j +1 (γ ) ∈
/ B then
q
7: P̄ j +1 (γ ) ← P̄ 0
8: q ← q + 1
˜ j +1
9: else if P̄ (γ )− P̄ j (γ ) / j < ε then
10: return P̄ j (γ ) and K̄ j (γ ) as P̄ ∗ (γ ) and K̄ ∗ (γ ),
11: else
12: P̄ j +1 (γ ) ← P̄˜ j +1 (γ )
13: end if
14: j ←j +1
15: end loop
16: control saturation check. Check the following control constraint,
j
K̄ (γ )z ≤ b, t ≥ tl .
∞
If the saturation condition is violated, reduce γ , reset j ← 0, and carry out Steps 3 to 15 with
the updated value of the low gain parameter.
⎡ ⎤
0 1 0 0
⎢0 0 1 0⎥
A=⎢
⎣0 0 0
⎥,
1⎦
−1 0 −2 0
⎡ ⎤
0
⎢0⎥
B=⎢ ⎥
⎣0⎦ ,
1
212 4 Model-Free Stabilization in the Presence of Actuator Saturation
C= 1000 .
Matrix A has a pair of repeated eigenvalues at ±j and, therefore, the system is open-
loop unstable. The actuator saturation limit is b = 1. Both the state feedback and
the output feedback algorithms will be validated along with their policy iteration and
value iteration variants. In order to appreciate the motivation for low gain feedback,
let us first test the system with a general stabilizing state feedback control law with
feedback gain matrix
K = −23 −50 −33 −10 .
This state feedback law results in the closed-loop eigenvalues of {−1, −2, −3, −4}.
Figure 4.18 shows the closed-loop response under this controller. Upon examining
these results it is evident that this controller repeatedly violates the control constraint
imposed by actuator saturation, which causes instability of the closed-loop system
even though K is chosen such that A + BK is Hurwitz.
We will now focus on the model-free scheduled low gain feedback designs.
Consider first the state feedback policy iteration Algorithm 4.9. Let the initial state
T
of the system be x0 = 0.25 −0.5 −0.5 0.25 and initialize the algorithm with
γ = 1 and K 0 = −23 −50 −33 −10 , which is stabilizing with A + BK 0
being Hurwitz stable but does not meet the control constraint as shown in Fig. 4.18.
Figure 4.19 shows the state response and the control effort under Algorithm 4.9.
Fig. 4.18 Closed-loop response under a stabilizing state feedback law designed without taking
actuator saturation into consideration
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 213
2
x1
1
States x2
0 x3
x4
-1
-2
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
T
Fig. 4.19 Algorithm 4.9: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25
In every main iteration of the algorithm, the low gain parameter γ is reduced by a
factor of one half. In each iteration under a given value of the low gain parameter γ ,
the convergence criterion of ε = 0.01 is selected for the iteration on the controller
parameters. In this simulation, we collected L = 15 data samples to satisfy the rank
condition (4.47). These data samples are collected only once using a behavioral
policy comprising of sinusoidal signals of different frequencies and magnitudes
such that the control constraint is satisfied. Once these data samples are available,
we repeatedly use this dataset in policy iterations for different values of γ . It can be
seen that the algorithm is able to find a suitable value of the low gain parameter
γ and learn the corresponding low gain state feedback gain matrix K(γ ) that
guarantees convergence of the state without saturating the actuator. The final value
of γ = 0.1250 and the corresponding gain matrix is obtained by solving the ARE
as
K ∗ (γ ) = −0.0607 −1.4046 −0.7567 −1.2800 .
The convergence of the low gain feedback gain is shown in Fig. 4.20 and its final
estimate is
K(γ ) = −0.0605 −1.4055 −0.7571 −1.2809 ,
80
60
40
20
0
0 1 2 3 4 5 6 7 8
iterations
Fig. 4.20 Algorithm 4.9: Convergence of the feedback gain matrix with x0 =
T
0.25 −0.5 −0.5 0.25
Fig. 4.21 shows the closed-loop response. The convergence of the estimate of the
low gain feedback gain matrix is shown in Fig. 4.22 and its final estimate is
K(γ ) = −0.0085 −0.7534 −0.2577 −0.7294 .
It can be seen that, with this larger initial condition, the algorithm stabilizes
the system with a lower value of the low gain parameter. This illustrates that
Algorithm 4.9 achieves global asymptotic stabilization without using the knowledge
of the system dynamics.
We will now test the model-free value iteration Algorithm 4.10 that uplifts
requirement of a stabilizing initial policy K 0 . The algorithm is initialized with
γ = 1 and P0 (γ ) = 0.01I4 . We set the step size
−1
k = k 0.5 + 5 , k = 0, 1, 2, · · ·
4
x1
2
States x2
0 x3
x4
-2
-4
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
T
Fig. 4.21 Algorithm 4.9: Closed-loop response with x0 = 0.5 −1 −1 0.5
80
60
40
20
0
0 1 2 3 4 5 6 7 8 9 10
iterations
T
Fig. 4.22 Algorithm 4.9: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5
T
Results for the first initial condition x(0) = 0.25 −0.5 −0.5 0.25 are shown
in Fig. 4.23. The convergence of the low gain feedback gain matrix is shown
Fig. 4.24. The final value of the low gain parameter is γ = 0.1250 and the final
estimate of the corresponding low gain feedback gain matrix is
K(γ ) = −0.0600 −1.4006 −0.7550 −1.2773 .
2
x1 x2 x3 x4
States 1
-1
-2
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
Fig. 4.23 Algorithm 4.10: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25
0
0 20 40 60 80 100 120 140 160 180
iterations
Fig. 4.24 Algorithm 4.10: Convergence of the feedback gain matrix with x0 =
T
0.25 −0.5 −0.5 0.25
convergence of the low gain feedback gain matrix. The final estimate is
K(γ ) = −0.0095 −0.7463 −0.2570 −0.7224 .
The simulation we have carried out above pertains to state feedback. We now
carry out simulation on the output feedback policy iteration Algorithm 4.11. For
the state parameterization, we construct the user-defined system matrix A with a
choice of the desired eigenvalues all at −1. This corresponds to α0 = 1, α1 =
4, α2 = 6, and α3 = 4, which are the entries of matrix A. These constants are
obtained from the characteristic polynomial (s) = (s + 1)4 corresponding to our
choice of eigenvalues −1 of matrix A. The nominal values of the corresponding
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 217
T
Fig. 4.25 Algorithm 4.10: Closed-loop response with x0 = 0.5 −1 −1 0.5
T
Fig. 4.26 Algorithm 4.10: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5
It should be noted that the above matrices are used only to compute the nominal
output feedback parameters for comparison with the results of Algorithm 4.11.
Algorithm 4.11 itself does not require the knowledge of these matrices because
these parameters are learned directly. The initial state of the system is x(0) =
T
0.25 −0.5 −0.5 0.25 and the algorithm is initialized with γ = 1 and
K̄ 0 (γ ) = −315.95 −222.37 −73.06 −10.00 292.81
79.66 332.45 −80.93 ,
It can be observed from the results in Figs. 4.27 and 4.28 that the algorithm finds
a suitable value of the low gain parameter γ and learns the corresponding low
gain feedback gain matrix K̄(γ ) to guarantee the convergence to zero of the state.
It should be noted that the exploration signal is only needed in the first l = 60
intervals and is removed afterwards. The final estimate of the output feedback low
gain feedback gain matrix is
K̄(γ ) = −2.2073 −6.0467 −3.8709 −0.8715 2.1462
4.7728 3.4244 4.1578 .
2
x1
x2
States
0 x3
-2
x4
-4
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
Fig. 4.27 Algorithm 4.11: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25
Fig. 4.28 Algorithm 4.11: Convergence of the feedback gain matrix with x0 =
T
0.25 −0.5 −0.5 0.25
of the low gain parameter is found to be γ = 0.0156 and the corresponding low
gain feedback gain matrix is found by solving the ARE as
K̄ ∗ (γ ) = −0.6579 −3.0932 −2.1612 −0.5074 0.6500
2.5196 1.1348 2.3912 .
Figures 4.29 and 4.30 show the results under this new initial condition. The final
estimate of the low gain feedback gain matrix is
220 4 Model-Free Stabilization in the Presence of Actuator Saturation
10
5
x1
x2
States
0 x3
-5
x4
-10
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
T
Fig. 4.29 Algorithm 4.11: Closed-loop response with x0 = 0.5 −1 −1 0.5
T
Fig. 4.30 Algorithm 4.11: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5
K̄(γ ) = −0.6580 −3.0937 −2.1615 −0.5074 0.6501
2.5198 1.1349 2.3914 ,
which is close to its nominal value K̄ ∗ (γ ). As can be seen that, even with this
larger initial condition, the output feedback Algorithm 4.11 is able to find a suitable
value of the low gain parameter and stabilize the system, which illustrates the global
asymptotic stabilization capability of Algorithm 4.11.
Finally, we shall verify the final algorithm of this chapter, which is the output
feedback value iteration algorithm, Algorithm 4.12. The algorithm is initialized with
γ = 1 and P̄0 (γ ) = 0.01I8 . We set the step size
4.5 Summary 221
−1
k = k 0.5 + 5 , k = 0, 1, 2, · · · ,
All other simulation parameters remain the same as in the simulation of Algo-
rithm 4.11. It is worth noting that, compared to Algorithm 4.11, we no longer require
a stabilizing initial output feedback policy K̄ 0 during initialization. For the first
initial condition x(0) = 0.25 −0.5 −0.5 0.25 , the algorithm finds the low gain
parameter γ = 0.1250. The closed-loop response and the convergence of the low
gain feedback gain matrix are shown in Figs. 4.31 and 4.32, respectively. The final
estimate of the low gain feedback gain matrix is
K̄(γ ) = −2.2034 −6.0429 −3.8691 −0.8711 2.1439
4.7732 3.4223 4.1588 .
4.5 Summary
This chapter was motivated by the strong connection that exists between reinforce-
ment learning and Riccati equations as we saw in Chap. 2. The key idea revolves
around learning an appropriate value of the low gain parameter and finding the
solution of the corresponding parameterized ARE based on reinforcement learning.
The LQR learning techniques developed in Chap. 2 alone may not be able to learn
an appropriate low gain feedback control law that could also satisfy the control
constraint. As a result, a scheduling mechanism was introduced that would update
the value of the low gain parameter as a function of the current state to ensure that
222 4 Model-Free Stabilization in the Presence of Actuator Saturation
States 2
x1 x2 x3 x4
-2
-4
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
Fig. 4.31 Algorithm 4.12: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25
Fig. 4.32 Algorithm 4.12: Convergence of the feedback gain matrix with x0 =
T
0.25 −0.5 −0.5 0.25
the learned policy avoids actuator saturation. This technique enables us to perform
global asymptotic stabilization by scheduling the value of the low gain parameter
as a function of the state. Compared to the results in Chap. 2, the scheduling of the
low gain parameter results in a time-varying objective function. It was shown that
the proposed learning algorithms are able to adapt to the changes in the objective
function and learn the appropriate low gain feedback controller. Both discrete-
time and continuous-time systems were considered. Model-free policy iteration and
value iteration using both full state and output feedback were presented and their
performance was thoroughly verified by numerical simulation.
4.6 Notes and References 223
5 x1 x2 x3 x4
States
0
-5
0 10 20 30 40 50 60
time (sec)
1
Control
-1
0 10 20 30 40 50 60
time (sec)
T
Fig. 4.33 Algorithm 4.12: Closed-loop response with x0 = 0.5 −1 −1 0.5
T
Fig. 4.34 Algorithm 4.12: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5
The past decades has witnessed strong interest in designing control algorithms for
constrained control systems. Early results in this area of work have focused on
obtaining stabilizability conditions subject to the operating range of the system.
It was recognized that global asymptotic stabilization is in general not possible
even for very simple systems that are subject to actuator saturation. It has been
established that global asymptotic stabilization under such a constraint is possible
only for a limited class of systems. For linear systems, in particular, it has been
shown in [112] that, global asymptotic stabilization could be achieved only when
the system is asymptotically null controllable with bounded controls (ANCBC). An
224 4 Model-Free Stabilization in the Presence of Actuator Saturation
ANCBC system may be polynomially unstable but not exponentially unstable. Even
for such systems, one generally has to resort to nonlinear control laws to achieve
global results [26, 109]. The construction of such nonlinear laws is based on a good
insight of the particular problem at hand [112, 117]. Parallel results in the discrete-
time setting have also been presented, such as in [133].
Optimal control theory provides another formulation of the constrained control
problem by taking into account the control constraint in the objective function. This,
however, is not straightforward. The presence of such a constraint makes the optimal
control problem even more challenging as it results in further nonlinearity in the
Hamilton-Jacobi-Bellman (HJB) equation. Along this line of work, the idea of using
nonquadratic cost functional to encode control constraints has also gained popularity
[73]. One needs to resort to approximation techniques such as neural networks to
solve the complicated HJB equation. This difficulty is also extended to the problems
that are otherwise linear but entails a nonlinear control law because the use of such
cost functionals results in a nonlinear control law.
Reinforcement learning techniques have also been presented in the recent
literature based on the nonquadratic cost functional approach mentioned above. One
of the early developments along this approach was presented in [3], in which a
model-based near optimal solution to the constrained HJB equation was obtained
by employing neural network approximation. Follow-up works were focused on
uplifting the knowledge of the system dynamics. Partially model-free solutions
employing reinforcement learning were proposed in[46, 78, 81] towards solving this
problem in both the continuous-time and discrete-time settings. In these approaches
only local stability could be demonstrated and only in the form of uniform ultimate
boundedness rather than asymptotic stability.
An alternative paradigm called low gain feedback was introduced by the authors
[63] to deal with the constrained control problem. The key motivation in this
framework is to design a controller with a simple structure and at the same time
improve the overall stability characteristics. This technique takes a preventive
approach to designing a control law that operates within the saturation limits. In
our earlier works [63, 65, 66], we presented semi-global asymptotic stabilization
of the ANCBC systems for an a priori given bounded set of initial conditions.
An appealing technique for designing such controllers is based on the idea of
parameterized ARE, which is an LQR ARE parameterized in a low gain parameter.
The key advantage of the low gain feedback approach is that the resulting control
law remains linear. While the earlier works focused on semi-global results, later on
it was demonstrated that it is possible to convert semi-global results into global ones
by scheduling the value of the low gain parameter as a function of the state [63]. All
these designs make use of the knowledge of the system dynamics.
The presentation in this chapter follows our results on the model-free low gain
feedback designs for the discrete-time [92, 98] and continuous-time [99, 102] con-
strained control problems. Extensions to the policy iteration algorithm are presented
that provide a faster alternative to designing model-free learning algorithms, which
also uplift the control constraint on the stabilizing initial policy.
Chapter 5
Model-Free Control of Time Delay
Systems
5.1 Introduction
form, existing control techniques would become readily applicable. In this line of
work, a popular technique known as Smith predictor [108] or, more formally, the
method of finite spectrum assignment [74], was introduced in the early literature. In
this method, the control law utilizes a predicted future value of the state based on
the dynamic model of the system in order to offset the delay at the time the control
input signal arrives at the plant. This cancellation of the delay results in a closed-
loop system free from delays and enables us to apply existing control design tools
such as optimal control techniques [75].
An underlying assumption in the predictor feedback based optimal control of
time delay systems is the use of a model of the system to predict the future state
based on the current state and the history of the control input. Unfortunately, in
a model-free reinforcement learning setting we do not have the knowledge of
the system dynamics, which prevents us from predicting the future state. On the
other hand, it is known that the predictor feedback assigns a finite spectrum to
the closed-loop system resulting in a closed-loop system that has a finite number
of modes. This feature of predictor feedback is particularly useful for continuous-
time delay systems, which are inherently infinite dimensional due to the presence
of delays. In contrast, a remarkable property of discrete-time delay systems is
that they remain finite dimensional even in the presence time delays. It is this
property that allows us to bring the original open-loop time delay system into a
delay-free form by the so-called state augmentation or the lifting technique. The
method transforms the delayed variables into additional state variables to avoid
prediction. The lifting technique, however, requires the knowledge of the delay in
the augmentation process.
This chapter builds upon the Q-learning technique that was presented in Chap. 2
to solve the discrete-time linear quadratic regulator problem. Because of the
presence of delays, the methods developed in Chap. 2 are, however, not readily
applicable as the Q-learning Bellman equation does not hold true if the delays
are neglected. In this chapter, we address the difficulty of designing a Q-learning
scheme in the presence of delays that are not known. Both state and input delays will
be considered. Instead of using the exact knowledge of the delays, we use an upper
bound on these delays to transform the time delay system into a delay-free form
by means of an extended state augmentation. This approach essentially converts
the LQR problem of unknown dynamics and unknown delays (both the lengths and
number of delays) to one involving higher order unknown dynamics free of delays.
Properties of the extended augmented system will be thoroughly analyzed in terms
of the controllability and observability conditions to establish the solvability of the
optimal control problem of the time delay system. We will then present the policy
iteration and value iteration based Q-learning algorithms using both state feedback
and output feedback to learn the optimal control policies of the time delay system.
5.3 Problem Description 227
The time delay problem is an important one but only a handful of developments have
been carried out along this line of work. One of the primary difficulties associated
with applying the predictor feedback approach in RL is that the prediction signal
is obtained from an embedded model of the system, which is unfortunately not
available in the model-free framework. This difficulty was addressed in [139], where
a bicausal change of coordinates was used to bring the discrete-time delay system
into a delay-free form. Differently from the predictor approach, this approach
renders the open-loop system into a form free of delays without requiring a predictor
feedback. This approach has been extended to solve the optimal tracking problem
[70]. It turns out that the existence of the bicausal transformation is a restrictive
assumption that is very hard to verify when the system dynamics is unknown.
This, in turn, limits feasibility of the approach to solve more general time delay
problems. Very recently, a different output feedback approach was proposed to solve
the problem in RL setting without requiring this bicausal transformation [29]. It is
worth noting that all these existing approaches require a precise knowledge of the
delays.
Consider a discrete-time linear system given by the following state space represen-
tation,
xk+1 = Si=0 Ai xk−i + Ti=0 Bi uk−i ,
(5.1)
yk = Cxk ,
for any
& S '
λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I =0 ,
i=0
228 5 Model-Free Control of Time Delay Systems
for any
& S '
λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and λ = 0
i=0
and ρ(AS ) = n.
Assumption 5.3 The upper bounds S ≤ S̄ and T ≤ T̄ on the state and input delays
are known.
As will be shown, Assumptions 5.1 and 5.2 are the generalization of the
controllability and observability conditions of delay-free linear systems to systems
with multiple state and input delays. Assumption 5.3 is needed for the extended
augmentation to be presented in this chapter. Note that this assumption is mild
because it is often possible for us to determine upper bounds S̄ and T̄ such that
the conditions S ≤ S̄ and T ≤ T̄ hold. Under these assumptions, we are interested
in solving the linear quadratic regulation (LQR) problem for the time delay system
(5.1) with the following cost function,
∞
V (x, u) = r(xi , ui ), (5.2)
i=k
where Q ≥ 0 and R > 0 correspond to the desired parameters for penalizing the
states and the control, respectively.
We first present a state augmentation procedure that brings the system into a delay-
free form. The augmentation is carried out by introducing the delayed states and the
delayed control inputs as additional states. To this end, let us define the augmented
state vector as
T
Xk = xkT xk−1
T · · · xk−S
T uTk−T uTk−T +1 · · · uTk−1 . (5.4)
5.4 Extended State Augmentation 229
The dynamic equation of the augmented system can be obtained from the original
dynamics (5.1) as follows,
⎡ ⎤ ⎡ ⎤
A0 A1 A2 ··· AS BT BT −1 BT −2 · · · B1 B0
⎢ In 0 0 ··· 0 0 ··· 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ··· 0 ⎥ ⎢0⎥
⎢ In 0 0 0 0 ⎥ ⎢ ⎥
⎢ . .. .. .. .. .. .. . ⎥
.. ⎥ ⎢ . ⎥
⎢ .. . . · · · .. ⎢ . ⎥
⎢ . . . . . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥
⎢ 0 0 · · · In 0 0 ··· 0 0 0 ⎥
Xk+1 =⎢ ⎥Xk + ⎢ ⎥
⎢ .. ⎥ uk
⎢ 0 ··· 0 0 0 0 Im 0 ··· 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢ ⎥
⎢ 0 ···0 0 0 0 0 Im .. 0 ⎥ ⎢ ⎥
.
⎢ ⎥ ⎢ . ⎥
⎢ .. .... .. .. .. .. .. .. .. ⎥ ⎢ .. ⎥
⎢ . . . . . . . . . . ⎥ ⎢ ⎥
⎢ ⎥ ⎣0⎦
⎣ 0 ··· 0 0 0 0 0 ··· 0 Im ⎦
0 ··· 0 0 0 0 0 ··· 0 0 Im
Since the maximum state delay S and input delay T are not known, we will extend
the augmentation further up to their upper bounds S̄ and T̄ , respectively. For this
purpose, we introduce the extended augmented state vector as
X̄k = xkT xk−1
T · · · xk−S
T T
xk−S−1 · · · xk−
T
S̄
T
uTk−T̄ uTk−T̄ +1 · · · uTk−T · · · uTk−1 , (5.6)
S̄ + 1 blocks T̄ blocks
( )* + ( )* +
⎡ ⎤ ⎡ ⎤
A0 · · · AS 0 · · · 0 · · · BT · · · B1 B0
⎢ In 0 0 · · · 0 0 ··· 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 I 0 ··· 0 ··· 0 0 0 ⎥ ⎢0⎥
⎢ n 0 ⎥ ⎢ ⎥
⎢ . . . .. .. ⎥ ⎢ . ⎥
⎢ .. . . . . ... ... .. ..
. ··· . . ⎥ ⎥ ⎢ . ⎥
⎢ . ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥
⎢ 0 · · · 0 In 0 0 ··· 0 0 0 ⎥ ⎢ ⎥
X̄k+1 =⎢ ⎥X̄k + ⎢ .. ⎥uk
⎢ 0 ··· 0 0 0 0 Im 0 · · · 0 ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥
⎢ .. ⎥ ⎢ . ⎥
⎢ 0 ··· 0 0 0 0 0 Im . 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. ⎥
⎢ .. . . . .. .. . . . . .. ⎥ ⎢ . ⎥
⎢ . · · · .. .. .. . . . . . ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. ⎥
⎣ 0 ··· 0 0 0 0 0 · · · 0 Im ⎦ ⎣ . ⎦
0 ··· 0 0 0 0 0 ··· 0 0 Im
Δ
= ĀX̄k + B̄uk . (5.7)
230 5 Model-Free Control of Time Delay Systems
Comparing the extended augmented dynamics (5.7) with the original dynamics
(5.1), we see that the problem of finding an optimal controller for a time delay
system with both unknown delays and unknown dynamics is equivalent to finding
an optimal controller for an augmented delay-free system with unknown dynamics.
We will now study the controllability property of the augmented systems (5.5)
and (5.7).
Theorem 5.1 The delay-free augmented systems (5.5) and (5.7) are controllable if
and only if
S T T −i
ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n, (5.8)
for any
& S '
λ ∈ λ ∈ C : det Ai λS−i −λS+1 I =0 .
i=0
Consider the columns associated with the Bi ’s. Adding λ times the last column to
last but one column, we have
5.4 Extended State Augmentation 231
⎡ ⎤
∗ BT BT −1 BT −2 · · · B1 + B0 λ B0
⎢∗ 0 ··· ··· 0 ⎥
⎢ 0 0 ⎥
⎢. . .. .. .. .. .. ⎥
⎢ .. .. . ⎥
⎢ . . . . ⎥
⎢ ⎥
⎢∗ 0 0 ··· ··· 0 0 ⎥
ρ A − λI B = ρ ⎢ ⎥.
⎢ 0 −λI Im 0 ··· 0 0 ⎥
⎢ ⎥
⎢ 0 0 −λI Im ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎣0 0 0 . . Im 0 ⎦
0 0 0 0 0 0 Im
Repeating the above step for each of the remaining columns results in
⎡ ⎤
∗ BT + BT −1 λ + · · · + B0 λT ··· ··· · · · B1 + B0 λ B0
⎢∗ 0 0 ··· ··· 0 0 ⎥
⎢ ⎥
⎢. .. .. .. .. .. .. ⎥
⎢ .. . ⎥
⎢ . . . . . ⎥
⎢∗ 0 ··· ··· 0 ⎥
⎢ 0 0 ⎥
⎢ ⎥
ρ A − λI B = ρ ⎢ 0 0 Im 0 · · · 0 0 ⎥.
⎢ ⎥
⎢0 0 0 Im · · · 0 0 ⎥
⎢ ⎥
⎢ .. .. . . ⎥
⎢. . · · · .. .. 0 0 ⎥
⎢ ⎥
⎣0 0 0 0 0 Im 0 ⎦
0 0 0 0 0 0 Im
Similarly, row operations can be used to cancel the entries in the first row and result
in
⎡ ⎤
∗ BT + BT −1 λ + · · · + B0 λT 0 0 · · · 0
⎢∗ 0 0 ··· 0 ⎥
⎢ 0 ⎥
⎢. . .. .. .. .. ⎥
⎢ .. .. . . . . ⎥
⎢ ⎥
⎢ ⎥
⎢∗ 0 0 0 ··· 0 ⎥
ρ A − λI B = ρ ⎢ ⎥.
⎢0 0 Im 0 · · · 0 ⎥
⎢ ⎥
⎢0 0 0 Im 0 0 ⎥
⎢ ⎥
⎢ .. .. .. . . . . .. ⎥
⎣. . . . . . ⎦
0 0 0 0 0 Im
⎡ ⎤
A0 −λI A1 ··· AS BT + BT −1 λ+ · · · +B0 λT 0 0 ··· 0
⎢ I −λI ··· ··· 0 ⎥
⎢ n 0 0 0 0 ⎥
⎢ .. ⎥
⎢ .. .. .. .. .. .. .. ⎥
⎢ 0 . . . . . . . . ⎥
⎢ ⎥
⎢ 0 ··· In −λI 0 0 0 ··· 0 ⎥
ρ A−λI B = ρ ⎢
⎢ 0
⎥.
⎢ 0 0 0 0 Im 0 ··· 0 ⎥ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 Im 0 0 ⎥
⎢ .. ⎥
⎢ .. .. .. .. .. .. .. .. ⎥
⎣ . . . . . . . . . ⎦
0 0 0 0 0 0 0 ··· Im
Applying similar row and column operations to cancel out the entries corresponding
to the columns of Ai ’s results in
⎡ T T −i 0 0
⎤
0 0 · · · Si=0 Ai λS−i − λS+1 I i=0 Bi λ ··· 0
⎢I 0 ··· ··· 0 ⎥
⎢ n 0 0 0 0 ⎥
⎢ . .. ⎥
⎢ . . . .. .. .. ⎥
⎢ 0 . . .. .. .. . . . . ⎥
⎢ ⎥
⎢ 0 · · · In 0 0 0 0 ··· 0 ⎥
ρ A − λI B = ρ ⎢
⎢0 0 0
⎥
⎢ 0 0 I m 0 ··· 0 ⎥ ⎥
⎢ ⎥
⎢0 0 0 0 0 0 Im 0 0 ⎥
⎢ . . . .. ⎥
⎢ . . . .. .. .. . . .. ⎥
⎣ . . . . . . . . . ⎦
0 0 0 0 0 0 0 ··· Im
T
=ρ S
A λS−i − λS+1 I B λT −i + nS + mT .
i=0 i i=0 i
At full row rank, ρ A − λI B = n + nS + mT . Thus, full rank of ρ[A − λI B]
is equivalent to
S T T −i
ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n, λ ∈ C.
For the extended augmented system (5.7), we evaluate ρ Ā−λI B̄ by using
similar row and column operations to result in
⎡ ⎤
A0 −λI · · · AS 0 · · · 0 · · · BT · · · B1 B0
⎢ In −λI 0 · · · 0 0 ··· 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎢ 0 In −λI . ··· ··· 0 0 ⎥
⎢ . 0 0 ⎥
⎢ .. .. .. .. .. .. .. .. ⎥
⎢ 0 . . . . ··· ··· . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ 0 · · · 0 In −λI 0 ··· 0 ··· 0 0 ⎥
ρ Ā − λI B̄ = ρ ⎢ ⎥
⎢ 0 · · · 0 0 0 −λI Im 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ 0 ··· 0 −λI Im
0 0 0 . 0 0 ⎥
⎢ ⎥
⎢ .. .. .. . . . .
.. .. .. ⎥
⎢ . ··· . .
. . . . . 0 0 ⎥
⎢ ⎥
⎣ 0 ··· 0 0 0 0 0 · · · −λI Im 0 ⎦
0 ··· 0 0 0 0 0 ··· 0 −λI Im
S T
T −i + nS̄ + mT̄ ,
=ρ i=0 Ai λ
S−i − λS+1 I
i=0 Bi λ
where we have used the fact that the padded zero columns do not affect the row
rank, whereas the S̄ number of In and T̄ number of Im matrices contribute nS̄ + mT̄
to the row
rank. For the controllability of the extended augmented system (5.7), we
need ρ Ā − λI B̄ = n + nS̄ + mT̄ , which is equivalent to
S T T −i
ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n, λ ∈ C,
We now bring the focus to the design of the optimal controller. We rewrite the
original utility function for (5.1) in terms of the extended augmented state (5.6)
as
r X̄k , uk = X̄kT Q̄X̄k + uTk Ruk , (5.9)
234 5 Model-Free Control of Time Delay Systems
where
Q0
Q̄ = .
0 0
Since the system is now in a delay-free form, we can readily compute the optimal
controller. From optimal control theory [55], we know that there exists a unique
optimal control sequence
u∗k = K̄ ∗ X̄k
−1
= − R + B̄ T P̄ ∗ B̄ B̄ T P̄ ∗ ĀX̄k (5.10)
that minimizes (5.2) under the conditions of the stabilizability of Ā, B̄ and the
, ∗ T
∗
detectability of Ā, Q̄ , where P̄ = P̄ is the unique positive semi-definite
solution to the following ARE,
−1
ĀT P̄ Ā − P̄ + Q̄ − ĀT P̄ B̄ R + B̄ T P̄ B̄ B̄ T P̄ Ā = 0. (5.11)
Remark 5.2 The extended augmented states uk−T −1 to uk−T̄ and xk−S−1 to xk−S̄
are fictitious states and, therefore, are not reflected in the optimal control law, as
will be seen in the simulation results. That is, the control coefficients corresponding
to these states are zero.
We next work towards deriving the conditions for the observability of the
extended augmented system. To this end, we introduce an augmented output as
T
Yk = ykT uTk−T = CXk ,
where S+1 T
( )* + ( )* +
C 0 0 ··· 0 0 0 0 ··· 0
C= .
0 0 0 · · · 0 Im 0 0 · · · 0
where
5.4 Extended State Augmentation 235
S̄ + 1 T̄
( )* + ( )* +
C 0 0 ··· 0 0 0 0 ··· 0
C̄ = .
0 0 0 · · · 0 Im 0 0 · · · 0
We will now study the observability property of the augmented systems (5.5) and
(5.7).
Theorem 5.2 The delay-free augmented systems (5.5) is observable if and only if
S
ρ T S−i
i=0 Ai λ − λS+1 I C T = n, (5.12)
for any
& S '
λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and λ = 0 ,
i=0
and
ρ (AS ) = n. (5.13)
The extended augmented system (5.7) is detectable if and only if (5.12) holds for
any
& S '
λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and |λ| ≥ 1 .
i=0
Proof
T TBy duality, the observability of (A, C) implies
the controllability
of
A , C . Thus, (A, C) is observable if and only if AT −λI CT has a full row
rank of (S + 1)n + mT . We evaluate ρ AT −λI CT as
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 0 ··· 0 CT 0
⎢ AT −λI In · · · 0 ··· 0 0 ⎥
⎢ 1 0 0 0 0 ⎥
⎢ ⎥
⎢ . ⎥
⎢ AT2 0 −λI . . 0 0 0 0 ··· 0 0 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎢ . . . . In . . 0 . . . . ⎥
⎢ ⎥
⎢ AT · · · 0 0 −λI ··· 0 0 0 ⎥
⎢ 0 0 0 ⎥
⎢ S
⎥
ρ AT −λI CT = ρ ⎢ ⎥.
⎢ BT ··· 0 0 0 −λI 0 0 ··· 0 0 Im ⎥
⎢ T ⎥
⎢ .. ⎥
⎢ BT ··· −λI 0 0 0 ⎥
⎢ T −1 0 0 0 Im 0 . ⎥
⎢ .. .. .. .. .. ⎥
⎢ ⎥
⎢ . . . . . 0 Im −λI 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. .. ⎥
⎣ . ··· 0 0 0 0 0 . . 0 0 0 ⎦
B1T ··· 0 0 0 0 0 ··· Im −λI 0 0
236 5 Model-Free Control of Time Delay Systems
Moving the last column to the beginning of the right partitions of the above
partitioned matrix and then adding λ times this column to the next column to its
right result in
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 0 ··· 0 0 CT
⎢ AT −λI In · · · 0 0 ··· 0 ··· 0 0 0 ⎥
⎢ 1 ⎥
⎢ . ⎥
⎢ AT 0 −λI . . 0 ··· ··· ··· 0 0 0 ⎥
⎢ 2 0 ⎥
⎢ .. .. .. .. . . .. . .. .. ⎥
⎢ . . In .. .. . · · · .. . ⎥
⎢ . . . ⎥
⎢ ⎥
⎢ AS T ··· 0 0 −λI 0 0 · · · · · · 0 0 0 ⎥
⎢ ⎥
ρ AT −λI CT = ρ ⎢ ⎥.
⎢ BTT 0 ··· 0 0 Im 0 ··· 0
0 0 0 ⎥
⎢ ⎥
⎢ T . ⎥
⎢ BT −1
⎢ 0 ··· 0 0 0 Im −λI 0 .. 0 0 ⎥ ⎥
⎢ .. . .. ⎥
⎢ . 0 · · · .. . 0 0 Im −λI 0 0 0 ⎥
⎢ ⎥
⎢ .. .. .. . . .. .. ⎥
⎣ . 0 ··· 0 0 . 0 . . . 0 . ⎦
B1T 0 ··· 0 0 0 0 0 · · · Im −λI 0
Repeating the above step for each of the remaining columns results in
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 · · · 0 0 CT
0
⎢ AT −λI In · · · 0 ··· 0 ··· 0 0 0 ⎥
0
⎢ 1 ⎥
⎢ . .. ⎥
⎢ AT 0 −λI . . 0 ··· 0 ··· 0 0 0 ⎥
.
⎢ 2 ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎢ . . In . . . · · · . . . ⎥
⎢ . . ⎥
⎢ AT ··· 0 0 −λI 0 0 0 · · · 0 0 0 ⎥
⎢ ⎥
ρ AT −λI CT = ρ ⎢ ⎥.
S
⎢ ⎥
⎢ BT T 0 ··· 0 0 Im 0 0 · · · 0 0 0 ⎥
⎢ ⎥
⎢ T . ⎥
⎢ BT −1 0 ··· 0 0 0 Im 0 0 .. 0 0 ⎥
⎢ T ⎥
⎢ B 0 ··· 0 0 0 0 Im 0 0 0 0 ⎥
⎢ T −2 ⎥
⎢ .. .. .. ⎥
⎣ . 0 ··· 0 0 0 0 . . 0 0 0 ⎦
B1T 0 ··· 0 0 0 0 0 · · · Im 0 0
Similarly, we can cancel all the Bi entries in the left partitions using the identity
columns from the right partition to result in
5.4 Extended State Augmentation 237
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 0 · · · 0 0 CT
⎢ AT −λI In · · · 0 0 ··· 0 ··· 0 0 0 ⎥
⎢ 1 ⎥
⎢ . .. ⎥
⎢ AT 0 −λI . . 0 ··· ··· ··· 0 0 0 ⎥
⎢ 2 . ⎥
⎢ .. .. .. .. .. .. .. .. .. ⎥
⎢ . . In . · · · · · · . . . . ⎥
⎢ . . ⎥
⎢ ⎥
T ⎢ A T 0 0 ··· −λI 0 · · · 0 · · · 0 0 0 ⎥
⎢ S ⎥
ρ A −λI C = ρ ⎢
T ⎥.
⎢ 0 ··· 0 0 0 Im 0 0 · · · 0 0 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ 0 ··· 0 0 0 0 Im 0 0 . 0 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. ⎥
⎢ . . . . . 0 0 Im 0 0 · · · 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎣ 0 ··· 0 0 0 0 ··· . . 0 0 0 ⎦
0 ··· 0 0 0 0 0 0 · · · Im 0 0
The rows in the lower partitions involve T number of Im matrices and contribute
mT to the rank. We next examine the upper partition without the zero columns,
⎡ ⎤
AT0 −λI In 0 ··· 0 CT
⎢ AT −λI In · · · 0 0 ⎥
⎢ 1 ⎥
⎢ . .. ⎥
⎢
X = ⎢ AT2 0 −λI . . 0 ⎥
. ⎥.
⎢ .. .. .. .. .. ⎥
⎣ . . . . In . ⎦
AS T 0 0 ··· −λI 0
For λ = 0, we can perform elementary row and column operations on the above
matrix to result in
⎡ ⎤
AT0 − λI + λ1 AT1 + · · · + λ1S ATS 0 0 · · · 0 C T
⎢ ⎥
⎢ 0 In 0 · · · 0 0 ⎥
⎢ ⎥
⎢ . . .. ⎥
ρ(X) = ρ ⎢ 0 0 In . . 0 ⎥.
⎢ . ⎥
⎢ .. .. . . . . ⎥
⎣ . . . . 0 .. ⎦
0 0 0 · · · In 0
or, equivalently,
S
ρ T S−i−λS+1 I
i=0 Ai λ C T = n,
238 5 Model-Free Control of Time Delay Systems
If λ = 0, then
⎡ ⎤
AT0 In 0 · · · 0 CT
⎢ AT 0 In · · · 0 0 ⎥
⎢ 1 ⎥
⎢ . . ⎥
ρ⎢
⎢ A2
T 0 0 . . .. 0 ⎥⎥ = (S + 1)n
⎢ . .. . . . . .. ⎥
⎣ .. . . . In . ⎦
ATS · · · 0 0 0 0
if and only if
ρ (AS ) = n.
Remark 5.3 The observability condition for the augmented system (5.5) can also be
relaxed to detectability, in which case we require only condition (5.12) to hold for
any
& S '
λ ∈ λ ∈ C : det Ai λS−i −λS+1 I = 0 and |λ| ≥ 1 .
i=0
In this section, we will present a Q-learning scheme for learning the optimal control
parameters that uplifts the requirement of the knowledge of the system dynamics
and the delays.
5.5 State Feedback Q-learning Control of Time Delay Systems 239
for some positive definite matrix P̄ . The above infinite horizon value function can
be recursively written as
VK̄ X̄k = X̄kT Q̄X̄k + X̄kT K̄ T R K̄ X̄k + VK̄ X̄k+1 .
Similar to the value function above, we can define a Q-function that gives the value
of executing an arbitrary control uk instead of uk = K̄ X̄k at time k and then
following policy K̄ from time k + 1 on,
QK̄ X̄k , uk = X̄kT Q̄X̄k + uTk Ruk + VK̄ X̄k+1 . (5.15)
or, equivalently,
T
X̄k X̄
QK̄ X̄k , uk = H k , (5.16)
uk uk
where
HXX HXu
H =
HuX Huu
Q̄+ ĀT P̄ Ā ĀT P̄ B̄
= .
B̄ T P̄ Ā R+ B̄ T P̄ B̄
∂
Q∗ = 0
∂uk
It can be seen that the problem of finding an optimal controller boils down to
finding the optimal matrix H ∗ or the optimal Q-function Q∗ .
Q-learning is a model-free learning technique that estimates the optimal Q-
function without requiring the knowledge of system dynamics. It does so by means
of the following Bellman Q-learning equation,
QK̄ X̄k , uk = X̄kT Q̄X̄k + uTk Ruk + QK̄ X̄k+1 , K̄ X̄k+1 , (5.17)
which is obtained by substituting VK̄ (X̄k ) = QK̄ X̄k , K̄Xk in (5.15). Let
X̄k
zk = .
uk
which is linear in the unknown matrix H . We can perform the following parameter-
ization on Equation (5.17),
QK̄ (zk ) = QK̄ X̄k , uk
= H̄ T z̄k ,
where
H̄ = vec(H )
with hii being the elements of matrix H and l = n + nS̄ + mT̄ + m. The regressor
z̄k ∈ Rl(l+1)/2 is defined as the following quadratic basis set,
T
z̄ = z12 z1 z2 · · · z1 zl z22 z2 z3 · · · z2 zl · · · zl2 .
Notice that (5.19) is a scalar equation with l(l + 1)/2 unknowns. We can solve this
equation in the least-squares sense by collecting at least L ≥ l(l + 1)/2 datasets of
X̄k and uk . The least-squares solution of (5.19) is given by
−1
H̄ = T ϒ, (5.20)
5.5 State Feedback Q-learning Control of Time Delay Systems 241
Algorithm 5.1 State feedback Q-learning policy iteration algorithm for time delay
systems
input: input-state data
output: H ∗
1: initialize. Select an admissible policy K̄ 0 such that Ā + B̄ K̄ 0 isSchur stable.
Set j ← 0.
2: collect data. Apply the initial policy u0 to collect L datasets of X̄k , uk .
3: repeat
4: policy evaluation. Solve the following Bellman equation for H j ,
T T
H̄ j z̄k = X̄kT Q̄X̄k + uTk Ruk + H̄ j z̄k+1 . (5.22)
6: j← j +1
7: until H j − H j −1 < ε for some small ε > 0.
242 5 Model-Free Control of Time Delay Systems
Algorithm 5.2 State feedback Q-learning value iteration algorithm for time delay
systems
input: input-state data
output: H ∗
1: initialize. Select an arbitrary policy u0k and H 0 ≥ 0. Set j ← 0.
2: collect data. Apply the initial policy u0 to collect L datasets of (X̄k , uk ).
3: repeat
4: value update. Solve the following Bellman equation for H j ,
T T
H̄ j +1 z̄k = X̄kT Q̄X̄k + uTk Ruk + H̄ j z̄k+1 . (5.24)
6: j← j +1
7: until H j − H j −1 < ε for some small ε > 0.
and 2. In the next step, we perform a minimization of the Q-function with respect
to uk , which gives us an improved policy. These iterations are carried out until
we see no further updates in the estimate of matrix H within a sufficiently small
range specified by the positive constant ε. We next establish the convergence of
Algorithms 5.1 and 5.2.
Theorem 5.3 Consider the time delay system (5.1) and its extended augmented
version (5.7).
Under the stabilizability and detectability conditions of Ā, B̄ and
,
Ā, Q̄ , respectively, the state feedback Q-learning Algorithms 5.1 and 5.2 each
j
generates a sequence of controls uk , j = 1, 2, 3, ... that converges to the optimal
feedback controller given in (5.10) as j → ∞ if the rank condition (5.21) is
satisfied.
Proof Algorithms 5.1 and 5.2 are the standard state feedback Q-learning Algo-
rithms 1.6 and 1.7, respectively, applied to the extended augmented system (5.6)
under the controllability condition of Ā, B̄ and the observability condition
,
of Ā, Q̄ , it follows from the convergence of Algorithms 1.6 and 1.7 that
Algorithms 5.1 and 5.2 converge to the optimal solution as j → ∞ under the rank
condition (5.21). This completes the proof.
Remark 5.4 Compared with the previous works [137, 139], the proposed scheme
relaxes the assumption of the existence of a bicausal change of coordinates.
Furthermore, unlike in [137, 139], the information of the state and input delays
(both the numbers and lengths of the delays) is not needed in our proposed scheme.
5.6 Output Feedback Q-learning Control of Time Delay Systems 243
The control design technique presented in the previous section was based on state
feedback. That is, access to the full state is needed in the learning process and in
the implementation of the resulting control law. However, in many applications it is
often the case that the measurement of the full state is not available but only a subset
of the state is measurable via system output. Output feedback techniques enable the
design of control algorithms without involving the information of the full state. This
section will present a Q-learning based control algorithm to stabilize the system
using the measurements of the system output instead of the full state. We recall
from [56] the following lemma, which allows the reconstruction of the system state
by means of the delayed measurements of the system input and output.
Lemma 5.1 Consider the extended augmented system (5.7). Under the observabil-
ity assumptions of the pair (A, C), the system state can be represented in terms of
the measured input and output sequence as
with
T T
VN = CAN −1 · · · (CA)T CT ,
UN = B AB · · · AN −1 B ,
244 5 Model-Free Control of Time Delay Systems
⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN = ⎢ ... .. . . . .
. . .
..
. ⎥.
⎢ ⎥
⎣0 0 0 · · · CB ⎦
0 0 0 ··· 0
Similarly, for the extended augmented system (5.7), we have the following parame-
terization of the extended augmented state,
uk−1,k−N
X̄k = M̄u M̄y , (5.28)
Ȳk−1,k−N
where
M̄u and M̄y are formed using the extended augmented system matrices
Ā, B̄, C̄ . It can be easily verified that substitution of (5.28) in (5.16) results in
⎡ ⎤T ⎡ ⎤⎡ ⎤
uk−1,k−N , Hūū Hūȳ Hūu uk−1,k−N
QK̄ = ⎣Ȳk−1,k−N ,⎦ ⎣Hȳ ū Hȳ ȳ Hȳu ⎦⎣Ȳk−1,k−N ⎦
uk Huū Huȳ Huu uk
where
T
ζk = uTk−1,k−N Ȳk−1,k−N
T uTk ,
Hūū = M̄uT Q̄ + ĀT P̄ Ā M̄u ∈ RmN ×mN ,
Hūȳ = M̄uT Q̄ + ĀT P̄ Ā M̄y ∈ RmN ×(p+m)N ,
Hūu = M̄uT ĀT P̄ B̄ ∈ RmN ×m ,
(5.30)
Hȳ ȳ = M̄yT Q̄ + ĀT P̄ Ā M̄y ∈ R(p+m)N×(p+m)N ,
Hȳu = M̄yT ĀT P̄ B̄ ∈ R(p+m)N ×m ,
Huu = R + B̄ T P̄ B̄ ∈ Rm×m .
∂
Q∗ = 0
∂uk
and solving for uk result in our output feedback LQR control law,
−1
u∗k = − H∗uu H∗uū uk−1,k−N + H∗uȳ Ȳk−1,k−N
∗
T
= K̄ uTk−1,k−N Ȳk−1,k−N
T . (5.31)
Now that we have an output feedback form of the Q-function for the extended
augmented time delay system, the next step is to learn the optimal Q-function Q∗
and the corresponding output feedback optimal controller (5.31).
Consider the state feedback Q-learning equation (5.18). We employ the equiva-
lent output feedback Q-function to write this equation as
It should be noted that, in the output feedback learning, we apply the user-defined
weighting matrix Q̄y to the output. The term X̄kT Q̄X̄k can be replaced with ȲkT Q̄y Ȳk
T
without requiring the knowledge of C̄ when Q̄ = C̄ Q̄y C̄ and Ȳk = C̄X̄k , where Ȳk
is measurable. Here, uk+1 is computed as
uk+1 = −(Huu )−1 Huū ūk,k−N +1 + Huȳ Ȳk,k−N +1
T
= K̄ uTk−1,k−N Ȳk−1,k−N
T . (5.33)
Equation (5.32) is the Bellman equation for the Q-function in the output feedback
form, from which we will develop a reinforcement learning algorithm. We will
246 5 Model-Free Control of Time Delay Systems
parameterize the Q-function in (5.29) so that we can separate the unknown matrix
H.
Consider the output feedback Q-function in (5.29), which can be linearly
parameterized as
T
QK̄ = H̄ ζ̄k , (5.34)
where
H̄ = vec(H)
T
= H11 2H12 · · · 2H1l H22 2H23 · · · 2H2l · · · Hll
∈ Rl(l+1)/2 , l = mN + (p + m)N + m,
is the vector that contains the upper triangular portion of matrix H. Since H is
symmetric, the off-diagonal entries are included as 2Hij . The regression vector ζ̄k ∈
Rl(l+1)/2 is defined by
ζ̄k = ζk ⊗ ζk ,
T T
H̄ ζ̄k = ȲkT Q̄y Ȳk + uTk Ruk + H̄ ζ̄k+1 . (5.35)
Notice that (5.35) is a scalar equation with l(l + 1)/2 unknowns. We can solve this
equation in the least-squares sense by collecting at least L ≥ l(l + 1)/2 datasets of
Ȳk and uk . The least-squares solution of (5.19) is given by
−1
H̄ = T ϒ, (5.36)
Algorithm 5.3 Output feedback Q-learning policy iteration algorithm for time delay
systems
input: input-output data
output: H∗
1: initialize. Select an admissible policy u0k . Set j ← 0.
2: collect data. Apply the initial policy u0k to collect L datasets of Ȳk , uk .
3: repeat
4: policy evaluation. Solve the following Bellman equation for Hj ,
j T
j T
H̄ ζ̄k = ȲkT Q̄y Ȳk + uTk Ruk + H̄ ζ̄k+1 . (5.38)
6: j← j +1
7: until Hj − Hj −1 < ε for some small ε > 0.
Remark 5.6 When N < T̄ , Ȳk−1,k−N will contain entries from uk−1,k−N that will
prevent the rank condition (5.37) from being satisfied even in the presence of an
exploration signal vk . However, in the proof of Theorem 5.2, we see that increasing
T to an arbitrary large T̄ does not affect the observability. Furthermore, the rank
condition (5.12) corresponding to the original output yk remains unchanged. This
implies that the sum of the observability indices from the original output yk also
remains unchanged. Therefore, uk−T̄ in the augmented output vector contributes to
the observability of the extended states. This causes the sum of the observability
indices corresponding to the augmented output vector to increase to mT̄ , with each
component contributing T̄ equally. Thus, for an arbitrarily large T̄ , the observability
index becomes N = T̄ and the rank condition (5.37) can be satisfied.
In what follows, we present an iterative Q-learning algorithm to learn our output
feedback Q-function and the optimal control parameters.
Algorithms 5.3 and 5.4 present the output feedback Q-learning algorithms for
time delay systems. As is the case with PI algorithms, Algorithm 5.3 makes use of
a stabilizing initial policy. On the other hand, Algorithm 5.4 does not require such
an initial policy. As seen in Chap. 2, for both policy iteration and value iteration
based Q-learning algorithms, we require an exploration signal vk such that the rank
condition (5.37) is satisfied. The Bellman equation in Algorithms 5.3 and 5.4 can
248 5 Model-Free Control of Time Delay Systems
Algorithm 5.4 Output feedback Q-learning value iteration algorithm for time delay
systems
input: input-output data
output: H∗
1: initialize. Select an arbitrary policy u0k and H0 ≥ 0. Set j ← 0.
2: collect data. Apply the initial policy u0k to collect L datasets of Ȳk , uk .
3: repeat
4: value update. Solve the following Bellman equation for Hj ,
j +1 T
j T
H̄ ζ̄k = ȲkT Q̄y Ȳk + uTk Ruk + H̄ ζ̄k+1 . (5.40)
6: j← j +1
7: until Hj − Hj −1 < ε for some small ε > 0.
be solved using the least-squares technique by forming the data matrices as shown
in Chap. 2. We next establish the convergence of Algorithms 5.3 and 5.4.
Theorem 5.4 Consider the time delay system (5.1) and its extended augmented
version (5.7). Let S = S̄. Under the stabilizability conditions on Ā, B̄ and the
-
observability conditions on Ā, C̄ and Ā, Q̄y C̄ , and the full row rank of M̄ =
M̄u M̄y , the output feedback Algorithms 5.3 and 5.4 each generates a sequence
j
of controls uk , j = 1, 2, 3, ... that converges to the optimal feedback controller
given in (5.31) as j → ∞ if the rank condition (5.37) is satisfied.
Proof Algorithms 5.3 and 5.4 are the output feedback Q-learning Algorithms 2.1
and 2.2, respectively, applied to the extended augmented system (5.6). It follows
from the proofs of Algorithms 2.1 and 2.2 that, under the stated conditions,
Algorithms 5.1 and 5.2 converge to the optimal output feedback solution as j → ∞.
This completes the proof.
In this section, we test the proposed scheme using numerical simulation. Consider
the discrete-time system (5.1) with
5.7 Numerical Simulation 249
0.6 0.3
A0 = ,
0.2 0.5
0.2 0.5
A1 = ,
0.4 0.1
0.6
B1 = ,
0.4
0.1
B2 = ,
−0.1
C = [1 − 0.8].
There are two input delays and one state delay present in the system. Notice that,
although matrices A0 and A1 are both Schur stable, the system is unstable due to
delays, which can be checked by finding the roots of the polynomial matrix P (λ) =
A0 λ + A1 − λ2 I or by evaluating the eigenvalues of the augmented matrix A as
defined in (5.5). It can be verified that the controllability condition of the delayed
system
S T
T −i
ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ =n
holds. Hence, the extended augmented system (5.6) is controllable. The maximum
input and state delays present in the system are T = 2 and S = 1, respectively. Let
T̄ = 3 and S̄ = 1 be the upper bounds on the input and state delays, respectively.
We specify the user-defined performance index as Q = I and R = 1. The nominal
optimal feedback control matrix as obtained by solving the ARE with known delays
is
K̄ ∗ = 0.7991 0.8376 0.3622 0.3643 0.0007 0.6191 .
We first validate the state feedback policy iteration algorithm, Algorithm 5.1. For
this algorithm, we employ the following stabilizing initial policy,
K̄ 0 = 0.4795 0.5025 0.2173 0.2186 0.0000 0.0004 0.3714 .
1.5
1
x
0.5
0
0 20 40 60 80 100 120 140 160 180
time step (k)
Fig. 5.1 Algorithm 5.1 State trajectory of the closed-loop system under state feedback
12
10
8
Ĥ →H ∗
0
0 1 2 3 4
iterations
Fig. 5.2 Algorithm 5.1: Convergence of the parameter estimates under state feedback
K̄ˆ = 0.7991 0.8376 0.3622 0.3643 −0.0000 0.0007 0.6191 .
It can be seen that the estimated control parameters correspond only to the actual
delays present in the system while the one term corresponding to the extra state is
very small. In other words, the final control is equal to the one obtained using the
exact knowledge of the delays and system dynamics. Moreover, the rank condition
(5.21) is no longer needed once the convergence criterion is met.
We next validate the state feedback value iteration algorithm, Algorithm 5.2. The
algorithm is initialized with a zero feedback policy, which is clearly not stabilizing.
All other design and simulation parameters are the same as in the policy algorithm.
It can be seen in Fig. 5.3 that, due to the absence of a stabilizing initial policy, the
system response is unstable during the learning phase. However, Fig. 5.4 shows that
even in such a scenario, Algorithm 5.2 still manages to converge to the optimal
solution, which eventually stabilizes the unstable system. The final estimate of the
control gain is
5.7 Numerical Simulation 251
100
80
60
x
40
20
0
0 20 40 60 80 100 120 140 160 180
time step (k)
Fig. 5.3 Algorithm 5.2 State trajectory of the closed-loop system under state feedback
12
10
8
Ĥ →H ∗
0
0 1 2 3 4 5 6 7 8 9
iterations
Fig. 5.4 Algorithm 5.2: Convergence of the parameter estimates under state feedback
K̄ˆ = 0.7996 0.8380 0.3625 0.3645 0.0000 0.0007 0.6194 .
The simulation performed so far focuses on full state feedback. We will now
validate the output feedback algorithms, Algorithms 5.3 and 5.4. The extended
augmented system is observable since the observability conditions in Theorem 5.2
hold. We specify the user-defined performance index as Q̄y = I and R = 1. The
bound on the state delay is the same but the bound on the input delay has been
increased to T̄ = 4 to make the observability index N = T̄ in order to satisfy the
output feedback rank condition (5.37). The nominal output feedback optimal control
parameters as obtained by solving the ARE with known delays are
∗
K̄ = [0.5100 0.6229 −0.2150 −0.6116 3.5739 −0.3037
to satisfy the rank condition (5.37) to solve (5.35). These data samples are collected
by applying some sinusoidal signals of different frequencies and magnitudes in
the control. For the output feedback policy iteration algorithm, Algorithm 5.3, we
choose the following stabilizing initial policy,
0
K̄ = [0.3366 0.4111 −0.1419 −0.4036 2.3588 −0.2005
The state response is shown in Fig. 5.5. It takes around 5 iterations to converge to the
optimal controller as shown in Fig. 5.6. It can be seen in Fig. 5.5 that the resulting
controller is able to achieve the closed-loop stability. The final estimate of the output
feedback gain matrix is
For the output feedback value iteration algorithm, Algorithm 5.4, we initialize with
a zero feedback policy. All other parameters are the same as in the simulation of the
policy iteration algorithm, Algorithm 5.3. The final estimate of the output feedback
gain matrix is
Figure 5.7 shows the state response, where it can be seen that due to the absence of a
stabilizing initial policy, the closed-loop system remains unstable during the initial
learning phase. However, after the initial data collection phase, the system trajectory
converges to zero. The convergence to the optimal output feedback parameters is
shown in Fig. 5.8. As can be seen in the state feedback and output feedback results,
while the transient performance of the state feedback algorithm is superior due
to fewer unknown parameters to be learnt, the output feedback algorithm has the
advantage that it does not require the measurement of the full state.
5.8 Summary
This chapter is built upon the idea that exploits the finite dimensionality property
of the discrete-time delay systems to bring them into a delay-free form. Compared
to the predictor feedback, which requires a feedback controller to bring the system
into a delay-free form, the technique brings the open-loop system into a delay-free
form by assigning delayed variables as additional states, which are then augmented
5.8 Summary 253
1.5
x
0.5
0
0 50 100 150 200 250 300
time step (k)
Fig. 5.5 Algorithm 5.3: State trajectory of the closed-loop system under output feedback
150
100
Ĥ →H∗
50
0
0 1 2 3 4 5 6 7 8 9
iterations
Fig. 5.6 Algorithm 5.3: Convergence of the parameter estimates under output feedback
300
200
x
100
0
0 50 100 150 200 250 300
time step (k)
Fig. 5.7 Algorithm 5.4: State trajectory of the closed-loop system under output feedback
with the system state. The standard state augmentation technique, however, still
requires the knowledge of the delays. To overcome this difficulty, we presented an
extended state augmentation approach that only requires the knowledge of upper
bounds of the delays. Both state and input delays were considered. We presented
254 5 Model-Free Control of Time Delay Systems
150
50
0
0 2 4 6 8 10 12 14 16 18 20
iterations
Fig. 5.8 Algorithm 5.4: Convergence of the parameter estimates under output feedback
There have been some recent developments in solving the optimal control
problem for time delay systems in a model-free manner. Instead of applying
the predictor feedback, the idea of bicausal change of coordinates has been
recently employed in developing reinforcement learning and approximate dynamic
programming techniques for model-free optimal control of time delay systems.
However, the key challenge in this approach is the issue of the existence of a bicausal
transformation, which becomes difficult to verify when the system dynamics is not
known. Moreover, even though the schemes mentioned above are model-free, a
precise knowledge of the delay is still required.
In this chapter, the presentation of the extended state augmentation technique
and the associated state feedback algorithms follows from our preliminary results
in [104]. Section 5.6, along with Theorem 5.2 in Sect. 5.4, extends these results to
the output feedback case and provides a detailed convergence analysis of the output
feedback Q-learning algorithms for time delay problems.
Chapter 6
Model-Free Optimal Tracking Control
and Multi-Agent Synchronization
6.1 Introduction
Tracking control has unarguably been the most traditional control design problem
beyond stabilization owing to its practical relevance. As its name suggests, the goal
in a tracking problem is to design a controller that enables the system to closely
follow a specified reference trajectory. In general, such a reference signal could take
various forms and be very dynamic in nature, making the problem more challenging
in comparison with a stabilization problem that only involves regulation of the
system state to an equilibrium point. There are several domains such as robotics,
aerospace, automation, manufacturing, power systems, and automobiles, where a
tracking controller is required to match the output and/or the state of the system
with that of a reference generator, hereby, referred to as the exosystem. Very often,
tracking a reference trajectory is only a basic requirement in a control application.
Instead, it is desirable to achieve tracking in a prescribed optimal manner, which
involves solving an optimization problem.
In the setting of linear systems, the structure of the linear quadratic tracking
(LQT) control law involves an optimal feedback term (similar to the stabilization
problem) and an optimal feedforward term that corresponds to the reference trajec-
tory. While the design of the optimal feedback term follows the same procedures that
we have seen in the previous chapters, the design procedure for the feedforward term
is more involved. It is due to this difficulty that the application of LQT design has
received less attention in the literature. The traditional approach to computing this
feedforward term requires solving a noncausal equation and involves a backward in
time procedure to precompute and store the reference trajectory [55]. An alternative
method to calculate the feedforward term involves the idea of dynamic inversion.
This technique, however, requires the input coupling matrix to be invertible. A
promising paradigm for designing tracking controllers is the output regulation
framework that involves solving the regulator equations to compute the feedforward
term. All these approaches to finding the feedforward term are model-based as they
require the complete knowledge of the system dynamics.
One notable extension of the tracking problem that has recently gained significant
popularity is the multi-agent synchronization problem. As the name suggests, the
problem involves a group of agents that make decisions to achieve synchronization
among themselves. The individual agents may have identical dynamics (homo-
geneous agents) or they can assume different dynamics (heterogeneous agents).
The primary motivation of this problem comes from the cooperative behavior that
is found in many species. Such cooperations are essential in achieving goals or
performing tasks that would otherwise be difficult to carry out by individual agents
operating solo. Their distributed nature also gives multi-agent systems a remarkable
edge over the centralized approaches to obtaining scalability to the problem size
and robustness to failures. Similar to the optimal tracking problem, the multi-agent
synchronization problem can also take into account the optimality aspect of each
agent based on the optimal control theory. Output regulation framework has also
been very instrumental in solving the multi-agent synchronization problems. The
key challenge in the multi-agent control design pertains to the limited information
that is available to each agent. Furthermore, as is the case with a tracking problem,
difficulties exist in computing the optimal feedback and feedforward control terms
when the dynamics of the agents are unknown.
Model-free reinforcement learning (RL) holds the promise of solving individual
agent and multi-agent problems without involving the system dynamics. However,
there are additional challenges RL faces in finding the solution of the optimal
tracking and synchronization problems. The primary difficulty lies in selecting an
appropriate cost function, which is challenging to formulate due to the presence of
the reference signal. In particular, in the infinite horizon setting, the presence of a
general non-vanishing reference signal can make the long-term cost ill-defined. The
traditional approach to finding the feedforward term involves a noncausal difference
equation, which is not readily applicable in RL as it involves a backward in time
method in contrast to the forward in time approach in RL.
In this chapter, we will first present a two degrees of freedom learning approach
to learning the optimal feedback and feedforward control terms. The optimal
feedback term is obtained using the LQR Q-learning method that we developed
in Chap. 2, whereas an adaptive algorithm is presented that learns the feedforward
term. The presented scheme has the advantage that it does not require a discounting
factor. As a result, convergence to the optimal parameters is achieved and optimal
asymptotic output tracking is ensured. We will restrict this discussion to the
single input single output case, which is commonly addressed in this context. The
treatment of the general multiple input multiple output case and its continuous-time
extension are more involved. In the second half of this chapter, we will extend
the single-agent scheme to solve the multi-agent synchronization problem. The
focus of the multi-agent design will be to achieve synchronization using only the
neighborhood information. A leader-follower style formulation will be presented
and distributed adaptive laws will be designed that require only the system input-
output data and the information of the neighboring agents.
6.2 Literature Review 259
The optimal tracking problem has been covered quite comprehensively in the RL
control literature. Most recent works involve the idea of state augmentation in which
the dynamics of the trajectory generator is merged into that of the system. The
problem is then solved as a linear quadratic regulation problem using the standard
RL techniques such as Q-learning as presented in the earlier chapters. For instance,
the authors of [49] employed Q-learning to solve the optimal tracking problem by
simultaneously learning the feedback and feedforward control gains. One advantage
of the state augmentation approach is that it can be also readily applied to solve the
continuous-time tracking problems as done in [77]. However, this approach does not
guarantee asymptotic tracking error convergence owing to the need of employing a
discounting factor in the cost function.
The main utility of the discounting factor in the case of optimal tracking problem
is to make the cost function well-defined. It is worth pointing out that this require-
ment of a discounting factor is different from the one in output feedback control, as
seen in Chap. 2. In fact, the above mentioned works involve the measurement of the
internal state while still requiring a discounted cost function. Motivated by the work
of [56], the authors of [50] solved the output feedback optimal tracking problem,
but asymptotic error convergence could not be ensured, again due to the use of
a discounting factor. Here the discounting factor serves the additional purpose of
diminishing the effect of the exploration noise bias as explained in [56]. Continuous-
time extensions based on the state augmentation approach combined with a state
parameterization in terms of delayed output measurements have been carried out
in [80]. As discussed in Chap. 2, such a state parameterization also leads to bias
as it assumes that the control is strictly feedback, thereby, ignoring the effect of
exploration signal. Similarly to the discrete-time counterpart, the discounting factor
serves two utilities, to circumvent the exploration bias issue and to make the cost
function well-defined, or, equivalently, to make the augmented system stabilizable.
Applications of RL in solving a variety of multi-agent control problems have
been proposed in the literature where the requirement of the knowledge of system
dynamics [15, 105] has been relaxed. The authors of [119] developed a model-
free reinforcement learning scheme to solve the optimal consensus problem of
multi-agent systems in the continuous-time setting. In this work the leader and the
follower agents were assumed to have identical dynamics. Extension of this work to
the discrete-time setting was presented in [2]. In contrast, solution of consensus
problems for heterogeneous multi-agent systems involves a different approach,
which is based on the output regulation theory. This line of work, however,
encounters certain obstacles because of the need to solve regulator equations in
order to find the feedforward term. The solution of the regulator equations entails
the knowledge of the dynamics of both the leader and follower agents.
In [82], this difficulty was addressed in solving the continuous-time hetero-
geneous consensus problem based on model-free RL approach. Discrete-time
extensions of this work involving Q-learning were proposed in [47]. It is worth
260 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
noting that [47, 82] involve the simultaneous estimation of the feedforward and
feedback term, as with the solution of the tracking problem discussed previously.
An augmented algebraic Riccati equation is obtained by augmenting the leader
dynamics with the dynamics of the follower agents. Similarly to the tracking
problem, the multi-agent problem also faces the issue that the resulting augmented
system is not stabilizable owing to the autonomous and neutrally stable nature of
the leader dynamics. Furthermore, the infinite horizon cost function in the multi-
agent synchronization problem is also ill-defined as the control does not converge
to zero due to the non-decaying state of the leader. In order to make the local cost
function of each agent well-defined, the authors of [47, 82] resorted to incorporating
a discounting factor in the cost function.
Consider a discrete-time linear system given by the following state space represen-
tation,
wk+1 = Swk ,
(6.2)
yrk = F wk ,
where wk ∈ Rq is the state of the exosystem system and yrk ∈ R is the reference
output. In [50], the linear quadratic tracking (LQT) problem has been addressed
by augmenting the system dynamics (6.1) with the exosystem dynamics (6.2) and
using the VFA method. However, the augmented system is not stabilizable as the
exosystem is autonomous and neutrally stable and the resulting cost function of the
augmented system becomes ill-defined due to the non-decaying reference trajectory
wk . To address these difficulties, a discounting factor 0 < γ < 1 is introduced in
the cost function as
∞
VK (Xk ) = γ i−k r Xi , Kγ Xi , (6.3)
i=k
T
where Xk = xkT wkT is the augmented state and r Xk , Kγ Xk is a one-step
quadratic cost function. In general, the boundedness of VK does not ensure xk → 0
as k → ∞ for all values of γ . Only for a particular range of γ may asymptotic
6.3 Q-learning Based Linear Quadratic Tracking 261
tracking be ensured but the lower bound on γ is not known. Furthermore, the
discounted solution is only suboptimal.
To solve the LQT problem, we make the following standing assumptions.
Assumption 6.1 (A, B) is stabilizable.
Assumption 6.3 The relative degree n∗ and the sign of the high frequency gain kp
are known, where
A − λI B
Assumption 6.4 ρ = n + 1, λ ∈ σ (S), where σ (S) denotes the
C 0
spectrum of S.
We formulate the optimal tracking problem as an output regulation problem by
defining the error dynamics as
XS = AX + BU,
(6.5)
0 = CX − F.
The value function that represents the infinite horizon cost of the feedback law
ũk = K x̃k is defined as
∞
V (x̃k ) = r(ei , ũi ), (6.6)
i=k
where
with Qy ≥ 0 and R > 0 being the user-defined weighting scalars. Under the stated
assumptions, there exists a unique optimal feedback control, given by
and is given by
−1
K ∗ = − R + B TP ∗B B T P ∗ A, (6.9)
and Kr∗ is the optimal feedforward gain, which is related to the optimal feedback
gain as
Kr∗ = K ∗ X + U.
It is interesting to note that the original system (6.1) and the error dynamics
system (6.4) are algebraically equivalent with the same ARE and, therefore, have the
same optimal feedback control term K ∗ . This suggests that K ∗ can be first estimated
by treating the problem as an LQR stabilization problem as done in Chap. 2. Then,
the feedforward term Kr∗ is estimated such that the tracking error ek → 0 as k → ∞.
In this way, we do not need to solve the Sylvester equations (6.5), whose solution
would require the knowledge of system dynamics.
The optimal output feedback tracking controller is given as the sum of the
optimal output feedback and feedforward terms,
∗ −1 ∗
u∗k = − Huu ∗
Huσ σk + Huω ωk + Kr∗ wk , (6.10)
Since this equation holds for all state trajectories, we can reuse the stored dataset
to learn new policies in every iteration. In this method, a fixed initial policy called
the behavioral policy u0k is first used to generate the required system data. Then,
repeated applications of policy evaluation and policy improvement on the same
dataset enable us to learn new policies. That is, experience replay affects the
utilization of the data employed in Q-learning. In a conventional on-policy learning
scheme, a new dataset corresponding to the newly learned policy is collected in
each iteration. In contrast, the presented experience replay method makes use of
the behavioral policy dataset only. This technique has been shown to be more
data efficient [4]. Furthermore, the use of historic data during learning relaxes the
exploration requirement. Unlike the usual persistence of excitation (PE) condition,
which is hard to maintain in every iteration due to the converging trend of the system
trajectories [13], we only need a rank condition on the data matrices for a single
behavioral policy dataset.
We now present the experience replay Q-learning algorithms to learn the optimal
feedback control parameters. These algorithms are essentially Algorithms 2.1
264 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
Algorithm 6.1 Output feedback Q-learning policy iteration algorithm for computing
the feedback gain using experience replay
input: input-output data
output: H ∗
1: initialize. Select an admissible policy
−1
K0 = − Huu
0 0 H0 .
Huσ uω
Set j ← 0.
2: collect online data. Apply the behavioral policy
T
uk = K0 σkT ωkT + νk ,
where νk is the exploration signal. Collect L datasets of (σk , ωk , uk ) for k ∈ [0, L − 1], with
L ≥ (mn + pn + m)(mn + pn + m + 1)/2.
3: repeat
4: policy evaluation. Solve the following Bellman equation for H̄ j
T
H̄ j (z̄k − z̄k+1 ) = ykT Qy yk + uTk Ruk .
6: j← j +1
7: until H̄ j − H̄ j −1 < ε for some small ε > 0.
and 2.2 with the experience replay mechanism. As a result, the convergence
properties of these algorithms remain unchanged.
We now proceed to the design of feedforward control gain. The closed-loop tracking
error is given by
ek = yk − yrk
= C (Axk−1 + Buk−1 ) − F wk . (6.11)
Assume that the optimal feedback parameters have already been learnt following the
discussion in the previous section. We now design an estimator of the feedforward
gain Kr∗ based on the tracking error. Using the estimated feedforward gain K̂r in the
6.5 Adaptive Tracking Law 265
Algorithm 6.2 Output feedback Q-learning value iteration algorithm for computing
the feedback gain using experience replay
input: input-output data
output: H ∗
1: initialize. Select an arbitrary policy u0k and H 0 ≥ 0. Set j ← 0.
2: collect online data. Apply the behavioral policy
uk = u0k + νk ,
where νk is the exploration signal. Collect L datasets of (σk , ωk , uk ) for k ∈ [0, L − 1], with
L ≥ (mn + pn + m)(mn + pn + m + 1)/2.
3: repeat
4: value update. Solve the following Bellman equation for H̄ j +1 ,
T T
H̄ j +1 z̄k = ykT Qy yk + uTk Ruk + H̄ j z̄k+1 .
6: j← j +1
7: until H̄ j − H̄ j −1 < ε for some small ε > 0.
and hence, the tracking error is directly related to the estimation error of the
feedforward gain as
ek = CB K̂r wk−1 − Kr∗ wk−1 .
The above equation is for systems with a relative degree 1. In general, for systems
with a relative degree n∗ , we have
CAi B = 0, 0 ≤ i < n∗ − 1,
∗ −1
CAn B = 0,
266 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
sign(kp )wk−n∗ ek
K̂rk+1 = K̂rk − , (6.13)
m2k
where
2
0 < = T < I
|kp |
is the normalizing factor. The resulting adaptive control law is then given by
∗ −1 ∗
u∗k = − Huu ∗
Huσ σk + Huω ωk + K̂r wk . (6.14)
Theorem 6.1 Under the adaptive laws (6.13) and (6.14), all signals are bounded
and the tracking error converges to zero asymptotically.
Proof We consider the following Lyapunov function,
Vk V K̃rk
T −1
= kp K̃rk K̃rk .
ek2
≤ −α , (6.15)
m2k
As a result, Vk , and hence, K̃rk and K̂rk are all bounded. Furthermore, by (6.12) and
the boundedness of the reference signal wk , ek , and mk are also bounded, that is,
ek ∈ l∞ and mk ∈ l∞ . Summing (6.15) from k to ∞, we have
∞
ek2
α = V0 − V∞
i=k
m2k
≤ V0 ,
that is,
ek
∈ l2 ,
mk
lim ek = 0.
k→∞
Remark 6.2 The adaptation mechanism for estimating the feedforward gain uses
only the information of the relative degree n∗ and the sign of the high frequency
system gain kp . By selecting sufficiently small, we can satisfy the design condition
2
0 < < I,
k p
wk+1 = Swk ,
(6.17)
yrk = F wk ,
where xi,k ∈ Rni , ui,k ∈ R, and yi,k ∈ R are, respectively, the state, input, and
output of follower agent i, and wk ∈ Rn0 and yrk ∈ R are, respectively, the state and
output of the leader agent. All agents are allowed to have different dynamics. Let
the relative degree of follower agent i be n∗i . The follower agents can be divided into
two categories depending upon their connectivity with the leader agent. Agents that
are directly connected to the leader are referred to as informed agents, while agents
that do not have direct access to the leader information are called uninformed agents.
The exchange of information among the agents is formally described by a
directed graph (digraph) G = {V, E}, comprising of the vertices (nodes) V =
{v0 , v1 , v2 , · · · , vN } and the edges E ⊆ V × V. A pair (vi , vj ) ∈ E represents an
edge that enables the information flow from node vi to node vj . In this case, the
nodes vi and vj are referred to as the parent and child node of each other. The set of
(in-) neighbors of vi is defined as Ni = {vj : (vj , vi )}, that is, Ni represents the set
of the parent nodes of node vi . A directed path from vi1 to vil is a sequence of edges
(vi1 , vi2 ), (vi2 , vi3 ), · · · , (vil−1 , vil ). Node vil is said to be reachable from node vi1
if there exists a directed path from vi1 to vil .
Assumption 6.5 The leader agent is reachable from every follower agent in the
graph and every follower agent knows which of its neighbor can reach the leader
without going through the agent itself.
The distributed optimal output feedback tracking controller is given as the sum
of the optimal output feedback and the feedforward terms,
∗ −1 ∗
u∗i,k = − Hi,uu ∗
Hi,uσ σi,k + Hi,uω ωi,k + K̄ri∗ ωrk . (6.18)
In Equation (6.18), we have used the parameterization wk = Mry ωrk using state
parameterization result in Theorem 2.1, which results in the relationship
Here ωrk depends on the leader output yrk . This parameterization is needed because
the agents may not have access to the internal state of the leader. The adaptation
mechanism for the informed agent is derived similar to that for the single-agent
case discussed in the previous section, that is,
where
2
0 < i = iT < I
ki
We next consider the case of the uninformed follower agents, which do not have
access to the output measurement yrk of the leader and, therefore, access to ωrk . For
the uninformed agents, we suggest the following adaptation mechanism,
where
and ωj,k is obtained by applying the state parameterization result in Theorem 2.1
to the leader dynamics and using the output yj,k of any neighboring parent agent j
that is reachable to the leader under Assumption 6.5. The resulting adaptive control
law is then given by
∗ −1 ∗
ui,k = − Hi,uu ∗
Hi,uσ σi,k + Hi,uω ωi,k + K̄ˆ ri,k ωj,k . (6.22)
270 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
Theorem 6.2 Let Assumption 6.5 hold. Under the distributed adaptive laws (6.19),
(6.20), (6.21) and (6.22), all signals are bounded and the tracking errors of all the
follower agents converge to zero asymptotically.
Proof For the case of informed agents, the proof for the adaptive control laws (6.19)
and (6.20) is the same as that of the single-agent tracking problem as shown in
Theorem 6.1. For the case of uninformed agents, we know from Assumption 6.5
that the leader is reachable from every follower agent, which implies that, for any
uninformed follower agent i, there is a parent node in Ni (which itself may be
uninformed) that synchronizes with the leader. Let vj be such a parent node with
possible successive parent nodes vj1 , vj2 , · · · , vjl and let vjl be connected to an
informed agent vinformed whose output yinformed synchronizes with that of the leader
by Theorem 6.1. Thus, ωinformed k → ωrk as k → ∞ by Theorem 6.1. As a result,
yinformed k can serve as a reference output for vjl and thus, vjl can use ωinformed k
instead of ωrk in (6.19), which by Theorem 6.1 results in yjl ,k → yrk as k → ∞.
Similarly, the outputs of nodes vjl−1 , · · · , vj2 , vj1 , vj synchronize with the leader’s
output yrk , which in turn implies that, for the child node vi , yi,k → yrk as k → ∞.
This completes the proof.
Example 6.1 (DC Motor Speed Control) Consider a DC motor system given in the
state space form (6.1) with
0.3678 0.0564
A= ,
−0.0011 0.8187
0.0069
B= ,
0.1813
C= 10 ,
S = 1,
F = 1.
The states x1 and x2 represent the motor speed and current, respectively, and
the control u is the applied motor voltage. A constant unit speed reference is
generated using the exosystem. Let Qy = 1 and R = 1 define the performance
index. We employ the above model to simulate the target system only and the
controller does not utilize any knowledge of these matrices. Based on these model
parameters, we compute the nominal optimal parameters for comparison with
the estimates resulting from the proposed algorithm. The nominal optimal output
feedback controller parameters as found by solving the ARE (6.8) are as follows,
6.7 Numerical Examples 271
1
y(k), yr (k)
0
System output
-1
Reference
-2
0 50 100 150 200 250 300 350 400 450 500
time steps (k)
∗
Huσ = 0.0008 0.0003 ,
∗
Huω = 0.0544 −0.0193 ,
∗
Huu = 1.0008,
Kr∗ = 10.0557.
Since the system is open-loop stable, we use the PI algorithm. The initial controller
parameters are set to zero and the rank condition is ensured by adding sinusoids of
different frequencies in the behavioral policy. The system has a relative degree n∗ =
1 with the high frequency gain kp = 1. The convergence criterion ε = 0.001 and the
adaptation gain = 0.5I are chosen. The matrix A in the state parameterization is
defined as
00
A= .
10
Figure 6.1 shows the tracking response. It takes 3 iterations for the feedback
parameters to converge to their optimal values as shown in Fig. 6.2, and L = 18
data samples are collected using the behavioral policy. After these iterations, the
gradient algorithm drove the tracking error ek to approach 0, as shown in Fig. 6.1.
The convergence of the feedforward parameters is shown Fig. 6.3. Notice that no
discounting factor is employed in the proposed method, and the result is identical
to that obtained by solving the ARE and the Sylvester equations. Furthermore, the
exploration signal does not incur any bias in the estimates, which is an advantage of
the proposed scheme.
Example 6.2 (Discrete-time Double Integrator) The double integrator system rep-
resents a large class of practical systems, such as satellite attitude control and rigid
body motion. Consider a discrete-time double integrator system given in the state
space form (6.1) with
272 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
Ĥ →H ∗
4
0
0 1 2 3
iterations
10
K̂r →Kr∗
0
50 100 150 200 250 300 350 400 450 500
time steps (k)
11
A= ,
01
0
B= ,
1
C= 10 ,
cos(0.1) sin(0.1)
S= ,
−sin(0.1) cos(0.1)
F = 10 .
∗
Huσ = 5.4117 7.4927 ,
∗
Huω = 9.5737 −7.4927 ,
∗
Huu = 4.3306,
Figure 6.4 shows the tracking response. It takes 10 iterations for the feedback
parameters to converge to their optimal values as shown in Fig. 6.5, and L = 18
data samples are collected using the behavioral policy. After these iterations, the
gradient algorithm drives the tracking error ek to approach 0, as shown in Fig. 6.4.
The convergence of the feedforward parameters is shown in Fig. 6.6. It can be
seen that the result matches the solution of the Riccati and Sylvester equations.
Furthermore, the proposed output feedback method is free from discounting factor
and exploration noise bias.
2
System output
1 Reference
y(k), yr (k)
-1
-2
0 100 200 300 400 500 600 700 800
time steps (k)
150
Ĥ →H ∗ 100
50
0
0 1 2 3 4 5 6 7 8 9 10
iterations
0.6
K̂r →Kr∗
0.4
0.2
0
100 200 300 400 500 600 700 800
time steps (k)
Consider the multi-agent system (6.16) with N = 4 and the dynamics matrices
given by
A1 = 1,
B1 = 1,
C1 = 1,
1.1 −0.3
A2 = ,
1 0
1
B2 = ,
0
C2 = 1 −0.8 ,
1.8 −0.7
A3 = ,
1 0
1
B3 = ,
0
C3 = 1 −0.5 ,
6.7 Numerical Examples 275
⎡ ⎤
0.2 0.5 0
A4 = ⎣0.3 0.6 0.5⎦ ,
0 0 0.8
1
B4 = ,
0
C4 = 101 ,
cos(0.5) sin(0.5)
S= ,
−sin(0.5) cos(0.5)
F = 10 .
∗
H3,uω = 4.9055 −2.9360 ,
∗
H3,uu = 3.0467,
∗
H4,uσ = 2.2045 −1.2962 −0.5412 ,
∗
H4,uω = 3.4558 −1.8868 −0.0722 ,
∗
H4,uu = 2.9172,
The initial controller parameters are set to be zero and the rank condition in Chap. 2
is ensured by adding sinusoids of different frequencies in the behavioral policy.
The convergence criterion ε = 0.001 and the adaptation gain i = I is chosen
for all the follower agents. The output tracking responses of the agents are shown
in Fig. 6.8. The synchronization errors converge to zero as shown in Fig. 6.9. The
convergence of the output feedback control parameters is shown in Fig. 6.10. The
estimated feedback control parameters are
Ĥ1,uσ = 1.6180,
Ĥ1,uω = 1.6180,
2
Leader
Output trajectory yi (k)
Agent 1
1
Agent 2
Agent 3
0 Agent 4
-1
-2
0 50 100 150 200 250 300
time steps k
3
Agent 1
2
Agent 2
Tracking error ei (k) Agent 3
1
Agent 4
0
-1
-2
-3
0 50 100 150 200 250 300
time steps k
1
Agent 1
0.8 Agent 2
Agent 3
Ĥi →Hi∗
0.6 Agent 4
Hi∗
0.4
0.2
0
0 1 2 3 4 5 6 7 8 9
iterations
Ĥ1,uu = 2.6180,
Ĥ2,uσ = 0.3100 −0.9635 ,
Ĥ2,uω = 0.9895 −0.3613 ,
Ĥ2,uu = 2.0504,
Ĥ3,uσ = 2.5416 −1.9065 ,
Ĥ3,uȳ = 4.9055 −2.9360 ,
Ĥ3,uu = 3.0467,
Ĥ4,uσ = 2.2045 −1.2962 −0.5412 ,
Ĥ4,uω = 3.4558 −1.8868 −0.0722 ,
278 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
Ĥ4,uu = 2.9172,
Notice that different numbers of data samples L are collected for each agent so
as to satisfy the rank condition given in Chap. 2. Once the Q-learning iterations for
the optimal feedback parameters are completed, the gradient algorithm also starts
to converge as the tracking error ei,k → 0. Notice also that no discounting factor
is employed in the proposed method and the result is identical to that obtained by
solving the ARE and the Sylvester equations.
6.8 Summary
dynamics or the full state of the system. Simulation results have been presented that
confirm the effectiveness of the proposed method.
Tracking control remains one of the most common applications of control theory.
Some of the most popular and successful applications of tracking control can be
found in the areas of robotics, aerospace, automobiles, and more recently, the multi-
agent systems [23, 24, 36–38, 45, 61, 84, 87, 115, 144]. The significant demand
for improved tracking schemes in such diverse applications have resulted in a
wide variety of approaches to solving this problem. Notable techniques include
the traditional PID tracking control [19], the tracking control designs [90], robust
designs[89], nonlinear trackers [134], adaptive tracking [25, 116] and intelligent
tracking control [123].
Ideas from optimal control theory for solving the optimal stabilization problem
have been successfully extended to find the solution of the optimal tracking problem.
In particular, the linear quadratic tracker (LQT) is regarded as a classical optimal
tracking controller since it is a natural extension of the celebrated linear quadratic
regulator (LQR). However, different from LQR, the optimal tracking control
involves a feedforward term that makes the tracking problem non-trivial. There have
been several efforts devoted to address the difficulty of finding this feedforward
component. The classical LQT controller employs a noncausal equation to compute
the feedforward trajectory by solving it in a backward in time manner [55], which
generally involves a pre-computation through the model and an offline storage of
the trajectory. On the other hand, if the input coupling matrix is invertible, the
technique of dynamic inversion is also applicable for finding the feedforward term
[136]. Nevertheless, one of the most formal frameworks of solving the tracking
problems is the output regulation paradigm [39], which employs the internal model
principle to compute the feedforward term. This method also bears the advantage
of handling disturbances. However, all these techniques are model-based since they
require the precise knowledge of the system dynamics.
The success of reinforcement learning in solving the classical optimal control
problems such as the LQR problem has led to the development of model-free
designs for solving the LQT problem. A variety of tracking schemes have been
proposed in the RL control literature for different classes of systems [77, 78, 143].
One of the primary challenges in the RL tracking control design is associated with
the learning of the feedforward term. RL relies on a suitable cost function during the
learning phase. Such a cost function is difficult to form in the presence of a reference
signal that may not decay to zero. In particular, it leads to an ill-defined cost function
as the resulting infinite horizon cost is not finite. An elegant approach to addressing
this difficulty in RL based LQT designs involves the idea of augmenting the system
dynamics with that of the reference generator (exosystem). The RL problem then
boils down to learning the optimal feedback controller for the augmented system,
280 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization
which implicitly incorporates the feedforward term. The approach is very effective
as long as the reference generator is asymptotically stable. In that case, the problem
then reduces to a stabilization problem.
Very often in a practical setting the exosystem is required to be neutrally stable to
generate non-vanishing reference trajectories. Since the exosystem is autonomous,
the resulting augmented system is no longer stabilizable. Consequently, the infinite
horizon cost function becomes ill-defined due to the presence of the non-decaying
state. In the RL control literature, this difficulty is addressed by means of introducing
a discounting factor that makes the cost function well-defined [77]. Equivalently, the
discounting factor can be considered a modifier to the augmented system dynamics,
thereby rendering it stabilizable. Based on this approach, Q-learning schemes
have been proposed to solve the tracking problem for discrete-time linear systems
[49]. Output feedback extensions based on the output feedback value function
approximation approach [56] have also been developed later in [50]. A difficulty in
the state augmentation based approaches is that the discounted solution is different
from the nominal solution of the original LQT problem. More importantly, the
discounted solution may not be a stabilizing one or may not guarantee asymptotic
tracking if the discounting factor is not carefully selected [88]. Therefore, it is
desirable to uplift these design restrictions. Along these lines, some notable RL
results have been presented that uplift such restrictions. In [27], an output regulation
formulation is adopted to learn the optimal feedback and the feedforward terms
based on adaptive dynamic programming. This approach involves an additional
mechanism to approximate the solution of the regulator equations. Some knowledge
of system dynamics and a stabilizing initial policy are required. Output feedback
extensions to this approach have also been presented recently [27].
Tracking control designs find applications in a wide range of multi-agent control
problems. In particular, the multi-agent synchronization problem can be formulated
as a leader-follower synchronization problem, in which each agent is required to
track the leader based on the neighborhood information. The generalization to the
case of heterogeneous agents is often desirable in applications that involve agents
having different dynamics such as rescue operations that require a combination
of ground, aerial and under water support. The output regulation framework has
been successfully employed to solve these synchronization problems based on the
knowledge of system dynamics [111]. The extension of single-agent reinforcement
learning has opened a new avenue to solving these problems without requiring
model information [15, 105]. Similar to the tracking problem, the idea of aug-
menting the agent dynamics with that of the leader dynamics has been employed in
solving model-free optimal synchronization problems [47, 82]. However, discounted
cost functions are employed in these works due to the reasons highlighted earlier. A
challenging problem in multi-agent leader-following schemes is that the information
of the leader is not readily available to each agent. Distributed adaptive observers
are employed to address this difficulty by estimating the leader state for every agent
in both model-based [17, 40] and data-driven [47, 82] approaches. However, some
knowledge of the leader dynamics matrix along with the knowledge of the graph
network is needed in designing the observer [47]. In [30, 31], output regulation
6.9 Notes and References 281
1. Aangenent, W., Kostic, D., de Jager, B., van de Molengraft, R., Steinbuch, M.: Data-based
optimal control. In: Proceedings of the 2005 American Control Conference, pp. 1460–1465
(2005)
2. Abouheaf, M.I., Lewis, F.L., Vamvoudakis, K.G., Haesaert, S., Babuska, R.: Multi-agent
discrete-time graphical games and reinforcement learning solutions. Automatica 50(12),
3038–3053 (2014)
3. Abu-Khalaf, M., Lewis, F.L.: Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
4. Adam, S., Busoniu, L., Babuska, R.: Experience replay for real-time reinforcement learning
control. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(2), 201–212 (2012)
5. Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q-learning designs for linear
discrete-time zero-sum games with application to H-infinity control. Automatica 43(3), 473–
481 (2007)
6. Bacsar, T., Bernhard, P.: H-infinity Optimal Control and Related Minimax Design Problems:
A Dynamic Game Approach. Springer Science & Business Media (2008), New York, NY
7. Bellman, R.E.: Dynamic Programming. Princeton University Press (1957), Princeton, NJ
8. Bertsekas, D.: Dynamic Programming and Optimal Control: Volume I and II. Athena
Scientific (2012), Belmont, MA
9. Bertsekas, D.: Reinforcement Learning and Optimal Control. Athena Scientific (2019),
Belmont, MA
10. Bian, T., Jiang, Z.P.: Data-driven robust optimal control design for uncertain cascaded systems
using value iteration. In: Proceedings of the 54th Annual Conference on Decision and Control
(CDC), pp. 7610–7615. IEEE (2015)
11. Bian, T., Jiang, Z.P.: Value iteration and adaptive dynamic programming for data-driven
adaptive optimal control design. Automatica 71, 348–360 (2016)
12. Boltyanskii, V., Gamkrelidze, R., Pontryagin, L.: On the theory of optimal processes. Sci.
USSR 110(1), 71–0 (1956)
13. Bradtke, S.J., Ydstie, B.E., Barto, A.G.: Adaptive linear quadratic control using policy
iteration. In: Proceedings of the 1994 American Control Conference, pp. 3475–3479 (1994)
14. Bucsoniu, L., Babuvska, R., De Schutter, B.: Multi-agent Reinforcement Learning: An
Overview. Innovations in Multi-agent Systems and Applications, pp. 183–221 (2010),
Springer, Berlin, Heidelberg
15. Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforce-
ment learning. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 38(2), 2008 (2008)
16. Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic
Programming Using Function Approximators. CRC Press (2017)
17. Cai, H., Lewis, F.L., Hu, G., Huang, J.: The adaptive distributed observer approach to the
cooperative output regulation of linear multi-agent systems. Automatica 75, 299–305 (2017)
18. Chen, C., Lewis, F.L., Xie, K., Xie, S., Liu, Y.: Off-policy learning for adaptive optimal output
synchronization of heterogeneous multi-agent systems. Automatica 119, 109081 (2020)
19. Choi, Y., Chung, W.K.: PID Trajectory Tracking Control for Mechanical Systems, vol. 298.
Springer (2004)
20. Ding, Z., Dong, H.: Challenges of reinforcement learning. In: H. Dong, Z. Ding, S. Zhang
(eds.) Deep Reinforcement Learning: Fundamentals, Research and Applications, pp. 249–
272. Springer Singapore, Singapore (2020)
21. Dong, L., Zhong, X., Sun, C., He, H.: Event-triggered adaptive dynamic programming for
continuous-time systems with control constraints. IEEE Trans. Neural Netw. Learn. Syst.
28(8), 1941–1952 (2017)
22. Dong, L., Zhong, X., Sun, C., He, H.: Adaptive event-triggered control based on heuristic
dynamic programming for nonlinear discrete-time systems. IEEE Trans. Neural Netw. Learn.
Syst. (to appear)
23. Encarnaccao, P., Pascoal, A.: Combined trajectory tracking and path following: an application
to the coordinated control of autonomous marine craft. In: Proceedings of the 40th IEEE
Conference on Decision and Control 2001, vol. 1, pp. 964–969. IEEE (2001)
24. Fleming, A.J., Aphale, S.S., Moheimani, S.R.: A new method for robust damping and tracking
control of scanning probe microscope positioning stages. IEEE Trans. Nanotechnol. 9(4),
438–448 (2010)
25. Fukao, T., Nakagawa, H., Adachi, N.: Adaptive tracking control of a nonholonomic mobile
robot. IEEE Trans. Robot. Autom. 16(5), 609–615 (2000)
26. Fuller, A.: In-the-large stability of relay and saturating control systems with linear controllers.
Int. J. Control. 10(4), 457–480 (1969)
27. Gao, W., Jiang, Z.P.: Adaptive dynamic programming and adaptive optimal output regulation
of linear systems. IEEE Trans. Autom. Control 61(12), 4164–4169 (2016)
28. Gao, W., Jiang, Z.P.: Data-driven adaptive optimal output-feedback control of a 2-DOF
helicopter. In: Proceedings of the 2016 American Control Conference, pp. 2512–2517 (2016)
29. Gao, W., Jiang, Z.P.: Adaptive optimal output regulation of time-delay systems via measure-
ment feedback. IEEE Trans. Neural Netw. Learn. Syst. 30(3), 938–945 (2018)
30. Gao, W., Jiang, Z.P., Lewis, F.L., Wang, Y.: Leader-to-formation stability of multi-agent
systems: An adaptive optimal control approach. IEEE Trans. Autom. Control 63(10), 3581–
3587 (2018)
31. Gao, W., Liu, Y., Odekunle, A., Yu, Y., Lu, P.: Adaptive dynamic programming and
cooperative output regulation of discrete-time multi-agent systems. Int. J. Control Autom.
Syst. 16(5), 2273–2281 (2018)
32. Gu, K., Chen, J., Kharitonov, V.L.: Stability of Time-Delay Systems. Springer Science &
Business Media (2003)
33. Hagander, P., Hansson, A.: Existence of discrete-time LQG-controllers. Syst. Control Lett.
26(4), 231–238 (1995)
34. He, P., Jagannathan, S.: Reinforcement learning-based output feedback control of nonlinear
systems with input constraints. IEEE Trans. Syst. Man Cybern. Part B Cybern. 35(1), 150–
154 (2005)
35. Hewer, G.: An iterative technique for the computation of the steady state gains for the discrete
optimal regulator. IEEE Trans. Autom. Control 16(4), 382–384 (1971)
36. Hoffmann, G., Waslander, S., Tomlin, C.: Quadrotor helicopter trajectory tracking control. In:
AIAA Guidance, Navigation and Control Conference and Exhibit, p. 7410 (2008)
37. Hong, Y., Hu, J., Gao, L.: Tracking control for multi-agent consensus with an active leader
and variable topology. Automatica 42(7), 1177–1182 (2006)
38. Hu, J., Feng, G.: Distributed tracking control of leader–follower multi-agent systems under
noisy measurement. Automatica 46(8), 1382–1387 (2010)
References 285
39. Huang, J.: Nonlinear Output Regulation: Theory and Applications. SIAM (2004)
40. Huang, J.: The cooperative output regulation problem of discrete-time linear multi-agent
systems by the adaptive distributed observer. IEEE Trans. Autom. Control 62(4), 1979–1984
(2017)
41. Ioannou, P., Fidan, B.: Adaptive Control Tutorial. SIAM (2006)
42. Jiang, Y., Jiang, Z.P.: Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)
43. Jiang, Y., Jiang, Z.P.: Robust Adaptive Dynamic Programming. John Wiley & Sons (2017)
44. Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement
learning with generalized computation graphs for robot navigation. In: 2018 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), pp. 5129–5136. IEEE (2018)
45. Kaminer, I., Pascoal, A., Hallberg, E., Silvestre, C.: Trajectory tracking for autonomous
vehicles: an integrated approach to guidance and control. J. Guid. Control Dynam. 21(1),
29–38 (1998)
46. Kiumarsi, B., Lewis, F.L.: Actor-critic-based optimal tracking for partially unknown nonlin-
ear discrete-time systems. IEEE Trans. Neural Netw. Learn. Syst. 26(1), 140–151 (2015)
47. Kiumarsi, B., Lewis, F.L.: Output synchronization of heterogeneous discrete-time systems: a
model-free optimal approach. Automatica 84, 86–94 (2017)
48. Kiumarsi, B., Lewis, F.L., Jiang, Z.P.: H∞ control of linear discrete-time systems: off-policy
reinforcement learning. Automatica 78, 144–152 (2017)
49. Kiumarsi, B., Lewis, F.L., Modares, H., Karimpour, A., Naghibi-Sistani, M.B.: Reinforce-
ment Q-learning for optimal tracking control of linear discrete-time systems with unknown
dynamics. Automatica 50(4), 1167–1175 (2014)
50. Kiumarsi, B., Lewis, F.L., Naghibi-Sistani, M.B., Karimpour, A.: Optimal tracking control of
unknown discrete-time linear systems using input-output measured data. IEEE Trans. Cybern.
45(12), 2770–2779 (2015)
51. Kleinman, D.: On an iterative technique for Riccati equation computations. IEEE Trans.
Autom. Control 13(1), 114–115 (1968)
52. Lancaster, P., Rodman, L.: Algebraic Riccati Equations. Clarendon Press (1995)
53. Landelius, T.: Reinforcement learning and distributed local model synthesis. Ph.D. thesis,
Linköping University Electronic Press (1997)
54. Lewis, F.L., Liu, D.: Reinforcement Learning and Approximate Dynamic Programming for
Feedback Control, vol. 17. John Wiley & Sons (2013)
55. Lewis, F.L., Syrmos, V.L.: Optimal Control. John Wiley & Sons (1995)
56. Lewis, F.L., Vamvoudakis, K.G.: Reinforcement learning for partially observable dynamic
processes: adaptive dynamic programming using measured output data. IEEE Trans. Syst.
Man Cybern. Part B Cybern. 41(1), 14–25 (2011)
57. Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
58. Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control. John Wiley & Sons (2012)
59. Lewis, F.L., Vrabie, D., Vamvoudakis, K.G.: Reinforcement learning and feedback control:
Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst.
Mag. 32(6), 76–105 (2012)
60. Li, H., Liu, D., Wang, D., Yang, X.: Integral reinforcement learning for linear continuous-time
zero-sum games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3),
706–714 (2014)
61. Liao, F., Wang, J.L., Yang, G.H.: Reliable robust flight tracking control: an LMI approach.
IEEE Trans. Control Syst. Technol. 10(1), 76–89 (2002)
62. Lin, X., Huang, Y., Cao, N., Lin, Y.: Optimal control scheme for nonlinear systems with
saturating actuator using ε-iterative adaptive dynamic programming. In: Proceedings of 2012
UKACC International Conference on Control, pp. 58–63. IEEE (2012)
63. Lin, Z.: Low Gain Feedback. Springer (1999)
286 References
64. Lin, Z., Glauser, M., Hu, T., Allaire, P.E.: Magnetically suspended balance beam with
disturbances: a test rig for nonlinear output regulation. In: 2004 43rd IEEE Conference on
Decision and Control (CDC)(IEEE Cat. No. 04CH37601), vol. 5, pp. 4577–4582. IEEE
(2004)
65. Lin, Z., Saberi, A.: Semi-global exponential stabilization of linear systems subject to input
saturation via linear feedbacks. Syst. Control Lett. 21(3), 225–239 (1993)
66. Lin, Z., Saberi, A.: Semi-global exponential stabilization of linear discrete-time systems
subject to input saturation via linear feedbacks. Syst. Control Lett. 24(2), 125–132 (1995)
67. Liu, D., Huang, Y., Wang, D., Wei, Q.: Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int. J. Control 86(9),
1554–1566 (2013)
68. Liu, D., Wei, Q., Wang, D., Yang, X., Li, H.: Adaptive Dynamic Programming with
Applications in Optimal Control. Springer (2017)
69. Liu, D., Yang, X., Wang, D., Wei, Q.: Reinforcement-learning-based robust controller design
for continuous-time uncertain nonlinear systems subject to input constraints. IEEE Trans.
Cybern. 45(7), 1372–1385 (2015)
70. Liu, Y., Zhang, H., Luo, Y., Han, J.: ADP based optimal tracking control for a class of linear
discrete-time system with multiple delays. J. Franklin Inst. 353(9), 2117–2136 (2016)
71. Luo, B., Wu, H.N., Huang, T.: Off-policy reinforcement learning for h-infinity control design.
IEEE Trans. Cybern. 45(1), 65–76 (2015)
72. Lyashevskiy, S.: Control of linear dynamic systems with constraints: optimization issues and
applications of nonquadratic functionals. In: Proceedings of the 35th IEEE Conference on
Decision and Control, 1996, vol. 3, pp. 3206–3211. IEEE (1996)
73. Lyshevski, S.E.: Optimal control of nonlinear continuous-time systems: design of bounded
controllers via generalized nonquadratic functionals. In: Proceedings of the 1998 American
Control Conference, vol. 1, pp. 205–209. IEEE (1998)
74. Manitius, A., Olbrot, A.: Finite spectrum assignment problem for systems with delays. IEEE
Trans. Autom. Control 24(4), 541–552 (1979)
75. Mee, D.: An extension of predictor control for systems with control time-delays. Int. J.
Control 18(6), 1151–1168 (1973)
76. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves,
A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep
reinforcement learning. Nature 518(7540), 529–533 (2015)
77. Modares, H., Lewis, F.L.: Linear quadratic tracking control of partially-unknown continuous-
time systems using reinforcement learning. IEEE Trans. Autom. Control 59(11), 3051–3056
(2014)
78. Modares, H., Lewis, F.L.: Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7), 1780–
1792 (2014)
79. Modares, H., Lewis, F.L., Jiang, Z.P.: H∞ tracking control of completely unknown
continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw.
Learn. Syst. 26(10), 2550–2562 (2015)
80. Modares, H., Lewis, F.L., Jiang, Z.P.: Optimal output-feedback control of unknown
continuous-time linear systems using off-policy reinforcement learning. IEEE Trans. Cybern.
46(11), 2401–2410 (2016)
81. Modares, H., Lewis, F.L., Naghibi-Sistani, M.B.: Integral reinforcement learning and experi-
ence replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1), 193–202 (2014)
82. Modares, H., Nageshrao, S.P., Lopes, G.A.D., Babuška, R., Lewis, F.L.: Optimal model-free
output synchronization of heterogeneous systems using off-policy reinforcement learning.
Automatica 71, 334–341 (2016)
83. Moghadam, R., Lewis, F.L.: Output-feedback H-infinity quadratic tracking control of linear
systems using reinforcement learning. Int. J. Adapt. Control Signal Process. 33, 300–314
(2019)
References 287
84. Mu, C., Ni, Z., Sun, C., He, H.: Air-breathing hypersonic vehicle tracking control based
on adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 584–598
(2017)
85. Mu, C., Wang, D., He, H.: Novel iterative neural dynamic programming for data-based
approximate optimal control design. Automatica 81, 240–252 (2017)
86. Narendra, K.S., Annaswamy, A.M.: Stable Adaptive Systems. Prentice Hall (1989)
87. Olfati-Saber, R.: Flocking for multi-agent dynamic systems: algorithms and theory. IEEE
Trans. Autom. Control 51(3), 401–420 (2006)
88. Postoyan, R., Busoniu, L., Nesic, D., Daafouz, J.: Stability analysis of discrete-time infinite-
horizon optimal control with discounted cost. IEEE Trans. Autom. Control 62(6), 2736–2749
(2017). https://doi.org/10.1109/TAC.2016.2616644
89. Qu, Z., Dorsey, J.: Robust tracking control of robots by a linear feedback law. IEEE Trans.
Autom. Control 36(9), 1081–1084 (1991)
90. Raptis, I.A., Valavanis, K.P., Vachtsevanos, G.J.: Linear tracking control for small-scale
unmanned helicopters. IEEE Trans. Control Syst. Technol. 20(4), 995–1010 (2012)
91. Rizvi, S.A.A., Lin, Z.: Output feedback reinforcement Q-learning control for the discrete-
time linear quadratic regulator problem. In: 2017 IEEE 56th Annual Conference on Decision
and Control (CDC), pp. 1311–1316. IEEE (2017)
92. Rizvi, S.A.A., Lin, Z.: Model-free global stabilization of discrete-time linear systems with
saturating actuators using reinforcement learning. In: 2018 IEEE Conference on Decision
and Control (CDC), pp. 5276–5281. IEEE (2018)
93. Rizvi, S.A.A., Lin, Z.: Output feedback optimal tracking control using reinforcement Q-
learning. In: 2018 Annual American Control Conference (ACC), pp. 3423–3428. IEEE (2018)
94. Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning control for the discrete-time linear
quadratic regulator problem. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1523–1536 (2018)
95. Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning for discrete-time linear zero-sum games
with application to the H-infinity control. Automatica 95, 213–221 (2018)
96. Rizvi, S.A.A., Lin, Z.: Output feedback reinforcement learning control for the continuous-
time linear quadratic regulator problem. In: 2018 Annual American Control Conference
(ACC), pp. 3417–3422. IEEE (2018)
97. Rizvi, S.A.A., Lin, Z.: Experience replay–based output feedback q-learning scheme for
optimal output tracking control of discrete-time linear systems. Int. J. Adapt. Control Signal
Process. 33(12), 1825–1842 (2019)
98. Rizvi, S.A.A., Lin, Z.: An iterative Q-learning scheme for the global stabilization of discrete-
time linear systems subject to actuator saturation. Int. J. Robust Nonlinear Control 29(9),
2660–2672 (2019)
99. Rizvi, S.A.A., Lin, Z.: Model-free global stabilization of continuous-time linear systems with
saturating actuators using adaptive dynamic programming. In: 2019 IEEE 58th Conference
on Decision and Control (CDC), pp. 145–150. IEEE (2019)
100. Rizvi, S.A.A., Lin, Z.: Output feedback reinforcement learning based optimal output syn-
chronisation of heterogeneous discrete-time multi-agent systems. IET Control Theory Appl.
13(17), 2866–2876 (2019)
101. Rizvi, S.A.A., Lin, Z.: Reinforcement learning-based linear quadratic regulation of
continuous-time systems using dynamic output feedback. IEEE Trans. Cybern. 50(11), 4670–
4679 (2019)
102. Rizvi, S.A.A., Lin, Z.: Adaptive dynamic programming for model-free global stabilization of
control constrained continuous-time systems. IEEE Trans. Cybern. 52(2), 1048–1060 (2022)
103. Rizvi, S.A.A., Lin, Z.: Output feedback adaptive dynamic programming for linear differential
zero-sum games. Automatica 122, 109272 (2020)
104. Rizvi, S.A.A., Wei, Y., Lin, Z.: Model-free optimal stabilization of unknown time delay
systems using adaptive dynamic programming. In: 2019 IEEE 58th Conference on Decision
and Control (CDC), pp. 6536–6541. IEEE (2019)
105. Shoham, Y., Powers, R., Grenager, T.: Multi-agent reinforcement learning: a critical survey.
Technical report, Stanford University (2003)
288 References
106. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,
L., Kumaran, D., Graepel, T., et al.: A general reinforcement learning algorithm that masters
chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
107. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,
Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge.
Nature 550(7676), 354–359 (2017)
108. Smith, O.J.: Closed control of loop with dead time. Chem. Eng. Process. 53, 217–219 (1957)
109. Sontag, E.D., Sussmann, H.J.: Nonlinear output feedback design for linear systems with
saturating controls. In: Proceedings of the 29th IEEE Conference on Decision and Control,
pp. 3414–3416. IEEE (1990)
110. Stevens, B., Lewis, F.L.: Aircraft Control and Simulation. Wiley (2003)
111. Su, Y., Huang, J.: Cooperative output regulation of linear multi-agent systems. IEEE Trans.
Autom. Control 57(4), 1062–1066 (2012)
112. Sussmann, H., Sontag, E., Yang, Y.: A general result on the stabilization of linear systems
using bounded controls. In: Proceedings of 32nd IEEE Conference on Decision and Control,
pp. 1802–1807. IEEE (1993)
113. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge
(1998)
114. Sutton, R.S., Barto, A.G., Williams, R.J.: Reinforcement learning is direct adaptive optimal
control. IEEE Control Syst. 12(2), 19–22 (1992)
115. Tang, Y., Xing, X., Karimi, H.R., Kocarev, L., Kurths, J.: Tracking control of networked
multi-agent systems under new characterizations of impulses and its applications in robotic
systems. IEEE Trans. Ind. Electron. 63(2), 1299–1307 (2016)
116. Tao, G.: Adaptive Control Design and Analysis. John Wiley & Sons (2003)
117. Teel, A.R.: Global stabilization and restricted tracking for multiple integrators with bounded
controls. Syst. Control Lett. 18(3), 165–171 (1992)
118. Trentelman, H.L., Stoorvogel, A.A.: Sampled-data and discrete-time H2 optimal control.
SIAM J. Control Optim. 33(3), 834–862 (1995)
119. Vamvoudakis, K.G., Lewis, F.L., Hudas, G.R.: Multi-agent differential graphical games:
online adaptive learning solution for synchronization with optimality. Automatica 48(8),
1598–1611 (2012)
120. Vrabie, D., Lewis, F.: Adaptive dynamic programming for online solution of a zero-sum
differential game. J. Control Theory Appl. 9(3), 353–360 (2011)
121. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
122. Vrabie, D., Vamvoudakis, K.G., Lewis, F.L.: Optimal Adaptive Control and Differential
Games by Reinforcement Learning Principles, vol. 2. IET (2013)
123. Wai, R.J., Chen, P.C.: Intelligent tracking control for robot manipulator including actuator
dynamics via tsk-type fuzzy neural network. IEEE Trans. Fuzzy Syst. 12(4), 552–560 (2004)
124. Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: an introduction. IEEE
Comput. Intell. Mag. 4(2), 39–47 (2009)
125. Wang, L.Y., Li, C., Yin, G.G., Guo, L., Xu, C.Z.: State observability and observers of linear-
time-invariant systems under irregular sampling and sensor limitations. IEEE Trans. Autom.
Control 56(11), 2639–2654 (2011)
126. Watkins, C.J.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge,
England (1989)
127. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
128. Werbos, P.: Beyond regression: new tools for prediction and analysis in the behavioral
sciences. Ph.D. dissertation, Harvard University (1974)
129. Werbos, P.J.: Neural networks for control and system identification. In: Proceedings of the
28th IEEE Conference on Decision and Control, 1989, pp. 260–265. IEEE (1989)
130. Werbos, P.J.: Approximate dynamic programming for real-time control and neural modeling.
In: Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pp. 493–525.
Nostrand, New York (1992)
References 289
131. Werbos, P.J.: A menu of designs for reinforcement learning over time. In: Neural Networks
for Control, pp. 67–95. MIT Press (1995)
132. Wu, H.N., Luo, B.: Simultaneous policy update algorithms for learning the solution of linear
continuous-time H∞ state feedback control. Inf. Sci. 222, 472–485 (2013)
133. Yang, Y., Sontag, E.D., Sussmann, H.J.: Global stabilization of linear discrete-time systems
with bounded feedback. Syst. Control Lett. 30(5), 273–281 (1997)
134. Yeh, H.H., Nelson, E., Sparks, A.: Nonlinear tracking control for satellite formations. J. Guid.
Control Dynam. 25(2), 376–386 (2002)
135. Yoon, S.Y., Anantachaisilp, P., Lin, Z.: An LMI approach to the control of exponentially
unstable systems with input time delay. In: Proceedings of the 52nd IEEE Conference on
Decision and Control, pp. 312–317 (2013)
136. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming
method. IEEE Trans. Neural Netw. 22(12), 2226–2236 (2011)
137. Zhang, H., Liu, Y., Xiao, G., Jiang, H.: Data-based adaptive dynamic programming for a class
of discrete-time systems with multiple delays. IEEE Trans. Syst. Man Cybern. Part A Syst.
Hum. 50, 1–10 (2017)
138. Zhang, H., Qin, C., Luo, Y.: Neural-network-based constrained optimal control scheme for
discrete-time switched nonlinear system using dual heuristic programming. IEEE Trans.
Autom. Sci. Eng. 11(3), 839–849 (2014)
139. Zhang, J., Zhang, H., Luo, Y., Feng, T.: Model-free optimal control design for a class of
linear discrete-time systems with multiple delays using adaptive dynamic programming.
Neurocomputing 135, 163–170 (2014)
140. Zhao, Q., Xu, H., Jagannathan, S.: Near optimal output feedback control of nonlinear discrete-
time systems based on reinforcement neural network learning. IEEE/CAA J. Autom. Sinica
1(4), 372–384 (2014)
141. Zhong, X., He, H.: An event-triggered ADP control approach for continuous-time system
with unknown internal states. IEEE Trans. Cybern. 47(3), 683–694 (2017)
142. Zhu, L.M., Modares, H., Peen, G.O., Lewis, F.L., Yue, B.: Adaptive suboptimal output-
feedback control for linear systems using integral reinforcement learning. IEEE Trans.
Control Syst. Technol. 23(1), 264–273 (2015)
143. Zhu, Y., Zhao, D., Li, X.: Using reinforcement learning techniques to solve continuous-time
non-linear optimal tracking problem without system dynamics. IET Control Theory Appl.
10(12), 1339–1347 (2016)
144. Zuo, Z.: Trajectory tracking control design with command-filtered compensation for a
quadrotor. IET Control Theory Appl. 4(11), 2343–2355 (2010)
Index
E
Exploration/excitation bias, 20, 21, 28, 32, 49, K
54, 60, 61, 63, 69, 71, 85–87, 89, Kleinman’s algorithm, 10
93–95, 121, 129, 135, 149, 152, 155,
160, 161, 241
Exponentially decaying, 45, 148, 241 L
Extended augmented system, 226, 233–235, Least-squares, 16, 48, 79, 80, 82, 118, 120,
238, 242–244, 248, 249, 254 141–145, 148, 149, 175, 179, 196,
Extended state augmentation, 226, 228, 253 197, 200, 201, 206, 207, 209, 210,
240, 241, 246, 248
Lifting technique, 226
F Linearly dependent, 118, 175, 241, 246
F-16 fighter aircraft, 122, 153 Linearly independent, 118
Finite dimensional, 226, 252 Linear quadratic regulator (LQR), 4, 8, 27, 99,
Finite spectrum assignment, 226 226, 254, 279
Fixed-point property, 6, 11, 24, 102, 168 Linear quadratic tracking (LQT), 257, 260
Function approximation, 13, 17 Linear time-invariant, 65, 129, 260
Lower bound, 94, 95, 161, 261
Low gain feedback, 163–169, 172, 176,
G 181–185, 187–195, 200, 205, 207,
Game algebraic Riccati equation (GARE), 98, 212–216, 218, 219, 221, 222, 224
99 Lyapunov equation, 9–11, 27, 66, 67, 84, 99,
continuous-time, 131, 132, 143, 148, 149, 102, 103, 131, 132, 148, 167, 168,
154, 157, 158 202
discrete-time, 101–103, 106, 123, 129 Lyapunov iterations, 9, 84, 102, 167, 202
294 Index