0% found this document useful (0 votes)
45 views

Output Feedback Reinforcement Learning Control For Linear Systems

This is a good example for Out put feedback with reinforcement Learning

Uploaded by

ESLAM salah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views

Output Feedback Reinforcement Learning Control For Linear Systems

This is a good example for Out put feedback with reinforcement Learning

Uploaded by

ESLAM salah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 304

Control Engineering

Syed Ali Asad Rizvi


Zongli Lin

Output Feedback
Reinforcement
Learning Control
for Linear
Systems
Control Engineering

Series Editor
William S. Levine, Department of Electrical and Computer Engineering, University
of Maryland, College Park, MD, USA

Editorial Board Members


Richard Braatz, Room E19-551, MIT, Cambridge, MA, USA
Graham C Goodwin, School of Elect Eng and Com Sc, University of Newcastle,
Callaghan, Australia
Zongli Lin, Electrical and Computer Engineering, University of Virginia, Char-
lottesville, VA, USA
Mark W. Spong, Erik Jonsson School of Eng & Comp, University of Texas at Dallas,
Richardson, USA
Maarten Steinbuch, Mechanical Engineering, Technische Universiteit Eindhoven,
EINDHOVEN, Noord-Brabant, The Netherlands
Mathukumalli Vidyasagar, University of Texas at Dallas, Cecil & Ida Green Chair
in Systems Biolo, Richardson, TX, USA
Yutaka Yamamoto, Dept. of Applied Analysis, Kyoto University, Sakyo-ku, Kyoto,
Kyoto, Japan
Syed Ali Asad Rizvi • Zongli Lin

Output Feedback
Reinforcement Learning
Control for Linear Systems
Syed Ali Asad Rizvi Zongli Lin
Electrical and Computer Engineering Electrical and Computer Engineering
Tennessee Technological University University of Virginia
Cookeville, TN, USA Charlottesville, VA, USA

ISSN 2373-7719 ISSN 2373-7727 (electronic)


Control Engineering
ISBN 978-3-031-15857-5 ISBN 978-3-031-15858-2 (eBook)
https://doi.org/10.1007/978-3-031-15858-2

Mathematics Subject Classification: 49N05, 49N30, 49N90, 93C40, 68T05

© Springer Nature Switzerland AG 2023


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
The publisher, the authors, and the editors are safe to assume that the advice and information in this book
are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or
the editors give a warranty, expressed or implied, with respect to the material contained herein or for any
errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional
claims in published maps and institutional affiliations.

This book is published under the imprint Birkhäuser, www.birkhauser-science.com by the registered
company Springer Nature Switzerland AG
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface

We live in a technology-driven world that is strongly dependent on the operation of


complex systems found in automobiles, aerospace, industrial automation, manufac-
turing systems, communication, computers systems, defense, medicine, and finance.
Smooth and efficient operation of such systems is ensured by means of automatic
control and decision-making methods. Control theory provides mathematical tools
for designing control systems to enable desired system behavior and ensure stable
operation of these systems. With the increasing complexity of systems, the demand
for extracting better and better performance from these control systems has now
become a requirement. While stability is essential, it is only a minimum requirement
for the operation of a control system. It is desirable that a control system operates
in a prescribed optimal manner that meets certain cost and performance objectives.
Traditional control theory has been pushed to its limits, and new techniques are
being sought that can keep up with the accelerating evolution pace of today’s
complex systems and their performance specifications.
Optimal control theory is one of the founding pillars of modern control that
enables the synthesis of optimal controllers. An optimal control problem is the
problem of designing a control law that fulfills a control task while minimizing a
prescribed cost function that reflects the balance between the performance achieved
and the available resources consumed. The foundations of optimal control theory
were laid in the 1950s based on the Pontryagin’s minimum principle and Bellman’s
theory of dynamic programming. Formally, the solution of the optimal control
problem is obtained by solving the Hamilton-Jacobi-Bellman (HJB) equation.
Although mathematically elegant, the HJB equation is intractable because it is a
partial differential equation. For the special case of linear systems with a quadratic
cost function, solving the HJB equation amounts to finding the solution of an
algebraic Riccati equation (ARE). Iterative computational schemes have been
developed to solve AREs. All these optimal control design algorithms are offline
in nature and require a complete knowledge of the system dynamics.
Optimal control theory is fundamentally a model-based control paradigm that
employs a system model to obtain the optimal control law. The HJB equation and
the ARE incorporate the complete knowledge of the dynamic system, and therefore,

v
vi Preface

the optimality of the solution is subject to the accuracy of the system model. The
requirement of an accurate system model is hard to satisfy in practice because
of the presence of unavoidable uncertainties. Furthermore, these uncertainties add
up as the complexity of system increases. More often, the modeling and system
identification process is undesirable owing to the cost and effort associated with this
process. As a result, the dynamics of the underlying system is often unknown for
the control design. Even when the system dynamics is available in the early design
phase, the system parameters are subject to change over the operating life span of
the system due to variations in the process itself. These model variations may arise
as a result of aging, faults, or subsequent upgrades of the system components. Thus,
it is desirable to develop optimal control methods that do not rely on the knowledge
of the system dynamics.
Machine learning is an area of artificial intelligence and involves the development
of computational intelligence algorithms that learn by analyzing the system behavior
instead of being explicitly programmed to perform certain tasks. Recent advances in
machine learning have opened a new avenue for the design of intelligent controllers.
Machine learning techniques come into the picture when accurate system models
are not known but examples of system behavior are available or a measure of the
goodness of the behavior can be assigned. Machine learning algorithms are formally
categorized into three types based on the kind of supervision available, namely
supervised learning, unsupervised learning, and reinforcement learning. Supervised
learning is undoubtedly the most common type of machine learning which finds use
in applications where examples of the desired behavior are available in the form
of input-output training data. However, the nature of control problems requires
selecting control actions whose consequences emerge over time. As a result, the
optimal control inputs that achieve the desired output are not known in advance.
In these scenarios, reinforcement learning can be used to learn the desired control
inputs by providing the controller with a suitable evaluation of its performance
during the course of learning.
Today, more and more engineering designs are finding inspiration from our
mother nature. Living organisms tend to learn, adapt, and optimize their behavior
over time by interacting with their environment. This is the main motivation behind
reinforcement learning techniques, which are computational intelligence algorithms
based on the principle of action and reward that naturally incorporate a feedback
mechanism to capture optimal behavior. The ideas of adaptability and optimality are
also present in the control community in the form of adaptive control and dynamic
programming (optimal control). However, reinforcement learning has provided a
new direction of learning optimal adaptive control by observing a reward stimulus.
Recently, the idea of using reinforcement learning to solve optimal control problems
has attracted a lot of attention in the control community.
The treatment of RL problems in control settings calls for a more mathematically
rigorous formulation so that connections with the fundamental control concepts such
as stability, controllability, and observability can be made. We start to see to such
sophistications right from the beginning when we are required to select a rigorously
formulated objective function that satisfies certain control assumptions.
Preface vii

While significant progress has been made in recent years, the reinforcement
learning (RL) control paradigm continues to undergo developments more rapidly
than ever before. Reinforcement learning for dynamic systems requires to take into
account the system dynamics. Control of dynamic systems gives prime importance
to the closed-loop stability. Control algorithms are required to guarantee closed-
loop stability, which is a bare minimum requirement for a feedback system. Control
systems without performance, robustness, and safety margins are not acceptable by
the industry. Current developments in the RL control literature are directed towards
making RL algorithms applicable in real world scenarios.
This book focuses on the recent developments in the design of RL controllers for
general linear systems represented by state space models, either in continuous-time
or in discrete-time. It is dedicated to the design of output feedback RL controllers.
While the early developments in RL algorithms have been primarily attributed to the
computer science community where it is of practical interest to pose the problem in
the stochastic setting, the scope of this research monograph is towards enhancing the
output feedback capability of the mainstream RL algorithms developed within the
control engineering community, which are primarily in the deterministic setting.
This specialized choice of topics sets this research monograph apart from the
leading mainstream books that develop the conceptual foundations of the theory
of approximate dynamic programming [8], reinforcement learning [9], and their
connection with feedback control [54, 68]. The book presents control algorithms
that are aimed to deal with the challenges associated with learning under limited
sensing capability. Fundamental to these algorithms are the issues of exploration
bias and stability guarantee. Model-free solutions to the classical optimal control
problems such as stabilization and tracking are presented for both discrete-time and
continuous-time dynamic systems. Output feedback control for linear systems is
currently more formalized in the latest literature compared to the ongoing extensions
for nonlinear systems; therefore, the discussions for the most part are dedicated to
linear systems.
The output feedback formulation presented in this book differs from the tra-
ditional design method that involves a separate observer design. As such, the
emphasis of the monograph is not on optimal state estimation, as is involved in
an observer and control design process that leads to the optimal output feedback
control. Instead, our motivation is to learn the optimal control parameters directly
based on a state parameterization approach. This in turn circumvents the separation
principle of performing two-step learning for optimal state estimation and optimal
control design.
The results covered in the book extend beyond the classical problems of
stabilization and tracking. A variety of practical challenges are studied ranging from
the disturbance rejection, control constraints, and communication delays. Model-
free H∞ controllers are developed using output feedback based on the principles
of game theory. The ideas of low gain feedback control are employed to develop
RL controllers that achieve global stability under control constraints. New results
on the design of model-free optimal controllers for systems subject to both state
and input delays are presented based on an extended state augmentation approach,
viii Preface

which requires neither the knowledge of the lengths of the delays nor the knowledge
of the number of the delays.
The organization of the book is as follows. In Chap. 1, we introduce the readers
to optimal control theory and reinforcement learning. Fundamental concepts and
results are reviewed. This chapter also provides a brief survey of the recent develop-
ments in the field of RL control. Challenges associated with the RL controllers are
highlighted.
Chapter 2 presents model-free output feedback RL algorithms to solve the
optimal stabilization problem. Both continuous-time systems and discrete-time
systems are considered. A review of existing mainstream approaches and algorithms
is presented to highlight some of the challenges in guaranteeing closed-loop
stability and optimality. Q-learning and integral reinforcement learning algorithms
are developed based on the parameterization of the system state. The issues of
discounting factor and exploration bias are discussed in detail, and improved output
feedback RL methods are presented to overcome these difficulties.
Chapter 3 brings attention to the disturbance rejection control problem formu-
lated as an H∞ control problem. The framework of game theory is employed to
develop both continuous-time and discrete-time algorithms. A literature review is
provided to elaborate the difficulties in some of the recent RL-based disturbance
rejection algorithms. Q-learning and integral reinforcement learning algorithms
are developed based on the input-output parameterization of the system state, and
convergence to the optimal solution is established.
Chapter 4 presents model-free algorithms for global asymptotic stabilization
of linear systems subject to actuator saturation. Existing reinforcement learning
approaches to solving the constrained control problem are first discussed. The idea
of gain-scheduled low gain feedback is presented to develop control laws that avoid
saturation and achieve global asymptotic stabilization. To design these control laws,
we employ the parameterized ARE-based low gain design technique. Reinforcement
learning algorithms based on this approach are then presented to find the solution of
the parameterized ARE without requiring any knowledge of the system dynamics.
The presented scheme has the advantage that the resulting control laws have a
linear structure and global asymptotic stability is ensured without causing actuator
saturation. Both continuous-time systems and discrete-time systems are considered.
The last two chapters build upon the fundamental results established in the
earlier chapters to solve some important problems in control theory practice, control
of systems in the presence of time delays, the optimal tracking problem, and
the multi-agent synchronization problem. In particular, Chap. 5 focuses on the
control problems involving time delays. A review of some existing RL approaches
to addressing the time-delay control problem of linear systems is first provided.
The design of model-free RL controllers is presented based on an extended state
augmentation approach. It is shown that discrete-time delay systems with input
and/or state delays can be brought into a delay-free form. Q-learning is then
employed to learn the optimal control parameters for the extended dynamic system.
Systems with arbitrarily large delays can be dealt with using the presented approach.
Furthermore, this method requires neither the number and nor the lengths of delays.
Preface ix

Chapter 6 develops model-free output feedback RL algorithms to solve the


optimal tracking problem along with their extension to the synchronization problem
of multi-agent systems. Existing RL approaches to solving a multi-agent synchro-
nization problem based on output feedback are first described. A distributed two
degree of freedom approach is presented that separates the learning of the optimal
output feedback and the feedforward terms. The approach overcomes the prevailing
discounting factor issue and ensures asymptotic optimal output tracking.
This monograph was typeset by the authors using LATEX. All simulation and
numerical computation were carried out in MATLAB.

Cookeville, TN, USA Syed Ali Asad Rizvi


Charlottesville, VA, USA Zongli Lin
Contents

1 Introduction to Optimal Control and Reinforcement Learning . . . . . . . . 1


1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Optimal Control of Dynamic Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.1 Dynamic Programming Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2.2 The Linear Quadratic Regulation Problem . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Iterative Numerical Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.3 Reinforcement Learning Based Optimal Control. . . . . . . . . . . . . . . . . . . . . . 11
1.3.1 Principles of Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.2 Reinforcement Learning for Automatic Control . . . . . . . . . . . . . . 12
1.3.3 Advantages of Reinforcement Learning Control . . . . . . . . . . . . . . 13
1.3.4 Limitations of Reinforcement Learning Control . . . . . . . . . . . . . . 14
1.3.5 Reinforcement Learning Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.4 Recent Developments and Challenges in Reinforcement
Learning Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.4.1 State Feedback versus Output Feedback Designs . . . . . . . . . . . . . 19
1.4.2 Exploration Signal/Noise and Estimation Bias . . . . . . . . . . . . . . . . 20
1.4.3 Discounted versus Undiscounted Cost Functions . . . . . . . . . . . . . 20
1.4.4 Requirement of a Stabilizing Initial Policy . . . . . . . . . . . . . . . . . . . . 21
1.4.5 Optimal Tracking Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4.6 Reinforcement Learning in Continuous-Time . . . . . . . . . . . . . . . . . 22
1.4.7 Disturbance Rejection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.4.8 Distributed Reinforcement Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Model-Free Design of Linear Quadratic Regulator . . . . . . . . . . . . . . . . . . . . . . 27
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.3 Discrete-Time LQR Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.1 Iterative Schemes Based on State Feedback . . . . . . . . . . . . . . . . . . . 30
2.3.2 Model-Free Output Feedback Solution . . . . . . . . . . . . . . . . . . . . . . . . 32
2.3.3 State Parameterization of Discrete-Time Linear Systems . . . . 32

xi
xii Contents

2.3.4 Output Feedback Q-function for LQR . . . . . . . . . . . . . . . . . . . . . . . . . 40


2.3.5 Output Feedback Based Q-learning for the LQR Problem . . . 43
2.3.6 Numerical Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.4 Continuous-Time LQR Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
2.4.1 Model-Based Iterative Schemes for the LQR Problem . . . . . . . 66
2.4.2 Model-Free Schemes Based on State Feedback . . . . . . . . . . . . . . . 68
2.4.3 Model-Free Output Feedback Solution . . . . . . . . . . . . . . . . . . . . . . . . 69
2.4.4 State Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
2.4.5 Learning Algorithms for Continuous-Time Output
Feedback LQR Control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
2.4.6 Exploration Bias Immunity of the Output Feedback
Learning Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
2.4.7 Numerical Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
2.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
3 Model-Free H∞ Disturbance Rejection and Linear Quadratic
Zero-Sum Games. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem . . . . . . . . . . . 100
3.3.1 Model-Based Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
3.3.2 State Parameterization of Discrete-Time Linear
Systems Subject to Disturbances. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
3.3.3 Output Feedback Q-function for Zero-Sum Game . . . . . . . . . . . . 114
3.3.4 Output Feedback Based Q-learning for Zero-Sum
Game and H∞ Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
3.3.5 A Numerical Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem . . . . . . . 129
3.4.1 Model-Based Iterative Schemes for Zero-Sum Game
and H∞ Control Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
3.4.2 Model-Free Schemes Based on State Feedback . . . . . . . . . . . . . . . 132
3.4.3 State Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
3.4.4 Learning Algorithms for Output Feedback
Differential Zero-Sum Game and H∞ Control Problem . . . . . . 140
3.4.5 Exploration Bias Immunity of the Output Feedback
Learning Algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
3.4.6 Numerical Examples. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
3.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
3.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
4 Model-Free Stabilization in the Presence of Actuator Saturation . . . . . . 163
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
4.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
4.3 Global Asymptotic Stabilization of Discrete-Time Systems . . . . . . . . . . 165
4.3.1 Model-Based Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
Contents xiii

4.3.2
Q-learning Based Global Asymptotic Stabilization
Using State Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
4.3.3 Q-learning Based Global Asymptotic Stabilization
by Output Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
4.3.4 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
4.4 Global Asymptotic Stabilization of Continuous-Time Systems . . . . . . 191
4.4.1 Model-Based Iterative Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 193
4.4.2 Learning Algorithms for Global Asymptotic
Stabilization by State Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
4.4.3 Learning Algorithms for Global Asymptotic
Stabilization by Output Feedback . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
4.4.4 Numerical Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
4.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
4.6 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
5 Model-Free Control of Time Delay Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
5.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.3 Problem Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
5.4 Extended State Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 228
5.5 State Feedback Q-learning Control of Time Delay Systems . . . . . . . . . . 238
5.6 Output Feedback Q-learning Control of Time Delay Systems. . . . . . . . 243
5.7 Numerical Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 248
5.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
5.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
6 Model-Free Optimal Tracking Control and Multi-Agent
Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
6.2 Literature Review . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
6.3 Q-learning Based Linear Quadratic Tracking. . . . . . . . . . . . . . . . . . . . . . . . . . 260
6.4 Experience Replay Based Q-learning for Estimating the
Optimal Feedback Gain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 263
6.5 Adaptive Tracking Law . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 264
6.6 Multi-Agent Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
6.7 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
6.8 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 278
6.9 Notes and References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291
Notation and Acronyms

N the set of natural numbers


Z the set of integers
R the set of real numbers
Z+ the set of positive integers
R+ the set of positive real numbers
R+0 the set of non-negative real numbers
Rn the set of real vectors of dimension n
Rn×m the set of real matrices of dimensions n × m
C the set of complex √ numbers
j the imaginary unit −1
Re(·) (Im(·)) the real (imaginary) part of a complex number
· the absolute value of a scalar, the Euclidean norm of a vector or
the norm of a matrix induced by a vector Euclidean norm
0 a zero scalar, vector, or matrix of appropriate dimensions
I (In ) an identity matrix of appropriate dimensions (of dimensions n ×
n)
I [a, b] the set of integers within the interval [a, b], where a, b ∈ R and
a ≤ b. Either side of the interval can be open if a or b is replaced
by ∞
tr(·) the trace of a square matrix
det(·) the determinant of a square matrix
λ(·) the set of eigenvalues of a square matrix
λmin (·) (λmax (·)) the minimum (maximum) eigenvalue of a real symmetric matrix
v T (AT ) the transpose of a vector v (a matrix A)
A > B (A ≥ B) A − B is positive definite (semidefinite), where A and B are real
symmetric matrices
A < B (A ≤ B) A−B is negative definite (semidefinite), where A and B are real
symmetric matrices
ρ(·) or ρ[·] the rank of a matrix

xv
xvi Notation and Acronyms

l2 [0, ∞) ∞set of2 all sequences {xk ∈ +R : k = 0, 1, · · · } such that


the n

k=0 |xk | < ∞ for some n ∈ Z


l∞ [0, ∞) the set of all sequences {xk ∈ Rn : k = 0, 1, · · · } such that
sup max xi,k  < ∞ for some n ∈ Z+
k≥0 0≤i≤n
L2 [0, ∞) the x : [0, ∞) → Rn such that
 ∞ set of 2all measurable functions +
0 |x(t)| dt < ∞ for some n ∈ Z
L∞ [0, ∞) the set of all measurable functions x : [0, ∞) → Rn such that
sup max |xi (t)| < ∞ for some n ∈ Z+
t≥0 0≤i≤n
vec(·) a vector containing the stacked columns of a matrix
vecs(·) a vector containing the upper right triangular entries of a
symmetric matrix
 · ∞ the infinity norm of a signal
AI artificial intelligence
ARE algebraic Riccati equation
ADP approximate / adaptive dynamic programming
GARE game algebraic Riccati equation
HJB Hamilton–Jacobi–Bellman
HJI Hamilton–Jacobi–Isaacs
IRL integral reinforcement learning
LQR linear quadratic regulator
LQT linear quadratic tracker
MDP Markov decision process
PBH Popov-Belevitch-Hautus
PDE partial differential equation
PE persistence of excitation
PI policy iteration
POMDP partially observable Markov decision process
RL reinforcement learning
VFA value function approximation
VI value iteration
Chapter 1
Introduction to Optimal Control and
Reinforcement Learning

1.1 Introduction

Control algorithms are unarguably one of the most ubiquitous elements found at
the heart of many of today’s real world systems. Owing to the demand for high
performance of these systems, the development of better control techniques has
become ever so important. A consequence of their increasing complexity is that
it has become more difficult to model and control these systems to meet the desired
objectives. As a result, the traditional model-based control paradigm requires novel
techniques to cope with these increasing challenges. Optimal control, a control
framework that is regarded as an important development in modern control theory
for its capability to integrate optimization theory in controls, is now experiencing
a paradigm shift from the offline model-based setting to the online data-driven
approach. Such a shift has been fueled by the recent surge of the research and
developments in the area of machine learning pioneered by the artificial intelligence
(AI) community and has led to the developments in optimal decision making AI
algorithms popularly known as reinforcement learning.
While reinforcement learning holds the promise of rescuing the classical optimal
control from the curse of modeling, there are also several potential challenges
that stem from the marriage of these two paradigms owing to the mathematically
rigorous requirements inherent in control theory. In this chapter we will first provide
a concise but self-contained background of optimal control and reinforcement
learning in the context of linear dynamical systems with the aim of highlighting
the connections between the two. Iterative techniques are introduced that play an
essential role in the design of algorithms presented in the later chapters. A detailed
discussion then follows that highlights the existing difficulties found in the literature
of reinforcement learning based optimal control.

© Springer Nature Switzerland AG 2023 1


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2_1
2 1 Introduction to Optimal Control and Reinforcement Learning

1.2 Optimal Control of Dynamic Systems

Ensuring stable operation that meets basic control objectives is an essential, though
preliminary, requirement in any control design. Besides stability, practical control
systems are required to meet certain specifications from both the performance and
cost standpoints. This requires the control problem to be solved in a certain optimal
fashion that takes into account various practical operational aspects. Optimization
techniques play an essential role in meeting such specifications. An optimal control
problem basically revolves around the idea of performing a certain control task such
as stabilization or trajectory tracking in a way that a good balance of performance
and control cost is maintained while executing the control task. Pioneering devel-
opments in optimal control date back to the 1950s with the Pontryagin’s minimum
principle that provides necessary conditions under which the problem is solvable
[12]. Around the same time the introduction of Bellman’s dynamic programming[7]
method enabled solving dynamic optimization problems backward in time in a
sequential manner.

1.2.1 Dynamic Programming Method

We will develop an understanding of the dynamic programming method by consid-


ering a general optimal control problem for dynamic systems. Suppose we are given
a dynamic system of the form

xk+1 = f (xk , uk ), k ≥ 0, (1.1)

where xk ∈ Rn represents the state, uk ∈ Rm is the control input, and k is the time
index. The basic objective is to find an optimal control sequence u∗k , k ≥ 0, that
causes that xk → 0 as k → ∞, while minimizing a long term running cost of the
form


V (x0 ) = r(xk , uk ), (1.2)
k=0

where r(xk , uk ) is an appropriately chosen utility function (meeting certain condi-


tions) that represents the performance objectives. Note that the cost function in (1.2)
is an explicit function of the control and the state. The notation in (1.2) implies that
the optimal cost is computed for a given initial state x0 . In the case of an arbitrary
state, the notation for the value function Vπ (xk ) is used that essentially provides the
long term cost or value starting from the state xk at time k under some control policy,
say ui = π(xi ), i ≥ k, that belongs to a set of admissible policies U. In the special
case of linear systems, admissible policies are simply the stabilizing feedback gains
of the form ui = Kxi that yield a finite value VK (xk ). Formally, the optimal control
problem is to find π ∗ that results in the optimal cost
1.2 Optimal Control of Dynamic Systems 3


  
V ∗ (xk ) = r xi , π ∗ (xi )
i=k

∞
= min r(xi , ui ), (1.3)
u∈U i=k

that is,
∞


π (xk ) = arg min r(xi , ui ) . (1.4)
u∈U i=k

To solve this problem using the framework of dynamic programming, we first need
to recall the Bellman principle of dynamic programming. The Bellman principle
states that “An optimal policy has the property that whatever the initial state and
initial decision are, the remaining decisions must constitute an optimal policy with
regard to the state resulting from the first decision.” [57]
The Bellman principle in essence provides necessary and sufficient conditions
for the policy to be optimal, which is that every sub-trajectory under that policy
is also optimal independent of where we start taking optimal actions on the
optimal trajectory. The idea itself is actually very intuitive and simple. Consider, for
example, that we are to do an air travel from one place to another while minimizing
the travel costs (such as time and expenses). By finding out the information about
the possible routes (the model in this case) we obtain an optimal flight route that
goes through certain connections along the route. However, if we had to reschedule
that flight from a certain intermediate connection, the flight starting from that
intermediate connection following the rest of the route of the original flight will
still be the optimal choice. Otherwise, the original flight route would not have been
optimal and can be improved by adopting better flights for the later portion of the
flight route.
Coming back to our original problem, mathematically, the Bellman optimality
principle implies that the optimal value function holds the following relationship,
 
V ∗ (xk ) = min r(xk , uk ) + V ∗ (xk+1 ) , (1.5)
u∈U

which suggests that, in theory, we can compute the optimal cost backwards in
time and working all the way back to the present instance, a process that requires
offline planning and invokes the system dynamics. Equation (1.5) is called the
Bellman Optimality equation, which is in fact the discrete-time counterpart of a
popular equation known as the Hamilton-Jacobi-Bellman (HJB) equation (a partial
differential equation) used to solve the continuous-time version of the same problem
as discussed below.
4 1 Introduction to Optimal Control and Reinforcement Learning

For the continuous-time case, we have the following dynamics,

ẋ(t) = f (x(t), u(t)) , (1.6)

and the cost function to be minimized is given by



V (x0 ) = r (x(τ ), u(τ ))dτ.
0

Solving this problem amounts to solving the following HJB equation



T
∂V
0 = min r (x(t), u(t)) + f (x(t), u(t)) . (1.7)
u∈U ∂x

It is interesting to note that the right-hand side of this equation contains the
Hamiltonian function used in solving optimization problems. Once the solution (the
optimal value function) of the HJB equation is obtained, the optimal control can be
readily computed by minimizing the Hamiltonian with respect to the control u(t).
The Bellman optimality equation and the HJB equation are functional equations.
Their solution amounts to finding the optimal value function V ∗ . These equations
are difficult to solve for general dynamic systems and cost functions except for
some special cases. Thus, the standard dynamic programming framework faces two
big challenges. First, we need perfect models to solve (1.7). Secondly, even when
such models are available, we still need some method to approximate their solutions
[124].

1.2.2 The Linear Quadratic Regulation Problem

One of the most widely discussed optimal control problems is the linear quadratic
regulator (LQR) problem. The problem deals with linear dynamic systems whose
cost functions are energy like functions that take a quadratic form. In this problem
we are to solve a regulation problem while minimizing a long term quadratic cost
function. Depending on whether we require finite-time or asymptotic regulation,
the cost function is defined accordingly. We will discuss only the more common
asymptotic regulation here as we will not discuss finite-time control problems in
this book. Suppose we are given a linear dynamic system in the state space form,

xk+1 = Axk + Buk , (1.8)

where xk ∈ Rn represents the state, uk ∈ Rm is the control input and k is the time
index. The basic objective is to find an optimal control sequence u∗k , k ≥ 0, that
causes that xk → 0 as k → ∞, while minimizing a long term running cost of the
form [55],
1.2 Optimal Control of Dynamic Systems 5


 
V (x0 ) = xkT Qxk + uTk Ruk , (1.9)
k=0

 √  √ T√
where Q ≥ 0 and R > 0 are the weighting matrices with A, Q , Q Q =
Q, being observable and (A, B) being controllable. It is worth mentioning that the
optimal LQR controller may exist under less restrictive conditions [33, 118]. Utility
functions of this form will be the focus throughout the book.
For the linear dynamic system (1.8) and the quadratic cost function (1.9), solving
the Bellman optimality equation (1.7) amounts to solving a matrix algebraic Riccati
equation (ARE) of the form,
−1
AT P A − P + Q − AT P B R + B T P B B T P A = 0. (1.10)

The solution of the ARE (1.10) gives the optimal value function,

V ∗ (xk ) = xkT P ∗ xk ,

and the associated optimal control,

u∗k = K ∗ xk
−1
= − R + B TP ∗B B T P ∗ Axk ,

where P ∗ > 0 is the unique positive definite solution to the ARE (1.10). The
uniqueness of P ∗ is ensured under the standard controllability and observability
assumptions [55].
Parallel results for the continuous-time version of the same problem can be
obtained by considering the linear dynamics

ẋ(t) = Ax(t) + Bu(t), (1.11)

and the cost function


∞ 
V (x0 ) = x T (τ )Qx(τ ) + uT (τ )R(τ ) dτ,
0

where  B) is controllable, Q ≥ 0 and R > 0 are the weighting matrices with


 √ (A,
A, Q being observable. In the continuous-time setting, we solve the following
ARE,

AT P + P A + Q − P BR −1 B T P = 0, (1.12)
6 1 Introduction to Optimal Control and Reinforcement Learning

to find the optimal control law,

u∗ (t) = K ∗ x(t)
= −R −1 B T P ∗ x(t), (1.13)

where P ∗ > 0 is the unique positive definite solution to the continuous-time


ARE (1.12). The uniqueness of P ∗ is ensured under the standard controllability
and observability assumptions [55].
In the above discussion we observe that, for the LQR problem, solving the
Bellman optimality equation or the HJB equation is simplified to finding the solution
of an algebraic Riccati equation (ARE). The ARE is, however, a nonlinear equation
(in P ) and, therefore, still difficult to solve for higher order systems. In the next
subsection we will review some iterative numerical approaches that have been
developed in the optimal control literature to address the difficulty associated with
solving the AREs.

1.2.3 Iterative Numerical Methods

Optimal control problems rely on solving the HJB equation (for a general nonlinear
system and/or a general cost function) or the ARE equation (for a linear system
and a quadratic cost function). Even when accurate system models are available,
these equations are still difficult to solve analytically and, therefore, computational
methods have been developed in the literature to solve these equations. Many of
these methods are based on two computational techniques called policy iteration and
value iteration [9]. In this subsection we will give an overview of these algorithms
as they will serve as the basis of the online learning based methods that we will
introduce in the subsequent chapters.
The mathematical basis of the iterative procedures for solving the Bellman and
the ARE equations is that these equations satisfy a fixed-point property under some
general conditions on the cost function. This means that we can start with some sub-
optimal solution and successive improvements, when fed back to these equations,
would converge to the optimal one. To see this, we note that we can obtain the value
function from (1.2) corresponding to some policy π(xk ) for state xk as

Vπ (xk ) = r (xk , π(xk )) + Vπ (xk+1 ), (1.14)

which is known as the Bellman equation. This equation is used to evaluate the cost
of policy π . We can obtain an improved policy π and feeding this update back in
the Bellman equation will give us the value of the new policy Vπ , which satisfies
Vπ ≤ Vπ . The process can be repeated until Vπ = Vπ for all future iterations.
This method is formally referred to as policy iteration (PI) [113]. It allows finding
1.2 Optimal Control of Dynamic Systems 7

Algorithm 1.1 Policy iteration algorithm


input: system dynamics
output: V ∗ and π ∗
1: initialize. Select an admissible policy π 0 . Set j ← 0.
2: repeat
3: policy evaluation. Solve the Bellman equation for Vπ

Vπj (xk ) = r xk , π j (xk ) + Vπj (xk+1 ),

4: policy update. Find improved policy as



π j +1 (xk ) = arg min r(xk , π(xk )) + Vπj (xk+1 ) .
π(·)

5: j←  j +1 
 j j −1 
6: until Vπ − Vπ  < ε for some small ε > 0.

optimal policy without directly solving the Bellman optimality equation (1.5). This
PI procedure is described in Algorithm 1.1.
In each iteration of the policy iteration algorithm, the Bellman equation is
first solved by evaluating the cost or value of the current control policy and
then the policy is improved based on its policy evaluation. These two steps of
policy evaluation and policy update are repeated until the algorithm sees no further
improvement in the policy, and the final policy is said to be the optimal policy.
Notice that the PI algorithm iterates on the policies by evaluating them, hence the
name policy iteration. An important aspect of the algorithm is that it requires an
admissible policy because admissible policies have finite cost which is needed in
the policy evaluation step.
The second popular iteration method to solve the Bellman optimal equation is the
value iteration (VI) method. The method differs from the policy iteration method in
that it does not iterate on the policies by evaluating them but rather iterates on the
value function directly. This is a consequence of the fact that Bellman optimality
equation is a fixed-point functional equation, which means that iterating the value
functions directly would lead to the optimal value function. This VI procedure is
described in Algorithm 1.2.
The value iteration algorithm differs from the policy iteration algorithm only
in the policy evaluation step, in which the value iteration evaluates a policy value
based on the previous policy value. It should be noted that policy iteration generally
requires solving a system of equations in each iteration, whereas the value iteration
simply performs a one-step recursion, which is computationally economical. In
contrast to the policy iteration algorithm, the value iteration algorithm does not
actually find the policy value corresponding to the current policy at each step but
it takes only a step closer to that value. Also, the policy iteration algorithm must
be suitably initialized to converge, i.e., the initial policy must be admissible or
8 1 Introduction to Optimal Control and Reinforcement Learning

Algorithm 1.2 Value iteration algorithm


input: system dynamics
output: V ∗ and π ∗
1: initialize. Select an arbitrary policy π 0 and a value function Vπ0 ≥ 0. Set j ← 0.
2: repeat
3: value update. Calculate

Vπj +1 (xk ) = r xk , π j (xk ) + Vπj (xk+1 ).

4: policy update. Find an improved policy as



π j +1 (xk ) = arg min r(xk , π(xk )) + Vπj +1 (xk+1 ) .
π(·)

5: j←  j +1 
 j j −1 
6: until Vπ − Vπ  < ε for some small ε > 0.

stabilizing. This is a limitation because a stabilizing policy may be difficult to obtain


for complex systems. In contrast, the value iteration algorithm does not impose this
requirement. Although each step of value iteration is simpler than that of policy
iteration, the policy iteration algorithm generally converges to the optimal solution
in fewer iterations, as it does more work at each step. Moreover, the search space
of policy iteration is limited to stabilizing policies only. On the other hand, since
the value iteration takes place on the value space, it does not have restriction on the
policy space, which contributes to its slower convergence. Another property of the
value iteration algorithm is that only the final solution is guaranteed to be stabilizing
and the iterative control sequences may not be stabilizing. It is worth mentioning
here that, while the most common formulation of PI and VI algorithms is aimed
towards discrete state and action spaces, the methods presented in this book do not
have this restriction.
In the following we discuss how these iterative algorithms can be used to
solve the linear quadratic regulator problem. The Bellman equation (1.14) can be
expressed, after substituting the quadratic utility function, in terms of the quadratic
value function VK (xk ) = xkT P xk as

xkT P xk = xkT Qxk + uTk Ruk + xk+1


T
P xk+1 .

Note that this particular structure of the value function is advantageous from the
learning perspective, which is a consequence of the quadratic utility function as
known from the linear optimal control theory [55]. The control policy in this case is
uk = Kxk , which gives us

xkT P xk = xkT Qxk + xkT K T RKxk + xkT (A + BK)T P (A + BK)xk ,


1.2 Optimal Control of Dynamic Systems 9

Algorithm 1.3 Discrete-time LQR policy iteration algorithm


input: system dynamics
output: P ∗ and K ∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P j ,
T  T
A + BK j P j A + BK j − P j + Q + K j RK j = 0.

4: policy update. Find an improved policy as


−1
K j +1 = − R + B T P j B B T P j A.

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

which, in turn, leads to the Lyapunov equation

(A + BK)T P (A + BK) − P + Q + K T RK = 0. (1.15)

That is, the Bellman equation for the LQR problem actually corresponds to a
Lyapunov equation. This is analogous to the previous observation that the Bellman
optimality equation corresponds to the Riccati equation. These observations suggest
that we could apply iterations on the Lyapunov equation in the same way they are
applied on the Bellman equation. In this case Lyapunov iterations under the standard
LQR conditions would converge to the solution of Riccati equation. A Newton’s
iteration method was developed in the early literature that does exactly what a policy
iteration algorithm does and is presented in Algorithm 1.3.
Algorithm 1.3 finds the solution of the LQR Riccati equation iteratively. Instead
of solving the ARE (1.10), which is a nonlinear equation, Algorithm 1.3 solves
a Lyapunov equation which is linear in the unknown matrix P . Similar to the
general PI algorithm, Algorithm 1.3 also needs to be initialized with a stabilizing
policy. Such initiation is essential because the policy evaluation step involves
finding the positive definite solution of the Lyapunov equation, which requires the
feedback gain to be stabilizing. The algorithm is known to converge with a quadratic
convergence rate under the standard LQR conditions as shown in [35].
We can also apply value iteration to find the solution of the Riccati equation.
Similar to the general value iteration algorithm, Algorithm 1.2, we perform recur-
sions on the Lyapunov equation to carry out value iterations on the matrix P for
value updates. That is, instead of solving the Lyapunov equation, we only perform
recursions, which are computationally faster. The policy update step still remains the
same as in Algorithm 1.3. Under the standard LQR conditions, the value iteration
LQR algorithm, Algorithm 1.4, converges to the solution of the LQR ARE.
10 1 Introduction to Optimal Control and Reinforcement Learning

Algorithm 1.4 Discrete-time LQR value iteration algorithm


input: system dynamics
output: P ∗ and K ∗
1: initialize. Select an arbitrary policy K 0 and a value function P 0 > 0. Set j ← 0.
2: repeat
3: value update. Calculate
T  T
P j +1 = A + BK j P j A + BK j + Q + K j RK j = 0.

4: policy improvement. Find an improved policy as


−1
K j +1 = − R + B T P j +1 B B T P j +1 A.

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

Algorithm 1.5 Continuous-time LQR policy iteration algorithm


input: system dynamics
output: P ∗ and K ∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P j ,
T  T
A + BK j P j + P j A + BK j + Q + K j RK j = 0.

4: policy update. Find an improved policy as

K j +1 = −R −1 B T P j .

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

The above algorithms are applicable to discrete-time dynamics. Iterative tech-


niques have also been developed for continuous-time problems. For example, one
of the popular algorithms for solving continuous-time Riccati equations is the
Kleinman’s algorithm. The algorithm is essentially a policy iteration algorithm that
performs iterations in a fashion similar to the Newton’s method. On the hand, value
iteration in the continuous-time setting does not follow readily from its discrete-time
counterpart and the discussion on these algorithms is delayed until later chapters.
Algorithm 1.5 presents a policy iteration algorithm for the continuous-time LQR
problem.
1.3 Reinforcement Learning Based Optimal Control 11

In summary the iterative techniques discussed in this subsection provide a


numerically feasible way of solving the Bellman and the Riccati equations. These
techniques are based on the fixed-point property of these equations, which enables
us to perform successive approximation of the solutions of the otherwise compli-
cated equations. Inherent in these methods is the notion that dynamic programming
makes it easier to solve complex optimization problems by breaking them into
smaller problems. On the other hand, these techniques inherit the curse of modeling
associated with dynamic programming. It can be seen that all these iterative
techniques employ the dynamic model of the system to solve the Bellman equation
or the Lyapunov equation and, therefore, they can lead to sub-optimal or even
destabilizing solutions in the presence of modeling uncertainties. This problem
will be the focus of the remainder of the chapter, where we will present tools and
techniques that address the issue of modeling.

1.3 Reinforcement Learning Based Optimal Control

Recent surge of research in artificial intelligence (AI) algorithms has accelerated


the development of intelligent control techniques. Reinforcement learning (RL)
is a class of AI and machine learning algorithms that has gained significant
attention in the control community for its potential in designing intelligent optimal
controllers. In the machine learning community, reinforcement learning is regarded
as the third type of machine learning, which is different from the other two types,
namely the supervised and unsupervised learning. Unlike supervised learning, an
RL algorithm does not require labeled data sets to learn, which would otherwise be
hard to obtain in an optimal control problem as optimal control sequences and their
responses are not known a priori in control applications. Instead, RL uses a partial
supervision mechanism that provides some measure of the performance of the RL
controller. In the controls community, reinforcement learning is often referred to as
approximate dynamic programming approach to solving optimal control problems
as it involves approximating the solution of the Bellman or HJB equations [122].
Furthermore, unlike the iterative methods discussed in the previous subsection, RL
has the potential of solving optimal decision making problems without requiring the
knowledge of the system dynamics.

1.3.1 Principles of Reinforcement Learning

The primary motivation of reinforcement learning stems from the ways the living
beings learn to perform tasks. A key feature of intelligence in these beings is the
way they adapt to their environment and optimize their actions by interacting with
it. Reinforcement learning techniques were originally introduced in the computer
science community to serve as computational intelligence algorithms to automate
12 1 Introduction to Optimal Control and Reinforcement Learning

Fig. 1.1 Reinforcement


learning and the role of Agent
feedback (Controller)
State Reward Action
x r u
Environment
(System)

tasks without being explicitly programmed. The underlying mechanism in these


algorithms is based on the action and reward principle. The entity executing an
RL algorithm is referred to as an agent, which plays the role of a controller in
the optimal control setting. This agent is allowed to interact with the environment,
which is the system or plant in the context of control. The controller or the agent is
given the freedom to change (potentially improve) its actions based on some policy
or strategy, which is referred to as the control law. The agent is entitled to receive
a feedback stimulus called reward as a consequence of its actions that provides a
measure of its performance. This reward is some function of the current state of
the environment. It is this interactive action based feedback learning mechanism
that makes RL different from the other types of machine learning approaches [113].
In essence, there exists a cause and effect relationship between the action and the
reward stimulus.
It is interesting to see that, similar to the concept of feedback in controls, an RL
agent tries to improve its performance based on a reward feedback mechanism. As a
result, it serves as a good candidate to solve feedback control problems of dynamical
systems in an optimal manner. The mechanism of an RL decision maker and its
connection with the feedback control is demonstrated in Fig. 1.1, where it is seen
that, in addition to state feedback found in controls, the RL algorithm receives an
evaluative feedback in the form of a reward or penalty signal (depending on whether
we are trying to minimize a cost measure or maximize a performance measure). It
is essentially this evaluation mechanism that enables an RL controller to improve its
control strategy.

1.3.2 Reinforcement Learning for Automatic Control

Early formulation of reinforcement learning was focused on solving sequential


decision making problems with a finite state space. These are the type of problems
that are described in the framework of Markov decision processes (MDP). Dynamic
programming techniques were considered to be the primary tool for solving such
problems, where a full stochastic description of the MDP was mandatory. These
techniques, however, could not be scaled to larger problems, particularly those with
an infinite state space, owing to the curse of dimensionality. Reinforcement learning
1.3 Reinforcement Learning Based Optimal Control 13

solves these decision making problems by searching for an optimal policy that
maximizes the expected value of the performance measure by examining the reward
at each step. Examples of such sequential decision processes where reinforcement
learning has been successfully applied include robot navigation problems [44],
board games [76] such as chess [106], and more recently, the Google Alpha Go
[107].
Recently, reinforcement learning for the control of dynamic systems has received
significant attention in the automatic control community [59]. These dynamic
systems are represented by differential or difference equations. System dynamics
plays an essential part in the design of human engineered systems. RL based control
of dynamic systems requires to incorporate the system dynamics with an infinite
state space, which makes it different from the traditional RL algorithms. With the
recent advances in functional approximation such as neural networks, RL techniques
have been extended to approximate the solution of the HJB equation based on the
Bellman principle of dynamic programming. These methods are often referred to as
approximate or adaptive dynamic programming (ADP) in the control literature.
The development of reinforcement learning controllers is motivated by their
optimal and model-free nature. Reinforcement learning controllers are inherently
optimal because the problem formulation embeds a performance criterion or cost
function. Several optimization criteria, such as minimum energy, and minimum
time, can be taken into account. In addition to being optimal, reinforcement learning
also inherits adaptation capability in that the controller is able to adapt to the changes
in system dynamics during its operation by observing the real-time data. In other
words, the controller does not need to be reprogrammed if some parameters of the
systems are changed.
Reinforcement learning control is different from the adaptive control theory in
the sense that adaptive control techniques adapt the controller parameters based
on the error between the desired output and the measured output. As a result, the
learned controllers do not take into account the optimality aspect of the problem.
On the other hand, RL techniques are based on the Bellman equation, which
incorporates a reward signal to adapt the controller parameters and, therefore,
consider optimality while ensuring the control objectives to be achieved. The
controllers based on RL methods, however, do share a feature of direct adaptive
controllers in the sense that the RL controller is designed without the model
identification process and the optimal control is learned online through the learning
episodes by reinforcing the past control actions that give the maximum reward or
the minimum control utility.

1.3.3 Advantages of Reinforcement Learning Control

We highlight here some potential advantages of reinforcement learning control.


14 1 Introduction to Optimal Control and Reinforcement Learning

Optimality and Adaptivity

Reinforcement learning controllers are inherently optimal because the problem


formulation takes a performance criterion or cost function into consideration.
Several optimization criteria, such as minimum energy and minimum time, can
be adopted. In addition to being optimal, reinforcement learning also possesses
adaptation capability in that the controller is able to adapt to the changes in system
dynamics during its operation by observing the real-time data. In other words, the
controller does not need to be reprogrammed if some parameters of the systems
are changed. Therefore, reinforcement learning controllers possess the features of
adaptive control and optimal control learned online based on the real-time system
data. This notion of optimality and adaptivity presumes that a suitable reward
function is designed.

Model-Free Control

Classical dynamic programming methods have been used to solve optimal control
problems. However, these techniques are offline in nature and require complete
knowledge of the system dynamics. Reinforcement learning addresses the core lim-
itation of requiring complete knowledge of the system dynamics by assuming only
basic properties of the system dynamics such as controllability and observability.
RL control is different from indirect adaptive control, where system identification
techniques are first used to estimate the system parameters and then a controller is
designed based on the identified parameters. Instead, it learns the optimal control
policies directly based on the real-time system data.

Large Spectrum of Applications

Control of dynamic systems is a fundamental engineering problem which encom-


passes a wide horizon of applications, aerospace, automobiles, ships, industrial
processes, mechanical systems, robotics, game theory and many others. Reinforce-
ment learning problem can be applied to solve many control problems, such as
optimal control, tracking, disturbance rejection, and distributed control of multi-
agent systems.

1.3.4 Limitations of Reinforcement Learning Control

Reinforcement learning, by its very nature, is a reactive approach to improving


decisions over a long period of time. It is semi-supervised in principle and relies on
some form of reward and penalty mechanism requiring interaction with either the
real system or some model of actual system. It involves making decisions that could
1.3 Reinforcement Learning Based Optimal Control 15

potentially be unsafe or violate certain hard constraints of the system. Depending


on the complexity and nature of the problem, reinforcement learning may not
yield the performance expectations compared to other passive learning approaches.
Many results tend to be asymptotic in nature, which translates to potentially longer
learning phase. Finally, the semi-supervised nature requires a carefully designed
reward function that is essential for the convergence of the algorithms. The choice
of the reward function also has direct consequences on the optimality of the solution,
which is with respect to the particular reward function. There may be better reward
functions that could relate to a particular problem.

1.3.5 Reinforcement Learning Algorithms

The primary task in reinforcement learning control is to approximate the solutions


of the HJB equation. One of the tools that reinforcement learning employs to
accomplish this task is a functional approximation structure like that of a neural
network [16]. The function of interest in solving an optimal control problem is the
value function that can be represented in a parametric form

V (xk ) = W T φ(xk ),

where the vector W contains the unknown weights corresponding to the user-defined
basis set φ(xk ). Using the universal approximation property of neural networks,
the value function V (xk ) can be approximated with an arbitrary accuracy provided
that a sufficient number of terms are used in the approximation. For linear dynamic
systems, we know that the value function is quadratic in state xk , that is,

V (xk ) = xkT P xk , P > 0.

As a result, we can exactly obtain the value function with a finite basis set φ(x)
defined as
 
φ(x) = x12 , x1 x2 , · · · , x1 xn , x22 , x2 x3 , x2 xn , · · · , xn2 .

Functional approximators are essential in RL based control problems because,


unlike MDP problems, which have a finite state space, the state space is generally
infinite in control problems. In this case the use of functional approximators is much
more efficient than maintaining lookup tables for representing the value function.
Based on this parametric representation, we can write a parameterization of the
Bellman equation as

W T (φ(xk ) − φ(xk+1 )) = r(xk , uk ), (1.16)


16 1 Introduction to Optimal Control and Reinforcement Learning

Fig. 1.2 Reinforcement learning operates forward in time

where the utility function r(xk , uk ) plays the role of the reward or penalty signal.
Equation (1.16) is employed in policy evaluation and value update steps found in the
policy iteration and value iteration algorithms. The equation is linear in the unknown
vector W and standard linear equation solving techniques such as the least-squares
method can be employed to solve it based only on the datasets of (xk , uk , xk+1 ) and
without involving the system dynamics. This serves as a major step towards making
the iterative algorithms data-driven.
Recalling from the dynamic programming method in Sect. 1.2.1, we find that
the Bellman optimality equation (1.5) provides a backward in time procedure to
obtain the optimal value function. Such a procedure inherently involves offline
planning using models to perform operations in reverse time. Now consider the data-
driven reinforcement learning Bellman equation (1.16). The sequence of operations
required to solve the data-driven learning equation proceeds forward in time. That is,
at a given time index k, an action is applied to the system and the resulting reward
or penalty r(xk , uk ) corresponding to this action in the current state is observed.
The goal is to minimize the difference between the predicted performance and the
sum of the observed reward and the current estimate of the future performance. This
forward in time sequence is an important difference that distinguishes reinforcement
learning from dynamic programming (see Fig. 1.2).
A value function that is quite frequently used in model-free reinforcement
learning is the “Quality Function” or the Q-function [113]. The Q-function is
defined similar to the right-hand side of the Bellman equation (1.14),

Qπ (xk , uk ) = r(xk , uk ) + Vπ (xk+1 ). (1.17)

Like the value function, Q-function also provides a measure of the cost of the policy
π . However, unlike the value function, it is explicit in uk and gives the single step
cost of executing an arbitrary control uk from state xk at time index k together with
the cost of executing policy π from time index k + 1 on. The Q-function description
is in a sense more comprehensive than the value function description as it covers
both the state and action spaces and, therefore, the best control action in each state
can be selected by knowing only the Q-function. Once the optimal Q-function is
found, the optimal control can be readily obtained by finding the control action that
minimizes or maximizes the optimal Q-function. Similar to the value function, we
can estimate the Q-function using some function approximator such as
1.3 Reinforcement Learning Based Optimal Control 17

Qπ (xk , uk ) = W T φ(xk , uk ).

The Q-function also satisfies the Bellman equation, which can be obtained by using
the relationship Qπ (xk , π(xk )) = Vπ (xk ) as follows,

Qπ (xk , uk ) = r(xk , uk ) + Qπ (xk+1 , π(xk+1 )) . (1.18)

The Bellman equation (1.18) can be parameterized to find the unknown vector W
by solving the linear equation

W T (φ(xk , uk ) − φ(xk+1 , π(xk+1 ))) = r(xk , uk ),

which gives us the required Q-function. Then, following the policy iteration and
value iteration algorithms, Algorithms 1.1 and 1.2, we can iteratively solve the above
equation to obtain the optimal Q-function. Once we have the optimal Q-function
Q∗ (xk , uk ), the optimal control u∗k is obtained by solving

u∗k = arg minQ∗ .


u

A popular Q-learning algorithm in reinforcement learning is designed to learn the


Q-function. It is a simple yet powerful reinforcement learning algorithm that does
not require any knowledge of the system dynamics to design optimal controllers.
As the Q-function description spans the state-action space, it enables Q-learning
to employ only one function approximation structure instead of two separate
approximators for the policy evaluation and policy improvement steps. If the state
space is sufficiently explored, then the Q-learning algorithm eventually converges
to the optimal Q-function. The success of Q-learning in controls is evident from the
fact that it has provided model-free solution to some popular control problems.
We will apply the Q-learning algorithm to solve the LQR problem. For the LQR
problem, the Q-function in (1.17) can be obtained using the value function V (xk ) =
xkT P xk and the dynamics (1.8) as

 T   
xk Q + AT P A AT P B xk
QK (xk , uk ) =
uk B T P A R + B T P B uk

= zkT H zk
 
T Hxx Hxu
= zk zk . (1.19)
Hux Huu

Equation (1.19) shows that the Q-function can be parameterized in terms of


matrix H . An estimate of the Q-function can be obtained by estimating matrix H .
Depending on whether an a priori knowledge of a stabilizing policy is available or
not, the standard policy iteration and value iteration algorithms can be developed
[13, 53].
18 1 Introduction to Optimal Control and Reinforcement Learning

Algorithm 1.6 Q-learning policy iteration algorithm for the LQR problem
input: input-state data
output: H ∗ and K ∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H j ,

zkT H j zk = xkT Qxk + uTk Ruk + zk+1


T
H j zk+1 .

4: policy update. Find an improved policy as


−1
K j +1 = − Huu
j j
Hux .

5: j←  j +1 
6: until H j − H j −1  < ε for some small ε > 0.

Algorithm 1.6 is a policy iteration algorithm for solving the LQR problem
without requiring the knowledge of the system dynamics. It solves the Q-learning
Bellman equation (1.18) for the LQR Q-function matrix H . The algorithm is
initialized with a stabilizing control policy K 0 . In the policy evaluation step the
cost of the current policy K j is evaluated by estimating the Q-function matrix
H j associated with policy K j . The improved control policy K j +1 is obtained by
minimizing the Q-function

Qj = zkT H j zk .

Subsequent iterations of these steps have been shown to converge to the optimal cost
function matrix H ∗ and the optimal control K ∗ under the standard LQR conditions.
To address the difficulty of having any a priori knowledge of a stabilizing policy
K 0 , we recall the following value iteration algorithm, Algorithm 1.7. The Q-learning
equation in this case is recursive in terms of the matrix H . Instead of evaluating
the policy K j , the value (in this case the Q-function itself) is iterated towards the
optimal value.
Algorithms 1.6 and 1.7 represent an important development in the design of
model-free optimal controllers for discrete-time systems, and they will serve as the
foundation of the designs presented in the later chapters.
1.4 Recent Developments and Challenges in Reinforcement Learning Control 19

Algorithm 1.7 Q-learning value iteration algorithm for the LQR problem
input: input-state data
output: H ∗ and K ∗
1: initialize. Select an arbitrary policy K 0 and H 0 ≥ 0. Set j ← 0.
2: repeat
3: value update. Solve the following Bellman equation for H j +1 ,

zkT H j +1 zk = xkT Qxk + uTk Ruk + zk+1


T
H j zk+1 .

4: policy update. Find an improved policy as



j +1 −1 j +1
K j +1 = − Huu Hux .

5: j←  j +1 
6: until H j − H j −1  < ε for some small ε > 0.

1.4 Recent Developments and Challenges in Reinforcement


Learning Control

While there have been significant developments in reinforcement based control


algorithms in recent years, the design of reinforcement learning controllers has met
some challenges that need to be overcome to harness the full potential of the method
[20]. The key difference between the classical reinforcement learning and the
reinforcement learning for dynamic systems is that in the latter, the system dynamics
has to be taken into consideration. Unlike the traditional reinforcement learning
applications such as AI games, the control design for dynamic systems gives prime
importance to the closed-loop system stability. Control algorithms are required to
ensure closed-loop stability, which is a bare minimum requirement for feedback
control. Control systems without performance, robustness, and safety margin are not
acceptable by the industry. As a result of these challenges, more rigorous results are
needed that could provide sufficient stability and performance guarantees in order
for these learning techniques to become mainstream in control applications. We
highlight here some challenges being tackled in applying reinforcement learning to
control applications.

1.4.1 State Feedback versus Output Feedback Designs

Reinforcement learning is a data-driven approach and, like other machine learning


methods, relies on the availability of sufficient data to make decisions. This data
is obtained in real-time based on the feedback of the system state and is used to
compute the reward signal and to determine the feedback control input to the system.
20 1 Introduction to Optimal Control and Reinforcement Learning

This is referred to as state feedback, which requires as many sensors as the order of
the system. The difficulty with state feedback is that access to the complete state is
generally not available in practice. The state of the system may not be a physically
measurable quantity and a sensor may not be available to measure that state. Even
when a sensor is available, it may not be feasible to install sensor to measure
every component of the state owing to the cost and complexity. Furthermore, as
the order of the system increases, the requirement becomes more difficult to satisfy.
In contrast to state feedback, control methods which employ feedback of the system
output are more desirable. It is known that under a certain observability assumption
the system input and output signals can be utilized to reconstruct the full state of
the system. However, output feedback control becomes quite challenging in the
reinforcement learning paradigm because of the unavailability of the system model,
which is needed to reconstruct the internal state. Thus, model-free output feedback
methods should be sought to overcome the limitation of full state feedback.

1.4.2 Exploration Signal/Noise and Estimation Bias

An important requirement in reinforcement learning based methods is the explo-


ration condition. In order to learn new policies, the state space needs to be
sufficiently explored. However, there exists a trade-off between exploration of the
new policies and exploitation of the current policies. During the exploration phase,
the performance of the system may become undesirable. That is, the learning and the
exploration itself incur a cost. On the other hand, if the state space is not explored
and only the exploitation of the already learned policies is carried out, then the
improvement of the control policies is not possible. In practice, this exploration
condition is met by injecting an exploration signal/noise in the control input to excite
the system to learn the system behavior. This exploration requirement is similar
to the excitation condition found in adaptive control. However, in reinforcement
learning based optimal control this condition is more important because the prime
objective of an RL controller is to learn the optimal control parameters, and the
convergence to these optimal parameters is only ensured when the state space is
well explored. Recent studies have shown that, in certain situations, the excitation
signal, which is used to satisfy the exploration requirements, may result in bias in
the parameter estimates [56]. This excitation noise issue needs to be resolved to
guarantee convergence to the optimal parameters.

1.4.3 Discounted versus Undiscounted Cost Functions

In the original formulation of reinforcement learning for sequential decision


problems, a discounting factor is employed in the cost function to ensure that the
cost function is well-defined. Discounted cost functions have also been used in
1.4 Recent Developments and Challenges in Reinforcement Learning Control 21

the design of reinforcement learning controllers to address the excitation noise bias
issue. However, it has been pointed out in the recent control literature [88] that the
closed-loop system stability may be compromised due to the use of discounted cost
functions. The issue stems from the need to make the long term cost of the control
finite by assigning less weight to the future costs. However, this discounting factor
masks the long term effect of the state energy in the cost function and, therefore, the
convergence of the state is not guaranteed even when the cost is finite. Although the
use of discounting factor is common in many sequential decision making problems,
its application may not be feasible in control applications. Recent works tend to
find a bound on this discounting factor which could still ensure the closed-loop
stability [50, 80]. However, the computation of this bound requires the knowledge
of the system model, which is unavailable in the model-free reinforcement learning
control. Thus, undiscounted reinforcement learning controllers are sought upon.

1.4.4 Requirement of a Stabilizing Initial Policy

Many reinforcement learning control algorithms require a stabilizing initial policy


for their initialization and to guarantee convergence to the optimal stabilizing
policy. This can be a restrictive requirement in some control applications because
computing a stabilizing initial policy requires knowledge of system dynamics,
which is assumed to be unavailable in the case of reinforcement learning systems.
For open-loop stable systems, the problem can be overcome by initializing the
algorithm without feedback. However, for open-loop unstable systems, the selection
of a stabilizing initial policy is not straightforward.

1.4.5 Optimal Tracking Problems

In addition to the optimal stabilization problem, reinforcement learning has been


used to solve the model-free optimal tracking problem. The key additional difficulty
in this problem is the computation of the feedforward term. Most RL methods
are based on the state augmentation approach, in which the dynamics of the
reference generator is augmented with the system dynamics and the feedback
and feedforward terms are computed simultaneously. However, the difficulty with
this approach is that the reference generator is autonomous and neutrally stable
or unstable, and, is therefore not stabilizable. As a result, the augmented system
violates the stabilizability assumption. To overcome this difficulty, a discounting
factor is introduced in the cost function to render the cost function well-defined.
The discounted solution is, however, sub-optimal and may not ensure asymptotic
tracking if the discounting factor is not chosen appropriately.
22 1 Introduction to Optimal Control and Reinforcement Learning

1.4.6 Reinforcement Learning in Continuous-Time

Most of the existing literature on RL control focuses on discrete-time systems


as the continuous-time problem is more involved. The additional difficulty in the
continuous-time setting is associated with the continuous-time Bellman equation,
which requires the knowledge of the system dynamics for the expression of the
time derivative of the value function. This is in contrast to the discrete-time
Bellman equation, where only the difference of the value function is involved in
the Bellman equation. To address this problem, reference [121] introduced integral
reinforcement learning (IRL), in which the IRL Bellman equation does not involve
system dynamics. In IRL, the reward signal is evaluated over some learning interval
rather than at discrete-time instants. As a result, the need of employing derivatives of
the cost function and consequently invoking the system dynamics is circumvented.
However, the knowledge of the input coupling matrix is still needed. Furthermore,
the continuous-time output feedback problems bring additional challenges. This
difficulty is due to the fact that direct translation of discrete-time RL based output
feedback results requires derivatives of the input and output measurements, which is
generally prohibitive in practice. In summary, the development of continuous-time
RL algorithms needs further attention.

1.4.7 Disturbance Rejection

Disturbance is a key issue in the standard control setting. However, the primary
formulation of the RL is oriented towards solving decision making problems,
where disturbance does not readily fit in. The traditional control literature offers
various frameworks such as H∞ control and the internal model principle to handle
disturbances. Differently from these formulations, a game theory based approach
has been found to be more fitting in the RL setting, in which the disturbance is
treated as some intelligent decision maker that plays an adversarial role in the
system dynamics. Recently, some state feedback RL methods have been proposed
to solve the robust H∞ control problem using game theoretic arguments [3, 4, 35,
70]. However, these works solve the full-information H∞ control problem, where
the measurements of both state and disturbances are required. Development of more
practical approaches to disturbance rejection is in order.

1.4.8 Distributed Reinforcement Learning

With the growing complexity of today’s systems, it has become difficult to solve
control and decision making problems in a centralized setting. In this regard, multi-
agent distributed approaches are promising as they provide solutions to the control
1.5 Notes and References 23

problems that were once considered intractable in a centralized manner. This has
naturally led to more interest in RL community to design distributed learning
algorithms [14, 15, 105]. However, RL algorithms are in principle single-agent
based and, therefore, the leap from single-agent control to distributed algorithms for
multi-agent systems is challenging. Issues such as the coordination of these agents
and the exchange of information during the learning phase need to be carefully
addressed. More importantly, there are open questions on how to harness the model-
free power of RL to deal with not only unknown dynamics but also unknown
network topologies. In addition to this, the problem size in the multi-agent scenario
may become quite large, which is major challenge for the current RL algorithms.

1.5 Notes and References

This chapter gave an overview of optimal control and reinforcement learning


control. We observed that the traditional optimal control theory relies on the
availability of accurate system models, which may be difficult to obtain owing to
the increasing complexity of systems and ever presence of modeling uncertainties.
The underlying foundation of the classical optimal control tools is based on the
theory of dynamic programming introduced by the pioneering work of Richard
E. Bellman [7]. The underlying Bellman principle of dynamic programming is
elegant due to its intuition and simplicity and provides necessary and sufficient
conditions for optimality in decision making problems. Mathematical treatment of
the Bellman principle of dynamic programming , however, leads to the Bellman
optimality equation (for the discrete-time case), which is a nonlinear functional
equation and, in general, difficult to solve. Continuous-time extension of dynamic
programming involves the Hamilton-Jacobi-Bellman equation, a partial differential
equation, whose solution is in general intractable. In addition to the difficulty
of solving these equations, full knowledge of the underlying system dynamics is
needed in the dynamic programming framework. It was shown that, for the special
case of linear systems with a quadratic cost function, the problem boils down to
solving an algebraic Riccati equation.
Even when accurate system models are available, these equations are still difficult
to solve analytically. Therefore, computational methods have been developed in
the literature to solve these equations. Many of these methods are based on
two computational techniques called policy iteration and value iteration, which
were originally introduced to solve dynamic programming problems for sequential
decision processes such as MDP. Policy iteration and value iteration algorithms
were introduced in Sect. 1.2.3. These iterative algorithms recursively solve the
Bellman equation that provide a recursive relationship between the current cost
and the next step cost. Policy iteration algorithms for solving the discrete-time and
continuous-time Riccati equations were proposed in [35] and [51], respectively.
These algorithms evaluate the cost of the policy, and therefore, in order to have
a finite cost at each step they need to be initialized with a stabilizing policy.
24 1 Introduction to Optimal Control and Reinforcement Learning

On the other hand, value iteration algorithms do not suffer from this limitation
as they iterate by exploiting the fixed-point property of the Bellman optimality
equation. A value iteration algorithm for the discrete-time LQR problem was
proposed in [53]. Value iteration for continuous-time problems is, however, not
straightforward. Recently, some approximation based value iteration methods have
also been proposed towards solving the continuous-time ARE equation, where the
requirement of a stabilizing initial policy was removed [11]. It should be noted that
all of these design algorithms are model-based and, therefore, require the complete
knowledge of the system dynamics.
Reinforcement learning was introduced in Sect. 1.3.1 as an approach to solving
optimal control problems without requiring full model information. The idea behind
this approach is to approximate the solution of the HJB equation by performing
some functional approximations using tools such as neural networks. We discussed
some advantages of reinforcement learning in Sect. 1.3.3, where we emphasized
on its capability to achieve adaptive optimal control and its applicability to a large
class of control problems. Studies of RL based control methods generally consider
one of the two main types of algorithms, actor-critic learning and Q-learning [114].
We only discussed Q-learning in this chapter as it will be used in the following
chapters. However, it is worth pointing out here that the actor-critic structure for
RL control was introduced by Werbos in [128–131], where it is referred to as
approximate dynamic programming (ADP). The structure consists of a critic sub-
system and an actor sub-system. The critic component assesses the cost of current
action based on some optimality criterion similar to the policy evaluation step in
PI and VI algorithms, while the actor component estimates an improved policy.
However, in RL ADP, the critic and actor employ approximators such as neural
networks to approximate the cost and control functions. Actor-critic algorithms
employ value function approximation (VFA) to evaluate the value of the current
policy. The use of functional approximators instead of lookup tables overcomes
the curse of dimensionality problem in traditional DP, which occurs when the state
space grows large. Actor-critic algorithms for the discrete-time and continuous-
time LQR problems have been described in [59]. Partial knowledge of the system
dynamics (the input coupling matrix) is needed in these methods.
The Q-learning algorithm was discussed in detail in Sect. 1.3.5. This technique
was introduced by Watkins [126] and is based on the idea of Q-function (Quality
function), which is a function of the state and the control. The main task in
the Q-learning algorithm is to estimate the optimal Q-function. Once the optimal
Q-function is found, the optimal control can be readily obtained by finding the
control action that minimizes or maximizes the optimal Q-function. The Q-function
description is more comprehensive than the value function description as it spans
the state-action space instead of the state space, which enables Q-learning to employ
only one functional approximation structure instead of two separate approximators
for critic (value function) and actor (control function). If the state space is
sufficiently explored, then the Q-learning algorithm will eventually converge to the
optimal Q-function [127]. The success of Q-learning in controls is evident from the
fact that it has provided model-free solutions to the popular control problems. A
1.5 Notes and References 25

Q-learning algorithm for the discrete-time linear quadratic regulation problem was
proposed in [13]. In [13], the Q-learning iterations are based on the policy iteration
method and, therefore, the knowledge of a stabilizing initial policy is required. It was
shown that, under a certain excitation condition, the Q-learning algorithm converges
to the optimal LQR controller. Later in [53], a value iteration based Q-learning LQR
algorithm was presented and convergence to the optimal controller was shown. The
requirement of a stabilizing initial policy was also obviated.
In the final part of this chapter, we highlighted some of the recent developments
and challenges in reinforcement learning control. Issues of state feedback and output
feedback were discussed in detail. Discussion on the deleterious effects of dis-
counted cost functions was provided which is crucial in automatic control from the
stability perspective. Difficulties associated with extending reinforcement learning
to solve continuous-time control problems were highlighted. More advanced and
challenging problems for reinforcement learning were also discussed briefly.
Chapter 2
Model-Free Design of Linear Quadratic
Regulator

2.1 Introduction

The linear quadratic regulator (LQR) is one of the most effective formulations of
the control problem. It aims to minimize a quadratic cost function and, under some
mild conditions on the cost function and the system dynamics, leads to a linear
asymptotically stable closed-loop system. The quadratic cost function represents the
long term cost of the control and the state in the form of energies of the signals. The
LQR control problem is essentially a multi-objective optimization problem, where
it is desired to minimize both the state and the control energies. However, there is a
trade-off between these two objectives because achieving better state performance
generally comes at a higher control effort.
In Chap. 1, the LQR problem was introduced briefly and a model-based solution
was presented based on dynamic programming. For the LQR problem, the Bellman
equation reduces to an algebraic Riccati equation (ARE). Since AREs are nonlinear
in the unknown parameter and, therefore, difficult to solve, iterative techniques have
been developed to solve them. In particular, iterative techniques of policy iteration
(PI) and value iteration (VI) have been developed that would find the solution
of the LQR ARE by iteratively solving a Lyapunov equation, which is a linear
equation. A Q-learning technique has also been introduced to design model-free
policy iteration and value iteration algorithms. All results introduced in Chap. 1
focus on state feedback design, which requires the measurement of the full state
for its implementation.
Output feedback control eliminates the need of full state measurement required
by state feedback control and involves fewer sensors, making it cost-effective and
more reliable. Output feedback in reinforcement learning is more challenging as the
system dynamics is unknown. In particular, the key difficulty lies in the design of a
state observer. Classical state estimation techniques used in the design of observer
based output feedback control laws involve a dynamic model of the system. The
observer relies on the system model to estimate the state of the system from the input

© Springer Nature Switzerland AG 2023 27


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2_2
28 2 Model-Free Design of Linear Quadratic Regulator

and output measurements. While the input-output data is available, the dynamic
model is not known in a reinforcement learning problem, making the traditional
model-based state observers not applicable.
Model-free output feedback design approaches that have been reported in the
reinforcement learning literature can be classified as neural network observer based
and input-output data based methods. Neural network observer based designs
generally try to estimate the system dynamics and then reconstruct the state with
bounded estimation errors. This approach requires a separate approximator in
addition to the approximators employed for learning and control. Such a design
tends to be complicated because of the difficulty in proving stability properties
as the separation principle does not directly apply. Furthermore the use of a
separate approximation structure also makes the implementation of the design more
complicated.
On the other hand, the data based output feedback design approach has attracted
more attention recently. In contrast to the neural network observer based design
approach, this approach has the advantage that no external observer is needed and,
as such, does not suffer from state estimation errors. Consequently, optimal solution
could be obtained even when full state feedback is not available. This direct output
feedback approach is based on the idea of parameterizing the state in terms of some
function of the input and output. For linear systems, the state can be reconstructed
based on a linear combination of some functions of the input and output mea-
surements (in the absence of unknown disturbances). However, incorporating this
parameterization has the potential of incurring bias in the estimation process as the
Bellman equation is modified to cater to the output feedback design.
In this chapter, we will present model-free output feedback reinforcement
learning algorithms for solving the linear quadratic regulation problem. The design
will be carried out in both discrete-time and continuous-time settings. We will
present some parameterization of the system state that would allow us to develop
new learning and control equations that do not involve the state information. It will
be shown that the output feedback control law that is learned by the RL algorithm
is the steady-state equivalent of the optimal state feedback control law. Both policy
iteration and value iteration based algorithms will be presented. For the discrete-
time problems, we will present output feedback Q-learning algorithms. On the
other hand, the treatment of the continuous-time problems will be different and will
employ ideas from integral reinforcement learning to develop the output feedback
learning equations.

2.2 Literature Review

As discussed in Sect. 2.1, the neural network observers and the input-output data
based methods have been the two key approaches in the RL literature to solving
the output feedback problems. For instance, [142] presented a suboptimal controller
using output feedback to solve the LQR problem by incorporating a neural network
2.3 Discrete-Time LQR Problem 29

observer. This, however, results in bounded estimation errors. In the same spirit, [22,
34, 67, 136, 140, 141] performed state estimation based on these observers together
with the estimation of the critic and actor networks to achieve near optimal solution.
Again, only ultimate boundedness of the estimation errors was demonstrated.
On the other hand, the state parameterization approach to output feedback RL
control has been gaining increasing attention. In particular, [1, 28, 50, 56, 80]
followed this approach to avoid the issue of estimation errors and to take the
advantage that no external observer is required. In the pioneering work of [1],
identification of the Markov parameters was used to design a data-driven output
feedback optimal controller. The authors of [56] were the first to build upon this
work and extend the idea in the RL setting by employing the VFA approach using
PI and VI output feedback RL LQR algorithms. Following the VFA approach of
[56], the authors of [50] solved the RL optimal linear quadratic tracking problem.
For the continuous-time LQR problem, a partially model-free output feedback
solution was proposed in [142]. The method requires the system to be static output
feedback stabilizable. Motivated by the work of [56], the authors of [80] solved
the continuous-time output feedback LQR problem using a model-free IRL method,
which does not require the system to be static output feedback stabilizable.
However, in all these works on state parameterization based output feedback
RL control, a discounting factor is introduced in the cost function, which helps to
diminish the effect of excitation noise bias as studied in [56]. The end result is that
the discounted controller is suboptimal and does not correspond to the solution of
the Riccati equation. More importantly, this discounting factor has the potential to
compromise system stability as reported recently in [80, 88]. The stability analysis
carried out in [88] has shown that the discounted cost function in general cannot
guarantee stability. At best, “semiglobal stability” can be achieved in the sense that
stability is guaranteed only when the discounting factor is chosen above a certain
lower bound (upper bound in the continuous-time setting). However, knowing this
bound requires the knowledge of the system dynamics, which is assumed to be
unknown in RL. Furthermore, the discounted controller loses optimality aspect of
the control problem as it does not correspond to the optimal solution of the original
ARE.

2.3 Discrete-Time LQR Problem

Consider a discrete-time linear system given by the following state space represen-
tation,

xk+1 = Axk + Buk ,


(2.1)
yk = Cxk ,

where xk ∈ Rn is the state, uk ∈ Rm is the input, and yk ∈ Rp is the output.


Under the usual controllability and observability assumptions on (A, B) and (A, C),
30 2 Model-Free Design of Linear Quadratic Regulator

respectively, we would like to find the feedback control sequence uk = Kxk that
minimizes the long term cost [55],


VK (xk ) = r(xi , ui ), (2.2)
i=k

with the following r(xk , uk ) being a quadratic function,

r(xk , uk ) = ykT Qy yk + uTk Ruk , (2.3)

where Qy ≥ 0 and R > 0 are the performance matrices. The optimal state feedback
control that minimizes (2.2) is given by

u∗k = K ∗ xk
−1
= − R + B TP ∗B B T P ∗ Axk , (2.4)

and its associated cost is V ∗ (xk ) = xkT P ∗ xk , under the conditions of controllability
 √  √ T√
of (A, B) and observability of A, Q , where Q Q = Q, Q = C T Qy C.
Here, P ∗ = (P ∗ )T is the unique positive definite solution to the following ARE,
−1
AT P A − P + Q − AT P B R + B T P B B T P A = 0. (2.5)

2.3.1 Iterative Schemes Based on State Feedback

Before working out our way to the model-free output feedback solution, we will
revisit some results for the state feedback LQR problem. For the sake of continuity,
we will recall the concept of Q-functions from Chap. 1 and provide details of the
LQR Q-function, which plays a fundamental role in the design of the Q-learning
algorithms that we will subsequently present.
Consider the cost function given in (2.2). Under a stabilizing feedback control
policy uk = Kxk (not necessarily optimal), the total cost incurred when starting at
time index k from state xk is quadratic in the state as given by [13]

VK (xk ) = xkT P xk , P > 0. (2.6)

Motivated by the Bellman optimality principle, Equation (2.2) can be written


recursively as

VK (xk ) = r(xk , Kxk ) + VK (xk+1 ), (2.7)


2.3 Discrete-Time LQR Problem 31

where VK (xk+1 ) is the cost of following policy uk = Kxk (or, simply, policy K) at
all future time indices.
Next, we use (2.7) to define a Q-function as the sum of the one-step cost of taking
an arbitrary action uk at time index k and the total cost that would incur if the policy
K is followed at time index k + 1 and all the subsequent time indices [113],

QK (xk , uk ) = r(xk , uk ) + VK (xk+1 ), (2.8)

which is similar to the cost function but is explicit in both uk and xk .


For the LQR, the Q-function can be represented as

QK (xk , uk ) = xkT Qxk + uTk Ruk + xk+1


T
P xk+1
= xkT Qxk + uTk Ruk + (Axk + Buk )T P (Axk + Buk )
 T   
xk Q + AT P A AT P B xk
= , (2.9)
uk B T P A R + B T P B uk

which is quadratic in uk and xk [58].


Given the optimal cost V ∗ , we can compute K ∗ . To do so, we define the optimal
Q-function as the cost of executing an arbitrary control u and then following the
optimal policy K ∗ , as given by

Q∗ (xk , uk ) = r(xk , uk ) + V ∗ (xk+1 ), (2.10)

and the optimal policy is given by

K ∗ xk = arg minQ∗ (xk , uk ).


u

The optimal LQR controller u∗k , which minimizes the long term cost, can be
obtained by solving


Q∗ = 0.
∂uk

The result is the same as given in (2.4), which was obtained by solving the
ARE (2.5).
Model-based iterative algorithms based on policy iteration and value iteration
were presented in Algorithms 1.3 and 1.4. The model-free versions of these
algorithms were presented in Algorithms 1.6 and 1.7, which were based on Q-
learning.
32 2 Model-Free Design of Linear Quadratic Regulator

2.3.2 Model-Free Output Feedback Solution

In the previous results of Q-learning, the Q-function that we obtained in (2.9)


involves information of the state xk , which is not available in our problem setting.
To circumvent this situation, we next present a state reconstruction technique
that employs input and output data of the system to observe the internal state.
In this section, we will present reinforcement learning techniques for solving
discrete-time output feedback LQR problem. Discrete-time linear systems are the
deterministic analog to the Markov decision processes (MDPs). Reinforcement
learning techniques were originally developed to address optimal decision making
problems in MDPs, where the notion of sequential decision making and iterative
improvements are readily incorporated. Similar form of decision making is involved
in discrete-time dynamical systems and we will develop output feedback reinforce-
ment learning techniques for these systems. In the MDP framework the output
feedback problems are often referred to as partially observable Markov decision
processes (POMDPs).
In [56], this problem was addressed using the RL value function approximation
(VFA) method. However, this approach requires a discounting factor 0 < γ < 1 in
the cost function to overcome the exploration bias issue, that is,


VK (xk ) = γ i−k r(xi , Kxi ). (2.11)
i=k

The use of the discounted cost function (2.11) does not ensure closed-loop stability.
Notice that, for 0 < γ < 1, the boundedness of the discounted cost (2.11), that is,

 
γ i−k xiT Q + K T RK xi < ∞,
i=k

does not ensure the convergence of xk to zero as k → ∞.


We now proceed to develop a Q-learning scheme without employing a discounted
cost function. It will be shown that the controller parameters converge to the solution
of the undiscounted ARE and the closed-loop stability is guaranteed.

2.3.3 State Parameterization of Discrete-Time Linear Systems

Classical state estimation techniques make use of the dynamic model of the system
to reconstruct the state by means of a state observer. A state observer of a system
is essentially a user-defined dynamic system that employs some error correction
mechanism to reconstruct, based on the dynamic model of the system, the state of
2.3 Discrete-Time LQR Problem 33

the system from its input and output data. The problem of state estimation is not
straightforward when the system model is not available.
A key technique in adaptive control of unknown systems is to parameterize a
quantity in terms of the unknown parameters and the measurable variables and
use these measurable variables to estimate the unknown parameters. The following
result provides a parameterized expression of the state in terms of the measurable
input and output data that will be used to derive the output feedback learning
equations in the subsequent subsections.
Theorem 2.1 Consider system (2.1). Let the pair (A, C) be observable. Then, there
exists a parameterization of the state in the form of

xk = Wu σk + Wy ωk + (A + LC)k x0 , (2.12)

where L is the observer gain chosen such that A + LC is Schur


 stable, the
 1 2  p
parameterization matrices Wu = Wu Wu · · · Wu and Wy = Wy Wy2 · · · Wy
m 1

are given in the form of


⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2
au(n−2) i2 ⎥
· · · au0
⎢ u(n−1) ⎥
Wui = ⎢
⎢ ..
⎥ , i = 1, 2, · · · , m,
⎢ .
.. .. . ⎥ .
⎣ . . . ⎥ ⎦
in
au(n−1) au(n−2) · · · au0
in in

⎡ i1 i1 ⎤
ay(n−1) i1
ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2
ay(n−2) i2 ⎥
· · · ay0
⎢ y(n−1) ⎥
Wy = ⎢
i
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) ay(n−2) · · · ay0
in in

whose elements are the coefficients of the numerators in the transfer function matrix
 T
of a Luenberger observer with inputs uk and yk , and σk = σk1 σk2 · · · σkm and
 p T
ωk = ωk1 ωk2 · · · ωk represent the states of the user-defined dynamics driven by
individual input uik and output yki as given by

i
σk+1 = Aσki + Buik , σ0i = 0, i = 1, 2, · · · , m,
i
ωk+1 = Aωki + Byki , ω0i = 0, i = 1, 2, · · · , p,

for a Schur matrix A whose eigenvalues coincide with those of A+LC and an input
vector B of the form
34 2 Model-Free Design of Linear Quadratic Regulator

⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ···· · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎥ ⎢ ⎥
⎢ ⎢0⎥
⎢ 0 · · · 0 ⎥ , B = ⎢0⎥
⎥ ⎢
A=⎢ 1 0 ⎥.
⎢ . .. .. .. .. ⎥ ⎢.⎥
⎣ .. . . . . ⎦ ⎣ .⎦
.
0 0 ··· 1 0 0

Proof From the linear systems theory, it is known that if the pair (A, C) is
observable then a full state observer can be constructed as
 
x̂k+1 = Ax̂k + Buk − L yk − C x̂k
= (A + LC)x̂k + Buk − Lyk , (2.13)

where x̂k is the estimate of the state xk and L is the observer gain chosen such that
the matrix A+LC has all its eigenvalues strictly inside the unit circle. This observer
is a dynamic system driven by uk and yk with the dynamics matrix A + LC. This
dynamic system can be written in the filter form by treating both uk and yk as inputs
as follows,

x̂k = (zI − A − LC)−1 B[uk ] − (zI − A − LC)−1 L[yk ] + (A + LC)k x̂0


 U i (z)  i   Y i (z)  i 
m p
= u − y + (A + LC)k x̂0 ,
(z) k (z) k
i=1 i=1
U (z) Y (z)
= [uk ] − [yk ] + (A + LC)k x̂0 , (2.14)
(z) (z)
 
where uik and yki are the i th input and output, respectively, U = U 1 U 2 · · · U m ,
 
Y = Y 1 Y 2 · · · Y p , and U i (z) and Y i (z) are some n-dimensional polynomial
vectors in the operator z. Here the bracket notation [uk ] is used to denote a time
signal being operated on by an operator represented by a transfer function to result
in a new time signal, for example, z−n [uk ] = uk−n . The polynomial matrices U (z)
and Y (z) depend on the quadruple (A, B, C, L). The characteristic polynomial (z)
is given by

(z) = det(zI − A − LC)


= zn + αn−1 zn−1 + αn−2 zn−2 + · · · + α0 .

We now show the [uk ] and [yk ] terms in (2.14) can be linearly parameterized.
i (z)
Consider first each input filter term U(z) [uik ],

U i (z)  i   
uk = (zI − A − LC)−1 Bi uik ,
(z)
2.3 Discrete-Time LQR Problem 35

where Bi is the i th column of B and U(z)i (z)


is an n-dimensional vector of rational
functions, representing n filters each of degree n, which are applied to the input uik .
Therefore, we can express this term as

⎡ ⎤
i1
au(n−1) zn−1 + au(n−2)
i1 zn−2 + · · · + au0
i1
⎢ n ⎥
⎢ z + αn−1 zn−1 + αn−2 zn−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 n−1 + a i2 n−2 + · · · + a i2 ⎥
⎢ u(n−1) z z u0 ⎥
  ⎢ u(n−2)
⎥ 
i
U (z) i ⎢ z + αn−1 z
n n−1 + αn−2 z n−2 + · · · + α0 ⎥
uk = ⎢

⎥ ui
⎥ k
(z) ⎢ .
.. ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ in ⎥
⎣ au(n−1) z n−1 + au(n−2) z
in n−2 + · · · + au0 ⎦
in

zn + αn−1 zn−1 + αn−2 zn−2 + · · · + α0


⎡ ⎤⎡  i ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1 zn−1
uk
⎢ i2 ⎥⎢ (z)
 ⎥
⎢a i2 ··· au0 ⎥
i2 ⎢z ⎥
⎢ u(n−1) au(n−2)
n−2
⎥ ⎢ (z) uik ⎥
=⎢
⎢ .. ..
⎥⎢
.. ⎥ ⎢


⎢ . ..
⎣ . . . ⎥⎦⎣
⎢ ..
.


1
 i

in
au(n−1) in
au(n−2) ··· in
au0 (z) u k

= Wui σki ,

where Wui ∈ Rn×n is the parametric matrix corresponding to the coefficients of


U i (z) for the input uik . Notice that σki ∈ Rn contains the outputs of n filters when
applied to the input uik . Instead of applying n filters to uik , we can also obtain σki
using the n-dimensional state space system driven by uik as

i
σk+1 = Aσki + Buik ,

where
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ··· · · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ⎥ ⎢ ⎥
A=⎢ 1 0 ⎥, B = ⎢0⎥.
⎢ . .. .. .. . ⎦ . ⎥ ⎢.⎥
⎣ . . . . . . ⎣ .. ⎦
0 0 ··· 1 0 0

This holds for every input ui , i = 0, 1, · · · , m, and thus, we can generate σk =


 1 2 T  
σk σk · · · σkm ∈ Rmn with Wu = Wu1 Wu2 · · · Wum ∈ Rn×mn . Similarly, we
can repeat the same procedure for the output terms in (2.14). That is,
36 2 Model-Free Design of Linear Quadratic Regulator

i
ωk+1 = Aωki + Byki ,

which holds for every output y i , i = 0,


 1, · · · , p, and thus,
 we can generate ωk =
 1 2 p T p
ωk ωk · · · ωk ∈ Rpn with Wy = Wy1 Wy2 · · · Wy ∈ Rn×pn . We now show
that the internal dynamics of σk and ωk is asymptotically stable. The eigenvalues of
matrix Ai are the roots of its characteristic polynomial

det(zI − A) = zn + αn−1 zn−1 + αn−2 zn−2 + · · · + α0


= (z).

Note that (z) is the characteristic polynomial of A+LC and is a stable polynomial.
This implies that all the eigenvalues of A are strictly inside the unit circle and,
therefore, the dynamics of σk and ωk is asymptotically stable. Finally, by combining
the input and output terms, we can write (2.14) as,

x̂k = Wu σk + Wy ωk + (A + LC)k x̂0 . (2.15)

Note that A + LC represents the observer error dynamics, that is,

ek = xk − x̂k
= (A + LC)k e0 .

As a result, the state xk can be parameterized as

xk = Wu σk + Wy ωk + (A + LC)k x0 . (2.16)

Since A + LC is Schur stable, the term (A + LC)k x̂0 in (2.15) and the term (A +
LC)k x0 in (2.16) vanish as k → ∞. This completes the proof.
The state parameterization presented in Theorem 2.1 is derived based on the
Luenberger observer (2.13). The parameterization (2.16) contains a transient term
(A + LC)k x0 , which depends on the initial condition. This transient term in turn
depends on the choice of the design matrix L, which assigns the eigenvalues of
matrix A + LC. The user-defined matrix A plays the role of the matrix A + LC in
the parameterization. Ideally, we would like the observer dynamics to be as fast as
possible so that x̄k converges to xk quickly. For discrete-time systems, this can be
achieved by placing all the eigenvalues of matrix A + LC or, equivalently, matrix
A, at 0. That is, the coefficients αi of matrix A are all chosen to be zero. In this
case, it can be verified that, for any x0 , (A + LC)k x0 vanishes in no more than n
time steps. Motivated by this property of discrete-time systems, a special case of the
above parameterization was presented in [56]. This special state parameterization
result is recalled below.
2.3 Discrete-Time LQR Problem 37

Theorem 2.2 Consider system (2.1). Let the pair (A, C) be observable. Then, the
system state can be uniquely represented in terms of measured input and output as

xk = Wy ȳk−1,k−N + Wu ūk−1,k−N , k ≥ N, (2.17)

where N ≤ n is an upper bound on the system’s observability index, ūk−1,k−N ∈


RmN and ȳk−1,k−N ∈ RpN are the past input and output data vectors defined as
 T
ūk−1,k−N = uTk−1 uTk−2 · · · uTk−N ,
 T
ȳk−1,k−N = yk−1
T T
yk−2 · · · yk−N
T ,

and the parameterization matrices take the special form


−1
Wy = AN VNT VN VNT ,
−1
Wu = UN − AN VNT VN VNT TN ,

with
 T T
VN = CAN −1 · · · (CA)T C T ,
 
UN = B AB · · · AN −1 B ,
⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN = ⎢ ... ... .. ..
. .
..
. ⎥.
⎢ ⎥
⎣0 0 · · · 0 CB ⎦
0 0 0 0 0

Remark 2.1 The parameterization matrices Wu and Wy in (2.17) are the same as
Wu and Wy in (2.12) if N = n and all eigenvalues of matrix A or, equivalently,
matrix A + LC, used in Theorem 2.1 are zero. This can be seen as follows. Recall
from the proof of Theorem 2.1 that the state can be represented by

xk = Wu σk + Wy ωk + (A + LC)k x0 . (2.18)

If all eigenvalues of matrix A + LC are zero, then its characteristic polynomial is


(z) = zn . Then, from the definition of σk and ωk , we have
38 2 Model-Free Design of Linear Quadratic Regulator

⎡ ⎤
zn−1
(z) [uk ]
⎢ ⎥
⎢ zn−2 ⎥
⎢ (z) [uk ]⎥

σk = ⎢ ⎥ = ūk−1,k−n ,

⎢ .. ⎥
⎣ . ⎦
1
(z) [u k ]
⎡ n−1 ⎤
z
[yk ]
⎢ (z) ⎥
⎢ zn−2 ⎥
⎢ (z) [yk ]⎥

ωk = ⎢ ⎥ = ȳk−1,k−n .

⎢ .. ⎥
⎣ . ⎦
1
(z) [yk ]

Given that A + LC has all its eigenvalues at zero, A + LC is a nilpotent matrix.


Then, (A + LC)k = 0 for k ≥ n. Note that here we consider only the full order
observer case by choosing the nilpotent N = n, for which (2.18) reduces to

xk = Wu ūk−1,k−N + Wy ȳk−1,k−N , k ≥ N,

which is the parameterization (2.17) in Theorem 2.2. Hence, Theorem 2.2 is a


special case of Theorem 2.1 when all eigenvalues of matrix A+LC or, equivalently,
matrix A, in Theorem 2.1 are zero and the upper bound N of the observability index
is chosen as n.
While the observability of (A, C) is necessary to guarantee the convergence
of the state parameterization
  to the actual state, the rank of the parameterization
matrix W = Wu Wy depends on the observer dynamics matrix as well as the
system matrices. We will analyze the full row rank condition of matrix W for the
parameterization (2.12) of which(2.16) is a special case. This rank condition is
needed for establishing the convergence of the output feedback algorithms to the
optimal solution in Theorems 2.6 and 2.7. To this end we present the following
result.
 
Theorem 2.3 The state parameterization matrix W = Wu Wy in (2.12) is of full
row rank if (A + LC, B) or (A + LC, L) is controllable.
Proof We note that the full row rank of either Wu or Wy suffices for W to be of full
row rank. Consider first the matrix Wu associated with the input. Recall that Wu can
be obtained from (2.14) as follows,

Dn−1 zn−1 + Dn−2 zn−2 + · · · + D0


(zI − (A + LC))−1 B[uk ] = B[uk ]
zn + αn−1 zn−1 + αn−2 zn−2 + · · · + α0
2.3 Discrete-Time LQR Problem 39

⎡ ⎤
1
(z) [uk ]
⎢ ⎥
⎢ z ⎥
 ⎢  ⎢ (z) [uk ]⎥ ⎥
= D0 B D1 B · · · Dn−1 B ⎢ .. ⎥
⎢ ⎥
⎢ . ⎥
⎣ ⎦
zn−1
(z) [uk ]

= Wu σk .

Here the matrices Di contain the coefficients of the adjoint matrix. It can be verified
that we can express Di in terms of the matrix A + LC and the coefficients of its
characteristic polynomial (s) as follows,

Dn−1 = I,
Dn−2 = (A + LC) + αn−1 I,
Dn−3 = (A + LC)2 + αn−1 (A + LC) + αn−2 I,
..
.
D0 = (A + LC)n−1 + αn−1 (A + LC)n−2 + · · · + α2 (A + LC) + α1 I.

Substituting the above Di ’s in the expression for Wu and analyzing the rank of the
resulting expression for Wu , we have

ρ(Wu )=ρ (A + LC)n−1 B + αn−1 (A + LC)n−2 B + · · · + α2 (A + LC)B + α1 B,

· · · (A + LC)B + αn−1 B B .

The terms of αi can be eliminated by performing column operations to result in


 
ρ(Wu ) = ρ (A + LC)n−1 B · · · (A + LC)B B ,

which is the controllability condition of the pair (A + LC, B). Thus, the controlla-
bility of the pair (A + LC, B) implies full row rank of matrix Wu and hence full row
rank of matrix W .
A similar analysis of the matrix Wy yields that the controllability of the pair
(A + LC, L) would imply full row rank of matrix Wy and hence full row rank of
matrix W . This completes the proof.
We note that the controllability condition of (A + LC, B) or (A + LC, L) in
Theorem 2.3 is difficult to verify since it involves the observer gain matrix L,
whose determination requires the knowledge of the system dynamics. Under the
observability condition of (A, C), even though L can be chosen to place eigenvalues
of matrix A + LC arbitrarily, it is not easy to choose an L that satisfies the
40 2 Model-Free Design of Linear Quadratic Regulator

conditions of Theorem 2.3. As a result, in a model-free setting, we would not rely


on Theorem 2.3 to guarantee full row rank of matrix W . It is worth pointing out that
we do not design L for the state parameterization. Instead, we form a user-defined
dynamics A that contains the desired eigenvalues of matrix A + LC. As a result, we
need a condition in terms of these eigenvalues instead of the matrix L. The following
result establishes this condition.
Theorem 2.4 The parameterization matrix W is of full row rank if matrices A and
A + LC have no common eigenvalues.
Proof By Theorem 2.3, matrix Wy , and hence matrix W , are of full row rank if the
pair (A + LC, L) is controllable. We will show that, if matrices A and A + LC have
no common eigenvalues, then the pair (A + LC, L) is indeed controllable. By the
Popov–Belevitch–Hautus
 (PBH) test, the pair (A + LC, L) loses controllability if
and only if q T A + LC − λI L = 0 for a left eigenvector q associated with an
eigenvalue λ of A + LC. Consequently, q T (A + LC) = λq T and q T L = 0, and,
therefore, q T A = λq T . Consequently, λ must also be an eigenvalue of A if the pair
(A + LC, L) is not controllable. This completes the proof.

Remark 2.2 In contrast to the parameterization (2.12), the parameterization


in (2.16) is simpler and gives a dead-beat response as the eigenvalues in this
case are all fixed to zero. However, if the system dynamics matrix A happens to
have a zero eigenvalue, then My may not have full row rank. In other words, the
general parameterization (2.12) gives us extra flexibility in satisfying conditions
that would ensure full row rank of matrix W .

2.3.4 Output Feedback Q-function for LQR

We now proceed to apply the state parameterization (2.16) to describe the Q-


function in (2.9) in terms of the input and output of the system. It can be easily
verified that the substitution of the parameterization x̄k in (2.12) for xk in (2.9)
results in
⎡ ⎤T ⎡ ⎤⎡ ⎤
σk Hσ σ Hσ ω Hσ u σk
QK = ⎣ωk ⎦ ⎣Hωσ Hωω Hωu ⎦⎣ωk ⎦
uk Huσ Huω Huu uk

= zkT H zk , (2.19)

where
 T
zk = σkT ωkT uTk ,
2.3 Discrete-Time LQR Problem 41

H = H T ∈ R(mn+pn+m)×(mn+pn+m) ,

and the submatrices are given as

Hσ σ = WuT (Q + AT P A)Wu ∈ Rmn×mn ,


Hσ ω = WuT (Q + AT P A)Wy ∈ Rmn×pn ,
Hσ u = WuT AT P B ∈ Rmn×m ,
Hωσ = WyT (Q + AT P A)Wu ∈ Rpn×mn ,
Hωω = WyT (Q + AT P A)Wy ∈ Rpn×pn , (2.20)
Hωu = WyT AT P B ∈ Rpn×m ,
Huσ = B T P AWu ∈ Rm×mn ,
Huω = B T P AWy ∈ Rp×pn ,
Huu = R + B T P B ∈ Rm×m .

Given the optimal cost function V ∗ with the cost matrix P ∗ , we obtain the
corresponding optimal output feedback matrix H ∗ by substituting P = P ∗ in (2.20).
It is worth recalling that the notion of optimality in our context is with respect to the
controller since there are an infinite number of state parameterizations depending
on the choice of user-defined dynamics matrix A. However, in the discrete-time
setting, it is common to select A to have all zero eigenvalues, which gives the dead-
beat response due to the finite-time convergence property of the discrete-time state
parameterization. In this case, the optimal output feedback Q-function is given by

Q∗ = zkT H ∗ zk , (2.21)

which provides a state-free representation of the LQR Q-function. Solving

∂Q∗
=0
∂uk

for uk results in the desired control law,


 ∗ −1  ∗ 
u∗k = − Huu ∗
Huσ σk + Huω ωk
 
= K̄ ∗ σkT ωkT . (2.22)

This control law solves the LQR output feedback control problem without requiring
access to the state xk .
We now show the relation between the presented output feedback Q-function and
the output feedback value function. The output feedback value function as used is
given by
42 2 Model-Free Design of Linear Quadratic Regulator

 T  
σk σ
V = P̄ k , (2.23)
ωk ωk

where
 
WuT P Wu WuT P Wy
P̄ = .
WyT P Wu WyT P Wy

The value function (2.23), by definition (2.6), gives the cost of executing the policy
 
K̄ = K Wu Wy
 
= − (Huu )−1 Huσ Huω .

Using the relation QK (xk , Kxk ) = VK (xk ), the output feedback value function
matrix P̄ can be readily obtained as
   T
P̄ = I K̄ T H I K̄ T . (2.24)

Theorem 2.5 The output feedback law (2.22) is the steady-state equivalent of the
optimal LQR control law (2.4).
∗ =
Proof Consider the output feedback control law (2.22). Substitution of Huσ
T ∗ ∗ T ∗ ∗ T ∗
B P AWu , Huω = B P AWy , and Huu = R + B P B in (2.22) results in
−1 
u∗k = − R + B T P ∗ B B T P ∗ AWu σk + B T P ∗ AWy ωk .

By Theorem 2.1, the parameterization (2.12) in the steady-state reduces to

xk = Wu σk + Wy ωk ,

since A + LC is Schur stable. Thus, the output feedback controller (2.22) is the
steady-state equivalent of
−1
u∗k = − R + B T P ∗ B B T P ∗ Axk ,

which is the optimal state feedback control law (2.4). This completes the proof.
2.3 Discrete-Time LQR Problem 43

2.3.5 Output Feedback Based Q-learning for the LQR Problem

In this subsection, we present an output feedback based Q-learning scheme.


Consider the model-based output feedback control law (2.22), in which the optimal
matrix H ∗ is obtained by using information of (A, B, C) as defined in (2.20). We
use the Bellman optimality principle to derive the recursive form of the Q-function
so that the reinforcement learning techniques may be applied. From (2.7) and the
definition of (2.8), we have the following relationship,

QK (xk , Kxk ) = VK (xk ). (2.25)

Based on (2.25), substitution of QK (xk+1 , Kxk+1 ) for VK (xk+1 ) in (2.8) results in


the following recursive relationship,

QK (xk , uk ) = xkT Qxk + uTk Ruk + QK (xk+1 , Kxk+1 ), (2.26)

which is referred to as the LQR Bellman Q-learning equation. Equation (2.26)


is state dependent. We can go one step further and use the output feedback Q-
function (2.19) to result in

zkT H zk = ykT Qy yk + uTk Ruk + zk+1


T
H zk+1 , (2.27)

where we have used

xkT Qxk = ykT Qy yk , (2.28)

because xk is not available. Therefore, we apply weights to the available output


yk = Cxk only, in which case Q = C T Qy C. In (2.27), uk+1 is computed as

uk+1 = −(Huu )−1 (Huσ σk+1 + Huω ωk+1 ) . (2.29)

It should be noted at this point that the output feedback learning equation (2.27)
is precise for k ≥ N when A + LC is nilpotent, which is always achievable since
(A, C) is observable. The Q-function matrix H is unknown and is to be learned. We
can separate H by parameterizing (2.19) as

QK = H̄ T z̄k , (2.30)

where

H̄ = vec(H )

= [h11 2h12 · · · 2h1l h22 2h23 · · · 2h2l · · · hll ]T ,


44 2 Model-Free Design of Linear Quadratic Regulator

with l = mn + pn + m. The regression vector z̄k ∈ Rl(l+1)/2 is made up of quadratic


basis functions as follows,

z̄k = zk ⊗ zk
 T
= zk1
2
zk1 zk2 · · · zk1 zkl zk2
2
zk2 zk3 · · · zk2 zkl · · · zkl
2
,

where zk = [zk1 zk2 · · · zkl ]T . This yields the following equation,

H̄ T z̄k = ykT Qy yk + uTk Ruk + H̄ T z̄k+1 . (2.31)

Equation (2.31) is a linear equation and can be written as

T H̄ = Υ, (2.32)

where  ∈ R(l(l+1)/2)×L and Υ ∈ RL×1 are the data matrices defined by


 
 = z̄k1 − z̄k+1
1 z̄k2 − z̄k+1
2 · · · z̄kL − z̄k+1
L ,
 T
Υ = r 1 (yk , uk ) r 2 (yk , uk ) · · · r L (yk , uk ) ,

and H̄ is the unknown vector to be found. In order to solve this linear equation, we
require at least L ≥ l(l + 1)/2 data samples.
In what follows, we present reinforcement learning techniques of policy iteration
and value iteration to learn the output feedback LQR control law.
The policy iteration algorithm, Algorithm 2.1, consists of two steps. The policy
evaluation step uses the parameterized learning equation (2.31) to solve for H̄ by
collecting L ≥ l(l + 1)/2 observations of uk , yk , σk , ωk , σk+1 , and ωk+1 to form the
data matrices. Then, the solution of (2.31) is obtained as
 −1
H̄ j = T Υ, (2.33)

where H̄ j gives the Q-function matrix associated with the j th policy. In the policy
j +1
update step, we obtain a better policy uk by minimizing the Q-function of the
j th policy. From the estimation perspective, a difficulty arises due to the linear
dependence of uk in (2.22) on σk and ωk . On the other hand, the data matrix 
already has columns of σk and ωk . As a result, the column entries corresponding to
uk become linearly dependent as they are formed by a linear combination of σk and
ωk using matrix H . It implies that T is singular. To make T nonsingular and
the solution (2.33) unique, we add an excitation signal in uk . That is, we need to
satisfy the following rank condition,

rank() = l(l + 1)/2. (2.34)


2.3 Discrete-Time LQR Problem 45

Algorithm 2.1 Output feedback Q-learning policy iteration algorithm for the LQR
problem
input: input-output data
output: H ∗
1: initialize. Select an admissible policy u0k . Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H̄ j ,
T
H̄ j (z̄k − z̄k+1 ) = ykT Qy yk + uTk Ruk .

4: policy update. Find an improved policy as


−1 
j +1 j j j
uk = − Huu Huσ σk + Huω ωk .

5: j←  j +1 
6: until H̄ j − H̄ j −1  < ε for some small ε > 0.

Examples of excitation signals include sinusoidal signals, exponentially decaying


signals and Gaussian noise. We will be mostly using additive sinusoidal signals of
different frequencies and magnitudes in the control signal for this purpose.
Remark 2.3 It should be noted that the excitation noise is essential in the learning
process, which enables the learning of new policies that may turn out to be better
than the current policy. However, there always is a trade-off between exploration and
exploitation of the current policies because the performance during the exploration
phase may not be as good as that of the policies already learned. A comprehensive
discussion on the excitation condition is available in adaptive control texts [41, 116].
Finally, the convergence of H̄ j can be checked as follows,
 
 j 
H̄ − H̄ j −1  < ε, (2.35)

where ε > 0 is some small constant. The convergence of the output feedback PI
algorithm is established in the theorem below.
 √ 
Theorem 2.6 Let (A, B) be controllable, A, Q be observable and u0k be a
!
stabilizing initial control. Then, the sequence of policies K̄ j , j = 1, 2, 3, . . .
converges to the optimal output feedback policy K̄ ∗ as j → ∞ provided that the
state parameterization matrix W is of full row rank and the rank condition (2.34)
holds.
Proof By Theorem 2.5, the optimal output feedback control law (2.22) converges
to the optimal state feedback control law (2.4). Thus, we need to show that the
policy iterations on the output feedback cost matrix P̄ (or the Q-function matrix H )
46 2 Model-Free Design of Linear Quadratic Regulator

and output control matrix K̄ converge to their optimal values. Recall the following
iterations from Algorithm 2.1,
T T
zT H j zk = ykT Qy yk + uTk Ruk + zk+1
T
Hj zk+1 , (2.36)
−1  
K̄ j +1 = − Huu
j j j
Huσ Huω . (2.37)

If the rank condition (2.34) holds, then we can solve a system of equations based on
Equations (2.36) and (2.37) to obtain H j and K̄ j +1 . From the definition of H , we
have
 T  
W Q + AT P j A W W T AT P j B
H =
j
. (2.38)
B T P j AW R + B TP j B

Using the relationship between P̄ and H from (2.24), we have


       T
P̄ j = I K̄ j T H j I K̄ j T .

Substituting for (2.38) in the above equation results in


 T
P̄ j = W T Q + AT P j A W + K̄ j B T P j AW + W T AT P j B K̄ j +1
T 
+ K̄ j R + B T P j B K̄ j .

Using the definitions K̄ = KW and P̄ = W T P W , we have


 T
W T P j W = W T Q + AT P j A W + W T K j B T P j AW + W T AT P j BK j W
T 
+W T K j R + B TP j B K j W
T  T
= W T QW + W T A + BK j P j A + BK j W + W T K j RK j W.

By Theorem 2.3 (or Theorem 2.4) we have the full row rank of W , which results in
T  T
P j = Q + A + BK j P j A + BK j + K j RK j . (2.39)

Note that the policy update


−1
K̄ j +1 = − R + B T P j B B T P j AW (2.40)
2.3 Discrete-Time LQR Problem 47

corresponds to the state feedback policy update K j +1 based on K =


 −1
K̄W T W W T given that W is of full row rank. Then, Equations (2.39) and (2.40)
are equivalent to the Hewer’s algorithm [35] and,  under
√ the controllability condition
on (A, B) and the observability conditions on A, Q , converge to P ∗ and K ∗ as
j → ∞ with A + BK j being Schur stable for j = 0, 1, 2, · · · . This implies that
P̄ j converges to P̄ ∗ , and therefore, from the definition of H in (2.20), we have the
convergence of H j to H ∗ . Then, by the relation
 
K̄ = − (Huu )−1 Huσ Huω ,

we have the convergence of K̄ to K̄ ∗ . Therefore, the output feedback Q-learning


policy iteration algorithm converges to the optimal controller as j → ∞. This
completes the proof.
Algorithm 2.1 requires a stabilizing control to start with, which can be quite
restrictive when the system itself is open-loop unstable. To obviate this requirement,
we present the value iteration algorithm, Algorithm 2.2.
The VI and PI methods differ in their selection of the initial policy and the policy
evaluation step. The VI method does not require a stabilizing control to start with
and performs a recursive updates on the unknown parameter vector. In the case of
VI, the data matrices  ∈ R(l((l+1)/2)×L and Υ ∈ RL×1 are defined by,
 
 = z̄k1 z̄k2 · · · z̄kL ,
   1  T 2
Υ = r 1 (yk , uk ) + H̄ j T z̄k+1 r 2 (yk , uk ) + H̄ j z̄k+1 ···
 T L 
r L (yk , uk ) + H̄ j z̄k+1 .

Algorithm 2.2 Output feedback Q-learning value iteration algorithm for the LQR
problem
input: input-output data
output: H ∗
1: initialize. Select an arbitrary policy u0k and H 0 ≥ 0. Set j ← 0.
2: repeat
3: value update. Solve the Bellman equation for H̄ j +1 ,
T T
H̄ j +1 z̄k = ykT Qy yk + uTk Ruk + H̄ j z̄k+1 .

4: policy update. Find an improved policy as


 
j +1 j +1 −1 j +1 j +1
uk = − Huu Huσ σk + Huω ωk .

5: j←  j +1 
6: until H̄ j − H̄ j −1  < ε for some small ε > 0.
48 2 Model-Free Design of Linear Quadratic Regulator

These matrices are used to obtain the least-squares solution given by (2.33). The
rank condition (2.34) must be met by the addition of an excitation noise in the
control uk . The convergence of the output feedback VI algorithm is established in
the following theorem.
 √ 
Theorem 2.7 Let (A, B) be controllable and A, Q be observable. Then, the
!
sequence of policies K̄ j , j = 1, 2, 3, . . . converges to the optimal output feedback
policy K̄ ∗ as j → ∞ provided that the state parameterization matrix W is of full
row rank and the rank condition (2.34) holds.
Proof By Theorem 2.5, the optimal output feedback control law (2.22) converges
to the optimal state feedback control law (2.4). Thus, we need to show that the value
iterations on the output feedback cost matrix P̄ (or the Q-function matrix H ) and
the output feedback control matrix K̄ converge to their optimal values. Recall the
following recursion from Algorithm 2.2,
T T
zT H j +1 zk = ykT Qy yk + uTk Ruk + zk+1
T
Hj zk+1 . (2.41)

If the rank condition (2.34) holds, then we can solve Equation (2.41) based on
the above equation to obtain H j +1 . Based on Theorem 2.1 (or Theorem 2.2) and
Equations (2.26) and (2.27), the terms on the right-hand side of Equation (2.41) can
be written as
T
zkT H j +1 zk = ykT Qy yk + uTk Ruk + xk+1
T
P j xk+1
 T  
W Q + AT P j A W W T AT P j B
= zkT zk ,
B T P j AW R + B TP j B

which results in the following recursion,


 T  
W Q + AT P j A W W T AT P j B
H j +1 = . (2.42)
B T P j AW R + B TP j B

Using the relationship between P̄ and H from (2.24), we have


       T
P̄ j +1 = I K̄ j +1 T H j +1 I K̄ j +1 T .

Substituting (2.42) in the above equation results in


 T
P̄ j +1 = W T Q + AT P j A W + K̄ j +1 B T P j AW + W T AT P j B K̄ j +1
T 
+ K̄ j +1 R + B T P j B K̄ j +1 .
2.3 Discrete-Time LQR Problem 49

Since
  
j +1 −1 j +1 j +1
K̄ j +1 = − Huu Huσ Huω
−1
= R + B TP j B B T P j AW,

and

P̄ j = W T P j W,

the above equation reduces to


 −1
W T P j +1 W = W T Q + AT P j A W − W T AT P j B R + B T P j B B T P j AW.

From Theorem 2.3 (or Theorem 2.4), we have the full row rank of W , which results
in
−1
P j +1 = Q + AT P j A − AT P j B R + B T P j B B T P j A.

The above equation gives recursions in terms of ARE (2.5) that converge to P ∗
as j → ∞ under √ the standard controllability and observability assumptions on
(A, B) and A, Q , respectively [52]. This implies that P̄ j converges to P̄ ∗ , and
therefore, from the definitions of H in (2.20), we have the convergence of H j to
H ∗ . Then, by the relation
 
K̄ = − (Huu )−1 Huσ Huω ,

we have the convergence of K̄ to K̄ ∗ . Therefore, the output feedback Q-learning


value iteration algorithm converges to the optimal controller as j → ∞. This
completes the proof.
We now show that the Q-learning scheme presented in this subsection is immune
to the excitation noise bias.
Theorem 2.8 The output feedback Q-learning scheme does not incur bias in the
parameter estimates.
Proof Let ûk = uk + νk be the control input with excitation noise νk . The
corresponding Q-function is given by
 
QK (xk , ûk ) = r xk , ûk + VK (xk+1 ). (2.43)
50 2 Model-Free Design of Linear Quadratic Regulator

Let Ĥ be the estimate of the Q-function matrix H obtained using û. From (2.9), it
follows that
 T  
xk x  T  
Ĥ k = r(xk , ûk ) + Axk + B ûk P Axk + B ûk .
ûk ûk

We expand (2.43) and separate out the noise dependent terms involving νk ,
 T  
xk x
Ĥ k + xkT AT P Bνk + νkT B T P Axk + νkT Ruk + νkT B T P Buk + uTk Rνk
uk uk
+ uTk B T P Bνk + νkT Rνk + νkT B T P Bνk
= xkT Qxk + uTk Ruk + νkT Rνk + νkT Ruk + uTk Rνk + (Axk + Buk )T P (Axk + Buk )
+ xkT AT P Bνk + uTk B T P Bνk + νkT B T P Axk + νkT B T P Buk + νkT B T P Bνk .

As can be readily seen, the noise dependent terms get canceled out and we have
 T  
xk x
Ĥ k = xkT Qxk + uTk Ruk + (Axk + Buk )T P (Axk + Buk ) . (2.44)
uk uk

Comparing Equation (2.44) with (2.9), we have Ĥ = H , resulting in the noise-free


Bellman equation

QK (xk , uk ) = r(xk , uk ) + VK (xk+1 ).

In light of Equations (2.25), (2.26), and (2.19), we have the noise-free output
feedback Bellman equation

zkT H zk = ykT Qy yk + uTk Ruk + zk+1


T
H zk+1 .

This completes the proof.

2.3.6 Numerical Examples

In this subsection, we present numerical simulations of the proposed design using


various examples. We consider both stable and unstable dynamic systems. We show
here that the proposed algorithms converge to the optimal control parameters and
achieve closed-loop stability.
2.3 Discrete-Time LQR Problem 51

10
x1
State

0 x̂1

-10
0 10 20 30 40 50 60 70 80 90 100
10
x2
State

0 x̂2

-10
0 10 20 30 40 50 60 70 80 90 100
time (sec)

Fig. 2.1 Example 2.1: State reconstruction

Example 2.1 (A Stable System) Consider system (2.1) with


 
1.1 −0.3
A= ,
1 0
 
1
B= ,
0
 
C = 1 −0.8 .

Let Qy = 1 and R = 1 be the weights in the utility function (2.3). The eigenvalues
of the open-loop system are 0.5 and 0.6. We first verify the state reconstruction result
of Theorem 2.1. Let the characteristic polynomial of the observer be (z) = z2 . We
apply a sinusoidal signal to the system and compare the actual state trajectory with
that of the reconstructed state using the parameterization given in Theorem 2.1. It
can be seen in Fig. 2.1 that the estimated state converges exponentially to the true
state.
We use the PI algorithm as the system is open-loop stable. Sinusoids of different
frequencies and magnitudes are added in the control to satisfy the excitation con-
dition. We compare here the state feedback Q-learning algorithm (Algorithm 1.6),
the value function approximation (VFA) based output feedback method [56] and
the output feedback Q-learning algorithm (Algorithm 2.1). By solving the Riccati
equation (2.5), we obtain the optimal control matrices for the state feedback control
law (Algorithm 1.6) as


 
Hux = 0.3100 −0.3151
52 2 Model-Free Design of Linear Quadratic Regulator


Huu = 2.0504.

For the VFA based output feedback method [56], the control parameters p0∗ , pu∗ and
py∗ are obtained from the output feedback value function matrix

P̄ ∗ = W T P ∗ W
⎡ ⎤
p0∗ pu∗ py∗
⎢  T ⎥
⎢ ∗ ∗ ∗⎥
= ⎢ pu P22 P23 ⎥ ,
⎣ T ⎦
py∗ P32∗ P∗
33

where W is based on the parameterization (2.17). The nominal values of these


parameters are computed as

p0∗ = 1.0504,
pu∗ = −0.8079,
 
py∗ = 1.1179 −0.0253 .

On the other hand, the nominal values of the Q-learning based output feedback
algorithm (Algorithm 2.1) are computed by solving ARE (2.5) as

 
Huσ = 0.3100 −0.9635 ,

 
Huω = 0.9895 −0.3613 ,

Huu = 2.0504.

The state trajectories under these three different methods are shown in Figs. 2.2, 2.3,
and 2.4, respectively. The convergence of the parameter estimates under these three
methods are shown in Figs. 2.5, 2.6, and 2.7, respectively. The final parameter
estimates obtained are
 
Ĥux = 0.3100 −0.3151 ,

Ĥuu = 2.0504,

for the state feedback algorithm,

p̂0 = 1.0559,
p̂u = −0.8181,
 
p̂y = 1.1323 −0.2959 ,

for the VFA based output feedback method [56], and


 
Ĥuσ = 0.3102 −0.9628 ,
 
Ĥuω = 0.9889 −0.3611 ,
2.3 Discrete-Time LQR Problem 53

1.5

1
x

0.5

0
0 50 100 150
time step (k)

Fig. 2.2 Example 2.1: State trajectory of the closed-loop system under state feedback Q-learning

40

30
x

20

10

0
0 50 100 150
time step (k)

Fig. 2.3 Example 2.1: State trajectory of the closed-loop system under output feedback value
function learning [56]

Ĥuu = 2.0506,

for the output feedback Q-learning algorithm (Algorithm 2.1).


The convergence criterion of ε = 0.01 was selected for the controller parameters.
Seven data samples were collected for the state feedback algorithm, that is, L = 7,
whereas L = 18 for the output feedback algorithm. It can be seen that both the
state feedback Q-learning algorithm and the output feedback Q-learning algorithm
converge to the solution of the ARE (2.5), whereas the VFA based output feedback
method [56] differs from the optimal parameters owing to the use of the discounting
factor. It should be noted that the discounted solution in this particular example
happens to be quite close to the optimal solution as the system is already stable.
Furthermore, the state feedback method requires fewer data samples due to fewer
unknowns, and therefore, the optimal policies are learned faster as compared to the
output feedback method. The faster learning in turn results in better state response.
54 2 Model-Free Design of Linear Quadratic Regulator

1.5

1
x

0.5

0
0 50 100 150
time step (k)

Fig. 2.4 Example 2.1: State trajectory of the closed-loop system under output feedback Q-learning

2.5
Hux (1)
Parameter Estimates

2
Hux (2)
1.5 Huu

0.5

-0.5
0 1 2 3 4 5 6
iterations

Fig. 2.5 Example 2.1: Convergence of the parameter estimates under state feedback Q-learning

It should be noted that we did not introduce a discounting factor in our proposed
method (Algorithms 2.1 and 2.2) as compared to [56]. Moreover, no bias problem is
observed in the proposed scheme. Furthermore, the excitation noise can be removed
once the convergence criterion is satisfied.

Example 2.2 (An Unstable System) Consider system (2.1) with


 
1.8 −0.77
A= ,
1 0
 
1
B= ,
0
 
C = 1 −0.5 .
2.3 Discrete-Time LQR Problem 55

1.5
p0
Parameter Estimates pu
1
py (1)
0.5 py (2)

-0.5

-1
0 1 2 3 4 5 6
iterations

Fig. 2.6 Example 2.1: Convergence of the parameter estimates under output feedback value
function learning [56]

3
Huσ (1)
Parameter Estimates

2 Huσ (2)
Huω (1)
1 Huω (2)
Huu
0

-1

-2
0 1 2 3 4 5 6
iterations

Fig. 2.7 Example 2.1: Convergence of the parameter estimates under output feedback Q-learning

Let Qy = 1 and R = 1. The eigenvalues of the open-loop system are 0.7


and 1.1, and hence, the system is unstable. Let the characteristic polynomial of
the observer be (z) = z2 . We compare here the state feedback Q-learning
algorithm (Algorithm 1.7), the VFA based output feedback method [56] and the
output feedback Q-learning algorithm (Algorithm 2.2). By solving the Riccati
equation (2.5), we obtain the optimal control matrices as

 
Hux = 2.5416 −1.5759 ,

Huu = 3.0467,

for the state feedback algorithm (Algorithm 1.7),

p0∗ = 2.0467,
pu∗ = −1.2713,
56 2 Model-Free Design of Linear Quadratic Regulator

3
x

0
0 50 100 150
time step (k)

Fig. 2.8 Example 2.2 State trajectory of the closed-loop system under state feedback Q-learning

 
py∗ = 3.8129 −1.9578 ,

for the VFA based output feedback method [56], and



 
Huσ = 2.5416 −1.9065 ,

 
Huω = 4.9055 −2.9360 ,

Huu = 3.0467,

for the output feedback Q-learning algorithm (Algorithm 2.2). We use the VI
algorithm as the system is unstable. The initial estimate is H 0 = I . The excitation
condition is ensured by adding sinusoidal probing noises. The convergence criterion
of ε = 0.01 was chosen for all three algorithms. Seven data samples were
collected for the state feedback algorithm, that is, L = 7, whereas L = 18
for the output feedback algorithm. The state trajectories under the state feedback
Q-learning, the VFA based output feedback method [56], and the output feedback Q-
learning method are shown in Figs. 2.8, 2.9, and 2.10, respectively. The convergence
of the parameter estimates under these three different methods are shown in
Figs. 2.11, 2.12, and 2.13, respectively. The final parameter estimates obtained are
 
Ĥux = 2.5416 −1.5759 ,

Ĥuu = 3.0466,

for the state feedback algorithm,

p̂0 = 1.3639,
p̂u = −0.6771,
2.3 Discrete-Time LQR Problem 57

60

50

40
x

30

20

10

0
0 50 100 150
time step (k)

Fig. 2.9 Example 2.2: State trajectory of the closed-loop system under output feedback value
function learning [56]

40

30
x

20

10

0
0 50 100 150
time step (k)

Fig. 2.10 Example 2.2: State trajectory of the closed-loop system under the proposed output
feedback Q-learning

 
p̂y = 2.5069 −1.1909 ,

for the VFA based output feedback method [56], and


 
Ĥuσ = 2.4641 −1.9371 ,
 
Ĥuω = 4.8020 −2.9832 ,

Ĥuu = 3.0396,

for the output feedback Q-learning algorithm. It can be seen that both the state
feedback Q-learning algorithm and the output feedback Q-learning algorithm
converge to the solution of the undiscounted ARE (2.5), whereas the parameter
estimates in the VFA based output feedback method [56] differ even further from
58 2 Model-Free Design of Linear Quadratic Regulator

6
Hux (1)
Parameter Estimates
4 Hux (2)
Huu
2

-2

-4
0 5 10 15
iterations

Fig. 2.11 Example 2.2: Convergence of the parameter estimates under state feedback Q-learning

3
p0
Parameter Estimates

2 pu
py (1)
1 py (2)

-1

-2
0 5 10 15
iterations

Fig. 2.12 Example 2.2: Convergence of the parameter estimates under output feedback value
function learning [56]

the optimal parameters as compared to Example 2.1. In fact, it will be shown in


Example 2.3 that the discounted control law may not even be able to stabilize the
system.
We now show that a better choice of the initial policy can improve the conver-
gence speed and system response. Instead of initializing the controller parameters
to zero (open-loop), we choose initial controller parameters to be within 1% of the
nominal parameters. We again use the VI algorithm. Figure 2.14 shows the state
trajectories of the system. The parameters converge to the optimal parameters in 4
iterations as shown in Fig. 2.15.

Example 2.3 (A Balance Beam System) We consider the balance beam system [64,
135] as shown in Fig. 2.16. It serves as a test platform for magnetic bearing systems.
Two magnetic coils are located at the ends of a metal beam. Coil currents serve as
the control inputs that generates forces to balance the beam. The motion of the beam
is restricted to ±0.013 rad and proximity sensors are used to measure the beam tilt
angle. This system is modeled by the following continuous-time state equation,
2.3 Discrete-Time LQR Problem 59

10
Huσ (1)
Parameter Estimates
Huσ (2)
5
Huω (1)
Huω (2)
0 Huu

-5

-10
0 5 10 15
iterations

Fig. 2.13 Example 2.2: Convergence of the parameter estimates under the proposed output
feedback Q-learning

1.5

1
x

0.5

0
0 20 40 60 80 100 120
time step (k)

Fig. 2.14 Example 2.2: State trajectory of the closed-loop system under output feedback with a
better choice of initial estimates

ẋ(t) = Ac x(t) + Bc u(t),


y(t) = Cc x(t),

with
 
0 1
Ac = ,
9248 −1.635
 
0
Bc = ,
281.9
 
Cc = 1 0 ,
60 2 Model-Free Design of Linear Quadratic Regulator

6
Huσ (1)

Parameter Estimates
4 Huσ (2)
Huω (1)
2 Huω (2)
Huu
0

-2

-4
0 1 2 3 4 5
iterations

Fig. 2.15 Example 2.2: Convergence of the parameter estimates under output feedback with a
better choice of initial estimates

Fig. 2.16 Example 2.3: A balance beam test rig [135]

where x1 (t) = θ (t) and x2 (t) = θ̇ (t) are the angular displacement and angular
velocity, respectively, and u(t) is the control current that is applied on the top of a
fixed bias current to generate a differential electromagnetic force between the two
coils. We discretize the model with a sampling period of 0.5 ms. The discretized
system model is in the form of system (2.1) with
 
1.0012 0.0005
A= ,
4.6239 1.0003
 
0.00004
B= ,
0.14094
 
C= 10 .

We use the same user-defined performance matrices and (z) = z2 as in


Example 2.2. It can be verified that the state feedback optimal control policy K ∗
as obtained by solving the ARE is
2.3 Discrete-Time LQR Problem 61

 
K ∗ = −64.0989 −0.6608 .

The eigenvalues of the closed-loop system matrix A + BK ∗ are λ = 0.9530 ±


j 0.0006 with |λ| = 0.9530002, which means that the closed-loop system is stable.
On the other hand, the discounted controller with γ = 0.9 as obtained by solving
the discounted ARE is given by
 
Kγ = −0.1327 −0.0001 × 10−4 .

The corresponding eigenvalues of the closed-loop system are λ1 = 1.0488 and


λ2 = 0.9527. Since |λ1 | > 1, the closed-loop system is unstable. Only when γ is
chosen sufficiently close to 1 is the resulting closed-loop system stable. However,
as γ tends to 1, the noise bias dominates the solution and, therefore, learning does
not converge [56]. The nominal output feedback controller matrices are found to be


 
Huσ = 0.1049 0.0537 ,

 
Huω = 1599 −1523 ,

Huu = 1.1001.

Because the system is unstable, we use the VI algorithm. Since we already have
a validated experimental model, we initialize the controller with 60% parametric
uncertainty (0.4 times the nominal model). The excitation requirement is met by
the addition of sinusoidal noises. The system response is shown in Fig. 2.17. The
controller parameters converge to the optimal parameters and are given as
 
Ĥuσ = 0.1049 0.0537 ,
 
Ĥuω = 1599 −1523 ,

Ĥuu = 1.1000.

Note that, in this example, it took around 40 iterations to converge to the optimal
parameters as the initial estimation error was large. In each iteration, L = 30 data
samples were collected. The convergence of the parameter estimates is shown in
Fig. 2.18. Comparing this result with the discounted cost function approach of [56],
we find that the discounted controller gain Kγ results in an unstable closed-loop
system. However, using the output feedback Q-learning method (Algorithm 2.2),
the closed-loop stability is preserved and the controller converges to the optimal
LQR solution.

Example 2.4 (A Higher Order System) In this example, we further test the output
feedback Q-learning scheme on a higher order practical system. Consider the power
system example for the load frequency control of an electric system [121]. A
62 2 Model-Free Design of Linear Quadratic Regulator

15

θ(milli-rad) 10

-5
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
time (sec)

Fig. 2.17 Example 2.3: State trajectory of the closed-loop system under output feedback

2
Huσ (1)
Huσ (2)
Parameter Estimates

1
Huu
0
0 10 20 30 40 50 60 70 80 90 100
2000
Huω (1)
0 Huω (2)

-2000
0 10 20 30 40 50 60 70 80 90 100
iterations

Fig. 2.18 Example 2.3: Convergence of the parameter estimates under output feedback

practical problem arises when the actual power plant parameters are not precisely
known, yet an optimal feedback controller is desired. We discretize the following
continuous-time plant with a sampling period of 0.1s,
⎡ ⎤
−0.0665 11.5 0 0
⎢ 0 −2.5 2.5 0 ⎥
Ac = ⎢
⎣ −9.5
⎥,
0 −13.736 −13.736⎦
0.6 0 0 0
⎡ ⎤
0
⎢ 0 ⎥
Bc = ⎢ ⎥
⎣13.736⎦ ,
0
 
Cc = 1 0 0 0 .
2.3 Discrete-Time LQR Problem 63

0.2
x1
0.1 x2
x3
0 x4
States

-0.1

-0.2

-0.3
0 5 10 15 20 25 30 35 40
time step (k)

Fig. 2.19 Example 2.4: State trajectory of the closed-loop system under output feedback

Let Qy = 1 and R = 1 be the weights in the utility function (2.3). Let


the characteristic polynomial of the observer be (z) = z4 . The nominal output
feedback parameters as found by solving the ARE are

 
Huσ = 0.6213 0.1190 −0.8928 −0.1551 ,

 
Huω = 10.1884 −18.4698 10.4211 −1.5109 ,

Huu = 1.5541.

Let H 0 = I be the initial parameter estimate. We add sinusoidal signals to meet


the excitation condition. The response of the states is shown in Fig. 2.19. Under
the output feedback Q-learning algorithm (Algorithm 2.1), the system maintains
stability and the controller parameters converge to the optimal parameters as below,

 
Ĥuσ = 0.6214 0.1190 −0.8929 −0.1551 ,
 
Ĥuω = 10.1889 −18.4709 10.4217 −1.5110 ,

Ĥuu = 1.5542.

The convergence criterion was chosen as ε = 0.005. Convergence is achieved


in 6 iterations as shown in Fig. 2.20, and L = 50 observations were taken in
every iteration. It should be noted that, for this higher-order system, we have more
unknowns and, therefore, we require a richer excitation noise. It can be seen that
even with a richer excitation noise, the algorithm remains immune to the bias
problem.
64 2 Model-Free Design of Linear Quadratic Regulator

100
Huσ (1)
Parameter Estimates
50
Huσ (2)
0 Huσ (3)
Huσ (4)
-50 Huω (1)
Huω (2)
-100 Huω (3)
Huω (4)
-150 Huu
-200
0 1 2 3 4 5 6
iterations

Fig. 2.20 Example 2.4: Convergence of the parameter estimates under output feedback

Example 2.5 (Adaptation Capability) We now validate the adaptation capability of


our output feedback Q-learning control algorithms. Consider again the unstable
system in Example 2.2. After two iterations, we change the system dynamics to
 
1.5 −0.4
A= ,
1 0

for which the new nominal output feedback optimal controller parameters are

 
Huσ = 1.7900 −1.4292 ,

 
Huω = 3.4329 −1.1434 ,

Huu = 2.7032.

Under the output feedback Q-learning algorithm (Algorithm 2.2), the closed-loop
system maintains stability as shown in Fig. 2.21. Furthermore, the controller adapts
to the new system dynamics and converges to the new optimal controller as
 
Ĥuσ = 1.7583 −1.3983 ,
 
Ĥuω = 3.3643 −1.1187 ,

Ĥuu = 2.6787.

The parameter convergence is shown in Fig. 2.22. We see that, after the second
iteration, the controller begins to adapt to the new system dynamics and converges
to the new optimal parameters in 7 iterations. In each iteration, L = 20 data samples
were collected.
2.4 Continuous-Time LQR Problem 65

1.5

1
x

0.5

0
0 50 100 150 200 250
time step (k)

Fig. 2.21 Example 2.5: State trajectory of the closed-loop system with changing dynamics and
under output feedback

4
Parameter Estimates

2 Huσ (1)
Huσ (2)
1 Huω (1)
Huω (2)
0
Huu
-1

-2
0 2 4 6 8 10 12
iterations

Fig. 2.22 Example 2.5: Convergence of the parameter estimates with changing dynamics and
under output feedback

2.4 Continuous-Time LQR Problem

Consider a continuous-time linear system in the state space form,

ẋ = Ax + Bu,
(2.45)
y = Cx,

where x ∈ Rn is the state, u ∈ Rm is the input, and y ∈ Rp is the output.


Under the usual controllability and observability assumptions on (A, B) and (A, C),
respectively, we are to solve the LQR problem by output feedback. We will adopt
an observer based controller structure. When the full state is available for feedback,
the control law takes the form
66 2 Model-Free Design of Linear Quadratic Regulator

u∗ = K ∗ x (2.46)

and minimizes the cost function of the form



V (x(t)) = r (x(τ ), u(τ ))dτ, (2.47)
t

where r(x, u) is the utility function of a quadratic form

r(x, u) = y T Qy y + uT Ru, (2.48)

with Qy ≥ 0 and R > 0 being the user-defined weighting matrices. Under a


stabilizing policy u = Kx (not necessarily optimal), the cost is quadratic in the
state [121],

V (x) = x T P x, (2.49)

where P > 0. Under the conditions of controllability of (A, B) and observability


 √  √ T√
of A, Q , where Q Q = Q, Q = C T Qy C, there exists a unique optimal
feedback gain given by

K ∗ = −R −1 B T P ∗ , (2.50)

where P ∗ > 0 is the unique positive definite solution to the following ARE [55],

AT P + P A + Q − P BR −1 B T P = 0. (2.51)

2.4.1 Model-Based Iterative Schemes for the LQR Problem

Even when the system model information is available, the LQR ARE (2.51) is
difficult to solve owing to its nonlinear nature and, therefore, computational iterative
methods have been developed to address this difficulty. We recall the following
policy iteraion algorithm from [51].
The key equation in Algorithm 2.3 is the Bellman equation, which is a Lyapunov
equation and is easier to solve than the ARE (2.51). The algorithm essentially con-
sists of a policy evaluation step followed by a policy update step. We first compute
the cost P j of the control policy K j by solving the Lyapunov equation (2.252). In
the second step, we compute an updated policy K j +1 . It has been proven in [51]
that given a stabilizing initial policy K 0 , the successive iterations on the Lyapunov
equation converge to the optimal solution P ∗ and K ∗ .
Algorithm 2.3 requires a stabilizing initial policy. For an open-loop stable system,
the stabilizing initial policy K 0 can be set to zero. However, for an unstable system,
it is difficult to obtain such a stabilizing policy when the system dynamics is
2.4 Continuous-Time LQR Problem 67

Algorithm 2.3 Model-based policy iteration for solving the LQR problem
input: system dynamics (A, B)
output: P ∗ and K ∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for P ,
T
(A + BK)T P j + P j (A + BK) + Q + K j RK j = 0. (2.252)

4: policy update. Find an improved policy as

K j +1 = −R −1 B T P j . (2.253)

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

unknown. To obviate this requirement, value iteration algorithms are used that
perform recursive updates on the cost matrix P instead of solving the Lyapunov
equation in every iteration.
In [11], a VI algorithm was proposed for the continuous-time LQR problem.! We

recall the following definitions before introducing the VI algorithm. Let Bq q=0
be some bounded nonempty sets that satisfy

Bq ⊆ Bq+1 , q ∈ Z+

and

lim Bq = Pn+ ,
q→∞

where Pn+ is the set of n-dimensional positive definite matrices. For example,
 
Bq = P ∈ P2+ : |P | ≤ q + 1 , q = 0, 1, 2, . . . .

With these definitions, Algorithm 2.4 presents the LQR VI algorithm.


The VI algorithm, Algorithm 2.4, performs recursive updates on the difference
Riccati equation with the step size sequence j satisfying limj →∞ j = 0. It
can be seen that the algorithm no longer requires a stabilizing initial policy.
Furthermore, the recursion on the difference equation is computationally faster than
solving the Lyapunov equation in each iteration. However, it should be noted that
the VI algorithms generally take more iterations to converge [59]. Finally, both
Algorithms 2.3 and 2.4 are model-based as they require full model information
(A, B, C).
68 2 Model-Free Design of Linear Quadratic Regulator

Algorithm 2.4 Model-based value iteration for solving the LQR ARE
Input: system dynamics (A, B)
Output: P ∗
Initialization. P 0 > 0, j ← 0, q ← 0.
1: loop  
2: P̃ j +1 ← P j + j AT P j + P j A + Q − P j BR −1 B T P j .
3: if P̃ j +1 ∈/ Bq then
4: P j +1 ← P 0
5: q ← q + 1 
 
6: else if P̃ j +1 −P j  /j < ε, for some small ε > 0, then
7: return P j as P ∗
8: else
9: P j +1 ← P̃ j +1
10: end if
11: j ←j +1
12: end loop

2.4.2 Model-Free Schemes Based on State Feedback

Reinforcement learning control algorithms were originally developed for discrete-


time dynamic systems. The treatment of continuous-time reinforcement learning
control problems is quite different from its discrete-time counterpart discussed in
Sect. 2.3. One of the notable difficulties in the continuous-time case arises from the
involvement of the derivatives of the cost function in the Bellman equation, which
requires knowledge of the system dynamics, as given by the following equation,

0 = r (x(t), Kx(t)) + ( V )T (Ax(t) + Bu(t)) .

In contrast, the Bellman equation for the discrete-time problems (2.7) is a recursion
between two consecutive values of the cost function and does not involve the system
dynamics. Recently, the idea of integral reinforcement learning (IRL) [121] has been
used to overcome this difficulty. The IRL Bellman equation is given by

t+T
V (x(t)) = r (x(τ ), Kx(τ )) dτ + V (x(t + T )).
t

The idea of using an interval integral in the learning equation has been successfully
used to design RL based control algorithms and will be adopted here to solve
continuous-time output feedback RL control problems. It is worth mentioning that
the above learning equation is employed in an on-policy setting. In such algorithms,
the behavioral policy that is employed during the learning phase follows the policy
that has been learned. Since we are interested in learning the linear feedback policy,
the behavioral policy is confined to this structural constraint. A downside to this
method is that it is hard to achieve sufficient exploration of the state and action
2.4 Continuous-Time LQR Problem 69

space. Often, episodic learning involving resetting of system states is used in this
class of algorithms.
Differently from the on-policy approach, off-policy learning methods have been
presented in the literature that solve the optimal control problems without requiring
model information. The learning equation in such methods involve an explicit
control term that is not restricted to the feedback policy being learned that allows
a good degree of freedom on the choice of exploration. As a result, the exploration
bias issue does not arise, which is different from the output feedback bias issue,
as will be seen next. In the control setting, some pioneering developments along
this line of work were first made in the continuous-time setting in [42] to solve
the state feedback LQR problem. This model-free state feedback algorithm based
on an off-policy policy iteration algorithm is presented in Algorithm 2.5. Recently,
a number of optimal control problems have been solved based on this formulation
both in the continuous-time and the discrete-time settings [18, 48, 79, 80]. Interested
readers can refer to [43] for more discussions on this topic. The continuous-time
output feedback results in this book build upon the state feedback off-policy learning
equations based on this approach.
Although Algorithm 2.5 is model-free, it requires a stabilizing initial control
policy, similar to the model-based PI algorithm, Algorithm 2.3. This could be quite
a restrictive requirement when the open-loop system is unstable and the system
model is not available. Recently, a model-free ADP value iteration algorithm,
Algorithm 2.6, was proposed that overcomes this situation [11].
It can be seen that the model-free Algorithm 2.6 does not require a stabilizing
control policy for its initialization. The algorithm is based on the recursive learning
equation (2.55), which is used to find the unknown matrices H j = AT P j + P j A
and K j = −R −1 B T P j .
Both model-free algorithms discussed in this subsection make use of the full
state, which is not always available in practical scenarios. To obviate the requirement
of a measurement of the full internal state of the system, we next propose a dynamic
output feedback scheme to solve the model-free LQR problem. It will be shown that
the proposed scheme is immune to the exploration noise bias and does not require a
discounted cost function. As a result, the closed-loop stability and optimality of the
solution are ensured.

2.4.3 Model-Free Output Feedback Solution

A model-free output feedback solution for the continuous-time output feedback


LQR problem was recently proposed in [80], where the authors used the N -sample
observability theorem [125] to reconstruct the state of a continuous-time system by
sampling delayed output measurements. If the control is of the full state feedback
form, u = Kx, then the state of the closed-loop system

ẋ = (A + BK)x
70 2 Model-Free Design of Linear Quadratic Regulator

Algorithm 2.5 Model-free state feedback based continuous-time LQR policy


iteration algorithm
input: input-state data
output: P ∗ and K ∗
1: initialize. Select a stable control policy u0 = K 0 x + ν, where ν is the exploration signal, and
set the iteration index j ← 0.
2: collect data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where tl = t0 + lT
and T is the interval length. Based on this data, perform the following iterations.
3: repeat
4: evaluate and improve policy. Find the solution, P j and K j +1 , of the following learning
equation,

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t T
= − x T (τ ) C T Qy C + K j RK j x(τ )dτ
t−T
t T
−2 u(τ ) − K j x(τ ) RK j +1 x(τ )dτ. (2.54)
t−T

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

Algorithm 2.6 Model-free state feedback based continuous-time LQR value itera-
tion algorithm
Input: input-state data
Output: P ∗
Initialization. Select P 0 > 0 and set j ← 0, q ← 0.
Collect Online Data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where
tl = t0 + lT and T is the interval length. Based on this data, perform the following iterations,
1: loop
2: Find the solution, H j and K j , of the following equation,

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t t
= x T (τ )H j x(τ )dτ − 2 (Ru(τ ))T K j x(τ )dτ. (2.55)
t−T t−T

 T 
3: P̃ j +1 ← P j + j H j + C T Qy C − K j RK j
4: if P̃ j +1 ∈
/ Bq then
5: P j +1 ← P 0
6: q ← q + 1 
 
7: else if P̃ j +1 −P j  / j < ε then
8: return P j as P ∗
9: else
10: P j +1 ← P̃ j +1
11: end if
12: j ←j +1
13: end loop
2.4 Continuous-Time LQR Problem 71

is related to the delayed output through,

y(t − iT ) = Ce−iT (A+BK) x(t),

where i = 0, 1, · · · , N − 1. Here, N is the minimum number of samples needed to


observe the system in some interval (ti , tf ) and depends on the system dynamics.
Thus, N delayed output measurements can be used to reconstruct the state as
−1
x(t) = GT G GT ȳ,

where
 T
ȳ = y T (t) y T (t − T ) · · · y T (t − (N − 1)T ) ,
     T
G = C T e−T (A+BK) T C T · · · e−(N −1)T (A+BK) T C T .

The method is elegant, however, it assumes that u = Kx, and therefore, does not
consider the excitation noise in the input. Thus, introducing excitation noise violates
the above relations, which ultimately leads to bias in the parameter estimates in the
ADP learning equation. As a result, the estimated control parameters converge to
sub-optimal parameters [80]. To address this problem, a discounting factor γ is
typically introduced in the cost function [56, 80], which helps to suppress the noise
bias. The resulting discounted cost function takes the form of

V (x(t)) = e−γ (τ −t) r (x(τ ), u(τ )) dτ. (2.56)
t

The introduction of the discounting factor γ , however, changes the solution of the
Riccati equation, and the resulting discounted control is no longer optimal. More
importantly, the introduction of the discounting factor does not guarantee closed-
loop stability as the original optimal control (2.46) would. To see this, we note that,
for a time function α(t) and a discounting factor γ > 0,

e−γ (τ −t) α(τ )dτ < ∞
t

does not guarantee that α(t) → 0 as t → ∞.


In what follows, we will develop output feedback algorithms that do not suffer
from the exploration bias problem and, as a result, do not require a discounted
cost functions. The development of such output feedback algorithms will enable
to guarantee optimality and the closed-loop stability. To this end, we will present
a different parameterization of the state that takes into account the exploration
component of the input signal.
72 2 Model-Free Design of Linear Quadratic Regulator

2.4.4 State Parameterization

In the discrete-time setting, we were fortunate to obtain an exact parameterization of


the state, given in Theorem 2.2, in terms of the past input-output data. This was made
possible by placing the eigenvalues of the observer error dynamics at the origin.
However, the direct translation of this discrete-time result into the continuous-time
setting would invoke the derivatives of the input and output measurements, which is
generally prohibitive in practice.
To overcome this difficulty, [80] proposed the parameterization based on delayed
output measurements, which does not include the effect of exploration noise
component present in the input, as mentioned in Sect. 2.3. Interestingly, the
general parameterization result derived based on the Luenberger observer theory
in Theorem 2.1 in the discrete-time setting can be extended to the continuous-time
setting. This was the motivation of introducing the general parameterization (2.16)
before its special case (2.17). In the following, we present a state parameterization
procedure to represent the state in terms of the filtered input and output for a general
continuous-time linear system.
Theorem 2.9 Consider system (2.45). Let the pair (A, C) be observable. Then,
there exists a parameterization of the state in the form of

x(t) = Wu ζu (t) + Wy ζy (t) + e(A+LC)t x(0), (2.57)

where L is the observer gain such that A + LC is Hurwitz, the parameterization
  p
matrices Wu = Wu1 Wu2 · · · Wum and Wy = Wy1 Wy2 · · · Wy are given in the
form of
⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2
au(n−2) i2 ⎥
· · · au0
⎢ u(n−1) ⎥
Wu = ⎢
i
⎢ ..
⎥ , i = 1, 2, · · · , m,
⎢ .
.. . . .. ⎥
⎣ . . . ⎥⎦
in
au(n−1) in
au(n−2) · · · au0
in

⎡ i1 i1 ⎤
ay(n−1) i1
ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2
ay(n−2) i2 ⎥
· · · ay0
⎢ y(n−1) ⎥
Wyi = ⎢
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) ay(n−2) · · · ay0
in in

whose elements are the coefficients of the numerators in the transfer function
matrix of a Luenberger observer with inputs u(t) and y(t), and ζu =
     
T T  m T T T T  p T T
1
ζu ζu · · · ζu
2 and ζ y = 1
ζy 2
ζy · · · ζy represent the
2.4 Continuous-Time LQR Problem 73

states of the user-defined dynamics driven by individual input ui (t) and output y i (t)
as given by

ζ˙ui (t) = Aζui (t) + Bui (t), i = 1, 2, · · · , m,


ζ˙yi (t) = Aζyi (t) + By i (t), i = 1, 2, · · · , p,

for a Hurwitz matrix A whose eigenvalues coincide with those of A + LC and an


input vector B of the form
⎡ ⎤ ⎡ ⎤
0 0 ··· 0
1 0
⎢ 0 1 0 · · · 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢0⎥
⎢ .. ⎥ ⎢ 0⎥
.. . ...
A=⎢
⎢ . . . .
. . . ⎥⎥, B = ⎢ ⎥.
⎢ 0 ⎥ ⎢ . ⎥
⎣ 0 0 ··· 1 ⎦ ⎣.⎦
.
.
−α0 −α1 · · · .. −αn−1 1

Proof Under the observability condition of (A, C), the estimate of the state, x̂(t),
can be obtained based on the following observer,
 
˙ = Ax̂(t) + Bu(t) − L y(t) − C x̂(t)
x̂(t)
= (A + LC)x̂(t) + Bu(t) − Ly(t), (2.58)

where L is a user-defined observer gain selected such that matrix A+LC is Hurwitz.
The system input and output serve as the inputs to the observer, which can be written
in a filter notation as follows,

x̂(t) = (sI − A − LC)−1 B[u] − (sI − A − LC)−1 L[y] + e(A+LC)t x̂(0)


 U i (s)  i   Y i (s)  i 
m p
= u − y + e(A+LC)t x̂(0)
(s) (s)
i=1 i=1
U (s) Y (s)
= [u] − [y] + e(A+LC)t x̂(0), (2.59)
(s) (s)

where ui and y i are the


 components
 of the input
 and output vectors, respectively,
U = U 1 U 2 · · · U m , Y = Y 1 Y 2 · · · Y p , and U i (s) and Y i (s) are some n-
dimensional polynomial vectors in the differential operator s. We represent [u] as a
time signal acted upon by some transfer function to result in another time signal, for
example, s[u] = dt d
u. Note that U (s) and Y (s) are dependent on (A, B, C, L). We
define (s) as the characteristic polynomial of matrix A + LC,

(s) = det(sI − A − LC)


= s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α1 s + α0 .
74 2 Model-Free Design of Linear Quadratic Regulator

U i (s)  i
We first consider the contribution to x̂(t) from the input u. Note that (s) u
in (2.59) is given by

U i (s)  i   
u = (sI − A − LC)−1 Bi ui , (2.60)
(s)

where Bi is the i th column of B. Equation (2.60) can be further expanded as follows,

⎡ i1 s n−1 + a i1 s n−2 + · · · + a i1

an−1 n−2 0
⎢ n ⎥
⎢ s + αn−1 s n−1 + αn−2 s n−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 s n−1 + a i2 s n−2 + · · · + a i2 ⎥
⎢ ⎥
⎢ n−1 n−2 0

U i (s)  i  ⎢
⎢ s n+α
n−1 s n−1 + α
n−2 s n−2 + · · · + α ⎥  
0⎥
u =⎢ ⎥ u
i
(s) ⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ an−1
in s n−1 + a in s n−2 + · · · + a in
n−2 0 ⎦
s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α0
⎡ i1 i1 ⎡  i ⎤
i1 ⎤
a0 a1 · · · an−1
1
(s) u
⎢ i2 i2 ⎥ ⎢ s  i ⎥
⎢a a · · · a i2 ⎥ ⎢ ⎥
⎢ 0 1 n−1 ⎥ ⎢
⎢ (s) u ⎥

=⎢
⎢ .. .. ..

.. ⎥ ⎢ .. ⎥
⎢ . . . ⎢ ⎥
⎣ . ⎥
⎦⎢ . ⎥
⎣  ⎦
s n−1 i
a0 a1 · · · an−1
in in in
(s) u

= Wui ζui (t),

where the parameterization matrix Wui ∈ Rn×n contains the coefficients of the
polynomial vector U i (s) and ζui ∈ Rn is the result of a filtering operation on the
ith input signal ui , which can also be obtained through the following the dynamic
system,

ζ˙ui (t) = Aζui (t) + Bui (t), (2.61)

where
⎡ ⎤ ⎡ ⎤
0 1 0 ··· 0 0
⎢ 0 0 1 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ ⎥
⎥, B = ⎢ .⎥
.. .. .. .. ..
A=⎢
⎢ . . . . . ⎥ ⎢ .. ⎥ .
⎢ 0 ⎢ ⎥
⎣ 0 ···
0 1 ⎥ ⎦ ⎣0⎦
.
−α0 −α1 · · · .. −αn−1 1
2.4 Continuous-Time LQR Problem 75

The same procedure can also be applied for the contribution to x̂(t) from the ith
output y i as

Y i (s)  i 
y = Wyi ζyi (t),
(s)

where ζyi (t) is obtained through the following dynamic system,

ζ˙yi (t) = Aζyi (t) + By i (t). (2.62)

Note that the characteristic polynomial of A,

det(sI − A) = s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α0 ,

is the same as (s), which is the characteristic polynomial of matrix A + LC and is


a stable polynomial. Thus, the dynamics of ζui and ζyi is also asymptotically stable.
Based on (2.61) and (2.62), (2.59) can be written as

x̂(t) = Wu ζu (t) + Wy ζy (t) + e(A+LC)t x̂(0), (2.63)

where
     T T
T T
ζu = ζu1 ζu2 · · · ζum ∈ Rmn ,
 T T 
 p T T
ζy = ζy1 ζy2 · · · ζy ∈ Rpn ,

and
 
Wu = Wu1 Wu2 · · · Wum ∈ Rn×mn ,
 
p
Wy = Wy1 Wy2 · · · Wy ∈ Rn×pn .

It should be noted that A + LC represents the error dynamics of the observer,

e(t) = x(t) − x̂(t)


= e(A+LC)t e(0),

and therefore x(t) can be written as

x(t) = Wu ζu (t) + Wy ζy (t) + e(A+LC)t x(0). (2.64)

It can be seen that e(A+LC)t x̂(0) in (2.63) and e(A+LC)t x(0) in (2.64) converge to
zero as t → ∞ because A + LC is Hurwitz stable. This completes the proof.
76 2 Model-Free Design of Linear Quadratic Regulator

Remark 2.4 The parameterization matrices Wu and Wy depend on the quadruple


(A, B, C, L) and, therefore, require the knowledge of system dynamics. It will be
shown that, by embedding matrices A, B, C, and L in the learning equation, we
will be able to directly learn the optimal output feedback policy without actually
computing them, making the design model-free.
While the observability of (A, C) is necessary to guarantee the convergence of the
state parameterization
  to the actual state, the rank of the parameterization matrix
W = Wu Wy depends on the observer error dynamics matrix A + LC as well
as the system matrices. The full row rank condition is needed in establishing
the convergence of the output feedback algorithms to the optimal solution in
Theorem 2.13. We will analyze the full row rank condition of matrix W for the
parameterization (2.57). To this end we present the following result.
 
Theorem 2.10 The state parameterization matrix W = Wu Wy is of full row
rank if (A + LC, B) or (A + LC, L) is controllable.
Proof We note that the full row rank of either Wu or Wy suffices for W to be of full
row rank. Consider first the matrix Wu associated with the input. Recall that Wu can
be obtained from (2.14) as follows,

Dn−1 s n−1 + Dn−2 s n−2 + · · · + D0


(sI − (A + LC))−1 B[u] = B[u]
s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α0
⎡ 1 ⎤
(s) [u]
⎢ ⎥
 ⎢⎢ (s) [u]⎥
s ⎥
= D0 B D1 B · · · Dn−1 B ⎢ . ⎥
⎢ . ⎥
⎣ . ⎦
s n−1
(s) [u]

= Wu ζu .

Here the matrices Di contain the coefficients of the adjoint matrix. It can be verified
that we can express Di in terms of the matrix A + LC and the coefficients of its
characteristic polynomial (s) as follows,

Dn−1 = I,
Dn−2 = (A + LC) + αn−1 I,
Dn−3 = (A + LC)2 + αn−1 (A + LC) + αn−2 I,
..
.
D0 = (A + LC)n−1 + αn−1 (A + LC)n−2 + · · · + α2 (A + LC) + α1 I.

Substituting the expressions for Di ’s in the expression for Wu and analyzing the
rank of Wu , we have
2.4 Continuous-Time LQR Problem 77


ρ(Wu ) = ρ (A + LC)n−1 B + αn−1 (A + LC)n−2 B + · · · + α2 (A + LC)B + α1 B · · ·

(A + LC)B + αn−1 B, B .

The terms of αi can be eliminated by performing column operations to result in,


 
ρ(Wu ) = ρ (A + LC)n−1 B · · · (A + LC)B B ,

which is the controllability condition of the pair (A + LC, B). Thus, the controlla-
bility of the pair (A + LC, B) implies full row rank of matrix Wu , and hence full
row rank of matrix W .
A similar analysis of the matrix Wy yields that controllability of the pair (A +
LC, L) would also imply full row rank of matrix Wy . This completes the proof.
We note that the controllability condition of (A + LC, B) or (A + LC, L) in
Theorem 2.3 is difficult to verify since they involve the observer gain matrix L,
designing which would require the knowledge of the system dynamics. Under the
observability condition of (A, C), even though L can be chosen to place eigenvalues
of matrix A + LC arbitrarily, it is not easy to choose an L that satisfies the
conditions of Theorem 2.3. As a result, in a model-free setting, we would not rely
on Theorem 2.3 to guarantee full row rank of matrix W . It is worth pointing out that
we do not design L for the state parameterization. Instead, we form a user-defined
dynamics A that contains the desired eigenvalues of matrix A + LC. We need a
condition in terms of these eigenvalues instead of the matrix L. The following result
establishes this condition.
Theorem 2.11 The parameterization matrix W is of full row rank if matrices A and
A + LC have no common eigenvalues.
Proof By Theorem 2.10 matrix Wy has full row rank if the pair (A + LC, L)
is controllable. We will show that, if matrices A and A + LC have no common
eigenvalues, then the pair (A + LC, L) is indeed controllable. By the Popov–
Belevitch–Hautus
 (PBH) test, the pair (A + LC, L) loses controllability if and only
if q T A + LC − λI L = 0 for a left eigenvector q associated with an eigenvalue
λ of A + LC. Then, q T (A + LC) = λq T and q T L = 0, and, therefore, q T A = λq T .
Then, λ must also be an eigenvalue of A if the pair (A + LC, L) is not controllable.
This completes the proof.
Next, we aim to use this state parameterization to describe the cost function
in (2.49). Substitution of (2.57) in (2.49) results in
 T  
ζu  T   ζu
V = Wu Wy P Wu Wy (2.65)
ζy ζy

= zT P̄ z, (2.66)
78 2 Model-Free Design of Linear Quadratic Regulator

where
 
ζ
z = u ∈ RN ,
ζy
 
W TP W W TP W
u y
∈ RN ×N ,
u u
P̄ = P̄ T =
WyT P Wu WyT P Wy

with N = mn + pn being the dimension of the equivalent output feedback system.


By (2.65) we have obtained a new description of the cost function in terms of
the filtered input and output of the system. The corresponding output feedback
controller is given by

u = Kx (2.67)
 
  ζu
= K Wu Wy (2.68)
ζy

= K̄z, (2.69)
 
where K̄ = K Wu Wy ∈ Rm×N . Therefore, the optimal cost matrix is given by
P̄ ∗ and the corresponding optimal output feedback control law is given by

u∗ = K̄ ∗ z. (2.70)

Theorem 2.12 The output feedback law (2.70) is the steady-state equivalent of the
optimal LQR control law (2.46).
Proof Consider
 the
 optimal output feedback LQR controller (2.70). Substituting
K̄ ∗ = K ∗ Wu Wy in (2.70) results in
 
u∗ = K ∗ Wu ζu (t) + Wy ζy (t) .

By Theorem 2.9, the state parameterization (2.57) in the steady-state reduces to


x(t) = Wu ζu (t) + Wy ζy (t) since A + LC is Hurwitz. Thus, the output feedback
control law (2.70) is the steady-state equivalent of

u∗ = K ∗ x.

This completes the proof.


2.4 Continuous-Time LQR Problem 79

2.4.5 Learning Algorithms for Continuous-Time Output


Feedback LQR Control

The discussion so far in this section focused on finding the nominal solution of the
continuous-time LQR problem when the system model is known. In the following,
we will derive an output feedback learning equation that will allow us to learn the
optimal output feedback LQR controller based on the input and output data.
In view of the state parameterization, we can write the key learning equa-
tion (2.54) in the output feedback form as follows,

zT (t)P̄ j z(t) − zT (t − T )P̄ j z(t − T )


t t T
= − y T (τ )Qy y(τ )dτ − zT (τ ) K̄ j R K̄ j z(τ )dτ
t−T t−T
t T
−2 u(τ ) − K̄ j z(τ ) R K̄ j +1 z(τ )dτ, (2.71)
t−T

where the unknowns P̄ j and K̄ j +1 can be found by solving the least-squares


problem based on the system data y and z. As there are more unknowns than the
equations, we develop a system of l number of such equations by performing l finite
window integrals each over a period of length T . To solve this linear system of
equations, we define the following data matrices,
 T
δzz = z̄T (t1 ) − z̄T (t0 ) z̄T (t2 ) − z̄T (t1 ) · · · z̄T (tl ) − z̄T (tl−1 ) ,
 t  tl T
t
Izu = t01 (z(τ ) ⊗ u(τ )) dτ t12 (z(τ ) ⊗ u(τ )) dτ · · · tl−1 (z(τ ) ⊗ u(τ )) dτ ,
  t2  tl T
t1
Izz = t0 (z(τ ) ⊗ z(τ )) dτ t1 (z(τ ) ⊗ z(τ )) dτ · · · tl−1(z(τ ) ⊗ z(τ )) dτ ,
  t2  tl T
t1
Iyy = t0 (y(τ ) ⊗ y(τ )) dτ t1 (y(τ ) ⊗ y(τ )) dτ · · · tl−1(y(τ ) ⊗ y(τ )) dτ .

We can write the l number of Equation (2.71) in the following compact form,
  
vecs P̄ j
 j
  = j ,
vec K̄ j +1

where the data matrices are given by


 T  N(N+1)

l× +mN
 = δzz
j
− 2Izz IN ⊗ K̄ j
R + 2Izu (IN ⊗ R) ∈ R 2
,
  
 j = −Izz vec Q̄j − Iyy vec Qy ∈ Rl ,
80 2 Model-Free Design of Linear Quadratic Regulator

with
T 
Q̄j = K̄ j R K̄ j ,
 
z̄ = z12 2z1 z2 · · · z22 2z2 z3 · · · zN
2
,
  
j j j j j
vecs P̄ j = P̄11 P̄12 · · · P̄1n P̄22 P̄23 · · · P̄N N .

The least-squares solution is given by


  
vecs P̄ j T −1 T
  = j j j j . (2.72)
vec K̄ j +1

Algorithm 2.7 Model-free output feedback policy iteration algorithm for the
continuous-time LQR problem
input: input-output data
output: P ∗ and K ∗
1: initialize. Select a stabilizing control policy u0 = K̄ 0 z + ν, where ν is an exploration signal,
and set the iteration index j ← 0.
2: collect data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where tl = t0 + lT
and T is the interval length. Based on this data, perform the following iterations.
3: repeat
4: evaluate and improve policy. Find the solution, P̄ j and K̄ j +1 , of the following learning
equation,

zT (t)P̄ j z(t) − zT (t − T )P̄ j z(t − T )


t t T
= − y T (τ )Qy y(τ )dτ − zT (τ ) K̄ j R K̄ j z(τ )dτ
t−T t−T
t T
−2 u(τ ) − K̄ j z(τ ) R K̄ j +1 z(τ )dτ.
t−T

5: j←  j +1 
6: until P̄ j − P̄ j −1  < ε for some small ε > 0.

In Algorithm 2.7, we collect only the filtered input and output data to compute
their quadratic integrals and form the data matrices. Note that we use a stabilizing
initial policy K̄ 0 to collect data, which will be reused in the subsequent iterations.
Since there are N(N + 1)/2 + mN unknowns in P̄ j and K̄ j +1 , we need l ≥ N(N +
1)/2+mN data sets to solve (2.71). Furthermore, since u = K̄z(t) depends linearly
on the input output data z, we add an exploration signal ν in u0 to find the unique
least-squares solution of (2.71). In other words, the following rank condition needs
to be satisfied,
2.4 Continuous-Time LQR Problem 81

 N(N + 1)
rank j = + mN. (2.73)
2
Typical examples of exploration signals include sinusoids of various frequencies and
magnitudes. Moreover, we do not require this exploration condition once parameter
convergence has been achieved.
We now address the problem of requiring a stabilizing initial control policy. It
can be seen that Algorithm 2.7 solves the output feedback LQR problem without
using any knowledge of the system dynamics. However, it requires a stabilizing
initial control policy. When the system is unstable and a stabilizing initial policy is
hard to obtain, we propose a dynamic output feedback value iteration algorithm.
To this end, consider the following Lyapunov function candidate,

V (x) = x T P x. (2.74)

The derivative of V along the closed-loop trajectory can be evaluated as

d 
x T (t)P x(t) = (Ax(t) + Bu(t))T P x(t) + x T (t)P (Ax(t) + Bu(t))
dt
= x T (t)H x(t) − 2 (Ru(t))T Kx(t), (2.75)

where H = AT P + P A and K = −R −1 B T P . Performing finite window integrals


of length T > 0 on both sides of (2.75) results in the following equation,
t
x T (t)P x(t) − x T (t − T )P x(t − T ) = x T (τ )H x(τ )dτ
t−T
t
−2 (Ru(τ ))T Kx(τ )dτ. (2.76)
t−T
t
Adding t−T y T (τ )Qy y(τ )dτ on both sides of Equation (2.76) and substituting
y(t) = Cx(t) and H = AT P + P A on the right-hand side, we have
t
x T (t)P x(t) − x T (t − T )P x(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t  t
= x T (τ ) AT P + P A + C T Qy C x(τ )dτ − 2 (Ru(τ ))T Kx(τ )dτ.
t−T t−T

Next, we use the state parameterization (2.57) in the above equation to result in
t
zT (t)P̄ z(t) − zT (t − T )P̄ z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t  t
= zT (τ )W T AT P + P A + C T Qy C W z(τ )dτ − 2 (Ru(τ ))T K̄z(τ )dτ,
t−T t−T
82 2 Model-Free Design of Linear Quadratic Regulator

or more compactly,
t
zT (t)P̄ z(t) − zT (t − T )P̄ z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ z(τ )dτ − 2 (Ru(τ ))T K̄z(τ )dτ, (2.77)
t−T t−T

   
where W = Wu Wy and H̄ = W T AT P + P A + C T Qy C W . Equation (2.77)
serves as the key equation for the output feedback based value iteration algorithm.
Note that (2.77) is a scalar equation, which is linear in the unknowns H̄ and K̄.
These matrices are the output feedback counterparts of the matrices H and K in
the state feedback case. As there are more unknowns than the number of equations,
we develop a system of l number of such equations by performing l finite window
integrals each of length T . To solve this system of linear equations, we define the
following data matrices,

δzz = [ z(t1 ) ⊗ z(t1 ) − z(t0 ) ⊗ z(t0 ), z(t2 ) ⊗ z(t2 ) − z(t1 ) ⊗ z(t1 ), · · · ,

z(tl ) ⊗ z(tl ) − z(tl−1 ) ⊗ z(tl−1 ) ]T ,


  t2  tl T
t1
Izu = t0 (z(τ ) ⊗ Ru(τ )) dτ t1 (z(τ ) ⊗ Ru(τ )) dτ · · · tl−1(z(τ ) ⊗ Ru(τ )) dτ ,
  t2  tl T
t1
Izz = t0 z̄T (τ )dτ t1 z̄T (τ )dτ · · · tl−1z̄
T (τ )dτ
,
  t2  tl 
t1
Iyy = t0 (y(τ ) ⊗ y(τ )) dτ t1 (y(τ ) ⊗ y(τ )) dτ · · · tl−1(y(τ ) ⊗ y(τ )) dτ .

We rewrite l number of Equation (2.77) in the following form,


  
  vecs H̄    
Izz −2Izu   = δzz vec P̄ + Iyy vec Qy ,
vec K̄

whose least-squares solution is given by


  
vecs H̄  T −1 T
  = Izz −2Izu Izz −2Izu Izz −2Izu
vec K̄
    
× δyy vec P̄ + Iyy vec Qy (2.78)
T −1 T
= j j j j . (2.79)

Similar to Algorithm 2.7, we need the following rank condition to be satisfied to


obtain the unique solution to the above least-squares problem,
2.4 Continuous-Time LQR Problem 83

 N(N + 1)
rank  j = + mN, (2.80)
2
which can be met by injecting sufficiently exciting exploration signal in the control
input.
Before presenting the output! feedback value iteration algorithm, we recall the

following definitions. Let Bq q=0 be some bounded nonempty sets that satisfy
Bq ⊆ Bq+1 , q ∈ Z+ and limq→∞ Bq = Pn+ , where Pn+ is the set of n-
dimensional positive definite matrices. Also, let j be the step size sequence
satisfying limj →∞ j = 0. With these definitions, the continuous-time model-free
output feedback LQR algorithm is presented in Algorithm 2.8.

Algorithm 2.8 Output feedback based model-free continuous-time LQR value


iteration algorithm
Input: input-output data
Output: P ∗
Initialization. Select P 0 > 0 and set j ← 0, q ← 0.
Collect Online Data. Apply u0 = K̄ 0 z + v, where v is an exploration signal, to the system to
collect online data for t ∈ [t0 , tl ], where tl = t0 + lT and T is the interval length. Based on
this data, perform the following iterations.
1: loop
2: Find the solution, H̄ j and K̄ j , of the following equation,
t
zT (t)P̄ j z(t) − zT (t − T )P̄ j z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ j z(τ )dτ − 2 (Ru(τ ))T K̄ j z(τ )dτ.
t−T t−T

 T 
3: P̄˜ j +1 ← P̄ j + j H̄ j − K̄ j R K̄ j
4: if P̄˜ j +1 ∈ / Bq then
5: P̄ j +1 ← P 0
6: q ← q + 1 
 
7: else if P̄˜ j +1 − P̄ j  /j < ε then
8: return P j as P ∗
9: else
10: P̄ j +1 ← P̄˜ j +1
11: end if
12: j ←j +1
13: end loop

Theorem 2.13 Under


√  the controllability condition of (A, B), the observability
condition of A, Q , the full row rank of W and the rank conditions (2.73)
and (2.73), Algorithms 2.7 and 2.8 each generates a sequence of control policies
that asymptotically converges to the optimal output feedback policy K̄ ∗ as j → ∞.
84 2 Model-Free Design of Linear Quadratic Regulator

Proof By Theorem 2.12, we know that, in steady state, the optimal output feedback
control is equivalent to optimal state feedback control. Thus, we need to show that
the output feedback algorithms converge to the optimal output feedback solution.
For the PI algorithm, consider the output feedback learning equation (2.71), which
can be written as

zT (t)W T P j z(t) − zT (t − T )W P j W z(t − T )


t T
= − zT (τ )W T C T Qy C + K j RK j W z(τ )dτ
t−T
t T
−2 u(τ ) − K̄ j z(τ ) R K̄ j +1 z(τ )dτ. (2.81)
t−T

From Equation (2.252), we have


T T 
−W T C T Qy C + K j RK j W = W T A + BK j P j + P j A + BK j W.

By the full row rank condition of W of Theorem 2.10 (or Theorem 2.11), the above
equation reduces to
T  T
A + BK j P j + P j A + BK j + C T Qy C + K j RK j = 0,

which is a Lyapunov equation. Note that the policy update

K̄ j +1 = −R −1 B T P j W

corresponds to the state feedback policy update K j +1 based on K =


 −1
K̄W T W W T given that W is of full row rank. Since P̄ j and K̄ j +1 are uniquely
solvable from (2.81) under sufficient excitation based on the rank condition (2.73),
it follows that the iterations on the data-based equation (2.81) are equivalent to the
model-based Lyapunov iterations in Algorithm 2.3, which converges to the optimal
solution P ∗ and K ∗ . Then, from their definitions, P̄ and K̄ converge to P̄ ∗ and K̄ ∗ .
This shows that the output feedback PI algorithm, Algorithm 2.7, converges to the
optimal output feedback solution as j → ∞.
For the VI algorithm, consider the output feedback learning equation (2.77).
Under the rank condition (2.80), the equation has the unique least-squares solution
H̄ j and K̄ j . By Theorem 2.9, this solution corresponds to the solution H j and K j of
the equivalent state feedback learning equation. Given the output feedback solution,
we perform the following recursion,
T
P̄˜ j +1 ← P̄ j + j H̄ j − K̄ j R K̄ j ,
2.4 Continuous-Time LQR Problem 85

which can be written as


 T
W T P̃ j +1 W = W T P j W + j W T AT P +P A+C T Qy C W −W T K j RK j W .

By the full row rank condition of W from Theorem 2.10 (or Theorem 2.11), the
above equation reduces to
T
P̃ j +1 = P j + j AT P + P A + C T Qy C − K j RK j .

The above recursion is given in Algorithm 2.4, which converges to P ∗ as shown in


[11]. Then, from their definitions, P̄ and K̄ converge to P̄ ∗ and K̄ ∗ . This shows that
the output feedback VI algorithm, Algorithm 2.8, converges to the optimal output
feedback solution as j → ∞. This completes the proof.

Remark 2.5 In comparison with the previous output feedback LQR works based on
RL [80, 142], the output feedback VI algorithm, Algorithm 2.8, does not require a
stabilizing initial policy.

2.4.6 Exploration Bias Immunity of the Output Feedback


Learning Algorithms

We now establish the immunity to the exploration bias problem of the output
feedback algorithms. We have the following result.
Theorem 2.14 The output feedback algorithms, Algorithms 2.7 and 2.8, are
immune to the exploration bias problem.
Consider the learning equation (2.54) with input û,
t T
x T (t)P̂ j x(t) − x T (t − T )P̂ j x(t − T ) = − x T (τ ) C T Qy C + K̂ j R K̂ j x(τ )dτ
t−T
t T
−2 û(τ ) − K̂ j x(τ ) R K̂ j +1 x(τ )dτ.
t−T

We differentiate both sides of the above equation as follows,

2x T (t)P̂ j ẋ(t) − 2x T (t − T )P̂ j ẋ(t − T )


T T
= −x T (t) C T Qy C + K̂ j R K̂ j x(t) − 2 û(t) − K̂ j x(t) R K̂ j +1 x(t)
T
+x T (t − T ) C T Qy C + K̂ j R K̂ j x(t − T )
T
+2 û(t − T ) − K̂ j x(t − T ) R K̂ j +1 x(t − T ).
86 2 Model-Free Design of Linear Quadratic Regulator

Upon further expansion, we obtain

2x T (t)P̂ j (Ax + Bu) + 2x T (t)P̂ j Bν(t) − 2x T (t − T )P̂ j (Ax(t − T ) + Bu(t − T ))


−2x T (t − T )P̂ j Bν(t − T )
T
= −x T (t) C T Qy C + K̂ j R K̂ j x(t)
T
−2 u(t) − K̂ j x(t) R K̂ j +1 x(t) − 2ν T (t)R K̂ j +1 x(t)
T
+x T (t − T ) C T Qy C + K̂ j R K̂ j x(t − T )
T
+2 u(t − T ) − K̂ j x(t − T ) R K̂ j +1 x(t − T ) + 2ν T (t − T )R K̂ j +1 x(t − T ).

Rearranging terms and using K̂ j +1 = −R −1 B T P̂ j , which implies that

2x T P̂ j Bν = −2ν T R K̂ j +1 x,

we have

2x T (t)P̂ j (Ax + Bu) − 2x T (t − T )P̂ j (Ax(t − T ) + Bu(t − T ))


T T
= −x T (t) C T Qy C + K̂ j R K̂ j x(t) − 2 u(t) − K̂ j x(t) R K̂ j +1 x(t)

T
+x T (t − T ) C T Qy C + K̂ j R K̂ j x(t − T )
T
+2 u(t − T ) − K̂ j x(t − T ) R K̂ j +1 x(t − T ).

We now repack the terms and perform back the integral equation,

x T (t)P̂ j x(t) − x T (t − T )P̂ j x(t − T )


t T
= − x T (τ ) C T Qy C + K̂ j R K̂ j x(τ )dτ
t−T
t T
−2 u(τ )− K̂ j x(τ ) R K̂ j +1 x(τ )dτ, (2.82)
t−T

which gives us the learning equation in the control u free of exploration signal.
It can be seen that Equation (2.82) is the same as the state feedback learning
equation (2.54). Therefore, P̂ j = P j and K̂ j +1 = K j +1 .
We next consider the output feedback case. By the equivalency of the learning
equations (2.54) and (2.71) following from Theorem 2.9, we have the bias-free
output feedback equation (2.71), that is,
2.4 Continuous-Time LQR Problem 87

zT (t)P̄ j z(t) − zT (t − T )P̄ j z(t − T )


t t T
= − y T (τ )Qy y(τ )dτ − zT (τ ) K̄ j R K̄ j z(τ )dτ
t−T t−T
t T
−2 u(τ ) − K̄ j z(τ ) R K̄ j +1 z(τ )dτ.
t−T

We now show the noise bias immunity of Algorithm 2.8. Consider the learning
equation (2.77) with the excited input û. Let P̄ˆ j , H̄ˆ j , and K̄ˆ j be the parameter
estimates obtained as a result of the excited input. We have
t
zT (t)P̄ˆ j z(t)−zT (t −T )P̄ˆ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ˆ j z(τ )dτ − 2 (R û)T (τ )K̄ˆ j z(τ )dτ.
t−T t−T

Taking time derivative results in

2zT (t)P̄ˆ j ż(t) − 2zT (t−T )P̄ˆ j ż(t−T ) + y T (t)Qy y(t) − y T (t−T )Qy y(t−T )
 T
= zT (t)H̄ˆ j z(t) − zT (t−T )H̄ˆ j z(t−T ) − 2 R û(t) K̄ˆ j z(t)
 T
+2 R û(t−T ) K̄ˆ j z(t−T )).

Further expansion of terms yields


 
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 y(t) + 2zT (t)K̄ˆ j B¯1 ν(t)

+y T (t)Qy y(t) − y T (t−T )Qy y(t−T ) − 2zT (t − T )P̄ˆ j


 
× Āz(t − T ) + B̄1 u(t − T ) + B̄2 y(t − T )

= zT (t)H̄ˆ j z(t) − zT (t−T )H̄ˆ z(t−T ) − 2 (Ru(t))T K̄ˆ j z(t) + 2 (Ru(t−T ))T K̄ˆ j z(t−T )

−2ν T (t)K̄ˆ j z(t) + 2ν T (t−T )K̄ˆ j z(t−T ),

where we have combined the dynamics of z based on the input-output dynamics of


the observer in which Ā and B̄i represent the combined system dynamics and input
matrices B̄1 and B̄2 corresponding to the dynamics of ζu and ζy as
      
ζ̇u Ā1 0 ζu   u
= + B̄1 B̄2
ζ̇y 0 Ā2 ζy y
     
Ā1 0 ζu B 0 u
= + 1
0 Ā2 ζy 0 B2 y

= Āz + B̄η,
88 2 Model-Free Design of Linear Quadratic Regulator

in which each Āi and Bi is further block diagonalized, respectively, with blocks of
A and B defined in Theorem 2.9, with the number of such blocks being equal to the
number of components in the individual vectors u and y.
Using the fact that W B̄1 = B, we have

2zT P̄ˆ j B̄1 ν = −2ν T K̄ˆ j z,

thereby cancelling the delayed and non-delayed νi terms. Thus, we have


 
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 y(t) + y T (t)Qy y(t) − y T (t−T )Qy y(t−T )
 
−2zT (t − T )P̄ˆ j Āz(t − T ) + B̄1 u(t − T ) + B̄2 y(t−T )

= zT (t)H̄ˆ j z(t) − zT (t−T )H̄ˆ z(t−T ) − 2 (Ru(t))T K̄ˆ j z(t) + 2 (Ru(t−T ))T K̄ˆ j z(t−T ).

Reversing the previous operations and performing integration result in


t
zT (t)P̄ˆ j z(t)−zT (t −T )P̄ˆ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ˆ j z(τ )dτ − 2 (Ru(τ ))T K̄ˆ j z(τ )dτ. (2.83)
t−T t−T

Comparing (2.83) with (2.77), we have P̄ˆ j = P̄ j , H̄ˆ j = H̄ j and K̄ˆ j = K̄ j . This
establishes the bias-free property of Algorithm 2.8.

2.4.7 Numerical Examples

Example 2.6 (A Power System) We test the output feedback RL scheme on the load
frequency control of power systems [121]. Although power systems are nonlinear,
a linear model can be employed to develop the optimal controllers for operation
under the normal conditions. The main difficulty arises from determining the plant
parameters in order to design an optimal controller. This motivates the use of model-
free optimal control methods. We use the policy iteration algorithms (Algorithms 2.5
and 2.7) as the system is open-loop stable.
The nominal system model parameters corresponding to (2.45) are
⎡ ⎤
−0.0665 8 0 0
⎢ 0 −3.663 3.663 0 ⎥
A=⎢
⎣ −6.86
⎥,
0 −13.736 −13.736⎦
0.6 0 0 0
2.4 Continuous-Time LQR Problem 89

⎡ ⎤
0
⎢ 0 ⎥
B=⎢ ⎥
⎣13.736⎦ ,
0
 
C= 1000 .

We choose the performance index parameters as Qy = 1 and R = 1. The


eigenvalues of matrix A are all placed at −1. The optimal control parameters as
obtained by solving the ARE (2.51) are
⎡ ⎤
0.3135 0.2864 0.0509 0.1912
⎢ 0.2864 0.4156 0.0903 0.0789⎥
P∗ = ⎢
⎣ 0.0509 0.0903 0.0210 0 ⎦ ,

0.1912 0.0789 0 1.1868


 
K ∗ = −0.6994 −1.2404 −0.2890 0 ,
⎡ ⎤
0 402.5 0 0
⎢ 0 −673.9 50.3 0 ⎥
Wu = ⎢ ⎥
⎣ 0 1918.5 −133.6 13.7⎦ ,
1 0 0 0
⎡ ⎤
−240.5 −200.4 −45.5 −13.5
⎢ 404.3 312.1 61.2 23.6 ⎥
Wy = ⎢ ⎥
⎣ −1151.1 −893.7 −185.1 −72.2⎦ .
1.9 3.5 2.4 0.6

The control parameters are initialized to zero. We choose 100 learning intervals
of period T = 0.05s. The exploration condition is met by injecting sinusoidal
signals of different frequencies in the control. We compare the results of both the
state feedback PI algorithm (Algorithm 2.5) and the output feedback PI algorithm
(Algorithm 2.7). The state feedback results are shown in Figs. 2.23 and 2.24, and
the output feedback results are shown in Figs. 2.25 and 2.26. It can be seen that,
similar to the state feedback results, the output feedback parameters also converge
to their nominal values with performance quite close to that of the state feedback
case. However, the output feedback PI Bellman equation contains more unknown
terms than the state feedback PI Bellman equation. As a result, it takes the output
feedback algorithm longer to converge. It is worth noting that these results are
obtained without the use of a discounting factor. Furthermore, no exploration bias
is observed from the use of exploration signals, and these exploration signals can be
removed once the convergence criterion has been met.

Example 2.7 (An Unstable System) We now test the output feedback RL scheme on
an unstable system. Consider the double integrator system with
90 2 Model-Free Design of Linear Quadratic Regulator

1000
x1
500
x2
States

0 x3
x4
-500

-1000
0 5 10 15
time (sec)

Fig. 2.23 Example 2.6: State trajectory of the closed-loop system under state feedback (Algorithm
2.5)

Fig. 2.24 Example 2.6: Convergence of the parameter estimates under state feedback (Algorithm
2.5)

1000

x1
500
x2
States

0 x3
x4
-500

-1000
0 5 10 15
time (sec)

Fig. 2.25 Example 2.6: State trajectory of the closed-loop system under output feedback (Algo-
rithm 2.7)
2.4 Continuous-Time LQR Problem 91

Fig. 2.26 Example 2.6: Convergence of the parameter estimates under output feedback (Algo-
rithm 2.7)

 
01
A= ,
00
 
0
B= ,
1
 
C= 10 .

Both state feedback and output feedback VI algorithms (Algorithms 2.6 and 2.8)
are evaluated. We choose the performance index parameters as Qy = 1 and R = 1.
The eigenvalues of matrix A are all placed at −2. The optimal control parameters
as found by solving the ARE (2.51) are
 
∗ 1.4142 1.0000
P = ,
1.0000 1.4142
 
K∗ = −1.0000 −1.4142 ,
 
10
Wu = ,
41
 
44
Wy = .
04

The initial controller parameters are set to zero. We choose 20 learning intervals
of period T = 0.05s. We also choose the step size as
−1
j = j 0.2 + 5 , j = 0, 1, 2, . . .
92 2 Model-Free Design of Linear Quadratic Regulator

2
x1
States x2
1

-1

-2
0 10 20 30 40 50
time (sec)

Fig. 2.27 Example 2.7: State trajectory of the closed-loop system under state feedback (Algorithm
2.6)

and the set


 
Bq = P̄ ∈ P4+ : |P̄ | ≤ 800(q + 1) , q = 0, 1, 2, . . . .

The exploration condition is met by injecting sinusoidal signals of different


frequencies in the control. We compare here both the state feedback VI algorithm
(Algorithm 2.6) and the output feedback VI algorithm (Algorithm 2.8). The state
feedback results are shown in Figs. 2.27 and 2.28 and the output feedback results
are shown in Figs. 2.29 and 2.30. It can be seen that, as with the state feedback
results, the output feedback parameters also converge to their nominal values with
performance quite close to that of the state feedback. However, it should be noted
that the output feedback VI Bellman equation contains more unknown terms than
the state feedback VI Bellman equation. As a result, it takes the output feedback
algorithm longer to converge. Furthermore, the exploration noise can be safely
removed after the initial data collection phase.
It can also be seen that, compared with the PI algorithms (Algorithms 2.5 and 2.7),
the VI algorithms (Algorithms 2.6 and 2.8) relax the requirement of a stabilizing
initial policy. However, the VI algorithms take more iterations to converge. Thus, as
demonstrated in Examples 2.6 and 2.7, depending on the particular application, one
algorithm can be better suited than the other.

2.5 Summary

In this chapter, a new output feedback Q-learning scheme was first presented to
solve the LQR problem for discrete-time systems. An embedded observer based
approach was presented that enables learning and control using output feedback
without requiring the knowledge of the system dynamics. To this end, we presented
2.5 Summary 93

Fig. 2.28 Example 2.7: Convergence of the parameter estimates under state feedback (Algorithm
2.6)

2
x1
x2
States

-1

-2
0 10 20 30 40 50
time (sec)

Fig. 2.29 Example 2.7: State trajectory of the closed-loop system under output feedback (Algo-
rithm 2.8)

a parameterization of the state in terms of the past input-output data. A new LQR
Q-function was presented that uses only the input-output data instead of the full
state. This Q-function was used to derive an equivalent output feedback LQR
controller. We presented output feedback Q-learning algorithms of policy iteration
and value iteration, where the latter does not require a stabilizing initial controller.
The proposed scheme has the advantage that it does not incur bias in the parameter
estimates. As a result, the need of using a discounted cost function has been obviated
and closed-loop stability is guaranteed. It was shown that the output feedback Q-
learning algorithms converge to the nominal solution as obtained by solving the
LQR ARE. A comprehensive simulation study was conducted that validates the
proposed designs.
The formulation in the case of continuous-time dynamics was found to be
quite different from its discrete-time counterpart. This was due to the fact that the
original formulation of reinforcement learning was developed for MDPs. Recent
94 2 Model-Free Design of Linear Quadratic Regulator

Fig. 2.30 Example 2.7: Convergence of the parameter estimates under output feedback (Algo-
rithm 2.8)

developments in integral reinforcement learning has enabled the design of the output
feedback algorithms presented in this chapter. In [142], a static output feedback
scheme was proposed to solve the continuous-time counterpart of this problem,
where some partial information of the system dynamics is needed. Furthermore,
this method imposes an additional condition of static output feedback stabilizability.
Later on, this stringent condition was relaxed in [80], where a completely model-
free output feedback solution to the continuous-time LQR problem was proposed.
However, similar to its discrete-time model-free counterpart, the work had to resort
to a discounted cost function. In particular, there is an upper bound (lower bound) on
the feasible discounting factor for continuous-time (discrete-time) systems, which
can only be precisely computed using the system model [80, 88].
In this chapter, a filtering based observer approach was presented for param-
eterizing the state in terms of the filtered inputs and outputs. Based on this
parameterization, we derived two new output feedback learning equations. We
considered both policy iteration and value iteration algorithms to learn the optimal
solution of the LQR problem, based on the system output measurements and
without using any system model information. Compared to previous RL works,
the proposed scheme is completely in continuous-time. Moreover, for the value
iteration algorithm, the need of a stabilizing output feedback initial policy was
obviated, which is useful for the control design of unknown unstable systems.
It was shown that the proposed scheme is not prone to exploration bias and
thus circumvents the need of employing a discounting factor. Under the proposed
scheme, the closed-loop stability is guaranteed and the resulting output feedback
control parameters converge to the optimal solution as obtained by solving the LQR
ARE. A comprehensive simulation study was carried out to validate the presented
results.
2.6 Notes and References 95

2.6 Notes and References

This chapter presented the designs of model-free output feedback reinforcement


learning algorithms for both discrete-time and continuous-time linear systems.
These techniques build upon the previous state feedback results. A PI based Q-
learning scheme, which requires a stabilizing initial control, was proposed in [13]
to solve the LQR problem. Later in [53], a VI based Q-learning scheme was adopted,
where this requirement was obviated. However, full state feedback was needed in
these earlier works. We discussed two popular approaches in the literature that
have been used to perform RL control using input-output data instead of state
feedback. The use of neural networks is convincing for state estimation because
they have already been successfully used in actor-critic learning. Recent results have
confirmed the efficacy of this approach in solving a variety of RL control problems.
Interested readers can refer to [85] where neural networks were employed for state
estimation, critic and actor networks were used to achieve near optimal solution,
and ultimate boundedness of the estimation errors was shown. In contrast to neural
network based designs, this chapter is based on a state parameterization based
approach, which has the advantage of not requiring an observer. Instead of focusing
on state estimation, this approach is to learn the optimal output feedback control
sequence without requiring state estimation. Consequently, we do not have to deal
with the errors in learning that arise as a result of state estimation. In the discrete-
time setting, we showed that the state parameterization is obtained in terms of
Markov parameters [1]. It should be noted that the approach presented in this chapter
does not involve the identification of these Markov parameters. Furthermore, it
can be readily used in conjunction with both value function approximation (VFA)
methods proposed in the earlier work [56] or with our presented Q-learning method.
This chapter also discussed about the continuous-time formulation, which turns
out to be fundamentally different from the discrete-time setting. The key difficulty
in the continuous-time setting is that the state parameterization is not finite time.
Motivated by the discrete-time results, [80] presented a delayed input-output state
parameterization. However, such a parameterization assumes that the control is
in the form of feedback, which neglects the effect of the exploration signal and
ultimately causes bias. A discounting factor is introduced to help to overcome
the problem of the excitation noise bias. As a result, the resulting controller is
not optimal as it differs from the solution of the original undiscounted problem.
Furthermore, the stability analysis carried out in [88] has shown that the discounted
cost function in general cannot guarantee stability. Closed-loop stability can be
achieved only when the value of the discounting factor is chosen above a lower
bound. However, determination of this bound requires the knowledge of the system
dynamics, which is assumed to be unknown. In contrast, this chapter presented an
observer based formulation to construct the state parameterization using a filtering
based approach. The discrete-time equivalent of this parameterization was also
shown. The presented solution does not lead to bias and eliminates the need of the
discounting factor.
96 2 Model-Free Design of Linear Quadratic Regulator

The presentation in this chapter expands on our results on the discrete-time and
continuous-time output feedback LQR problems presented in [91, 94] and [96, 101],
respectively, by providing a new perspective on the rank conditions of the state
parameterizations and their due roles in the convergence of the output feedback
learning algorithms.
Chapter 3
Model-Free H∞ Disturbance Rejection
and Linear Quadratic Zero-Sum Games

3.1 Introduction

Disturbance rejection is a core problem in control theory that has long been
recognized as a motivation of feedback control. Control designs capable of rejecting
disturbances are of strong theoretical and practical interest because control systems
are often subject to external disturbances. Addressing the presence of external
disturbances is of utmost importance as it would otherwise cause failure to meet
control objectives and may even result in instabilities. Owing to the significance of
this problem, the control literature has witnessed a vast variety of approaches to
addressing this issue under the paradigm of robust control. Among such approaches
is H∞ optimal control. A major portion of the robust control literature is dedicated to
this design methodology owing to its versatility in designing worst-case controllers
for a large class of dynamic systems that are prone to deleterious effects of external
disturbances.
Early designs of the H∞ control were formulated in the frequency domain based
on the sensitivity analysis and optimization techniques using the H∞ operator
norm. The frequency domain approach was found to be relatively complicated
as it involved advanced mathematical tools based on operator theory and spectral
factorization. Later developments, however, presented designs in the time-domain,
where the key machinery involved was based on the more familiar algebraic Riccati
equations similar to the ones found in the popular linear quadratic regulation (LQR)
framework. The time-domain approach also led to further extensions to cater for
more general scenarios such as those involving nonlinear dynamics, time-varying
and finite horizon problems.
A striking feature in the time-domain formulation of the H∞ problem is that
it matches well with the formulation of the zero-sum game problem found in
game theory. The framework of game theory provides strong mathematical tools to
describe situations involving strategic decision makers. These situations are known
as games and the rational decision makers are referred to as the players. Each player

© Springer Nature Switzerland AG 2023 97


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2_3
98 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

in the game has its own interest, which is represented in the form of its own objective
function. Depending on the nature of the game, the objective of each player could
be in conflict with the interests of the other players. The fundamental idea in game
theory is that the decision of each player not only affects its own outcomes but also
affects the outcomes of the other players. Game theory, which has been successfully
applied in diverse areas such as social science, economics, political science, and
computer science, can be used to analyze many real-world scenarios.
The connection between games and the H∞ control problem stems from the
nature of the H∞ control problem, which is formulated as a minimax dynamic
optimization problem similar to the zero-sum game problem by considering the
controller and the disturbance as two independent players who have competing, in
fact, opposite objectives. That is, the controller can be considered as a minimizing
player who minimizes a quadratic cost similar to the one encountered in the LQR
problem discussed in Chap. 2. On the other hand, different from the LQR problem,
the disturbance acts as an independent agent that tries to have a negative impact
on the control performance by maximizing the cost. As the objective functions of
the controller and the disturbance are exactly opposite, the sum of their respective
functions is identically zero, and hence the name zero-sum game. The Bellman
dynamic programming principle plays a fundamental role in solving problems in
game theory. The key step in solving these problems using dynamic programming
involves finding the solution to the Hamilton-Jacobi-Isaacs (HJI) equation,

T
∂V
0 = max min r (x(t), u(t), w(t)) + f (x(t), u(t), w(t)) , (3.1)
w∈W u∈U ∂x

where w(t) is the maximizing player or disturbance that influences the game
dynamics f (x(t), u(t), w(t)) as well as the cost utility r (x(t), u(t), w(t)). It is
worthwhile to note that the Hamilton-Jacobi-Isaacs equation (3.1) is a generalization
of the Hamilton-Jacobi-Bellman PDE introduced in Chap. 1, and, as a result, its
solution is generally intractable. The discrete-time version of the HJI equation is the
following nonlinear difference equation often referred to as the Isaacs equation,
 
V ∗ (xk ) = max min r(xk , uk , wk ) + V ∗ (xk+1 ) , (3.2)
w∈W u∈U

which is a generalization of the Bellman optimality equation introduced in Chap. 1.


Recall that these are nonlinear functional equations that in general cannot be solved
analytically. For a linear quadratic game represented by a linear differential or
difference equation with a quadratic cost function, hereafter sometimes referred to
as a linear quadratic zero sum game or a zero-sum game for simplicity, its solution
boils down to finding the solution of the so-called (generalized) game algebraic
Riccati equation (GARE). This GARE is solved offline based on the complete
information of the system dynamics. Iterative algorithms such as policy iteration
and value iteration are commonly employed to recursively solve a GARE.
3.2 Literature Review 99

In the previous chapters we developed model-based and model-free iterative state


feedback and output feedback algorithms for solving the LQR problems. These
algorithms were based on the iterative techniques of policy iteration (PI) and value
iteration (VI) that would find the solution of the LQR ARE by iteratively solving a
Lyapunov equation in the model-based scenario or a Bellman learning equation in
the model-free case. Q-learning and integral reinforcement learning were the two
key learning algorithms to solve the discrete-time and continuous-time versions
of the problem. In this chapter we will extend the previous designs to introduce
disturbance rejection capability in both discrete-time and continuous-time designs.
One important difference between the LQR ARE and the GARE is that the
latter involves a sign indefinite term resulting from the presence of the disturbance.
Therefore, additional solvability conditions beyond the standard controllability and
observability conditions are needed in solving the linear quadratic zero-sum game
or the H∞ problem (both terms will be used interchangeably). First, model-based
iterative algorithms of policy iteration and value iteration will be presented for
solving the zero-sum game. Then, a learning based model-free solutions to the
problem will be presented based on the full state feedback. This will solve the
full information H∞ control problem without requiring the model information.
Finally, a solution to the partial information H∞ control problem will be presented
in which only the measurement of the output, instead of the full state, is available
for feedback. Extended parameterization of the system state will be presented that
will allow us to develop output feedback learning equations and find optimal output
feedback strategies for solving the zero-sum game and the associated H∞ control
problem.

3.2 Literature Review

Recently, RL based methods have been successfully applied to arrive at a model-


free solution to the zero-sum game problem. A Q-learning solution to the linear
discrete-time quadratic zero-sum game was first developed in [5] and its application
to the H∞ control problem was shown. Later, the continuous-time zero-sum game
problem was solved using partially model-free [120] and completely model-free
[60] integral reinforcement learning methods. Recently, an off-policy algorithm was
proposed to solve the discrete-time linear quadratic zero-sum game problem in [48],
where the issue of excitation noise bias was also addressed. In all these works, the
measurement of the complete state vector is required. One approach to dealing with
the output feedback H∞ control and the zero-sum game problems was presented
in[83], which employed static output feedback. However, this approach requires
the system to be static output feedback stabilizable, which is a more stringent
condition than the dynamic output feedback stabilizability. Furthermore, a separate
state estimator is still needed in this approach during the learning phase since the
RL learning equations themselves are not necessarily static output based, thereby,
requiring state estimates.
100 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Although dynamic output feedback RL algorithms based on value functions [56]


can be used to solve the zero-sum game and the H∞ problems, these methods
are prone to the excitation noise bias as discussed in Chap. 2. Consequently, a
discounting factor is needed for convergence to the sub-optimal solution. More
importantly, the closed-loop stability may not be ensured when a discounting factor
is involved.

3.3 Discrete-Time Zero-Sum Game and H∞ Control


Problem

Consider a discrete-time linear system in the state space form,

xk+1 = Axk + Buk + Ewk ,


(3.3)
yk = Cxk ,

where xk ∈ Rn is the state, uk ∈ Rm1 is the input, wk ∈ Rm2 is the disturbance


input, and yk ∈ Rp is the output. We assume that the pair (A, B) is controllable and
the pair (A, C) is observable. The zero-sum game can be formulated as a minimax
problem with the optimal value function of the form [6],



V (xk ) = min max r(xi , ui , wi ), (3.4)
ui wi
i=k

where the control input u acts as the minimizing player and the disturbance input
w acts as the maximizing player. For the H∞ control problem, the utility function
r(xi , ui , wi ) takes the quadratic form,

r(xi , ui , wi ) = yiT Qy yi + uTi ui − γ 2 wiT wi , (3.5)

where Qy ≥ 0 is the user-defined weighting matrix and γ is an upper bound on the


desired l2 gain from the disturbance to the performance measure ykT Qy yk + uTk uk .
The aim of the H∞ control problem is to find the optimal control policy u∗k such that
the closed-loop system is asymptotically stable when wk = 0 and satisfies the H∞
constraint with the following disturbance attenuation condition,

  ∞

yiT Qy yi + uTi ui ≤ γ 2 wiT wi , w ∈ l2 [0, ∞), (3.6)
i=0 i=0

for a given γ ≥ 0. The minimum value of γ , denoted as γ ∗ , for which the H∞


control problem (3.4) has a solution (the optimal control u and the worst-case
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 101

disturbance w) is referred to as the H∞ infimum. This H∞ control problem is often


referred to as the H∞ disturbance decoupling problem.
In the H∞ control problem, it is required that (3.6) holds for all disturbances.
Thus, the H∞ control problem is a “worst-case” control problem and can be
solved as a linear quadratic zero-sum game. Under the additional assumption of
 √  √ T√
the observability of A, Q , where Q Q = Q, Q = C T Qy C, the problem
is solvable if there exists a unique positive definite matrix P ∗ that satisfies the
following game algebraic Riccati equation (GARE),
 −1  
  I + BTP B BTP E BTP A
P = AT P A + Q − AT P B AT P E ,(3.7)
ETP B ETP E − γ 2I ETP A

and the inequality

I − γ −2 E T P E > 0. (3.8)

Then, given that the system dynamics is completely known and the full state xk is
available for feedback, there exist a unique optimal stabilizing controller u∗k = K ∗ xk
and a unique worst-case disturbance wk∗ = G∗ xk that solve the linear quadratic zero-
sum game [6], where

−1 −1
K ∗ = I + B TP B − B TP E ETP E − γ 2I ETP B
−1
× B TP E ETP E − γ 2I ETP A − B TP A , (3.9)

−1 −1
G∗ = E T P E − γ 2 I − E T P B I + B T P B B TP E
−1
× ETP B I + B TP B B TP A − ETP A . (3.10)

3.3.1 Model-Based Iterative Algorithms

The GARE is a nonlinear equation and is difficult to solve analytically. In the


following we discuss how iterative algorithms can be developed to solve the GARE
in order to find the optimal strategies for the zero-sum game. The Bellman equation
for the zero-sum game is given by

V (xk ) = r(xk , Kxk , Gxk ) + V (xk+1 ), (3.11)


102 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

which, for the quadratic function (3.5), can be expressed in terms of the quadratic
value function V (xk ) = xkT P xk as

xkT P xk = xkT C T Qy Cxk + uTk Ruk − γ 2 wkT wk + xk+1


T
P xk+1 .

The policies in this case are uk = Kxk and wk = Gxk , which gives us

xkT P xk = xkT Qxk + xkT K T Kxk − γ 2 xkT GT Gxk


+xkT (A + BK + EG)T P (A + BK + EG)xk ,

which, in turn, leads to

(A + BK + EG)T P (A + BK + EG) − P + Q + K T K − γ 2 GT G = 0. (3.12)

That is, the Bellman equation for the zero-sum game actually corresponds to a
Lyapunov equation, which is similar to the connection that exists between the LQR
Bellman equation and a Lyapunov equation. This observation suggests that we can
apply iterations on the Lyapunov equation in the same way as they are applied on
the Bellman equation in Chap. 1. In such a case, Lyapunov iterations under the
H∞ control design conditions would converge to the solution of the GARE. A
Newton’s iteration method is often used in the literature that does exactly what a
policy iteration algorithm does, and is presented in Algorithm 3.1.
Algorithm 3.1 finds the solution of the GARE (3.7) iteratively. Instead of solving
the GARE, which is a nonlinear equation, Algorithm 3.1 only involves solving
Lyapunov equations, which are linear in the unknown matrix P j . As is the case with
other PI algorithms, Algorithm 3.1 also needs to be initialized with a stabilizing
policy. Such initialization is essential because the policy evaluation step involves
finding the positive definite solution of the Lyapunov equation, which requires the
feedback gain to be stabilizing. The algorithm is known to converge with a quadratic
convergence rate under the stated conditions.
We can also apply value iteration to find the solution of the GARE. Similar to
the LQR value iteration algorithm presented in Chap. 1, we perform recursions on
the GARE to carry out value iterations on the matrix P for value updates. That
is, instead of solving the Lyapunov equation, we only perform recursions, which
are computationally faster. The policy update step still remains the same as in
Algorithm 3.1. Under the solvability conditions of the linear quadratic zero-sum
game, the value iteration algorithm for the zero-sum game, Algorithm 3.2, converges
to the solution of the GARE.
The iterative algorithms provide a numerically feasible way of solving the
GARE. The fixed-point property of the Bellman and GARE equations enables us
to perform successive approximation of the solution in a way similar to the way
for solving the LQR problem. This successive approximation property is inherited
from the dynamic programming approach that has enabled us to break a complex
optimization problem into smaller ones. However, these methods also inherit the
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 103

Algorithm 3.1 Model-based policy iteration algorithm for the discrete-time zero-
sum game
input: system dynamics
output: P ∗ , K ∗ and G∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Select G0 = 0.
Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P ,
T 
A + BK j + EGj P j A + BK j + EGj
T T
−P j + Q + K j K j − γ 2 Gj Gj = 0.

4: policy update. Find improved policies as

−1 −1
K j +1 = I + B TP j B − B TP j E ETP j E − γ 2I ETP j B

−1
× B TP j E ETP j E − γ 2I ETP j A − B TP j A ,

−1 −1
Gj +1 = ETP j E − γ 2I − ETP j B I + B TP j B BTP j E

−1
× ETP j B I + B TP j B B TP j A − ETP j A .

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

curse of modeling associated with dynamic programming as these techniques


employ the dynamic model of the system to solve the Lyapunov equation or the
recursive GARE. Therefore, they can lead to sub-optimal or even destabilizing
solutions in the presence of modeling uncertainties. The remaining portion of this
subsection will be dedicated to developing techniques that circumvent the issue of
modeling by designing data-driven algorithms.
We will first introduce state feedback Q-learning schemes for solving the
zero-sum game and the associated H∞ control problem without requiring model
information. To this end, we present the Q-function associated with the linear
quadratic zero-sum game and the H∞ control problem. Consider the cost function
given by

 
V (xk ) = yiT Qy yi + uTi ui − γ 2 wiT wi . (3.13)
i=k
104 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Algorithm 3.2 Model-based value iteration algorithm for the discrete-time zero-
sum game
input: system dynamics
output: P ∗ , K ∗ and G∗
1: initialize. Select an arbitrary policy K 0 , G0 = 0, and a value function matrix P 0 > 0. Set
j ← 0.
2: repeat
3: value update. Perform the following recursion,
 −1
  I + BTP j B BTP j E
P j +1 = AT P j A + Q − AT P j B AT P j E
ETP j B ETP j E − γ 2I
 T j 
B P A
× T j .
E P A

4: policy improvement. Find improved policies as

−1 −1
K j +1 = I + B T P j +1 B − B T P j +1 E E T P j +1 E − γ 2 I E T P j +1 B

−1
× B T P j +1 E E T P j +1 E − γ 2 I E T P j +1 A − B T P j +1 A ,

−1 −1
Gj +1 = E T P j +1 E − γ 2 I − E T P j +1 B I + B T P j +1 B B T P j +1 E

−1
× E T P j B I + B T P j +1 B B T P j +1 A − E T P j +1 A .

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

Under the control policy uk = Kxk and the disturbance policy wk = Gxk , the total
cost incurred when starting with any state xk is quadratic in the state [6], that is,

V (xk ) = xkT P xk , (3.14)

for some positive definite matrix P ∈ Rn×n . Motivated by the Bellman optimality
principle, Equation (3.13) can be written recursively as

V (xk ) = r(xk , Kxk , Gxk ) + V (xk+1 ), (3.15)

where V (xk+1 ) is the cost of following policies K and G in all future time indices.
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 105

Next, we use (3.15) to define a Q-function as

Q(xk , uk , wk ) = r(xk , uk , wk ) + V (xk+1 ), (3.16)

which is the sum of the one-step cost of taking an arbitrary action uk under some
disturbance wk from state xk and the total cost that would incur if the policies K
and G are followed at time index k + 1 and all subsequent time indices. Note that
the Q-function (3.16) is similar to the cost function (3.15) but is explicit in xk , uk ,
and wk .
For the zero-sum game, we have a quadratic cost and the corresponding Q-
function can be expressed as

Q(xk , uk , wk ) = xkT Qxk + uTk uk − γ 2 wkT wk + xk+1


T
P xk+1
= xkT Qxk + uTk uk − γ 2 wkT wk
+ (Axk + Buk + Ewk )T P (Axk + Buk + Ewk )
⎡ ⎤T ⎡ ⎤⎡ ⎤
xk Hxx Hxu Hxw xk
= ⎣ uk ⎦ ⎣ Hux Huu Huw ⎦⎣ uk ⎦
wk Hwx Hwu Hww wk

= zkT H zk , (3.17)

where

Hxx = Q + AT P A ∈ Rn×n ,
Hxu = AT P B ∈ Rn×m1 ,
Hxw = AT P E ∈ Rn×m2 ,
Hux = B T P A ∈ Rm1 ×n ,
Huu = B T P B + I ∈ Rm1 ×m1 ,
Huw = B T P E ∈ Rm1 ×m2 ,
Hwx = E T P A ∈ Rm2 ×n ,
Hwu = E T P B ∈ Rm2 ×m1 ,
Hww = E T P E − γ 2 I ∈ Rm2 ×m2 .

Given the optimal cost V ∗ , we can compute K ∗ and G∗ . To do so, we define the
optimal Q-function as the cost of executing an arbitrary control uk and disturbance
wk , and then following the optimal policies K ∗ and G∗ , as given by

Q∗ (xk , uk , wk ) = r(xk , uk , wk ) + V ∗ (xk+1 ), (3.18)


106 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

and the optimal policies are given by

K ∗ xk = arg minQ∗ (xk , uk , wk ),


u

G xk = arg maxQ∗ (xk , uk , wk ).



w

That is, the optimal policies are obtained by performing the minimization and
maximization of (3.18), which in turn can be carried out by simultaneously solving

∂Q∗
= 0,
∂uk
∂Q∗
= 0,
∂wk

for uk and wk . The result is the same as given in (3.9) and (3.10), which was obtained
by solving the GARE.
In the discussion so far we have obtained the form of Q-function for the zero-
sum game. The next logical step would be to develop Q-learning algorithms that
can learn the optimal Q-function.
From (3.15) and the definition of (3.16), we have the following relationship,

Q(xk , Kxk , Gxk ) = V (xk ). (3.19)

In view of (3.19), substitution of Q(xk+1 , Kxk+1 ) for V (xk+1 ) in (3.16) results in


the recursive relationship

Q(xk , uk , wk ) = xkT Qxk + uTk uk − γ 2 wkT wk + Q (xk+1 , K(xk+1 , G(xk+1 )) ,


(3.20)
which gives us the following Q-learning Bellman equation,

zkT H zk = ykT Qy yk + uTk uk − γ 2 wkT wk + zk+1


T
H zk+1 . (3.21)

The above equation is a key learning equation for solving the model-free zero-
sum games. We present the following state feedback Q-learning algorithms based
on policy iteration and value iteration that employ the Q-learning Bellman equation
(3.20). These algorithms are the extensions of the state feedback LQR Q-learning
algorithms introduced in Chap. 1.
Algorithm 3.3 is a policy iteration algorithm for solving the zero-sum game
without requiring the knowledge of the system dynamics. It is an extension of the
LQR Q-learning policy iteration algorithm to the case when two decision makers,
instead of a single one, are involved. An interesting feature of this algorithm is that it
updates the policies of the two players simultaneously based on a single Q-learning
equation. The players (or the agents corresponding to the control and disturbance)
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 107

Algorithm 3.3 Q-learning policy iteration algorithm for the zero-sum game
input: input-state data
output: P ∗ , K ∗ and G∗
1: initialize. Apply a stabilizing policy uk = K 0 xk + nk and w = νk with nk and νk being the
exploration signals. Set G0 = 0 and j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H j ,

zkT H j zk = ykT Qy yk + uTk uk − γ 2 wkT wk + zk+1


T
H j zk+1 .

4: policy update. Find improved policies as

−1 −1 −1
K j +1 =
j j j j j j
Huu − Huw Hww
j
Hwu j
Huw Hww Hwx − Hux ,

−1 −1 −1
Gj +1 =
j j j j j j j
j
Hww − Hwu Huu Huw Hwu Huu Hux − Hwx .

5: j←  j +1 
6: until H j − H j −1  < ε for some small ε > 0.

have competing objectives and, therefore, the aim of the algorithm is to find the best
case control and the worst-case disturbance under which the system still satisfies
the H∞ performance criterion (3.6). Note that the policy updates in K j and Gj are
fed back to the policy evaluation step through the variables uk+1 = Kxk+1 and
wk+1 = Gxk+1 present in zk+1 . Variables uk and wk do not necessarily follow the
policies K and G. They follow from the definition of the Q-function. As a policy
iteration algorithm, Algorithm 3.3 requires a stabilizing initial control policy K 0 .
Subsequent iterations of these steps have been shown [60] to converge to the optimal
cost function matrix H ∗ and the optimal strategies K ∗ and G∗ under the solvability
conditions for the zero-sum game.
The value iteration algorithm for the model-free zero-sum game has also been
developed in the literature that relaxes the requirement of the knowledge of a
stabilizing initial policy K 0 . The algorithm is recalled in, Algorithm 3.4. Similar
to the Q-learning value iteration for the LQR problem, this algorithm recursively
updates the value matrix H towards its optimal value. The policies of the players
are updated in the same fashion as in Algorithm 3.3.
An important consideration in these model-free algorithms is that they need
information of the full state xk , which is not available in our problem setting. To
circumvent this situation, we will next present a state reconstruction technique that
employs input-output and disturbance data of the system to observe the state. This
parameterization will play a key role in developing the output feedback Q-learning
equation for the zero-sum game and the associated H∞ control problem.
108 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Algorithm 3.4 Q-learning value iteration algorithm for the zero-sum game
input: input-state data
output: P ∗ , K ∗ and G∗
1: initialize. Apply an arbitrary policy uk = K 0 xk + nk and w = νk with nk and νk being the
exploration signals. Set H 0 ≥ 0 and j ← 0.
2: repeat
3: value update. Solve the following Bellman equation for H j +1 ,

zkT H j +1 zk = ykT Qy yk + uTk uk − γ 2 wkT wk + zk+1


T
H j zk+1 .

4: policy update. Find improved policies as

−1 −1 −1
j +1 j +1 j +1 j +1 j +1 j +1
K j +1 = Huu − Huw j +1
Hww Hwu Huw j +1
Hww Hwx − Hux ,

 −1 
j +1 j +1 −1 j +1 j +1 j +1 −1 j +1 j +1
Gj +1 = j +1
Hww − Hwu Huu Huw Hwu Huu Hux − Hwx .

5: j←  j +1 
6: until H j − H j −1  < ε for some small ε > 0.

3.3.2 State Parameterization of Discrete-Time Linear Systems


Subject to Disturbances

In this subsection, we present an extension of the state parameterization result that


was first introduced in Chap. 2 for use in developing an output feedback Q-learning
scheme for solving the zero-sum game.
Theorem 3.1 Consider system (3.3). Let the pair (A, C) be observable. Then, there
exists a parameterization of the state in the form of

xk = Wu σk + Wy ωk + Ww υk + (A + LC)k x0 , (3.22)

where L is the observer gain chosen


 1 such that A + LC is Schur stable, the
= 2 · · · W m1 , W m2
w = Ww Ww · · · Ww
parameterization matrices W W W 1 2
  u u u u
p
and Wy = Wy1 Wy2 · · · Wy are given in the form of

⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2 i2 ⎥
· · · au0
⎢ u(n−1) au(n−2) ⎥
Wu = ⎢
i
⎢ ..
⎥ , i = 1, 2, · · · , m1 ,
⎢ .
.. . . .. ⎥
⎣ . . . ⎥ ⎦
in
au(n−1) in
au(n−2) · · · au0
in
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 109

⎡ ⎤
i1
aw(n−1) i1
aw(n−2) · · · aw0
i1
⎢ i2 ⎥
⎢a i2 ⎥
w(n−1) aw(n−2) · · · aw0 ⎥
i2

Wwi = ⎢
⎢ ..
⎥ , i = 1, 2, · · · , m2 ,
⎢ .
.. . . .. ⎥ ⎥
⎣ . . . ⎦
aw(n−1) aw(n−2) · · · aw0
in in in

⎡ i1 i1 ⎤
i1
ay(n−1) ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2 ⎥
y(n−1) ay(n−2) · · · ay0 ⎥
i2

Wyi = ⎢
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) in
ay(n−2) · · · ay0
in

whose elements are the coefficients of the numerators in the transfer function matrix
of a Luenberger observer with inputs uk , wk and yk , and σk = [σk1 σk2 · · · σkm1 ]T ,
p
υk = [υk1 υk2 · · · υkm2 ]T and ωk = [ωk1 ωk2 · · · ωk ]T represent the states of the user-
defined dynamics driven by individual input uk , disturbance wki and output yki as
i

given by
i
σk+1 = Aσki + Buik , σ i (0) = 0, i = 1, 2, · · · , m1 ,
i
υk+1 = Aυki + Bwki , υ i (0) = 0, i = 1, 2, · · · , m2 ,
i
ωk+1 = Aωki + Byki , ωi (0) = 0, i = 1, 2, · · · , p,

for a Schur matrix A whose eigenvalues coincide with those of A+LC and an input
vector B of the following form,
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ···· · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ⎥ ⎢ ⎥
A=⎢ 1 0 ⎥ , B = ⎢0⎥ .
⎢ . .. .. . . .. ⎥ ⎢.⎥
⎣ .. . . . . ⎦ ⎣ .. ⎦
0 0 ··· 1 0 0

Proof The proof follows from the proof of the state parameterization result
presented in Chap. 2. By considering the disturbance as an additional input, given
(A, C) is observable, we can obtain a full state observer as

x̂k+1 = Ax̂k + Buk + Ewk − L(yk − C x̂k )


= (A + LC)x̂k + Buk + Ewk − Lyk ,

where x̂k is the estimate of the state xk and L is the observer gain chosen such that
the matrix A+LC has all its eigenvalues strictly inside the unit circle. This observer
110 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

is a dynamic system driven by uk , wk , and yk with the dynamics matrix A + LC.


By linearity, the effect of the disturbance can be determined in the same way as that
of the uk and yk as carried out in proof of Theorem 2.1. Therefore, following the
arguments in the proof of Theorem 2.1, we can express the disturbance term as
⎡ ⎤
i1
aw(n−1) zn−1 + aw(n−2)
i1 zn−2 + · · · + aw0
i1
⎢ n ⎥
⎢ z + αn−1 zn−1 + αn−2 zn−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 n−1 + a i2 n−2 + · · · + a i2 ⎥
⎢ w(n−1) z z w0 ⎥
  ⎢ w(n−2)
⎥ 
i
Ww (z) i ⎢ z + αn−1 z
n n−1 + αn−2 z n−2 + · · · + α0 ⎥
wk = ⎢

⎥ wi
⎥ k
(z) ⎢ .. ⎥
⎢ . ⎥
⎢ ⎥
⎢ ⎥
⎢ in ⎥
⎣ aw(n−1) zn−1 + aw(n−2)
in z n−2 + · · · + aw0 ⎦
in

zn + αn−1 zn−1 + αn−2 zn−2 + · · · + α0


⎡ i1 ⎤⎡  i ⎤
i1
aw(n−1) aw(n−2) · · · aw0
i1 zn−1
wk
⎢ i2 ⎥⎢ (z)
 ⎥
⎢a i2 ··· aw0 ⎥
i2 ⎢ zn−2 ⎥
⎢ w(n−1) aw(n−2) ⎥ ⎢ (z) wki ⎥
=⎢
⎢ .. ..
⎥⎢
.. ⎥ ⎢


⎢ . ..
⎣ . . . ⎥
⎦⎣
⎢ ..
.


1
 i

in
aw(n−1) in
aw(n−2) ··· in
aw0 (z) wk

= Wwi υki , i = 1, 2, · · · , m2 ,

where Wwi ∈ Rn×n is the parametric matrix corresponding to the contribution to the
state from the ith disturbance input wki , υki , which can be obtained as

i
υk+1 = Aυki + Bwki ,

with
⎡ ⎤ ⎡ ⎤
−αn−1 −αn−2 ···· · · −α0 1
⎢ 1 0 0 ··· 0 ⎥ ⎢0 ⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ⎥ ⎢ ⎥
A=⎢ 1 0 ⎥ , B = ⎢0 ⎥ .
⎢ . .. .. .. .. ⎥ ⎢.⎥
⎣ .. . . . . ⎦ ⎣ .. ⎦
0 0 ··· 1 0 0

Then, following the arguments used in Theorem 2.1, we have

x̂k = Wu σk + Ww υk + Wy ωk + (A + LC)k x̂0 . (3.23)


3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 111

Note that A + LC represents the observer error dynamics, that is,

ek = xk − x̂k
= (A + LC)k e0 .

As a result, the state xk can be parameterized as

xk = Wu σk + Ww υk + Wy ωk + (A + LC)k x0 . (3.24)

Since A + LC is Schur stable, the term (A + LC)k x̂0 in (3.23) and the term (A +
LC)k x0 in (3.24) vanish as k → ∞. This completes the proof.
It was shown in Chap. 2 that, for discrete-time systems, a special case of
the above result can be obtained if all the eigenvalues of matrix A + LC or,
equivalently, matrix A, are placed at 0. This property also pertains to the above
extended parameterization with the disturbance term. The following result presents
this special case.
Theorem 3.2 Consider system (3.3). Let the pair (A, C) be observable. Then,
the system state can be uniquely represented in terms of the input, output, and
disturbance as

xk = Wu ūk−1,k−N + Ww w̄k−1,k−N + Wy ȳk−1,k−N , k ≥ N, (3.25)

where N ≤ n is an upper bound on the observability index of the system,


ūk−1,k−N ∈ Rm1 N , w̄k−1,k−N ∈ Rm2 N and ȳk−1,k−N ∈ RpN are the past input,
disturbance, and output data vectors defined as
 T
ūk−1,k−N = uTk−1 uTk−2 · · · uTk−N ,
 T
w̄k−1,k−N = wk−1
T T
wk−2 · · · wk−N
T ,
 T
ȳk−1,k−N = yk−1
T T
yk−2 · · · yk−N
T ,

and the parameterization matrices take the special form


−1
Wu = UN − AN VNT VN VNT TN 1 ,
−1
Ww = WN − AN VNT VN VNT TN 2 ,
−1
Wy = AN VNT VN VNT ,
112 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

with
 T T
VN = CAN −1 · · · (CA)T C T ,
 
UN = B AB · · · AN −1 B ,
 
WN = E AE · · · AN −1 E ,
⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN 1 = ⎢ ... ... .. ..
. .
..
. ⎥,
⎢ ⎥
⎣0 0 · · · 0 CB ⎦
0 0 0 0 0
⎡ ⎤
0 CECAE · · · CAN −2 E
⎢0 0 CE · · · CAN −3 E ⎥
⎢ ⎥
⎢ ⎥
TN 2 = ⎢ ... ...
.. ..
. .
..
. ⎥.
⎢ ⎥
⎣0 0 ··· 0 CE ⎦
0 0 0 0 0

Remark 3.1 The parameterization matrices Wu , Wy , and Ww in (3.25) are the same
as those in (3.22) if N = n and all eigenvalues of matrix A or, equivalently, matrix
A + LC, used in Theorem 3.1 are zero. This can be seen as follows. Recall from the
proof of Theorem 3.1 that the state can be represented by

xk = Wu σk + Wy ωk + Ww υk + (A + LC)k x0 . (3.26)

If all eigenvalues of matrix A + LC are zero, then its characteristic polynomial


(z) = zn . Then, from the definition of σk , ωk , and υk , we have
⎡ ⎤
zn−1
[uk ]
⎢ (z)

⎢ zn−2 ⎥
⎢ (z) [uk ]⎥

σk = ⎢ ⎥ = ūk−1,k−n ,

⎢ .. ⎥
⎣ . ⎦
1
[uk ]
(z)
⎡ ⎤
zn−1
(z) [yk ]
⎢ ⎥
⎢ zn−2 ⎥
⎢ (z) [yk ]⎥

ωk = ⎢ ⎥ = ȳk−1,k−n ,

⎢ .. ⎥
⎣ . ⎦
1
(z) [yk ]
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 113

⎡ ⎤
zn−1
[wk ]
⎢ (z)

⎢ zn−2 ⎥
⎢ (z) [wk ]⎥

υk = ⎢ ⎥ = w̄k−1,k−n .

⎢ .. ⎥
⎣ . ⎦
1
(z) [w k ]

With all its eigenvalues at zero, matrix A + LC is a nilpotent matrix. Thus, (A +


LC)k = 0 for k ≥ n. Note here that we consider only the full order observer case
by choosing the nilpotent N = n, which reduces (3.26) to

xk = Wu ūk−1,k−N + Wy ȳk−1,k−N + Ww w̄k−1,k−N , k ≥ N,

which is the parameterization (3.25) in Theorem 3.2. Hence, Theorem 3.2 is a


special case of Theorem 3.1 when all eigenvalues of matrix A+LC or, equivalently,
matrix A, in Theorem 3.1 are zero and the upper bound N of the observability index
is chosen as n.
Similar to the case of the state parameterization result presented in Chap. 2,
it is important that the full row rank condition of the parameterization matrix
W = [Wu Ww Wy ] needs to be established so that the input-output and disturbance
correspond to a unique state xk . We present the extension of the full row rank
condition in Chap. 2 to include Ww in matrix W .
 
Theorem 3.3 The state parameterization matrix W = Wu Ww Wy in (3.22) (in
(3.25) as a special case) is of full row rank if either of (A + LC, B),(A + LC, E),
or (A + LC, L) is controllable.
Proof The conditions for the matrices Wu and Wy remain the same as in Theo-
rem 2.3. For Ww , we can treat the disturbance as an additional input and, following
the arguments similar to those in the proof of Theorem 2.3, we have that, Ww , and
hence W , is of full row rank if the pair (A + LC, E) is controllable. This completes
the proof.
In a model-free setting, the conditions mentioned in Theorem 3.3 are difficult to
verify since they involve the knowledge of the system dynamics. For this reason,
we do not design matrix L. Instead, we form a user-defined matrix A that contains
the desired eigenvalues of matrix A + LC. As a result, we need a condition in
terms of these eigenvalues instead of matrix L. The following result establishes this
condition.
Theorem 3.4 The parameterization matrix W is of full row rank if matrices A and
A + LC have no common eigenvalues.
Proof The proof is similar to the proof of Theorem 2.4.
114 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Remark 3.2 The condition discussed in Theorem 3.4 will not be satisfied if we
use the parameterization (3.25) and the system happens to have a zero eigenvalue.
This is due to the dead-beat nature of this special parameterization that places
the eigenvalues of matrix A + LC all at the origin. In such a case, the general
parameterization (3.22) gives us extra flexibility in satisfying conditions by placing
eigenvalues of A at location other than zero. Compared to the disturbance-free
parameterization presented in Chap. 2, the disturbance matrix Ww provides another
degree of freedom to satisfy the full rank condition of W for both parameterizations
(3.22) and (3.25).

3.3.3 Output Feedback Q-function for Zero-Sum Game

In this subsection we will solve the zero-sum game and the associated H∞ control
problem by using only the input-output and disturbance data. No knowledge of the
system dynamics (A, B, C, E) and no measurement of the state xk are assumed
available. We now proceed to apply the state parameterization (3.25) to describe the
Q-function in (3.17) in terms of the input, output, and disturbance. It can be easily
verified that substitution of the parameterization (3.25) for xk in (3.17) results in
⎡ ⎤T ⎡ ⎤⎡ ⎤
ūk−1,k−N Hūū Hūw̄ Hūȳ Hūu Hūw ūk−1,k−N
⎢w̄ ⎥ ⎢H Hw̄w ⎥ ⎢ ⎥
⎢ k−1,k−N ⎥ ⎢ w̄ū Hw̄w̄ Hw̄ȳ Hw̄u ⎥⎢w̄k−1,k−N ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
QK = ⎢ ȳk−1,k−N ⎥ ⎢ Hȳ ū Hȳ w̄ Hȳ ȳ Hȳu Hȳw ⎥⎢ ȳk−1,k−N ⎥
⎢ ⎥ ⎢ ⎥⎢ ⎥
⎣ uk ⎦ ⎣ Huū Huw̄ Huȳ Huu Huw ⎦⎣ uk ⎦
wk Hwū Hww̄ Hwȳ Hwu Hww wk

= zkT H zk , (3.27)

where
 T
zk = ūTk−1,k−N w̄k−1,k−N
T T
ȳk−1,k−N uTk wkT ,

H = H T ∈ R(m1 N +m2 N +pN +m1 +m2 )×(m1 N +m2 N +pN +m1 +m2 ) ,

and the submatrices are given as


3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 115

 
Hūū = WuT Q + AT P A Wu ∈ Rm1 N ×m1 N ,
 
Hūw̄ = WuT Q + AT P A Ww ∈ Rm1 N ×m2 N ,
 
Hūȳ = WuT Q + AT P A Wy ∈ Rm1 N ×pN ,
Hūu = WuT AT P B ∈ Rm1 N ×m1 ,
Hūw = WuT AT P E ∈ Rm1 N ×m2 ,
 
Hw̄ū = WwT Q + AT P A Wu ∈ Rm2 N ×m1 N ,
 
Hw̄w̄ = WwT Q + AT P A Ww ∈ Rm2 N ×m2 N ,
 
Hw̄ȳ = WwT Q + AT P A Wy ∈ Rm2 N ×pN ,
Hw̄u = WwT AT P B ∈ Rm2 N ×m1 ,
Hw̄w = WwT AT P E ∈ Rm2 N ×m2 ,
 
Hȳ ū = WyT Q + AT P A Wu ∈ RpN ×m1 N ,
 
Hȳ w̄ = WyT Q + AT P A Ww ∈ RpN ×m2 N ,
 
Hȳ ȳ = WyT Q + AT P A Wy ∈ RpN ×pN ,
Hȳu = WyT AT P B ∈ RpN ×m1 ,
Hȳw = WyT AT P E ∈ RpN ×m2 ,
Huū = B T P AWu ∈ Rm1 ×m1 N ,
Huw̄ = B T P AWw ∈ Rm1 ×m2 N ,
Huȳ = B T P AWy ∈ Rm1 ×pN ,
Hwū = E T P AWu ∈ Rm2 ×m1 N ,
Hww̄ = E T P AWw ∈ Rm2 ×m2 N ,
Hwȳ = E T P AWy ∈ Rm2 ×pN ,
Huu = B T P B + I ∈ Rm1 ×m1 ,
Huw = B T P E ∈ Rm1 ×m2 ,
Hwu = E T P B ∈ Rm2 ×m1 ,
Hww = E T P E − γ 2 I ∈ Rm2 ×m2 .

Given the optimal cost function V ∗ with the cost matrix P ∗ , we obtain the
corresponding optimal output feedback matrix H ∗ by substituting P = P ∗ in
(3.3.3). Then, the optimal output feedback Q-function is given by

Q∗ = zkT H ∗ zk , (3.28)

which provides a state-free representation of the zero-sum game Q-function. Solving

∂Q∗
= 0,
∂uk
116 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

∂Q∗
= 0,
∂wk

simultaneously for uk and wk results in the desired control law,

 ∗ −1 ∗ −1 ∗ −1

u∗k = Huu

−Huw ∗
Hww Hwu Huw ∗
Hww Hwū ūk−1,k−N

+Hw∗ w̄ w̄k−1,k−N + Hw∗ ȳ ȳk−1,k−N

− Hu∗ū ūk−1,k−N + Hu∗w̄ w̄k−1,k−N + Hu∗ȳ ȳk−1,k−N , (3.29)
 ∗ −1 ∗ −1 ∗ −1
 ∗
wk∗ = Hww

−Hwu ∗
Huu Huw ∗
Hwu Huu Huū ūk−1,k−N

+Hu∗w̄ w̄k−1,k−N + Hu∗ȳ ȳk−1,k−N

− Hw∗ ū ūk−1,k−N + Hw∗ w̄ w̄k−1,k−N + Hw∗ ȳ ȳk−1,k−N . (3.30)

These policies solve the output feedback zero-sum game without requiring access
to the full state xk .
We next show that the output feedback policies (3.29) and (3.30) are equivalent
to the state feedback policies (3.9) and (3.10). To this end, we show the relation
between the presented output feedback Q-function and the output feedback value
function. The output feedback value function as used is given by
⎡ ⎤T ⎡ ⎤
ūk−1,k−N ūk−1,k−N
V = ⎣w̄k−1,k−N ⎦ P̄ ⎣w̄k−1,k−N ⎦, (3.31)
ȳk−1,k−N ȳk−1,k−N

where
⎡ ⎤
WuT P Wu WuT P Ww WuT P Wy
⎢ ⎥
P̄ = ⎣WwT P Wu WwT P Ww WwT P Wy ⎦.
WyT P Wu WyT P Ww WyT P Wy

The value function (3.31), by definition (3.14), gives the cost of executing the
policies
 
K̄ = K Wu Ww Wy ,
 
Ḡ = G Wu Ww Wy .

Using the relation QK (xk , Kxk , , Gxk ) = VK (xk ), the output feedback value
function matrix P̄ can be readily obtained as
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 117

⎡ ⎤T ⎡ ⎤
I I
P̄ = ⎣K̄ ⎦ H ⎣K̄ ⎦ . (3.32)
Ḡ Ḡ

Theorem 3.5 The output feedback policies given by (3.29) and (3.30) converge to
the state feedback policies (3.9) and (3.10), respectively, that solve the zero-sum
game (3.3)–(3.5).
Proof We know that the state vector can be represented by the input-output data
sequence as in (3.25). It can be easily verified that substituting (3.25) and (3.3.3)
in (3.29) and (3.30) results in the state feedback policies (3.9) and (3.10), which
solve the zero-sum game. So, the output feedback policies (3.29) and (3.30) are the
equivalent policies that also solve the zero-sum game (3.3)–(3.5). This completes
the proof.

3.3.4 Output Feedback Based Q-learning for Zero-Sum Game


and H∞ Control Problem

In this subsection, we present an output feedback Q-learning scheme for solving


the zero-sum game and the associated H∞ control problem. Using the state
parameterization (3.25) in the state feedback Q-learning equation (3.21), we have
the following output feedback Q-learning equation,

zkT H zk = ykT Qy yk + uTk uk − γ 2 wkT wk + zk+1


T
H zk+1 , (3.33)

in which uk+1 and wk+1 are given by


−1 
−1 −1
uk+1 = Huu − Huw Hww Hwu Huw Hww Hwū ūk,k−N +1 + Hww̄ w̄k,k−N +1
  
+Hwȳ ȳk,k−N +1 − Huū ūk,k−N +1 + Huw̄ w̄k,k−N +1 + Huȳ ȳk,k−N +1 ,
−1 
−1 −1
wk+1 = Hww − Hwu Huu Huw Hwu Huu Huū ūk,k−N +1 + Huw̄ w̄k,k−N +1
  
+Huȳ ȳk,k−N +1 − Hwū ūk,k−N +1 + Hww̄ w̄k,k−N +1 + Hwȳ ȳk,k−N +1 .

The Q-function matrix H is unknown and is to be learned. We can separate H by


parameterizing (3.27) as

QK = H̄ T z̄k , (3.34)
118 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

where

H̄ = vec(H )

= [h11 2h12 · · · 2h1l h22 2h23 · · · 2h2l · · · hll ]T ,

with l = m1 N + m2 N + pN + m1 + m2 . The regression vector z̄k ∈ Rl(l+1)/2 is


made up of quadratic basis functions as follows,

z̄k = zk ⊗ zk
 T
= zk1
2
zk1 zk2 · · · zk1 zkl zk2
2
zk2 zk3 · · · zk2 zkl · · · zkl
2
,

where zk = [zk1 zk2 · · · zkl ]T . Then, it follows from Equation (3.34) that

H̄ T z̄k = ykT Qy yk + uTk uk − γ 2 wkT wk + H̄ T z̄k+1 . (3.35)

Equation (3.35) is a linear equation and can be written as

T H̄ = ϒ, (3.36)

where  ∈ R(l(l+1)/2)×L and ϒ ∈ RL×1 are the data matrices defined by


 
 = z̄k1 − z̄k+1
1 z̄k2 − z̄k+1
2 · · · z̄kL − z̄k+1
L ,
 T
ϒ = r 1 (yk , uk ) r 2 (yk , uk ) · · · r L (yk , uk ) ,

and H̄ is the unknown vector to be found. We can use least-squares technique to


obtain the following closed-form solution,
 −1
H̄ j = T ϒ. (3.37)

Note that we require at least L ≥ l(l + 1)/2 data samples. Furthermore, since uk
and wk are linearly dependent on the vectors ūk−1,k−N , w̄k−1,k−N and ȳk−1,k−N ,
we add excitation signals in uk and wk so that all the involved vectors are linearly
independent to ensure a unique least-squares solution to (3.36). That is, we need to
satisfy the following rank condition,

rank() = l(l + 1)/2. (3.38)

In what follows, we present policy iteration and value iteration algorithms to


learn the output feedback strategies for the zero-sum game.
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 119

Algorithm 3.5 Output feedback Q-learning policy iteration algorithm for the zero-
sum game
input: input-output data
output: H ∗
1: initialize. Select a stabilizing output feedback policy u0k and disturbance wk0 along with their
exploration signals nk and νk . Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for H̄ j ,
T
H̄ j (z̄k − z̄k+1 ) = ykT Qy yk + uTk uk − γ 2 wkT wk .

4: policy update. Find improved policies as

−1 −1 −1
j +1 j j j j j j j
uk = Huu −Huw Hww Hwu Huw Hww Hwū ūk−1,k−N

j j j j
+ Hww̄ w̄k−1,k−N + Hwȳ ȳk−1,k−N − Huū ūk−1,k−N + Huw̄ w̄k−1,k−N

j
+ Huȳ ȳk−1,k−N ,

−1 −1 −1
j +1 j j j j j j j
wk = Hww −Hwu Huu Huw Hwu Huu Huū ūk−1,k−N

j j j j
+ Huw̄ w̄k−1,k−N + Huȳ ȳk−1,k−N − Hwū ūk−1,k−N + Hww̄ w̄k−1,k−N

j
+ Hwȳ ȳk−1,k−N .

5: j←  j +1 
6: until H̄ j − H̄ j −1  < ε for some small ε > 0.

Algorithm 3.5 requires a stabilizing control to start with, which can be quite
restrictive when the system itself is open-loop unstable. To obviate this requirement,
a value iteration algorithm, Algorithm 3.6, is presented next.
In Algorithm 3.6, the data matrices  ∈ R(l(l+1)/2)×L and ϒ ∈ RL×1 are defined
by
 
 = z̄k1 z̄k2 · · · z̄kL ,
   1  T 2
ϒ = r 1 (yk , uk , wk ) + H̄ j T z̄k+1 r 2 (yk , uk , wk ) + H̄ j z̄k+1 ···
 T L 
r L (yk , uk , wk ) + H̄ j z̄k+1 .
120 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Algorithm 3.6 Output feedback Q-learning value iteration algorithm for the zero-
sum game
input: input-output data
output: H ∗
1: initialize. Apply arbitrary u0k and wk0 along with their exploration signals nk and νk . Set H 0 ≥
0 and j ← 0.
2: repeat
3: value update. Solve the following Bellman equation for H̄ j +1 ,
T T
H̄ j +1 z̄k = ykT Qy yk + uTk uk − γ 2 wkT wk + H̄ j z̄k+1 .

4: policy update. Find improved policies as

−1 −1 −1
j +1 j +1 j +1 j +1 j +1 j +1 j +1 j +1
uk = Huu − Huw Hww Hwu Huw Hww Hwū ūk−1,k−N

j +1 j +1 j +1
+ Hww̄ w̄k−1,k−N + Hwȳ ȳk−1,k−N − Huū ūk−1,k−N

j +1 j +1
+ Huw̄ w̄k−1,k−N + Huȳ ȳk−1,k−N ,

 −1 
j +1 j +1 j +1 j +1 −1 j +1 j +1 j +1 −1 j +1
wk = Hww − Hwu Huu Huw Hwu Huu Huū ūk−1,k−N

j +1 j +1 j +1
+ Huw̄ w̄k−1,k−N + Huȳ ȳk−1,k−N − Hwū ūk−1,k−N

j +1 j +1
+ Hww̄ w̄k−1,k−N + Hwȳ ȳk−1,k−N .

5: j←  j +1 
6: until H̄ j − H̄ j −1  < ε for some small ε > 0.

These matrices are used to obtain the least-squares solution given by (3.37). The
rank condition (3.38) must be met by the addition of excitation noises in the control
uk and the disturbance wk . Convergence of the output feedback learning algorithms,
Algorithms 3.5 and 3.6, is established in the following theorem.
Theorem 3.6 Consider system (3.3). Assume that the linear quadratic zero-sum
game is solvable. Then, the output feedback
 Q-learningalgorithms
 (Algorithms 3.5

j j
and 3.6) each generates policies uk , j = 1, 2, 3, ... and wk , j = 1, 2, 3, ...
that converge to the optimal output feedback policies given in (3.29) and (3.30)
as j → ∞ if the rank condition (3.38) holds.
Proof The proof follows from [5], where it is shown that, under sufficient
excitation, the state
  feedback
 Q-learning iterative
 algorithm generates policies
j j
uk , j = 1, 2, 3, ... and wk , j = 1, 2, 3, ... that converge to the optimal state
feedback policies (3.9) and (3.10). By the state parameterization (3.25), we see
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 121

that the state feedback and output feedback Q-functions are equivalent and, by
Theorem 3.5, the output feedback policies (3.29) and (3.30) are equivalent to (3.9)
and (3.10), respectively. Therefore, following the result in [5], we can conclude
that, under sufficient excitation, such that the rank condition (3.38) holds, the output
feedback Q-learning algorithm generates policies that converge to the optimal
output feedback policies as j → ∞. This completes the proof.
We now show that the Q-learning scheme for solving the zero-sum game is
immune to the excitation noise bias.
Theorem 3.7 The output feedback Q-learning scheme does not incur bias in the
parameter estimates.
Proof Based on the state parameterization (3.25), we know that the output feedback
Q-function in (3.27) is equivalent to the original state feedback Q-function (3.17).
Under the excitation noise, we can write the Q-function as
 
Q(xk , ûk , ŵk ) = r xk , ûk , ŵk + V (xk+1 ),

where ûk = uk + nk and ŵk = wk + vk with nk and vk being the excitation noise
signals.
Let Ĥ be the estimate of H obtained using ûk and ŵ. It then follows from (3.17)
that
⎡ ⎤T ⎡ ⎤
xk xk
⎢ ⎥ ⎢ ⎥    T  
⎣ ûk ⎦ Ĥ ⎣ ûk ⎦ = r xk , ûk , ŵk + Axk + B ûk + E ŵk P Axk + B ûk + E ŵk .
ŵk ŵk

Upon expansion, we have


⎡ ⎤T ⎡ ⎤
xk xk 
⎢ ⎥ ⎢ ⎥ T T T T T T T T
⎣ uk ⎦ Ĥ ⎣ uk ⎦ + xk A P Bnk + nk B P Axk + nk I + B P B uk + nk B P Ewk
wk wk
 
+vkT E T P Axk + uT T T T T T
k I + B P B nk + nk I + B P B nk + xk A P Evk
 
wkT E T P E − γ 2 I vk + nT
k B T P Ev + v T E T P E − γ 2 I v + uT B T P Ev
k k k k k

+vkT E T P Buk + wkT E T P Bnk + vkT E T P Bnk + vkT (E T P E − γ 2 I )wk

= xkT Qxk + uT 2 T T T T 2 T 2 T
k uk − γ wk wk + nk nk + nk uk + uk nk − γ wk vk − γ vk wk
−γ 2 vkT vk + (Axk +Buk +Ewk )T P (Axk +Buk +Ewk ) + (Axk +Buk +Ewk )T

×P Bnk + nT T T T T
k B P Bnk + (Bnk ) P (Axk + Buk + Ewk ) + nk B P Evk
+vkT E T P Bnk + (Axk +Buk +Ewk )T P Evk + (Evk )T P (Axk + Buk + Ewk )

+vkT E T P Evk .
122 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

It can be easily verified that all the terms involving nk and vk get canceled on both
sides of the equation, and we are left with
⎡ ⎤T ⎡ ⎤
xk xk
⎣ uk ⎦ Ĥ ⎣ uk ⎦ = r(xk , uk , wk )+(Axk +Buk + Ewk )T P (Axk + Buk + Ewk ) .
wk wk

Comparing the above equation with (3.17), we have Ĥ = H , that is,

Q(xk , uk , wk ) = r(xk , uk , wk ) + V (xk+1 ).

In view of (3.27), we have

zkT H zk = ykT Qy yk + uTk uk − γ 2 wkT wk + zk+1


T
H zk+1 .

That is, we have obtained the Bellman equation in the absence of excitation noise
as given in (3.35). This completes the proof.

3.3.5 A Numerical Example

In this subsection we present numerical simulations of the proposed design.


Example 3.1 (H∞ Autopilot Control) In this example we test the proposed design
by numerical simulations of a model-free H∞ autopilot controller for the F-16
aircraft. Consider the discrete-time model of the fighter aircraft autopilot system
from [5], which is in the form of (3.3) with
⎡ ⎤
0.906488 0.0816012 −0.0005
A = ⎣0.0741349 0.90121 −0.0007083⎦ ,
0 0 0.132655
⎡ ⎤
−0.00150808
B = ⎣ −0.0096 ⎦
0.867345
 
C= 100 ,
⎡ ⎤
0.00951892
E = ⎣0.00038373⎦ .
0
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 123

The three states are given by x = [x1 x2 x3 ]T , where x1 is the angle of attack, x2
is the rate of pitch, and x3 is the elevator angle of deflection. The initial state of
the system is x0 = [10 5 − 2]T . The user-defined cost function parameters are
Qy = 1 and γ = 1. The algorithm is initialized with u0k = nk and wk0 = vk .
The PE condition is ensured by adding sinusoidal noise nk of different frequencies
and amplitudes in the input uk and assuming that the disturbance wk is sufficiently
exciting. In the simulation study, we let wk = vk be sinusoidal noises of different
frequencies and amplitudes. The system order is 3, so N = 3 is selected. In
comparison with the output feedback model-free Q-learning schemes, the state
feedback case leads to a smaller H matrix but requires the measurement of the
full state. The nominal values of the state feedback control parameters are obtained
by solving the GARE (3.7) as follows,

 
Hux = −0.0861 −0.0708 0.0001 ,

 
Hwx = 0.1000 0.0671 −0.0001 ,

Huu = 1.0009,

Huw = −0.0008,

Hww = −0.9990.

On the other hand, the output feedback case leads to a larger H matrix, whose
nominal values are
 
Hu∗ū = 0.0009 −0.0006 −0.0002 ,
 
Hu∗w̄ = −0.0008 0.0087 −0.0011 ,
 
Hu∗ȳ = −0.9987 0.9443 −0.1077 ,
 
Hw∗ ū = −0.0008 0.0005 0.0002 ,
 
Hw∗ w̄ = 0.0009 −0.0084 0.0011 ,
 
Hw∗ ȳ = 0.9813 −0.9132 0.1039 ,

Huu = 1.0009,

Huw = −0.0008,

Hww = −0.9990.

Consequently, it takes longer for the estimates of the output feedback parameters
to converge than in the state feedback case. We first present the result of the
state feedback and output feedback Q-learning policy iteration algorithms, Algo-
rithms 3.3 and 3.5. The final estimated state feedback control parameters obtained
by Algorithm 3.3 are
124 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

 
Ĥux = −0.0861 −0.0708 0.0001 ,
 
Ĥwx = 0.1000 0.0671 −0.0001 ,

Ĥuu = 1.0009,
Ĥuw = −0.0008,
Ĥww = −0.9990,

whereas the output feedback control parameters obtained by Algorithm 3.5 are
 
Ĥuū = 0.0009 −0.0006 −0.0002 ,
 
Ĥuw̄ = −0.0008 0.0087 −0.0011 ,
 
Ĥuȳ = −0.9983 0.9439 −0.1076 ,
 
Ĥwū = −0.0008 0.0005 0.0002 ,
 
Ĥww̄ = 0.0009 −0.0084 0.0011 ,
 
Ĥwȳ = 0.9813 −0.9132 0.1039 ,

Ĥuu = 1.0009,
Ĥuw = −0.0008,
Ĥww = −0.9990.

The closed-loop response and the convergence of the parameter estimates under
the state feedback PI algorithm, Algorithm 3.3, are shown in Figs. 3.1 and 3.2,
respectively. The corresponding results for the output feedback PI Algorithm 3.5 are
shown in Figs. 3.3 and 3.4, respectively. It can be seen that both algorithms converge
to the nominal solution. Furthermore, the output feedback algorithm circumvents
the need of full state measurement at the cost of a longer learning time as it requires
more parameters to be learnt.
We now test the state feedback and output feedback Q-learning value iteration
algorithms, Algorithms 3.4 and 3.6. Figures 3.5 and 3.6 show the closed-loop state
response and the convergence of the parameter estimates, respectively, under the
state feedback Q-learning value iteration Algorithm 3.4. The final estimated state
feedback control parameters under Algorithm 3.4 are
 
Ĥux = −0.0857 −0.0705 0.0001 ,
 
Ĥwx = 0.0997 0.0668 −0.0001 ,

Ĥuu = 1.0009,
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 125

10
5
x1
0

-5
0 100 200 300 400 500 600 700 800 900 1000
6
4
x2

2
0
-2
0 100 200 300 400 500 600 700 800 900 1000
5
x3

-5
0 100 200 300 400 500 600 700 800 900 1000
time step (k)

Fig. 3.1 Example 3.1: State trajectory of the closed-loop system under the state feedback Q-
learning PI algorithm, Algorithm 3.3

15

10
H *

Ĥ

0
0 1 2 3 4 5
iterations

Fig. 3.2 Example 3.1: Convergence of the parameter estimates under the state feedback Q-
learning PI algorithm, Algorithm 3.3

Ĥuw = −0.0008,
Ĥww = −0.9990.

The corresponding results for the output feedback Q-learning value iteration algo-
rithm, Algorithm 3.6, are shown in Figs. 3.7 and 3.8. As with the output feedback
policy iteration algorithm, it takes longer for the output feedback parameters to
converge in Algorithm 3.6 than in the state feedback case, Algorithm 3.5. The final
estimated control parameters by Algorithm 3.6 are
126 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

x1 0
-5

0 100 200 300 400 500 600 700 800 900 1000
2
x2

-2
0 100 200 300 400 500 600 700 800 900 1000
5
x3

-5
0 100 200 300 400 500 600 700 800 900 1000
time step (k)

Fig. 3.3 Example 3.1: State trajectory of the closed-loop system under the output feedback Q-
learning PI algorithm, Algorithm 3.5

2500

2000
H *

1500

Ĥ

1000

500

0
0 1 2 3 4 5
iterations

Fig. 3.4 Example 3.1: Convergence of the parameter estimates under the output feedback Q-
learning PI algorithm, Algorithm 3.5

 
Ĥuū = 0.0009 −0.0006 −0.0002 ,
 
Ĥuw̄ = −0.0008 0.0087 −0.0011 ,
 
Ĥuȳ = −0.9980 0.9436 −0.1076 ,
 
Ĥwū = −0.0008 0.0005 0.0002 ,
 
Ĥww̄ = 0.0009 −0.0084 0.0011 ,
3.3 Discrete-Time Zero-Sum Game and H∞ Control Problem 127

10

5
x1
0

-5
0 500 1000 1500 2000 2500 3000 3500 4000
6
4
x2

2
0
-2
0 500 1000 1500 2000 2500 3000 3500 4000
5
x3

-5
0 500 1000 1500 2000 2500 3000 3500 4000
time step (k)

Fig. 3.5 Example 3.1: State trajectory of the closed-loop system under the state feedback Q-
learning VI algorithm, Algorithm 3.4 [5]

15

10
H *

Ĥ

0
0 50 100 150
iterations

Fig. 3.6 Example 3.1: Convergence of the parameter estimates under the state feedback Q-
learning VI algorithm, Algorithm 3.4 [5]

 
Ĥwȳ = 0.9811 −0.9130 0.1039 ,

Ĥuu = 1.0009,
Ĥuw = −0.0008,
Ĥww = −0.9990.
128 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

10

5
x1
0

-5
0 2000 4000 6000 8000 10000 12000 14000 16000
6
4
x2

2
0
-2
0 2000 4000 6000 8000 10000 12000 14000 16000
5
x3

-5
0 2000 4000 6000 8000 10000 12000 14000 16000
time step (k)

Fig. 3.7 Example 3.1: State trajectory of the closed-loop system under the output feedback Q-
learning VI algorithm, Algorithm 3.6

2500

2000
H *

1500

Ĥ

1000

500

0
0 50 100 150 200
iterations

Fig. 3.8 Example 3.1: Convergence of the parameter estimates under the output feedback Q-
learning VI algorithm, Algorithm 3.6

Note that in both the state feedback algorithms, we used 20 data samples in each
iteration, while 70 data samples were used for both the output feedback algorithms,
in order to satisfy the requirement of minimum number of data samples needed to
satisfy the rank condition (3.38). This example also demonstrates that for a given
system under the same initial conditions, the policy iteration algorithms converge
in fewer iterations than the value iteration algorithms in both the state feedback and
output feedback cases. Finally, it can be seen that the output feedback Q-learning
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 129

algorithms are able to maintain closed-loop stability and the controller parameters
converge to their optimal values.
The convergence criterion was chosen as ε = 0.1. Notice that no discounting
factor was employed and the results correspond to those obtained by solving the
game algebraic Riccati equation. Furthermore, the excitation noise did not introduce
any bias in the estimates, which is an advantage of the presented scheme. Moreover,
the excitation noise was removed after the convergence of parameter estimates.

3.4 Continuous-Time Zero-Sum Game and H∞ Control


Problem

In this section, we will present model-free solutions to the differential zero-


sum game, which is the continuous-time counterpart of the problem discussed
in the previous section. Like its discrete-time counterpart, the formulation of the
differential zero-sum game is analogous to the H∞ control problem for continuous-
time systems. However, as we have seen in Chaps. 1 and 2, unlike its discrete-time
counterpart, the Bellman equation associated with the continuous-time problems
is not in a recursive form. More specifically, for the case of the linear quadratic
differential game, we have the following partial differential equation,

T
∂V
0 = r (x(t), Kx(t), Gx(t)) + (Ax(t) + Bu(t) + Ew(t)) ,
∂x

which involves the derivative of the cost function and requires the knowledge
of the system dynamics. In Chap. 2, we presented algorithms based on integral
reinforcement learning (IRL) that circumvent this difficulty by providing a Bellman
equation in a recursive form. The IRL Bellman equation for the differential zero-
sum game is given by

t+T
V (x(t)) = r (x(τ ), Kx(τ ), Gx(τ )) dτ + V (x(t + T )) .
t

In the remainder of this section, we will first introduce the model-based iterative
techniques for solving the continuous-time problem. Then, we will present model-
free iterative techniques to find the optimal strategies for the linear quadratic
differential zero-sum game, which we will refer to as the differential zero sum game
for simplicity. We will first introduce the state feedback learning algorithms before
presenting the output feedback learning algorithms.
Consider a continuous-time linear system in the state space form,

ẋ = Ax + Bu + Ew,
(3.39)
y = Cx,
130 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

where x ∈ Rn is the state, u ∈ Rm1 is the control input (of Player 1), w ∈ Rm2 is
the disturbance input (of the opposing player, Player 2), and y ∈ Rp is the output.
We assume that the pair (A, B) are controllable and the pair (A, C) is observable.
Let us define the infinite horizon cost function as

J (x(0), u, w) = r(x (τ ), u(τ ), w(τ ))dτ. (3.40)
0

For a linear quadratic differential game, the utility function r(x, u, w) takes the
following quadratic form,

r(x, u, w) = y T (t)Qy y(t) + uT (t)u(t) − γ 2 w T (t)w(t), (3.41)

where Qy ≥ 0 is the user-defined weighting matrix and γ ≥ 0 is an upper


bound on the desired L2 gain from the disturbance w to the performance measure
y T (t)Qy y(t) + uT (t)u(t), that is,
∞  ∞
y T (τ )Qy y(τ ) + uT (τ )u(τ ) dτ ≤ γ 2 w T (τ )w(τ )dτ, w ∈ L2 [0, ∞).
0 0
(3.42)
The problem of finding a control u such that (3.42) is satisfied is an H∞ control
problem. The minimum value of γ , denoted as γ ∗ , for which the above H∞ control
problem has a solution is referred to as the H∞ infimum. This H∞ control problem
is often referred to as the H∞ disturbance decoupling problem.
We introduce a value function that gives the cost of executing the given policies,
∞ 
V (x(t)) = y T (τ )Qy y(τ ) + uT (τ )u(τ ) − γ 2 w T (τ )w(τ ) dτ. (3.43)
t

If the feedback polices are such that the resulting closed-system is asymptotically
stable, then V (x) is quadratic in the state [6], that is,

V (x) = x T P x, (3.44)

for some P > 0. The optimal value function associated with the zero-sum game is
of the form
∞ 
V ∗ (x(t)) = min max y T (τ )Qy y(τ ) + uT (τ )u(τ ) − γ 2 w T (τ )w(τ ) dτ,
u w t
(3.45)
where the input u acts as the minimizing player and the input w acts as the
maximizing player. Equivalently, the aim of the zero-sum game is to find the
saddle point solution (u∗ , w ∗ ) that satisfies the following pair of Nash equilibrium
inequalities,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 131

     
J x(0), u∗ , w ≤ J x(0), u∗ , w ∗ ≤ J x(0), u, w∗ , (3.46)

for any feedback policies u(x) and w(x). For the special case of the H∞ control
problem, the maximizing player w is an L2 [0, ∞) disturbance.  √ 
Under the additional assumption of the observability of A, Q , where
√ T√
Q Q = Q, Q = C T Qy C, the problem is solvable if there exists a unique
positive definite matrix P ∗ that satisfies the following continuous-time game
algebraic Riccati equation (GARE),

AT P + P A + Q − P BB T − γ −2 EE T P = 0. (3.47)

In this case, there exist the unique state feedback policies u∗ = K ∗ x and w ∗ = G∗ x
that achieve the objective (3.45), where

K ∗ = −B T P ∗ , (3.48)
G∗ = γ −2 E T P ∗ . (3.49)

From the above discussion it is clear that, in order to find the optimal game
strategies, we need to have full knowledge of the system matrices for the solution
of the GARE (3.47). Furthermore, access to the information of the state is needed
for the implementation of the optimal strategies. In what follows, we will present
alternate approaches to finding the solution of this GARE.

3.4.1 Model-Based Iterative Schemes for Zero-Sum Game and


H∞ Control Problem

Even when the system model information is available, the GARE (3.47) is difficult
to solve owing to its nonlinear nature. Iterative computational methods have
been developed to address this difficulty. We recall the policy iteration algorithm,
Algorithm 3.7, from [132].
Algorithm 3.7 finds the optimal strategies for the differential zero-sum game
by iteratively solving the Lyapunov equation (3.50). As is the case with the
previously discussed PI algorithms, a stabilizing policy is required to ensure that
the policy evaluation step results in a finite cost. For games with stable dynamics,
the feedback control may simply be initialized to zero. However, for an unstable
system, it is difficult to obtain such a stabilizing policy when the system dynamics
is unknown. To obviate this requirement, value iteration algorithms are used that
perform recursive updates on the cost matrix P instead of solving the Lyapunov
equation in every iteration.
In [11], a VI algorithm was proposed for the H∞ control problem. The algorithm
is an extension of the VI algorithm presented in Chap. 2 for solving the LQR
132 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Algorithm 3.7 Model-based policy iteration for the differential zero-sum game
input: system dynamics (A, B)
output: P ∗ , K ∗ and G∗
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz. Set G0 = 0. Set
j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P j ,
T T
(A + BK + EG)T P j + P j (A + BK + EG) + Q + K j K j − γ 2 Gj Gj = 0.
(3.50)

4: policy update. Find improved policies as

K j +1 = −B T P j ,
Gj +1 = γ −2 E T P j +1 .

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

problem.! We recall the following definitions before introducing the VI algorithm.



Let Bq q=0 be some bounded nonempty sets that satisfy

Bq ⊆ Bq+1 , q ∈ Z+

and

lim Bq = Pn+ ,
q→∞

where Pn+ is the set of n-dimensional positive definite matrices. With these
definitions, the VI algorithm is presented in Algorithm 3.8.
Algorithm 3.8 recursively solves the GARE equation (3.47), instead of solving a
Lyapunov equation. As a result, it does not require the knowledge of a stabilizing
initial policy. However, both Algorithms 3.7 and 3.8 are model-based as they require
the full model information (A, B, C, E).

3.4.2 Model-Free Schemes Based on State Feedback

Learning methods have been presented in the literature that solve the optimal control
problems without requiring the model information. The model-free state feedback
policy iteration algorithm, Algorithm 3.9, was developed in [60].
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 133

Algorithm 3.8 Model-based value iteration for the differential zero-sum game
Input: system dynamics (A, B, E)
Output: P ∗
Initialization. Set P 0 > 0, j ← 0, q ← 0.
1: loop  
2: P̃ j +1 ← P j + j AT P j + P j A + Q − P j BB T P j + γ −2 P j EE T P j .
3: if P̃ j +1 ∈/ Bq then
4: P j +1 ← P 0
5: q ← q + 1 
 
6: else if P̃ j +1 −P j  /j < ε, for some small ε > 0, then
7: return P j as P ∗
8: else
9: P j +1 ← P̃ j +1
10: end if
11: j ←j +1
12: end loop

Algorithm 3.9 Model-free state feedback policy iteration algorithm for the differ-
ential zero-sum game
input: input-state data
output: P ∗ , K ∗ and G∗
1: initialize. Select a stabilizing control policy K 0 and apply u0 = K 0 x + n and w 0 = ν, where
n and ν are the exploration signals. Set G0 = 0. Set j ← 0.
2: collect data. Apply u0 to the system to collect online data for t ∈ [t0 , tl ], where l is the number
of learning intervals of length tk − tk−1 = T , k = 1, 2, · · · , l. Based on this data, perform the
following iterations,
3: repeat
4: evaluate and improve policies. Find the solution, P j , K j +1 and Gj +1 , of the following
learning equation,

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t T T
= − x T (τ ) Q+ K j K j −γ 2 Gj Gj x(τ )dτ
t−T
t T
−2 u(τ ) − K j x(τ ) K j +1 x(τ )dτ
t−T
t T
+2γ 2 w(τ ) − Gj x(τ ) Gj +1 x(τ )dτ. (3.51)
t−T

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.
134 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

As is the case with other PI algorithms, Algorithm 3.9 requires a stabilizing initial
control policy for the policy evaluation step to have a finite cost. To overcome this
difficulty, a model-free value iteration algorithm, Algorithm 3.10, was proposed that
overcomes this situation [11].
It can be seen that the model-free Algorithm 3.10 does not require a stabilizing
control policy for its initialization. The algorithm is based on the recursive learning
equation (3.52), which is used to find the unknown matrices H j = AT P j + P j A,
K j = −B T P j and Gj = γ −2 E T P j .

Algorithm 3.10 Model-free state feedback value iteration algorithm for the differ-
ential zero-sum game
Input: input-state data
Output: P ∗ , K ∗ and G∗
Initialization. Select P 0 > 0 and set j ← 0, q ← 0.
Collect Online Data. Apply u0 = n and w 0 = ν with n and ν being the exploration signals
to the system and collect online data for t ∈ [t0 , tl ], where tl = t0 + lT and T is the interval
length. Based on this data, perform the following iterations,
1: loop
2: Find the solution, H j , K j and Gj , of the following equation,

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t t
= x T (τ )H j x(τ )dτ − 2 uT (τ )K j x(τ )dτ
t−T t−T
t
+2γ 2 w T (τ )Gj x(τ )dτ. (3.52)
t−T

 T  T 
3: P̃ j +1 ← P j + j H j + Q − K j K j + γ 2 Gj Gj
4: if P̃ j +1 ∈
/ Bq then
5: P j +1 ← P 0
6: q ← q + 1 
 
7: else if P̃ j +1 −P j  /j < ε then
8: return P j , K j and Gj as P ∗ , K ∗ and G∗
9: else
10: P j +1 ← P̃ j +1
11: end if
12: j ←j +1
13: end loop

In the discussion so far in this subsection we have found that the model-free
algorithms make use of the full state measurement. In the following subsections, we
will present a dynamic output feedback scheme to solve the model-free differential
zero-sum game and the associated H∞ control problem. Along the lines of the
discussion in Chap. 2, we will first present a parameterization of the state of the
continuous-time system. Based on this parameterization, we will present output
feedback learning equations that will form the basis of the model-free output
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 135

feedback learning algorithms. Finally, it will be shown that, similar to the output
feedback algorithms in Chap. 2, these algorithms also possess the exploration bias
immunity and, therefore, do not require a discounted cost function. Maintaining the
original undiscounted cost function ensures the stability of the closed-loop and the
optimality of the solution.

3.4.3 State Parameterization

In the previous developments we learnt that a key idea in developing the output
feedback learning equations is the parameterization of the state. Two parameteriza-
tions have been presented. One parameterization is based on the derivation using
the embedded observer and filtering approach, and the other is more direct in the
sense that it requires just the delayed measurements of the input, output, and the
disturbance signals. Based on the direct parameterization we developed the output
feedback learning equations. However, as has been discussed earlier in Chap. 2, the
direct parameterization does not extend to the continuous-time setting as it would
involve the derivatives of the input, output, and disturbance signals. On the other
hand, the observer and filtering based parameterization (3.22) result does have a
continuous-time counterpart, which is why it was introduced in the discrete-time
setting.
In the following, we will present a state parameterization procedure to represent
the state of a general continuous-time linear system in terms of the filtered input,
output, and disturbance.
Theorem 3.8 Consider system (3.3). Let the pair (A, C) be observable. Then, there
exists a parameterization of the state in the form of

x̄(t) = Wu ζu (t) + Ww ζw (t) + Wy ζy (t) + e(A+LC)t x(0), (3.53)

where L is the observer gain chosen


 1such2 that A + LC is Hurwitz
 1 2stable, mthe
m1  
parameterization
 matrices W
 u = W u W u · · · Wu , Ww = Ww Ww · · · Ww 2
p
and Wy = Wy1 Wy2 · · · Wy are given in the form of

⎡ ⎤
i1
au(n−1) i1
au(n−2) · · · au0
i1
⎢ i2 ⎥
⎢a i2 i2 ⎥
· · · au0
⎢ u(n−1) au(n−2) ⎥
Wu = ⎢
i
⎢ ..
⎥ , i = 1, 2, · · · , m1 ,
⎢ .
.. . . .. ⎥
⎣ . . . ⎥ ⎦
in
au(n−1) in
au(n−2) · · · au0
in
136 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

⎡ ⎤
i1
aw(n−1) i1
aw(n−2) · · · aw0
i1
⎢ i2 ⎥
⎢a i2 ⎥
w(n−1) aw(n−2) · · · aw0 ⎥
i2

Wwi = ⎢
⎢ ..
⎥ , i = 1, 2, · · · , m2 ,
⎢ .
.. . . .. ⎥ ⎥
⎣ . . . ⎦
aw(n−1) aw(n−2) · · · aw0
in in in

⎡ i1 i1 ⎤
i1
ay(n−1) ay(n−2) · · · ay0
⎢ i2 ⎥
⎢a i2 ⎥
y(n−1) ay(n−2) · · · ay0 ⎥
i2

Wyi = ⎢
⎢ .. .. . . .. ⎥
⎥ , i = 1, 2, · · · , p,
⎢ . . . . ⎥
⎣ ⎦
in
ay(n−1) in
ay(n−2) · · · ay0
in

whose elements are the coefficients of the numerators in the transfer function
matrix of a Luenberger observer with inputs u(t), w(t) and y(t), and
 T  T
ζu (t) = ζu1 (t) ζu2 (t) · · · ζum1 (t) , ζw = ζw1 (t) ζw2 (t) · · · ζwm2 (t) and ζy =
p T
ζy1 (t) ζy2 (t) · · · ζy (t) represent the states of the user-defined dynamics driven
by individual input ui (t), disturbance w i (t) and output y i (t) as given by

ζ̇ui (t) = Aζui (t) + Bui (t), ζui (0) = 0, i = 1, 2, · · · , m1 , (3.54)


ζ̇wi (t) = Aζwi (t) + Bw i (t), ζwi (0) = 0, i = 1, 2, · · · , m2 , (3.55)
ζ̇yi (t) = Aζyi (t) + By i (t), ζyi (0) = 0, i = 1, 2, · · · , p, (3.56)

for a Hurwitz matrix A whose eigenvalues coincide with those of A + LC and an


input vector B of the form
⎡ ⎡ ⎤ ⎤
0 1 0 ··· 0
0
⎢ 0 0 1 ⎥
··· ⎢0⎥
0
⎢ ⎥ ⎢ ⎥
⎢ .. .. .. ⎥
.. ⎢.⎥
..
A=⎢ . . . ⎥ , B = ⎢ .. ⎥.
. .
⎢ ⎥ ⎢ ⎥
⎣ 0 0 0 ··· 1 ⎦ ⎣0⎦
−α0 −α1 · · · · · · −αn−1 1

Proof The proof follows from the proof of the state parameterization result
presented in Chap. 2. Given that (A, C) is observable, we can obtain, by considering
the disturbance as an additional input, a state observer as
 
˙ = Ax̂(t) + Bu(t) + Ew(t) − L y(t) − C x̂(t)
x̂(t)
= (A + LC)x̂(t) + Bu(t) + Ew(t) − Ly(t),
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 137

where x̂(t) is the estimate of the state x(t) and L is the observer gain chosen such
that matrix A + LC has all its eigenvalues in the open left-half plane. This observer
is a dynamic system driven by u, w, and y with the dynamics matrix A + LC. By
linearity, the effect of the disturbance can be added in the same way as that of the u
and y as carried out in proof of Theorem 2.9. Therefore, following the arguments in
the proof of Theorem 2.9, we can express the effect of the disturbance w as
⎡ i1 s n−1 + a i1 s n−2 + · · · + a i1

an−1 n−2 0
⎢ n ⎥
⎢ s + αn−1 s n−1 + αn−2 zn−2 + · · · + α0 ⎥
⎢ ⎥
⎢ a i2 s n−1 + a i2 s n−2 + · · · + a i2 ⎥
⎢ ⎥
  ⎢ n−1 n−2 0
⎥ 
i
Ww (s) i ⎢ s + αn−1 s
n n−1 + αn−2 sz n−2 + · · · + α0 ⎥
w =⎢⎢
⎥ wi

(s) ⎢ .
.. ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎢ ⎥
⎣ an−1 sz
in n−1 + an−2 s
in n−2 + · · · + a0in ⎦
s n + αn−1 s n−1 + αn−2 s n−2 + · · · + α0
⎡ i1 a i1 · · · a i1 ⎤ ⎡ 1  i ⎤
aw0 w
⎢ i2
w1 w(n−1)
⎥ ⎢ (s)  ⎥
⎢a i2 · · · i2 ⎥⎢ s i ⎥
⎢ w0 aw1 aw(n−1) ⎥⎢ (s) w ⎥
=⎢ ⎥⎢ ⎥
⎢ .. .. .. .. ⎥ ⎢⎢ .. ⎥

⎢ . . . ⎥
. ⎦⎢ . ⎥
⎣ ⎣ ⎦
in in · · · in s n−1  i

aw0 aw1 aw(n−1) (s) w

= Wwi ζwi ,

where Wwi ∈ Rn×n is the parameteric matrix corresponding to contribution to the


state from the ith disturbance w i , ζwi , which can be obtained as

ζ̇wi (t) = Aζwi (t) + Bw i (t),

where
⎡ ⎤ ⎡ ⎤
00 ··· 0 1 0
⎢ 01 ··· 0 ⎥
0 ⎥ ⎢ ⎥
⎢ ⎢0⎥
⎢ ..
.. .. .. .. ⎥ , B = ⎢ .. ⎥
. ⎥ ⎢ .
A=⎢ . . . . ⎥.
⎢ ⎥ ⎢ ⎥
⎣ 0 0 0 ··· 1 ⎦ ⎣ 0⎦
−α0 −α1 · · · · · · −αn−1 1

Then, following the arguments used in Theorem 2.9, we have

x̂(t) = Wu ζu (t) + Ww ζw (t) + Wy ζy (t) + e(A+LC)t x̂(0). (3.57)


138 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

It should be noted that A + LC represents the error dynamics of the observer,

e(t) = x(t) − x̂(t)


= e(A+LC)t e(0),

and therefore x(t) can be written as,

x(t) = Wu ζu (t) + Ww ζw (t) + Wy ζy (t) + e(A+LC)t x(0). (3.58)

It can be seen that e(A+LC)t x̂(0) in (3.57) and e(A+LC)t x(0) in (3.58) converge to
zero as t → ∞ because A + LC is Hurwitz stable. This completes the proof.

Remark 3.3 The variables ζu , ζw , and ζy are generated through the filter matrix A
that is independent of the system dynamics. The connection between the param-
eterized state x̄ and the actual system state x is established through the matrices
Wu , Ww , and Wy that contain the coefficients ajikl of the numerator polynomials of
U W Y
(s) , (s) and (s) that depend on (A, B, C, E, L).
Analogous to the discrete-time counterpart of this parameterization given in (3.22),
we have the same full row rank conditions for the parameterization matrix W =
Wu Ww Wy in the state parameterization (3.53).
 
Theorem 3.9 The state parameterization matrix W = Wu Ww Wy in (3.53) is of
full row rank if either of (A + LC, B),(A + LC, E), or (A + LC, L) is controllable.
Proof The proof follows the same procedure as in the proof of Theorem 3.3
by replacing the transfer function variable z corresponding to the discrete-time
dynamics of ζu , ζw and ζy with the transfer function variable s corresponding to
the continuous-time dynamics.

Theorem 3.10 The parameterization matrix W in (3.53) is of full row rank if


matrices A and A + LC have no common eigenvalues.
Proof The proof is similar to the proof of Theorem 2.11.
Next, we use the state parameterization (3.53) to describe the cost function in
(3.44). It can be easily verified that substitution of (3.53) in (3.44) results in
⎡ ⎤T ⎡ ⎤
ζu  T   ζu
V = ⎣ζw ⎦ Wu Ww Wy P Wu Ww Wy ⎣ζw ⎦
ζy ζy

= zT P̄ z, (3.59)
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 139

where
 T
z = ζuT ζwT ζyT ∈ RN ,
 
W = Wu Ww Wy ∈ Rn×N ,

P̄ = P̄ T
⎡ T ⎤
Wu P Wu WuT P Ww WuT P Wy
⎢ ⎥
= ⎣WwT P Wu WwT P Ww WwT P Wy ⎦ ∈ RN ×N ,
WyT P Wu WyT P Ww WyT P Wy

with N = m1 n + m2 n + pn.
By (3.59) we have obtained a new description of the steady-state cost function in
terms of the inputs and output of the system. The corresponding steady-state output
feedback policies are given by
⎡ ⎤
  ζu
u = K Wu Ww Wy ⎣ζw ⎦ = K̄z, (3.60)
ζy
⎡ ⎤
  ζu
w = G Wu Ww Wy ⎣ζw ⎦ = Ḡz, (3.61)
ζy

where
 
K̄ = K Wu Ww Wy ∈ Rm1 ×(m1 n+m2 n+pn)

and
 
Ḡ = G Wu Ww Wy ∈ Rm2 ×(m1 n+m2 n+pn) .

Therefore, the optimal cost matrix is given by P̄ ∗ and the corresponding steady-state
optimal output feedback policies are given by

ū∗ = K̄ ∗ z, (3.62)
w̄ ∗ = Ḡ∗ z. (3.63)

Theorem 3.11 The output feedback policies given by (3.62) and (3.63) converge,
respectively, to the optimal policies (3.48) and (3.49), which solve the zero-sum
game.
140 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Proof We can write the output feedback strategies as

ū∗ = K̄ ∗ z = K ∗ W z,
w̄ ∗ = Ḡ∗ z = G∗ W z.

Applying the result in Theorem 3.8, we have

x̄ = W z,

which converges exponentially to x, and thus,

ū∗ = K ∗ x,
w̄ ∗ = G∗ x,

which are the optimal steady-state strategies of the differential zero-sum game. This
completes the proof.

3.4.4 Learning Algorithms for Output Feedback Differential


Zero-Sum Game and H∞ Control Problem

In Sect. 3.4.1, we have presented algorithms that learn the optimal state feedback
strategies that solve the differential zero-sum game. We now direct our attention
towards developing model-free techniques to learn the optimal output feedback
strategies that solve the differential zero-sum game. We first recall from Algo-
rithm 3.9 the following state feedback learning equation,

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t T T
= − x T (τ ) Q+ K j K j −γ 2 Gj Gj x(τ )dτ
t−T
t T
−2 u(τ ) − K j x(τ ) K j +1 x(τ )dτ
t−T
t T
+2γ 2 w(τ ) − Gj x(τ ) Gj +1 x(τ )dτ, (3.64)
t−T

where K j +1 = −B T P and Gj +1 = γ −2 E T P . Substituting x̄ = W z for x


and letting x T Qx = y T Qy y in (3.64), we have the following steady-state output
feedback learning equation,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 141

zT (t)P̄ j z(t) − zT (t − T )P̄ j z(t − T )


t t T
= − y T (τ )Qy y(τ )dτ − zT (τ ) K̄ j K̄ j z(τ )dτ
t−T t−T
t T t T
+γ 2 zT (τ ) Ḡj Ḡj z(τ )dτ − 2 u(τ )− K̄ j z(τ ) K̄ j +1 z(τ )dτ
t−T t−T
t T
+2γ 2 w(τ ) − Ḡj z(τ ) Ḡj +1 z(τ )dτ, (3.65)
t−T

where K̄ j +1 = K j +1 W = −B T P j W and Ḡj +1 = Gj +1 W = γ −2 E T P j W . In


(3.65), P̄ j , K̄ j +1 , and Ḡj +1 are the unknowns to be found using the filtered input
and output data z. As there are more unknowns than the equations, we develop a
system of l sets of such equations by performing l finite window integrals each over
a period of length T and solve them in the least-squares sense. To solve this linear
system of equations, we define the following data matrices,
 T
δzz = z̄T (t1 ) − z̄T (t0 ) z̄T (t2 ) − z̄T (t1 ) · · · z̄T (tl ) − z̄T (tl−1 ) ,
 t  tl T
t
Izu = t01 z(τ ) ⊗ u(τ )dτ t12 z(τ ) ⊗ u(τ )dτ · · · tl−1 z(τ ) ⊗ u(τ )dτ ,
  t2  tl T
t1
Izw = t0 z(τ ) ⊗ w(τ )dτ t1 z(τ ) ⊗ w(τ )dτ · · · tl−1z(τ ) ⊗ w(τ )dτ ,
  t2  tl T
t1
Izz = t0 z(τ ) ⊗ z(τ )dτ t1 z(τ ) ⊗ z(τ )dτ · · · z(τ
tl−1 ) ⊗ z(τ )dτ ,
  t2  tl T
t1
Iyy = t0 y(τ ) ⊗ y(τ )dτ t1 y(τ ) ⊗ y(τ )dτ · · · tl−1y(τ ) ⊗ y(τ )dτ .

We can write the l number of Equation (3.65) in the following compact form,
 ⎤

vecs P̄ j
⎢  ⎥
j ⎣vec K̄ j +1 ⎦ =  j ,
 
vec Ḡj +1

where the data matrices are given by


 T T 
 j
= δzz −2Izz IN ⊗ K̄ j
+ 2Izu 2γ Izz IN ⊗ Ḡ
2 j
− 2γ Izw
2


N(N+1)
l× +(m1 +m2 )N
∈R 2
,
  
 j = −Izz vec Q̄j − Iyy vec Qy ∈ Rl ,
142 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

with
T  T 
Q̄j = K̄ j K̄ j − γ 2 Ḡj Ḡj ,
 
z̄ = z12 2z1 z2 · · · z22 2z2 z3 · · · zN
2
,
  
j j j j j
vecs P̄ j = P̄11 P̄12 · · · P̄1n P̄22 P̄23 · · · P̄N N .

The least-squares solution is given by


⎡  ⎤
vecs P̄ j T −1 T
⎢  ⎥
⎣vec K̄ j +1 ⎦ = j j j j . (3.66)
 
vec Ḡj +1

We are now ready to present the output feedback policy algorithm, Algo-
rithm 3.11, to learn the solution of the differential zero-sum game.

Algorithm 3.11 Model-free output feedback policy iteration for the differential
zero-sum game
Input: Input-Output filtering data
Output: P̄ ∗ , K̄ ∗ and Ḡ∗
1: Initialize. Select policies u0 = K̄ 0 z + ν1 and w 0 = ν2 , with K̄ 0 being a stabilizing policy,
and ν1 and ν2 being the corresponding exploration signals. Set G0 = 0. Set j ← 0.
2: Acquire Data. Apply u0 and w 0 during t ∈ [t0 , tl ], where l is the number of learning intervals
of length tk − tk−1 = T , k = 1, 2, · · · , l. Collect the filtered input and output data for each
interval.
3: loop
4: Find P̄ j , K̄ j +1 and Ḡj +1 by solving (3.65), that is,

zT (t)P̄ j z(t) − zT (t − T )P̄ j z(t − T )


t t T
= − y T (τ )Qy y(τ )dτ − zT (τ ) K̄ j K̄ j z(τ )dτ
t−T t−T
t T t T
+γ 2 zT (τ ) Ḡj Ḡj z(τ )dτ − 2 u(τ )− K̄ j z(τ ) K̄ j +1 z(τ )dτ
t−T t−T
t T
+2γ 2 w(τ ) − Ḡj z(τ ) Ḡj +1 z(τ )dτ.
t−T

 
5: if P̄ j − P̄ j −1  < ε, then
6: return P̄ j , K̄ j +1 and Ḡj +1 as P̄ ∗ , K̄ ∗ and Ḡ∗
7: else
8: j ←j +1
9: end if
10: end loop
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 143

In Algorithm 3.11, we collect only the filtered input, output, and disturbance data
to compute their quadratic integrals and form the data matrices. The least-squares
problem corresponding to (3.65) is solved based on this data set. Note that we use a
stabilizing initial policy K̄ 0 to collect data, which will be reused in the subsequent
iterations. Since there are N(N + 1)/2 + (m1 + m2 )N unknowns corresponding to
P̄ j , K̄ j +1 and Ḡj +1 we need l ≥ N(N + 1)/2 + (m1 + m2 )N data sets to solve
(3.65).
We now address the problem of requiring a stabilizing initial control policy. It
can be seen that Algorithm 3.11 solves the output feedback differential zero-sum
game without using any knowledge of the system dynamics. However, it requires
a stabilizing initial control policy. In the situation when the system is unstable and
a stabilizing initial policy is not available, we propose an output feedback value
iteration algorithm. We start with the following equations,

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t  t
= x T (τ ) AT P j + P j A x(τ )dτ + 2 uT (τ )B T P j x(τ )dτ
t−T t−T
t
+2 w T (τ )E T P j x(τ )dτ, (3.67)
t−T

P j +1 = P j + j H j + Q − P j BB T P j + γ −2 P j EE T P j . (3.68)

d
Equation (3.67) is obtained by taking the derivative dt (x T P j x) along the
trajectory of (3.39) and then integrating both sides of the resulting equation.
We now work towards deriving an equivalent output feedback learning equation
corresponding to Equation (3.67). Notice that the recursive GARE equation (3.68)
uses the state weighting Q, which in the output feedback case is given by Q =
C T Qy C. However, this would require the knowledge of the output matrix C. To
t
overcome this difficulty, we add t−T y T (τ )Qy y(τ )dτ to both sides of (3.67) so
that Q can be lumped up together with the unknown H j = AT P j + P j A, that is,

t
x T (t)P j x(t) − x T (t − T )P j x(t − T )+ y T (τ )Qy y(τ )dτ
t−T
t  t
= x T (τ ) AT P j +P j A + C T Qy C x(τ )dτ +2 uT (τ )B T P j x(τ )dτ
t−T t−T
t
+2 w T (τ )E T P j x(τ )dτ.
t−T

Next, we obtain the steady-state output feedback learning equation by using the state
parameterization (3.53) to obtain the following equation,
144 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

t
zT (t)P̄ j z(t)−zT (t −T )P̄ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ j z(τ )dτ + 2 uT (τ )B T P j W z(τ )dτ
t−T t−T
t
+2 w T (τ )E T P j W z(τ )dτ, (3.69)
t−T

where H̄ j = W T (AT P j + P j A + C T Qy C)W . As (3.69) is a scalar equation, we


need to build an l number of such equations using l datasets. To solve the resulting
linear system of equations, we define the following data matrices,
 T
δzz = z ⊗ z|tt10 z ⊗ z|tt21 · · · z ⊗ z|ttll−1 ,
  t2  tl T
t1
Izu = t0 z(τ ) ⊗ u(τ )dτ t1 z(τ ) ⊗ u(τ )dτ · · · tl−1z(τ ) ⊗ u(τ )dτ ,
  t2  tl T
t1
Izw = t0 z(τ ) ⊗ w(τ )dτ t1 z(τ ) ⊗ w(τ )dτ · · · z(τ
tl−1 ) ⊗ w(τ )dτ ,
  t2  tl T
t1
Izz = t1 z̄ (τ )dτ · · ·
T T T (τ )dτ
t0 z̄ (τ )dτ z̄
tl−1 ,
  t2  tl T
t1
Iyy = t0 y(τ ) ⊗ y(τ )dτ t1 y(τ ) ⊗ y(τ )dτ · · · tl−1y(τ ) ⊗ y(τ )dτ ,

where, for k = 1, 2, · · · , l,

z ⊗ z|ttkk−1 = z(tk ) ⊗ z(tk ) − z(tk−1 ) ⊗ z(tk−1 ).

Then, we can write the l number of Equation (3.69) in the following compact form,
⎡   ⎤
vecs H̄ j
⎢  ⎥
j ⎣vec B T P j W ⎦ =  j ,
 
vec E T P j W

where the data matrices are given by



N(N+1)
l× +(m +m2 )N
 j = [Izz 2Izu 2Izw ] ∈ R 2 1
,
  
 j = δzz vec P̄ j + Iyy vec Qy ∈ Rl .

The least-squares solution is then given by


3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 145

⎡   ⎤
vecs H̄ j T −1 T
⎢  ⎥
⎣vec B T P j W ⎦ = j j j j . (3.70)
 
vec E T P j W

We now propose the output feedback value iteration algorithm,


!∞ Algorithm 3.12, for
the differential zero-sum game based on (3.69). Let Bq q=0 be a collection of sets
of norm limited positive semi-definite matrices with nonempty interiors that satisfy

Bq ⊆ Bq+1 , q ∈ Z+

and

lim Bq = P+ ,
q→∞

where P+ is the set of positive semi-definite matrices. Note that the purpose of the
set Bq is to prevent the estimates P̄˜ j +1 from escaping. If the upper bound on |P̄ ∗ |
is known, then Bq can be fixed to a B̄ that contains P̄ 0 and P̄ ∗ in its interior. This
assumption of the upper bound of P̄ ∗ , although quite restrictive, is helpful in the
projection of the estimate of P̄ ∗ in the value iteration algorithms. A discussion on
this assumption can be found in the relevant state feedback works, such as [11].
It can be seen that Algorithm 3.12 does not require a stabilizing control policy
for its initialization. The updates in the difference Riccati equation are performed
with a varying step size j that satisfies limj →∞ j = 0. It is worth pointing out
that the least-squares solution (3.70) for Algorithm 3.12 provides lumped parameter
estimates of the terms BP ∗ W and EP ∗ W .
Remark 3.4 Compared to their state feedback counterparts (3.51) and (3.52),
Equations (3.66) and (3.70) solve the least-squares problem using input-output data
instead of input-state data. Similar to the state feedback problem, in order to solve
these least-squares problems, we need to inject exploration signals ν1 and ν2 in
u and w, respectively. Since the data matrix i has columns associated with the
filtered measurements as well as the actions u and w, the two exploration signals
n and ν are selected not only independent of these filtered measurements but also
independent of one and other. That is, both of these two independency conditions
are necessary to ensure that i is of full column rank. In other words, the following
rank condition needs to hold for all j ,
 N(N + 1)
rank j = + (m1 + m2 )N. (3.71)
2
146 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Algorithm 3.12 Model-free output feedback value iteration for the differential zero-
sum game
Input: Input-Output filtering data
Output: P̄ ∗ , K̄ ∗ , and Ḡ∗
1: Initialize. Choose u0 = ν1 and w 0 = ν2 . Initialize P̄ 0 > 0, and set i ← 0, q ← 0.
2: Acquire Data. Apply u0 and w 0 during t ∈ [t0 , tl ], where l is the number of learning intervals
of length tk −tk−1 = T , k = 1, 2, · · · , l. Collect the filtered input-output data for each interval.

3: loop
4: Find H̄ j , K̄ j and Ḡj by solving
t
zT (t)P̄ j z(t)−zT (t −T )P̄ j z(t −T )+ y T (τ )Qy y(τ )dτ
t−T
t t t
= zT (τ )H̄ j z(τ )dτ − 2 uT (τ )K̄ j z(τ )dτ + 2γ 2 w T (τ )Ḡj z(τ )dτ,
t−T t−T t−T

 T  T 
5: P̄˜ j +1 ← P̄ j + j H̄ j − K̄ j K̄ j + γ 2 Ḡj Ḡj
6: if P̄˜ j +1 ∈ / Bq then
7: P̄ j +1 ← P̄ 0
8: q ← q + 1 
 ˜ j +1 
9: else if  P̄ − P̄ j  / < ε then
 j
10: return P̄ j , K̄ j and Ḡj as P̄ ∗ , K̄ ∗ and Ḡ∗ ,
11: else
12: P̄ j +1 ← P̄˜ j +1
13: end if
14: j ←j +1
15: end loop

We now show that the proposed output feedback algorithms, Algorithms 3.11
and 3.12, converge to the optimal output feedback solution.
Theorem 3.12 Consider system (3.39). Assume that the output feedback zero-sum
game, and hence, the H∞ problem, are solvable. Then, the proposed output feedback
algorithms, Algorithms 3.11 and 3.12, generate a sequence P̄ j that converges to the
optimal output feedback solution P̄ ∗ as j → ∞, provided that the rank condition
(3.71) holds.
Proof Consider a Lyapunov function

V j (x) = x T P j x. (3.72)

The time derivative of V j (x) along the trajectory of the closed-loop system with the
output feedback controls u = K̄ j z + ν1 and w = Ḡj z + ν2 is given by
 
V̇ j (x) = x T AT P j + P j A x + 2x T P j B K̄ j + E Ḡj z
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 147

+2(Bν1 + Eν2 )T P j x
 
= x T AT P j + P j A x + 2x T P j BK j + EGj W z

+2(Bν1 + Eν2 )T P j x. (3.73)

By Theorem 3.8, we have

W z = x − e(A+LC)t x(0).

Using this relation, together with the facts that

K i+1 = −B T P j ,

Gi+1 = γ −2 E T P j ,

ν1 = u − K̄ j z

= u − K j x − e(A+LC)t x(0)

and

ν2 = w − Ḡj z

= w − Gj x − e(A+LC)t x(0) ,

we integrate both sides of (3.73) to result in

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t T 
= x T (τ ) A + BK j + EGj P j + P j A + BK j + EGj x(τ )dτ
t−T
t T
−2 u − K j x − e(A+LC)τ x(0) K j +1 x(τ )dτ
t−T
t T
+2γ 2 w − Gj x − e(A+LC)τ x(0) Gj +1 x(τ )dτ
t−T
t 
−2 x T (τ )P j BK j + EGj e(A+LC)τ x(0)dτ. (3.74)
t−T

Noting that K j +1 = −B T P j and Gj +1 = γ −2 E T P j , we have


148 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

t T
−2 K j e(A+LC)τ x(0) K j +1 x(τ )dτ
t−T
t
=2 x T (τ )P j BK j e(A+LC)τ x(0)dτ,
t−T
t T
2γ 2 Gj e(A+LC)τ x(0) Gj +1 x(τ )dτ
t−T
t
=2 x T (τ )P j EGj e(A+LC)τ x(0)dτ,
t−T

which results in the cancellation of all the exponentially decaying terms. Comparing
the resulting equation with the learning equation (3.51), we have
T 
A + BK j + EGj P j + P j A + BK j + EGj
T  T 
= C T Qy C + K j K j − γ 2 Gj Gj , (3.75)

which is the Lyapunov equation associated with the zero-sum game. Thus, iterating
on the solution of the output feedback learning equation (3.65) is equivalent to
iterating on the Lyapunov equation (3.75). The existence of the unique solution
to the equation (3.65) is guaranteed under the rank condition (3.71). In [132], it
has been shown that the iterations on the Lyapunov equation (3.75) converge to the
solution of the GARE (3.47) under the controllability and observability conditions.
Therefore, we can conclude that the proposed iterative algorithm, Algorithm 3.11,
also converges to the solution of the GARE (3.47), provided that the least-squares
problem (3.66) corresponding to (3.65) is solvable.
We next prove the convergence of the output feedback value iteration Algo-
rithm 3.12. Consider the following recursion in Algorithm 3.12,

P̄˜ j +1 = P̄ j + j H̄ j − W T P j BB T P j W + γ −2 W T P j EE T P j W .

In view of the definitions of P̄ and H̄ , we have

W T P̃ j +1 W = W T P j W + j W T H j W + W T C T Qy CW

−W TP j BB T P jW + γ −2 W T P j EE T P j W .

Recall from Theorem 3.9 and 3.10 that W is of full row rank. Thus, the above
equation reduces to

P̃ j +1 = P j + j H j + Q − P j BB T P j + γ −2 P j EE T P j ,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 149

where Q = C T Qy C. The recursions on the above equation converge to the GARE


solution P ∗ under the controllability and observability conditions if H j , B T P j , and
E T P j are the unique solution to (3.52) as shown in [10], which in turn requires
the unique solution H̄ j , B T P j W , and E T P j W of the least-squares problem (3.70)
associated with (3.69). The existence of the unique solution of (3.70) is guaranteed
under the rank condition (3.71). This completes the proof.

3.4.5 Exploration Bias Immunity of the Output Feedback


Learning Algorithms

We now establish the immunity of the output feedback algorithms to the exploration
bias problem. We have the following result.
Theorem 3.13 The output feedback algorithms, Algorithms 3.11 and 3.12, are
immune to the exploration bias problem.
Proof Define the inputs of Player 1 and Player 2 with the excitation signals as
û = u + ν1 and ŵ = w + ν2 . As a result of applying the excited inputs, we have
the estimates of P̄ j , K̄ j and Ḡj in the j th iteration represented as P̄ˆ j , K̄ˆ j and Ḡ
ˆ j.
We first consider the following output feedback learning equation (3.65) with the
exploration inputs,

zT (t)P̄ˆ j z(t) − zT (t − T )P̄ˆ j z(t − T )


t t T 
= − y(τ )T Qy y(τ )dτ − zT (τ ) K̄ˆ j K̄ˆ j z(τ )dτ
t−T t−T
t T  t T
+γ 2 ˆj
zT (τ ) Ḡ ˆ j z(τ )dτ − 2
Ḡ û(τ ) − K̄ˆ j z(τ ) K̄ˆ j +1 z(τ )dτ
t−T t−T
t T
+2γ 2 ˆ j z(τ )
ŵ(τ ) − Ḡ ˆ j +1 z(τ )dτ.

t−T

Taking time derivative results in

2zT (t)P̄ˆ j ż(t) − 2zT (t − T )P̄ˆ j ż(t − T )


T   
= −y T (t)Qy y(t) − z(t) K̄ˆ j ˆ j T Ḡ
K̄ˆ j − γ 2 Ḡ ˆj z(t)

T 
ˆ j z(t) T Ḡ
−2 û(t) − K̄ˆ j z(t) K̄ˆ j +1 z(t) + 2γ 2 ŵ(t) − Ḡ ˆ j +1 z(t)

T 
+y T (t − T )Qy y(t − T ) + zT (t − T ) K̄ˆ j K̄ˆ j z(t − T )
150 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

  
ˆ j T Ḡ
−γ 2 zT (t − T ) Ḡ ˆ j z(t − T ) + 2 û(t − T ) − K̄ˆ j z(t − T ) T K̄ˆ j +1 z(t − T )

T
−2γ 2 ŵ(t − T ) − Ĝj x(t − T ) Ĝj +1 x(t − T ).

Expansion of terms yields


 
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t) + 2zT (t)P̄ˆ j B̄1 ν1 (t)

+2zT (t)P̄ˆ j B̄2 ν2 (t) − 2zT (t − T )P̄ˆ j Āz(t − T ) + B̄1 u(t − T )

+B̄2 w(t − T ) + B̄3 y(t − T ) − 2zT (t − T )P̄ˆ j B̄1 ν1 (t − T )

−2zT (t − T )P̄ˆ j B̄2 ν2 (t − T )


T   
= −y T (t)Qy y T (t) − z(t) K̄ˆ j ˆ j T Ḡ
K̄ˆ j − γ 2 Ḡ ˆj z(t)

T 
ˆ j z(t) T Ḡ
−2 u(t) − K̄ˆ j z(t) K̄ˆ j +1 z(t) − 2ν1T (t)K̄ˆ j +1 z(t) + 2γ 2 w(t) − Ḡ ˆ j +1 z(t)

 
ˆ j +1 z(t) + y T (t − T )Q y(t − T ) + zT (t − T ) K̄ˆ j T K̄ˆ j z(t − T )
+2γ 2 ν2T (t)Ḡ y
  
ˆ j T Ḡ
−γ 2 zT (t − T ) Ḡ ˆ j z(t − T ) + 2 u(t − T ) − K̄ˆ j z(t − T ) T K̄ˆ j +1 z(t − T )

ˆ j z(t − T ))T Ḡ
−2γ 2 (w(t − T ) − Ḡ ˆ j +1 z(t − T )

+2ν1T (t − T )K̄ˆ j +1 z(t − T ) − 2γ 2 ν2T (t − T )Ḡ


ˆ j +1 z(t − T ),

where we have combined the dynamics of z based on the input-output dynamics


of the observer, in which Ā and B̄ represent the combined system dynamics and
B̄1 , B̄2 and B̄3 input matrices corresponding to (3.54)-(3.56), as
⎡ ⎤ ⎡ ⎤⎡ ⎤ ⎡ ⎤
ζ̇u Ā1 0 0 ζu   u
⎣ζ̇w ⎦ = ⎣ 0 Ā2 0 ⎦ ⎣ζw ⎦ + B̄1 B̄2 B̄3 ⎣w ⎦
ζ̇y 0 0 Ā3 ζy y
⎡ ⎤⎡ ⎤ ⎡ ⎤⎡ ⎤
Ā1 0 0 ζu B1 0 0 u
= ⎣ 0 Ā2 0 ⎦ ⎣ζw ⎦ + ⎣ 0 B2 0 ⎦ ⎣w⎦
0 0 Ā3 ζy 0 0 B3 y

= Āz + B̄η,

in which each Āi and Bi is further a diagonal matrix whose blocks correspond to
matrices A and B defined in Theorem 3.8, with the number of such blocks being
equal to the number of components in the individual vectors u, w, and y.
Next, we recall the following facts that hold under state feedback laws,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 151

2x T P̂ j Bν1 = −2ν1T K̂ j +1 x,
2x T P̂ j Eν2 = 2γ 2 ν2T Ĝj +1 x,

which together with the application of Theorem 3.8 give

2zT P̄ˆ j B̄1 ν1 = −2ν1T K̄ˆ j +1 z,

2zT P̄ˆ j B̄2 ν2 = 2γ 2 ν2T Ḡ


ˆ j +1 z.

As a result, we have the cancellation of both delayed and non-delayed terms


involving ν1 and ν2 , which results in
 
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t)
 
−2zT (t − T )P̄ˆ j Az(t − T ) + B̄1 u(t − T ) + B̄2 w(t − T ) + B̄3 y(t − T )
T   
= −y T (t)Qy y(t) − z(t) K̄ˆ j ˆ j T Ḡ
K̄ˆ j − γ 2 Ḡ ˆj z(t)
T T
−2 u(t) − K̄ˆ j z(t) ˆ j z(t)
K̄ˆ j +1 z(t) + 2γ 2 w(t) − Ḡ ˆ j +1 z(t)

T 
+y T (t − T )Qy y(t − T ) + zT (t − T ) K̄ˆ j K̄ˆ j z(t − T )
T 
ˆj
−γ 2 zT (t − T ) Ḡ ˆ j z(t − T )

T
+2 u(t − T ) − K̄ˆ j z(t − T ) K̄ˆ j +1 z(t − T )
T
ˆ j z(t − T )
−2γ 2 w(t − T ) − Ḡ ˆ j +1 z(t − T ).

Repacking the terms and performing integration on both sides of the above equation
lead to

zT (t)P̄ˆ j z(t) − zT (t − T )P̄ˆ j z(t − T )


t t T 
= − y T (τ )Qy y(τ )dτ − zT (τ ) K̄ˆ j K̄ˆ j z(τ )dτ
t−T t−T
t T  t T
+γ 2 ˆj
zT (τ ) Ḡ ˆ j z(τ )dτ − 2
Ḡ u(τ ) − K̄ˆ j z(τ ) K̄ˆ j +1 z(τ )dτ
t−T t−T
t T
+2γ 2 ˆ j z(τ )
w(τ ) − Ḡ ˆ j +1 z(τ )dτ,
Ḡ (3.76)
t−T
152 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

which is the bias-free output feedback learning equation in u and w. Comparing


(3.76) with (3.65), we have P̄ˆ j = P̄ j , K̄ˆ j +1 = K̄ j +1 , and Ḡ
ˆ j +1 = Ḡj +1 . This
establishes the bias-free property of Algorithm 3.11.
We now show that Algorithm 3.12 is also free from exploration bias. Consider
the VI learning equation (3.52) with the excited inputs û and ŵ. Let P̄ˆ j , H̄ˆ j , K̄ˆ j
ˆ j be the parameter estimates obtained as a result of these excited inputs. Then,
and Ḡ
t
zT (t)P̄ˆ j z(t) − zT (t − T )P̄ˆ j z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ˆ j z(τ )dτ − 2 ûT (τ )K̄ˆ j z(τ )dτ
t−T t−T
t
+2γ 2 ˆ j z(τ )dτ.
ŵ T (τ )Ḡ
t−T

Taking time derivative results in

2zT (t)P̄ˆ j ż(t) − 2zT (t − T )P̄ˆ j ż(t − T ) + y T (t)Qy y(t) − y T (t − T )Qy y(t − T )
ˆ j z(t) − zT (t − T )Ḡz(t
= zT (t)Ḡ ˆ − T ) − 2û(t)K̄ˆ j z(t) + 2γ 2 ŵ(t)Ḡ
ˆ j z(t)

+2û(t − T )K̄ˆ j z(t − T ) − 2γ 2 ŵ(t − T )Ḡ


ˆ j z(t − T ).

Expansion of terms yields


 
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t) + 2zT (t)P̄ˆ j B̄1 ν1 (t)

+2zT (t)P̄ˆ j B̄2 ν2 (t) + y T (t)Qy y(t) − y T (t − T )Qy y(t − T )


 
−2zT (t − T )P̄ˆ j Āz(t − T ) + B̄1 u(t − T ) + B̄2 w(t − T ) + B̄3 y(t − T )

−2zT (t − T )P̄ˆ j B̄1 ν1 (t − T ) − 2zT (t − T )P̄ˆ j B̄2 ν2 (t − T )

= zT (t)H̄ˆ j z(t) − zT (t − T )H̄ˆ z(t − T ) − 2uT (t)K̄ˆ j z(t) + 2γ 2 w(t)Ḡ


ˆ j z(t)

+2uT (t − T )K̄ˆ j z(t − T ) − 2γ 2 w(t − T )Ḡ


ˆ j z(t − T ) − 2ν T (t)K̄ˆ j z(t)
1
ˆ j z(t) + 2ν T (t − T )K̄ˆ j z(t − T ) − 2γ 2 ν (t − T )Ḡ
+2γ 2 ν2 (t)Ḡ ˆ j z(t − T ).
1 2

Using the fact that W B̄1 = B and W B̄2 = E, we have


2zT P̄ˆ j B̄1 ν1 = −2ν1T K̄ˆ j z and 2zT P̄ˆ j B̄2 ν2 = 2γ 2 ν2T Ḡ
ˆ j z, thereby cancelling the
delayed and un-delayed νi terms. Thus, we have,
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 153

 
2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t)
 
−2zT (t)P̄ˆ j Āz(t) + B̄1 u(t) + B̄2 w(t) + B̄3 y(t)
+y T (t)Qy y(t) − y T (t − T )Qy y(t − T )

= zT (t)H̄ˆ j z(t) − zT (t − T )H̄ˆ z(t − T ) − 2uT (t)K̄ˆ j z(t) + 2γ 2 w(t)Ḡ


ˆ j z(t)

+2uT (t − T K̄ˆ j z(t − T ) − 2γ 2 w(t − T )Ḡ


ˆ j z(t − T ).

Reversing the previous operations and performing integration result in


t
zT (t)P̄ˆ j z(t) − zT (t − T )P̄ˆ j z(t − T ) + y T (τ )Qy y(τ )dτ
t−T
t t
= zT (τ )H̄ˆ j z(τ )dτ − 2 uT (τ )K̄ˆ j z(τ )dτ
t−T t−T
t
+2γ 2 ˆ j z(τ )dτ.
w T (τ )Ḡ (3.77)
t−T

Comparing (3.77) with (3.69), we have P̄ˆ j = P̄ j , H̄ˆ j = H̄ j , K̄ˆ j = −B T P j W ,


ˆ j = γ −2 E T P j W . This establishes the bias-free property of Algorithm 3.12.
and Ḡ

3.4.6 Numerical Examples

In this subsection, we present numerical examples to verify the output feedback


Algorithms 3.11 and 3.12.
Example 3.2 (An Auto Pilot System) In this example, we validate Algorithm 3.11
on the F-16 fighter aircraft auto pilot system whose dynamics is represented by
(3.39) with
⎡ ⎤
−1.01887 0.90506 −0.00215
A = ⎣ 0.82225 −1.07741 −0.17555⎦ ,
0 0 −1
⎡ ⎤
0
B = ⎣0⎦ ,
5
154 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

⎡ ⎤
1
E = 0⎦ ,

0
 
C= 100 .

The three states are given by x = [x1 x2 x3 ]T , where x1 is the angle of attack,
x2 is the rate of pitch, and x3 is the elevator angle of deflection [110]. Here u is a
stabilizing player and w is a perturbation on the angle of attack. Let the user-defined
cost function be specified by Qy = 1 and γ = 3. The eigenvalues of matrix A
are all chosen to be at −10. The nominal output feedback solution can be found by
solving the Riccati equation (3.47) and then computing Wu , Ww , Wy , P̄ ∗ , K̄ ∗ , and
Ḡ∗ as
⎡ ⎤
0.9149 0.5387 −0.0677
P∗ = ⎣ 0.5387 0.4248 −0.0607⎦ ,
−0.0677 −0.0607 0.0098
 
K∗ = −0.3385 −0.3035 0.0492 ,
 
G∗ = 0.1017 0.0599 −0.0075 ,
⎡ ⎤
−0.80 −0.01 0
Wu = ⎣−21.96 −0.87 0⎦ ,
1332.7 146.5 5
⎡ ⎤
−1.07 2.07 1
Ww = ⎣−1092.5 −260.89 0⎦ ,
5104.1 4737.4 0
⎡ ⎤
1029.8 303.5 27.2
Wy = ⎣ 1144.7 1382.3 261.7 ⎦ ,
−1674.9 −9930.8 −4737.3
 T  
P̄ ∗ = Wu Ww Wy P ∗ Wu Ww Wy ,
 
K̄ ∗ = K ∗ Wu Ww Wy ,
 
Ḡ∗ = G∗ Wu Ww Wy .

The policies K̄ 0 and Ḡ0 are initialized to zero. The learning period is chosen as
T = 0.1s and l = 65 learning intervals are performed. It is assumed that the
disturbance is sufficiently exciting. In the simulation, the rank condition (3.71) is
satisfied by adding sinusoidal signals of random frequencies to both the control
input and the disturbance input. These excitation signals are removed after the
convergence of the algorithm. As comparison, we carry out the state feedback policy
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 155

iteration, Algorithm 3.9. The algorithm is carried out using input-state data with the
same objective function and identical learning period as that of the output feedback
algorithm, Algorithm 3.11. It should be noted that the learning phase of both the
state and output feedback algorithms uses randomly generated frequencies. We
would like to compare the performance achievable under the state feedback and
output feedback algorithms.
The closed-loop response under the state feedback PI algorithm and the closed-
loop response under the output feedback PI algorithm are shown in Figs. 3.9
and 3.10, respectively. It can be seen that the post-learning trajectories with the
converged optimal solutions of both the state and output feedback algorithms are
similar, which shows that the output feedback policy recovers the performance
achievable under the state feedback policy. This confirms the result in Theorem 3.11
that the output feedback policy converges to its state feedback counterpart. It can be
seen that the proposed output feedback algorithm is able to achieve stabilization
similar to the state feedback algorithm but with the advantage that the access to
the full state of the system is no longer needed. The results of the convergence of
parameter estimates of the output feedback algorithm, Algorithm 3.11, are shown in
Figs. 3.11, 3.12, and 3.13. It can be seen that convergence to the optimal parameters
is achieved without incurring any estimation bias and the closed-loop stability is
guaranteed.

Example 3.3 (A Double Integrator with Disturbance) We now test Algorithm 3.12
on an unstable system. We consider the double integrator system, that is, system
(3.39) with
 
01
A= ,
00
 
0
B= ,
1

Fig. 3.9 Example 3.2: State trajectory of the closed-loop system under state feedback
156 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Fig. 3.10 Example 3.2: State trajectory of the closed-loop system under output feedback

Fig. 3.11 Example 3.2: Convergence of the output feedback cost matrix P̄ j

Fig. 3.12 Example 3.2: Convergence of the Player 1 output feedback policy K̄ j
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 157

Fig. 3.13 Example 3.2: Convergence of the Player 2 output feedback policy Ḡj

 
1
E= ,
1
 
C= 10 .

The double integrator model represents a large class of practical systems including
satellite attitude control and rigid body motion. It is known that such systems are not
static output feedback stabilizable. We choose the performance index parameters as
Qy = 1 and γ = 3. The eigenvalues of matrix A are placed at −2. The optimal
control parameters are found by solving the GARE (3.47) and then computing Wu ,
Ww , Wy , P̄ ∗ , K̄ ∗ , and Ḡ∗ as
 
1.7997 1.4821
P∗ = ,
1.4821 2.0941
 
K∗ = 1.4821 2.0941 ,
 
G∗ = 0.3646 0.3974 ,
 
10
Wu = ,
41
 
11
Ww = ,
01
 
44
Wy = .
04
158 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

Fig. 3.14 Example 3.3: State trajectory of the closed-loop system under state feedback

Fig. 3.15 Example 3.3: State trajectory of the closed-loop system under output feedback

The initial controller parameters are set to zero. The learning period is T = 0.05s
with a total of l = 20 intervals. The GARE recursions are performed with the step
size
−1
i = i 0.2 + 5 , i = 0, 1, 2, . . .

and the set


 
Bq = P̄ ∈ P4+ : |P̄ | ≤ 200(q + 1) , q = 0, 1, 2, . . . .

The closed-loop response under the state feedback algorithm, Algorithm 3.10,
is shown in Fig. 3.14 and the closed-loop response under the output feedback
algorithm, Algorithm 3.12, is shown in Fig. 3.15. The convergence of the parameter
estimates in the output feedback algorithm is shown in Figs. 3.16, 3.17, and 3.18.
3.4 Continuous-Time Zero-Sum Game and H∞ Control Problem 159

Fig. 3.16 Example 3.3: Convergence of the output feedback cost matrix P̄ j

Fig. 3.17 Example 3.3: Convergence of the Player 1 output feedback policy K̄ j

Fig. 3.18 Example 3.3: Convergence of the Player 2 output feedback policy Ḡj
160 3 Model-Free H∞ Disturbance Rejection and Linear Quadratic Zero-Sum Games

3.5 Summary

The aim of this chapter is to present model-free output feedback algorithms for the
linear quadratic zero-sum game and the associated H∞ control problem. Both the
discrete-time and differential zero-sum games have been considered. In the first part
of the chapter, we presented an output feedback model-free solution for the discrete-
time linear quadratic zero-sum game and the associated H∞ control problem.
In particular, we developed an output feedback Q-function description, which is
more comprehensive than the value function description [56] due to the explicit
dependence of the Q-function on the control inputs and disturbances. In contrast to
[56], the issue of excitation noise bias is not present in our work due to the inclusion
of the input terms in the cost function, which results in the cancellation of noise
dependent terms in the Bellman equation. A proof of excitation noise immunity of
the proposed Q-learning scheme was provided. As a result, the presented algorithm
does not require a discounting factor which has been used in output feedback value
function learning. It was established that the presented method guarantees closed-
loop stability and that the learned output feedback controller is the optimal controller
corresponding to the solution of the original Riccati equation. Also, our approach
is different from the recently proposed off-policy technique used in [48], which
also addresses the excitation noise issue but requires a stabilizing initial policy and
full state feedback. Both of these requirements are not present in the presented
output feedback work. We note that the output feedback design presented here is
completely model-free. While other output feedback control schemes exist in the
literature, they require certain knowledge of the system dynamics and employ a
separate state estimator.
The second half of this chapter was devoted to the differential game, which
is the continuous-time counterpart of the discrete-time results in the first part of
the chapter. Similar to the discrete-time case, two player zero-sum differential
games have the same problem formulation as that of the continuous-time H∞
control problem. The framework of integral reinforcement learning was employed
to develop the learning equations. In [120], the continuous-time linear quadratic
differential zero-sum game was solved using a partially model-free method. Later,
a completely model-free solution to the same problem was presented in [60]. The
authors in [10] provided a learning scheme which further relaxes the requirement
of a stabilizing initial policy. All of these works, however, require the measurement
of the full state of the system. In this chapter, we presented an extension of the
state parameterization result introduced in Chap. 2. The extended parameterization
incorporates the effect of disturbance and serves as the basis of the output feedback
learning equations for solving the differential zero-sum game. Compared to the
recently proposed static output feedback solution to this problem [83], we presented
a dynamic output feedback scheme that does not require the system to be stabilizable
by static output feedback and, therefore, can also stabilize systems that are only
stabilizable by dynamic output feedback. Differently from the adaptive observer
approaches [86], we presented a type of parameterization of the state that can
3.6 Notes and References 161

be directly embedded into the learning equation, thereby eliminating the need to
identify the unknown observer model parameters and to estimate the system state,
which would otherwise complicate the learning process. Instead, the optimal output
feedback policies are learned directly without involving a state estimator. Finally,
compared to the recent output feedback works [56, 80], the scheme in this paper
implicitly takes into account the exploration signals, which makes it immune to the
exploration bias and eliminates the need of a discounted cost function.

3.6 Notes and References

The zero-sum game involves a two agent scenario in which the two agents or players
have opposing adjectives. Coincidently, the formulation of this problem matches the
time-domain formulation of the H∞ control problem. We presented an extension of
the state parameterization for both the discrete-time and the continuous-time H∞
problems to develop new Q-learning and integral reinforcement learning equations
that include the effect of the disturbance. For results on nonlinear systems, interested
readers can refer to [71]. It is worth mentioning that the approach does not
involve value function learning, which would otherwise involve a discounting factor.
Although there exists a lower bound (upper bound for the differential game) on the
value of the discounting factor above (below) which the closed-loop stability may
be ensured, the computation of this bound requires the knowledge of the system
dynamics, which is assumed to be unknown in this problem [88].
The presentation in Sect. 3.3 follows from our results in [95] and provides
extension to the policy iteration algorithm that results in faster convergence. The
extension incorporates the rank analysis of the state parameterization subject to
disturbance and provides detailed convergence analysis of the learning algorithms
under this rank condition. The presentation of the continuous-time results in
Sect. 3.4 follows from our recent results in [103] and further includes the analysis
of the full row rank condition of the state parameterization in the convergence of the
output feedback policy iteration algorithm used in solving differential games.
Chapter 4
Model-Free Stabilization in the Presence
of Actuator Saturation

4.1 Introduction

A commonly encountered difficulty in implementing a control is the limited


actuation capability. The actuator provides control actions determined by the
controller to the plant. Physical limitations within the actuator impose constraints
on the magnitude of the control signal that could be delivered to the plant. Actuator
saturation is a very common form of control constraint that happens when the
control demand exceeds the actuator output capability. Despite its ubiquity, actuator
saturation is often not taken into consideration in control design. The underlying
difficulty comes from the nonlinear nature of the saturation constraint, which
drastically changes the system behavior. Even an otherwise linear system behaves
nonlinearly when saturation occurs. The effect of actuator saturation is readily seen
in the form of degradation of the system performance. In a severe situation, actuator
saturation may even cause the closed-loop system to lose its stability.
Owing to its practical significance, several lines of work have been dedicated to
addressing the problem of actuator saturation. For the fundamental understanding
of the problem of actuator saturation, several important theoretical results have
been established in the literature. In general, actuator saturation limits the region of
stability of the closed-loop system. In other words, a controller subject to actuator
saturation would in general only be able to stabilize the system when operated within
a certain range. Even for linear systems, global asymptotic stabilization is only
achievable under certain conditions on the open-loop systems, as will be stated in the
next section. Furthermore, for a system satisfying such conditions, one, in general,
has to resort to nonlinear control laws to achieve global asymptotic stabilization.
An intuitive idea in dealing with actuator saturation is to design a control
law that operates in the linear region of the actuator characteristics. It has been
established in our previous work [63] that, for a linear system satisfying the above
mentioned conditions for global asymptotic stabilizability, we can achieve semi-
global asymptotic stabilization by linear low gain feedback. Low gain feedback is

© Springer Nature Switzerland AG 2023 163


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2_4
164 4 Model-Free Stabilization in the Presence of Actuator Saturation

a family a linear feedback laws, either of state feedback or output feedback type,
parameterized in a scalar positive parameter, called the low gain parameter. By semi-
global asymptotic stabilization by linear low gain feedback we mean that, for any
a priori given, arbitrarily large, bounded set of the state space, the value of the low
gain parameter can be tuned small enough so that the given set is contained in the
domain of attraction of the resulting closed-loop system. There are different ways to
construct low gain feedback [63]. One of such constructions is based on the solution
of a parameterized algebraic Riccati equation (ARE), parameterized in the low gain
parameter.
The low gain feedback design is a model-based approach, which relies on
solving the parameterized ARE. This chapter builds on the model-free techniques
that we introduced in Chap. 2 to solve the LQR problem, which also involves the
solution of an ARE, to develop model-free low gain feedback designs for global
asymptotic stabilization of linear systems in the presence of actuator saturation.
Of particular interest is the question of how to avoid the actuator saturation in
the learning algorithm as the standard LQR learning algorithm does not consider
actuator saturation. To address this question, we introduce a scheduling mechanism
in the learning algorithm that helps to learn an appropriate value of the low gain
parameter so that actuator saturation is avoided for any initial condition of the
system. As a result, the model-free technique presented in this chapter will achieve
global asymptotic stabilization. Both state feedback and output feedback designs, in
both the discrete-time and continuous-time settings, will be considered.

4.2 Literature Review

Reinforcement learning control literature has seen limited development when it


comes to control design in the presence of actuator saturation. This is mainly
because the inclusion of actuator constraints leads to additional nonlinearities in
the HJB equation. The original work [3] employed the idea of nonquadratic cost
functionals [73] to incorporate the control constraint and provided a model-based
near optimal solution to the constrained HJB equation using neural networks. Later
on, [46, 78, 81] extended this idea to solve the partially model-free constrained
control problem online using reinforcement learning. Recently, a similar approach
has been followed in [21, 62, 69, 138] to encode the control constraint in the
objective function. A common feature in these works is that local stability could be
guaranteed only in the form of uniform ultimate boundedness. Furthermore, the use
of nonquadratic cost functional leads to nonlinear control laws even when the system
dynamics is linear [72]. As discussed in the previous section, we can obtain control
laws with linear structure for such problems without resorting to nonquadratic cost
functionals.
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 165

4.3 Global Asymptotic Stabilization of Discrete-Time


Systems

Consider a discrete-time linear system subject to actuator saturation, given by the


following state space representation,

xk+1 = Axk + Bσ (uk ),


(4.1)
yk = Cxk ,
 T  T
where xk = xk1 xk2 · · · xkn ∈ Rn refers to the state, uk = u1k u2k · · · um ∈
 1 2 p T
k
R is the input, and yk = yk yk · · · yk
m ∈ R is the output. Without loss of
p

generality, we assume that σ : Rm → Rm is a standard saturation function, that is,


for i = 1, 2, · · · , m,


⎨−b if uk < −b,
⎪ i

σ (uik ) = uik if − b ≤ uik ≤ b, (4.2)




⎩b i
if u > b,
k

with b being the saturation limit.


Assumption 4.1 The pair (A, B) is asymptotically null controllable with bounded
controls (ANCBC), i.e.,
1. (A, B) is stabilizable,
2. All eigenvalues of A are on or inside the unit circle.

Assumption 4.2 The pair (A, C) is observable.


Condition 1 in Assumption 4.1 is a standard requirement for stabilization. Condition
2 is a necessary condition for global asymptotic stabilization in the presence of
actuator saturation [133]. Note that this condition allows systems to be polynomially
unstable, that is, to have repeated poles on the unit circle. Furthermore, as is
known in the literature on control systems with actuator saturation, when there
is an eigenvalue of outside the unit circle, global asymptotic stabilization is not
achievable. Assumption 4.2 is needed for output feedback control.
The objective here is to find a stabilizing feedback control law that avoids
actuator saturation and achieves global asymptotic stabilization of the system.
A model-based solution to semi-global asymptotic stabilization [63] employs a
parameterized ARE based approach to designing the low gain feedback matrix. The
design process involves solving the following parameterized ARE,
−1
AT P (γ )A − P (γ ) + γ I − AT P (γ )B B T P (γ )B + I B T P (γ )A = 0, (4.3)
166 4 Model-Free Stabilization in the Presence of Actuator Saturation

where γ ∈ (0, 1] is the low gain parameter. Based on the solution of the ARE (4.3),
a family of low gain feedback laws can be constructed as

uk = K ∗ (γ )xk , (4.4)

where
−1
K ∗ (γ ) = − B T P ∗ (γ )B + I B T P ∗ (γ )A

and P ∗ (γ ) is the unique positive definite solution to the ARE and the value of γ ∈
(0, 1] is appropriately chosen so that actuator saturation is avoided for any initial
condition inside the a priori given bounded set. As will be shown in later sections,
the appropriate value of γ is determined in an iterative manner.
We recall the following lemma from [63].
Lemma 4.1 Let Assumption 4.1 hold. Then, for each γ ∈ (0, 1], there exists a
unique positive definite matrix P ∗ (γ ) that solves the ARE (4.3). Moreover, such a
P ∗ (γ ) satisfies
1. limγ →0 P ∗ (γ ) = 0,
2. There exists a γ ∗ ∈ (0, 1] such that
  √
 ∗ 1 1
P (γ ) 2 AP ∗ (γ )− 2  ≤ 2, γ ∈ (0, γ ∗ ].

In [63], it has been shown that the family of low gain feedback control laws (4.4)
achieve semi-global exponential stabilization of system (4.1) by choosing the value
of the low gain parameter γ sufficiently small to avoid actuator saturation. We recall
from [63] the following result.
Theorem 4.1 Consider the system (4.1). Under Assumption 4.1, for any a priori
given (arbitrarily large) bounded set W, there exists a γ ∗ ∈ (0, 1] such that for any
γ ∈ (0, γ ∗ ], the low gain feedback control law (4.4) renders the closed-loop system
exponentially stable at the origin with W contained in the domain of attraction.
Moreover, for any initial condition in W, actuator saturation does not occur.

Remark 4.1 The upper bound on value of the low gain parameter γ depends on the
a priori given set of initial conditions as well as the actuator saturation limit b. The
learning based algorithms we are to develop in this chapter schedule the value of the
low gain parameter, and hence the feedback gain, as a function of the state of the
system to achieve global asymptotic stabilization.
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 167

4.3.1 Model-Based Iterative Algorithms

The parameterized ARE (4.3) is an LQR ARE. Thus, solving it amounts to solving
the optimization problem

 
V ∗ (xk ) = min γ xkT xk + uTk uk , (4.5)
u
i=k

subject to the dynamics (4.1), where the LQR weighting matrices are Q = γ I
and R = I . The parameterized ARE possesses the characteristics of the LQR
ARE. In particular, its nonlinear characteristic makes it difficult to be solved
analytically. Along the lines of Chap. 2, we will present iterative algorithms to solve
the parameterized ARE to compute the low gain feedback gain K ∗ (γ ) for a given
value of γ ∈ (0, 1]. The Bellman equation corresponding to the problem (4.5) is
given by

V (xk ) = γ xkT xk + uTk uk + V (xk+1 ), (4.6)

which can be expressed in terms of the quadratic value function V (xk ) = xkT P (γ )xk
as

xkT P (γ )xk = γ xkT xk + uTk uk + xk+1


T
P (γ )xk+1 .

The state feedback policy in this case is uk = K(γ )xk , which gives us

xkT P (γ )xk = γ xkT xk + xkT K T (γ )K(γ )xk


+xkT (A + BK(γ ))T P (γ ) (A + BK(γ )) xk ,

which, in turn, leads to the following Lyapunov equation,

(A + BK(γ ))T P (γ ) (A + BK(γ )) − P (γ ) + γ I + K T (γ )K(γ ) = 0. (4.7)

That is, the Bellman equation for the low gain feedback design corresponds to a
parameterized Lyapunov equation that is parameterized in the low gain parameter
γ . For a given the value of the low gain parameter, we can run iterations on the
parameterized Lyapunov equation in the same way as on the Bellman equation in
Chap. 2. In such a case, the Lyapunov iterations would converge to the solution
of the parameterized ARE. It is worth pointing out that, a suitable choice of the
value of the low gain parameter γ is implicit in the parameterized Lyapunov
equation (4.7) to ensure the resulting control policy satisfies the saturation avoidance
condition. A parameterized version of the policy iteration Algorithm 1.3 is presented
in Algorithm 4.1.
168 4 Model-Free Stabilization in the Presence of Actuator Saturation

Algorithm 4.1 Discrete-time low gain feedback policy iteration algorithm


input: γ ∈ (0, γ ∗ ] and system dynamics (A, B)
output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Lyapunov equation for P (γ ),
T  T
A + BK j (γ ) P j (γ ) A + BK j (γ ) − P j (γ ) + Q + K j (γ ) K j (γ ) = 0.

4: policy update. Find an improved policy as


−1
K j +1 = − I + B T P j (γ )B B T P j (γ )A.

5: j←  j +1 
6: until P j (γ ) − P j −1 (γ ) < ε for some small ε > 0.

Algorithm 4.1 provides an iterative procedure to obtain a low gain feedback


control policy K ∗ (γ ) without explicitly solving the parameterized ARE (4.3).
As a PI algorithm for an LQR ARE, Algorithm 4.1 solves a Lyapunov equation
that is parameterized in γ and is linear in the unknown matrix P (γ ). Also, as a
PI algorithm, Algorithm 4.1 needs to be initialized with a stabilizing policy. An
important point to note is that the stabilizing initial policy K 0 does not need to
satisfy the saturation avoidance condition but only needs to be chosen such that
A + BK 0 is Schur stable. It is the appropriate choice of γ that results in the
convergence to an appropriate low gain feedback control policy that satisfies the
saturation avoidance condition. An appropriate value of γ ∈ (0, γ ∗ ) always exists
under Assumption 4.1, which ensures that the constraints are satisfied [63]. The
algorithm has the same convergence characteristics as that of Algorithm 1.3.
While Algorithm 4.1 provides a simpler alternative to solving the parameterized
ARE (4.3), it still requires a stabilizing initial policy, independent of what value of
the low gain parameter γ is. To obviate this requirement, we can extend the LQR
value iteration algorithm, Algorithm 1.4, to perform recursions on the parameterized
ARE through the value function matrix P (γ ). Compared to its policy iteration
counterpart, Algorithm 4.2 only performs recursions, which are computationally
faster than solving the parameterized Lyapunov equation. By eliminating the need
of solving a Lyapunov equation, the requirement of a stabilizing initial policy is thus
circumvented. The convergence properties of this algorithm remain the same as that
of Algorithm 1.4.
The algorithms we have discussed in this subsection provide a computationally
feasible way of finding the solution of the parameterized ARE (4.3) and the low gain
feedback control law (4.4). The iterative algorithms build upon the standard model-
based policy iteration and value iteration algorithms used to solve the LQR problem
as introduced in Chap. 1. Consequently, they exploit the fixed-point property of the
parameterized Bellman and ARE equations to obtain successive approximations of
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 169

Algorithm 4.2 Discrete-time low gain feedback value iteration algorithm


input: γ ∈ (0, γ ∗ ] and system dynamics (A, B)
output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select an arbitrary policy K 0 and a value function P 0 > 0. Set j ← 0.
2: repeat
3: value update. Compute P j +1 as
T  T
P j +1 (γ ) = A + BK j (γ ) P j (γ ) A + BK j (γ ) + Q + K j (γ ) K j (γ ) = 0.

4: policy improvement. Find an improved policy as


−1
K j +1 = − R + B T P j +1 (γ )B B T P j +1 (γ )A.

5: j←  j +1 
6: until P j − P j −1  < ε for some small ε > 0.

the policy leading to the low gain feedback policy. The limitation of these iterative
algorithms is that they not only require the knowledge of the system dynamics but
also apply only to an appropriately chosen value of the low gain parameter γ . This
low gain parameter plays a crucial role in satisfying the control constraints.
Motivated by the results in Chap. 2, we will now extend the Q-learning algo-
rithms to not only cater for the problem of unknown dynamics but also learn
an appropriate value of the low gain parameter so that the control constraint is
satisfied. It is also worth pointing out that the model-based techniques for solving
the parameterized ARE result in semi-global asymptotic stabilization in the sense
that the domain of attraction of the resulting closed-loop system can be enlarged
to enclose any a priori given, arbitrarily large, bounded set of the state space by
choosing the value of the low gain parameter sufficiently small. In the following
subsection it will be shown that the learning algorithms are able to adjust the value
of the low gain parameter to achieve global asymptotic stabilization.

4.3.2 Q-learning Based Global Asymptotic Stabilization Using


State Feedback

In this subsection, a Q-learning technique employing a low gain scheduling


mechanism will be developed to learn the control law (4.4) without solving the
parameterized ARE. The technique will allow us to solve the problem of global
asymptotic stabilization without requiring the knowledge of the system dynamics.
We will first formulate this constrained control problem as an optimal control
problem so that we can apply the Bellman optimality principle that plays an
instrumental role in reinforcement learning control. To this end, we define a
quadratic utility function,
170 4 Model-Free Stabilization in the Presence of Actuator Saturation

r(xk , uk , γ ) = γ xkT xk + uTk uk , (4.8)

which is the utility function used to solve the linear quadratic regulation (LQR)
problem with the user-defined performance matrices chosen as Q = γ I and R = I .
Next, we define the following cost function starting from state xk at time k that we
would like to minimize,


V (xk ) = r(xi , ui , γ ). (4.9)
i=k

Under a stabilizing feedback control policy uk = K(γ )xk , the total cost incurred
when starting with any state xk is quadratic in the state as given by

VK (xk ) = xkT P (γ )xk , (4.10)

for some positive definite matrix P (γ ). Motivated by the Bellman optimality


principle, Equation (4.9) can be written recursively as

VK (xk ) = r(xk , Kxk , γ ) + VK (xk+1 ), (4.11)

where VK (xk+1 ) is the cost of following policy K(γ ) in all future states. We now
use (4.11) to define a Q-function as

QK (xk , uk , γ ) = r(xk , uk , γ ) + VK (xk+1 ), (4.12)

which is the sum of the one-step cost of executing an arbitrary control uk at time
index k together with the cost of executing policy K(γ ) from time index k + 1 on.
The low gain parameterized Q-function can be expressed as

QK (xk , uk , γ ) = γ xkT xk + uTk uk + xk+1


T
P (γ )xk+1
= γ xkT xk + uTk uk + (Axk + Buk )T P (γ ) (Axk + Buk ) , (4.13)

or, equivalently, as

QK (zk , γ ) = zkT H (γ )zk , (4.14)


 T
where zk = xkT uTk and the matrix H (γ ) is defined as
 
Hxx Hxu
H (γ ) =
Hux Huu
 
γ I + AT P (γ )A AT P (γ )B
= .
B T P (γ )A B T P (γ )B + I
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 171

The optimal stabilizing controller for a given value of the low gain parameter γ
has a cost V ∗ with the associated optimal Q-function Q∗ and the optimal Q-function
matrix H ∗ (γ ). This optimal stabilizing controller can be obtained by solving


Q∗ = 0
∂uk

for uk . That is,


 ∗ −1 ∗
uk = − Huu Hux xk .

The above control law is the same as given in (4.4), which was obtained by solving
the parameterized ARE (4.3).
We now seek to develop a Q-learning Bellman equation for our problem. Using
(4.11) and (4.12), we see that

QK (xk , Kxk , γ ) = VK (xk ), (4.15)

which can be written in a recursive form as

QK (xk , uk , γ ) = γ xkT xk + uTk uk + QK (xk+1 , Kxk+1 , γ ). (4.16)

Equation (4.16) is referred to as the parameterized Bellman Q-learning equation that


will be employed for reinforcement learning.
For our problem, estimating the solution of the Q-learning equation amounts to
estimating the parameterized Q-function matrix H (γ ). Once we have an estimate of
H (γ ), we can compute the low gain control matrix K(γ ). That is, we are to solve
the following equation for H (γ ),

zkT H (γ )zk = γ xkT xk + uTk uk + zk+1


T
H (γ )zk+1 . (4.17)

The above Bellman equation is linear in the unknown matrix H (γ ), that is, it can be
written in a linearly parameterized form by defining

QK (xk , zk , γ ) = H̄ T (γ )z̄k , (4.18)

where

H̄ (γ ) = vec(H (γ ))
 T
= h11 2h12 · · · 2h1(n+m) h22 2h23 · · · 2h2(n+m) · · · h(n+m)(n+m)
∈ R(n+m)(n+m+1)/2
172 4 Model-Free Stabilization in the Presence of Actuator Saturation

is the vector that includes the components of matrix H (γ ). Since H (γ ) is


symmetric, the off-diagonal entries are included as 2hij . The regression vector
z̄k ∈ R(n+m)(n+m+1)/2 is defined by

z̄k = zk ⊗ zk ,

which is the quadratic basis set formed by the products as


 T
z̄ = z12 z1 z2 · · · z1 zn+m z22 z2 z3 · · · z2 z(n+m) · · · z(n+m)
2
.

This yields the following equation,

H̄ T (γ )z̄k = γ xkT xk + uTk uk + H̄ T (γ )z̄k+1 . (4.19)

Based on (4.19), we can utilize the Q-learning technique to learn the Q-function
matrix. The Bellman equation (4.17) is parameterized in the low gain parameter γ .
This parameter can be tuned and the utility function (4.8) can be updated such that
the resulting control law does not saturate the actuator. We are now ready to present
iterative Q-learning algorithms to find a control law that globally asymptotically
stabilizes the system. Both policy iteration and value iteration based Q-learning
algorithms will be presented.
Algorithms 4.3 and 4.4 employ the Q-learning based policy iteration and
value iteration techniques for finding a low gain feedback control law that avoids
saturation and achieves global asymptotic stabilization. In both the algorithms, we
begin by selecting a value of the low gain parameter γ ∈ (0, 1]. Different from
the unconstrained algorithms, we apply an open-loop control, which contains an
exploration signal that satisfies the saturation avoidance condition, to collect online
data. This initial control is referred to as the behavioral policy and is used to collect
system data to be used in Step 2. Note that, in the case of the PI algorithm, the
initial policy K 0 is used to compute the prediction term uk+1 = K 0 xk+1 on the
right-hand side of the Bellman equation and is chosen such that A + BK 0 is Schur
stable. However, it does not have to satisfy the control constraint as this policy is not
a behavioral policy and, therefore, will not be applied to the system to collect data.
The value iteration algorithm, Algorithm 4.4, uplifts this stabilizing initial policy
requirement but takes more iterations to converge as has been seen with other value
iteration algorithms studied in the previous chapters. In both the algorithms, the data
collection is performed only once and we use the same dataset repeatedly to learn
all future control policies. In each iteration, we use the collected data to solve the
Bellman Q-learning equation and then update the policy. The iterations eventually
converge to an optimal policy K ∗ (γ ) for a given γ . The key step in these constrained
control learning algorithms is the control constraint check, in which we check the
newly learned K ∗ (γ ) for the avoidance of actuator saturation. When the control
constraint is satisfied, we apply uk = K ∗ (γ )xk to the system, otherwise we reduce γ
and carry out Steps 3 to 7 with the updated cost function. We employ a proportional
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 173

Algorithm 4.3 Q-learning policy iteration algorithm for global asymptotic stabi-
lization by state feedback
input: input-state data
output: H ∗ (γ ) and K ∗ (γ )
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Schur stable. Pick a γ ∈ (0, 1]
and set j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint

uk ∞ ≤ b.

Collect L datasets of (xk , uk ) for k ∈ [0, L − 1], where

L ≥ (n + m)(n + m + 1)/2.

3: repeat
4: policy evaluation. Solve the following Bellman equation for H j (γ ),

zkT H j (γ )zk = γ xkT xk + uTk uk + +zk+1


T
H j (γ )zk+1 .

5: policy update. Find an improved policy as


−1
K j +1 (γ ) = − Huu
j j
Hux .

6: j←  j +1 
7: until H j (γ ) − H j −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following control constraint,

 
 j 
K (γ )xk  ≤ b.

If for any k = L, L + 1, · · · , the control constraint is violated, reduce γ , K 0 ← K j , reset


j ← 0, and carry out Steps 3 to 7 with the updated value of the low gain parameter.

rule γi+1 = αγi for 0 < α < 1 to update the value of the low gain parameter. As
can be seen, the control constraint is checked prior to being applied to the system
in order to avoid actuator saturation. The control constraint is checked for all future
times. It should be noted that the low gain scheduling mechanism ensures that γ
will be small enough in a finite number of iterations in Step 8 at every time index.
Note that the L data samples refer to different points in time, k, which are used
to form the data matrices  ∈ R((n+m)(n+m+1)/2)×L and ϒ ∈ RL×1 . For the case of
Algorithm 4.3, these data matrices are defined as
 
 = z̄k1 − z̄k+1
1 z̄k2 − z̄k+1
2 · · · z̄kL − z̄k+1
L ,
174 4 Model-Free Stabilization in the Presence of Actuator Saturation

Algorithm 4.4 Q-learning value iteration algorithm for global asymptotic stabiliza-
tion by state feedback
input: input-state data
output: H ∗ and K ∗
1: initialize. Select an arbitrary policy H 0 (γ ) ≥ 0. Pick a γ ∈ (0, 1] and set j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint

uk ∞ ≤ b.

Collect L datasets of (xk , uk ) for k ∈ [0, L − 1], where

L ≥ (n + m)(n + m + 1)/2.

3: repeat
4: policy evaluation. Solve the following Bellman equation for H j +1 (γ ),

zkT H j +1 (γ )zk = γ xkT xk + uTk uk + zk+1


T
H j (γ )zk+1 .

5: policy update. Find an improved policy as



j +1 −1 j +1
K j +1 (γ ) = − Huu Hux .

6: j←  j +1 
7: until H j (γ ) − H j −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following saturation
constraint,
 
 j 
K (γ )xk  ≤ b.

If for any k = L, L + 1, · · · , the saturation constraint is violated, reduce γ , H 0 ← H j , reset


j ← 0, and carry out Steps 3 to 7 with the updated value of the low gain parameter.

 T
ϒ = r1 r2 · · · rL ,

whereas for Algorithm 4.4, they are given by


 
 = z̄k1 z̄k2 · · · z̄kL ,
   1  T 2  T L T
ϒ = r 1 + H̄ j (γ ) T z̄k+1 r 2 + H̄ j (γ ) z̄k+1 · · · r L + H̄ j (γ ) z̄k+1 .
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 175

Then, the least-squares solution of the Bellman equation is given by


−1
H̄ (γ ) = T ϒ. (4.20)

Notice that the control uk = K j (γ )xk is linearly dependent on xk , which means that
T will not be invertible. To overcome this issue, one adds excitation noise νk in
uk , which guarantees a unique solution to (4.20). In other words, the following rank
condition needs to be satisfied,

rank() = (n + m)(n + m + 1)/2. (4.21)

This excitation condition can be met in several ways such as adding sinusoidal noise
of various frequencies, exponentially decaying noise and gaussian noise. A detailed
discussion on such persistent excitation condition can be found in adaptive control
literature [116].
Theorem 4.2 states the convergence of Algorithms 4.3 and 4.4.
Theorem 4.2 Consider system (4.1). Under Assumption 4.1 and the rank condition
(4.21), the iterative Q-learning algorithms, Algorithms 4.3 and 4.4 globally asymp-
totically stabilize the system at the origin.
Proof The proof is carried out in two steps, which correspond, respectively, to
the iteration on the low gain parameter γ and the value iteration on the matrix
H (γ ). With respect to the iterations on γ , denoted as γi , we have γi < γi−1 with
γi ∈ (0, 1]. By Lemma 4.1, there exists a unique P ∗ (γi ) > 0 that satisfies the
parameterized ARE (4.3) and, by definition (4.14), there exists a unique H ∗ (γi ).
By Theorem 4.1, for any given initial condition, there exists a γ ∗ ∈ (0, 1] such
that, for all γi ∈ (0, γ ∗ ], the closed-loop system is exponentially stable. As i
increases, we have γi ∈ (0, γ ∗ ]. Note that, by Lemma 4.1, the value of the low
gain parameter can be made sufficiently small to accommodate any initial condition.
Under the stabilizability assumption on (A, B) and rank condition (4.21), the policy
iteration (in Algorithm 4.3) and value iteration (in Algorithm 4.4) steps converge to
the optimal H ∗ [13, 53]. Therefore, H j (γi ) converges to H ∗ (γi ), γi ∈ (0, γ ∗ ], as
j → ∞. Finally, we have the convergence of K j (γi ) to K ∗ (γi ), γi ∈ (0, γ ∗ ]. This
completes the proof.

Remark 4.2 The presented learning scheme does not seek to find γ ∗ , rather it
searches for a γ ∈ (0, γ ∗ ] that suffices to ensure closed-loop stability without
saturating the actuator. However, better closed-loop performance could be obtained
if the final γ is closer to γ ∗ , which can be achieved by applying smaller decrements
to γ in each iteration at the expense of more iterations.
176 4 Model-Free Stabilization in the Presence of Actuator Saturation

4.3.3 Q-learning Based Global Asymptotic Stabilization by


Output Feedback

The results in the previous subsections pertain to the full state feedback. Both
model-based and model-free solutions make use of the feedback of the full state
xk in the low gain feedback laws. This subsection builds upon the output feedback
techniques that we have developed in Chap. 2 to extend the state feedback results
in the previous subsection for designing a low gain based output feedback Q-
learning scheme. We recall from Chap. 2 that a key ingredient in achieving output
feedback Q-learning is the parameterization of the state in terms of the input-output
measurements. As will be seen next, the state parameterization result remains an
essential tool that allows us to develop the output feedback learning equations to
learn the low gain based output feedback controller that achieves global asymptotic
stabilization.
More specifically, we will extend Algorithms 4.3 and 4.4 to the output feedback
case by the method of state parameterization. To this end, we present the constrained
control counterpart of state parameterization result in Theorem 2.2.
Theorem 4.3 Consider system (4.1). Let Assumption 4.2 hold. Then, the system
state can be represented in terms of the constrained input and measured output as

xk = Wy ȳk−1,k−N + Wu σ̄ (uk−1,k−N ), k ≥ N, (4.22)

where N ≤ n is an upper bound on the observability index of the system,


σ̄ (uk−1,k−N ) ∈ RmN and ȳk−1,k−N ∈ RpN are the past constrained input and
measured output data vectors defined as
      T
σ̄ (uk−1,k−N ) = σ uTk−1 σ uTk−2 · · · σ uTk−N ,
 T
ȳk−1,k−N = yk−1
T T
yk−2 · · · yk−N
T ,

and the parameterization matrices take the special form


−1
Wy = AN VNT VN VNT ,
−1
Wu = UN − AN VNT VN VNT TN ,

with
 T T
VN = CAN −1 · · · (CA)T C T ,
 
UN = B AB · · · AN −1 B ,
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 177

⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN = ⎢ ... ... .. ..
. .
..
. ⎥.
⎢ ⎥
⎣0 0 ··· 0 CB ⎦
0 0 0 0 0

Proof View the constrained control signal σ (uk ) as a new input signal and the result
follows directly from the result of Theorem 2.2.
We would like to use the above state parameterization to constitute a state-free Q-
function. We can write the expression in (4.22) as
 
  σ (ūk−1,k−N )
xk = Wu Wy . (4.23)
ȳk−1,k−N

Then using (4.23) in (4.14) yields


⎡ ⎤T ⎡ ⎤⎡ ⎤
σ (ūk−1,k−N ) Hūū Hūȳ Hūu σ (ūk−1,k−N )
QK (ζk , γ ) = ⎣ ȳk−1,k−N ⎦ ⎣Hȳ ū Hȳ ȳ Hȳu ⎦⎣ ȳk−1,k−N ⎦
uk Huū Huȳ Huu uk

= ζkT Hζk , (4.24)

where
 T
ζk = σ T (ūk−1,k−N ) ȳk−1,k−N
T uTk ,

H = HT ∈ R(mN +pN +m)×(mN +pN +m) ,

and the submatrices are given as


 
Hūū = WuT γ I + AT P (γ )A Wu ∈ RmN ×mN ,
 
Hūȳ = WuT γ I + AT P (γ )A Wy ∈ RmN ×pN ,
Hūu = WuT AT P (γ )B ∈ RmN ×m ,
  (4.25)
Hȳ ȳ = WyT γ I + AT P (γ )A Wy ∈ RpN ×pN ,
Hȳu = WyT AT P (γ )B ∈ RpN ×m ,
Huu = R + B T P (γ )B ∈ Rm×m .

Let H∗ be the optimal matrix corresponding to P ∗ (γ ) for a given value of the


low gain parameter γ . Then the optimal output feedback controller is given by
178 4 Model-Free Stabilization in the Presence of Actuator Saturation

 −1 
u∗k = − H∗uu H∗uū σ (ūk−1,k−N ) + H∗uȳ ȳk−1,k−N
 
σ (ūk−1,k−N )
= K∗ (γ ) . (4.26)
ȳk−1,k−N

We now proceed to develop an output feedback Q-learning equation, which


enables us to learn the output feedback controller online. Using the output feedback
Q-function (4.24) in the Q-learning equation (4.16) results in

ζkT H(γ )ζk = γ ykT yk + uTk uk + ζk+1


T
H(γ )ζk+1 , (4.27)

which can be linearly parameterized as

T
QK = H̄ (γ )ζ̄k , (4.28)

where H̄(γ ) = vec(H(γ )) and



ζ̄k = ζ12 ζ1 ζ2 · · · ζ1 ζmN +pN +m ζ22 ζ2 ζ3 · · ·
T
ζ2 ζmN +pN +m · · · ζmN
2
+pN +m .

With the above parameterization, we have the following linear equation with
unknowns in H̄,
T T
H̄ (γ )ζ̄k = γ ykT yk + uTk uk + H̄ (γ )ζ̄k+1 . (4.29)

Based on (4.29), we can utilize the Q-learning technique to learn the output feedback
Q-function matrix H. We now present the policy iteration and the value iteration
based Q-learning algorithms, Algorithms 4.5 and 4.6 that globally asymptotically
stabilize the system.
The output feedback Q-function matrix H has (mN + pN + m)(mN + pN +
m + 1)/2 unknowns. The data matrices  ∈ R((mN+pN +m)(mN +pN +m+1)/2)×L and
ϒ ∈ RL×1 used in Algorithm 4.5 are given by
 
 = ζ̄k1 − ζ̄k+1
1 ζ̄k2 − ζ̄k+1
2 · · · ζ̄kL − ζ̄k+1
L ,
 T
ϒ = r1 r2 · · · rL ,

whereas, for Algorithm 4.6, they are defined as


 
 = ζ̄k1 ζ̄k2 · · · ζ̄kL ,
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 179

Algorithm 4.5 Q-learning Policy iteration algorithm for global asymptotic stabi-
lization by output feedback
input: input-output data
output: H∗ (γ ) and K∗ (γ )
1: initialize. Select an admissible policy K0 such that A + BK 0 is Schur stable, where K0 =
K 0 W . Set γ ← 1 and j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint

uk  ≤ b.

Collect L datasets of (xk , uk ) for k ∈ [0, L − 1], where

L ≥ (mN + pN + m)(mN + pN + m + 1)/2.

3: repeat
4: policy evaluation. Solve the following Bellman equation for H j (γ ),

ζkT Hj (γ )ζk = γ ykT yk + uTk uk + +ζk+1


T
Hj (γ )ζk+1 .

5: policy update. Find an improved policy as


−1  
Kj +1 (γ ) = − Hjuu j j
Huū Huȳ .

6: j←  j +1 
 
7: until Hj (γ ) − Hj −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following control constraint,

  T 
 j 
K (γ ) σ T (ūk−1,k−N ) ȳ T  ≤ b.
 k−1,k−N 

If for any k = L, L + 1, · · · , the saturation condition is violated, reduce γ , K0 ← Kj , reset


j ← 0, and carry out Steps 3 to 7 with the updated value of the low gain parameter.

 T T T T
j j j
ϒ= r1 + H̄ (γ ) 1
ζ̄k+1 r2 + H̄ (γ ) 2
ζ̄k+1 ··· rL + H̄ (γ ) L
ζ̄k+1 .

Then the least-squares solution of the output feedback Bellman equation is given by

j
−1
H̄ (γ ) = T ϒ. (4.30)

As was the case in the output feedback algorithms studied in Chap. 2, an excitation
signal vk is added in the control input during the learning phase to satisfy the
following rank condition,
180 4 Model-Free Stabilization in the Presence of Actuator Saturation

Algorithm 4.6 Q-learning value iteration algorithm for global asymptotic stabiliza-
tion by output feedback
input: input-output data
output: H∗ and K∗
1: initialize. Select an arbitrary matrix H0 ≥ 0. Set γ ← 1 and j ← 0.
2: collect online data. Apply an open-loop control uk = νk with νk being the exploration signal
and uk satisfying the control constraint (4.2). Collect L datasets of (xk , uk ) for k ∈ [0, L − 1]
along with their quadratic terms, where

L ≥ (n + m)(n + m + 1)/2.

3: repeat
4: policy evaluation. Solve the following Bellman equation for H j +1 (γ ),

ζkT Hj +1 (γ )zk = γ ykT yk + uTk uk + ζk+1


T
Hj (γ )ζk+1 .

5: policy update. Find a greedy policy by


−1  
j +1 j +1
Kj +1 (γ ) = − Hjuu+1 Huū Huȳ .

6: j←  j +1 
 
7: until Hj (γ ) − Hj −1 (γ ) < ε for some small ε > 0.
8: control saturation check. For each k = L, L + 1, · · · , check the following saturation
condition,
  T 
 j 
K (γ ) σ T (ūk−1, k−N ) ȳ T  ≤ b.
 k−1, k−N 

If for any k = L, L + 1, · · · , the control constraint is violated, reduce γ , H0 ← Hj , reset


j ← 0, and carry out Steps 3 to 7 with the updated value of the low gain parameter.

rank() = (mN + pN + m)(mN + pN + m + 1)/2. (4.31)

Theorem 4.4 shows the convergence of Algorithms 4.5 and 4.6.


Theorem 4.4 Consider system (4.1). Under Assumptions 4.1 and 4.2, together with
the full row rank condition of W and the rank condition (4.31), both Algorithms 4.5
and 4.6 globally asymptotically stabilize the system at the origin.
Proof Note that Steps 3 to 7 of the output feedback algorithms, Algorithms 4.5
and 4.6, are the output feedback LQR Q-learning algorithms, whose convergence
has been shown in Chap. 2 under the stated conditions. Then, following the
arguments in the proof of Theorem 4.1 for the low gain scheduling mechanism, the
output feedback algorithms, Algorithms 4.5 and 4.6, also achieve global asymptotic
stabilization of the system by finding a suitable value of the low gain parameter γ
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 181

and the corresponding control matrix K(γ ) without saturating the actuators. This
completes the proof.

4.3.4 Numerical Simulation

In this subsection we present numerical simulation of the presented designs.


Consider the discrete-time system (4.1) with
⎡ ⎤
0 1 0 0
⎢0 0 1 0 ⎥
A=⎢
⎣0
⎥,
0
√ 0 √1 ⎦
−1 2 2 −4 2 2
⎡ ⎤
0
⎢0⎥
B=⎢ ⎥
⎣0⎦ ,
1
 
C= 1000 .
√ √
Matrix A has a pair of repeated eigenvalues at 22 ± j 22 , which lie on the unit
circle, and therefore, the system is open-loop unstable. The actuator saturation limit
is b = 1. Before presenting the learning based low gain feedback designs, we shall
see the result of a standard model-based feedback controller. Let
 
K = 0.9339 −2.5171 3.2847 −1.8801 .

The closed-loop eigenvalues have the magnitudes of 0.6132, 0.6132, 0.4193 and
0.4193, which are all less than 1. Hence A + BK is Schur stable. The closed-loop
response from the initial condition x0 = [1 1 1 1 ]T is shown in Fig. 4.1. As can
be seen, even though the feedback controller is chosen to be stabilizing, it violates
the control constraints and leads to instability. This motivates the design of low gain
feedback.
We first validate Algorithm 4.3 which is based on policy iteration and uses state
feedback. The initial state of the system remains as x0 = [ 1 1 1 1 ]T . The
algorithm is initialized with γ = 1 and
 
K 0 = 0.9339 −2.5171 3.2847 −1.8801 ,

with A+BK 0 being Schur stable but not necessarily satisfying the control constraint
as shown previously in Fig. 4.1. Figure 4.2 shows the state response and the control
effort. In every main iteration of the algorithm, the value of the low gain parameter
182 4 Model-Free Stabilization in the Presence of Actuator Saturation

1000
x1k
500
x2k
States

0 x3k
x4k
-500

-1000
0 10 20 30 40 50 60 70 80 90 100
time (k)

1
Control uk

-1

-2
0 10 20 30 40 50 60 70 80 90 100
time (k)

Fig. 4.1 Closed-loop response under linear state feedback control without taking actuator satura-
tion into consideration

γ is reduced by a factor of one half. In each sub-iteration under a given the value
of the low gain parameter γ , the convergence criterion of ε = 0.01 was selected on
the controller parameters. In this simulation, we collected L = 15 data samples to
satisfy the rank condition (4.21). These data samples are collected only once using
a behavioral policy comprising of sinusoidal signals of different frequencies and
magnitudes such that the control constraint is satisfied. Once these data samples are
available, we can repeatedly use this dataset to policy iterations for different values
of γ . It can be seen that the algorithm is able to find a suitable value of the low gain
parameter γ and learn the corresponding low gain state feedback control matrix
K(γ ) that guarantees convergence of the state without saturating the actuator. The
final value of the low gain parameter is γ = 2.4 × 10−4 and the corresponding low
gain feedback gain matrix is obtained by solving the ARE as
 
K ∗ (γ ) = 0.3002 −0.6717 0.6801 −0.2537 .

The convergence of the low gain matrix is shown in Fig. 4.3 and its final estimate is
 
K(γ ) = 0.3003 −0.6718 0.6802 −0.2537 ,

which shows convergence to the solution of the ARE.


We simulate again Algorithm 4.3 with a different initial condition, x0 =
[5 5 5 5]T . All other simulation parameters are kept the same. Figure 4.4 shows
the closed-loop response. The final value of γ is 6.1 × 10−5 and the corresponding
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 183

20
x1k
10
x2k
States

0 x3k
x4k
-10

-20
0 10 20 30 40 50 60 70 80 90 100
time (k)

0.5
Control uk

-0.5

-1
0 10 20 30 40 50 60 70 80 90 100
time (k)

Fig. 4.2 Algorithm 4.3: Closed-loop response with x0 = [ 1 1 1 1 ]T

4
K *( J )

2

K( J )

0
0 1 2 3 4 5 6 7 8
iterations

Fig. 4.3 Algorithm 4.3: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T

low gain feedback gain matrix is obtained by solving the ARE as


 
K ∗ (γ ) = 0.2222 −0.4899 0.4860 −0.1781 .

The convergence of the low gain feedback gain matrix is shown in Fig. 4.5 and its
final estimate is
 
K(γ ) = 0.2222 −0.4899 0.4860 −0.1781 .

It can be seen that even with this larger initial condition, the algorithm is able to
stabilize the system with a lower value of the low gain parameter. Therefore, these
two cases illustrate global asymptotic stabilization of the presented scheme.
184 4 Model-Free Stabilization in the Presence of Actuator Saturation

40
x1k
20
States x2k
0 x3k
x4k
-20

-40
0 10 20 30 40 50 60 70 80 90 100
time (k)

0.5
Control uk

-0.5

-1
0 10 20 30 40 50 60 70 80 90 100
time (k)

Fig. 4.4 Algorithm 4.3: Closed-loop response with x0 = [5 5 5 5]T

4
K *( J )

2

K( J )

0
0 1 2 3 4 5 6 7 8 9
iterations

Fig. 4.5 Algorithm 4.3: Convergence of the control matrix with x0 = [5 5 5 5]T

We shall now validate that the model-free value iteration algorithm, Algo-
rithm 4.4, uplifts the requirement of a stabilizing initial policy. The algorithm is
thus initialized with H 0 = I , which implies that
 
K0 = 0 0 0 0 .

All other conditions remain the same as in the simulation of Algorithm 4.3. The
closed-loop response and the convergence of the parameter estimates for x0 =
[ 1 1 1 1 ]T are shown in Figs. 4.6 and 4.7, respectively.
The final estimate of the low gain feedback gain matrix is
 
K(γ ) = 0.3025 −0.6771 0.6856 −0.2558 .
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 185

20
x1k
10
States x2k
0 x3k
x4k
-10

-20
0 10 20 30 40 50 60 70 80 90 100
time (k)

0.5
Control uk

-0.5

-1
0 10 20 30 40 50 60 70 80 90 100
time (k)

Fig. 4.6 Algorithm 4.4: Closed-loop response with x0 = [ 1 1 1 1 ]T

1.5
K *( J )

1

K( J )

0.5

0
0 5 10 15 20 25
iterations

Fig. 4.7 Algorithm 4.4: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T

The corresponding results for the second initial condition x0 = [5 5 5 5]T are shown
in Figs. 4.8 and 4.9 with the final estimated low gain feedback matrix being
 
K(γ ) = 0.2254 −0.4970 0.4930 −0.1807 .

Upon comparing the simulation results of Algorithms 4.3 and 4.4, we see that both
algorithms are able to arrive at an appropriate value of the low gain parameter and
the associated low gain feedback matrix. Furthermore, Algorithm 4.4 eliminates the
need of a stabilizing initial policy at the expense of more iterations.
We now validate Algorithm 4.5, which uses output feedback. We choose N = 4
as the bound on the observability index. The initial state of the system is x0 =
[ 1 1 1 1 ]T . The algorithm is initialized with γ = 1 and
186 4 Model-Free Stabilization in the Presence of Actuator Saturation

40
x1k
20
States x2k
0 x3k
x4k
-20

-40
0 10 20 30 40 50 60 70 80 90 100
time (k)

0.5
Control uk

-0.5

-1
0 10 20 30 40 50 60 70 80 90 100
time (k)

Fig. 4.8 Algorithm 4.4: Closed-loop response with x0 = [ 5 5 5 5 ]T

0.8
K *( J )

0.6

0.4

K( J )

0.2

0
0 5 10 15 20 25 30
iterations

Fig. 4.9 Algorithm 4.4: Convergence of the feedback gain matrix with x0 = [ 5 5 5 5 ]T

 
K0 = −1.4728 −1.3258 −0.1333 1.6318 2.8714 −5.5783 4.7487 −1.6318 .

Note that K0 = K 0 W is stabilizing in the sense that A + BK 0 is Schur. However,


it does not satisfy the control constraint and would eventually result in instability
as seen in Fig. 4.1 for its state feedback equivalent. Figure 4.10 shows the state
response and the control effort under Algorithm 4.5. The low gain parameter γ is
reduced by a factor of one half and the convergence criterion of ε = 0.01 is selected.
In this simulation, we collect L = 50 data samples to satisfy the rank condition
(4.31). It can be seen that the algorithm is able to find a suitable value of the low gain
parameter γ and the corresponding low gain state feedback control matrix K(γ )
that guarantees convergence of the state without saturating the actuators. The final
value of the low gain parameter is γ = 1.2 × 10−4 and the corresponding low gain
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 187

100
x1k
50
x2k
States

0 x3k
x4k
-50

-100
0 50 100 150 200 250
time (k)

0.5
Control uk

-0.5

-1
0 50 100 150 200 250
time (k)

Fig. 4.10 Algorithm 4.5: Closed-loop response with x0 = [ 1 1 1 1 ]T

8
K(γ) →K∗ (γ)

0
0 1 2 3 4 5 6 7
iterations

Fig. 4.11 Algorithm 4.5: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T

feedback gain matrix is obtained by solving the ARE as



K∗ (γ ) = −0.1495 −0.0123 0.1460 0.2298 0.1804

−0.4937 0.5038 −0.2298 .

The convergence of the low gain feedback gain matrix is shown in Fig. 4.11 and its
final estimate is
 
K(γ ) = −0.1496 −0.0123 0.1462 0.2300 0.1805 −0.4941 0.5043 −0.2300 ,

which shows convergence to value obtained by solving the ARE.


188 4 Model-Free Stabilization in the Presence of Actuator Saturation

200
x1k
100
States x2k
0 x3k
x4k
-100

-200
0 50 100 150 200 250
time (k)

0.5
Control uk

-0.5

-1
0 50 100 150 200 250
time (k)

Fig. 4.12 Algorithm 4.5: Closed-loop response with x0 = [5 5 5 5]T

8
K(γ) →K∗ (γ)

0
0 1 2 3 4 5 6 7 8 9
iterations

Fig. 4.13 Algorithm 4.5: Convergence of the feedback gain matrix with x0 = [ 5 5 5 5 ]T

We next simulate again Algorithm 4.5 with a different initial condition, x0 =


[ 5 5 5 5 ]T . All other simulation parameters are kept the same. Figure 4.12 shows
the closed-loop response. The final value of the low gain parameter is γ = 1.5 ×
10−5 and the corresponding low gain feedback gain matrix is obtained by solving
the ARE as

K∗ (γ ) = −0.0892 −0.0042 0.0885 0.1332 0.1002

−0.2784 0.2882 −0.1332 .

The convergence of the low gain feedback gain matrix is shown in Fig. 4.13 and its
final estimate is
 
K(γ ) = −0.0904 −0.0042 0.0897 0.1351 0.1016 −0.2822 0.2923 −0.1351 .
4.3 Global Asymptotic Stabilization of Discrete-Time Systems 189

100
x1k
50
States x2k
0 x3k
x4k
-50

-100
0 50 100 150 200 250
time (k)

0.5
Control uk

-0.5

-1
0 50 100 150 200 250
time (k)

Fig. 4.14 Algorithm 4.6: Closed-loop response with x0 = [ 1 1 1 1 ]T

It can be seen that even with this larger initial condition, the algorithm is able to
stabilize the system with a lower value of the low gain parameter. Also, it is worth
noting that the output feedback Algorithm 4.5 does not require full state feedback
but it needs more data samples owing to the larger number of unknown parameters
to be learned. As a result, the learning phase of Algorithm 4.5 is longer and results
in a smaller value of the low gain parameter as compared to the state feedback
Algorithm 4.3.
We shall now validate that the output feedback model-free value iteration
algorithm, Algorithm 4.6, uplifts the requirement of a stabilizing initial policy. The
algorithm is thus initialized with H0 = I , which implies that
 
K0 (γ ) = 0 0 0 0 0 0 0 0 .

All other conditions remain the same as in the simulation of Algorithm 4.5. The
closed-loop response and the convergence of the parameter estimates for x0 =
[1 1 1 1]T are shown in Figs. 4.14 and 4.15, respectively. The final estimate
of the low gain feedback gain matrix is

K(γ ) = −0.1515 −0.0126 0.1477 0.2325 0.1826

−0.4996 0.5099 −0.2325 .
190 4 Model-Free Stabilization in the Presence of Actuator Saturation

K(γ) →K∗ (γ)


0.5

0
0 5 10 15 20 25 30 35 40
iterations

Fig. 4.15 Algorithm 4.6: Convergence of the feedback gain matrix with x0 = [ 1 1 1 1 ]T

200
x1k
100 x2k
x3k
States

0
x4k
-100

-200
0 50 100 150 200 250
time (k)

0.5
Control uk

-0.5

-1
0 50 100 150 200 250
time (k)

Fig. 4.16 Algorithm 4.6: Closed-loop response with x0 = [ 5 5 5 5 ]T

The corresponding results for the second initial condition x0 = [5 5 5 5]T are
shown in Figs. 4.16 and 4.17, with the final estimated low gain feedback gain matrix
being

K(γ ) = −0.0904 −0.0042 0.0897 0.1351 0.1016

−0.2822 0.2923 −0.1351 .
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 191

0.6

K(γ) →K∗ (γ)


0.4

0.2

0
0 10 20 30 40 50 60
iterations

Fig. 4.17 Algorithm 4.6: Convergence of the feedback gain matrix with x0 = [ 5 5 5 5 ]T

4.4 Global Asymptotic Stabilization of Continuous-Time


Systems

In this section, we will address the problem of global asymptotic stabilization for
continuous-time linear systems subject to actuator saturation. The results presented
in this section are the continuous-time counterparts of the discrete-time results
presented in Sect. 4.3. The presented algorithms build upon the learning techniques
for continuous-time systems that were developed in Chap. 2 and incorporate a low
gain scheduling mechanism. In particular, we will first introduce the model-based
iterative techniques for designing a low gain feedback control law for continuous-
time systems. Then, we will present model-free techniques to learn the low gain
feedback control law. State feedback algorithms will be developed first and then
extended to arrive at output feedback learning algorithms.
Consider a continuous-time linear system subject to actuator saturation,

ẋ(t) = Ax(t) + Bσ (u(t)),


(4.32)
y(t) = Cx(t),

where x = [x1 x2 · · · xn ]T ∈ Rn is the state, u = [u1 u2 · · · um ]T ∈ Rm is the


 T
input and y = y1 y2 · · · yp ∈ Rp is the output. Without loss of generality,
we assume that σ : R → R is a standard saturation function, that is, for i =
m m

1, 2, · · · , m,


⎨−b if ui < −b,

σ (ui ) = ui if − b ≤ ui ≤ b, (4.33)


⎩b if u > b, i

where b denotes the actuator limit.


Assumption 4.3 The pair (A, B) is asymptotically null controllable with bounded
controls (ANCBC), i.e.,
192 4 Model-Free Stabilization in the Presence of Actuator Saturation

1. (A, B) is stabilizable,
2. All eigenvalues of the system matrix A are in the closed left-half s-plane.

Assumption 4.4 The pair (A, C) is observable.


Condition 1 in Assumption 4.3 is a standard requirement for stabilization. Condition
2) is a necessary condition for global asymptotic stabilization in the presence of
actuator saturation [109]. Note that this condition allows systems to be polynomially
unstable, that is, to have repeated poles on the imaginary axis. Furthermore, as is
known in the literature on control systems with actuator saturation, when there is an
eigenvalue of A in the open right-half plane, global asymptotic stabilization is not
achievable. Assumption 4.4 is needed for output feedback control.
Our objective is to find a scheduled low gain feedback law that avoids actuator
saturation and achieves global asymptotic stabilization.
To reach our objective, we recall from [63] a method to find the low gain feedback
gain matrix based on the solution of the following parameterized ARE,

AT P (γ ) + P (γ )A + γ I − P (γ )BB T P (γ ) = 0, γ ∈ (0, 1]. (4.34)

It is worth pointing out that the parameterized ARE (4.34) is the ARE found in
the solution of the LQR problem, in which the weighting matrix Q = γ I is
parameterized in the low gain parameter γ and R = I . As a result, the parameterized
ARE can be obtained directly by the substitution of these weights in the LQR ARE.
The resulting family of parameterized low gain feedback control laws is given by

u = K ∗ (γ )x, (4.35)

where

K ∗ (γ ) = −B T P ∗ (γ )

and P ∗ (γ ) > 0 is the unique positive definite solution of the ARE (4.34),
parameterized in the low gain parameter γ ∈ (0, 1].
We recall the following results from [63].
Lemma 4.2 Under Assumption 4.3, for each γ ∈ (0, 1], there exists a unique
positive definite solution P ∗ (γ ) to the ARE (4.34) that satisfies

lim P ∗ (γ ) = 0.
γ →0

Theorem 4.5 Consider system (4.32). Let Assumption 4.3 hold. Then, for any a
priori given (arbitrarily large) bounded set W, there exists a γ ∗ such that for any
γ ∈ (0, γ ∗ ], the low gain feedback control law (4.35) renders the closed-loop system
exponentially stable at the origin with W contained in the domain of attraction.
Moreover, for any initial condition in W, actuator saturation does not occur.
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 193

Remark 4.3 The domain of attraction can be made arbitrarily large by making γ →
0. The upper limit of 1 for the value of γ is chosen for the sake of convenience only.
Theorem 4.5 establishes that semi-global asymptotic stabilization can be
achieved with a constant low gain parameter. Furthermore, as shown in [63],
by scheduling the low gain parameter as a function of the state, global asymptotic
stabilization can be achieved.
The results discussed so far in this section rely on solving the parameterized
ARE (4.34), which requires the complete knowledge of the system dynamics. In
this section, we are interested in solving the global asymptotic stabilization problem
by using measurable data without invoking the system dynamics (A, B, C). An
iterative learning approach is presented where the low gain parameter is scheduled
online and the corresponding low gain feedback gain matrices are learned that
achieves global asymptotic stabilization.

4.4.1 Model-Based Iterative Algorithms

As discussed in Sect. 4.3 in the discrete-time setting, the parameterized ARE is an


LQR ARE. By comparing the parameterized ARE (4.34) with the LQR ARE (2.51)
in Chap. 2, we can see that this is also the case in the continuous-time setting. The
LQR ARE arises from solving the following optimization problem,
∞ 
V ∗ (x(t)) = min γ x T (τ )x(τ ) + uT (τ )u(τ ) dτ, (4.36)
u t

subject to the dynamics (4.32), where the LQR weighting matrices are Q = γ I and
R = I with parameter γ being the low gain parameter.
In Chap. 2, we presented iterative algorithms that provide a computationally
feasible way of solving the continuous-time LQR ARE based on the solution of
the Lyapunov equation in the policy iteration technique or by performing recursion
on the ARE itself in the value iteration technique. Extensions of these algorithms to
solving the parameterized ARE will be presented next that enable us to find the
low gain feedback gain matrix K ∗ (γ ) for global asymptotic stabilization of the
system (4.32) without causing the actuator to saturate. A parameterized version of
the policy iteration algorithm, Algorithm 2.3 used for solving the continuous-time
LQR problem, is presented in Algorithm 4.7. As was the case with its discrete-
time counterpart, Algorithm 4.1, Algorithm 4.7 requires a suitable value of the low
gain parameter and a stabilizing initial control policy K 0 such that A + BK 0 is
Hurwitz. Also, recall that this policy does not have to be the one corresponding to
an appropriate value of γ . In other words, K 0 does not have to satisfy the control
constraint. This is an advantage of the policy iteration in constrained control that will
be more appreciable when we develop policy iteration based learning algorithms in
194 4 Model-Free Stabilization in the Presence of Actuator Saturation

Algorithm 4.7 Continuous-time low gain feedback policy iteration algorithm


input: γ ∈ (0, γ ∗ ] and system dynamics (A, B)
output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz. Set j ← 0.
2: repeat
3: policy evaluation. Solve the following Bellman equation for P ,
T
(A + BK(γ ))T P j (γ ) + P j (γ )(A + BK(γ )) + γ I + K j (γ ) K j = 0.

4: policy update. Find an improved policy as

K j +1 (γ ) = −B T P j (γ ).

5: j←  j +1 
6: until P j (γ ) − P j −1 (γ ) < ε for some small ε > 0.

Algorithm 4.8 Continuous-time low gain feedback value iteration algorithm


Input: γ ∈ (0, γ ∗ ] and system dynamics (A, B)
Output: P ∗ (γ ) and K ∗ (γ )
Initialization. Set P 0 > 0, j ← 0, q ← 0.
1: loop  
2: P̃ j +1 (γ ) ← P j (γ ) + j AT P j (γ ) + P j (γ )A + γ I − P j (γ )BB T P j (γ ) .
3: if P̃ j +1 (γ ) ∈
/ Bq then
4: P j +1 (γ ) ← P 0
5: q ← q + 1 
 
6: else if P̃ j +1 (γ )−P j (γ ) /j < ε, for some small ε > 0, then
7: return P j (γ ) and −B T P j (γ ) as P ∗ (γ ) and K ∗ (γ ),
8: else
9: P j +1 (γ ) ← P̃ j +1 (γ )
10: end if
11: j ←j +1
12: end loop

the following subsections. The convergence properties of Algorithm 4.7 remain the
same as those of Algorithm 2.3.
Following the results in Chap. 2, a value iteration algorithm for solving the
parameterized ARE is presented that circumvents the need of a stabilizing
!∞ initial
policy at the expense of more iterations. Recall from Chap. 2 that Bq q=0 is some
bounded nonempty sets that satisfy Bq ⊆ Bq+1 , q ∈ Z+ and limq→∞ Bq = Pn+ ,
where Pn+ is the set of n-dimensional positive definite matrices. Also, let j is the
step size sequence satisfying limj →∞ j = 0. With these definitions, the value
iteration algorithm is presented in Algorithm 4.8.
Algorithms 4.7 and 4.8 provide alternative ways of designing a low gain
feedback controller without requiring to solve the parameterized ARE (4.34).
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 195

These algorithms make use of the knowledge of the system dynamics and require
an appropriate value of the low gain parameter for an a priori bounded set of
initial conditions. As a result, these algorithms provide semi-global asymptotic
stabilization. In the remainder of this section, we will develop model-free learning
techniques to arrive at scheduled low gain feedback laws for global asymptotic
stabilization of system (4.32).

4.4.2 Learning Algorithms for Global Asymptotic Stabilization


by State Feedback

This subsection focuses on developing learning based algorithms for designing a


scheduled low gain feedback law for system (4.32). The two main objectives are
to uplift the knowledge of system dynamics and to circumvent the need of an a
priori given value of the low gain parameter while guaranteeing global asymptotic
stability of the closed-loop system. In Sect. 4.3 we learnt that we could achieve these
objectives for discrete-time systems by enhancing the Q-learning algorithms with a
low gain parameter scheduling mechanism, which leads to a learning formulation
in which the cost function itself is time-varying. The adaptation capability of
reinforcement learning has enabled us to learn the optimal solution corresponding
to the time-varying cost function resulting from the low gain parameter scheduling
mechanism. Building upon the continuous-time learning algorithms introduced in
Chap. 2, this subsection develops low gain feedback based model-free learning
algorithms for global asymptotic stabilization of system (4.32). We will present the
design of both policy iteration and value iteration based state feedback learning
algorithms.
Consider a Lyapunov function

V (x) = x T P j (γ )x

for system (4.32). We can also represent the dynamics of system (4.32) as follows,
 
ẋ = A + BK j (γ ) x + B σ (u) − K j (γ )x . (4.37)

Taking the time derivative of the Lyapunov function along the trajectory of (4.37) in

T 
V̇ = x T A + BK j P j (γ ) + P j A + BK j (γ ) x
T
+2 σ (u) − K j (γ )x B T P j (γ )x.

By integrating both sides of the above equation over a finite time interval, we have
196 4 Model-Free Stabilization in the Presence of Actuator Saturation

x T (t)P j x(t) − x T (t − T )P j x(t − T )


t T 
= − x T (τ ) A + BK j (γ ) P j + P j (γ ) A + BK j (γ ) x(τ )dτ
t−T
t T
−2 σ (u(τ )) − K j (γ )x(τ ) B T P j x(τ )dτ. (4.38)
t−T

From the policy iteration algorithm, Algorithm 4.7, we have


T  T
A + BK j (γ ) P j (γ ) + P j (γ ) A + BK j (γ ) = −γ I − K j (γ ) K j (γ ),
(4.39)

K j +1 (γ ) = −B T P j (γ ). (4.40)

Substitution of (4.39) and (4.40) in (4.38) gives us the following equation,

x T (t)P j (γ )x(t) − x T (t − T )P j (γ )x(t − T )


t T
= − x T (τ ) γ I + K j (γ ) K j (γ ) x(τ )dτ
t−T
t T
−2 σ (u(τ )) − K j (γ )x(τ ) K j +1 (γ )x(τ )dτ. (4.41)
t−T

The above equation is a parameterized learning equation subject to a control


constraint but does not involve the knowledge of the system dynamics. Notice how
the policy evaluation and policy update steps of the policy iteration in Algorithm 4.7
have been embedded in this equation. Interestingly, the equation is in the form of
the learning equation (2.54) used for solving the unconstrained LQR problem in
Chap. 2. As a result, this equation forms the basis of the model-free policy iteration
algorithm for system (4.32), which is subject to actuator saturation.
Equation (4.41) is a scalar equation with n(n + 1)/2 + nm unknowns corre-
sponding to P j (γ ) and K j +1 (γ ). In order to solve this equation as a least-squares
problem, we collect l ≥ n(n + 1)/2 + nm input-state datasets, each for an interval
of length T , to form the following data matrices,
 T
δxx = x̄ T (t1 ) − x̄ T (t0 ) x̄ T (t2 ) − x̄ T (t1 ) · · · x̄ T (tl ) − x̄ T (tl−1 ) ,
 t  tl T
t
Ixu = t01 x ⊗ σ (u)dτ t12 x ⊗ σ (u)dτ · · · tl−1 x ⊗ σ (u)dτ ,
  t2  tl T
t1
Ixx = t0 x ⊗ xdτ t1 x ⊗ xdτ · · · tl−1 x ⊗ xdτ .

We can write the corresponding l number of Equation (4.41) in the following


compact form,
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 197

  
vecs P j (γ )
 j
 j +1  =  j ,
vec K (γ )

where the data matrices are given by


 T  n(n+1)

l× +mn
 = δxx 2Ixx In ⊗ K (γ )
j j
− 2Izu ∈ R 2
,

 j = −Ixx vec(Qj ) ∈ Rl ,

with
T
Qj = γ I + K j (γ ) K j (γ ),
 
x̄ = x12 2x1 x2 · · · z22 2z2 z3 · · · zn2 ,
  
j j j j j j
vecs P j (γ ) = P11 (γ ) P12 (γ ) · · · P1n (γ ) P22 (γ ) P23 (γ ) · · · Pnn (γ ) .

The least-squares solution is given by


  
vecs P j (γ ) T −1 T
  = j j j j . (4.42)
vec K j +1 (γ )

The above discussion has focused on the learning equation (4.41), which is
parameterized in the low gain parameter γ . Note, however, that the appropriate
value of the low gain parameter in our model-free setting is not known a priori.
For this reason, we need a low gain parameter scheduling mechanism, which can
be embedded in the learning equation (4.41). To this end, we present the model-free
state feedback based policy iteration algorithm, Algorithm 4.9 for the continuous-
time system (4.32).
Algorithm 4.9 is an extension of Algorithm 2.5 in Chap. 2. The key difference
between the two algorithms is the inclusion of the low gain scheduling mechanism,
which results in an LQR problem with a time-varying objective function. This
mechanism allows us to learn an appropriate value of the low gain parameter that
would result in a linear feedback controller that avoids actuator saturation. The
learning equation (4.41) merges the policy evaluation and policy update steps of
policy iteration. This also implies that we need a stabilizing initial policy such that
A + BK 0 is Hurwitz as we have seen in the previous policy iteration algorithms.
However, the initial policy is used only to initialize the iterations and does not need
to satisfy the control constraint as it is not applied to the system. Instead, an open-
loop policy that satisfies control constraint can be applied as a behavioral policy for
the data generation purpose. It is also worth pointing out that the control constraint
check in Step 7, which drives the low gain parameter scheduling mechanism, is a
198 4 Model-Free Stabilization in the Presence of Actuator Saturation

Algorithm 4.9 Model-free state feedback policy iteration algorithm for global
asymptotic stabilization
input: input-state data
output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select an admissible policy K 0 such that A + BK 0 is Hurwitz stable. Set γ ← 1
and j ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint

u∞ ≤ b.

3: repeat
4: evaluate and improve policy. Find the solution, P j (γ ) and K j +1 (γ ), of the following
learning equation,

x T (t)P j (γ )x(t) − x T (t − T )P j (γ )x(t − T )


t T
= − x T (τ ) γ I + K j (γ ) K j (γ ) x(τ )dτ
t−T
t T
−2 σ (u(τ )) − K j (γ )x(τ ) K j +1 (γ )x(τ )dτ.
t−T

5: j←  j +1 
6: until P j (γ ) − P j −1 (γ ) < ε for some small ε > 0.
7: control saturation check. Check the following control constraint,
 
 j 
K (γ )x  ≤ b, t ≥ tl ,

where tl = lT . If the control constraint is violated, reduce γ , reset j ← 0, and carry out Steps
3 to 6 with the updated value of the low gain parameter.

function of the current state, which in turn enables us to find a value of the low
gain parameter for any initial condition. As a result, we achieve global asymptotic
stabilization instead of semi-global asymptotic stabilization as compared to the
model-based policy iteration Algorithm 4.7.
Parallel to Sect. 4.3, we would also like to develop a value iteration variant of
Algorithm 4.9 that would uplift the requirement of a stabilizing initial policy. To
this end, consider the Lyapunov function candidate parameterized in the low gain
parameter γ ,

V (x) = x T P (γ )x. (4.43)

Evaluating the derivative of (4.43) along the trajectory of system (4.32), we have
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 199

d T
(x P (γ )x) = (Ax + Bσ (u))T P (γ )x + x T P (γ ) (Ax + Bσ (u))
dt
= x T H (γ )x − 2σ T (u)K(γ )x, (4.44)

where H (γ ) = AT P (γ ) + P T (γ )A and K(γ ) = −B T P (γ ). We are interested in


finding the unknown matrices H (γ ) and K(γ ) so that we can use the recursions
in the model-based value iteration algorithm, Algorithm 4.8, without requiring the
knowledge of (A, B).
Performing finite window integrals of length T > 0 on both sides of Equation
(4.44) results in the following learning equation,

x T (t)P (γ )x(t) − x T (t − T )P (γ )x(t − T )


t t
= x T (τ )H (γ )x(τ )dτ − 2 σ T (u(τ ))K(γ )x(τ )dτ, (4.45)
t−T t−T

or, equivalently,

x T (t) ⊗ x T (t)|tt−T vec (P (γ ))


t t
= x̄(τ )dτ vecs(H (γ )) − 2 x T (τ ) ⊗ σ T (u(τ ))dτ vec (K(γ )) ,
t−T t−T

where
 
x̄ = x12 2x1 x2 · · · x22 2x2 x3 · · · xn2

and

vecs (H (γ )) = [H11 H12 · · · H1n H22 H23 · · · Hnn ] .

Equation (4.45) is a parameterized version of the value iteration learning equation


(2.55) presented in Chap. 2 for the unconstrained LQR problem. It is a scalar
equation linear in the unknowns H (γ ) and K(γ ). Clearly, there are more unknowns
than the number of equations. Therefore, we develop a system of l number of such
equations by performing l finite window integrals each of length T . To solve this
linear system of equations, we define the following data matrices,
 T
δxx = x ⊗ x|tt10 x ⊗ x|tt21 · · · x ⊗ x|ttll−1 ,
  t2  tl T
t1
Ixu = t0 x(τ ) ⊗ σ (u)(τ )dτ t1 x(τ ) ⊗ σ (u(τ ))dτ · · · tl−1 x(τ ) ⊗ σ (u(τ ))dτ ,
  t2  tl T
t1
Ixx = t0 x̄(τ )T dτ t1 x̄(τ )T dτ · · · tl−1 x̄(τ ) T dτ
,
200 4 Model-Free Stabilization in the Presence of Actuator Saturation

where, for k = 1, 2, · · · , l,

x ⊗ x|ttkk−1 = x(tk ) ⊗ x(tk ) − x(tk−1 ) ⊗ x(tk−1 ).

We write the corresponding l number of Equation (4.45) as the following matrix


equation,
 
  vecs (H (γ ))
Ixx −2Ixu = δxx vec (P (γ )) ,
vec (K(γ ))

whose least-squares solution is given by


 
vecs (H (γ ))  T  −1
= Ixx −2Ixu Ixx −2Ixu
vec (K(γ ))
 T
× Ixx −2Ixu δxx vec(P (γ )). (4.46)

Based on the solution of Equation (4.46), we perform the recursion on the following
equation,
T
P j +1 (γ ) = P j (γ ) + j H j (γ ) + γ I − K j (γ ) K j (γ ) .

We are now ready to propose our scheduled low gain learning algorithm that
achieves model-free global asymptotic stabilization of system (4.32). Global asymp-
totic stabilization is achieved by preventing saturation under the scheduled low gain
feedback.
The details of Algorithm 4.10 are as follows. The algorithm is initialized with
a value of the low gain parameter γ ∈ (0, 1], say γ = 1, and an arbitrary control
policy comprising of exploration signal ν is used to generate system data. Note
that this initial policy is open-loop and not necessarily stabilizing. Furthermore,
it is selected so that actuator saturation is avoided. The trajectory data is used to
solve the learning equation in Step 4, where a stabilizing low gain feedback gain
is obtained corresponding to our choice of the low gain parameter. However, this
control policy needs to be checked for the control constraint to ensure that the value
of the low gain parameter is appropriate, which is done in Step 16. The value of
the low gain parameter γ can be updated under a proportional rule γj +1 = aγj , for
some a ∈ (0, 1), in future iterations. The novelty in Algorithm 4.10 is that there is
a scheduling mechanism for the time-varying low gain parameter that ensures the
satisfaction of the control constraint. This is in contrast to Algorithm 4.8, where the
knowledge of (A, B) as well as an appropriate value of the low gain parameter are
needed. Furthermore, the scheduling of the value of the low gain parameter enables
global asymptotic stabilization as the scheduling mechanism is updated according
to the current state of the system.
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 201

Algorithm 4.10 Model-free state feedback value iteration algorithm for global
asymptotic stabilization
Input: input-state data
Output: P ∗ (γ ) and K ∗ (γ )
1: initialize. Select P 0 > 0 and set γ ← 1, j ← 0 and q ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint

u∞ ≤ b.

3: loop
4: Find the solution, H j (γ ) and K j (γ ), of the following equation,

x T (t)P j (γ )x(t) − x T (t − T )P j (γ )x(t − T )


t t
= x T (τ )H j (γ )x(τ )dτ − 2 (σ (u(τ )))T K j (γ )x(τ )dτ.
t−T t−T

 
5: P̃ j +1 (γ ) ← P j (γ ) + j H j + γ I − (K j )T (γ )K j (γ )
6: if P̃ j +1 (γ ) ∈
/ Bq then
7: P j +1 (γ ) ← P 0
8: q ← q + 1 
 
9: else if P̃ j +1 (γ )−P j (γ ) / j < ε then
10: return P j (γ ) as an estimate of P ∗ (γ )
11: else
12: P j +1 (γ ) ← P̃ j +1 (γ )
13: end if
14: j ←j +1
15: end loop
16: control saturation check. Check the following control constraint ∀t ≥ tf ,
 
 j 
K (γ )x  ≤ b, t ≥ tl = lT .

If the control constraint is violated, reduce γ , reset j ← 0, and carry out Steps 3 to 15 with
the updated value of the low gain parameter.

Recall from the previous chapters that, in order to solve the least-squares problem
of the forms (4.42) and (4.46), we need an exploration signal ν during the data
collection phase to satisfy the following rank condition,
 
rank Ixx Ixu = n(n + 1)/2 + mn. (4.47)

Compared to the exploration mechanism in the previous chapters, where we could


use an unconstrained exploration signal together with a feedback signal, the use
of feedback signal is avoided in constrained control problems in order to prevent
violation of the control constraint during the learning phase.
202 4 Model-Free Stabilization in the Presence of Actuator Saturation

Theorem 4.6 below establishes the convergence of the model-free algorithms,


Algorithms 4.9 and 4.10.
Theorem 4.6 Consider system (4.32). Under Assumption 4.3, both Algorithms 4.9
and 4.10 globally asymptotically stabilize the system at the origin if the rank
condition (4.47) is satisfied.
Proof The proof is carried out in two steps, which correspond, respectively, to the
iterations on the low gain parameter γ and the iterations on the matrix K(γ ) for a
given value of the low gain parameter γ . Let i be the iteration index on the low gain
parameter γ . Then, the gain scheduling law, denoted as

γi+1 = aγi , γ0 ∈ (0, 1],


converges
  to a stabilizing γ ∈ (0, γ ] since 0 < a < 1. The control constraint
 
K̂(γ )x  ≤ b can be met by Theorem 4.5. This shows the convergence of

the scheduling mechanism found in the outer loops of Algorithms 4.9 and 4.10,
assuming that P ∗ (γ ) and K ∗ (γ ) can be obtained. To show the convergence
to P ∗ (γ ) and K ∗ (γ ), we next consider the inner-loop with iteration index j
corresponding to the iterations on P (γ ) and K(γ ). In the case of Algorithm 4.9,
the inner-loop has the following learning equation corresponding to a current low
gain parameter γi ,

x T (t)P j (γi )x(t) − x T (t − T )P j (γi )x(t − T )


t T
= − x T (τ ) γi I + K j (γi ) K j (γi ) x(τ )dτ
t−T
t 
−2 σ (u(τ )) − K j (γi ) x(τ ))T K j +1 (γi )x(τ )dτ,
t−T

which has a unique solution provided that the rank condition (4.47) holds. Then,
the iterations on this equation are equivalent to the iterations on the parameterized
Lyapunov equation used in Algorithm 4.7 as shown in the derivation of (4.41) earlier
in this section. These iterations are the LQR Lyapunov iterations [51] with Q = γ I
and converge under√ the controllability condition of (A, B) and the observability
condition of A, Q .
In the case of Algorithm 4.10, the inner-loop has the following learning equation
corresponding to the current value of the low gain parameter γi ,

x T (t)P j (γi )x(t) − x T (t − T )P j (γi )x(t − T )


t t
= x T (τ )H j (γi )x(τ )dτ − 2 σ T (u(τ ))K j (γi )x(τ )dτ.
t−T t−T

This equation has a unique solution under the rank condition (4.47). Then, the
recursion
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 203

T
P̃ j +1 (γi ) ← P j (γi ) + j H j (γi ) + γi I − K j (γi )K j (γi )

corresponds to Algorithm 4.8 for finding K ∗ (γ ), which is the LQR recursive ARE
with Q = γi I and its convergence is shown in [11]  under
√  the controllability
condition of (A, B) and the observability condition of A, Q .
In order to establish that the converged K ∗ (γ ) indeed stabilizes the system while
ensuring that actuator saturation does not occur, we perform the following Lyapunov
analysis.
We consider system (4.32) in the following closed-loop form,

ẋ = Ax + Bσ (u)
 
= A − BB T P ∗ (γ ) x + B (σ (u) − u) . (4.48)

We choose the candidate Lyapunov function

V (x) = x T P ∗ (γ )x. (4.49)

For the given initial condition x(0), let W be a bounded set that contains x(0) and
let c > 0 be a constant such that

c≥ sup x T P ∗ (γ )x. (4.50)


x∈W,γ ∈(0,1]

Such c exists because W is bounded and limγ →0 P ∗ (γ ) = 0 by Lemma 4.2. Let us


define a level set
!
LV (c) = x ∈ Rn : V (x) ≤ c

and let γ ∗ be such that, for all γ ∈ (0, γ ∗ ], x ∈ LV (c) implies that
 
 T ∗ 
B P (γ )x  ≤ b.

To see that such γ ∗ exists, we note that


    1/2  ∗ 1/2 
 T ∗   
B P (γ )x  = B T P ∗ (γ ) P (γ ) x
  1/2 
 √
≤ B T P ∗ (γ )  c,

where we have used inequality (4.50). Since limγ →0 P ∗ (γ ) = 0 by Lemma 4.2,


the factor involving P ∗ (γ ) can be made arbitrary small and, therefore, there exists
γ ∗ ∈ (0, 1] such that
204 4 Model-Free Stabilization in the Presence of Actuator Saturation

 
 T ∗ 
B P (γ )x  ≤b

for all γ ∈ (0, γ ∗ ]. In other words, we can always find a γ that makes the above
norm small enough so that the control operates in the linear region of the saturation
function. The evaluation of the derivative of V along the trajectory of the closed-
loop system (4.48) shows that, for all x ∈ LV (c),

  
V̇ = x T P ∗ (γ ) A − BB T P ∗ (γ ) x + B (σ (u) − u)
  T
+ A − BB T P ∗ (γ ) x + B (σ (u) − u) P ∗ (γ )x
   
= x T AT P ∗ (γ ) + P ∗ (γ )A x − 2x T P ∗ (γ )BB T P ∗ (γ )x + 2x T P ∗ (γ )B σ (u) − u .

By the ARE (4.34), we have

AT P ∗ (γ ) + P ∗ (γ )A = −γ I + P ∗ (γ )BB T P ∗ (γ ).

It then follows that


 
V̇ = x T − γ I +P ∗ (γ )BB T P ∗ (γ ) x−2x T P ∗ (γ )BB T P ∗ (γ )x + 2x T P ∗ (γ )B (σ (u) − u)

= −γ x T x − x T P ∗ (γ )BB T P ∗ (γ )x + 2x T P ∗ (γ )B(σ (u) − u).

Since
 
 T ∗ 
B P (γ )x  ≤ b,

it follows from the definition of the saturation function (4.33) that σ (u) = u, which
results in

V̇ = −γ x T x − x T P ∗ (γ )BB T P ∗ (γ )x
≤ −γ x T x,

which in turn implies that, for any γ ∈ (0, γ ∗ ], the equilibrium x = 0 of the closed-
loop system is asymptotically stable at the origin with x(0) ∈ W ⊂ LV (c) is
contained in the domain of attraction. Since x(0) is arbitrary, we establish global
asymptotic stabilization. This completes the proof.
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 205

4.4.3 Learning Algorithms for Global Asymptotic Stabilization


by Output Feedback

The discussion in the previous subsection was focused on developing model-free


state feedback algorithms for the continuous-time system (4.32). Both the scheduled
low gain feedback controller and the associated learning mechanism require the
measurement of the full state of the system. In this subsection we will develop
output feedback counterparts of the learning algorithms introduced in the previous
subsection. The results in this subsection are based on the state parameterization
result of Theorem 2.9 presented in Chap. 2 for the continuous-time linear systems in
the absence of actuator saturation. We have seen in earlier in Sect. 4.3.3 that the state
parameterization result Theorem 2.2 for discrete-time linear systems in the absence
of actuator saturation remains applicable in the presence of actuator saturation by
viewing the saturated input as a new input. The following result shows that this
extension is also applicable in the continuous-time setting.
Theorem 4.7 Consider system (4.32). Let Assumption 4.4 hold. Consider the state
parameterization

x̄(t) = Wu ζu (t) + Wy ζy (t), (4.51)

where
 1 T  2 T  T T
ζu = ζu ζu · · · ζum ,
 T T 
 p T T
ζy = ζy1 ζy2 · · · ζy ,

and ζui and ζyi are constructed, respectively, from the ith input ui and the ith output
y i as

ζ˙ui (t) = Aζui (t) + bσ ui (t) , i = 1, 2, · · · , m,

ζ˙yi (t) = Aζyi (t) + by i (t), i = 1, 2, · · · , p,

 T
for any Hurwitz stable matrix A in the controllable form and b = 0 0 · · · 1 with
ζui (0) = 0 and ζyi (0) = 0. Then, x̄(t) converges to the state x(t) as t → ∞.
In the following we will develop the output feedback learning equations for the
constrained control problem. We introduce the following definitions,
 T
z(t) = ζuT (t) ζyT (t) ∈ RN ,
 
W = Wu Wy ∈ Rn×N ,
206 4 Model-Free Stabilization in the Presence of Actuator Saturation

P̄ (γ ) = W T P (γ )W ∈ RN ×N ,
K̄(γ ) = K(γ )W ∈ RmN ,

where N = mn + pn. In view of these definitions and Theorem 2.9, and using
y = Cx, we can write the constrained control learning equation (4.41) as

zT (t)P̄ j (γ )z(t) − zT (t − T )P̄ j (γ )z(t − T )


t t T
= −γ y T (τ )y(τ )dτ − zT (τ ) K̄ j (γ ) K̄ j (γ )z(τ )dτ
t−T t−T
t T
−2 σ (u(τ )) − K̄ j (γ )z(τ ) K̄ j +1 (γ )z(τ )dτ. (4.52)
t−T

Equation (4.52) is the output feedback version of the constrained control learning
equation (4.41). It can be seen that, with the help of the auxiliary dynamics z, the
full state has been eliminated in this equation. As was the case in its state feedback
counterpart, embedded in this equation are the steps of policy evaluation and policy
update found in a model-based policy iteration algorithm. The two unknowns in
this equation are P̄ j and K̄ j +1 , which contain a total of N(N + 1)/2 + Nm scalar
unknowns. To solve (4.52) in the least-squares sense, we can collect l ≥ N(N +
1)/2 + N m datasets of input-output data to form the following data matrices,
 T
δzz = z̄T (t1 ) − z̄T (t0 ) z̄T (t2 ) − z̄T (t1 ) · · · z̄T (tl ) − z̄T (tl−1 ) ,
 t
t
Izu = t01 z(τ ) ⊗ σ (u(τ ))dτ t12 z(τ ) ⊗ σ (u(τ ))dτ · · ·
 tl T
tl−1 z(τ ) ⊗ σ (u(τ ))dτ ,
  t2  tl T
t1
Izz = t0 z(τ ) ⊗ z(τ )dτ t1 z(τ ) ⊗ z(τ )dτ · · · tl−1 z(τ ) ⊗ z(τ )dτ ,
  t2  tl T
t1
Iyy = t0 y(τ ) ⊗ y(τ )dτ t1 y(τ ) ⊗ y(τ )dτ · · · tl−1 y(τ ) ⊗ y(τ )dτ .

We can write the corresponding l number of Equation (4.52) in the following


compact form,
  
vecs P̄ j (γ )
 j
  = j ,
vec K̄ j +1 (γ )

where the data matrices are given by


 T  N(N+1)

l× +mN
 = δzz
j
−2Izz In ⊗ K̄ (γ ) j
+2Izu ∈ R 2
,
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 207


 j = −Izz vec Q̄j − γ Iyy vec(I ) ∈ Rl ,

with
T
Q̄j = K̄ j (γ ) K̄ j (γ ),
 
z̄ = z12 2z1 z2 · · · z22 2z2 z3 · · · zN
2
,
  
j j j j j j
vecs P j (γ ) = P11 (γ ) P12 (γ ) · · · P1n (γ ) P22 (γ ) P23 (γ ) · · · PN N (γ ) .

The least-squares solution is then given by


  
vecs P̄ j (γ ) T −1 T
  = j j j j . (4.53)
vec K̄ j +1 (γ )

In addition to solving (4.52), we also need a scheduling mechanism to find


an appropriate value of the low gain parameter so that the solution of (4.52)
corresponds to an appropriate low gain control policy that satisfies the control
constraint. Furthermore, the scheduling mechanism is used to update the value of the
low gain parameter γ as a function of the filtered input-output data corresponding
to the current state and helps to result in global asymptotic stabilization. The model-
free output feedback policy iteration algorithm for global asymptotic stabilization
of system (4.32) algorithm is presented in Algorithm 4.11. Algorithm 4.11 works
essentially on the same principle as Algorithm 4.9 but does so without requiring the
measurement of the full state of the system. The learning equation and the resulting
low gain feedback control use output feedback. This results in more unknown
parameters and, therefore, incurs a longer learning time. In the output feedback case,
Algorithm 4.11, we need a stabilizing initial policy K̄ 0 = K 0 W such that A + BK 0
is Hurwitz. However, K̄ 0 does not need to satisfy the control constraint as it will
not be applied to the system during the learning phase. Instead, only an open-loop
signal u = ν is applied as a behavioral policy to generate data. In the following, we
present a value iteration algorithm that uplifts the requirement of such a stabilizing
K̄. t
Consider the value iteration learning equation (4.45). Adding γ t−T y T (τ )y(τ )dτ
to both sides of this equation, and substituting y(t) = Cx(t) and H = AT P + P A
on the right-hand side, we have
t
x T (t)P (γ )x(t) − x T (t − T )P (γ )x(t − T ) + γ y T (τ )y(τ )dτ
t−T
t 
= x T (τ ) AT P (γ ) + P (γ )A + γ C T C x(τ )dτ
t−T
208 4 Model-Free Stabilization in the Presence of Actuator Saturation

Algorithm 4.11 Model-free output feedback policy iteration algorithm for asymp-
totic stabilization
input: input-output data
output: P̄ ∗ (γ ) and K̄ ∗ (γ )
1: initialize. Select an admissible policy K̄ 0 such that A + BK 0 is Hurwitz. Set γ ← 1 and
j ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint

u∞ ≤ b.

3: repeat
4: evaluate and improve policy. Find the solution, P̄ j (γ ) and K̄ j +1 (γ ), of the following
learning equation,

zT (t)P̄ j (γ )z(t) − zT (t − T )P̄ j (γ )z(t − T )


t t T
= −γ y T (τ )y(τ )dτ − zT (τ ) K̄ j (γ ) K̄ j (γ )z(τ )dτ
t−T t−T
t T
−2 σ (u(τ )) − K̄ j (γ )z(τ ) K̄ j +1 (γ )z(τ )dτ.
t−T

5: j←  j +1 
6: until P̄ j (γ ) − P̄ j −1 (γ ) < ε for some small ε > 0.
7: control saturation check. Check the following control constraint,
 
 j 
K̄ (γ )z ≤ b, t ≥ tl = lT .

If the control constraint is violated, reduce γ , reset j ← 0, and carry out Steps 3 to 6 with the
updated value of the low gain parameter.

t
−2 σ T (u(τ ))K(γ )x(τ )dτ.
t−T

Using the state parameterization (4.51), we have


t
zT (t)P̄ (γ )z(t) − zT (t − T )P̄ (γ )z(t − T ) + γ y T (τ )y(τ )dτ
t−T
t t
= zT (τ )H̄ (γ )z(τ )dτ − 2 σ T (u(τ ))K̄(γ )z(τ )dτ, (4.54)
t−T t−T

or, equivalently,
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 209

  t
zT (t) ⊗ zT (t)|tt−T vec P̄ (γ ) + γ y T (τ ) ⊗ y T (τ )dτ vec(I )
t−T
t   t  
= z̄(τ )dτ vecs H̄ (γ ) − 2 zT (τ ) ⊗ σ T (u(τ ))dτ vec K̄(γ ) ,
t−T t−T

where

H̄ (γ ) = W T AT P (γ ) + P (γ )A + γ C T C W

and
 
z̄ = z12 2z1 z2 2z1 z3 · · · z22 2z2 z3 · · · zN
2
.

Equation (4.54) is a parameterized learning equation that uses only output feedback.
Similar to the state feedback equation (4.45), it is a scalar equation linear in the
unknowns H̄ (γ ) and K̄(γ ). These matrices are the output feedback counterparts of
the matrices H (γ ) and K(γ ) in the state feedback case. As there are more unknowns
than the number of equations, we develop a system of l number of such equations
by performing l finite window integrals each of length T . To solve this linear system
of equations, we define the following data matrices,
 T
δzz = z ⊗ z|tt10 z ⊗ z|tt21 · · · z ⊗ z|ttll−1 ,
 t
t
Izu = t01 z(τ ) ⊗ σ (u(τ ))dτ t12 z(τ ) ⊗ σ (u(τ ))dτ ···
 tl T
tl−1 z(τ ) ⊗ σ (u(τ ))dτ ,
  t2  tl T
t1
Izz = t0 z̄T (τ )dτ t1 z̄T (τ )dτ · · · T
tl−1 z̄ (τ )dτ ,
  t2  tl T
t1
Iyy = t0 y(τ ) ⊗ y(τ )dτ t1 y(τ ) ⊗ y(τ )dτ · · · tl−1 y(τ ) ⊗ y(τ )dτ ,

where, for k = 1, 2, · · · , l,

z ⊗ z|ttkk−1 = z(tk ) ⊗ z(tk ) − z(tk−1 ) ⊗ z(tk−1 ).

We write the corresponding l number of Equation (4.54) in the following compact


form,
  
  vecs H̄ (γ )  
Izz −2Izu   = δzz vec P̄ (γ ) + γ Iyy vec(I ),
vec K̄(γ )

whose least-squares solution is given by


210 4 Model-Free Stabilization in the Presence of Actuator Saturation

  
vecs H̄ (γ )  T  −1 T
  = Izz −2Izu Izz −2Izu Izz −2Izu
vec K̄(γ )
   
× δxx vec P̄ (γ ) + γ Iyy vec(I ) . (4.55)

Based on the solution of (4.55), we perform the recursion on the following equation,
T
P̄ j +1 (γ ) = P̄ j (γ ) + j H̄ j (γ ) − K̄ j (γ ) K̄ j (γ ) .

As has been the case with the previous algorithms, the learning equation discussed
above requires an appropriate value of the low gain parameter as a target objective
function. This is achieved by a low gain parameter scheduling mechanism. The
resulting model-free output feedback value iteration algorithm for global asymptotic
stabilization of system (4.32) is presented in Algorithm 4.12.
Compared to the state feedback Algorithms 4.9 and 4.10, the output feedback
Algorithms 4.11 and 4.12 involve more unknown parameters and require the
following rank condition for the solution of the least-squares problems (4.53) and
(4.55),
 
rank Izz Izu = N(N + 1)/2 + mN. (4.56)

Theorem 4.8 establishes the convergence of Algorithms 4.11 and 4.12.


Theorem 4.8 Consider system (4.32). Under Assumptions 4.3 and 4.4, together
with the full row rank condition of W and the rank condition (4.56), both
Algorithms 4.11 and 4.12 globally asymptotically stabilize the system at the origin.
Proof Note that Steps 3 to 6 in Algorithm 4.11 and Steps 3 to 15 in Algorithm 4.12
are, respectively, the output feedback LQR learning Algorithms 2.7 and 2.8,
respectively, with Qy = γ I , whose convergence has been established in Chap. 2
under the stated conditions. Then, following the arguments in Theorem 4.6 for the
low gain scheduling mechanism, the output feedback algorithms, Algorithm 4.11
and 4.12 also achieve global asymptotic stabilization of the system by finding a
suitable value of the low gain parameter γ and the corresponding control matrix
K̄(γ ) that avoids actuator saturation. This completes the proof.

4.4.4 Numerical Simulation

In this subsection we will validate the model-free algorithms for the continuous-time
system (4.32). Consider system (4.32) with
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 211

Algorithm 4.12 Model-free output feedback value iteration algorithm for global
asymptotic stabilization
Input: input-output data
Output: P̄ ∗ (γ ) and K̄ ∗ (γ )
1: initialize. Select P̄ 0 ≥ 0 and set γ ← 1, j ← 0 and q ← 0.
2: collect data. Collect system data for time t ∈ [t0 , tl ], where l is the number of learning
intervals of length tk − tk−1 = T , k = 1, 2, · · · , l, by applying an open-loop control u = ν
with ν being the exploration signal and u satisfying the control constraint

u∞ ≤ b.

3: loop
4: Find the solution, H̄ j (γ ) and K̄ j (γ ), of the following equation,

zT (t)P̄ j (γ )z(t) − zT (t − T )P̄ j (γ )z(t − T )


t t
= zT (τ )H̄ j (γ )z(τ )dτ − 2 σ (u(τ ))T K̄ j (γ )z(τ )dτ.
t−T t−T

 T 
5: P̄˜ j +1 (γ ) ← P̄ j (γ ) + j H̄ j − K̄ j (γ )K̄ j (γ )
6: if P̄˜ j +1 (γ ) ∈
/ B then
q
7: P̄ j +1 (γ ) ← P̄ 0
8: q ← q + 1 
 ˜ j +1 
9: else if P̄ (γ )− P̄ j (γ ) / j < ε then

10: return P̄ j (γ ) and K̄ j (γ ) as P̄ ∗ (γ ) and K̄ ∗ (γ ),
11: else
12: P̄ j +1 (γ ) ← P̄˜ j +1 (γ )
13: end if
14: j ←j +1
15: end loop
16: control saturation check. Check the following control constraint,
 
 j 
K̄ (γ )z ≤ b, t ≥ tl .

If the saturation condition is violated, reduce γ , reset j ← 0, and carry out Steps 3 to 15 with
the updated value of the low gain parameter.

⎡ ⎤
0 1 0 0
⎢0 0 1 0⎥
A=⎢
⎣0 0 0
⎥,
1⎦
−1 0 −2 0
⎡ ⎤
0
⎢0⎥
B=⎢ ⎥
⎣0⎦ ,
1
212 4 Model-Free Stabilization in the Presence of Actuator Saturation

 
C= 1000 .

Matrix A has a pair of repeated eigenvalues at ±j and, therefore, the system is open-
loop unstable. The actuator saturation limit is b = 1. Both the state feedback and
the output feedback algorithms will be validated along with their policy iteration and
value iteration variants. In order to appreciate the motivation for low gain feedback,
let us first test the system with a general stabilizing state feedback control law with
feedback gain matrix
 
K = −23 −50 −33 −10 .

This state feedback law results in the closed-loop eigenvalues of {−1, −2, −3, −4}.
Figure 4.18 shows the closed-loop response under this controller. Upon examining
these results it is evident that this controller repeatedly violates the control constraint
imposed by actuator saturation, which causes instability of the closed-loop system
even though K is chosen such that A + BK is Hurwitz.
We will now focus on the model-free scheduled low gain feedback designs.
Consider first the state feedback policy iteration Algorithm 4.9. Let the initial state
 T
of the system be x0 = 0.25 −0.5 −0.5 0.25 and initialize the algorithm with
 
γ = 1 and K 0 = −23 −50 −33 −10 , which is stabilizing with A + BK 0
being Hurwitz stable but does not meet the control constraint as shown in Fig. 4.18.
Figure 4.19 shows the state response and the control effort under Algorithm 4.9.

Fig. 4.18 Closed-loop response under a stabilizing state feedback law designed without taking
actuator saturation into consideration
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 213

2
x1
1
States x2
0 x3
x4
-1

-2
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 T
Fig. 4.19 Algorithm 4.9: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25

In every main iteration of the algorithm, the low gain parameter γ is reduced by a
factor of one half. In each iteration under a given value of the low gain parameter γ ,
the convergence criterion of ε = 0.01 is selected for the iteration on the controller
parameters. In this simulation, we collected L = 15 data samples to satisfy the rank
condition (4.47). These data samples are collected only once using a behavioral
policy comprising of sinusoidal signals of different frequencies and magnitudes
such that the control constraint is satisfied. Once these data samples are available,
we repeatedly use this dataset in policy iterations for different values of γ . It can be
seen that the algorithm is able to find a suitable value of the low gain parameter
γ and learn the corresponding low gain state feedback gain matrix K(γ ) that
guarantees convergence of the state without saturating the actuator. The final value
of γ = 0.1250 and the corresponding gain matrix is obtained by solving the ARE
as
 
K ∗ (γ ) = −0.0607 −1.4046 −0.7567 −1.2800 .

The convergence of the low gain feedback gain is shown in Fig. 4.20 and its final
estimate is
 
K(γ ) = −0.0605 −1.4055 −0.7571 −1.2809 ,

which shows convergence to its nominal value K ∗ (γ ).


214 4 Model-Free Stabilization in the Presence of Actuator Saturation

80

60

40

20

0
0 1 2 3 4 5 6 7 8
iterations

Fig. 4.20 Algorithm 4.9: Convergence of the feedback gain matrix with x0 =
 T
0.25 −0.5 −0.5 0.25

We next verify the global asymptotic stabilization capability of Algorithm 4.9.


To this end, we simulate the system with a larger initial condition x(0) =
 T
0.5 −1 −1 0.5 . All other simulation parameters are the same. In this case,
the value of the low gain parameter can be expected to be lower since the
initial condition is larger. Indeed, this is the case and the final value of the low
gain parameter as found by the scheduling mechanism is γ = 0.0156 and the
corresponding nominal gain matrix as found by solving the ARE is
 
K ∗ (γ ) = −0.0078 −0.7510 −0.2567 −0.7273 .

Fig. 4.21 shows the closed-loop response. The convergence of the estimate of the
low gain feedback gain matrix is shown in Fig. 4.22 and its final estimate is
 
K(γ ) = −0.0085 −0.7534 −0.2577 −0.7294 .

It can be seen that, with this larger initial condition, the algorithm stabilizes
the system with a lower value of the low gain parameter. This illustrates that
Algorithm 4.9 achieves global asymptotic stabilization without using the knowledge
of the system dynamics.
We will now test the model-free value iteration Algorithm 4.10 that uplifts
requirement of a stabilizing initial policy K 0 . The algorithm is initialized with
γ = 1 and P0 (γ ) = 0.01I4 . We set the step size
−1
k = k 0.5 + 5 , k = 0, 1, 2, · · ·

and choose the set


 
Bq = P ∈ P4+ : |P | ≤ 200(q + 1) , q = 0, 1, 2, · · · .
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 215

4
x1
2
States x2
0 x3
x4
-2

-4
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 T
Fig. 4.21 Algorithm 4.9: Closed-loop response with x0 = 0.5 −1 −1 0.5

80

60

40

20

0
0 1 2 3 4 5 6 7 8 9 10
iterations
 T
Fig. 4.22 Algorithm 4.9: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5

 T
Results for the first initial condition x(0) = 0.25 −0.5 −0.5 0.25 are shown
in Fig. 4.23. The convergence of the low gain feedback gain matrix is shown
Fig. 4.24. The final value of the low gain parameter is γ = 0.1250 and the final
estimate of the corresponding low gain feedback gain matrix is
 
K(γ ) = −0.0600 −1.4006 −0.7550 −1.2773 .

To validate the global asymptotic stabilization capability, we will increase the


 T
initial conditions to 0.5 −1 −1 0.5 as done previously for verifying the policy
iteration algorithm. Under this new initial condition, the final value of the low gain
parameter is 0.0156. Figures 4.25 and 4.26 show the closed-loop response and the
216 4 Model-Free Stabilization in the Presence of Actuator Saturation

2
x1 x2 x3 x4
States 1

-1

-2
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 
Fig. 4.23 Algorithm 4.10: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25

0
0 20 40 60 80 100 120 140 160 180
iterations

Fig. 4.24 Algorithm 4.10: Convergence of the feedback gain matrix with x0 =
 T
0.25 −0.5 −0.5 0.25

convergence of the low gain feedback gain matrix. The final estimate is
 
K(γ ) = −0.0095 −0.7463 −0.2570 −0.7224 .

The simulation we have carried out above pertains to state feedback. We now
carry out simulation on the output feedback policy iteration Algorithm 4.11. For
the state parameterization, we construct the user-defined system matrix A with a
choice of the desired eigenvalues all at −1. This corresponds to α0 = 1, α1 =
4, α2 = 6, and α3 = 4, which are the entries of matrix A. These constants are
obtained from the characteristic polynomial (s) = (s + 1)4 corresponding to our
choice of eigenvalues −1 of matrix A. The nominal values of the corresponding
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 217

 T
Fig. 4.25 Algorithm 4.10: Closed-loop response with x0 = 0.5 −1 −1 0.5

 T
Fig. 4.26 Algorithm 4.10: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5

state parameterization matrices are as follows,


⎡ ⎤
1 00 0
⎢4 10 0⎥
Wu = ⎢
⎣4
⎥,
41 0⎦
−4 44 1
⎡ ⎤
0 4 4 4
⎢−4 0 −4 4⎥
Wy = ⎢
⎣−4
⎥.
−4 −8 −4⎦
4 −4 4 −8
218 4 Model-Free Stabilization in the Presence of Actuator Saturation

It should be noted that the above matrices are used only to compute the nominal
output feedback parameters for comparison with the results of Algorithm 4.11.
Algorithm 4.11 itself does not require the knowledge of these matrices because
these parameters are learned directly. The initial state of the system is x(0) =
 T
0.25 −0.5 −0.5 0.25 and the algorithm is initialized with γ = 1 and

K̄ 0 (γ ) = −315.95 −222.37 −73.06 −10.00 292.81

79.66 332.45 −80.93 ,

where K̄ 0 is the output feedback equivalent of the K 0 used earlier to initialize


the state feedback Algorithm 4.9. As seen in the state feedback results, while K 0
has been chosen such that A + BK 0 is Hurwitz, it does not satisfy the control
constraint and results in instability of the closed-loop system in the presence of
actuator saturation, as shown earlier in Fig. 4.18. Consequently, the same is also
true for K̄ 0 . In every main iteration of the algorithm, the value of the low gain
parameter γ is reduced by a factor of one half. Since n = 4, m = 1, and p = 1,
we need at least (mn + pn)(mn + pn + 1)/2 + m(mn + pn) = 44 learning
intervals to solve (4.54). In the simulation, we use a behavior policy comprising of
sinusoids of different frequencies and magnitudes that meets the control constraint.
This trajectory data is used in subsequent iterations of the low gain parameter γ .
We choose l = 60 learning intervals of period T = 0.2 seconds to satisfy the above
mentioned condition. The final value of the low gain parameter is γ = 0.1250. The
nominal output feedback low gain feedback gain matrix corresponding to this γ is
computed by solving the ARE as

K̄ ∗ (γ ) = −2.2071 −6.0464 −3.8707 −0.8715 2.1460

4.7727 3.4243 4.1577 .

It can be observed from the results in Figs. 4.27 and 4.28 that the algorithm finds
a suitable value of the low gain parameter γ and learns the corresponding low
gain feedback gain matrix K̄(γ ) to guarantee the convergence to zero of the state.
It should be noted that the exploration signal is only needed in the first l = 60
intervals and is removed afterwards. The final estimate of the output feedback low
gain feedback gain matrix is

K̄(γ ) = −2.2073 −6.0467 −3.8709 −0.8715 2.1462

4.7728 3.4244 4.1578 .

We will now demonstrate the global asymptotic stabilization characteristic of the


output feedback Algorithm 4.11. For this purpose, we test this algorithm with a
 T
different initial condition x(0) = 0.5 −1 −1 0.5 . It is expected that the value of
the low gain parameter will decrease for this larger initial condition. The final value
4.4 Global Asymptotic Stabilization of Continuous-Time Systems 219

2
x1
x2
States
0 x3
-2
x4

-4
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 
Fig. 4.27 Algorithm 4.11: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25

Fig. 4.28 Algorithm 4.11: Convergence of the feedback gain matrix with x0 =
 T
0.25 −0.5 −0.5 0.25

of the low gain parameter is found to be γ = 0.0156 and the corresponding low
gain feedback gain matrix is found by solving the ARE as

K̄ ∗ (γ ) = −0.6579 −3.0932 −2.1612 −0.5074 0.6500

2.5196 1.1348 2.3912 .

Figures 4.29 and 4.30 show the results under this new initial condition. The final
estimate of the low gain feedback gain matrix is
220 4 Model-Free Stabilization in the Presence of Actuator Saturation

10

5
x1
x2
States

0 x3
-5
x4

-10
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 T
Fig. 4.29 Algorithm 4.11: Closed-loop response with x0 = 0.5 −1 −1 0.5

 T
Fig. 4.30 Algorithm 4.11: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5


K̄(γ ) = −0.6580 −3.0937 −2.1615 −0.5074 0.6501

2.5198 1.1349 2.3914 ,

which is close to its nominal value K̄ ∗ (γ ). As can be seen that, even with this
larger initial condition, the output feedback Algorithm 4.11 is able to find a suitable
value of the low gain parameter and stabilize the system, which illustrates the global
asymptotic stabilization capability of Algorithm 4.11.
Finally, we shall verify the final algorithm of this chapter, which is the output
feedback value iteration algorithm, Algorithm 4.12. The algorithm is initialized with
γ = 1 and P̄0 (γ ) = 0.01I8 . We set the step size
4.5 Summary 221

−1
k = k 0.5 + 5 , k = 0, 1, 2, · · · ,

and choose the set


 
Bq = P̄ ∈ P8+ : |P̄ | ≤ 800(q + 1) , q = 0, 1, 2, . . . .

All other simulation parameters remain the same as in the simulation of Algo-
rithm 4.11. It is worth noting that, compared to Algorithm 4.11, we no longer require
a stabilizing initial output feedback policy K̄ 0 during initialization. For the first
initial condition x(0) = 0.25 −0.5 −0.5 0.25 , the algorithm finds the low gain
parameter γ = 0.1250. The closed-loop response and the convergence of the low
gain feedback gain matrix are shown in Figs. 4.31 and 4.32, respectively. The final
estimate of the low gain feedback gain matrix is

K̄(γ ) = −2.2034 −6.0429 −3.8691 −0.8711 2.1439

4.7732 3.4223 4.1588 .

To illustrate global asymptotic stabilization, we increase the initial conditions to


 T
x(0) = 0.5 −1 −1 0.5 . The corresponding value of the low gain parameter
found by the algorithm is γ = 0.0156. The results under this new initial condition
are shown in Figs. 4.33 and 4.34. As can be seen, the algorithm responds to the
increase in initial condition and is able adapt to an even lower value of the low
gain parameter with the corresponding final estimate of the low gain feedback gain
matrix given by

K̄(γ ) = −0.6540 −3.1001 −2.1678 −0.5091 0.6468

2.5277 1.1326 2.4016 .

Clearly, Algorithm 4.12 also achieves global asymptotic stabilization.

4.5 Summary

This chapter was motivated by the strong connection that exists between reinforce-
ment learning and Riccati equations as we saw in Chap. 2. The key idea revolves
around learning an appropriate value of the low gain parameter and finding the
solution of the corresponding parameterized ARE based on reinforcement learning.
The LQR learning techniques developed in Chap. 2 alone may not be able to learn
an appropriate low gain feedback control law that could also satisfy the control
constraint. As a result, a scheduling mechanism was introduced that would update
the value of the low gain parameter as a function of the current state to ensure that
222 4 Model-Free Stabilization in the Presence of Actuator Saturation

States 2
x1 x2 x3 x4

-2

-4
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 
Fig. 4.31 Algorithm 4.12: Closed-loop response with x0 = 0.25 −0.5 −0.5 0.25

Fig. 4.32 Algorithm 4.12: Convergence of the feedback gain matrix with x0 =
 T
0.25 −0.5 −0.5 0.25

the learned policy avoids actuator saturation. This technique enables us to perform
global asymptotic stabilization by scheduling the value of the low gain parameter
as a function of the state. Compared to the results in Chap. 2, the scheduling of the
low gain parameter results in a time-varying objective function. It was shown that
the proposed learning algorithms are able to adapt to the changes in the objective
function and learn the appropriate low gain feedback controller. Both discrete-
time and continuous-time systems were considered. Model-free policy iteration and
value iteration using both full state and output feedback were presented and their
performance was thoroughly verified by numerical simulation.
4.6 Notes and References 223

5 x1 x2 x3 x4
States
0

-5
0 10 20 30 40 50 60
time (sec)

1
Control

-1

0 10 20 30 40 50 60
time (sec)
 T
Fig. 4.33 Algorithm 4.12: Closed-loop response with x0 = 0.5 −1 −1 0.5

 T
Fig. 4.34 Algorithm 4.12: Convergence of the feedback gain matrix with x0 = 0.5 −1 −1 0.5

4.6 Notes and References

The past decades has witnessed strong interest in designing control algorithms for
constrained control systems. Early results in this area of work have focused on
obtaining stabilizability conditions subject to the operating range of the system.
It was recognized that global asymptotic stabilization is in general not possible
even for very simple systems that are subject to actuator saturation. It has been
established that global asymptotic stabilization under such a constraint is possible
only for a limited class of systems. For linear systems, in particular, it has been
shown in [112] that, global asymptotic stabilization could be achieved only when
the system is asymptotically null controllable with bounded controls (ANCBC). An
224 4 Model-Free Stabilization in the Presence of Actuator Saturation

ANCBC system may be polynomially unstable but not exponentially unstable. Even
for such systems, one generally has to resort to nonlinear control laws to achieve
global results [26, 109]. The construction of such nonlinear laws is based on a good
insight of the particular problem at hand [112, 117]. Parallel results in the discrete-
time setting have also been presented, such as in [133].
Optimal control theory provides another formulation of the constrained control
problem by taking into account the control constraint in the objective function. This,
however, is not straightforward. The presence of such a constraint makes the optimal
control problem even more challenging as it results in further nonlinearity in the
Hamilton-Jacobi-Bellman (HJB) equation. Along this line of work, the idea of using
nonquadratic cost functional to encode control constraints has also gained popularity
[73]. One needs to resort to approximation techniques such as neural networks to
solve the complicated HJB equation. This difficulty is also extended to the problems
that are otherwise linear but entails a nonlinear control law because the use of such
cost functionals results in a nonlinear control law.
Reinforcement learning techniques have also been presented in the recent
literature based on the nonquadratic cost functional approach mentioned above. One
of the early developments along this approach was presented in [3], in which a
model-based near optimal solution to the constrained HJB equation was obtained
by employing neural network approximation. Follow-up works were focused on
uplifting the knowledge of the system dynamics. Partially model-free solutions
employing reinforcement learning were proposed in[46, 78, 81] towards solving this
problem in both the continuous-time and discrete-time settings. In these approaches
only local stability could be demonstrated and only in the form of uniform ultimate
boundedness rather than asymptotic stability.
An alternative paradigm called low gain feedback was introduced by the authors
[63] to deal with the constrained control problem. The key motivation in this
framework is to design a controller with a simple structure and at the same time
improve the overall stability characteristics. This technique takes a preventive
approach to designing a control law that operates within the saturation limits. In
our earlier works [63, 65, 66], we presented semi-global asymptotic stabilization
of the ANCBC systems for an a priori given bounded set of initial conditions.
An appealing technique for designing such controllers is based on the idea of
parameterized ARE, which is an LQR ARE parameterized in a low gain parameter.
The key advantage of the low gain feedback approach is that the resulting control
law remains linear. While the earlier works focused on semi-global results, later on
it was demonstrated that it is possible to convert semi-global results into global ones
by scheduling the value of the low gain parameter as a function of the state [63]. All
these designs make use of the knowledge of the system dynamics.
The presentation in this chapter follows our results on the model-free low gain
feedback designs for the discrete-time [92, 98] and continuous-time [99, 102] con-
strained control problems. Extensions to the policy iteration algorithm are presented
that provide a faster alternative to designing model-free learning algorithms, which
also uplift the control constraint on the stabilizing initial policy.
Chapter 5
Model-Free Control of Time Delay
Systems

5.1 Introduction

A feedback control system is formed by the interconnection of the fundamental


building blocks including controllers, sensors, actuators, and the plant itself. These
blocks work in conjunction with each other by means of a timely exchange of
information. Ideally, this communication should take place instantly so that only the
latest available information is employed in the control loop towards achieving the
desired system behavior. However, in a practical setting there is always an element
of delay, no matter how small it may be, in transferring the information from one end
to another. Such a delay may not only be due to the information propagation time
but also arise as a result of processing or computation time incurred at different
points within the control loop. Time delays of this sort are often safely neglected
in the control designs as long as they are sufficiently small as to not cause any
noticeable performance issues. The difficulty arises when these delays are prolonged
to the extent that they begin to hinder the control objectives from being reached.
More importantly, it has been established in control theory that delays that become
sufficiently large relative to the system dynamics could result in a loss of stability.
In such scenarios, it becomes necessary to take into account the time delays in the
control design so that the closed-loop stability could be ensured and the control
objectives could be reached.
The time delay problem has enjoyed a rich history in the controls literature owing
to its practical significance [32]. There are essentially two fundamental approaches
in dealing with time delays. The first approach is a robust design approach, in
which the controller is designed based on a stability criterion obtained from the
traditional Lyapunov analysis approach to maintain closed-loop stability for a range
of admissible delays. The second approach, which has received more popularity, is
based on the idea of active delay compensation and effectively cancels the effect
of delay by transforming a time delay system into a form free of delays. The key
motivation behind this approach is that, once the system is brought into a delay-free

© Springer Nature Switzerland AG 2023 225


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2_5
226 5 Model-Free Control of Time Delay Systems

form, existing control techniques would become readily applicable. In this line of
work, a popular technique known as Smith predictor [108] or, more formally, the
method of finite spectrum assignment [74], was introduced in the early literature. In
this method, the control law utilizes a predicted future value of the state based on
the dynamic model of the system in order to offset the delay at the time the control
input signal arrives at the plant. This cancellation of the delay results in a closed-
loop system free from delays and enables us to apply existing control design tools
such as optimal control techniques [75].
An underlying assumption in the predictor feedback based optimal control of
time delay systems is the use of a model of the system to predict the future state
based on the current state and the history of the control input. Unfortunately, in
a model-free reinforcement learning setting we do not have the knowledge of
the system dynamics, which prevents us from predicting the future state. On the
other hand, it is known that the predictor feedback assigns a finite spectrum to
the closed-loop system resulting in a closed-loop system that has a finite number
of modes. This feature of predictor feedback is particularly useful for continuous-
time delay systems, which are inherently infinite dimensional due to the presence
of delays. In contrast, a remarkable property of discrete-time delay systems is
that they remain finite dimensional even in the presence time delays. It is this
property that allows us to bring the original open-loop time delay system into a
delay-free form by the so-called state augmentation or the lifting technique. The
method transforms the delayed variables into additional state variables to avoid
prediction. The lifting technique, however, requires the knowledge of the delay in
the augmentation process.
This chapter builds upon the Q-learning technique that was presented in Chap. 2
to solve the discrete-time linear quadratic regulator problem. Because of the
presence of delays, the methods developed in Chap. 2 are, however, not readily
applicable as the Q-learning Bellman equation does not hold true if the delays
are neglected. In this chapter, we address the difficulty of designing a Q-learning
scheme in the presence of delays that are not known. Both state and input delays will
be considered. Instead of using the exact knowledge of the delays, we use an upper
bound on these delays to transform the time delay system into a delay-free form
by means of an extended state augmentation. This approach essentially converts
the LQR problem of unknown dynamics and unknown delays (both the lengths and
number of delays) to one involving higher order unknown dynamics free of delays.
Properties of the extended augmented system will be thoroughly analyzed in terms
of the controllability and observability conditions to establish the solvability of the
optimal control problem of the time delay system. We will then present the policy
iteration and value iteration based Q-learning algorithms using both state feedback
and output feedback to learn the optimal control policies of the time delay system.
5.3 Problem Description 227

5.2 Literature Review

The time delay problem is an important one but only a handful of developments have
been carried out along this line of work. One of the primary difficulties associated
with applying the predictor feedback approach in RL is that the prediction signal
is obtained from an embedded model of the system, which is unfortunately not
available in the model-free framework. This difficulty was addressed in [139], where
a bicausal change of coordinates was used to bring the discrete-time delay system
into a delay-free form. Differently from the predictor approach, this approach
renders the open-loop system into a form free of delays without requiring a predictor
feedback. This approach has been extended to solve the optimal tracking problem
[70]. It turns out that the existence of the bicausal transformation is a restrictive
assumption that is very hard to verify when the system dynamics is unknown.
This, in turn, limits feasibility of the approach to solve more general time delay
problems. Very recently, a different output feedback approach was proposed to solve
the problem in RL setting without requiring this bicausal transformation [29]. It is
worth noting that all these existing approaches require a precise knowledge of the
delays.

5.3 Problem Description

Consider a discrete-time linear system given by the following state space represen-
tation,
 
xk+1 = Si=0 Ai xk−i + Ti=0 Bi uk−i ,
(5.1)
yk = Cxk ,

where xk ∈ Rn is the state, uk ∈ Rm is the input and yk ∈ Rp is the output. The


system is subject to multiple delays in both the state and the input with S and T
being the maximum numbers and amount of state and the input delays, respectively.
We assume that neither the system dynamics (A0 , A1 , · · · , AS , B0 , B1 , · · · , BT , C)
nor the information of delays (both numbers and lengths) are known. The only
assumptions we make are given as follows.
Assumption 5.1 The following rank condition holds,
 S T T −i

ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n,

for any
&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I =0 ,
i=0
228 5 Model-Free Control of Time Delay Systems

where ρ(·) denotes the rank of a matrix.

Assumption 5.2 The following rank conditions hold,


 S 
ρ T S−i−λS+1 I
i=0 Ai λ C T = n,

for any
&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and λ = 0
i=0

and ρ(AS ) = n.

Assumption 5.3 The upper bounds S ≤ S̄ and T ≤ T̄ on the state and input delays
are known.
As will be shown, Assumptions 5.1 and 5.2 are the generalization of the
controllability and observability conditions of delay-free linear systems to systems
with multiple state and input delays. Assumption 5.3 is needed for the extended
augmentation to be presented in this chapter. Note that this assumption is mild
because it is often possible for us to determine upper bounds S̄ and T̄ such that
the conditions S ≤ S̄ and T ≤ T̄ hold. Under these assumptions, we are interested
in solving the linear quadratic regulation (LQR) problem for the time delay system
(5.1) with the following cost function,


V (x, u) = r(xi , ui ), (5.2)
i=k

with r(xk , uk ) taking the following quadratic form,

r(xk , uk ) = xkT Qxk + uTk Ruk , (5.3)

where Q ≥ 0 and R > 0 correspond to the desired parameters for penalizing the
states and the control, respectively.

5.4 Extended State Augmentation

We first present a state augmentation procedure that brings the system into a delay-
free form. The augmentation is carried out by introducing the delayed states and the
delayed control inputs as additional states. To this end, let us define the augmented
state vector as
 T
Xk = xkT xk−1
T · · · xk−S
T uTk−T uTk−T +1 · · · uTk−1 . (5.4)
5.4 Extended State Augmentation 229

The dynamic equation of the augmented system can be obtained from the original
dynamics (5.1) as follows,
⎡ ⎤ ⎡ ⎤
A0 A1 A2 ··· AS BT BT −1 BT −2 · · · B1 B0
⎢ In 0 0 ··· 0 0 ··· 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 ··· 0 ··· 0 ⎥ ⎢0⎥
⎢ In 0 0 0 0 ⎥ ⎢ ⎥
⎢ . .. .. .. .. .. .. . ⎥
.. ⎥ ⎢ . ⎥
⎢ .. . . · · · .. ⎢ . ⎥
⎢ . . . . . ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥
⎢ 0 0 · · · In 0 0 ··· 0 0 0 ⎥
Xk+1 =⎢ ⎥Xk + ⎢ ⎥
⎢ .. ⎥ uk
⎢ 0 ··· 0 0 0 0 Im 0 ··· 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. ⎥
⎢ . ⎥ ⎢ ⎥
⎢ 0 ···0 0 0 0 0 Im .. 0 ⎥ ⎢ ⎥
.
⎢ ⎥ ⎢ . ⎥
⎢ .. .... .. .. .. .. .. .. .. ⎥ ⎢ .. ⎥
⎢ . . . . . . . . . . ⎥ ⎢ ⎥
⎢ ⎥ ⎣0⎦
⎣ 0 ··· 0 0 0 0 0 ··· 0 Im ⎦
0 ··· 0 0 0 0 0 ··· 0 0 Im

= AXk + Buk . (5.5)

Since the maximum state delay S and input delay T are not known, we will extend
the augmentation further up to their upper bounds S̄ and T̄ , respectively. For this
purpose, we introduce the extended augmented state vector as

X̄k = xkT xk−1
T · · · xk−S
T T
xk−S−1 · · · xk−
T

T
uTk−T̄ uTk−T̄ +1 · · · uTk−T · · · uTk−1 , (5.6)

and the corresponding extended augmented dynamics is given by

S̄ + 1 blocks T̄ blocks
( )* + ( )* +
⎡ ⎤ ⎡ ⎤
A0 · · · AS 0 · · · 0 · · · BT · · · B1 B0
⎢ In 0 0 · · · 0 0 ··· 0 ··· 0 ⎥ ⎢0⎥
⎢ ⎥ ⎢ ⎥
⎢ 0 I 0 ··· 0 ··· 0 0 0 ⎥ ⎢0⎥
⎢ n 0 ⎥ ⎢ ⎥
⎢ . . . .. .. ⎥ ⎢ . ⎥
⎢ .. . . . . ... ... .. ..
. ··· . . ⎥ ⎥ ⎢ . ⎥
⎢ . ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥
⎢ 0 · · · 0 In 0 0 ··· 0 0 0 ⎥ ⎢ ⎥
X̄k+1 =⎢ ⎥X̄k + ⎢ .. ⎥uk
⎢ 0 ··· 0 0 0 0 Im 0 · · · 0 ⎥ ⎢ . ⎥
⎢ ⎥ ⎢ . ⎥
⎢ .. ⎥ ⎢ . ⎥
⎢ 0 ··· 0 0 0 0 0 Im . 0 ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. ⎥
⎢ .. . . . .. .. . . . . .. ⎥ ⎢ . ⎥
⎢ . · · · .. .. .. . . . . . ⎥ ⎢ ⎥
⎢ ⎥ ⎢ .. ⎥
⎣ 0 ··· 0 0 0 0 0 · · · 0 Im ⎦ ⎣ . ⎦
0 ··· 0 0 0 0 0 ··· 0 0 Im
Δ
= ĀX̄k + B̄uk . (5.7)
230 5 Model-Free Control of Time Delay Systems

Comparing the extended augmented dynamics (5.7) with the original dynamics
(5.1), we see that the problem of finding an optimal controller for a time delay
system with both unknown delays and unknown dynamics is equivalent to finding
an optimal controller for an augmented delay-free system with unknown dynamics.
We will now study the controllability property of the augmented systems (5.5)
and (5.7).
Theorem 5.1 The delay-free augmented systems (5.5) and (5.7) are controllable if
and only if
 S T T −i

ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n, (5.8)

for any
&  S '

λ ∈ λ ∈ C : det Ai λS−i −λS+1 I =0 .
i=0

Proof Consider the augmented system (5.5).


 From linear systems theory, the system
is controllable if and only if A − λI B has a full row rank for all λ ∈ C. We
 
evaluate ρ A − λI B as
⎡ ⎤
A0 −λI A1 A2 · · · AS BT BT −1 BT −2 · · · B1 B0
⎢ In −λI 0 · · · 0 0 ··· 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎢ 0 In −λI . ··· ··· 0 0 ⎥
⎢ . 0 0 ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎢ . . . . ··· . ⎥
⎢ . . . . . ⎥
  ⎢ ⎥
⎢ 0 · · · 0 In −λI 0 ··· 0 ··· 0 0 ⎥
ρ A−λI B = ρ ⎢ ⎥.
⎢ 0 · · · 0 0 0 −λI Im 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ 0 ··· 0 0 0 0 −λI Im . 0 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. .. .. .. ⎥
⎢ . ··· . . . . ··· . . . . ⎥
⎢ ⎥
⎣ 0 ··· 0 0 0 0 0 · · · −λI Im 0 ⎦
0 ··· 0 0 0 0 0 ··· 0 −λI Im

Consider the columns associated with the Bi ’s. Adding λ times the last column to
last but one column, we have
5.4 Extended State Augmentation 231

⎡ ⎤
∗ BT BT −1 BT −2 · · · B1 + B0 λ B0
⎢∗ 0 ··· ··· 0 ⎥
⎢ 0 0 ⎥
⎢. . .. .. .. .. .. ⎥
⎢ .. .. . ⎥
⎢ . . . . ⎥
  ⎢ ⎥
⎢∗ 0 0 ··· ··· 0 0 ⎥
ρ A − λI B = ρ ⎢ ⎥.
⎢ 0 −λI Im 0 ··· 0 0 ⎥
⎢ ⎥
⎢ 0 0 −λI Im ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎣0 0 0 . . Im 0 ⎦
0 0 0 0 0 0 Im

Repeating the above step for each of the remaining columns results in
⎡ ⎤
∗ BT + BT −1 λ + · · · + B0 λT ··· ··· · · · B1 + B0 λ B0
⎢∗ 0 0 ··· ··· 0 0 ⎥
⎢ ⎥
⎢. .. .. .. .. .. .. ⎥
⎢ .. . ⎥
⎢ . . . . . ⎥
⎢∗ 0 ··· ··· 0 ⎥
  ⎢ 0 0 ⎥
⎢ ⎥
ρ A − λI B = ρ ⎢ 0 0 Im 0 · · · 0 0 ⎥.
⎢ ⎥
⎢0 0 0 Im · · · 0 0 ⎥
⎢ ⎥
⎢ .. .. . . ⎥
⎢. . · · · .. .. 0 0 ⎥
⎢ ⎥
⎣0 0 0 0 0 Im 0 ⎦
0 0 0 0 0 0 Im

Similarly, row operations can be used to cancel the entries in the first row and result
in
⎡ ⎤
∗ BT + BT −1 λ + · · · + B0 λT 0 0 · · · 0
⎢∗ 0 0 ··· 0 ⎥
⎢ 0 ⎥
⎢. . .. .. .. .. ⎥
⎢ .. .. . . . . ⎥
⎢ ⎥
  ⎢ ⎥
⎢∗ 0 0 0 ··· 0 ⎥
ρ A − λI B = ρ ⎢ ⎥.
⎢0 0 Im 0 · · · 0 ⎥
⎢ ⎥
⎢0 0 0 Im 0 0 ⎥
⎢ ⎥
⎢ .. .. .. . . . . .. ⎥
⎣. . . . . . ⎦
0 0 0 0 0 Im

We can now consider the columns pertaining to the Ai ’s in


232 5 Model-Free Control of Time Delay Systems

⎡ ⎤
A0 −λI A1 ··· AS BT + BT −1 λ+ · · · +B0 λT 0 0 ··· 0
⎢ I −λI ··· ··· 0 ⎥
⎢ n 0 0 0 0 ⎥
⎢ .. ⎥
⎢ .. .. .. .. .. .. .. ⎥
⎢ 0 . . . . . . . . ⎥
  ⎢ ⎥
⎢ 0 ··· In −λI 0 0 0 ··· 0 ⎥
ρ A−λI B = ρ ⎢
⎢ 0
⎥.
⎢ 0 0 0 0 Im 0 ··· 0 ⎥ ⎥
⎢ ⎥
⎢ 0 0 0 0 0 0 Im 0 0 ⎥
⎢ .. ⎥
⎢ .. .. .. .. .. .. .. .. ⎥
⎣ . . . . . . . . . ⎦
0 0 0 0 0 0 0 ··· Im

Applying similar row and column operations to cancel out the entries corresponding
to the columns of Ai ’s results in
⎡  T T −i 0 0

0 0 · · · Si=0 Ai λS−i − λS+1 I i=0 Bi λ ··· 0
⎢I 0 ··· ··· 0 ⎥
⎢ n 0 0 0 0 ⎥
⎢ . .. ⎥
⎢ . . . .. .. .. ⎥
⎢ 0 . . .. .. .. . . . . ⎥
  ⎢ ⎥
⎢ 0 · · · In 0 0 0 0 ··· 0 ⎥
ρ A − λI B = ρ ⎢
⎢0 0 0

⎢ 0 0 I m 0 ··· 0 ⎥ ⎥
⎢ ⎥
⎢0 0 0 0 0 0 Im 0 0 ⎥
⎢ . . . .. ⎥
⎢ . . . .. .. .. . . .. ⎥
⎣ . . . . . . . . . ⎦
0 0 0 0 0 0 0 ··· Im
 T 
=ρ S
A λS−i − λS+1 I B λT −i + nS + mT .
i=0 i i=0 i

 
At full row rank, ρ A − λI B = n + nS + mT . Thus, full rank of ρ[A − λI B]
is equivalent to
 S T T −i

ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n, λ ∈ C.

On the other hand, let P (λ) = AS + AS−1 λ + · · · + A0 λS − λS+1 I be a matrix


polynomial of λ. Then, P (λ) drops its rank to below n for any
&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I =0 .
i=0

Thus, only for such λ’s does it need to be ensured that


 S T T −i

ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n,

which is condition (5.8) for


&  S '

λ ∈ λ ∈ C : det Ai λS−i −λS+1 I =0 .
i=0
5.4 Extended State Augmentation 233

 
For the extended augmented system (5.7), we evaluate ρ Ā−λI B̄ by using
similar row and column operations to result in
⎡ ⎤
A0 −λI · · · AS 0 · · · 0 · · · BT · · · B1 B0
⎢ In −λI 0 · · · 0 0 ··· 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎢ 0 In −λI . ··· ··· 0 0 ⎥
⎢ . 0 0 ⎥
⎢ .. .. .. .. .. .. .. .. ⎥
⎢ 0 . . . . ··· ··· . ⎥
⎢ . . . ⎥
  ⎢ ⎥
⎢ 0 · · · 0 In −λI 0 ··· 0 ··· 0 0 ⎥
ρ Ā − λI B̄ = ρ ⎢ ⎥
⎢ 0 · · · 0 0 0 −λI Im 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ 0 ··· 0 −λI Im
0 0 0 . 0 0 ⎥
⎢ ⎥
⎢ .. .. .. . . . .
.. .. .. ⎥
⎢ . ··· . .
. . . . . 0 0 ⎥
⎢ ⎥
⎣ 0 ··· 0 0 0 0 0 · · · −λI Im 0 ⎦
0 ··· 0 0 0 0 0 ··· 0 −λI Im
 S T 
T −i + nS̄ + mT̄ ,
=ρ i=0 Ai λ
S−i − λS+1 I
i=0 Bi λ

where we have used the fact that the padded zero columns do not affect the row
rank, whereas the S̄ number of In and T̄ number of Im matrices contribute nS̄ + mT̄
to the row
 rank. For the controllability of the extended augmented system (5.7), we
need ρ Ā − λI B̄ = n + nS̄ + mT̄ , which is equivalent to
 S T T −i

ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ = n, λ ∈ C,

which in turn is again equivalent to condition (5.8) for


&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I =0 .
i=0

This completes the proof.

Remark 5.1 The controllability condition can be relaxed to the stabilizability


condition, in which case (5.8) needs to hold only for any
&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and |λ| ≥ 1 .
i=0

We now bring the focus to the design of the optimal controller. We rewrite the
original utility function for (5.1) in terms of the extended augmented state (5.6)
as
 
r X̄k , uk = X̄kT Q̄X̄k + uTk Ruk , (5.9)
234 5 Model-Free Control of Time Delay Systems

where
 
Q0
Q̄ = .
0 0

Since the system is now in a delay-free form, we can readily compute the optimal
controller. From optimal control theory [55], we know that there exists a unique
optimal control sequence

u∗k = K̄ ∗ X̄k
−1
= − R + B̄ T P̄ ∗ B̄ B̄ T P̄ ∗ ĀX̄k (5.10)

 
that minimizes (5.2) under the conditions of the stabilizability of Ā, B̄ and the
,   ∗ T

detectability of Ā, Q̄ , where P̄ = P̄ is the unique positive semi-definite
solution to the following ARE,
−1
ĀT P̄ Ā − P̄ + Q̄ − ĀT P̄ B̄ R + B̄ T P̄ B̄ B̄ T P̄ Ā = 0. (5.11)

Remark 5.2 The extended augmented states uk−T −1 to uk−T̄ and xk−S−1 to xk−S̄
are fictitious states and, therefore, are not reflected in the optimal control law, as
will be seen in the simulation results. That is, the control coefficients corresponding
to these states are zero.
We next work towards deriving the conditions for the observability of the
extended augmented system. To this end, we introduce an augmented output as
 T
Yk = ykT uTk−T = CXk ,

where S+1 T
 ( )* + ( )* + 
C 0 0 ··· 0 0 0 0 ··· 0
C= .
0 0 0 · · · 0 Im 0 0 · · · 0

In the same spirit, we can obtain an extended augmented output vector by


incorporating the fictitious states corresponding to the delayed states xk−S−1 to xk−S̄
and the delayed inputs uk−T −1 to uk−T̄ as follows,
 T
Ȳk = ykT uTk−T̄ = C̄X̄k ,

where
5.4 Extended State Augmentation 235
S̄ + 1 T̄
 ( )* + ( )* + 
C 0 0 ··· 0 0 0 0 ··· 0
C̄ = .
0 0 0 · · · 0 Im 0 0 · · · 0

We will now study the observability property of the augmented systems (5.5) and
(5.7).
Theorem 5.2 The delay-free augmented systems (5.5) is observable if and only if
 S 
ρ T S−i
i=0 Ai λ − λS+1 I C T = n, (5.12)

for any
&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and λ = 0 ,
i=0

and

ρ (AS ) = n. (5.13)

The extended augmented system (5.7) is detectable if and only if (5.12) holds for
any
&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and |λ| ≥ 1 .
i=0

Proof
 T TBy  duality, the observability of (A, C) implies
 the controllability
 of
A , C . Thus, (A, C) is observable if and only if AT −λI CT has a full row
 
rank of (S + 1)n + mT . We evaluate ρ AT −λI CT as
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 0 ··· 0 CT 0
⎢ AT −λI In · · · 0 ··· 0 0 ⎥
⎢ 1 0 0 0 0 ⎥
⎢ ⎥
⎢ . ⎥
⎢ AT2 0 −λI . . 0 0 0 0 ··· 0 0 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎢ . . . . In . . 0 . . . . ⎥
⎢ ⎥
⎢ AT · · · 0 0 −λI ··· 0 0 0 ⎥
  ⎢ 0 0 0 ⎥
⎢ S

ρ AT −λI CT = ρ ⎢ ⎥.
⎢ BT ··· 0 0 0 −λI 0 0 ··· 0 0 Im ⎥
⎢ T ⎥
⎢ .. ⎥
⎢ BT ··· −λI 0 0 0 ⎥
⎢ T −1 0 0 0 Im 0 . ⎥
⎢ .. .. .. .. .. ⎥
⎢ ⎥
⎢ . . . . . 0 Im −λI 0 ··· 0 0 ⎥
⎢ ⎥
⎢ .. .. .. ⎥
⎣ . ··· 0 0 0 0 0 . . 0 0 0 ⎦
B1T ··· 0 0 0 0 0 ··· Im −λI 0 0
236 5 Model-Free Control of Time Delay Systems

Moving the last column to the beginning of the right partitions of the above
partitioned matrix and then adding λ times this column to the next column to its
right result in
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 0 ··· 0 0 CT
⎢ AT −λI In · · · 0 0 ··· 0 ··· 0 0 0 ⎥
⎢ 1 ⎥
⎢ . ⎥
⎢ AT 0 −λI . . 0 ··· ··· ··· 0 0 0 ⎥
⎢ 2 0 ⎥
⎢ .. .. .. .. . . .. . .. .. ⎥
⎢ . . In .. .. . · · · .. . ⎥
⎢ . . . ⎥
⎢ ⎥
  ⎢ AS T ··· 0 0 −λI 0 0 · · · · · · 0 0 0 ⎥
⎢ ⎥
ρ AT −λI CT = ρ ⎢ ⎥.
⎢ BTT 0 ··· 0 0 Im 0 ··· 0
0 0 0 ⎥
⎢ ⎥
⎢ T . ⎥
⎢ BT −1
⎢ 0 ··· 0 0 0 Im −λI 0 .. 0 0 ⎥ ⎥
⎢ .. . .. ⎥
⎢ . 0 · · · .. . 0 0 Im −λI 0 0 0 ⎥
⎢ ⎥
⎢ .. .. .. . . .. .. ⎥
⎣ . 0 ··· 0 0 . 0 . . . 0 . ⎦
B1T 0 ··· 0 0 0 0 0 · · · Im −λI 0

Repeating the above step for each of the remaining columns results in
⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 · · · 0 0 CT
0
⎢ AT −λI In · · · 0 ··· 0 ··· 0 0 0 ⎥
0
⎢ 1 ⎥
⎢ . .. ⎥
⎢ AT 0 −λI . . 0 ··· 0 ··· 0 0 0 ⎥
.
⎢ 2 ⎥
⎢ .. .. .. .. .. .. .. .. .. .. ⎥
⎢ . . In . . . · · · . . . ⎥
⎢ . . ⎥
⎢ AT ··· 0 0 −λI 0 0 0 · · · 0 0 0 ⎥
  ⎢ ⎥
ρ AT −λI CT = ρ ⎢ ⎥.
S
⎢ ⎥
⎢ BT T 0 ··· 0 0 Im 0 0 · · · 0 0 0 ⎥
⎢ ⎥
⎢ T . ⎥
⎢ BT −1 0 ··· 0 0 0 Im 0 0 .. 0 0 ⎥
⎢ T ⎥
⎢ B 0 ··· 0 0 0 0 Im 0 0 0 0 ⎥
⎢ T −2 ⎥
⎢ .. .. .. ⎥
⎣ . 0 ··· 0 0 0 0 . . 0 0 0 ⎦
B1T 0 ··· 0 0 0 0 0 · · · Im 0 0

Similarly, we can cancel all the Bi entries in the left partitions using the identity
columns from the right partition to result in
5.4 Extended State Augmentation 237

⎡ ⎤
AT0 −λI In 0 ··· 0 0 0 0 · · · 0 0 CT
⎢ AT −λI In · · · 0 0 ··· 0 ··· 0 0 0 ⎥
⎢ 1 ⎥
⎢ . .. ⎥
⎢ AT 0 −λI . . 0 ··· ··· ··· 0 0 0 ⎥
⎢ 2 . ⎥
⎢ .. .. .. .. .. .. .. .. .. ⎥
⎢ . . In . · · · · · · . . . . ⎥
⎢ . . ⎥
⎢ ⎥
 T  ⎢ A T 0 0 ··· −λI 0 · · · 0 · · · 0 0 0 ⎥
⎢ S ⎥
ρ A −λI C = ρ ⎢
T ⎥.
⎢ 0 ··· 0 0 0 Im 0 0 · · · 0 0 0 ⎥
⎢ ⎥
⎢ .. ⎥
⎢ 0 ··· 0 0 0 0 Im 0 0 . 0 0 ⎥
⎢ ⎥
⎢ .. .. .. .. .. ⎥
⎢ . . . . . 0 0 Im 0 0 · · · 0 ⎥
⎢ ⎥
⎢ .. .. ⎥
⎣ 0 ··· 0 0 0 0 ··· . . 0 0 0 ⎦
0 ··· 0 0 0 0 0 0 · · · Im 0 0

The rows in the lower partitions involve T number of Im matrices and contribute
mT to the rank. We next examine the upper partition without the zero columns,
⎡ ⎤
AT0 −λI In 0 ··· 0 CT
⎢ AT −λI In · · · 0 0 ⎥
⎢ 1 ⎥
⎢ . .. ⎥

X = ⎢ AT2 0 −λI . . 0 ⎥
. ⎥.
⎢ .. .. .. .. .. ⎥
⎣ . . . . In . ⎦
AS T 0 0 ··· −λI 0

For λ = 0, we can perform elementary row and column operations on the above
matrix to result in
⎡ ⎤
AT0 − λI + λ1 AT1 + · · · + λ1S ATS 0 0 · · · 0 C T
⎢ ⎥
⎢ 0 In 0 · · · 0 0 ⎥
⎢ ⎥
⎢ . . .. ⎥
ρ(X) = ρ ⎢ 0 0 In . . 0 ⎥.
⎢ . ⎥
⎢ .. .. . . . . ⎥
⎣ . . . . 0 .. ⎦
0 0 0 · · · In 0

The S number of In matrices contribute Sn to the rank. Thus, (A, C) is observable


if and only if
 
ρ AT0 − λI + λ1 AT1 + · · · + 1 T
A
λS S
C T = n,

or, equivalently,
 S 
ρ T S−i−λS+1 I
i=0 Ai λ C T = n,
238 5 Model-Free Control of Time Delay Systems

is condition (5.12) for any


&  S '

λ ∈ λ ∈ C : det Ai λ S−i
−λ S+1
I = 0 and λ = 0 .
i=0

If λ = 0, then
⎡ ⎤
AT0 In 0 · · · 0 CT
⎢ AT 0 In · · · 0 0 ⎥
⎢ 1 ⎥
⎢ . . ⎥
ρ⎢
⎢ A2
T 0 0 . . .. 0 ⎥⎥ = (S + 1)n
⎢ . .. . . . . .. ⎥
⎣ .. . . . In . ⎦
ATS · · · 0 0 0 0

if and only if

ρ (AS ) = n.

which is condition (5.13).


For the case of the extended augmented system, it can be readily verified that for
λ = 0, condition (5.12) remains the same due to the additional identity matrices, as
in the proof of the controllability results in Theorem 5.1. The second condition in
this case becomes ρ(AS̄ ) = n. However, this condition cannot be satisfied if S̄ > S
because AS̄ = 0 for S̄ > S. As a result, the observability condition (5.13) for the
extended augmented system loses rank for λ = 0. Since λ = 0 is a stable eigenvalue,
we still have the detectability condition satisfied. This completes the proof.

Remark 5.3 The observability condition for the augmented system (5.5) can also be
relaxed to detectability, in which case we require only condition (5.12) to hold for
any
&  S '

λ ∈ λ ∈ C : det Ai λS−i −λS+1 I = 0 and |λ| ≥ 1 .
i=0

5.5 State Feedback Q-learning Control of Time Delay


Systems

In this section, we will present a Q-learning scheme for learning the optimal control
parameters that uplifts the requirement of the knowledge of the system dynamics
and the delays.
5.5 State Feedback Q-learning Control of Time Delay Systems 239

Consider a stabilizing control policy uk = K̄ X̄k , which is not necessarily optimal


with respect to our utility function. Corresponding to this controller, there is a cost
given by the following value function that represents the infinite horizon cost of
executing the control starting from state X̄k from time k to time ∞,
 
VK̄ X̄k = X̄kT P̄ X̄k , (5.14)

for some positive definite matrix P̄ . The above infinite horizon value function can
be recursively written as
   
VK̄ X̄k = X̄kT Q̄X̄k + X̄kT K̄ T R K̄ X̄k + VK̄ X̄k+1 .

Similar to the value function above, we can define a Q-function that gives the value
of executing an arbitrary control uk instead of uk = K̄ X̄k at time k and then
following policy K̄ from time k + 1 on,
   
QK̄ X̄k , uk = X̄kT Q̄X̄k + uTk Ruk + VK̄ X̄k+1 . (5.15)

Substituting the dynamics (5.7) in (5.15) results in


 
QK̄ X̄k , uk = X̄kT Q̄X̄k + uTk Ruk + X̄k+1
T
P̄ X̄k+1
 T  
= X̄kT Q̄X̄k + uTk Ruk + ĀX̄k + B̄uk P̄ ĀX̄k + B̄uk ,

or, equivalently,
 T  
  X̄k X̄
QK̄ X̄k , uk = H k , (5.16)
uk uk

where
 
HXX HXu
H =
HuX Huu
 
Q̄+ ĀT P̄ Ā ĀT P̄ B̄
= .
B̄ T P̄ Ā R+ B̄ T P̄ B̄

When K̄ = K̄ ∗ , we have P̄ = P̄ ∗ , QK̄ = Q∗ , and the optimal LQR controller


can be obtained by solving


Q∗ = 0
∂uk

for uk , which corresponds to (5.10).


240 5 Model-Free Control of Time Delay Systems

It can be seen that the problem of finding an optimal controller boils down to
finding the optimal matrix H ∗ or the optimal Q-function Q∗ .
Q-learning is a model-free learning technique that estimates the optimal Q-
function without requiring the knowledge of system dynamics. It does so by means
of the following Bellman Q-learning equation,
   
QK̄ X̄k , uk = X̄kT Q̄X̄k + uTk Ruk + QK̄ X̄k+1 , K̄ X̄k+1 , (5.17)
 
which is obtained by substituting VK̄ (X̄k ) = QK̄ X̄k , K̄Xk in (5.15). Let
 
X̄k
zk = .
uk

Then, by definition (5.16), we can write Equation (5.17) as

zkT H zk = X̄kT Q̄X̄k + uTk Ruk + zk+1


T
H zk+1 , (5.18)

which is linear in the unknown matrix H . We can perform the following parameter-
ization on Equation (5.17),
 
QK̄ (zk ) = QK̄ X̄k , uk
= H̄ T z̄k ,

where

H̄ = vec(H )

= [h11 2h12 · · · 2h1l h22 2h23 · · · 2h2l · · · hll ]T ∈ Rl(l+1)/2 ,

with hii being the elements of matrix H and l = n + nS̄ + mT̄ + m. The regressor
z̄k ∈ Rl(l+1)/2 is defined as the following quadratic basis set,
 T
z̄ = z12 z1 z2 · · · z1 zl z22 z2 z3 · · · z2 zl · · · zl2 .

With this parameterization, we have, from (5.18), the following equation,

H̄ T (z̄k − z̄k+1 ) = X̄kT Q̄X̄k + uTk Ruk . (5.19)

Notice that (5.19) is a scalar equation with l(l + 1)/2 unknowns. We can solve this
equation in the least-squares sense by collecting at least L ≥ l(l + 1)/2 datasets of
X̄k and uk . The least-squares solution of (5.19) is given by
 −1
H̄ = T ϒ, (5.20)
5.5 State Feedback Q-learning Control of Time Delay Systems 241

where  ∈ R l(l+1)/2×L and ϒ ∈ R L×1 are the data matrices defined as


 
 = z̄k−L+1 − z̄k−L+2 z̄k−L+2 − z̄k−L+3 · · · z̄k − z̄k+1 ,
      T
ϒ = r X̄k−L+1 , uk−L+1 r X̄k−L+2 , uk−L+2 · · · r X̄k , uk .

It is important to note that since uk = K̄ X̄k is linearly dependent on X̄k , the


least-squares problem cannot be solved unless we inject an independent exploration
signal vk in uk in order to guarantee the invertibility of T in (5.20). In other
words, the following rank condition should hold,

rank() = l(l + 1)/2. (5.21)

As we have seen, typical examples of exploration signals include sinusoids,


exponentially decaying signals and white noise. It has been shown in Chap. 2 that
these exploration signals do not incur any bias in the Q-learning algorithm. In
what follows, we present the policy iteration and value iteration based Q-learning
algorithms to learn the optimal control parameters.
Algorithms 5.1 and 5.2 present the policy iteration and value iteration Q-learning
algorithms for solving the LQR problem of time delay systems. For the policy
iteration algorithm, we require a stabilizing initial policy, which is not required for
the value iteration algorithm. An exploration signal vk is injected in the control
signal such that the rank condition (5.21) is satisfied during the learning phase.
Based on the collected data, we solve the Q-learning Bellman equation using the
least-squares method by forming appropriate data matrices as shown in Chaps. 1

Algorithm 5.1 State feedback Q-learning policy iteration algorithm for time delay
systems
input: input-state data
output: H ∗
1: initialize. Select an admissible policy K̄ 0 such that Ā + B̄ K̄ 0 isSchur stable.
 Set j ← 0.
2: collect data. Apply the initial policy u0 to collect L datasets of X̄k , uk .
3: repeat
4: policy evaluation. Solve the following Bellman equation for H j ,
T T
H̄ j z̄k = X̄kT Q̄X̄k + uTk Ruk + H̄ j z̄k+1 . (5.22)

5: policy update. Find an improved policy as


−1
K̄ j +1 = − H̄uu
j j
HuX . (5.23)

6: j←  j +1 
7: until H j − H j −1  < ε for some small ε > 0.
242 5 Model-Free Control of Time Delay Systems

Algorithm 5.2 State feedback Q-learning value iteration algorithm for time delay
systems
input: input-state data
output: H ∗
1: initialize. Select an arbitrary policy u0k and H 0 ≥ 0. Set j ← 0.
2: collect data. Apply the initial policy u0 to collect L datasets of (X̄k , uk ).
3: repeat
4: value update. Solve the following Bellman equation for H j ,
T T
H̄ j +1 z̄k = X̄kT Q̄X̄k + uTk Ruk + H̄ j z̄k+1 . (5.24)

5: policy update. Find an improved policy as



j +1 −1 j +1
K̄ j +1 = − H̄uu HuX . (5.25)

6: j←  j +1 
7: until H j − H j −1  < ε for some small ε > 0.

and 2. In the next step, we perform a minimization of the Q-function with respect
to uk , which gives us an improved policy. These iterations are carried out until
we see no further updates in the estimate of matrix H within a sufficiently small
range specified by the positive constant ε. We next establish the convergence of
Algorithms 5.1 and 5.2.
Theorem 5.3 Consider the time delay system (5.1) and its extended augmented 
version (5.7).
 Under the stabilizability and detectability conditions of Ā, B̄ and
,
Ā, Q̄ , respectively, the state feedback Q-learning Algorithms 5.1 and 5.2 each
 
j
generates a sequence of controls uk , j = 1, 2, 3, ... that converges to the optimal
feedback controller given in (5.10) as j → ∞ if the rank condition (5.21) is
satisfied.
Proof Algorithms 5.1 and 5.2 are the standard state feedback Q-learning Algo-
rithms 1.6 and 1.7, respectively, applied to the extended augmented system (5.6)
under the controllability condition of Ā, B̄ and the observability condition
, 
of Ā, Q̄ , it follows from the convergence of Algorithms 1.6 and 1.7 that
Algorithms 5.1 and 5.2 converge to the optimal solution as j → ∞ under the rank
condition (5.21). This completes the proof.

Remark 5.4 Compared with the previous works [137, 139], the proposed scheme
relaxes the assumption of the existence of a bicausal change of coordinates.
Furthermore, unlike in [137, 139], the information of the state and input delays
(both the numbers and lengths of the delays) is not needed in our proposed scheme.
5.6 Output Feedback Q-learning Control of Time Delay Systems 243

5.6 Output Feedback Q-learning Control of Time Delay


Systems

The control design technique presented in the previous section was based on state
feedback. That is, access to the full state is needed in the learning process and in
the implementation of the resulting control law. However, in many applications it is
often the case that the measurement of the full state is not available but only a subset
of the state is measurable via system output. Output feedback techniques enable the
design of control algorithms without involving the information of the full state. This
section will present a Q-learning based control algorithm to stabilize the system
using the measurements of the system output instead of the full state. We recall
from [56] the following lemma, which allows the reconstruction of the system state
by means of the delayed measurements of the system input and output.
Lemma 5.1 Consider the extended augmented system (5.7). Under the observabil-
ity assumptions of the pair (A, C), the system state can be represented in terms of
the measured input and output sequence as

Xk = My Yk−1,k−N + Mu uk−1,k−N , (5.26)

where N ≤ n(S + 1) + mT is an upper bound on the observability index of the


system, uk−1,k−N ∈ RmN and Yk−1,k−N ∈ R(p+m)N are the delayed input and the
delayed output data vectors defined as
 T
uk−1,k−N = uTk−1 uTk−2 · · · uTk−N ,
 T
Yk−1,k−N = yk−1
T uTk−T −1 · · · yk−N
T uTk−T −N ,

and the parameterization matrices take the special form


−1
My = AN VNT VN VNT ,
−1
Mu = UN − AN VNT VN VNT TN ,

with
 T T
VN = CAN −1 · · · (CA)T CT ,

 
UN = B AB · · · AN −1 B ,
244 5 Model-Free Control of Time Delay Systems

⎡ ⎤
0 CB CAB · · · CAN −2 B
⎢0 0 CB · · · CAN −3 B ⎥
⎢ ⎥
⎢ ⎥
TN = ⎢ ... .. . . . .
. . .
..
. ⎥.
⎢ ⎥
⎣0 0 0 · · · CB ⎦
0 0 0 ··· 0

In Lemma 5.1, VN and UN are the observability and controllability matrices,


respectively, and TN is referred to as the Toeplitz matrix.
Remark 5.5 The state parameterization (5.26) involves the invertibility of the
observability
 matrix. To ensure the observability of the extended augmented pair
Ā, C̄ , we will assume in this section that S̄ = S.
Next, we aim to use the state parameterization (5.26) to describe the Q-function in
(5.16). Notice that (5.26) can be written as
 
  uk−1,k−N
Xk = Mu My . (5.27)
Yk−1,k−N

Similarly, for the extended augmented system (5.7), we have the following parame-
terization of the extended augmented state,
 
  uk−1,k−N
X̄k = M̄u M̄y , (5.28)
Ȳk−1,k−N

where
 M̄u and M̄y are formed using the extended augmented system matrices
Ā, B̄, C̄ . It can be easily verified that substitution of (5.28) in (5.16) results in

⎡ ⎤T ⎡ ⎤⎡ ⎤
uk−1,k−N , Hūū Hūȳ Hūu uk−1,k−N
QK̄ = ⎣Ȳk−1,k−N ,⎦ ⎣Hȳ ū Hȳ ȳ Hȳu ⎦⎣Ȳk−1,k−N ⎦
uk Huū Huȳ Huu uk

= ζkT Hζk , (5.29)

where
 T
ζk = uTk−1,k−N Ȳk−1,k−N
T uTk ,

H = HT ∈ R(mN +(p+m)N+m)×(mN +(p+m)N +m) ,

and the submatrices of H are defined as


5.6 Output Feedback Q-learning Control of Time Delay Systems 245

 
Hūū = M̄uT Q̄ + ĀT P̄ Ā M̄u ∈ RmN ×mN ,
 
Hūȳ = M̄uT Q̄ + ĀT P̄ Ā M̄y ∈ RmN ×(p+m)N ,
Hūu = M̄uT ĀT P̄ B̄ ∈ RmN ×m ,
  (5.30)
Hȳ ȳ = M̄yT Q̄ + ĀT P̄ Ā M̄y ∈ R(p+m)N×(p+m)N ,
Hȳu = M̄yT ĀT P̄ B̄ ∈ R(p+m)N ×m ,
Huu = R + B̄ T P̄ B̄ ∈ Rm×m .

By (5.29) we have obtained a new description of the Q-function of the LQR in


terms of the input and output of the system. We are now in a position to derive
our output feedback LQR controller based on this Q-function. We seek to derive an
optimal controller that minimizes the cost as expressed by the Q-function (5.29).
To obtain the optimal controller, we perform the minimization of the optimal Q-
function Q∗ with H∗ being the optimal H. Setting


Q∗ = 0
∂uk

and solving for uk result in our output feedback LQR control law,
 −1 
u∗k = − H∗uu H∗uū uk−1,k−N + H∗uȳ Ȳk−1,k−N


 T
= K̄ uTk−1,k−N Ȳk−1,k−N
T . (5.31)

Now that we have an output feedback form of the Q-function for the extended
augmented time delay system, the next step is to learn the optimal Q-function Q∗
and the corresponding output feedback optimal controller (5.31).
Consider the state feedback Q-learning equation (5.18). We employ the equiva-
lent output feedback Q-function to write this equation as

ζkT Hζk = ȲkT Q̄y Ȳk + uTk Ruk + ζk+1


T
Hζk+1 . (5.32)

It should be noted that, in the output feedback learning, we apply the user-defined
weighting matrix Q̄y to the output. The term X̄kT Q̄X̄k can be replaced with ȲkT Q̄y Ȳk
T
without requiring the knowledge of C̄ when Q̄ = C̄ Q̄y C̄ and Ȳk = C̄X̄k , where Ȳk
is measurable. Here, uk+1 is computed as
 
uk+1 = −(Huu )−1 Huū ūk,k−N +1 + Huȳ Ȳk,k−N +1
 T
= K̄ uTk−1,k−N Ȳk−1,k−N
T . (5.33)

Equation (5.32) is the Bellman equation for the Q-function in the output feedback
form, from which we will develop a reinforcement learning algorithm. We will
246 5 Model-Free Control of Time Delay Systems

parameterize the Q-function in (5.29) so that we can separate the unknown matrix
H.
Consider the output feedback Q-function in (5.29), which can be linearly
parameterized as

T
QK̄ = H̄ ζ̄k , (5.34)

where

H̄ = vec(H)
 T
= H11 2H12 · · · 2H1l H22 2H23 · · · 2H2l · · · Hll
∈ Rl(l+1)/2 , l = mN + (p + m)N + m,

is the vector that contains the upper triangular portion of matrix H. Since H is
symmetric, the off-diagonal entries are included as 2Hij . The regression vector ζ̄k ∈
Rl(l+1)/2 is defined by

ζ̄k = ζk ⊗ ζk ,

which is the quadratic basis set formed as


 T
ζ̄ = ζ12 ζ1 ζ2 · · · ζ1 ζl ζ22 ζ2 ζ3 · · · ζ2 ζl · · · ζl2 .

With the parameterization (5.34), we have the following equation,

T T
H̄ ζ̄k = ȲkT Q̄y Ȳk + uTk Ruk + H̄ ζ̄k+1 . (5.35)

Notice that (5.35) is a scalar equation with l(l + 1)/2 unknowns. We can solve this
equation in the least-squares sense by collecting at least L ≥ l(l + 1)/2 datasets of
Ȳk and uk . The least-squares solution of (5.19) is given by
 −1
H̄ = T ϒ, (5.36)

where  ∈ R l(l+1)/2×L and ϒ ∈ R L×1 are the data matrices defined as


 
 = ζ̄k−L+1 − ζ̄k−L+2 ζ̄k−L+2 − ζ̄k−L+3 · · · ζ̄k − ζ̄k+1 ,
      T
ϒ = r Ȳk−L+1 , uk−L+1 r Ȳk−L+2 , uk−L+2 · · · r Ȳk , uk .

It is important to note that since uk is linearly dependent on uk−1,k−N and


Ȳk−1,k−N , the least-squares problem cannot be solved unless we inject an inde-
pendent exploration signal vk in uk in order guarantee the invertibility of T in
5.6 Output Feedback Q-learning Control of Time Delay Systems 247

Algorithm 5.3 Output feedback Q-learning policy iteration algorithm for time delay
systems
input: input-output data
output: H∗
1: initialize. Select an admissible policy u0k . Set j ← 0.
 
2: collect data. Apply the initial policy u0k to collect L datasets of Ȳk , uk .
3: repeat
4: policy evaluation. Solve the following Bellman equation for Hj ,

j T

j T
H̄ ζ̄k = ȲkT Q̄y Ȳk + uTk Ruk + H̄ ζ̄k+1 . (5.38)

5: policy update. Find an improved policy as


−1 
j +1 j j
uk = − Hjuu Huū uk−1,k−N + Huȳ Ȳk−1,k−N . (5.39)

6: j←  j +1 
 
7: until Hj − Hj −1  < ε for some small ε > 0.

(5.36). In other words, the following rank condition should be satisfied,

rank() = l(l + 1)/2. (5.37)

Remark 5.6 When N < T̄ , Ȳk−1,k−N will contain entries from uk−1,k−N that will
prevent the rank condition (5.37) from being satisfied even in the presence of an
exploration signal vk . However, in the proof of Theorem 5.2, we see that increasing
T to an arbitrary large T̄ does not affect the observability. Furthermore, the rank
condition (5.12) corresponding to the original output yk remains unchanged. This
implies that the sum of the observability indices from the original output yk also
remains unchanged. Therefore, uk−T̄ in the augmented output vector contributes to
the observability of the extended states. This causes the sum of the observability
indices corresponding to the augmented output vector to increase to mT̄ , with each
component contributing T̄ equally. Thus, for an arbitrarily large T̄ , the observability
index becomes N = T̄ and the rank condition (5.37) can be satisfied.
In what follows, we present an iterative Q-learning algorithm to learn our output
feedback Q-function and the optimal control parameters.
Algorithms 5.3 and 5.4 present the output feedback Q-learning algorithms for
time delay systems. As is the case with PI algorithms, Algorithm 5.3 makes use of
a stabilizing initial policy. On the other hand, Algorithm 5.4 does not require such
an initial policy. As seen in Chap. 2, for both policy iteration and value iteration
based Q-learning algorithms, we require an exploration signal vk such that the rank
condition (5.37) is satisfied. The Bellman equation in Algorithms 5.3 and 5.4 can
248 5 Model-Free Control of Time Delay Systems

Algorithm 5.4 Output feedback Q-learning value iteration algorithm for time delay
systems
input: input-output data
output: H∗
1: initialize. Select an arbitrary policy u0k and H0 ≥ 0. Set j ← 0.
 
2: collect data. Apply the initial policy u0k to collect L datasets of Ȳk , uk .
3: repeat
4: value update. Solve the following Bellman equation for Hj ,

j +1 T

j T
H̄ ζ̄k = ȲkT Q̄y Ȳk + uTk Ruk + H̄ ζ̄k+1 . (5.40)

5: policy update. Find an improved policy by


−1 
j +1 j +1 j +1
uk = − Hjuu+1 Huū uk−1,k−N + Huȳ Ȳk−1,k−N . (5.41)

6: j←  j +1 
 
7: until Hj − Hj −1  < ε for some small ε > 0.

be solved using the least-squares technique by forming the data matrices as shown
in Chap. 2. We next establish the convergence of Algorithms 5.3 and 5.4.
Theorem 5.4 Consider the time delay system (5.1) and its extended  augmented

version (5.7). Let S = S̄. Under the stabilizability conditions on Ā, B̄ and the
  -
observability conditions on Ā, C̄ and Ā, Q̄y C̄ , and the full row rank of M̄ =
 
M̄u M̄y , the output feedback Algorithms 5.3 and 5.4 each generates a sequence
 
j
of controls uk , j = 1, 2, 3, ... that converges to the optimal feedback controller
given in (5.31) as j → ∞ if the rank condition (5.37) is satisfied.
Proof Algorithms 5.3 and 5.4 are the output feedback Q-learning Algorithms 2.1
and 2.2, respectively, applied to the extended augmented system (5.6). It follows
from the proofs of Algorithms 2.1 and 2.2 that, under the stated conditions,
Algorithms 5.1 and 5.2 converge to the optimal output feedback solution as j → ∞.
This completes the proof.

5.7 Numerical Simulation

In this section, we test the proposed scheme using numerical simulation. Consider
the discrete-time system (5.1) with
5.7 Numerical Simulation 249

 
0.6 0.3
A0 = ,
0.2 0.5
 
0.2 0.5
A1 = ,
0.4 0.1
 
0.6
B1 = ,
0.4
 
0.1
B2 = ,
−0.1
C = [1 − 0.8].

There are two input delays and one state delay present in the system. Notice that,
although matrices A0 and A1 are both Schur stable, the system is unstable due to
delays, which can be checked by finding the roots of the polynomial matrix P (λ) =
A0 λ + A1 − λ2 I or by evaluating the eigenvalues of the augmented matrix A as
defined in (5.5). It can be verified that the controllability condition of the delayed
system
 S T 
T −i
ρ i=0 Ai λ
S−i − λS+1 I i=0 Bi λ =n

holds. Hence, the extended augmented system (5.6) is controllable. The maximum
input and state delays present in the system are T = 2 and S = 1, respectively. Let
T̄ = 3 and S̄ = 1 be the upper bounds on the input and state delays, respectively.
We specify the user-defined performance index as Q = I and R = 1. The nominal
optimal feedback control matrix as obtained by solving the ARE with known delays
is
 
K̄ ∗ = 0.7991 0.8376 0.3622 0.3643 0.0007 0.6191 .

We first validate the state feedback policy iteration algorithm, Algorithm 5.1. For
this algorithm, we employ the following stabilizing initial policy,
 
K̄ 0 = 0.4795 0.5025 0.2173 0.2186 0.0000 0.0004 0.3714 .

The convergence criterion of ε = 0.001 was selected on the controller parameters.


Since l = n(S̄ + 1) + m(T̄ + 1) = 8, we need at least l(l + 1)/2 = 36 data samples
to satisfy the rank condition (5.21) to solve (5.19). These data samples are collected
by applying some sinusoidal signals of different frequencies and magnitudes in the
control. The state response is shown in Fig. 5.1. It takes 4 iterations to converge to
the optimal controller as shown in Fig. 5.2. It can be seen in Fig. 5.1 that controller
is able to maintain the closed-loop stability. The final estimate of the feedback gain
matrix is
250 5 Model-Free Control of Time Delay Systems

1.5

1
x

0.5

0
0 20 40 60 80 100 120 140 160 180
time step (k)

Fig. 5.1 Algorithm 5.1 State trajectory of the closed-loop system under state feedback

12

10

8
Ĥ →H ∗ 

0
0 1 2 3 4
iterations

Fig. 5.2 Algorithm 5.1: Convergence of the parameter estimates under state feedback

 
K̄ˆ = 0.7991 0.8376 0.3622 0.3643 −0.0000 0.0007 0.6191 .

It can be seen that the estimated control parameters correspond only to the actual
delays present in the system while the one term corresponding to the extra state is
very small. In other words, the final control is equal to the one obtained using the
exact knowledge of the delays and system dynamics. Moreover, the rank condition
(5.21) is no longer needed once the convergence criterion is met.
We next validate the state feedback value iteration algorithm, Algorithm 5.2. The
algorithm is initialized with a zero feedback policy, which is clearly not stabilizing.
All other design and simulation parameters are the same as in the policy algorithm.
It can be seen in Fig. 5.3 that, due to the absence of a stabilizing initial policy, the
system response is unstable during the learning phase. However, Fig. 5.4 shows that
even in such a scenario, Algorithm 5.2 still manages to converge to the optimal
solution, which eventually stabilizes the unstable system. The final estimate of the
control gain is
5.7 Numerical Simulation 251

100

80

60
x

40

20

0
0 20 40 60 80 100 120 140 160 180
time step (k)

Fig. 5.3 Algorithm 5.2 State trajectory of the closed-loop system under state feedback

12

10

8
Ĥ →H ∗ 

0
0 1 2 3 4 5 6 7 8 9
iterations

Fig. 5.4 Algorithm 5.2: Convergence of the parameter estimates under state feedback

 
K̄ˆ = 0.7996 0.8380 0.3625 0.3645 0.0000 0.0007 0.6194 .

The simulation performed so far focuses on full state feedback. We will now
validate the output feedback algorithms, Algorithms 5.3 and 5.4. The extended
augmented system is observable since the observability conditions in Theorem 5.2
hold. We specify the user-defined performance index as Q̄y = I and R = 1. The
bound on the state delay is the same but the bound on the input delay has been
increased to T̄ = 4 to make the observability index N = T̄ in order to satisfy the
output feedback rank condition (5.37). The nominal output feedback optimal control
parameters as obtained by solving the ARE with known delays are

K̄ = [0.5100 0.6229 −0.2150 −0.6116 3.5739 −0.3037

0.6210 0.0000 0.5693 0.0000 0.5062 0.0000].

The convergence criterion of ε = 0.001 is selected on the controller parameters.


Since l = mN + (p + m)N + m = 13, we need at least l(l + 1)/2 = 91 data samples
252 5 Model-Free Control of Time Delay Systems

to satisfy the rank condition (5.37) to solve (5.35). These data samples are collected
by applying some sinusoidal signals of different frequencies and magnitudes in
the control. For the output feedback policy iteration algorithm, Algorithm 5.3, we
choose the following stabilizing initial policy,

0
K̄ = [0.3366 0.4111 −0.1419 −0.4036 2.3588 −0.2005

0.4099 0.0000 0.3758 0 0.3341 0.0000].

The state response is shown in Fig. 5.5. It takes around 5 iterations to converge to the
optimal controller as shown in Fig. 5.6. It can be seen in Fig. 5.5 that the resulting
controller is able to achieve the closed-loop stability. The final estimate of the output
feedback gain matrix is

K̄ = [0.5100 0.6229 −0.2150 −0.6116 3.5739 −0.3037

0.6210 0.0000 0.5693 0.0000 0.5062 0.0000].

For the output feedback value iteration algorithm, Algorithm 5.4, we initialize with
a zero feedback policy. All other parameters are the same as in the simulation of the
policy iteration algorithm, Algorithm 5.3. The final estimate of the output feedback
gain matrix is

K̄ = [0.5092 0.6220 −0.2148 −0.6107 3.5687 −0.3033

0.6201 0.0000 0.5685 0.0000 0.5055 0.0000].

Figure 5.7 shows the state response, where it can be seen that due to the absence of a
stabilizing initial policy, the closed-loop system remains unstable during the initial
learning phase. However, after the initial data collection phase, the system trajectory
converges to zero. The convergence to the optimal output feedback parameters is
shown in Fig. 5.8. As can be seen in the state feedback and output feedback results,
while the transient performance of the state feedback algorithm is superior due
to fewer unknown parameters to be learnt, the output feedback algorithm has the
advantage that it does not require the measurement of the full state.

5.8 Summary

This chapter is built upon the idea that exploits the finite dimensionality property
of the discrete-time delay systems to bring them into a delay-free form. Compared
to the predictor feedback, which requires a feedback controller to bring the system
into a delay-free form, the technique brings the open-loop system into a delay-free
form by assigning delayed variables as additional states, which are then augmented
5.8 Summary 253

1.5
x

0.5

0
0 50 100 150 200 250 300
time step (k)

Fig. 5.5 Algorithm 5.3: State trajectory of the closed-loop system under output feedback

150

100
Ĥ →H∗ 

50

0
0 1 2 3 4 5 6 7 8 9
iterations

Fig. 5.6 Algorithm 5.3: Convergence of the parameter estimates under output feedback

300

200
x

100

0
0 50 100 150 200 250 300
time step (k)

Fig. 5.7 Algorithm 5.4: State trajectory of the closed-loop system under output feedback

with the system state. The standard state augmentation technique, however, still
requires the knowledge of the delays. To overcome this difficulty, we presented an
extended state augmentation approach that only requires the knowledge of upper
bounds of the delays. Both state and input delays were considered. We presented
254 5 Model-Free Control of Time Delay Systems

150

Ĥ →H∗  100

50

0
0 2 4 6 8 10 12 14 16 18 20
iterations

Fig. 5.8 Algorithm 5.4: Convergence of the parameter estimates under output feedback

a comprehensive analysis of the controllability and observability properties of the


extended augmented system, which is essential in obtaining the optimal control
solution. Both state feedback and output feedback designs based on policy iteration
and value iteration Q-learning techniques were presented and the convergence of
the presented algorithms was established. A detailed simulation study was carried
out to validate the presented designs.

5.9 Notes and References

Time delay is a commonly experienced phenomenon in science and engineering.


Time delays are frequently encountered in control systems at various points within
the control loop. Effective control involves the use of real-time information of
the system. In general, time delays can cause performance issues, and in a severe
situation, may even result in loss of the closed-loop stability. Therefore, it is essential
to design control laws that take into account the effect of time delays, particularly
when they are noticeably long relative to the closed-loop dynamics. Owing to
its very practical nature, the problem of time delay compensation has received a
comprehensive coverage in the controls literature.
Predictor feedback is an effective technique to design control systems that are
subject to time delays [74, 108]. The approach is motivated by the idea of bringing
a time delay system into a delay-free form by means of feedback control, which
involves the feedback of the predicted state of the system. This state prediction
is accomplished by embedding the dynamic model of the system in the predictor
feedback law. Based on this technique, the design of optimal linear quadratic
regulator has been presented in the earlier works, such as [75]. However, the precise
knowledge of the system dynamics is needed for predicting the state and to design
the optimal controller.
5.9 Notes and References 255

There have been some recent developments in solving the optimal control
problem for time delay systems in a model-free manner. Instead of applying
the predictor feedback, the idea of bicausal change of coordinates has been
recently employed in developing reinforcement learning and approximate dynamic
programming techniques for model-free optimal control of time delay systems.
However, the key challenge in this approach is the issue of the existence of a bicausal
transformation, which becomes difficult to verify when the system dynamics is not
known. Moreover, even though the schemes mentioned above are model-free, a
precise knowledge of the delay is still required.
In this chapter, the presentation of the extended state augmentation technique
and the associated state feedback algorithms follows from our preliminary results
in [104]. Section 5.6, along with Theorem 5.2 in Sect. 5.4, extends these results to
the output feedback case and provides a detailed convergence analysis of the output
feedback Q-learning algorithms for time delay problems.
Chapter 6
Model-Free Optimal Tracking Control
and Multi-Agent Synchronization

6.1 Introduction

Tracking control has unarguably been the most traditional control design problem
beyond stabilization owing to its practical relevance. As its name suggests, the goal
in a tracking problem is to design a controller that enables the system to closely
follow a specified reference trajectory. In general, such a reference signal could take
various forms and be very dynamic in nature, making the problem more challenging
in comparison with a stabilization problem that only involves regulation of the
system state to an equilibrium point. There are several domains such as robotics,
aerospace, automation, manufacturing, power systems, and automobiles, where a
tracking controller is required to match the output and/or the state of the system
with that of a reference generator, hereby, referred to as the exosystem. Very often,
tracking a reference trajectory is only a basic requirement in a control application.
Instead, it is desirable to achieve tracking in a prescribed optimal manner, which
involves solving an optimization problem.
In the setting of linear systems, the structure of the linear quadratic tracking
(LQT) control law involves an optimal feedback term (similar to the stabilization
problem) and an optimal feedforward term that corresponds to the reference trajec-
tory. While the design of the optimal feedback term follows the same procedures that
we have seen in the previous chapters, the design procedure for the feedforward term
is more involved. It is due to this difficulty that the application of LQT design has
received less attention in the literature. The traditional approach to computing this
feedforward term requires solving a noncausal equation and involves a backward in
time procedure to precompute and store the reference trajectory [55]. An alternative
method to calculate the feedforward term involves the idea of dynamic inversion.
This technique, however, requires the input coupling matrix to be invertible. A
promising paradigm for designing tracking controllers is the output regulation
framework that involves solving the regulator equations to compute the feedforward

© Springer Nature Switzerland AG 2023 257


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2_6
258 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

term. All these approaches to finding the feedforward term are model-based as they
require the complete knowledge of the system dynamics.
One notable extension of the tracking problem that has recently gained significant
popularity is the multi-agent synchronization problem. As the name suggests, the
problem involves a group of agents that make decisions to achieve synchronization
among themselves. The individual agents may have identical dynamics (homo-
geneous agents) or they can assume different dynamics (heterogeneous agents).
The primary motivation of this problem comes from the cooperative behavior that
is found in many species. Such cooperations are essential in achieving goals or
performing tasks that would otherwise be difficult to carry out by individual agents
operating solo. Their distributed nature also gives multi-agent systems a remarkable
edge over the centralized approaches to obtaining scalability to the problem size
and robustness to failures. Similar to the optimal tracking problem, the multi-agent
synchronization problem can also take into account the optimality aspect of each
agent based on the optimal control theory. Output regulation framework has also
been very instrumental in solving the multi-agent synchronization problems. The
key challenge in the multi-agent control design pertains to the limited information
that is available to each agent. Furthermore, as is the case with a tracking problem,
difficulties exist in computing the optimal feedback and feedforward control terms
when the dynamics of the agents are unknown.
Model-free reinforcement learning (RL) holds the promise of solving individual
agent and multi-agent problems without involving the system dynamics. However,
there are additional challenges RL faces in finding the solution of the optimal
tracking and synchronization problems. The primary difficulty lies in selecting an
appropriate cost function, which is challenging to formulate due to the presence of
the reference signal. In particular, in the infinite horizon setting, the presence of a
general non-vanishing reference signal can make the long-term cost ill-defined. The
traditional approach to finding the feedforward term involves a noncausal difference
equation, which is not readily applicable in RL as it involves a backward in time
method in contrast to the forward in time approach in RL.
In this chapter, we will first present a two degrees of freedom learning approach
to learning the optimal feedback and feedforward control terms. The optimal
feedback term is obtained using the LQR Q-learning method that we developed
in Chap. 2, whereas an adaptive algorithm is presented that learns the feedforward
term. The presented scheme has the advantage that it does not require a discounting
factor. As a result, convergence to the optimal parameters is achieved and optimal
asymptotic output tracking is ensured. We will restrict this discussion to the
single input single output case, which is commonly addressed in this context. The
treatment of the general multiple input multiple output case and its continuous-time
extension are more involved. In the second half of this chapter, we will extend
the single-agent scheme to solve the multi-agent synchronization problem. The
focus of the multi-agent design will be to achieve synchronization using only the
neighborhood information. A leader-follower style formulation will be presented
and distributed adaptive laws will be designed that require only the system input-
output data and the information of the neighboring agents.
6.2 Literature Review 259

6.2 Literature Review

The optimal tracking problem has been covered quite comprehensively in the RL
control literature. Most recent works involve the idea of state augmentation in which
the dynamics of the trajectory generator is merged into that of the system. The
problem is then solved as a linear quadratic regulation problem using the standard
RL techniques such as Q-learning as presented in the earlier chapters. For instance,
the authors of [49] employed Q-learning to solve the optimal tracking problem by
simultaneously learning the feedback and feedforward control gains. One advantage
of the state augmentation approach is that it can be also readily applied to solve the
continuous-time tracking problems as done in [77]. However, this approach does not
guarantee asymptotic tracking error convergence owing to the need of employing a
discounting factor in the cost function.
The main utility of the discounting factor in the case of optimal tracking problem
is to make the cost function well-defined. It is worth pointing out that this require-
ment of a discounting factor is different from the one in output feedback control, as
seen in Chap. 2. In fact, the above mentioned works involve the measurement of the
internal state while still requiring a discounted cost function. Motivated by the work
of [56], the authors of [50] solved the output feedback optimal tracking problem,
but asymptotic error convergence could not be ensured, again due to the use of
a discounting factor. Here the discounting factor serves the additional purpose of
diminishing the effect of the exploration noise bias as explained in [56]. Continuous-
time extensions based on the state augmentation approach combined with a state
parameterization in terms of delayed output measurements have been carried out
in [80]. As discussed in Chap. 2, such a state parameterization also leads to bias
as it assumes that the control is strictly feedback, thereby, ignoring the effect of
exploration signal. Similarly to the discrete-time counterpart, the discounting factor
serves two utilities, to circumvent the exploration bias issue and to make the cost
function well-defined, or, equivalently, to make the augmented system stabilizable.
Applications of RL in solving a variety of multi-agent control problems have
been proposed in the literature where the requirement of the knowledge of system
dynamics [15, 105] has been relaxed. The authors of [119] developed a model-
free reinforcement learning scheme to solve the optimal consensus problem of
multi-agent systems in the continuous-time setting. In this work the leader and the
follower agents were assumed to have identical dynamics. Extension of this work to
the discrete-time setting was presented in [2]. In contrast, solution of consensus
problems for heterogeneous multi-agent systems involves a different approach,
which is based on the output regulation theory. This line of work, however,
encounters certain obstacles because of the need to solve regulator equations in
order to find the feedforward term. The solution of the regulator equations entails
the knowledge of the dynamics of both the leader and follower agents.
In [82], this difficulty was addressed in solving the continuous-time hetero-
geneous consensus problem based on model-free RL approach. Discrete-time
extensions of this work involving Q-learning were proposed in [47]. It is worth
260 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

noting that [47, 82] involve the simultaneous estimation of the feedforward and
feedback term, as with the solution of the tracking problem discussed previously.
An augmented algebraic Riccati equation is obtained by augmenting the leader
dynamics with the dynamics of the follower agents. Similarly to the tracking
problem, the multi-agent problem also faces the issue that the resulting augmented
system is not stabilizable owing to the autonomous and neutrally stable nature of
the leader dynamics. Furthermore, the infinite horizon cost function in the multi-
agent synchronization problem is also ill-defined as the control does not converge
to zero due to the non-decaying state of the leader. In order to make the local cost
function of each agent well-defined, the authors of [47, 82] resorted to incorporating
a discounting factor in the cost function.

6.3 Q-learning Based Linear Quadratic Tracking

Consider a discrete-time linear system given by the following state space represen-
tation,

xk+1 = Axk + Buk ,


(6.1)
yk = Cxk ,

where xk ∈ Rn is the state, uk ∈ R is the input, and yk ∈ R is the output. The


reference trajectory to be tracked is generated by the following exosystem,

wk+1 = Swk ,
(6.2)
yrk = F wk ,

where wk ∈ Rq is the state of the exosystem system and yrk ∈ R is the reference
output. In [50], the linear quadratic tracking (LQT) problem has been addressed
by augmenting the system dynamics (6.1) with the exosystem dynamics (6.2) and
using the VFA method. However, the augmented system is not stabilizable as the
exosystem is autonomous and neutrally stable and the resulting cost function of the
augmented system becomes ill-defined due to the non-decaying reference trajectory
wk . To address these difficulties, a discounting factor 0 < γ < 1 is introduced in
the cost function as

  
VK (Xk ) = γ i−k r Xi , Kγ Xi , (6.3)
i=k

 T  
where Xk = xkT wkT is the augmented state and r Xk , Kγ Xk is a one-step
quadratic cost function. In general, the boundedness of VK does not ensure xk → 0
as k → ∞ for all values of γ . Only for a particular range of γ may asymptotic
6.3 Q-learning Based Linear Quadratic Tracking 261

tracking be ensured but the lower bound on γ is not known. Furthermore, the
discounted solution is only suboptimal.
To solve the LQT problem, we make the following standing assumptions.
Assumption 6.1 (A, B) is stabilizable.

Assumption 6.2 (A, C) is observable.

Assumption 6.3 The relative degree n∗ and the sign of the high frequency gain kp
are known, where

kp = lim C(zI − A)−1 B.


z→∞

 
A − λI B
Assumption 6.4 ρ = n + 1, λ ∈ σ (S), where σ (S) denotes the
C 0
spectrum of S.
We formulate the optimal tracking problem as an output regulation problem by
defining the error dynamics as

x̃k+1 = Ax̃k + B ũk ,


(6.4)
ek = C x̃k ,

where ek = yk − yrk , ũk = uk − U wk and x̃k = xk − Xwk . Here, X and U are


the solution to the following Sylvester equations, which are commonly referred to
as the regulator equations,

XS = AX + BU,
(6.5)
0 = CX − F.

The value function that represents the infinite horizon cost of the feedback law
ũk = K x̃k is defined as


V (x̃k ) = r(ei , ũi ), (6.6)
i=k

where

r(ei , ũi ) = ekT Qy ek + ũTk R ũk ,

with Qy ≥ 0 and R > 0 being the user-defined weighting scalars. Under the stated
assumptions, there exists a unique optimal feedback control, given by

u∗k = K ∗ xk + Kr∗ wk , (6.7)


262 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

where K ∗ is the optimal feedback gain obtained by solving the ARE,


−1
AT P A − P + Q − AT P B R + B T P B B T P A = 0, (6.8)

and is given by
−1
K ∗ = − R + B TP ∗B B T P ∗ A, (6.9)

and Kr∗ is the optimal feedforward gain, which is related to the optimal feedback
gain as

Kr∗ = K ∗ X + U.

It is interesting to note that the original system (6.1) and the error dynamics
system (6.4) are algebraically equivalent with the same ARE and, therefore, have the
same optimal feedback control term K ∗ . This suggests that K ∗ can be first estimated
by treating the problem as an LQR stabilization problem as done in Chap. 2. Then,
the feedforward term Kr∗ is estimated such that the tracking error ek → 0 as k → ∞.
In this way, we do not need to solve the Sylvester equations (6.5), whose solution
would require the knowledge of system dynamics.
The optimal output feedback tracking controller is given as the sum of the
optimal output feedback and feedforward terms,
 ∗ −1  ∗ 
u∗k = − Huu ∗
Huσ σk + Huω ωk + Kr∗ wk , (6.10)

where σk and ωk are variables in the parameterization of the state xk , as presented


in Theorem 2.1.
Remark 6.1 The above output feedback tracking law makes use of the state of the
exosystem. However, this is not a requirement as the term Kr∗ wk can be replaced
with K̄r∗ ωrk that depends on the exosystem output yrk and can be obtained by
applying the same state parameterization result in Theorem 2.1 as is done for the
state feedback terms. In this case the input dependent parameterization terms will
be zero since the exosystem is autonomous.
In the following two sections we will present learning and adaptive algorithms to
learn the above control law.
6.4 Experience Replay Based Q-learning for Estimating the Optimal Feedback. . . 263

6.4 Experience Replay Based Q-learning for Estimating the


Optimal Feedback Gain

In this section we present an experience replay based Q-learning scheme for


estimating the optimal feedback gain. Experience replay is a technique to relax
the excitation requirement in adaptive and learning based control algorithms by
increasing the data usage. This approach comes handy in our two degree of freedom
approach to solving the tracking problem that involves two separate learning and
adaptation phases corresponding to the feedback and feedforward terms. It is clear
from the relationship between the feedforward gain and the feedback gain that the
optimal feedback parameters need to be learnt for the feedforward term to track the
reference trajectory. Therefore, it is desirable that the learning phase of the optimal
feedback term is made shorter so that simultaneous learning of the feedforward term
and tracking of the reference signal could begin as early as possible.
Recall from Chap. 2 that, in order to solve the Q-learning equation, we require
sufficiently rich system trajectory data (σk , ωk , uk ). In the output feedback Q-
learning algorithms, we renew this dataset after every iteration once a new policy has
been learned. That is, every policy update is used to generate a new dataset, which
implies that the excitation condition needs to be maintained for all the subsequent
iterations. As a result, the exploration of state trajectories lasts for the entire learning
period. In the tracking problem, this has the effect of delaying the convergence of
the feedforward term and consequently the tracking error.
The idea behind experience replay is to make efficient use of the learning datasets
by reusing them. Recall the following output feedback Q-learning equation from
Chap. 2,

zkT H zk = ykT Qy yk + uTk Ruk + zk+1


T
H zk+1 .

Since this equation holds for all state trajectories, we can reuse the stored dataset
to learn new policies in every iteration. In this method, a fixed initial policy called
the behavioral policy u0k is first used to generate the required system data. Then,
repeated applications of policy evaluation and policy improvement on the same
dataset enable us to learn new policies. That is, experience replay affects the
utilization of the data employed in Q-learning. In a conventional on-policy learning
scheme, a new dataset corresponding to the newly learned policy is collected in
each iteration. In contrast, the presented experience replay method makes use of
the behavioral policy dataset only. This technique has been shown to be more
data efficient [4]. Furthermore, the use of historic data during learning relaxes the
exploration requirement. Unlike the usual persistence of excitation (PE) condition,
which is hard to maintain in every iteration due to the converging trend of the system
trajectories [13], we only need a rank condition on the data matrices for a single
behavioral policy dataset.
We now present the experience replay Q-learning algorithms to learn the optimal
feedback control parameters. These algorithms are essentially Algorithms 2.1
264 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

Algorithm 6.1 Output feedback Q-learning policy iteration algorithm for computing
the feedback gain using experience replay
input: input-output data
output: H ∗
1: initialize. Select an admissible policy
−1  
K0 = − Huu
0 0 H0 .
Huσ uω

Set j ← 0.
2: collect online data. Apply the behavioral policy
 T
uk = K0 σkT ωkT + νk ,

where νk is the exploration signal. Collect L datasets of (σk , ωk , uk ) for k ∈ [0, L − 1], with
L ≥ (mn + pn + m)(mn + pn + m + 1)/2.
3: repeat
4: policy evaluation. Solve the following Bellman equation for H̄ j
T
H̄ j (z̄k − z̄k+1 ) = ykT Qy yk + uTk Ruk .

5: policy update. Find an improved policy as


−1 
j +1 j j j
uk = − Huu Huσ σk + Huω ωk .

6: j←  j +1 
7: until H̄ j − H̄ j −1  < ε for some small ε > 0.

and 2.2 with the experience replay mechanism. As a result, the convergence
properties of these algorithms remain unchanged.

6.5 Adaptive Tracking Law

We now proceed to the design of feedforward control gain. The closed-loop tracking
error is given by

ek = yk − yrk
= C (Axk−1 + Buk−1 ) − F wk . (6.11)

Assume that the optimal feedback parameters have already been learnt following the
discussion in the previous section. We now design an estimator of the feedforward
gain Kr∗ based on the tracking error. Using the estimated feedforward gain K̂r in the
6.5 Adaptive Tracking Law 265

Algorithm 6.2 Output feedback Q-learning value iteration algorithm for computing
the feedback gain using experience replay
input: input-output data
output: H ∗
1: initialize. Select an arbitrary policy u0k and H 0 ≥ 0. Set j ← 0.
2: collect online data. Apply the behavioral policy

uk = u0k + νk ,

where νk is the exploration signal. Collect L datasets of (σk , ωk , uk ) for k ∈ [0, L − 1], with
L ≥ (mn + pn + m)(mn + pn + m + 1)/2.
3: repeat
4: value update. Solve the following Bellman equation for H̄ j +1 ,
T T
H̄ j +1 z̄k = ykT Qy yk + uTk Ruk + H̄ j z̄k+1 .

5: policy update. Find an improved policy as


 
j +1 j +1 −1 j +1 j +1
uk = − Huu Huσ σk + Huω ωk .

6: j←  j +1 
7: until H̄ j − H̄ j −1  < ε for some small ε > 0.

expression of the tracking error (6.11), we have



ek = C Axk−1 + B K ∗ xk−1 + K̂r wk−1 − F wk
  
= C A + BK ∗ xk−1 + CBKr∗ wk−1 + CB K̂r wk−1 − Kr∗ wk−1 − F wk .

For precise tracking,


 
F wk = C A + BK ∗ xk−1 + CBKr∗ wk−1 ,

and hence, the tracking error is directly related to the estimation error of the
feedforward gain as

ek = CB K̂r wk−1 − Kr∗ wk−1 .

The above equation is for systems with a relative degree 1. In general, for systems
with a relative degree n∗ , we have

CAi B = 0, 0 ≤ i < n∗ − 1,
∗ −1
CAn B = 0,
266 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

and the tracking error can be written as


∗ −1

ek = CAn B K̂r wk−n∗ − Kr∗ wk−n∗

= kp K̂r − Kr∗ wk−n∗

= kp K̃r wk−n∗ , (6.12)



where K̃r = K̂r − Kr∗ and kp = CAn −1 B is the high frequency gain of the system.
The tracking error equation (6.12) suggests the following gradient adaptive law
for updating the feedforward parameter vector K̂r ,

sign(kp )wk−n∗ ek
K̂rk+1 = K̂rk − , (6.13)
m2k

where
2
0 <  = T < I
|kp |

is the desired gain and


-
mk = 1 + wk−n
T
∗ wk−n∗

is the normalizing factor. The resulting adaptive control law is then given by
 ∗ −1  ∗ 
u∗k = − Huu ∗
Huσ σk + Huω ωk + K̂r wk . (6.14)

Theorem 6.1 Under the adaptive laws (6.13) and (6.14), all signals are bounded
and the tracking error converges to zero asymptotically.
Proof We consider the following Lyapunov function,

Vk  V K̃rk
  T −1
= kp  K̃rk  K̃rk .

Evaluation of the change in Vk along the evolution of K̂rk as described by Equation


(6.13) results in
   T
k p  w ek2
k−n∗ wk−n

Vk+1 − Vk = − 2 −
m2k m2k
6.5 Adaptive Tracking Law 267

ek2
≤ −α , (6.15)
m2k

for some α > 0 if


2
0 <  =  T <   I.
k p 

As a result, Vk , and hence, K̃rk and K̂rk are all bounded. Furthermore, by (6.12) and
the boundedness of the reference signal wk , ek , and mk are also bounded, that is,
ek ∈ l∞ and mk ∈ l∞ . Summing (6.15) from k to ∞, we have

 ek2
α = V0 − V∞
i=k
m2k
≤ V0 ,

that is,
ek
∈ l2 ,
mk

which implies that


ek
lim = 0.
k→∞ mk

Since mk > 0 is bounded from above, we have

lim ek = 0.
k→∞

This completes the proof.

Remark 6.2 The adaptation mechanism for estimating the feedforward gain uses
only the information of the relative degree n∗ and the sign of the high frequency
system gain kp . By selecting  sufficiently small, we can satisfy the design condition
2
0 <  <   I,
k p 

without requiring the exact knowledge of the model parameters.


268 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

6.6 Multi-Agent Synchronization

Consider a discrete-time linear multi-agent system comprising of N follower agents


denoted as vi , i = 1, 2, · · · , N, with dynamics given by

xi,k+1 = Ai xi,k + Bi ui,k ,


(6.16)
yi,k = Ci xi,k ,

and a leader agent denoted as v0 , with its dynamics given by

wk+1 = Swk ,
(6.17)
yrk = F wk ,

where xi,k ∈ Rni , ui,k ∈ R, and yi,k ∈ R are, respectively, the state, input, and
output of follower agent i, and wk ∈ Rn0 and yrk ∈ R are, respectively, the state and
output of the leader agent. All agents are allowed to have different dynamics. Let
the relative degree of follower agent i be n∗i . The follower agents can be divided into
two categories depending upon their connectivity with the leader agent. Agents that
are directly connected to the leader are referred to as informed agents, while agents
that do not have direct access to the leader information are called uninformed agents.
The exchange of information among the agents is formally described by a
directed graph (digraph) G = {V, E}, comprising of the vertices (nodes) V =
{v0 , v1 , v2 , · · · , vN } and the edges E ⊆ V × V. A pair (vi , vj ) ∈ E represents an
edge that enables the information flow from node vi to node vj . In this case, the
nodes vi and vj are referred to as the parent and child node of each other. The set of
(in-) neighbors of vi is defined as Ni = {vj : (vj , vi )}, that is, Ni represents the set
of the parent nodes of node vi . A directed path from vi1 to vil is a sequence of edges
(vi1 , vi2 ), (vi2 , vi3 ), · · · , (vil−1 , vil ). Node vil is said to be reachable from node vi1
if there exists a directed path from vi1 to vil .
Assumption 6.5 The leader agent is reachable from every follower agent in the
graph and every follower agent knows which of its neighbor can reach the leader
without going through the agent itself.
The distributed optimal output feedback tracking controller is given as the sum
of the optimal output feedback and the feedforward terms,
 ∗ −1  ∗ 
u∗i,k = − Hi,uu ∗
Hi,uσ σi,k + Hi,uω ωi,k + K̄ri∗ ωrk . (6.18)

In Equation (6.18), we have used the parameterization wk = Mry ωrk using state
parameterization result in Theorem 2.1, which results in the relationship

Kri∗ wk = Kri∗ Mry ωrk


 K̄ri∗ ωrk .
6.6 Multi-Agent Synchronization 269

Here ωrk depends on the leader output yrk . This parameterization is needed because
the agents may not have access to the internal state of the leader. The adaptation
mechanism for the informed agent is derived similar to that for the single-agent
case discussed in the previous section, that is,

sign(ki )i ωrk−n∗i ei,k


K̄ˆ ri,k+1 = K̄ˆ ri,k − , (6.19)
m2i,k

where
2
0 < i = iT < I
ki 

is the adaptation gain and


-
mi,k = 1 + ωrk−n
T
∗ ωrk−n∗
i
i

is the normalizing factor. The resulting adaptive control law is given by


 ∗ −1  ∗ 
ui,k = − Hi,uu ∗
Hi,uσ σi,k + Hi,uω ωi,k + K̄ˆ ri,k ωrk . (6.20)

We next consider the case of the uninformed follower agents, which do not have
access to the output measurement yrk of the leader and, therefore, access to ωrk . For
the uninformed agents, we suggest the following adaptation mechanism,

sign(ki )i ωj,k−n∗i eij,k


K̄ˆ ri,k+1 = K̄ˆ ri,k − , (6.21)
m2j,k

where

eij ,k = yi,k − yj,k ,


-
mj,k = 1 + ωj,k−nT
∗ ωj,k−n∗ ,
i
i

and ωj,k is obtained by applying the state parameterization result in Theorem 2.1
to the leader dynamics and using the output yj,k of any neighboring parent agent j
that is reachable to the leader under Assumption 6.5. The resulting adaptive control
law is then given by
 ∗ −1  ∗ 
ui,k = − Hi,uu ∗
Hi,uσ σi,k + Hi,uω ωi,k + K̄ˆ ri,k ωj,k . (6.22)
270 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

Theorem 6.2 Let Assumption 6.5 hold. Under the distributed adaptive laws (6.19),
(6.20), (6.21) and (6.22), all signals are bounded and the tracking errors of all the
follower agents converge to zero asymptotically.
Proof For the case of informed agents, the proof for the adaptive control laws (6.19)
and (6.20) is the same as that of the single-agent tracking problem as shown in
Theorem 6.1. For the case of uninformed agents, we know from Assumption 6.5
that the leader is reachable from every follower agent, which implies that, for any
uninformed follower agent i, there is a parent node in Ni (which itself may be
uninformed) that synchronizes with the leader. Let vj be such a parent node with
possible successive parent nodes vj1 , vj2 , · · · , vjl and let vjl be connected to an
informed agent vinformed whose output yinformed synchronizes with that of the leader
by Theorem 6.1. Thus, ωinformed k → ωrk as k → ∞ by Theorem 6.1. As a result,
yinformed k can serve as a reference output for vjl and thus, vjl can use ωinformed k
instead of ωrk in (6.19), which by Theorem 6.1 results in yjl ,k → yrk as k → ∞.
Similarly, the outputs of nodes vjl−1 , · · · , vj2 , vj1 , vj synchronize with the leader’s
output yrk , which in turn implies that, for the child node vi , yi,k → yrk as k → ∞.
This completes the proof.

6.7 Numerical Examples

Example 6.1 (DC Motor Speed Control) Consider a DC motor system given in the
state space form (6.1) with
 
0.3678 0.0564
A= ,
−0.0011 0.8187
 
0.0069
B= ,
0.1813
 
C= 10 ,
S = 1,
F = 1.

The states x1 and x2 represent the motor speed and current, respectively, and
the control u is the applied motor voltage. A constant unit speed reference is
generated using the exosystem. Let Qy = 1 and R = 1 define the performance
index. We employ the above model to simulate the target system only and the
controller does not utilize any knowledge of these matrices. Based on these model
parameters, we compute the nominal optimal parameters for comparison with
the estimates resulting from the proposed algorithm. The nominal optimal output
feedback controller parameters as found by solving the ARE (6.8) are as follows,
6.7 Numerical Examples 271

1
y(k), yr (k)
0

System output
-1
Reference

-2
0 50 100 150 200 250 300 350 400 450 500
time steps (k)

Fig. 6.1 Example 6.1: Output tracking response


 
Huσ = 0.0008 0.0003 ,

 
Huω = 0.0544 −0.0193 ,

Huu = 1.0008,

and the optimal feedforward gain as obtained by solving (6.5) is

Kr∗ = 10.0557.

Since the system is open-loop stable, we use the PI algorithm. The initial controller
parameters are set to zero and the rank condition is ensured by adding sinusoids of
different frequencies in the behavioral policy. The system has a relative degree n∗ =
1 with the high frequency gain kp = 1. The convergence criterion ε = 0.001 and the
adaptation gain  = 0.5I are chosen. The matrix A in the state parameterization is
defined as
 
00
A= .
10

Figure 6.1 shows the tracking response. It takes 3 iterations for the feedback
parameters to converge to their optimal values as shown in Fig. 6.2, and L = 18
data samples are collected using the behavioral policy. After these iterations, the
gradient algorithm drove the tracking error ek to approach 0, as shown in Fig. 6.1.
The convergence of the feedforward parameters is shown Fig. 6.3. Notice that no
discounting factor is employed in the proposed method, and the result is identical
to that obtained by solving the ARE and the Sylvester equations. Furthermore, the
exploration signal does not incur any bias in the estimates, which is an advantage of
the proposed scheme.

Example 6.2 (Discrete-time Double Integrator) The double integrator system rep-
resents a large class of practical systems, such as satellite attitude control and rigid
body motion. Consider a discrete-time double integrator system given in the state
space form (6.1) with
272 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

Ĥ →H ∗ 
4

0
0 1 2 3
iterations

Fig. 6.2 Example 6.1: Convergence of the feedback parameters

10
K̂r →Kr∗ 

0
50 100 150 200 250 300 350 400 450 500
time steps (k)

Fig. 6.3 Example 6.1: Convergence of the feedforward parameters

 
11
A= ,
01
 
0
B= ,
1
 
C= 10 ,
 
cos(0.1) sin(0.1)
S= ,
−sin(0.1) cos(0.1)
 
F = 10 .

A sinusoidal reference signal is generated using the exosystem. Let Qy = 1 and


R = 1 define the performance index. We employ the given model to simulate the
target system only and the controller does not utilize any knowledge of the model.
Based on the model parameters, we compute the nominal optimal parameters for
comparison with the estimates resulting from the proposed algorithm. The nominal
output feedback controller parameters are
6.7 Numerical Examples 273


 
Huσ = 5.4117 7.4927 ,

 
Huω = 9.5737 −7.4927 ,

Huu = 4.3306,

and the optimal feedforward gain as obtained by solving (6.5) is given by


 
Kr∗ = 0.4644 0.1237 .

The system has repeated eigenvalues equal to 1 and, therefore, is open-loop


unstable. Thus, we use the VI algorithm. The system has a relative degree n∗ = 2
with the high frequency gain kp = 1. The initial controller parameters are set to zero
and the rank condition is ensured by adding sinusoids of different frequencies. The
convergence criterion ε = 0.1 and the adaptation gain  = 0.3I are chosen. The
matrix A in the state parameterization is defined as
 
00
A= .
10

Figure 6.4 shows the tracking response. It takes 10 iterations for the feedback
parameters to converge to their optimal values as shown in Fig. 6.5, and L = 18
data samples are collected using the behavioral policy. After these iterations, the
gradient algorithm drives the tracking error ek to approach 0, as shown in Fig. 6.4.
The convergence of the feedforward parameters is shown in Fig. 6.6. It can be
seen that the result matches the solution of the Riccati and Sylvester equations.
Furthermore, the proposed output feedback method is free from discounting factor
and exploration noise bias.

Example 6.3 (Multi-Agent Synchronization) We will now validate the distributed


multi-agent control algorithm through numerical simulation. We show here that the
resulting distributed controller is able to achieve optimal output synchronization
with convergence to the optimal parameters.

2
System output
1 Reference
y(k), yr (k)

-1

-2
0 100 200 300 400 500 600 700 800
time steps (k)

Fig. 6.4 Example 6.2: Output tracking response


274 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

150

Ĥ →H ∗  100

50

0
0 1 2 3 4 5 6 7 8 9 10
iterations

Fig. 6.5 Example 6.2: Convergence of the feedback parameters

0.6
K̂r →Kr∗ 

0.4

0.2

0
100 200 300 400 500 600 700 800
time steps (k)

Fig. 6.6 Example 6.2: Convergence of the feedforward parameters

Consider the multi-agent system (6.16) with N = 4 and the dynamics matrices
given by

A1 = 1,
B1 = 1,
C1 = 1,
 
1.1 −0.3
A2 = ,
1 0
 
1
B2 = ,
0
 
C2 = 1 −0.8 ,
 
1.8 −0.7
A3 = ,
1 0
 
1
B3 = ,
0
 
C3 = 1 −0.5 ,
6.7 Numerical Examples 275

Fig. 6.7 Example 6.3:


Communication topology of
the multi-agent system

⎡ ⎤
0.2 0.5 0
A4 = ⎣0.3 0.6 0.5⎦ ,
0 0 0.8
 
1
B4 = ,
0
 
C4 = 101 ,
 
cos(0.5) sin(0.5)
S= ,
−sin(0.5) cos(0.5)
 
F = 10 .

The communication topology of the agents is provided in Fig. 6.7.


Let the user-defined performance index of the agents be specified by Qi y = 1
and Ri = 1 for i = 1, 2, 3, 4. For comparison, the nominal optimal output feedback
and feedforward parameters can be found by solving the Riccati equation (6.8) and
the regulator equations (6.5) and by computing the coefficients Miu and Miy . The
nominal output feedback controller parameters are

H1,uσ = 1.6180,
H1,uω = 1.6180,

H1,uu = 2.6180,

 
H2,uσ = 0.3100 −0.9635 ,

 
H2,uω = 0.9895 −0.3613 ,

H2,uu = 2.0504,

 
H3,uσ = 2.5416 −1.9065 ,
276 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization


 
H3,uω = 4.9055 −2.9360 ,

H3,uu = 3.0467,

 
H4,uσ = 2.2045 −1.2962 −0.5412 ,

 
H4,uω = 3.4558 −1.8868 −0.0722 ,

H4,uu = 2.9172,

and the optimal feedforward gains as obtained by solving (6.5) are



 
K̄r1 = 1.4102 −1.3732 ,

 
K̄r2 = 1.7727 −1.5139 ,

 
K̄r3 = 1.2694 −1.2625 ,

 
K̄r4 = 1.2680 −1.2282 .

The initial controller parameters are set to be zero and the rank condition in Chap. 2
is ensured by adding sinusoids of different frequencies in the behavioral policy.
The convergence criterion ε = 0.001 and the adaptation gain i = I is chosen
for all the follower agents. The output tracking responses of the agents are shown
in Fig. 6.8. The synchronization errors converge to zero as shown in Fig. 6.9. The
convergence of the output feedback control parameters is shown in Fig. 6.10. The
estimated feedback control parameters are

Ĥ1,uσ = 1.6180,
Ĥ1,uω = 1.6180,

2
Leader
Output trajectory yi (k)

Agent 1
1
Agent 2
Agent 3
0 Agent 4

-1

-2
0 50 100 150 200 250 300
time steps k

Fig. 6.8 Example 6.3: Output synchronization responses of agents


6.7 Numerical Examples 277

3
Agent 1
2
Agent 2
Tracking error ei (k) Agent 3
1
Agent 4
0

-1

-2

-3
0 50 100 150 200 250 300
time steps k

Fig. 6.9 Example 6.3: Synchronization errors

1
Agent 1
0.8 Agent 2
Agent 3
Ĥi →Hi∗ 

0.6 Agent 4
Hi∗ 

0.4

0.2

0
0 1 2 3 4 5 6 7 8 9
iterations

Fig. 6.10 Example 6.3: Convergence of the feedback parameters

Ĥ1,uu = 2.6180,
 
Ĥ2,uσ = 0.3100 −0.9635 ,
 
Ĥ2,uω = 0.9895 −0.3613 ,

Ĥ2,uu = 2.0504,
 
Ĥ3,uσ = 2.5416 −1.9065 ,
 
Ĥ3,uȳ = 4.9055 −2.9360 ,

Ĥ3,uu = 3.0467,
 
Ĥ4,uσ = 2.2045 −1.2962 −0.5412 ,
 
Ĥ4,uω = 3.4558 −1.8868 −0.0722 ,
278 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

Ĥ4,uu = 2.9172,

and the estimated feedforward control parameters are


 
K̄ˆ r1 = 1.4102 −1.3732 ,
 
K̄ˆ r2 = 1.7727 −1.5138 ,
 
K̄ˆ r3 = 1.2696 −1.2627 ,
 
K̄ˆ r4 = 1.2682 −1.2283 .

Notice that different numbers of data samples L are collected for each agent so
as to satisfy the rank condition given in Chap. 2. Once the Q-learning iterations for
the optimal feedback parameters are completed, the gradient algorithm also starts
to converge as the tracking error ei,k → 0. Notice also that no discounting factor
is employed in the proposed method and the result is identical to that obtained by
solving the ARE and the Sylvester equations.

6.8 Summary

In this chapter, we have presented model-free output feedback RL algorithms to


solve the optimal tracking problem along with its extension to the problem of
multi-agent system synchronization. A distributed two degree of freedom approach
is presented that separates the learning of the optimal output feedback and the
feedforward terms. The optimal feedback parameters are learned using the proposed
output feedback Q-learning Bellman equation, whereas the estimation of the
optimal feedforward control parameters is achieved using an adaptive algorithm
that guarantees convergence to zero of the tracking error. This scheme is extended
to solve the output synchronization problem for multi-agent systems, in which
the agents are categorized as informed and uninformed follower agents depending
on their connectivity with the leader agent. The distributed control law for the
informed follower agents corresponds to the one obtained by solving the optimal
tracking problem, whereas the remaining uninformed follower agents follow any
parent node that reaches to the leader without coming back to the follower agent
itself. Compared to the previous works, the resulting distributed control laws neither
require access to the full state of the agents nor do they need an additional distributed
leader state observer. The two degree of freedom approach has the advantage over
the more common state augmentation approach in that it circumvents the need
of introducing a discounting factor in the performance function. It is shown that
the proposed algorithms converge to the optimal solution of the algebraic Riccati
equation and the Sylvester equations without requiring the knowledge of the system
6.9 Notes and References 279

dynamics or the full state of the system. Simulation results have been presented that
confirm the effectiveness of the proposed method.

6.9 Notes and References

Tracking control remains one of the most common applications of control theory.
Some of the most popular and successful applications of tracking control can be
found in the areas of robotics, aerospace, automobiles, and more recently, the multi-
agent systems [23, 24, 36–38, 45, 61, 84, 87, 115, 144]. The significant demand
for improved tracking schemes in such diverse applications have resulted in a
wide variety of approaches to solving this problem. Notable techniques include
the traditional PID tracking control [19], the tracking control designs [90], robust
designs[89], nonlinear trackers [134], adaptive tracking [25, 116] and intelligent
tracking control [123].
Ideas from optimal control theory for solving the optimal stabilization problem
have been successfully extended to find the solution of the optimal tracking problem.
In particular, the linear quadratic tracker (LQT) is regarded as a classical optimal
tracking controller since it is a natural extension of the celebrated linear quadratic
regulator (LQR). However, different from LQR, the optimal tracking control
involves a feedforward term that makes the tracking problem non-trivial. There have
been several efforts devoted to address the difficulty of finding this feedforward
component. The classical LQT controller employs a noncausal equation to compute
the feedforward trajectory by solving it in a backward in time manner [55], which
generally involves a pre-computation through the model and an offline storage of
the trajectory. On the other hand, if the input coupling matrix is invertible, the
technique of dynamic inversion is also applicable for finding the feedforward term
[136]. Nevertheless, one of the most formal frameworks of solving the tracking
problems is the output regulation paradigm [39], which employs the internal model
principle to compute the feedforward term. This method also bears the advantage
of handling disturbances. However, all these techniques are model-based since they
require the precise knowledge of the system dynamics.
The success of reinforcement learning in solving the classical optimal control
problems such as the LQR problem has led to the development of model-free
designs for solving the LQT problem. A variety of tracking schemes have been
proposed in the RL control literature for different classes of systems [77, 78, 143].
One of the primary challenges in the RL tracking control design is associated with
the learning of the feedforward term. RL relies on a suitable cost function during the
learning phase. Such a cost function is difficult to form in the presence of a reference
signal that may not decay to zero. In particular, it leads to an ill-defined cost function
as the resulting infinite horizon cost is not finite. An elegant approach to addressing
this difficulty in RL based LQT designs involves the idea of augmenting the system
dynamics with that of the reference generator (exosystem). The RL problem then
boils down to learning the optimal feedback controller for the augmented system,
280 6 Model-Free Optimal Tracking Control and Multi-Agent Synchronization

which implicitly incorporates the feedforward term. The approach is very effective
as long as the reference generator is asymptotically stable. In that case, the problem
then reduces to a stabilization problem.
Very often in a practical setting the exosystem is required to be neutrally stable to
generate non-vanishing reference trajectories. Since the exosystem is autonomous,
the resulting augmented system is no longer stabilizable. Consequently, the infinite
horizon cost function becomes ill-defined due to the presence of the non-decaying
state. In the RL control literature, this difficulty is addressed by means of introducing
a discounting factor that makes the cost function well-defined [77]. Equivalently, the
discounting factor can be considered a modifier to the augmented system dynamics,
thereby rendering it stabilizable. Based on this approach, Q-learning schemes
have been proposed to solve the tracking problem for discrete-time linear systems
[49]. Output feedback extensions based on the output feedback value function
approximation approach [56] have also been developed later in [50]. A difficulty in
the state augmentation based approaches is that the discounted solution is different
from the nominal solution of the original LQT problem. More importantly, the
discounted solution may not be a stabilizing one or may not guarantee asymptotic
tracking if the discounting factor is not carefully selected [88]. Therefore, it is
desirable to uplift these design restrictions. Along these lines, some notable RL
results have been presented that uplift such restrictions. In [27], an output regulation
formulation is adopted to learn the optimal feedback and the feedforward terms
based on adaptive dynamic programming. This approach involves an additional
mechanism to approximate the solution of the regulator equations. Some knowledge
of system dynamics and a stabilizing initial policy are required. Output feedback
extensions to this approach have also been presented recently [27].
Tracking control designs find applications in a wide range of multi-agent control
problems. In particular, the multi-agent synchronization problem can be formulated
as a leader-follower synchronization problem, in which each agent is required to
track the leader based on the neighborhood information. The generalization to the
case of heterogeneous agents is often desirable in applications that involve agents
having different dynamics such as rescue operations that require a combination
of ground, aerial and under water support. The output regulation framework has
been successfully employed to solve these synchronization problems based on the
knowledge of system dynamics [111]. The extension of single-agent reinforcement
learning has opened a new avenue to solving these problems without requiring
model information [15, 105]. Similar to the tracking problem, the idea of aug-
menting the agent dynamics with that of the leader dynamics has been employed in
solving model-free optimal synchronization problems [47, 82]. However, discounted
cost functions are employed in these works due to the reasons highlighted earlier. A
challenging problem in multi-agent leader-following schemes is that the information
of the leader is not readily available to each agent. Distributed adaptive observers
are employed to address this difficulty by estimating the leader state for every agent
in both model-based [17, 40] and data-driven [47, 82] approaches. However, some
knowledge of the leader dynamics matrix along with the knowledge of the graph
network is needed in designing the observer [47]. In [30, 31], output regulation
6.9 Notes and References 281

based approximate dynamic programming techniques have been presented to


achieve optimal output synchronization of heterogeneous multi-agent systems. All
these approaches make use of the full states of the follower agents.
The presentation of the single-agent tracking results in this chapter follows from
[93, 97], whereas the results on the multi-agent synchronization problem are from
[100].
References

1. Aangenent, W., Kostic, D., de Jager, B., van de Molengraft, R., Steinbuch, M.: Data-based
optimal control. In: Proceedings of the 2005 American Control Conference, pp. 1460–1465
(2005)
2. Abouheaf, M.I., Lewis, F.L., Vamvoudakis, K.G., Haesaert, S., Babuska, R.: Multi-agent
discrete-time graphical games and reinforcement learning solutions. Automatica 50(12),
3038–3053 (2014)
3. Abu-Khalaf, M., Lewis, F.L.: Nearly optimal control laws for nonlinear systems with
saturating actuators using a neural network HJB approach. Automatica 41(5), 779–791 (2005)
4. Adam, S., Busoniu, L., Babuska, R.: Experience replay for real-time reinforcement learning
control. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 42(2), 201–212 (2012)
5. Al-Tamimi, A., Lewis, F.L., Abu-Khalaf, M.: Model-free Q-learning designs for linear
discrete-time zero-sum games with application to H-infinity control. Automatica 43(3), 473–
481 (2007)
6. Bacsar, T., Bernhard, P.: H-infinity Optimal Control and Related Minimax Design Problems:
A Dynamic Game Approach. Springer Science & Business Media (2008), New York, NY
7. Bellman, R.E.: Dynamic Programming. Princeton University Press (1957), Princeton, NJ
8. Bertsekas, D.: Dynamic Programming and Optimal Control: Volume I and II. Athena
Scientific (2012), Belmont, MA
9. Bertsekas, D.: Reinforcement Learning and Optimal Control. Athena Scientific (2019),
Belmont, MA
10. Bian, T., Jiang, Z.P.: Data-driven robust optimal control design for uncertain cascaded systems
using value iteration. In: Proceedings of the 54th Annual Conference on Decision and Control
(CDC), pp. 7610–7615. IEEE (2015)
11. Bian, T., Jiang, Z.P.: Value iteration and adaptive dynamic programming for data-driven
adaptive optimal control design. Automatica 71, 348–360 (2016)
12. Boltyanskii, V., Gamkrelidze, R., Pontryagin, L.: On the theory of optimal processes. Sci.
USSR 110(1), 71–0 (1956)
13. Bradtke, S.J., Ydstie, B.E., Barto, A.G.: Adaptive linear quadratic control using policy
iteration. In: Proceedings of the 1994 American Control Conference, pp. 3475–3479 (1994)
14. Bucsoniu, L., Babuvska, R., De Schutter, B.: Multi-agent Reinforcement Learning: An
Overview. Innovations in Multi-agent Systems and Applications, pp. 183–221 (2010),
Springer, Berlin, Heidelberg
15. Busoniu, L., Babuska, R., De Schutter, B.: A comprehensive survey of multiagent reinforce-
ment learning. IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 38(2), 2008 (2008)

© Springer Nature Switzerland AG 2023 283


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2
284 References

16. Busoniu, L., Babuska, R., De Schutter, B., Ernst, D.: Reinforcement Learning and Dynamic
Programming Using Function Approximators. CRC Press (2017)
17. Cai, H., Lewis, F.L., Hu, G., Huang, J.: The adaptive distributed observer approach to the
cooperative output regulation of linear multi-agent systems. Automatica 75, 299–305 (2017)
18. Chen, C., Lewis, F.L., Xie, K., Xie, S., Liu, Y.: Off-policy learning for adaptive optimal output
synchronization of heterogeneous multi-agent systems. Automatica 119, 109081 (2020)
19. Choi, Y., Chung, W.K.: PID Trajectory Tracking Control for Mechanical Systems, vol. 298.
Springer (2004)
20. Ding, Z., Dong, H.: Challenges of reinforcement learning. In: H. Dong, Z. Ding, S. Zhang
(eds.) Deep Reinforcement Learning: Fundamentals, Research and Applications, pp. 249–
272. Springer Singapore, Singapore (2020)
21. Dong, L., Zhong, X., Sun, C., He, H.: Event-triggered adaptive dynamic programming for
continuous-time systems with control constraints. IEEE Trans. Neural Netw. Learn. Syst.
28(8), 1941–1952 (2017)
22. Dong, L., Zhong, X., Sun, C., He, H.: Adaptive event-triggered control based on heuristic
dynamic programming for nonlinear discrete-time systems. IEEE Trans. Neural Netw. Learn.
Syst. (to appear)
23. Encarnaccao, P., Pascoal, A.: Combined trajectory tracking and path following: an application
to the coordinated control of autonomous marine craft. In: Proceedings of the 40th IEEE
Conference on Decision and Control 2001, vol. 1, pp. 964–969. IEEE (2001)
24. Fleming, A.J., Aphale, S.S., Moheimani, S.R.: A new method for robust damping and tracking
control of scanning probe microscope positioning stages. IEEE Trans. Nanotechnol. 9(4),
438–448 (2010)
25. Fukao, T., Nakagawa, H., Adachi, N.: Adaptive tracking control of a nonholonomic mobile
robot. IEEE Trans. Robot. Autom. 16(5), 609–615 (2000)
26. Fuller, A.: In-the-large stability of relay and saturating control systems with linear controllers.
Int. J. Control. 10(4), 457–480 (1969)
27. Gao, W., Jiang, Z.P.: Adaptive dynamic programming and adaptive optimal output regulation
of linear systems. IEEE Trans. Autom. Control 61(12), 4164–4169 (2016)
28. Gao, W., Jiang, Z.P.: Data-driven adaptive optimal output-feedback control of a 2-DOF
helicopter. In: Proceedings of the 2016 American Control Conference, pp. 2512–2517 (2016)
29. Gao, W., Jiang, Z.P.: Adaptive optimal output regulation of time-delay systems via measure-
ment feedback. IEEE Trans. Neural Netw. Learn. Syst. 30(3), 938–945 (2018)
30. Gao, W., Jiang, Z.P., Lewis, F.L., Wang, Y.: Leader-to-formation stability of multi-agent
systems: An adaptive optimal control approach. IEEE Trans. Autom. Control 63(10), 3581–
3587 (2018)
31. Gao, W., Liu, Y., Odekunle, A., Yu, Y., Lu, P.: Adaptive dynamic programming and
cooperative output regulation of discrete-time multi-agent systems. Int. J. Control Autom.
Syst. 16(5), 2273–2281 (2018)
32. Gu, K., Chen, J., Kharitonov, V.L.: Stability of Time-Delay Systems. Springer Science &
Business Media (2003)
33. Hagander, P., Hansson, A.: Existence of discrete-time LQG-controllers. Syst. Control Lett.
26(4), 231–238 (1995)
34. He, P., Jagannathan, S.: Reinforcement learning-based output feedback control of nonlinear
systems with input constraints. IEEE Trans. Syst. Man Cybern. Part B Cybern. 35(1), 150–
154 (2005)
35. Hewer, G.: An iterative technique for the computation of the steady state gains for the discrete
optimal regulator. IEEE Trans. Autom. Control 16(4), 382–384 (1971)
36. Hoffmann, G., Waslander, S., Tomlin, C.: Quadrotor helicopter trajectory tracking control. In:
AIAA Guidance, Navigation and Control Conference and Exhibit, p. 7410 (2008)
37. Hong, Y., Hu, J., Gao, L.: Tracking control for multi-agent consensus with an active leader
and variable topology. Automatica 42(7), 1177–1182 (2006)
38. Hu, J., Feng, G.: Distributed tracking control of leader–follower multi-agent systems under
noisy measurement. Automatica 46(8), 1382–1387 (2010)
References 285

39. Huang, J.: Nonlinear Output Regulation: Theory and Applications. SIAM (2004)
40. Huang, J.: The cooperative output regulation problem of discrete-time linear multi-agent
systems by the adaptive distributed observer. IEEE Trans. Autom. Control 62(4), 1979–1984
(2017)
41. Ioannou, P., Fidan, B.: Adaptive Control Tutorial. SIAM (2006)
42. Jiang, Y., Jiang, Z.P.: Computational adaptive optimal control for continuous-time linear
systems with completely unknown dynamics. Automatica 48(10), 2699–2704 (2012)
43. Jiang, Y., Jiang, Z.P.: Robust Adaptive Dynamic Programming. John Wiley & Sons (2017)
44. Kahn, G., Villaflor, A., Ding, B., Abbeel, P., Levine, S.: Self-supervised deep reinforcement
learning with generalized computation graphs for robot navigation. In: 2018 IEEE Interna-
tional Conference on Robotics and Automation (ICRA), pp. 5129–5136. IEEE (2018)
45. Kaminer, I., Pascoal, A., Hallberg, E., Silvestre, C.: Trajectory tracking for autonomous
vehicles: an integrated approach to guidance and control. J. Guid. Control Dynam. 21(1),
29–38 (1998)
46. Kiumarsi, B., Lewis, F.L.: Actor-critic-based optimal tracking for partially unknown nonlin-
ear discrete-time systems. IEEE Trans. Neural Netw. Learn. Syst. 26(1), 140–151 (2015)
47. Kiumarsi, B., Lewis, F.L.: Output synchronization of heterogeneous discrete-time systems: a
model-free optimal approach. Automatica 84, 86–94 (2017)
48. Kiumarsi, B., Lewis, F.L., Jiang, Z.P.: H∞ control of linear discrete-time systems: off-policy
reinforcement learning. Automatica 78, 144–152 (2017)
49. Kiumarsi, B., Lewis, F.L., Modares, H., Karimpour, A., Naghibi-Sistani, M.B.: Reinforce-
ment Q-learning for optimal tracking control of linear discrete-time systems with unknown
dynamics. Automatica 50(4), 1167–1175 (2014)
50. Kiumarsi, B., Lewis, F.L., Naghibi-Sistani, M.B., Karimpour, A.: Optimal tracking control of
unknown discrete-time linear systems using input-output measured data. IEEE Trans. Cybern.
45(12), 2770–2779 (2015)
51. Kleinman, D.: On an iterative technique for Riccati equation computations. IEEE Trans.
Autom. Control 13(1), 114–115 (1968)
52. Lancaster, P., Rodman, L.: Algebraic Riccati Equations. Clarendon Press (1995)
53. Landelius, T.: Reinforcement learning and distributed local model synthesis. Ph.D. thesis,
Linköping University Electronic Press (1997)
54. Lewis, F.L., Liu, D.: Reinforcement Learning and Approximate Dynamic Programming for
Feedback Control, vol. 17. John Wiley & Sons (2013)
55. Lewis, F.L., Syrmos, V.L.: Optimal Control. John Wiley & Sons (1995)
56. Lewis, F.L., Vamvoudakis, K.G.: Reinforcement learning for partially observable dynamic
processes: adaptive dynamic programming using measured output data. IEEE Trans. Syst.
Man Cybern. Part B Cybern. 41(1), 14–25 (2011)
57. Lewis, F.L., Vrabie, D.: Reinforcement learning and adaptive dynamic programming for
feedback control. IEEE Circuits Syst. Mag. 9(3), 32–50 (2009)
58. Lewis, F.L., Vrabie, D., Syrmos, V.L.: Optimal Control. John Wiley & Sons (2012)
59. Lewis, F.L., Vrabie, D., Vamvoudakis, K.G.: Reinforcement learning and feedback control:
Using natural decision methods to design optimal adaptive controllers. IEEE Control Syst.
Mag. 32(6), 76–105 (2012)
60. Li, H., Liu, D., Wang, D., Yang, X.: Integral reinforcement learning for linear continuous-time
zero-sum games with completely unknown dynamics. IEEE Trans. Autom. Sci. Eng. 11(3),
706–714 (2014)
61. Liao, F., Wang, J.L., Yang, G.H.: Reliable robust flight tracking control: an LMI approach.
IEEE Trans. Control Syst. Technol. 10(1), 76–89 (2002)
62. Lin, X., Huang, Y., Cao, N., Lin, Y.: Optimal control scheme for nonlinear systems with
saturating actuator using ε-iterative adaptive dynamic programming. In: Proceedings of 2012
UKACC International Conference on Control, pp. 58–63. IEEE (2012)
63. Lin, Z.: Low Gain Feedback. Springer (1999)
286 References

64. Lin, Z., Glauser, M., Hu, T., Allaire, P.E.: Magnetically suspended balance beam with
disturbances: a test rig for nonlinear output regulation. In: 2004 43rd IEEE Conference on
Decision and Control (CDC)(IEEE Cat. No. 04CH37601), vol. 5, pp. 4577–4582. IEEE
(2004)
65. Lin, Z., Saberi, A.: Semi-global exponential stabilization of linear systems subject to input
saturation via linear feedbacks. Syst. Control Lett. 21(3), 225–239 (1993)
66. Lin, Z., Saberi, A.: Semi-global exponential stabilization of linear discrete-time systems
subject to input saturation via linear feedbacks. Syst. Control Lett. 24(2), 125–132 (1995)
67. Liu, D., Huang, Y., Wang, D., Wei, Q.: Neural-network-observer-based optimal control for
unknown nonlinear systems using adaptive dynamic programming. Int. J. Control 86(9),
1554–1566 (2013)
68. Liu, D., Wei, Q., Wang, D., Yang, X., Li, H.: Adaptive Dynamic Programming with
Applications in Optimal Control. Springer (2017)
69. Liu, D., Yang, X., Wang, D., Wei, Q.: Reinforcement-learning-based robust controller design
for continuous-time uncertain nonlinear systems subject to input constraints. IEEE Trans.
Cybern. 45(7), 1372–1385 (2015)
70. Liu, Y., Zhang, H., Luo, Y., Han, J.: ADP based optimal tracking control for a class of linear
discrete-time system with multiple delays. J. Franklin Inst. 353(9), 2117–2136 (2016)
71. Luo, B., Wu, H.N., Huang, T.: Off-policy reinforcement learning for h-infinity control design.
IEEE Trans. Cybern. 45(1), 65–76 (2015)
72. Lyashevskiy, S.: Control of linear dynamic systems with constraints: optimization issues and
applications of nonquadratic functionals. In: Proceedings of the 35th IEEE Conference on
Decision and Control, 1996, vol. 3, pp. 3206–3211. IEEE (1996)
73. Lyshevski, S.E.: Optimal control of nonlinear continuous-time systems: design of bounded
controllers via generalized nonquadratic functionals. In: Proceedings of the 1998 American
Control Conference, vol. 1, pp. 205–209. IEEE (1998)
74. Manitius, A., Olbrot, A.: Finite spectrum assignment problem for systems with delays. IEEE
Trans. Autom. Control 24(4), 541–552 (1979)
75. Mee, D.: An extension of predictor control for systems with control time-delays. Int. J.
Control 18(6), 1151–1168 (1973)
76. Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves,
A., Riedmiller, M., Fidjeland, A.K., Ostrovski, G., et al.: Human-level control through deep
reinforcement learning. Nature 518(7540), 529–533 (2015)
77. Modares, H., Lewis, F.L.: Linear quadratic tracking control of partially-unknown continuous-
time systems using reinforcement learning. IEEE Trans. Autom. Control 59(11), 3051–3056
(2014)
78. Modares, H., Lewis, F.L.: Optimal tracking control of nonlinear partially-unknown
constrained-input systems using integral reinforcement learning. Automatica 50(7), 1780–
1792 (2014)
79. Modares, H., Lewis, F.L., Jiang, Z.P.: H∞ tracking control of completely unknown
continuous-time systems via off-policy reinforcement learning. IEEE Trans. Neural Netw.
Learn. Syst. 26(10), 2550–2562 (2015)
80. Modares, H., Lewis, F.L., Jiang, Z.P.: Optimal output-feedback control of unknown
continuous-time linear systems using off-policy reinforcement learning. IEEE Trans. Cybern.
46(11), 2401–2410 (2016)
81. Modares, H., Lewis, F.L., Naghibi-Sistani, M.B.: Integral reinforcement learning and experi-
ence replay for adaptive optimal control of partially-unknown constrained-input continuous-
time systems. Automatica 50(1), 193–202 (2014)
82. Modares, H., Nageshrao, S.P., Lopes, G.A.D., Babuška, R., Lewis, F.L.: Optimal model-free
output synchronization of heterogeneous systems using off-policy reinforcement learning.
Automatica 71, 334–341 (2016)
83. Moghadam, R., Lewis, F.L.: Output-feedback H-infinity quadratic tracking control of linear
systems using reinforcement learning. Int. J. Adapt. Control Signal Process. 33, 300–314
(2019)
References 287

84. Mu, C., Ni, Z., Sun, C., He, H.: Air-breathing hypersonic vehicle tracking control based
on adaptive dynamic programming. IEEE Trans. Neural Netw. Learn. Syst. 28(3), 584–598
(2017)
85. Mu, C., Wang, D., He, H.: Novel iterative neural dynamic programming for data-based
approximate optimal control design. Automatica 81, 240–252 (2017)
86. Narendra, K.S., Annaswamy, A.M.: Stable Adaptive Systems. Prentice Hall (1989)
87. Olfati-Saber, R.: Flocking for multi-agent dynamic systems: algorithms and theory. IEEE
Trans. Autom. Control 51(3), 401–420 (2006)
88. Postoyan, R., Busoniu, L., Nesic, D., Daafouz, J.: Stability analysis of discrete-time infinite-
horizon optimal control with discounted cost. IEEE Trans. Autom. Control 62(6), 2736–2749
(2017). https://doi.org/10.1109/TAC.2016.2616644
89. Qu, Z., Dorsey, J.: Robust tracking control of robots by a linear feedback law. IEEE Trans.
Autom. Control 36(9), 1081–1084 (1991)
90. Raptis, I.A., Valavanis, K.P., Vachtsevanos, G.J.: Linear tracking control for small-scale
unmanned helicopters. IEEE Trans. Control Syst. Technol. 20(4), 995–1010 (2012)
91. Rizvi, S.A.A., Lin, Z.: Output feedback reinforcement Q-learning control for the discrete-
time linear quadratic regulator problem. In: 2017 IEEE 56th Annual Conference on Decision
and Control (CDC), pp. 1311–1316. IEEE (2017)
92. Rizvi, S.A.A., Lin, Z.: Model-free global stabilization of discrete-time linear systems with
saturating actuators using reinforcement learning. In: 2018 IEEE Conference on Decision
and Control (CDC), pp. 5276–5281. IEEE (2018)
93. Rizvi, S.A.A., Lin, Z.: Output feedback optimal tracking control using reinforcement Q-
learning. In: 2018 Annual American Control Conference (ACC), pp. 3423–3428. IEEE (2018)
94. Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning control for the discrete-time linear
quadratic regulator problem. IEEE Trans. Neural Netw. Learn. Syst. 30(5), 1523–1536 (2018)
95. Rizvi, S.A.A., Lin, Z.: Output feedback Q-learning for discrete-time linear zero-sum games
with application to the H-infinity control. Automatica 95, 213–221 (2018)
96. Rizvi, S.A.A., Lin, Z.: Output feedback reinforcement learning control for the continuous-
time linear quadratic regulator problem. In: 2018 Annual American Control Conference
(ACC), pp. 3417–3422. IEEE (2018)
97. Rizvi, S.A.A., Lin, Z.: Experience replay–based output feedback q-learning scheme for
optimal output tracking control of discrete-time linear systems. Int. J. Adapt. Control Signal
Process. 33(12), 1825–1842 (2019)
98. Rizvi, S.A.A., Lin, Z.: An iterative Q-learning scheme for the global stabilization of discrete-
time linear systems subject to actuator saturation. Int. J. Robust Nonlinear Control 29(9),
2660–2672 (2019)
99. Rizvi, S.A.A., Lin, Z.: Model-free global stabilization of continuous-time linear systems with
saturating actuators using adaptive dynamic programming. In: 2019 IEEE 58th Conference
on Decision and Control (CDC), pp. 145–150. IEEE (2019)
100. Rizvi, S.A.A., Lin, Z.: Output feedback reinforcement learning based optimal output syn-
chronisation of heterogeneous discrete-time multi-agent systems. IET Control Theory Appl.
13(17), 2866–2876 (2019)
101. Rizvi, S.A.A., Lin, Z.: Reinforcement learning-based linear quadratic regulation of
continuous-time systems using dynamic output feedback. IEEE Trans. Cybern. 50(11), 4670–
4679 (2019)
102. Rizvi, S.A.A., Lin, Z.: Adaptive dynamic programming for model-free global stabilization of
control constrained continuous-time systems. IEEE Trans. Cybern. 52(2), 1048–1060 (2022)
103. Rizvi, S.A.A., Lin, Z.: Output feedback adaptive dynamic programming for linear differential
zero-sum games. Automatica 122, 109272 (2020)
104. Rizvi, S.A.A., Wei, Y., Lin, Z.: Model-free optimal stabilization of unknown time delay
systems using adaptive dynamic programming. In: 2019 IEEE 58th Conference on Decision
and Control (CDC), pp. 6536–6541. IEEE (2019)
105. Shoham, Y., Powers, R., Grenager, T.: Multi-agent reinforcement learning: a critical survey.
Technical report, Stanford University (2003)
288 References

106. Silver, D., Hubert, T., Schrittwieser, J., Antonoglou, I., Lai, M., Guez, A., Lanctot, M., Sifre,
L., Kumaran, D., Graepel, T., et al.: A general reinforcement learning algorithm that masters
chess, shogi, and go through self-play. Science 362(6419), 1140–1144 (2018)
107. Silver, D., Schrittwieser, J., Simonyan, K., Antonoglou, I., Huang, A., Guez, A., Hubert, T.,
Baker, L., Lai, M., Bolton, A., et al.: Mastering the game of go without human knowledge.
Nature 550(7676), 354–359 (2017)
108. Smith, O.J.: Closed control of loop with dead time. Chem. Eng. Process. 53, 217–219 (1957)
109. Sontag, E.D., Sussmann, H.J.: Nonlinear output feedback design for linear systems with
saturating controls. In: Proceedings of the 29th IEEE Conference on Decision and Control,
pp. 3414–3416. IEEE (1990)
110. Stevens, B., Lewis, F.L.: Aircraft Control and Simulation. Wiley (2003)
111. Su, Y., Huang, J.: Cooperative output regulation of linear multi-agent systems. IEEE Trans.
Autom. Control 57(4), 1062–1066 (2012)
112. Sussmann, H., Sontag, E., Yang, Y.: A general result on the stabilization of linear systems
using bounded controls. In: Proceedings of 32nd IEEE Conference on Decision and Control,
pp. 1802–1807. IEEE (1993)
113. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge
(1998)
114. Sutton, R.S., Barto, A.G., Williams, R.J.: Reinforcement learning is direct adaptive optimal
control. IEEE Control Syst. 12(2), 19–22 (1992)
115. Tang, Y., Xing, X., Karimi, H.R., Kocarev, L., Kurths, J.: Tracking control of networked
multi-agent systems under new characterizations of impulses and its applications in robotic
systems. IEEE Trans. Ind. Electron. 63(2), 1299–1307 (2016)
116. Tao, G.: Adaptive Control Design and Analysis. John Wiley & Sons (2003)
117. Teel, A.R.: Global stabilization and restricted tracking for multiple integrators with bounded
controls. Syst. Control Lett. 18(3), 165–171 (1992)
118. Trentelman, H.L., Stoorvogel, A.A.: Sampled-data and discrete-time H2 optimal control.
SIAM J. Control Optim. 33(3), 834–862 (1995)
119. Vamvoudakis, K.G., Lewis, F.L., Hudas, G.R.: Multi-agent differential graphical games:
online adaptive learning solution for synchronization with optimality. Automatica 48(8),
1598–1611 (2012)
120. Vrabie, D., Lewis, F.: Adaptive dynamic programming for online solution of a zero-sum
differential game. J. Control Theory Appl. 9(3), 353–360 (2011)
121. Vrabie, D., Pastravanu, O., Abu-Khalaf, M., Lewis, F.L.: Adaptive optimal control for
continuous-time linear systems based on policy iteration. Automatica 45(2), 477–484 (2009)
122. Vrabie, D., Vamvoudakis, K.G., Lewis, F.L.: Optimal Adaptive Control and Differential
Games by Reinforcement Learning Principles, vol. 2. IET (2013)
123. Wai, R.J., Chen, P.C.: Intelligent tracking control for robot manipulator including actuator
dynamics via tsk-type fuzzy neural network. IEEE Trans. Fuzzy Syst. 12(4), 552–560 (2004)
124. Wang, F.Y., Zhang, H., Liu, D.: Adaptive dynamic programming: an introduction. IEEE
Comput. Intell. Mag. 4(2), 39–47 (2009)
125. Wang, L.Y., Li, C., Yin, G.G., Guo, L., Xu, C.Z.: State observability and observers of linear-
time-invariant systems under irregular sampling and sensor limitations. IEEE Trans. Autom.
Control 56(11), 2639–2654 (2011)
126. Watkins, C.J.: Learning from delayed rewards. Ph.D. thesis, University of Cambridge,
England (1989)
127. Watkins, C.J., Dayan, P.: Q-learning. Mach. Learn. 8(3–4), 279–292 (1992)
128. Werbos, P.: Beyond regression: new tools for prediction and analysis in the behavioral
sciences. Ph.D. dissertation, Harvard University (1974)
129. Werbos, P.J.: Neural networks for control and system identification. In: Proceedings of the
28th IEEE Conference on Decision and Control, 1989, pp. 260–265. IEEE (1989)
130. Werbos, P.J.: Approximate dynamic programming for real-time control and neural modeling.
In: Handbook of Intelligent Control: Neural, Fuzzy, and Adaptive Approaches, pp. 493–525.
Nostrand, New York (1992)
References 289

131. Werbos, P.J.: A menu of designs for reinforcement learning over time. In: Neural Networks
for Control, pp. 67–95. MIT Press (1995)
132. Wu, H.N., Luo, B.: Simultaneous policy update algorithms for learning the solution of linear
continuous-time H∞ state feedback control. Inf. Sci. 222, 472–485 (2013)
133. Yang, Y., Sontag, E.D., Sussmann, H.J.: Global stabilization of linear discrete-time systems
with bounded feedback. Syst. Control Lett. 30(5), 273–281 (1997)
134. Yeh, H.H., Nelson, E., Sparks, A.: Nonlinear tracking control for satellite formations. J. Guid.
Control Dynam. 25(2), 376–386 (2002)
135. Yoon, S.Y., Anantachaisilp, P., Lin, Z.: An LMI approach to the control of exponentially
unstable systems with input time delay. In: Proceedings of the 52nd IEEE Conference on
Decision and Control, pp. 312–317 (2013)
136. Zhang, H., Cui, L., Zhang, X., Luo, Y.: Data-driven robust approximate optimal tracking
control for unknown general nonlinear systems using adaptive dynamic programming
method. IEEE Trans. Neural Netw. 22(12), 2226–2236 (2011)
137. Zhang, H., Liu, Y., Xiao, G., Jiang, H.: Data-based adaptive dynamic programming for a class
of discrete-time systems with multiple delays. IEEE Trans. Syst. Man Cybern. Part A Syst.
Hum. 50, 1–10 (2017)
138. Zhang, H., Qin, C., Luo, Y.: Neural-network-based constrained optimal control scheme for
discrete-time switched nonlinear system using dual heuristic programming. IEEE Trans.
Autom. Sci. Eng. 11(3), 839–849 (2014)
139. Zhang, J., Zhang, H., Luo, Y., Feng, T.: Model-free optimal control design for a class of
linear discrete-time systems with multiple delays using adaptive dynamic programming.
Neurocomputing 135, 163–170 (2014)
140. Zhao, Q., Xu, H., Jagannathan, S.: Near optimal output feedback control of nonlinear discrete-
time systems based on reinforcement neural network learning. IEEE/CAA J. Autom. Sinica
1(4), 372–384 (2014)
141. Zhong, X., He, H.: An event-triggered ADP control approach for continuous-time system
with unknown internal states. IEEE Trans. Cybern. 47(3), 683–694 (2017)
142. Zhu, L.M., Modares, H., Peen, G.O., Lewis, F.L., Yue, B.: Adaptive suboptimal output-
feedback control for linear systems using integral reinforcement learning. IEEE Trans.
Control Syst. Technol. 23(1), 264–273 (2015)
143. Zhu, Y., Zhao, D., Li, X.: Using reinforcement learning techniques to solve continuous-time
non-linear optimal tracking problem without system dynamics. IET Control Theory Appl.
10(12), 1339–1347 (2016)
144. Zuo, Z.: Trajectory tracking control design with command-filtered compensation for a
quadrotor. IET Control Theory Appl. 4(11), 2343–2355 (2010)
Index

A discrete-time parameterized LQR VI,


Actor-critic, 24 168, 169
Actuator saturation, 163–166, 172, 173, 181, discrete-time zero-sum game PI, 102,
182, 191, 192, 196, 197, 200, 203, 103
205, 210, 212, 218, 222, 223 discrete-time zero-sum game VI, 102,
Adaptation, 13, 14, 64, 195 104
Adaptive/approximate dynamic programming, zero-sum game policy iteration, 98
13, 255, 280, 281 zero-sum game value iteration, 98
Adaptive control, 13, 14, 24, 33, 45, 175 model-free
Adaptive control law, 266, 269, 270 continuous-time low gain feedback PI,
Algebraic Riccati equation (ARE), 6, 23, 24, 197, 198, 202, 207, 210, 212–215,
27, 99, 168 217, 218
continuous-time, 5, 6, 66, 68, 89, 91, 94, continuous-time low gain feedback VI,
192, 193, 203, 224 200–202, 214, 216, 217
discrete-time, 5, 9, 30–32, 49, 53, 57, 60, continuous-time low gain output
61, 93, 167, 168, 262, 270, 271, 273, feedback PI, 207, 208, 210, 214,
275 216, 218–221
time-delay, 234, 249, 251 continuous-time low gain output
Algorithm feedback VI, 210, 211, 220–223
model-based continuous-time LQR PI, 69, 70, 88, 89,
continuous-time LQR PI, 10, 66, 67, 84, 92, 197
193, 194 continuous-time LQR VI, 69, 70, 91, 92
continuous-time LQR VI, 67, 85 continuous-time output feedback LQR
continuous-time parameterized LQR PI, PI, 80–85, 88, 89, 92, 210
193, 194, 196, 198, 202 continuous-time output feedback LQR
continuous-time parameterized LQR VI, 83, 85, 87, 88, 91, 92, 210
VI, 194, 199, 200 continuous-time output feedback
continuous-time zero-sum game PI, zero-sum game PI, 142, 143, 146,
131, 132 149, 152, 153, 155
continuous-time zero-sum game VI, continuous-time output feedback
132 zero-sum game VI, 145, 146, 149,
discrete-time LQR PI, 9, 167, 168 153
discrete-time LQR VI, 9, 10, 168 continuous-time output feedback
discrete-time parameterized LQR PI, zero-sum game VI, 148, 152, 155,
167, 168, 193 158

© Springer Nature Switzerland AG 2023 291


S. A. A. Rizvi, Z. Lin, Output Feedback Reinforcement Learning Control for Linear
Systems, Control Engineering, https://doi.org/10.1007/978-3-031-15858-2
292 Index

Algorithm (cont.) 101, 102, 106, 107, 118, 119, 122,


continuous-time parameterized LQR 129, 160, 167, 171, 172, 174, 175,
VI, 203 179, 180, 194, 226, 241, 242, 245,
continuous-time zero-sum game PI, 247, 248, 264, 265, 278
132, 134, 140, 155 Bellman optimality principle, 3, 13, 23, 30, 43,
continuous-time zero-sum game VI, 98, 104, 169, 170
134, 158 Bounded (signals, sets), 28, 67, 83, 132, 164,
discrete-time experience replay LQR 166, 169, 192, 194, 195, 203, 224,
PI, 264 266, 267, 270
discrete-time experience replay LQR
VI, 265
discrete-time low gain feedback PI, 172, C
173, 175, 176, 181–185 Closed-loop stability, 21, 32, 50, 61, 69, 71,
discrete-time low gain feedback VI, 93, 94, 129, 155, 160, 161, 175, 225,
172, 174–176, 184–186 249, 252
discrete-time low gain output feedback Computational methods, 6, 23, 66, 131
PI, 178–180, 185–189 Control constraints, 163, 169, 172–174,
discrete-time low gain output feedback 179–182, 186, 193, 196–198,
VI, 178, 180, 189–191 200–202, 207, 208, 211–213, 218,
discrete-time LQR PI, 18, 51, 263 221, 224
discrete-time LQR VI, 18, 19, 55, 264 Controllability, 14, 29, 30, 39, 40, 47, 65, 66,
discrete-time output feedback LQR PI, 77, 83, 99, 148, 149, 202, 203, 226,
44–47, 51–53, 63, 248 228, 230, 233, 235, 238, 244, 249,
discrete-time output feedback LQR VI, 254
47, 48, 55, 56, 61, 64, 248 Controllable, 5, 38, 40, 45, 48, 76, 77, 100, 113,
discrete-time output feedback zero-sum 130, 138, 205, 230, 249
game PI, 118, 119, 123, 124, 126 Control law, 12, 163
discrete-time output feedback zero-sum continuous-time low gain feedback, 191,
game VI, 119, 124, 125, 128 192, 212
discrete-time zero-sum game PI, 106, continuous-time LQR, 6, 65
107, 123–125 delay compensated, 234, 243
discrete-time zero-sum game VI, 107, delay compensated output feedback, 245
124 discrete-time low gain feedback, 165, 166,
time delay LQR PI, 241, 242, 248–250 168, 169, 171, 172
time delay LQR VI, 241, 242, 248, 250, discrete-time output feedback LQR, 41, 43
251 discrete-time output feedback zero-sum
time delay output feedback LQR PI, game, 116
247, 248, 251–253 distributed, 278
time delay output feedback LQR VI, linear quadratic tracking, 257, 262
247, 248, 251–254 low gain feedback, 221, 224
policy iteration, 7, 17 nonlinear, 224
value iteration, 7–9, 17 predictor feedback, 226
Asymptotically null controllable with bounded Convergence criterion, 53, 54, 56, 89, 129,
(ANCBC), 165, 191, 223 249–251, 271, 273, 276
Asymptotically stable, 27, 36, 75, 100, 130, Convergence rate, 8, 9, 58, 102
204, 280 Cost function, 2, 4, 6, 13, 14, 18, 20, 160, 279,
Augmented system, 21, 229, 230, 235, 238, 280
260, 280 augmented, 260
Autonomous system, 21, 260, 262, 280 continuous-time, 22, 129
continuous-time LQR, 5, 66, 68, 71, 77, 78
continuous-time zero-sum game, 130, 138
B discounted, 20, 21, 32, 61, 71, 93–95, 135,
Bellman equation, 6–9, 11, 13, 15–19, 22, 23, 161, 260, 280
27, 28, 45, 47, 50, 66–68, 89, 92, discrete-time, 68
Index 293

discrete-time LQR, 4, 30 Game theory, 14, 97, 98


discrete-time zero-sum game, 103, 105 Global asymptotic stability, 193, 195
LQR, 27, 98 Global asymptotic stabilization, 163–166,
non-quadratic, 224 169, 172–174, 176, 179, 180, 183,
optimal, 41, 107, 115 191–193, 195, 198, 200, 201, 204,
output feedback, 139 207, 208, 210, 211, 214, 215, 218,
parameterized, 170, 172, 195 220–223
parameters, 123, 154
time-delay, 228
tracking problem, 258 H
Hamiltonian function, 4
Hamilton-Jacobi-Bellman (HJB) equation, 3,
D 23, 98, 224
DC motor, 270 Hamilton-Jacobi-Isaacs (HJI) equation, 98
Delay-free, 225, 226, 228, 230, 234, 235, 252, Heterogeneous agents, 258, 280, 281
254 Hewer’s algorithm, 47
Detectability, 234, 238, 242 Homogeneous agents, 258
Detectable, 235
Discounting factor, 20, 21, 95, 129, 160, 161,
258, 260, 271, 273, 278, 280 I
Double integrator, 89, 271 Infinite dimensional, 226
Dynamic inversion, 257, 279 Infinite horizon, 130, 239, 258, 261, 279, 280
Dynamic programming, 2–4, 11–14, 16, 23, 27, Informed agents, 268, 270
98, 102, 103 Instability, 181, 186, 212, 218
Internal model principle, 279

E
Exploration/excitation bias, 20, 21, 28, 32, 49, K
54, 60, 61, 63, 69, 71, 85–87, 89, Kleinman’s algorithm, 10
93–95, 121, 129, 135, 149, 152, 155,
160, 161, 241
Exponentially decaying, 45, 148, 241 L
Extended augmented system, 226, 233–235, Least-squares, 16, 48, 79, 80, 82, 118, 120,
238, 242–244, 248, 249, 254 141–145, 148, 149, 175, 179, 196,
Extended state augmentation, 226, 228, 253 197, 200, 201, 206, 207, 209, 210,
240, 241, 246, 248
Lifting technique, 226
F Linearly dependent, 118, 175, 241, 246
F-16 fighter aircraft, 122, 153 Linearly independent, 118
Finite dimensional, 226, 252 Linear quadratic regulator (LQR), 4, 8, 27, 99,
Finite spectrum assignment, 226 226, 254, 279
Fixed-point property, 6, 11, 24, 102, 168 Linear quadratic tracking (LQT), 257, 260
Function approximation, 13, 17 Linear time-invariant, 65, 129, 260
Lower bound, 94, 95, 161, 261
Low gain feedback, 163–169, 172, 176,
G 181–185, 187–195, 200, 205, 207,
Game algebraic Riccati equation (GARE), 98, 212–216, 218, 219, 221, 222, 224
99 Lyapunov equation, 9–11, 27, 66, 67, 84, 99,
continuous-time, 131, 132, 143, 148, 149, 102, 103, 131, 132, 148, 167, 168,
154, 157, 158 202
discrete-time, 101–103, 106, 123, 129 Lyapunov iterations, 9, 84, 102, 167, 202
294 Index

M 121, 160, 170–172, 177, 178, 239,


Markov decision process, 12, 32 240, 242, 244–246
Multi-agent synchronization, 257, 258, 273, Q-learning, 17–19, 24, 25, 30, 32, 43, 45, 47,
280 49, 51–57, 61, 63, 64, 92, 93, 95, 99,
Multi-agent systems, 258, 268, 274, 275, 278, 103, 106–108, 117–121, 123–128,
279 160, 169, 171–176, 178–180, 195,
226, 238, 240–243, 245, 247, 248,
254, 258, 260, 263–265, 278, 280
N
Nash equilibrium, 130
Neural network, 13, 15, 24, 28, 95 S
Newton’s method, 9, 10, 102 Semi-global asymptotic stabilization, 163–165,
169, 193, 195, 198, 224
Semi-global exponential stabilization, 166
O Separation principle, 28
Observability, 14, 20, 29, 30, 38, 39, 47, 49, 65, Smith predictor, 226
66, 69, 73, 76, 77, 83, 99, 101, 131, State augmentation, 226, 228, 253
148, 149, 176, 185, 202, 203, 226, System identification, 14
228, 234, 235, 238, 243, 244, 247,
248, 251, 254
Observability index, 37, 38, 111, 113, 243, 247, T
251 Time delay, 225, 226, 228, 230, 238, 241–243,
Observable, 5, 32, 34, 37, 45, 48, 72, 100, 108, 245, 247, 248, 252, 254, 255
109, 111, 130, 135, 136, 192, 235,
237, 251, 261 U
Uninformed agents, 268–270
Uniform ultimate boundedness, 224
P Upper bound, 37, 38, 94, 100, 111, 113, 130,
Parameterized algebraic Riccati equation 145, 161, 166, 176, 226, 228, 229,
(ARE), 164, 165 243, 249, 253
continuous-time, 192–194, 204, 213, 218, Utility function, 2, 5, 8, 16, 51, 63, 66, 100,
219, 221, 224 130, 169, 170, 172, 233, 239
discrete-time, 166–169, 171, 175, 182, 183,
187, 188
Partially observable Markov decision process, V
32 Value function approximation, 24, 32, 95, 280
Pontryagin’s minimum principle, 2
Predictor feedback, 226, 252, 254, 255
Z
Zero-sum game, 97–108, 114–121, 129–132,
Q 134, 139, 140, 142, 143, 145, 146,
Q-function, 16–18, 24, 30–32, 40, 41, 43–45, 148, 149, 152, 153, 155, 158, 160,
48–50, 93, 103, 105, 106, 114–117, 161

You might also like