0% found this document useful (0 votes)
53 views172 pages

Reinforcement Learning - Theory and Algorithms

Reinforcement Learning- Theory and Algorithms

Uploaded by

dddfff MSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
53 views172 pages

Reinforcement Learning - Theory and Algorithms

Reinforcement Learning- Theory and Algorithms

Uploaded by

dddfff MSH
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Reinforcement Learning:

Theory and Algorithms


Alekh Agarwal Nan Jiang Sham M. Kakade Wen Sun

November 27, 2020

WORKING DRAFT:
We will be frequently updating the book this fall, 2020. Please email
bookrltheory@[Link] with any typos or errors you find.
We appreciate it!
ii
Contents

1 Fundamentals 3

1 Markov Decision Processes


and Computational Complexity 5
1.1 (Discounted) Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.1 The objective, policies, and values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.1.2 Bellman consistency equations for stationary policies . . . . . . . . . . . . . . . . . . . . . . 7
1.1.3 Bellman optimality equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.2 (Episodic) Markov Decision Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.4 Iterative Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.1 Value Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
1.4.2 Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
1.5 The Linear Programming Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.1 The Primal LP and A Polynomial Time Algorithm . . . . . . . . . . . . . . . . . . . . . . . 15
1.5.2 The Dual LP and the State-Action Polytope . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.6 Advantages and The Performance Difference Lemma . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.7 Bibliographic Remarks and Further Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

2 Sample Complexity 19
2.1 Warmup: a naive model-based approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.2 Sublinear Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.3 Minimax Optimal Sample Complexity with the Model Based Approach . . . . . . . . . . . . . . . . 22
2.3.1 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.2 Variance Lemmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.3.3 Completing the proof . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

iii
2.4 Scalings and Effective Horizon Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.5 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3 Approximate Value Function Methods 29


3.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2 Approximate Greedy Policy Selector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.2.1 Implementing Approximate Greedy Policy Selector using Classification . . . . . . . . . . . . 30
3.2.2 Implementing Approximate Greedy Policy Selector using Regression . . . . . . . . . . . . . 31
3.3 Approximate Policy Iteration (API) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.4 Failure Case of API Without Assumption 3.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.5 Can we relax the concentrability notion? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.6 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4 Generalization 35
4.1 Review: Binary Classification and Generalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
4.2 Generalization and Agnostic Learning in RL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.2.1 Upper Bounds: Data Reuse and Importance Sampling . . . . . . . . . . . . . . . . . . . . . 37
4.2.2 Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4.3 Interpretation: How should we study generalization in RL? . . . . . . . . . . . . . . . . . . . . . . . 40
4.4 Approximation Limits with Linearity Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

2 Strategic Exploration 43

5 Multi-armed & Linear Bandits 45


5.1 The K-Armed Bandit Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.1.1 The Upper Confidence Bound (UCB) Algorithm . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2 Linear Bandits: Handling Large Action Spaces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.2.1 The LinUCB algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.2 Upper and Lower Bounds . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3 LinUCB Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.3.1 Regret Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
5.3.2 Confidence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

iv
5.4 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

6 Strategic Exploration in Tabular MDPs 55


6.1 The UCB-VI algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
6.2 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
6.3 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

7 Linearly Parameterized MDPs 61


7.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.1.1 Low-Rank MDPs and Linear MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
7.2 Planning in Linear MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
7.3 Learning Transition using Ridge Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7.4 Uniform Convergence via Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
7.5 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
7.6 Analysis of UCBVI for Linear MDPs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.6.1 Proving Optimism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
7.6.2 Regret Decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
7.6.3 Concluding the Final Regret Bound . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
7.7 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

8 Parametric Models with Bounded Bellman Rank 73


8.1 Problem setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
8.2 Value-function approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3 Bellman Rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
8.3.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
8.3.2 Linear MDP with Bounded Degree of Freedom . . . . . . . . . . . . . . . . . . . . . . . . . 76
8.3.3 Examples that do not have low Bellman Rank . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
8.5 Extension to Model-based Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
8.6 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

3 Policy Optimization 81

9 Policy Gradient Methods and Non-Convex Optimization 83

v
9.1 Policy Gradient Expressions and the Likelihood Ratio Method . . . . . . . . . . . . . . . . . . . . . 84
9.2 (Non-convex) Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
9.2.1 Gradient ascent and convergence to stationary points . . . . . . . . . . . . . . . . . . . . . . 86
9.2.2 Monte Carlo estimation and stochastic gradient ascent . . . . . . . . . . . . . . . . . . . . . 86
9.3 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

10 Optimality 89
10.1 Vanishing Gradients and Saddle Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
10.2 Policy Gradient Ascent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
10.3 Log Barrier Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
10.4 The Natural Policy Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
10.5 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

11 Function Approximation and the NPG 99


11.1 Compatible function approximation and the NPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
11.2 Examples: NPG and Q-NPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
11.2.1 Log-linear Policy Classes and Soft Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . 101
11.2.2 Neural Policy Classes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.3 The NPG “Regret Lemma” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
11.4 Q-NPG: Performance Bounds for Log-Linear Policies . . . . . . . . . . . . . . . . . . . . . . . . . 104
11.4.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
11.5 Q-NPG Sample Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
11.6 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

12 CPI, TRPO, and More 109


12.1 Conservative Policy Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
12.1.1 The CPI Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
12.2 Trust Region Methods and Covariant Policy Search . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
12.2.1 Proximal Policy Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
12.3 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117

4 Further Topics 119

13 Linear Quadratic Regulators 121

vi
13.1 The LQR Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
13.2 Bellman Optimality:
Value Iteration & The Algebraic Ricatti Equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
13.2.1 Planning and Finite Horizon LQRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
13.2.2 Planning and Infinite Horizon LQRs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
13.3 Convex Programs to find P and K ? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
13.3.1 The Primal for Infinite Horizon LQR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
13.3.2 The Dual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
13.4 Policy Iteration, Gauss Newton, and NPG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
13.4.1 Gradient Expressions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
13.4.2 Convergence Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
13.4.3 Gauss-Newton Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
13.5 System Level Synthesis for Linear Dynamical Systems . . . . . . . . . . . . . . . . . . . . . . . . . 130
13.6 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

14 Imitation Learning 135


14.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.2 Offline IL: Behavior Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
14.3 The Hybrid Setting: Statistical Benefit and Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 136
14.3.1 Extension to Agnostic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
14.4 Maximum Entropy Inverse Reinforcement Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
14.4.1 MaxEnt IRL: Formulation and The Principle of Maximum Entropy . . . . . . . . . . . . . . 140
14.4.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
14.4.3 Maximum Entropy RL: Implementing the Planning Oracle in Eq. 0.4 . . . . . . . . . . . . . 141
14.5 Interactive Imitation Learning:
AggreVaTe and Its Statistical Benefit over Offline IL Setting . . . . . . . . . . . . . . . . . . . . . . 142
14.6 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145

15 Offline Reinforcement Learning 147


15.1 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
15.2 Algorithm: Fitted Q Iteration (FQI) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
15.3 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
15.4 Bibliographic Remarks and Further Readings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151

vii
16 Partially Observable Markov Decision Processes 153

Bibliography 155

A Concentration 163

viii
Notation

The reader might find it helpful to refer back to this notation section.

• We slightly abuse notation and let [K] denote the set {0, 1, 2, . . . K − 1} for an integer K.

• For a vector v, we let (v)2 , v, and |v| be the component-wise square, square root, and absolute value operations.

• Inequalities between vectors are elementwise, e.g. for vectors v, v 0 , we way v ≤ v 0 , if the inequality holds
elementwise.
• For a vector v, we refer to the j-th component of this vector by either v(j) or [v]j
• Denote the variance of any real valued f under a distribution D as:

VarD (f ) := Ex∼D [f (x)2 ] − (Ex∼D [f (x)])2

• We overload notation where, for a distribution µ over S, we write:

V π (µ) = Es∼µ [V π (s)] .

• It is helpful to overload notation and let P also refer to a matrix of size (S · A) × S where the entry P(s,a),s0
is equal to P (s0 |s, a). We also will define P π to be the transition matrix on state-action pairs induced by a
π 0 0 0 π 0 0
deterministic policy π. In particular, P(s,a),(s0 ,a0 ) = P (s |s, a) if a = π(s ) and P(s,a),(s0 ,a0 ) = 0 if a 6= π(s ).

With this notation,

Qπ = r + P V π
Qπ = r + P π Qπ
Qπ = (I − γP π )−1 r

• For a vector Q ∈ R|S×A| , denote the greedy policy and value as:

πQ (s) := argmaxa∈A Q(s, a)


VQ (s) := max Q(s, a). .
a∈A

• For a vector Q ∈ R|S×A| , the Bellman optimality operator T : R|S×A| → R|S×A| is defined as:

T Q := r + P VQ . (0.1)

1
2
Part 1

Fundamentals

3
Chapter 1

Markov Decision Processes


and Computational Complexity

1.1 (Discounted) Markov Decision Processes

In reinforcement learning, the interactions between the agent and the environment are often described by a discounted
Markov Decision Process (MDP) M = (S, A, P, r, γ, µ), specified by:

• A state space S, which may be finite or infinite. For mathematical convenience, we will assume that S is finite
or countable infinite.
• An action space A, which also may be discrete or infinite. For mathematical convenience, we will assume that
A is finite.
• A transition function P : S × A → ∆(S), where ∆(S) is the space of probability distributions over S (i.e., the
probability simplex). P (s0 |s, a) is the probability of transitioning into state s0 upon taking action a in state s.
We use Ps,a to denote the vector P (· s, a).
• A reward function r : S × A → [0, 1]. r(s, a) is the immediate reward associated with taking action a in state s.
• A discount factor γ ∈ [0, 1), which defines a horizon for the problem.
• An initial state distribution µ ∈ ∆(S), which specifies how the initial state s0 is generated.

In many cases, we will assume that the initial state is fixed at s0 , i.e. µ is a distribution supported only on s0 .

1.1.1 The objective, policies, and values

Policies. In a given MDP M = (S, A, P, r, γ, µ), the agent interacts with the environment according to the following
protocol: the agent starts at some state s0 ∼ µ; at each time step t = 0, 1, 2, . . ., the agent takes an action at ∈ A,
obtains the immediate reward rt = r(st , at ), and observes the next state st+1 sampled according to st+1 ∼ P (·|st , at ).
The interaction record at time t,
τt = (s0 , a0 , r1 , s1 , . . . , st ),
is called a trajectory, which includes the observed state at time t.

5
In the most general setting, a policy specifies a decision-making strategy in which the agent chooses actions adaptively
based on the history of observations; precisely, a policy is a (possibly randomized) mapping from a trajectory to an
action, i.e. π : H → ∆(A) where H is the set of all possible trajectories (of all lengths) and ∆(A) is the space of
probability distributions over A. A stationary policy π : S → ∆(A) specifies a decision-making strategy in which
the agent chooses actions based only on the current state, i.e. at ∼ π(·|st ). A deterministic, stationary policy is of the
form π : S → A.

Values. We now define values for (general) policies. For a fixed policy and a starting state s0 = s, we define the
π
value function VM : S → R as the discounted sum of future rewards

hX i
π
VM (s) = E γ t r(st , at ) π, s0 = s .
t=0

where expectation is with respect to the randomness of the trajectory, that is, the randomness in state transitions and
π
the stochasticity of π. Here, since r(s, a) is bounded between 0 and 1, we have 0 ≤ VM (s) ≤ 1/(1 − γ).
Similarly, the action-value (or Q-value) function QπM : S × A → R is defined as

hX i
QπM (s, a) = E γ t r(st , at ) π, s0 = s, a0 = a .
t=0

and QπM (s, a) is also bounded by 1/(1 − γ).

Goal. Given a state s, the goal of the agent is to find a policy π that maximizes the value, i.e. the optimization
problem the agent seeks to solve is:
π
max VM (s) (0.1)
π

where the max is over all (possibly non-stationary and randomized) policies. As we shall see, there exists a determin-
istic and stationary policy which is simultaneously optimal for all starting states s.
We drop the dependence on M and write V π when it is clear from context.
Example 1.1 (Navigation). Navigation is perhaps the simplest to see example of RL. The state of the agent is their
current location. The four actions might be moving 1 step along each of east, west, north or south. The transitions
in the simplest setting are deterministic. Taking the north action moves the agent one step north of their location,
assuming that the size of a step is standardized. The agent might have a goal state g they are trying to reach, and the
reward is 0 until the agent reaches the goal, and 1 upon reaching the goal state. Since the discount factor γ < 1, there
is incentive to reach the goal state earlier in the trajectory. As a result, the optimal behavior in this setting corresponds
to finding the shortest path from the initial to the goal state, and the value function of a state, given a policy is γ d ,
where d is the number of steps required by the policy to reach the goal state.
Example 1.2 (Conversational agent). This is another fairly natural RL problem. The state of an agent can be the
current transcript of the conversation so far, along with any additional information about the world, such as the context
for the conversation, characteristics of the other agents or humans in the conversation etc. Actions depend on the
domain. In the most basic form, we can think of it as the next statement to make in the conversation. Sometimes,
conversational agents are designed for task completion, such as travel assistant or tech support or a virtual office
receptionist. In these cases, there might be a predefined set of slots which the agent needs to fill before they can find a
good solution. For instance, in the travel agent case, these might correspond to the dates, source, destination and mode
of travel. The actions might correspond to natural language queries to fill these slots.
In task completion settings, reward is naturally defined as a binary outcome on whether the task was completed or not,
such as whether the travel was successfully booked or not. Depending on the domain, we could further refine it based

6
on the quality or the price of the travel package found. In more generic conversational settings, the ultimate reward is
whether the conversation was satisfactory to the other agents or humans, or not.
Example 1.3 (Strategic games). This is a popular category of RL applications, where RL has been successful in
achieving human level performance in Backgammon, Go, Chess, and various forms of Poker. The usual setting consists
of the state being the current game board, actions being the potential next moves and reward being the eventual win/loss
outcome or a more detailed score when it is defined in the game. Technically, these are multi-agent RL settings, and,
yet, the algorithms used are often non-multi-agent RL algorithms.

1.1.2 Bellman consistency equations for stationary policies

Stationary policies satisfy the following consistency conditions:


Lemma 1.4. Suppose that π is a stationary policy. Then V π and Qπ satisfy the following Bellman consistency
equations: for all s ∈ S, a ∈ A,

V π (s) = Qπ (s, π(s)).


Qπ (s, a) = r(s, a) + γEs0 ∼P (·|s,a) V π (s0 ) .
 

We leave the proof as an exercise to the reader.


It is helpful to view V π as vector of length |S| and Qπ and r as vectors of length |S| · |A|. We overload notation and
let P also refer to a matrix of size (|S| · |A|) × |S| where the entry P(s,a),s0 is equal to P (s0 |s, a).
We also will define P π to be the transition matrix on state-action pairs induced by a stationary policy π, specifically:
π 0 0 0
P(s,a),(s0 ,a0 ) := P (s |s, a)π(a |s ).

In particular, for deterministic policies we have:

P (s0 |s, a) if a0 = π(s0 )



π
P(s,a),(s0 ,a0 ) :=
0 if a0 6= π(s0 )

With this notation, it is straightforward to verify:

Qπ = r + γP V π
Qπ = r + γP π Qπ .

Corollary 1.5. We have that:


Qπ = (I − γP π )−1 r (0.2)
where I is the identity matrix.

Proof: To see that the I − γP π is invertible, observe that for any non-zero vector x ∈ R|S||A| ,

k(I − γP π )xk∞ = kx − γP π xk∞


≥ kxk∞ − γkP π xk∞ (triangule inequality for norms)
≥ kxk∞ − γkxk∞ (each element of P π x is an average of x)
= (1 − γ)kxk∞ > 0 (γ < 1, x 6= 0)

which implies I − γP π is full rank.


The following is also a helpful lemma:

7
Lemma 1.6. We have that:

X
[(1 − γ)(I − γP π )−1 ](s,a),(s0 ,a0 ) = (1 − γ) γ t Pπh (sh = s0 , ah = a0 |s0 = s, a0 = a)
h=0

so we can view the (s, a)-th row of this matrix as an induced distribution over states and actions when following π
after starting with s0 = s and a0 = a.

We leave the proof as an exercise to the reader.

1.1.3 Bellman optimality equations

A remarkable and convenient property of MDPs is that there exists a stationary and deterministic policy that simulta-
neously maximizes V π (s) for all s ∈ S. This is formalized in the following theorem:

Theorem 1.7. Let Π be the set of all non-stationary and randomized policies. Define:

V ? (s) := sup V π (s)


π∈Π
Q? (s, a) := sup Qπ (s, a).
π∈Π

which is finite since V π (s) and Qπ (s, a) are bounded between 0 and 1/(1 − γ).
There exists a stationary and deterministic policy π such that for all s ∈ S and a ∈ A,

V π (s) = V ? (s)
Qπ (s, a) = Q? (s, a).

We refer to such a π as an optimal policy.

Proof: First, let us show that conditioned on (s0 , a0 , r0 , s1 ) = (s, a, r, s0 ), the maximum future discounted value,
from time 1 onwards, is not a function of s, a, r. Specifically,

hX i
sup E γ t r(st , at ) π, (s0 , a0 , r0 , s1 ) = (s, a, r, s0 ) = γV ? (s0 )
π∈Π t=1

For any policy π, define an “offset” policy π(s,a,r) , which is the policy that chooses actions on a trajectory τ according
to the same distribution that π chooses actions on the trajectory (s, a, r, τ ). For example, π(s,a,r) (a0 = a0 |s0 = s0 ) is
equal to the probability π(a1 = a0 |(s0 , a0 , r0 , s1 ) = (s, a, r, s0 )). By the Markov property, we have that:

hX i ∞
hX i
E γ t r(st , at ) π, (s0 , a0 , r0 , s1 ) = (s, a, r, s0 ) = γE γ t r(st , at ) π(s,a,r) , s0 = s0 = γV π(s,a,r) (s0 ).
t=1 t=0

Hence, due to that V π (s0 ) is not a function of (s, a, r), we have



hX i
sup E γ t r(st , at ) π, (s0 , a0 , r0 , s1 ) = (s, a, r, s0 ) = γ · sup V π(s,a,r) (s0 ) = γ · sup V π (s0 ) = γV ? (s0 ),
π∈Π t=1 π∈Π π∈Π

thus proving the claim.

8
0
We now show the deterministic and stationary policy π(s) = argmaxa∈A supπ0 ∈Π Qπ (s, a) satisfies V π (s) =
0
supπ0 ∈Π V π (s). For this, we have that:
h ∞
X i
?
V (s0 ) = sup E r(s0 , a0 ) + γ t r(st , at )
π∈Π t=1
h ∞
hX ii
= sup E r(s0 , a0 ) + E γ t r(st , at ) π, (s0 , a0 , r0 , s1 )
π∈Π t=1
h ∞
hX ii
≤ sup E r(s0 , a0 ) + sup E γ t r(st , at ) π 0 , (s0 , a0 , r0 , s1 )
π∈Π π 0 ∈Π t=1
h i
= sup E r(s0 , a0 ) + γV ? (s1 )
π∈Π
h i
= E r(s0 , a0 ) + γV ? (s1 ) π .

where the second equality is by the tower property of conditional expectations, and the last equality follows from the
definition of π. Now, by recursion,
h i h i
V ? (s0 ) ≤ E r(s0 , a0 ) + γV ? (s1 ) π ≤ E r(s0 , a0 ) + γr(s1 , a1 ) + γ 2 V ? (s2 ) π ≤ . . . ≤ V π (s0 ).
0
Since V π (s) ≤ supπ0 ∈Π V π (s) = V ? (s), we have that V π = V ? , which completes the proof of the first claim.
For the same policy π, an analogous argument can be used prove the second claim.
This shows that we may restrict ourselves to using stationary and deterministic policies without any loss in perfor-
mance. The following theorem, also due to [Bellman, 1956], gives a precise characterization of the optimal value
function.
Let us say that a vector Q ∈ R|S||A| satisfies the Bellman optimality equations if:
 
0 0
Q(s, a) = r(s, a) + γEs0 ∼P (·|s,a) max0
Q(s , a ) .
a ∈A

Theorem 1.8 (Bellman Optimality Equations). For any Q ∈ R|S||A| , we have that Q = Q? if and only if Q satisfies
the Bellman optimality equations. Furthermore, the deterministic policy π(s) ∈ Q? (s, a) is an optimal policy (where
ties are broken in some arbitrary and deterministic manner).

Before we prove this claim, we will provide a few definitions. Let πQ denote the greedy policy with respect to a vector
Q ∈ R|S||A| , i.e
πQ (s) := argmaxa∈A Q(s, a) .
where ties are broken in some arbitrary (and deterministic) manner. With this notation, by the above theorem, the
optimal policy π ? is given by:
π ? = πQ? .

Let us also use the following notation to turn a vector Q ∈ R|S||A| into a vector of length |S|.

VQ (s) := max Q(s, a).


a∈A

The Bellman optimality operator TM : R|S||A| → R|S||A| is defined as:

T Q := r + γP VQ . (0.3)

9
This allows us to rewrite the Bellman optimality equation in the concise form:

Q = T Q,

and, so, the previous theorem states that Q = Q? if and only if Q is a fixed point of the operator T .
Proof: We first show sufficiency, i.e. that Q? (the state-action value of an optimal policy) satisfies Q? = T Q? . Let
π ? be an optimal stationary and deterministic policy, which exists by Theorem 1.7. First let us show that V ? (s) =
? ?
maxa Q? (s, a). We have that V ? (s) = V π (s) = Qπ (s, π ? (a)) = Q? (s, π ? (a)), by Lemma 1.4 and Theorem 1.7.
Also,
?
max Q? (s, a) ≥ Q? (s, π ? (a)) = V ? (s) ≥ max max Qπ (s, a) ≥ max Qπ (s, a) = max Q? (s, a),
a a π a a

which proves the claim. Now for all actions a ∈ A, we have:

Q? (s, a) = max Qπ (s, a) = r(s, a) + γ max Es0 ∼P (·|s,a) [V π (s0 )]


π π
(a)
= r(s, a) + γEs0 ∼P (·|s,a) [V ? (s0 )]
= r(s, a) + γEs0 ∼P (·|s,a) [max
0
Q? (s0 , a0 )].
a

Here the equality (a) follows from Theorem 1.7. This proves sufficiency.
For the converse, suppose Q = T Q for some Q. We now show that Q = Q? . Let π = πQ . That Q = T Q implies
that Q = r + γP πQ Q, and so:
Q = (I − γP πQ )−1 r = Qπ
using Equation 0.2 in the last step. In other words, Q is the action value of the policy πQ . Now observe for any other
deterministic and stationary policy π 0 :
0 0
Q − Qπ = Qπ − Qπ
0
= Qπ − (I − γP π )−1 r
0 0
= (I − γP π )−1 ((I − γP π ) − (I − γP π ))Qπ
0 0
= γ(I − γP π )−1 (P π − P π )Qπ .
0 0
The proof is completed by noting that (P π − P π )Qπ ≥ 0. To see this, recall that (1 − γ)(I − γP π )−1 is a matrix
with all positive entries (see Lemma 1.6), and now we can observe that:
0
[(P π − P π )Qπ ]s,a = Es0 ∼P (·|s,a) [Qπ (s0 , π(s0 )) − Qπ (s0 , π 0 (s0 ))] ≥ 0
0
where the last step uses that π = πQ . Thus we have that Q ≥ Qπ for all deterministic and stationary π 0 which shows
Q = Q? , using Theorem 1.7. This completes the proof.

1.2 (Episodic) Markov Decision Processes

It will also be natural for us to work with episodic Markov decision Processes. In reinforcement learning, the inter-
actions between the agent and the environment are often described by an episodic time-dependent Markov Decision
Process (MDP) M = (S, A, {P }h , {r}h , H, µ), specified by:

• A state space S, which may be finite or infinite.


• An action space A, which also may be discrete or infinite.

10
• A time-dependent transition function Ph : S × A → ∆(S), where ∆(S) is the space of probability distributions
over S (i.e., the probability simplex). Ph (s0 |s, a) is the probability of transitioning into state s0 upon taking
action a in state s at time step h. Note that time-dependent setting generalizes the stationary setting where all
steps share a same transition .
• A reward function rh : S × A → [0, 1]. rh (s, a) is the immediate reward associated with taking action a in state
s at time step h.
• A integer H which defines the horizon of the problem.
• An initial state distribution µ ∈ ∆(S), which species how the initial state s0 is generated.

Here, for a policy π, a state s, and h ∈ {0, . . . H − 1}, we define the value function Vhπ : S → R as

h H−1
X i
Vhπ (s) = E rh (st , at ) π, sh = s ,
t=h

where again expectation is with respect to the randomness of the trajectory, that is, the randomness in state transitions
and the stochasticity of π. Similarly, the state-action value (or Q-value) function Qπh : S × A → R is defined as

h H−1
X i
Qπh (s, a) = E rh (st , at ) π, sh = s, ah = a .
t=h

We also use the notation V π (s) = V0π (s).


Again, given a state s, the goal of the agent is to find a policy π that maximizes the value, i.e. the optimization problem
the agent seeks to solve is:

max V π (s) (0.4)


π

where recall that V π (s) = V0π (s).


Theorem 1.9. (Bellman optimality equations) Define

Q?h (s, a) = sup Qπh (s, a)


π∈Π

where the sup is over all non-stationary and randomized policies. We have that Qh = Q?h for all h ∈ [H] if and only
if for all h ∈ [H],  
0 0
Qh (s, a) = rh (s, a) + γEs0 ∼Ph (·|s,a) max
0
Qh+1 (s , a ) .
a ∈A

Furthermore, π(s, h) = argmaxa∈A Q?h (s, a) is an optimal policy.

We leave the proof as an exercise to the reader.

1.3 Computational Complexity

The remainder of this section will be concerned with computing an optimal policy when given knowledge of the MDP
M = (S, A, P, r, γ). While much of this book is concerned with statistical limits, understanding the computational
limits can be informative. We will consider algorithms which give both exact and approximately optimal policies. In
particular, we will be interested in polynomial time (and strongly polynomial time) algorithms.

11
Value Iteration Policy Iteration LP-Algorithms
1 1
L(P,r,γ) log L(P,r,γ) log
Poly? |S|2 |A| 1−γ
1−γ
(|S|3 + |S|2 |A|) 1−γ
1−γ
|S|3 |A|L(P, r, γ)
|S|2
 2

|A||S| |S| |A| log |S|
Strongly Poly? 7 (|S|3 + |S|2 |A|) · min |S| , 1−γ
1−γ
|S|4 |A|4 log 1−γ

Table 0.1: Computational complexities of various approaches (we drop universal constants). Polynomial time algo-
rithms depend on the bit complexity, L(P, r, γ), while strongly polynomial algorithms do not. Note that only for a
fixed value of γ are value and policy iteration polynomial time algorithms; otherwise, they are not polynomial time
algorithms. Similarly, only for a fixed value of γ is policy iteration a strongly polynomial time algorithm. In contrast,
the LP-approach leads to both polynomial time and strongly polynomial time algorithms; for the latter, the approach
is an interior point algorithm. See text for further discussion, and Section 1.7 for references. Here, |S|2 |A| is the
assumed runtime per iteration of value iteration, and |S|3 + |S|2 |A| is the assumed runtime per iteration of policy
iteration (note that for this complexity we would directly update the values V rather than Q values, as described in the
text); these runtimes are consistent with assuming cubic complexity for linear system solving.

Suppose that (P, r, γ) in our MDP M is specified with rational entries. Let L(P, r, γ) denote the total bit-size required
to specify M , and assume that basic arithmetic operations +, −, ×, ÷ take unit time. Here, we may hope for an
algorithm which (exactly) returns an optimal policy whose runtime is polynomial in L(P, r, γ) and the number of
states and actions.
More generally, it may also be helpful to understand which algorithms are strongly polynomial. Here, we do not want
to explicitly restrict (P, r, γ) to be specified by rationals. An algorithm is said to be strongly polynomial if it returns
an optimal policy with runtime that is polynomial in only the number of states and actions (with no dependence on
L(P, r, γ)).

1.4 Iterative Methods


?
Planning refers to the problem of computing πM given the MDP specification M = (S, A, P, r, γ). This section
reviews classical planning algorithms that compute Q? .

1.4.1 Value Iteration

A simple algorithm is to iteratively apply the fixed point mapping: starting at some Q, we iteratively apply T :

Q ← T Q,

This is algorithm is referred to as Q-value iteration.


Lemma 1.10. (contraction) For any two vectors Q, Q0 ∈ R|S||A| ,

kT Q − T Q0 k∞ ≤ γkQ − Q0 k∞

Proof: First, let us show that for all s ∈ S, |VQ (s)−VQ0 (s)| ≤ maxa∈A |Q(s, a)−Q0 (s, a)|. Assume VQ (s) > VQ0 (s)
(the other direction is symmetric), and let a be the greedy action for Q at s. Then

|VQ (s) − VQ0 (s)| = Q(s, a) − max


0
Q0 (s, a0 ) ≤ Q(s, a) − Q0 (s, a) ≤ max |Q(s, a) − Q0 (s, a)|.
a ∈A a∈A

12
Using this,

kT Q − T Q0 k∞ = γkP VQ − P VQ0 k∞
= γkP (VQ − VQ0 )k∞
≤ γkVQ − VQ0 k∞
= γ max |VQ (s) − VQ0 (s)|
s
≤ γ max max |Q(s, a) − Q0 (s, a)|
s a
0
= γkQ − Q k∞

where the first inequality uses that each element of P (VQ − VQ0 ) is a convex average of VQ − VQ0 and the second
inequality uses our claim above.
The following result bounds the sub-optimality of the greedy policy itself, based on the error in Q-value function.
Lemma 1.11. (Q-Error Amplification) For any vector Q ∈ R|S||A| ,
2kQ − Q? k∞
V πQ ≥ V ? − 1.
1−γ
where 1 denotes the vector of all ones.

Proof: Fix state s and let a = πQ (s). We have:

V ? (s) − V πQ (s) =Q? (s, π ? (s)) − QπQ (s, a)


=Q? (s, π ? (s)) − Q? (s, a) + Q? (s, a) − QπQ (s, a)
=Q? (s, π ? (s)) − Q? (s, a) + γEs0 ∼P (·|s,a) [V ? (s0 ) − V πQ (s0 )]
≤ Q? (s, π ? (s)) − Q(s, π ? (s)) + Q(s, a) − Q? (s, a)
+ γEs0 ∼P (s,a) [V ? (s0 ) − V πQ (s0 )]
≤ 2kQ − Q? k∞ + γkV ? − V πQ k∞ .

where the first inequality uses Q(s, π ? (s)) ≤ Q(s, πQ (s)) = Q(s, a) due to the definition of πQ .
Theorem 1.12. (Q-value iteration convergence). Set Q(0) = 0. For k = 0, 1, . . ., suppose:

Q(k+1) = T Q(k)
2
log (1−γ)2 
Let π (k) = πQ(k) . For k ≥ 1−γ ,
(k)
Vπ ≥ V ? − 1 .

Proof: Since kQ? k∞ ≤ 1/(1 − γ), Q(k) = T k Q(0) and Q? = T Q? , Lemma 1.10 gives

exp(−(1 − γ)k)
kQ(k) − Q? k∞ = kT k Q(0) − T k Q? k∞ ≤ γ k kQ(0) − Q? k∞ = (1 − (1 − γ))k kQ? k∞ ≤ .
1−γ
The proof is completed with our choice of γ and using Lemma 1.11.

Iteration complexity for an exact solution. With regards to computing an exact optimal policy, when the gap
between the current objective value and the optimal objective value is smaller than 2−L(P,r,γ) , then the greedy policy
will be optimal. This leads to claimed complexity in Table 0.1. Value iteration is not strongly polynomial algorithm
due to that, in finite time, it may never return the optimal policy.

13
1.4.2 Policy Iteration

The policy iteration algorithm starts from an arbitrary policy π0 , and repeat the following iterative procedure: for
k = 0, 1, 2, . . .

1. Policy evaluation. Compute Qπk


2. Policy improvement. Update the policy:
πk+1 = πQπk

In each iteration, we compute the Q-value function of πk , using the analytical form given in Equation 0.2, and update
the policy to be greedy with respect to this new Q-value. The first step is often called policy evaluation, and the second
step is often called policy improvement.
Lemma 1.13. We have that:

1. Qπk+1 ≥ T Qπk ≥ Qπk


2. kQπk+1 − Q? k∞ ≤ γkQπk − Q? k∞

Proof: First let us show that T Qπk ≥ Qπk . Note that the policies produced in policy iteration are always deterministic,
so V πk (s) = Qπk (s, πk (s)) for all iterations k and states s. Hence,
T Qπk (s, a) = r(s, a) + γEs0 ∼P (·|s,a) [max
0
Qπk (s0 , a0 )]
a
≥ r(s, a) + γEs0 ∼P (·|s,a) [Qπk (s0 , πk (s0 ))] = Qπk (s, a).

Now let us prove that Qπk+1 ≥ T Qπk . First, let use see that Qπk+1 ≥ Qπk :

X
Qπk = r + γP πk Qπk ≤ r + γP πk+1 Qπk ≤ γ t (P πk+1 )t r = Qπk+1 .
t=0

where we have used that πk+1 is the greedy policy in the first inequality and recursion in the second inequality. Using
this,
Qπk+1 (s, a) = r(s, a) + γEs0 ∼P (·|s,a) [Qπk+1 (s0 , πk+1 (s0 ))]
≥ r(s, a) + γEs0 ∼P (·|s,a) [Qπk (s0 , πk+1 (s0 ))]
= r(s, a) + γEs0 ∼P (·|s,a) [max
0
Qπk (s0 , a0 )] = T Qπk (s, a)
a

which completes the proof of the first claim.


For the second claim,
kQ? − Qπk+1 k∞ ≤ kQ? − T Qπk k∞ = kT Q? − T Qπk+1 k∞ ≤ γkQ? − Qπk k∞
where we have used that Q? ≥ Qπk+1 ≥ Qπk in second step and the contraction property of T (·) (see Lemma 1.10 in
the last step.
With this lemma, a convergence rate for the policy iteration algorithm immediately follows.
1
log (1−γ)
Theorem 1.14. (Policy iteration convergence). Let π0 be any initial policy. For k ≥ 1−γ , the k-th policy in
policy iteration has the following performance bound:
(k)
Qπ ≥ Q? − 1 .

14
Iteration complexity for an exact solution. With regards to computing an exact optimal policy, it clear from the
previous results that policy iteration is no worse than value iteration. However, with regards to obtaining an exact
solution MDP that is independent of the bit complexity, L(P, r, γ), improvements are possible (and where we assume
basic arithmetic operations on real numbers are order one cost). Naively, the number of iterations of policy iterations
is bounded by the number of policies, namely |A||S| ; here, a small improvement is possible, where the number of
|A||S|
iterations of policy iteration can be bounded by |S| . Remarkably, for a fixed value of γ, policy iteration can be
|S|2
|S|2 |A| log 1−γ
show to be a strongly polynomial time algorithm, where policy iteration finds an exact policy in at most 1−γ
iterations. See Table 0.1 for a summary, and Section 1.7 for references.

1.5 The Linear Programming Approach

It is helpful to understand an alternative approach to finding an optimal policy for a known MDP. With regards to
computation, consider the setting where our MDP M = (S, A, P, r, γ, µ) is known and P , r, and γ are all specified by
rational numbers. Here, from a computational perspective, the previous iterative algorithms are, strictly speaking, not
polynomial time algorithms, due to that they depend polynomially on 1/(1 − γ), which is not polynomial in the de-
1
scription length of the MDP . In particular, note that any rational value of 1 − γ may be specified with only O(log 1−γ )
bits of precision. In this context, we may hope for a fully polynomial time algorithm, when given knowledge of the
MDP, which would have a computation time which would depend polynomially on the description length of the MDP
M , when the parameters are specified as rational numbers. We now see that the LP approach provides a polynomial
time algorithm.

1.5.1 The Primal LP and A Polynomial Time Algorithm

Consider the following optimization problem with variables V ∈ R|S| :


X
min µ(s)V (s)
s
X
subject to V (s) ≥ r(s, a) + γ P (s0 |s, a)V (s0 ) ∀a ∈ A, s ∈ S
s0

Here, the optimal value function V ? (s) is the unique solution to this linear program. With regards to computation
time, linear programming approaches only depend on the description length of the coefficients in the program, due
to that this determines the computational complexity of basic additions and multiplications. Thus, this approach will
only depend on the bit length description of the MDP, when the MDP is specified by rational numbers.

Computational complexity for an exact solution. Table 0.1 shows the runtime complexity for the LP approach,
where we assume a standard runtime for solving a linear program. The strongly polynomial algorithm is an interior
point algorithm. See Section 1.7 for references.

Policy iteration and the simplex algorithm. It turns out that the policy iteration algorithm is actually the simplex
method with block pivot. While the simplex method, in general, is not a strongly polynomial time algorithm, the
policy iteration algorithm is a strongly polynomial time algorithm, provided we keep the discount factor fixed. See
[Ye, 2011].

15
1.5.2 The Dual LP and the State-Action Polytope

For a fixed (possibly stochastic) policy π, let us define the state-action visitation distribution νµπ as:

X
νµπ (s, a) = (1 − γ) γ t Prπ (st = s, at = a)
t=0

where Prπ (st = s, at = a) is the state-action visitation probability, where we execute π in M starting at state s0 ∼ µ.
Recall Lemma 1.6 which provides a way to easily compute νµπ (s, a) through an appropriate vector-matrix multiplica-
tion.
It is straightforward to verify that νµπ satisfies, for all states s ∈ S:
X X
νµπ (s, a) = (1 − γ)µ(s) + γ P (s|s0 , a0 )νµπ (s0 , a0 ).
a s0 ,a0

Let us define the state-action polytope as follows:


X X
K := {ν| ν ≥ 0 and ν(s, a) = (1 − γ)µ(s) + γ P (s|s0 , a0 )ν(s0 , a0 )}
a s0 ,a0

We now see that this set precisely characterizes all state-action visitation distributions.
Proposition 1.15. We have that K is equal to the set of all feasible state-action distributions, i.e. ν ∈ K if and only if
there exists a stationary (and possibly randomized) policy π such that νµπ = ν.

With respect the variables ν ∈ R|S|·|A| , the dual LP formulation is as follows:


1 X
max ν(s, a)r(s, a)
1 − γ s,a
subject to ν∈K

Note that K is itself a polytope, and one can verify that this is indeed the dual of the aforementioned LP. This approach
provides an alternative approach to finding an optimal solution.
If ν ? is the solution to this LP, then we have that:
ν ? (s, a)
π ? (a|s) = P ? 0
.
a0 ν (s, a )

An alternative optimal policy is argmaxa ν ? (s, a) (and these policies are identical if the optimal policy is unique).

1.6 Advantages and The Performance Difference Lemma

Throughout, we will overload notation where, for a distribution µ over S, we write:

V π (µ) = Es∼µ [V π (s)] .

The advantage Aπ (s, a) of a policy π is defined as

Aπ (s, a) := Qπ (s, a) − V π (s) .

16
Note that:

A∗ (s, a) := Aπ (s, a) ≤ 0

for all state-action pairs.


Analogous to the state-action visitation distribution, define the discounted state visitation distribution dπs0 as:


X
dπs0 (s) = (1 − γ) γ t Prπ (st = s|s0 ) (0.5)
t=0

where Prπ (st = s|s0 ) is the state visitation probability, under π starting at state s0 . We also write:

dπµ (s) = Es0 ∼µ dπs0 (s) .


 

for a distribution µ over S.


The following lemma is a helpful in a number of our analyses.

Lemma 1.16. (The performance difference lemma) For all policies π, π 0 and distributions µ over S,

0 1 h 0 i
V π (µ) − V π (µ) = Es0 ∼dπµ Ea0 ∼π(·|s0 ) Aπ (s0 , a0 ) .
1−γ

Proof: Let Prπ (τ |s0 = s) denote the probability of observing a trajectory τ when starting in state s and following the
policy π. By definition of dπs0θ , observe that for any function f : S × A → R,
"∞ #
X 1
γ t f (st , at ) =
 
Eτ ∼Prπ E πθ Ea∼πθ (·|s) f (s, a) . (0.6)
t=0
1 − γ s∼ds0

Using a telescoping argument, we have:


"∞ #
π0 0
X
π
V (s) − V (s) = Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) − V π (s)
t

t=0
"∞ #
 
π0 π0 0
X
t
= Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) + V (st ) − V (st ) − V π (s)
t=0
"∞ #
(a)
 
π0 π0
X
t
= Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) + γV (st+1 ) − V (st )
t=0
"∞ #
(b)
 
π0 π0
X
t
= Eτ ∼Prπ (τ |s0 =s) γ r(st , at ) + γE[V (st+1 )|st , at ] − V (st )
t=0
"∞ #
(c) X  0 0

= Eτ ∼Prπ (τ |s0 =s) γ t Qπ (st , at ) − V π (st )
t=0
"∞ #
X
π0 1 0
= Eτ ∼Prπ (τ |s0 =s) t
γ A (st , at ) = Es0 ∼dπs Ea∼π(·|s) γ t Aπ (s0 , a),
t=0
1−γ

where (a) rearranges terms in the summation via telescoping; (b) uses the tower property of conditional expectations;
(c) follows by definition; and the final equality follows from Equation 0.6.

17
1.7 Bibliographic Remarks and Further Reading

We refer the reader to [Puterman, 1994] for a more detailed treatment of dynamic programming and MDPs. [Puterman,
1994] also contains a thorough treatment of the dual LP, along with a proof of Lemma 1.15
With regards to the computational complexity of policy iteration, [Ye, 2011] showed that policy iteration is a strongly
polynomial time algorithm for a fixed discount rate 1 . Also, see [Ye, 2011] for a good summary of the computa-
tional complexities of various approaches. [Mansour and Singh, 1999] showed that the number of iterations of policy
|S|
iteration can be bounded by |A|
|S| .

With regards to a strongly polynomial algorithm, the CIPA algorithm [Ye, 2005] is an interior point algorithm with the
claimed runtime in Table 0.1.
Lemma 1.11 is due to Singh and Yee [1994].
The performance difference lemma is due to [Kakade and Langford, 2002, Kakade, 2003], though the lemma was
implicit in the analysis of a number of prior works.

1 The stated strongly polynomial runtime in Table 0.1 for policy iteration differs from that in [Ye, 2011] due to we assume that the runtime per

iteration of policy iteration is |S|3 + |S|2 |A|.

18
Chapter 2

Sample Complexity

Let us now look at the statistical complexity of learning a near optimal policy. Here, we look at a more abstract
sampling model, a generative model, which allows us study the minimum number of transitions we need to observe.
This chapter characterizes the minimax optimal sample complexity of estimating Q? and learning a near optimal
policy.
In this chapter, we will assume that the reward function is known (and deterministic). This is often a mild assumption,
particularly due to that much of the difficulty in RL is due to the uncertainty in the transition model P . This will also
not effect the minimax sample complexity.
This chapter follows the results due to [Azar et al., 2013], along with some improved rates due to [Agarwal et al.,
2020c],

Generative models. A generative model provides us with a sample s0 ∼ P (·|s, a) upon input of a state action pair
(s, a). Let us consider the most naive approach to learning (when we have access to a generative model): suppose we
call our simulator N times at each state action pair. Let Pb be our empirical model, defined as follows:
count(s0 , s, a)
Pb(s0 |s, a) =
N
where count(s0 , s, a) is the number of times the state-action pair (s, a) transitions to state s0 . As the N is the number
of calls for each state action pair, the total number of calls to our generative model is |S||A|N . As before, we can view
Pb as a matrix of size |S||A| × |S|.
The generative model setting is a reasonable abstraction to understand the statistical limit, without having to directly
address exploration.

We define Mc to be the empirical MDP that is identical to the original M , except that it uses Pb instead of P for the
transition model. When clear from context, we drop the subscript on M on the values, action values (and one-step
variances and variances which we define later). We let Vb π , Qbπ , Q
b ? , and π
b? denote the value function, state-action
value function, optimal state-action value, and optimal policy in M
c, respectively.

A key question here is:

Do we require an accurate model of the world in order to find a near optimal policy?

Let’s us first start by looking at the naive approach where we build an accurate model of world, which will be sufficient
for learning a near optimal policy. In particular, as we shall see O(|S|2 |A|) is sufficient to provide us with an accurate

19
model 1 The question is if we can improve upon this and find a near optimal policy with a number of samples that is
sub-linear in the model size, i.e. use a number of samples that is smaller than O(|S|2 |A|). Furthermore, we also wish
to characterize the minimax dependence on the effective horizon, i.e. on the dependence on 1/(1 − γ).

2.1 Warmup: a naive model-based approach

Note that since P has a |S|2 |A| parameters, a naive approach would be to estimate P accurately and then use our
accurate model Pb for planning.
1

Proposition 2.1. There exists an absolute constant c such that the following holds. Suppose  ∈ 0, 1−γ and that we
obtain
γ |S|2 |A| log(c|S||A|/δ)
# samples from generative model = |S||A|N ≥ 4
(1 − γ) 2
where we uniformly sample every state action pair. Then, with probability greater than 1 − δ, we have:

• (Model accuracy) The transition model is  has error bounded as:

max kP (·|s, a) − Pb(·|s, a)k1 ≤ (1 − γ)2  .


s,a

• (Uniform value accuracy) For all policies π,

kQπ − Q
b π k∞ ≤ 

• (Near optimal planning) Suppose that π


b is the optimal policy in M
c. We have that:

b ? − Q? k∞ ≤ , and kQπb − Q? k∞ ≤ 2.


kQ

Before we provide the proof, the following lemmas will be helpful throughout:
Lemma 2.2. (Simulation Lemma) For all π we have that:

Qπ − Q
bπ = γ(I − γ Pbπ )−1 (P − Pb)V π

Proof: Using our matrix equality for Qπ (see Equation 0.2), we have:

Qπ − Q
bπ = (I − γP π )−1 r − (I − γ Pbπ )−1 r
= (I − γ Pbπ )−1 ((I − γ Pbπ ) − (I − γP π ))Qπ
= γ(I − γ Pbπ )−1 (P π − Pbπ )Qπ
= γ(I − γ Pbπ )−1 (P − Pb)V π

which proves the claim.


Lemma 2.3. For any policy π, MDP M and vector v ∈ R|S|×|A| , we have (I − γP π )−1 v ∞
≤ kvk∞ /(1 − γ).

Proof: Note that v = (I − γP π )(I − γP π )−1 v = (I − γP π )w, where w = (I − γP π )−1 v. By triangle inequality,
we have

kvk = k(I − γP π )wk ≥ kwk∞ − γ kP π wk∞ ≥ kwk∞ − γ kwk∞ ,


1 Note that this is consistent with parameter counting since P is specified by O|S|2 |A| parameters.

20
where the final inequality follows since P π w is an average of the elements of w by the definition of P π so that
kP π wk∞ ≤ kwk∞ . Rearranging terms completes the proof.
Now we are ready to complete the proof of our proposition.
Proof: Using the concentration of a distribution in the `1 norm (Lemma A.4), we have that for a fixed s, a that, with
probability greater than 1 − δ, we have:
r
|S| log(1/δ)
kP (·|s, a) − P (·|s, a)k1 ≤ c
b
m

where m is the number of samples used to estimate Pb(·|s, a). The first claim now follows by the union bound (and
redefining δ and c appropriately).
For the second claim, we have that:

b π k∞ = kγ(I − γ Pbπ )−1 (P − Pb)V π k∞ ≤ γ k(P − Pb)V π k∞


kQπ − Q
1−γ
 
γ γ
≤ max kP (·|s, a) − Pb(·|s, a)k1 kV π k∞ ≤ max kP (·|s, a) − Pb(·|s, a)k1
1 − γ s,a (1 − γ)2 s,a

where the penultimate step uses Holder’s inequality. The second claim now follows.
For the final claim, first observe that | supx f (x) − supx g(x)| ≤ supx |f (x) − g(x)|, where f and g are real valued
functions. This implies:
b ? (s, a) − Q? (s, a)| = | sup Q
|Q b π (s, a) − sup Qπ (s, a)| ≤ sup |Q
b π (s, a) − Qπ (s, a)| ≤ 
π π π

which proves the first inequality. The second inequality is left as an exercise to the reader.

2.2 Sublinear Sample Complexity

In the previous approach, we are able to accurately estimate the value of every policy in the unknown MDP M . How-
ever, with regards to planning, we only need an accurate estimate Q b ? of Q? , which we may hope would require less
samples. Let us now see that the model based approach can be refined to obtain minimax optimal sample complexity,
which we will see is sublinear in the model size.
We will state our results in terms of N , and recall that N is the # of calls to the generative models per state-action pair,
so that:
# samples from generative model = |S||A|N.

Let us start with a crude bound on the optimal action-values, which provides a sublinear rate. In the next section, we
will improve upon this to obtain the minimax optimal rate.

Proposition 2.4. (Crude Value Bounds) Let δ ≥ 0. With probability greater than 1 − δ,

kQ? − Q
b ? k∞ ≤ ∆δ,N
b π ? k∞
kQ? − Q ≤ ∆δ,N ,

where: r
γ 2 log(2|S||A|/δ)
∆δ,N :=
(1 − γ)2 N

21
Note that the first inequality above shows a sublinear rate on estimating the value function. Ultimately, we are in-
?
terested in the value V πb when we execute π b? , not just an estimate Q
b ? of Q? . Here, by Lemma 1.11, we lose an
additional horizon factor and have:
kQ? − Q b πb? k∞ ≤ 1 ∆δ,N
1−γ
We return to this point in Corollary 2.7 and Theorem 2.8.
Before we provide the proof, the following lemma will be helpful throughout.
Lemma 2.5. (Component-wise Bounds) We have that:
?
Q? − Q
b? ≤ γ(I − γ Pbπ )−1 (P − Pb)V ?
?
Q? − Q
b? ≥ γ(I − γ Pbπb )−1 (P − Pb)V ?

Proof: For the first claim, the optimality of π ? in M implies:


? ? ? ? ?
Q? − Q
b ? = Qπ − Q b π = γ(I − γ Pbπ )−1 (P − Pb)V ? ,
b πb ≤ Qπ − Q

where we have used Lemma 2.2 in the final step. This proves the first claim.
For the second claim,
?
Q? − Q
b? = Qπ − Q b πb?
 ? ?

= (1 − γ) (I − γP π )−1 r − (I − γ Pbπb )−1 r
? ? ?
= (I − γ Pbπ )−1 ((I − γ Pbπb ) − (I − γP π ))Q?
? ? ?
= γ(I − γ Pbπ )−1 (P π − Pbπb )Q?
? ? ?
≤ γ(I − γ Pbπ )−1 (P π − Pbπ )Q?
?
= γ(I − γ Pbπ )−1 (P − Pb)V ? ,
? ?
where the inequality follows from Pbπb Q? ≤ Pbπ Q? , due to the optimality of π ? . This proves the second claim.
Proof: Following from the simulation lemma (Lemma 2.2) and Lemma 2.3, we have:

b π? k∞ ≤ γ
kQ? − Q k(P − Pb)V ? k∞ .
1−γ
Also, the previous lemma, implies that:
γ
kQ? − Q
b ? k∞ ≤ k(P − Pb)V ? k∞
1−γ
By applying Hoeffding’s inequality and the union bound,
r
1 2 log(2|S||A|/δ)
k(P − Pb)V ? k∞ = max |Es0 ∼P (·|s,a) [V ? (s0 )] − Es0 ∼Pb(·|s,a) [V ? (s0 )]| ≤
s,a 1−γ N
which holds with probability greater than 1 − δ. This completes the proof.

2.3 Minimax Optimal Sample Complexity with the Model Based Approach
b ? to be optimal:
We now refine the crude bound on Q

22
Theorem 2.6. (Value estimation) For δ ≥ 0 and with probability greater than 1 − δ,
s
? ? c log(c|S||A|/δ) cγ log(c|S||A|/δ)
kQ − Q b k∞ ≤ γ + ,
(1 − γ)3 N (1 − γ)3 N

where c is an absolute constant.


Corollary 2.7. Provided that  ≤ 1, we have that if
c |S||A| log(c|S||A|/δ)
N≥ ,
(1 − γ)3 2
then with probability greater than 1 − δ, then

kQ? − Q
b ? k∞ ≤ .
?
Note that this implies kQ? − Qπb k∞ ≤ /(1 − γ).

?
Ultimately, we are interested in the value V πb when we execute π b? , not just an estimate Q
b ? of Q? . The above
corollary is not sharp with regards to finding a near optimal policy. The following Theorem shows that in fact both
value estimation and policy estimation have the same rate.
q
1
Theorem 2.8. Provided that  ≤ 1−γ , we have that if

c |S||A| log(c|S||A|/δ)
N≥ 3
,
(1 − γ) 2
then with probability greater than 1 − δ, then
?
kQ? − Qπb k∞ ≤ , and kQ? − Q
b ? k∞ ≤ .

We state this improved theorem without proof due to it being more involved, and only prove Theorem 2.6. See
Section 2.5 for further discussion.

2.3.1 Lower Bounds


b ? , is (, δ)-good on MDP M
Let us say that an estimation algorithm A, which is a map from samples to an estimate Q
? ?
if kQ − Q b k∞ ≤  holds with probability greater than 1 − δ.

Theorem 2.9. There exists 0 , δ0 , c and a set of MDPs M such that for  ∈ (0, 0 ) and δ ∈ (0, δ0 ) if algorithm A is
(, δ)-good on all M ∈ M, then A must use a number of samples that is lower bounded as follows
c |S||A| log(c|S||A|/δ)
# samples from generative model ≥ ,
(1 − γ)3 2

2.3.2 Variance Lemmas

The key to the sharper analysis is to more sharply characterize the variance in our estimates.
Denote the variance of any real valued f under a distribution D as:

VarD (f ) := Ex∼D [f (x)2 ] − (Ex∼D [f (x)])2

23
Slightly abusing the notation, for V ∈ R|S| , we define the vector VarP (V ) ∈ R|S||A| as:

VarP (V )(s, a) := VarP (·|s,a) (V )

Equivalently,
VarP (V ) = P (V )2 − (P V )2 .

Now we characterize a relevant deviation in terms of the its variance.


Lemma 2.10. Let δ > 0. With probability greater than 1 − δ,
r
? 2 log(2|S||A|/δ) p 1 2 log(2|S||A|/δ)
|(P − P )V | ≤
b VarP (V ? ) + 1.
N 1−γ 3N

Proof: The claims follows from Bernstein’s inequality along with a union bound over all state-action pairs.
? p ? p
The key ideas in the proof are in how we bound k(I − γ Pbπ )−1 VarP (V ? )k∞ and k(I − γ Pbπb )−1 VarP (V ? )k∞ .
It is helpful to define ΣπM as the variance of the discounted reward, i.e.
 !2 
X∞
ΣπM (s, a) := E  γ t r(st , at ) − QπM (s, a) s0 = s, a0 = a
t=0

where the expectation is induced under the trajectories induced by π in M . It is straightforward to verify that
kΣπM k∞ ≤ γ 2 /(1 − γ)2 .
The following lemma shows that ΣπM satisfies a Bellman consistency condition.
Lemma 2.11. (Bellman consistency of Σ) For any MDP M ,

ΣπM = γ 2 VarP (VM


π
) + γ 2 P π ΣπM (0.1)

where P is the transition model in MDP M .

The proof is left as an exercise to the reader.


Lemma 2.12. (Weighted Sum of Deviations) For any policy π and MDP M ,
s
2
q
(I − γP π )−1 VarP (VM π) ≤ ,
∞ (1 − γ)3

where P is the transition model of M .

Proof:Note that (1 − γ)(I − γP π )−1 is matrix whose rows are a probability distribution. For a positive
√ vector v and

a distribution ν (where ν is vector of the same dimension of v), Jensen’s inequality implies that ν · v ≤ ν · v. This
implies:
√ 1 √
k(I − γP π )−1 vk∞ = k(1 − γ)(I − γP π )−1 vk∞
1−γ
r
1
≤ (I − γP π )−1 v
1−γ ∞
r
2
≤ (I − γ 2 P π )−1 v .
1−γ ∞

24
where we have used that k(I − γP π )−1 vk∞ ≤ 2k(I − γ 2 P π )−1 vk∞ (which we will prove shortly). The proof is
completed as follows: by Equation 0.1, ΣπM = γ 2 (I − γ 2 P π )−1 VarP (VM
π π
), so taking v = VarP (VM ) and using that
π 2 2
kΣM k∞ ≤ γ /(1 − γ) completes the proof.
Finally, to see that k(I − γP π )−1 vk∞ ≤ 2k(I − γ 2 P π )−1 vk∞ , observe:
k(I − γP π )−1 vk∞ = k(I − γP π )−1 (I − γ 2 P π )(I − γ 2 P π )−1 vk∞
 
= k(I − γP π )−1 (1 − γ)I + γ(I − γP π ) (I − γ 2 P π )−1 vk∞
 
= k (1 − γ)(I − γP π )−1 + γI (I − γ 2 P π )−1 vk∞
≤ (1 − γ)k(I − γP π )−1 (I − γ 2 P π )−1 vk∞ + γk(I − γ 2 P π )−1 vk∞
1−γ
≤ k(I − γ 2 P π )−1 vk∞ + γk(I − γ 2 P π )−1 vk∞
1−γ
≤ 2k(I − γ 2 P π )−1 vk∞
which proves the claim.

2.3.3 Completing the proof

Lemma 2.13. Let δ ≥ 0. With probability greater than 1 − δ, we have:


?
VarP (V ? ) ≤ 2VarPb (Vb π ) + ∆0δ,N 1
VarP (V ? ) ≤ 2VarPb (Vb ? ) + ∆0δ,N 1
where r
1 18 log(6|S||A|/δ) 1 4 log(6|S||A|/δ)
∆0δ,N := + .
(1 − γ)2 N (1 − γ)4 N

Proof: By definition,
VarP (V ? ) = VarP (V ? ) − VarPb (V ? ) + VarPb (V ? )
= P (V ? )2 − (P V ? )2 − Pb(V ? )2 + (PbV ? )2 + VarPb (V ? )
 
= (P − Pb)(V ? )2 − (P V ? )2 − (PbV ? )2 + VarPb (V ? )

Now we bound each of these terms with Hoeffding’s inequality and the union bound. For the first term, with probability
greater than 1 − δ, r
? 2 1 2 log(2|S||A|/δ)
k(P − P )(V ) k∞ ≤
b
2
.
(1 − γ) N
For the second term, again with probability greater than 1 − δ,
k(P V ? )2 − (PbV ? )2 k∞ ≤ kP V ? + PbV ? k∞ kP V ? − PbV ? k∞
r
2 ? 2 2 log(2|S||A|/δ)
≤ k(P − P )V k∞ ≤
b .
1−γ (1 − γ)2 N
where we have used that (·)2 is a component-wise operation in the second step. For the last term:
? ?
VarPb (V ? ) = VarPb (V ? − Vb π + Vb π )
? ?
≤ 2Var b (V ? − Vb π ) + 2Var b (Vb π )
P P
? ?
≤ 2kV ? − Vb π k2∞ + 2VarPb (Vb π )
?
= 2∆2 + 2Var b (Vb π ) .
δ,N P

25
where ∆δ,N is defined in Proposition 2.4. To obtain a cumulative probability of error less than δ, we replace δ in the
above claims with δ/3. Combining these bounds completes the proof of the first claim. The argument in the above
display also implies that VarPb (V ? ) ≤ 2∆2δ,N + 2VarPb (Vb ? ) which proves the second claim.
Using Lemma 2.10 and 2.13, we have the following corollary.
Corollary 2.14. Let δ ≥ 0. With probability greater than 1 − δ, we have:
s
VarPb (Vb π? ) log(c|S||A|/δ)
|(P − Pb)V ? | ≤ c + ∆00δ,N 1
N
s
VarPb (Vb ? ) log(c|S||A|/δ)
|(P − Pb)V ? | ≤ c + ∆00δ,N 1 ,
N
where  3/4
1 log(c|S||A|/δ) c log(c|S||A|/δ)
∆00δ,N := c + 2
,
1−γ N (1 − γ) N
and where c is an absolute constant.

Proof:(of Theorem 2.6) The proof consists of bounding the terms in Lemma 2.5. We have:
?
γk(I − γ Pbπ )−1 (P − Pb)V ? k∞
r  3/4
log(c|S||A|/δ) cγ log(c|S||A|/δ)
q
π ? −1 π ?
≤ cγ k(I − γ P )
b VarPb (V )k∞ +
b
N (1 − γ)2 N
cγ log(c|S||A|/δ)
+
(1 − γ)3 N
s r  3/4
2 log(c|S||A|/δ) cγ log(c|S||A|/δ) cγ log(c|S||A|/δ)
≤ γ + +
(1 − γ)3 N (1 − γ)2 N (1 − γ)3 N
s r
1 log(c|S||A|/δ) cγ log(c|S||A|/δ)
≤ 3γ c +2 ,
(1 − γ)3 N (1 − γ)3 N

where the first step uses Corollary 2.14; the second uses Lemma 2.12; and the last step uses that 2ab ≤ a2 + b2
(and choosing a, b appropriately). The proof of the lower bound is analogous. Taking a different absolute constant
completes the proof.

2.4 Scalings and Effective Horizon Dependencies

It will be helpful to more intuitively understand why 1/(1 − γ)3 is the effective horizon dependency one might hope
to expect, from a dimensional analysis viewpoint. Due to that Q? is a quantity that is as large as 1/(1 − γ), to account
for this scaling, it is natural to look at obtaining relative accuracy.
In particular, if
c |S||A| log(c|S||A|/δ)
N≥ ,
1−γ 2
then with probability greater than 1 − δ, then
?  b ? k∞ ≤  .
kQ? − Qπb k∞ ≤ , and kQ? − Q
1−γ 1−γ

26

(provided that  ≤ 1 − γ using Theorem 2.8). In other words, if we had normalized the value functions 2 , then for
additive accuracy (on our normalized value functions) our sample size would scale linearly with the effective horizon.

2.5 Bibliographic Remarks and Further Readings

The notion of a generative model was first introduced in [Kearns and Singh, 1999], which made the argument that,
up to horizon factors and logarithmic factors, both model based methods and model free methods are comparable.
[Kakade, 2003] gave an improved version of this rate (analogous to the crude bounds seen here).
Theorem 2.6 is due to [Azar et al., 2013], and the proof in this section largely follows this work. Improvements are
possible with regards to bounding the quality of π b? ; here, Theorem 2.8 shows that the model based approach is near
optimal even for policy itself; showing that the quality of πb? does suffer any amplification factor of 1/(1−γ). [Sidford
et al., 2018] provides the first proof of this improvement using a variance reduction algorithm with value iteration. The
improvement in Theorem 2.8 is due to [Agarwal et al., 2020c], which shows that the naive model based approach is
sufficient.
Finally, we remark that we may hope for the bounds on our value estimation to hold up to  ≤ 1/(1 − γ), which
would be consistent with the lower bounds. Here, the work in [Li et al., 2020] shows this limit is achievable, albeit
with a slightly different algorithm where they introduce perturbations. It is an open question if the naive model based
approach also achieves the non-asymptotic statistical limit.

2 Rescaling the value functions by multiplying by (1 − γ), i.e. Qπ ← (1 − γ)Qπ , would keep the values bounded between 0 and 1. Throughout,

this book it is helpful to understand sample size with regards to normalized quantities.

27
28
Chapter 3

Approximate Value Function Methods

For large MDPs, when the underlying MDPs are unknown and we do not have a generative model, we cannot directly
perform policy iteration. This chapter will consider a simple approach where we learn an approximate Q function and
then update our policy greedily with respect to the estimated Q function.
This chapter focuses on obtaining of average case function approximation error bounds, provided we have a somewhat
stringent condition on how the underlying MDP behaves, quantified by the concentrability coefficient. This notion was
introduced in [Munos, 2003, 2005]. While the notion is somewhat stringent, we will see that there is reason to believe
it is not avoidable. Chapters 11 and 12 seek to relax this notion.

3.1 Setting

We consider infinite discounted MDPs M = (S, A, P, r, γ, µ) in this chapter. Here the MDP might have large or even
continuous state space. We assume action space is discrete and we denote A = |A| as the number of actions. We are
given a policy class Π = {π : S 7→ A} ⊂ S 7→ A. Note that the policy class is a restricted policy class which is a
subset of the class of all mappings from S to A. We denote the best policy in policy class as π ? , which is the policy
that maximizes the expected total reward with µ as the initial state distribution:
"∞ #
X
π ? ∈ argmaxπ∈Π E γ h r(sh , ah )|ah = π(sh ) .
h=0
?
Note that π is the best policy in policy class that maximizes the objective function and it is not necessarily true that
π ? will be the optimal policy of the MDP M which maximizes total reward starting from any state simultaneously
(i.e., the policy class might not be rich enough to contain the optimal policy of M ).

3.2 Approximate Greedy Policy Selector

Given a policy π 0 , one intuitive approach we attempt to do is to act greedily with respect π 0 at every state (recall
Policy Iteration in a known tabular MDP). However due to the unknown MDP and large state space, we will not be
0
able to have Aπ (s, a) at every state-action pair. Instead, we can act greedily in the average sense:
h 0 i
b ∈ argmaxπ∈Π Es∼dπ0 Aπ (s, π(s)) .
π
µ

29
We call the above procedure as greedy policy selector. We aim to pick a policy that acts greedily with respect to π 0
under the states visited by π 0 .
Implement the exact greedy policy selector is not possible due to the fact that we do not know the exact Aπ . In this
section, we explain how to achieve an ε-approximate greedy policy selector, which is defined in the definition below.
Definition 3.1 (ε-approximate Greedy Policy Selector ). Given a policy π, we denote Gε (π, Π, µ) as the oracle that
b ∈ Π, such that:
returns a policy π
Es∼dπµ Aπ (s, π
b(s)) ≥ max Es∼dπµ Aπ (s, π
e(s)) − ε.
e∈Π
π

Below we study two approaches to implement the above selector: one is via a reduction to classification with the
policy class Π and the other one is via a reduction to regression using value function approximation.

3.2.1 Implementing Approximate Greedy Policy Selector using Classification

Below we explain that we can implement such approximate Greedy Policy Selector via reduction to a classic super-
vised learning oracle—weighted multi-class classification. We first define a weighted classification oracle as follows.
Definition 3.2 (Weighted Classification Oracle). Given a dataset D = {si , ci }N A
i=1 where ci ∈ R , and a policy class
Π, the weight classification oracle returns the best classifier:
N
X
CO(D, Π) = argmaxπ∈Π ci [π(s)],
i=1

where c[a] denotes the value in the entry in c that corresponds to action a.

Weighted classification oracle is a standard oracle in supervised learning setting, and weighted classification oracle
can be further reduced to a regular classification oracle or a regression oracle. We will assume the existence of CO.
Now we can implement an approximate greedy policy selector via the CO oracle using data from dπµ up to statistical
error. We draw a dataset D = {si , ai , Aei }, where si ∼ dπ , ai ∼ U (A) (where we denote U (A) as the uniform
µ
ei is an unbiased estimate of Aπ (si , ai ) computed from a single rollout. We
distribution over action space A), and A
can perform the policy selection procedure using the CO oracle as follows:
N
X
π
b := argmaxπ∈Π ci [π(si )],
e (0.1)
i=1

Ai
where e ci ∈ RA is a one-hot vector with zeros everywhere, except the entry that corresponds to ai contains 1/A .
e

Essentially we are performing importance weighting here so that an e ci is indeed an unbiased hPestimate of the vector i
A
ei (si ,ai )
[Aπ (si , a)]> a∈A ∈ R A
, given si . To see that, note that for any a ∈ A, we have c
E[e i [a]|si ] = E 1
ai A 1{a = a i } 1/A =
h e i
A (s ,a)
E A1 i1/A i
= Aπ (si , a).

Theorem 3.3 (Approximate Greedy Policy Selector). Given a dataset D = {si , ai , A ei }, where si ∼ dπµ , ai ∼ U (A),
and Aei is an unbiased estimate of Aπ (si , ai ) computed from a single rollout, denote π
b as the return in Eq. 0.1. We
have that with probability at least 1 − δ:
r
4A ln(|Π|/δ)
Es∼dπµ Aπ (s, π
b(s)) ≥ max Es∼dπµ Aπ (s, π
e(s)) − .
e∈Π
π 1−γ N

30
Proof:We can apply Hoeffding’s inequality for a fixed policy π 0 ∈ Π and then a union bound over all π 0 ∈ Π. With
probability at least 1 − δ, we have that for all π 0 ∈ Π
N
r
X
0 π 0 2A ln (|Π|/δ)
ci [π (si )]/N − Es∼dµ A (s, π (s)) ≤
π := εstat

e
i=1
1 γ N
To see this, note that first of all, we have:
ci [π 0 (si )] si = Es∼dπµ Aπ (s, π 0 (s)),
  
Esi E e

ci |si ] = [Aπ (si , a)]>


as E[e ci . Second, note that we have:
a∈A , due to the importance weighting in e

A
|e
ci [a]| ≤ .
1−γ
With the uniform convergence result, we can conclude that:
N N
1 X 1 X
Es∼dπµ Aπ (s, π
b(s)) ≥ e π (si )] − εstat ≥
ci [b ci [π 0 (si )] − εstat ≥ Es∼dπµ Aπ (s, π 0 (s)) − 2εstat ,
e
N i=1 N i=1

for any π 0 ∈ Π including argmaxπe∈Π Es∼dπµ Aπ (s, π


e(s)).
Note that the above analysis shows that we can approximately optimize maxπe∈Π Es∼dπµ Aπ (s, π
e(s)) up to statistical
A
p
error. We can set 1−γ ln(|Π|/δ)/N = ε and solve for N which is the total number of i.i.d samples we need to draw
in order to get an ε-approximate policy selector with probability at least 1 − δ.

3.2.2 Implementing Approximate Greedy Policy Selector using Regression

Here we present an implementation based on value function approximation. Specifically, instead of starting directly
from a restrict policy class Π and a reduction to classification, we start from a restricted value function class F = {f :
S × A 7→ [1, 1/(1 − γ)]}. In this case, one can think about the policies class Π consisting of all greedy policies with
respect to f ∈ F, i.e., Π = {π(s) = argmaxa f (s, a) : f ∈ F}.
We perform a reduction to regression. Consider the following least square regression problem. Given the dataset
ei } with si ∼ dπµ , ai ∼ U (A) and A
{si , ai , A ei is an unbiased estimate of Aπ (si , ai ), we perform the following
regression:
N 
X 2
fb ∈ argmaxf ∈F ei
f (si , ai ) − A .
i=1

With fb, the approximate greedy policy is set as:


b(s) = argmaxa∈A fb(s, a), ∀s.
π
Using a similar uniform convergence argument as in the proof of Theorem 3.3, it is not hard to get a similar general-
ization bound as in Theorem 3.3.

3.3 Approximate Policy Iteration (API)

With the above approximate greedy policy selector, now we introduce the Approximate Policy Iteration (API) algo-
rithm, which is described in the following iteration:
π t+1 := Gε (π t , Π, µ). (0.2)

31
Note that API does not guarantee policy improvement nor convergence without additional assumption. We will give an
example in the next section where we show that even with the exact approximate greedy policy selection, i.e., ε = 0,
API cannot make any policy improvement and could oscillate between two suboptimal policies forever.
To have meaningful guarantees of policy improvement and convergence for API, we introduce the following concen-
tration assumption on the initial distribution µ:

µ (s)
Assumption 3.4 (Bounded Concentration Coefficient). We assume that C := maxπ∈Π sups∈S µ(s) < ∞.

With this assumption, we can show that API has monotonic improvement as long as there is local improvement, i.e.,
t
maxπ∈Π Es∼dπt Aπ (s, π(s)) is reasonably big.
µ

Theorem 3.5 (Monotonic Policy Improvement). For any t, we have:


 
t+1 t 1 h t i
Vπ −Vπ ≥ max Es∼dπt Aπ (s, π(s)) − ε .
C π∈Π µ

Proof: We start with Performance Difference Lemma.


 t+1 t
 h t i
(1 − γ) V π − V π = Es∼dπt+1 Aπ (s, π t+1 (s))
µ
t+1
dµπ (s) h πt i (1 − γ)µ(s) h πt i
= Es∼dπt πt A (s, π t+1 (s)) ≥ Es∼dπt A (s, π t+1
(s))
µ dµ (s) µ dπµt (s)
µ(s) h πt i 1−γ h t i
≥ (1 − γ)Es∼dπt inf πt A (s, π t+1 (s)) ≥ Es∼dπt Aπ (s, π t+1 (s)) ,
µ s dµ (s) C µ

where the last inequality uses the definition of C in Assumption 3.4.


This implies that:
t+1 t 1 h t i 1 h t i ε
Vπ −Vπ ≥ Es∼dπt Aπ (s, π t+1 (s)) ≥ max Es∼dπt Aπ (s, π(s)) − .
C µ C π∈Π µ C

t
The above theorem implies that when maxπ∈Π Es∼dπt Aπ (s, π(s)) > ε and C < ∞, then we make monotonic
µ
improvement every iteration.

3.4 Failure Case of API Without Assumption 3.4

In this section, we show that API indeed will fail to provide policy improvement if C = ∞. To illustrate this
phenomena, we simply consider the exact greedy policy selector, i.e., we assume that for Gε (π, Π, µ), we have ε = 0.
Claim 3.6. There exists a policy class Π, an MDP, a µ restart distribution where C = ∞, and two policies π 0 and π 00 ,
such that if one start API with π 0 ∈ {π 0 , π 00 }, π t and π t+1 will oscillate between π 0 and π 00 which are both γ away
from the optimal policy. Namely API with π t+1 = G0 (π t , Π, µ) will not be able to make any policy improvement nor
will it converge.

Proof: The MDP is shown in Fig. 0.1 where the transition is deterministic and µ(s1 ) = 1. We consider Π that contains
all stationary policies. We consider the two policies π 0 and π 00 as follows:

π 0 (s1 ) = a1 , π 0 (s2 ) = a2 , π 0 (s3 ) = a1 ; π 00 (s1 ) = a2 , π 00 (s2 ) = a1 , π 00 (s3 ) = a2 .

32
Figure 0.1: The example MDP. The MDP has deterministic transition and µ has probability mass on s1 . We have
reward zero every where except r(s2 , a1 ) = r(s3 , a1 ) = 1.

0 0 00 00
Hence for π 0 , dπµ (s3 ) = 0 and dπµ (s) > 0 for s 6= s3 . Similarly for π 00 , we have dπµ (s2 ) = 0 and dπµ (s) > 0 for
s 6= s2 .
Consider the greedy policy selection under π 0 :
0
π ∈ argmaxπ∈Π Es∼dπµ0 Aπ (s, π(s)).
0
We claim that π 00 is one of the maximizers of the above procedure. This is because dπµ (s3 ) = 0 and thus π(s3 ) does
0 0
not affect the objective function at all. For s1 , note that Qπ (s1 , a1 ) = 0 while Qπ (s1 , a2 ) > 0. Thus a greedy policy
0 0
will pick a2 which is consistent to the choice of π 00 . For s2 , we have Qπ (s2 , a1 ) > 0 while Qπ (s2 , a2 ) = 0. Thus a
greedy policy will pick a1 at s2 which again is consistent with the choice of π 00 at s2 . This concludes that π 00 is one of
the greedy policies under π 0 .
00
Similarly, one can argue that π 0 ∈ argmaxπ∈Π Es∼dπ00 Aπ (s, π(s)), i.e., at π 00 , API will switch back to π 0 in the next
h
iteration.
Thus, we have proven that when running API with either π 0 or π 00 as initialization, API can oscillate between π 0 and
π 00 forever. Note that π 0 and π 00 have the same value and are γ away from the optimal policy’s value.
The above phenomena really comes from the fact that API is making abrupt policy update, i.e., there is no way we
can guarantee π t+1 is close to π t , for instance, in terms of their resulting state distribution. Thus, for states s that have
t
really small probability under dπµ , π t+1 (·|s) and π t (·|s) could be different. Intuitively, if we look at the Performance
t+1 t
Difference between π and π , we see that:
t+1 t 1 h t i
Vπ −Vπ = Es∼dπt+1 Aπ (s, π t+1 (s)) ,
1−γ µ

which says that in order to make policy improvement, we need the new policy π t+1 to be greedy with respect to π t
t+1
under dπµ —the state distribution of the new policy. However, the greedy policy selector only selects a policy that is
t t t+1
greedy with respect to π t under dπµ . Hence, unless dπµ and dπµ are close, we will not be able to directly transfer the
t t+1 t
potential local one-step improvement Es∼dπt Aπ (s, π t+1 (s)) to policy improvement V π − V π .

33
3.5 Can we relax the concentrability notion?

There is another family of policy optimization algorithm which use incremental policy updates, i.e., when we perform
t+1 t
policy update, we ensure that dµπ is not all that different from dπµ . Incremental policy update will the core of Part 3.
For instance, the algorithm Conservative Policy Iteration, which we study in Chapter 12, uses a conservative policy
t+1 t
update π t+1 (·|s) := (1 − α)π t (·|s) + αGε (π t , Π, µ) for all s, which, for small α, ensures that dπµ and dπµ are
not too far apart. Properly setting the step size α, we can show that CPI makes monotonic policy improvement and
converges to a local optimal policy with a more relaxed condition over Assumption 3.4. We will explain the benefit of
incremental policy updating in more detail in Part III.

3.6 Bibliographic Remarks and Further Readings

The notion of concentrability was developed in [Munos, 2003, 2005] in order to permitting sharper bounds in terms
of average case function approximation error, provided that the concentrability coefficient is bounded. These methods
also permit sample based fitting methods, with sample size and error bounds, provided there is a data collection policy
that induces a bounded concentrability coefficient [Munos, 2005, Szepesvári and Munos, 2005, Antos et al., 2008,
Lazaric et al., 2016]. Chen and Jiang [2019] provide a more detailed discussion on this quantity.

34
Chapter 4

Generalization

Up to now we have focussed on “tabular” MDPs. While studying this setting is theoretically important, we ultimately
seek to have learnability results which are applicable to cases where number of states is large (or, possibly, countably
or uncountably infinite). This is a question of generalization.
A fundamental question here is:

To what extent is generalization in RL similar to (or different from) that in supervised learning?

This is the focus of this chapter. Understanding this question is crucial in how we study (and design) scalable algo-
rithms. These insights will also help us to motivate the various more refined assumptions (and settings) that we will
consider in subsequent chapters.
In supervised learning (and binary classification in particular), it is helpful to distinguish between two different objec-
tives: First, it is not difficult to see that, in general, it is not possible to learn the Bayes optimal classifier in a sample
efficient manner without strong underlying assumptions on the data generating process.1 Alternatively, given some
restricted set of classifiers (our hypothesis class H, which may not contain the Bayes optimal classifier), we may hope
to do as well as the best classifier in this set, i.e. we seek low (statistical) regret. This objective is referred to as agnostic
learning; here, obtaining low regret is possible, provided some measure of the complexity of our hypothesis set is not
too large.
With regards to reinforcement learning, we may ask a similar question. It is not difficult to see that in order to provably
learn the truly optimal policy in a sample efficient manner (say that does not depend on the number of states |S|), then
we must rely on quite strong assumptions. Analogous to the agnostic learning question in supervised learning, we
may ask the following question: given some restricted (and low complexity) policy class Π (which may not contain
the optimal policy π ? ), what is the sample complexity of doing nearly as well as the best policy in this class?
This chapter follows the reduction from reinforcement learning to supervised learning that was first introduced in [Kearns
et al., 2000], which used a different algorithm (the “trajectory tree” algorithm), and our discussion here largely follows
the motivation discussed in [Kearns et al., 2000, Kakade, 2003].
Before we address this question, a few remarks are in order.

Binary classification as a γ = 0 RL problem. Let us observe that the problem of binary classification can be
thought of as learning in an MDP: take γ = 0 (i.e. the effective horizon is 1); suppose we have a distribution of
1 Such impossibility results are often referred to as a “No free lunch theorem” theorems.

35
starting states s0 ∼ µ; suppose |A| = 2; and the reward function is r(s, a) = 1(label(s) = a). In other words,
we equate our action with the prediction of the binary class, and the reward function is 1 or 0, determined by if our
prediction is correct.

Sampling model In this chapter, we consider a weaker (and more realistic) sampling model where we have a starting
state distribution µ over states. We assume sampling access to the MDP where we start at a state s0 ∼ µ; we can
rollout a policy π of our choosing; and we can terminate the trajectory at will. We are interested in learning with a
small number of observed trajectories.

4.1 Review: Binary Classification and Generalization

One of the most important concepts for learning binary classifiers is that it is possible to generalize even when the
state space is infinite. Here note that the domain of our classifiers, often denoted by X , is analogous to the state
space S. We now briefly review some basics of supervised learning before we turn to the question of generalization in
reinforcement learning.
Consider the problem of binary classification with N labeled examples of the form (xi , yi )N i=1 , with xi ∈ X and
yi ∈ {0, 1}. Suppose we have a (finite or infinte) set H of binary classifiers where each h ∈ H is a mapping of the
form h : X → {0, 1}. Let 1(h(x) 6= y) be an indicator which takes the value 0 if h(x) = y and 1 otherwise. We
assume that our samples are drawn i.i.d. according to a fixed joint distribution D over (x, y).
Define the empirical error and the true error as:
N
1 X
rr(h) =
ec 1(h(xi ) 6= yi ), err(h) = E(X,Y )∼D 1(h(X) 6= Y ).
N i=1

For a given h ∈ H, Hoeffding’s inequality implies that with probability at least 1 − δ:


r
1 2
|err(h) − ec
rr(h)| ≤ log .
2N δ
This and the union bound give rise to what is often referred to as the “Occam’s razor” bound:
Proposition 4.1. (The “Occam’s razor” bound) Suppose H is finite. Let b rr(h) and h? = arg minh∈H err(h).
h = arg minh∈H ec
With probability at least 1 − δ: r
? 2 2|H|
err(h) − err(h ) ≤
b log .
N δ

Hence, provided that


c log 2|H|
δ
N≥ ,
2
then with probability at least 1 − δ, we have that:

h) − err(h? ) ≤ .
err(b

A key observation here is that the our regret — the regret is the left hand side of the above inequality — has no
dependence on the size of X (i.e. S) which may be infinite and is only logarithmic in the number of hypothesis in our
class.
In the supervised learning setting, a crucial observation is that even though a hypothesis set H may be infinite, the
number of possible behaviors of on a finite set of states is not necessarily exhaustive. Let us review the definition of

36
the VC dimension for a hypothesis set of boolean functions. We say that the set {x1 , x2 , . . . xd } is shattered if there
exists an h ∈ H that can realize any of the possible 2d labellings. The Vapnik–Chervonenkis (VC) dimension is the
size of the largest shattered set. If d = V C(H), then the Sauer–Shelah lemma states the number of possible labellings
d
on a set of N points by functions in H is at most eN d . For d << N , this is much less than 2N .
The following classical bound highlights how generalization is possible on infinite hypothesis classes with VC dimen-
sion.
h = arg minh∈H ec
Proposition 4.2. (VC dimension and generalization) Let b rr(h) and h? = arg minh∈H err(h).
Suppose H has a bounded VC dimension. For m ≥ VC(H), we have that with probability at least 1 − δ:
s  
? c 2N 2
h) − err(h ) ≤
err(b VC(H) log + log ,
N VC(H) δ

where c is an absolute constant

4.2 Generalization and Agnostic Learning in RL

Now consider the case where we have a set of policies Π (either finite or infinite). For example, Π could be a parametric
set. Alternatively, we could have a set of parametric value functions V = {fθ : S × A → R| θ ∈ Rd }, and Π could be
the set of policies that are greedy with respect to values in V.
The goal of agnostic learning can be formulated by the following optimization problem:

max Es0 ∼µ V π (s0 )


π∈Π

As before, we only hope to perform favorably against the best policy in Π. Recall that in our aforementioned sampling
model we have the ability to obtain trajectories from s0 ∼ µ under policies of our choosing. As we have seen, agnostic
learning is possible in the supervised learning setting, with regret bounds that have no dependence on the size of the
domain — the size of domain is analogous to the size the state space |S|.

4.2.1 Upper Bounds: Data Reuse and Importance Sampling

We now provide a reduction of RL to the supervised learning problem. The key issue is how to efficiently reuse data.
Here, we will simply collect N trajectories by executing a policy which chooses samples uniformly at random; let
πuar denote this policy. For simplicity, we only consider deterministic policies.
The following shows how we can obtain a nearly unbiased estimate of the reward with this uniform policy:
Lemma 4.3. (Near unbiased estimation of V π (s0 )) We have that:
" H
# "H #
 X X
H t t
|A| Eπuar 1 π(s0 ) = a0 , . . . , π(sH ) = aH γ r(st , at ) s0 = Eπ γ r(st , at ) s0 .
t=0 t=0

(Truncation) We also have that:


"H #
X
π
|V (s0 ) − Eπ γ r(st , at ) | ≤ γ H /(1 − γ),
t

t=0

log 1/ (1−γ))
which implies that for H = 1−γ we will have an  approximation to V π (s0 ).

37
In other words, the estimated reward of π on a trajectory is nonzero only when π takes exactly identical actions to
those taken by πuar on the trajectory, in which case the estimated value of π is |A|H times that of πuar . Note the factor
of |A|H , due to importance sampling, leads this being a high variance estimate. We will return to this point in the next
section.
Proof: To be added...
Denote the n-th sampled trajectory by (sn0 , an0 , r1n , sn1 , . . . , snH ), where H is the cutoff time where the trajectory ends.
We can then use following to estimate the γ-discounted reward of any given policy π:
N H
|A|H X  n X
Vb π (s0 ) = 1 π(s0 ) = an0 , . . . π(snH ) = anH γ t r(snt , ant ).
N n=1 t=0

Proposition 4.4. (Generalization


 b = arg maxπ∈Π Vb π (s0 ). Using
in RL) Suppose Π is a finite set of policies. Let π
log 2/ (1−γ))
H= 1−γ we have that with probability at least 1 − δ:
r
 2 2|Π|
V (s0 ) ≥ arg max V (s0 ) − − |A|H
π
b π
log .
π∈Π 2 N δ

Hence, provided that


c log(2|Π|/δ)
N ≥ |A|H ,
2
then with probability at least 1 − δ, we have that:

V πb (s0 ) ≥ arg max V π (s0 ) − .


π∈Π

This is the analogue of the Occam’s razor bound for RL.


Importantly, the above shows that we can avoid dependence on the size of the state space, though this comes at the
price of an exponential dependence on the horizon. As we see in the next section, this dependence is unavoidable
(without making further assumptions).
With regards to infinite hypothesis classes of policies, extending the Occam’s razor bound can be done with standard
approaches from statistical learning theory. For example, consider the case where |A| = 2, where Π is class of
deterministic policies. Here, as each π ∈ Π can be viewed as Boolean function, VC(Π) is defined in the usual manner.
Here, we have:

Proposition 4.5. (Bounded VC dimension) Suppose |A|


 = 2 and that suppose Π has a bounded VC dimension. Let
log 2/ (1−γ))
b = arg maxπ∈Π Vb π (s0 ). Using H =
π 1−γ and for N ≥ VC(Π), we have that with probability at least
1 − δ: s  
 c 2N 2
V (s0 ) ≥ arg max V (s0 ) − − 2H
π
b π
VC(Π) log + log ,
π∈Π 2 N VC(Π) δ

where c is an absolute constant.

We do not prove this result here, which follows a standard argument using results in statistical learning theory. The key
observation here is that, the Sauer–Shelah lemma bounds the number of possible labellings on a set of N trajectories
d
(each of length H) by eNdH , where d = VC(Π).
See Section 4.5.

38
4.2.2 Lower Bounds

Clearly, the drawback of these bounds are that they are exponential in the problem horizon. We now see that if we
desire a sample complexity that scales with O(log |Π|), then an exponential dependence on the effective horizon is
unavoidable, without making further assumptions.
An algorithm is a procedure which sequentially samples trajectories and then returns some policy π (we often say the
algorithm is proper if it returns a π ∈ Π). An algorithm is deterministic if it executes a policy (to obtain a trajectory)
in manner that is a deterministic function of the data that it has collected. We only consider deterministic algorithms
in this section, which does not quantitatively change the conclusions.
First, let us present the following simple observation, which already shows that avoiding an exp(1/(1−γ)) dependence
is not possible.
Proposition 4.6. (Lower Bound for The Complete Policy Class) Suppose |A| = 2 and |S| = 2H , where H = b log(2) 1−γ c.
H
Let Π be the set of all 2 policies. There exists a family of MDPs such that if a deterministic algorithm A is guaranteed
to find a policy π such that:
V πb (s0 ) ≥ arg max V π (s0 ) − 1/4.
π∈Π

then A must use N ≥ 2H trajectories.

Observe that log |Π| = H log(2), so this already rules out the possibility of logarithmic dependence on the size of
the policy class, without having an exponential dependence on H. The proof is straightforward, where we consider a
family of binary trees where the rewards are at one of the terminal leaf nodes.
Proof: Consider a family of deterministic MDPs, where each in each MDP the dynamics are specified by a binary tree
of depth H, with H = b log(2)
1−γ c and where there is a reward at one of the terminal leaf nodes. Note that for setting of
H, γ H ≤ exp(−(1 − γ)H) ≥ 1/2. Since Π is the set of all 2H policies, then we must check every leaf, in the worst
case (due that our algorithm is deterministic). This completes the proof.
In the previous proposition, our policy class was the complete class. Often, we are dealing with policies class which
are far more restrictive. Even in this case, the following proposition strengthens this lower bound to be applicable to
arbitrary policy classes, showing that even here (if we seek no dependence on |S|), we must either have exponential
dependence on the effective horizon or we must exhaustively try what is the effective size of all our policies.
Proposition 4.7. (Lower Bound for an Arbitrary Policy Class) Define H = b log(2) 1−γ c. Suppose |A| = 2 and let Π be
an arbitrary policy class. There exists a family of MDPs such that if a deterministic algorithm A is guaranteed to find
a policy π
b such that: h i
E V πb (s0 ) ≥ arg max V π (s0 ) − .
π∈Π
(where the expectation is with respect to the trajectories the algorithm observes) then A must use an expected number
of trajectories N where
min{2H , 2VC(Π) }
N ≥c ,
2
where c is a universal constant.

We can interpret 2VC(Π) is the effective the number of policies in our policy class (by the definition of the VC dimen-
sion, it is number of different behaviors in our policy set). Thus, requiring O(2VC(Π) ) samples shows that, in the worst
case, we are not able to effectively reuse data (as was the case in supervised learning), unless have an exponential
dependence on the horizon.
Proof: We will only prove this result for  = 1/4, where we will see that we need
N ≥ min{2H , 2VC(Π) }

39
By definition of the VC dimension, our policy class can exhibit 2VC(Π) distinct action sequences on VC(Π) states.
Suppose VC(Π) ≤ H. Here, we can construct a binary tree where the set of distinct leaves visited by Π will be
precisely equal to 2VC(Π) . By placing a unit reward at one of these leaves, the algorithm will be forced to explore all
of the leaves. If VC(Π) ≤ H, then exploring the full binary tree is necessary.
We leave the general case as an exercise for the reader. As a hint, consider two different types of leaf nodes: for all
but one of the leaf nodes, we obtain unit reward with 1/2 probability, and, if the remaining leaf node is reached, we
obtain unit reward with 1/2 +  probability.

4.3 Interpretation: How should we study generalization in RL?

The above clearly shows that, without further assumptions, agnostic learning (in the standard supervised learning
sense) is not possible in RL, unless we can tolerate an exponential dependence on the horizon 1/(1 − γ). Note that
agnostic learning is not about being (unconditionally) optimal, but only being competitive among some restricted
(hopefully lower complexity) set of models. Regardless, even with this weaker success criterion, avoiding the expo-
nential dependence on the effective horizon is simply not possible.
This motivates the study of RL to consider either stronger assumptions or means in which the agent can obtain side
information. Three examples of approaches that we will consider in this book are:

• Structural (and Modelling) Assumptions: By making stronger assumptions about the world, we can move away
from agnostic learning and escape the curse of dimensionality. We will see examples of this in Part 2.

• Distribution Dependent Results (and Distribution Shift): When we move to policy gradient methods (in Part 3),
we will consider results which depend on given distribution of how we obtain samples. Here, we will make
connections to transfer learning.

• Imitation learning and behavior cloning: here will consider models where the agent has input from, effectively,
a teacher, and we will see how this alleviates the problem of curse of dimensionality.

4.4 Approximation Limits with Linearity Assumptions

Given our previous lower bounds and discussion, it is natural to consider making assumptions. A common assumption
is that the Q-function (or value function) is a (nearly) linear function of some given features (our representation); this
is a natural assumption to begin our study of function approximation. In practice, suche features are either hand-crafted
or a pre-trained neural network that transforms a state-action pair to a d-dimensional embedding 2 .
We now see that, even when we make such linearity assumptions, there are hard thresholds, on the worst case approx-
imation error of our representation, that have to be satisfied in order for our linearity assumption to be helpful.
We now provide a lower bound on the approximation limits for value-based learning, when we have a approximate
linear representation. Formally, the agent is given a feature extractor φ : S × A → Rd , which can be hand-crafted or a
pre-trained neural network that transforms a state-action pair to a d-dimensional embedding. The following assumption
states that the given feature extractor can be used to predict the Q-function (of any policy) with approximation error
at most approx linear function.
In this section, we assume we are in the finite horizon (undiscounted) setting.

2 The more challenging question is to learn the features

40
Assumption 4.8 (Linear Value Function Approximation). There exists approx > 0, such that for any h ∈ [H] and any
policy π, there exists θhπ ∈ Rd such that for any (s, a) ∈ S × A, |Qπh (s, a) − hθh , φ (s, a)i| ≤ approx .

Here approx is the approximation error, which indicates the quality of the representation. If approx = 0, then all Q-
functions can be perfectly represented by a linear function of φ (·, ·). In general, as we increase the dimension of φ we
expect that approx becomes smaller , since larger dimension usually has more expressive power.
Later on, we will see that if approx is 0, then sample efficient learning is possible (with a polynomial dependence on
H and d, but no dependence on |S| and |A|). The following theorem shows that such assumptions, necessarily, need
approx close to 0, else sample efficient learning is not possible, which is consistent with our agnostic learning lower
q 
H
bounds in this chapter. In particular, The following theorem shows when approx = Ω d , the agent needs to
sample exponential number of trajectories to find a near-optimal policy.
Theorem 4.9 (Exponential Lower Bound for Value-based Learning). There exists a family of MDPs with |A| = 2
and a feature extractor φ that satisfy Assumption 4.8, such that anyalgorithm that returns a 1/2-optimal policy with
probability 0.9 needs to sample Ω min{|S|, 2H , exp(d2approx /16)} trajectories.

We state the theorem without proof. The lower bound is again based on a the deterministic binary tree hard instance,
with only one rewarding node (i.e. state) at a leaf. With no further assumptions, as before to find a 1/2-optimal policy
for such MDPs, the agent must enumerate all possible states in level H − 1 to find the state with reward R = 1. Doing
so intrinsically induces a sample complexity of Ω(2H ).
Th key idea of the proof is that we can construct a set of features so that Assumption 4.8 holds, and, yet, these features
reveal no additional information to the learner (and, so, the previous lower bound still applies). The main idea in the
construction uses the following fact regarding the -approximate rank of the identity matrix of size 2H : this (large)
identity matrix can be approximated to - accuracy (in the spectral norm) with a matrix of rank only O(H2 ) 3 . In our
context, this fact can be used to construct a set of features φ, all of which live in an O(H2 ) dimensional subspace,
where these features well approximate all 2H value function; the crucial property here is that the features can be
constructed with no knowledge of the actual reward function.

4.5 Bibliographic Remarks and Further Readings

The reduction from reinforcement learning to supervised learning was first introduced in [Kearns et al., 2000], which
used a different algorithm (the “trajectory tree” algorithm), as opposed to the importance sampling approach presented
here. [Kearns et al., 2000] made the connection to the VC dimension of the policy. The fundamental sample complexity
tradeoff — between polynomial dependence on the size of the state space and exponential dependence on the horizon
— was discussed in depth in [Kakade, 2003].
The approximation limits with linear function approximation are results from [Du et al., 2019].

3 Such a result can be proven with the e Johnson-Lindenstrauss Lemma

41
42
Part 2

Strategic Exploration

43
Chapter 5

Multi-armed & Linear Bandits

For the case, where γ = 0 (or H = 1 in the undiscounted case), the problem of learning in an unknown MDP reduce
to the multi-armed bandit problem. The basic algorithms and proof methodologies here are important to understand
in their own right, due to that we will have to extend these with more sophisticated variants to handle the exploration-
exploitation tradeoff in the more challenging reinforcement learning problem.
This chapter follows analysis of the LinUCB algorithm from the original proof in [Dani et al., 2008], with a simplified
concentration analysis due to [Abbasi-Yadkori et al., 2011].

5.1 The K-Armed Bandit Problem

The setting is where we have K decisions (the “arms”), where when we play arm i ∈ {1, 2, . . . K} we obtain a random
reward ri which has mean reward:
E[ri ] = µi
where we assume µi ∈ [−1, 1].
Every iteration t, the learner will pick an arm It ∈ [1, 2, . . . K]. Our cumulative regret is defined as:
T
X −1
RT = T · max µi − µIt
i
t=0

We denote a? = argmaxi µi as the optimal arm. We define gap ∆a = µa? − µ(a) for any arm a.

Theorem 5.1. There exists an algorithm such that with probability at least 1 − δ, we have:
   
p X ln(T K/δ) 
RT = O min KT · ln(T K/δ), + K .
 ?
∆a 
a6=a

5.1.1 The Upper Confidence Bound (UCB) Algorithm

We summarize the upper confidence bound (UCB) algorithm in Alg. 1. For simplicity, we allocate the first K rounds
to pull each arm once.

45
Algorithm 1 UCB
1: Play each arm once and denote received reward as ra for all a ∈ {1, 2, . . . K}
2: for t = 0 → T − 1 − K do  q 
3: Execute arm It = arg maxi∈[K] µ̂t (i) + log(T K/δ)
N t (i)
4: Observe rIt
5: end for

where every iteration t, we main counts of each arm:

t−1
X
t
N (a) = 1 + 1{Ii = a},
i=0

where It is the index of the arm that is picked by the algorithm at iteration t. We main the empirical mean for each
arm as follows:
t−1
!
t 1 X
µ
b (a) = t ra + 1{Ii = a}ri .
N (a) i=0

Recall that ra is the reward of arm a we got during the first K rounds.
We also main the upper confidence bound for each arm as follows:
s
ln(T K/δ)
bt (a) + 2
µ .
N t (a)

The following lemma shows that this is a valid upper confidence bound with high probability.

Lemma 5.2 (Upper Confidence Bound). For all t ∈ [0, . . . , T − 1] and a ∈ [1, 2, . . . K], we have that with probability
at least 1 − δ,
s
t ln(T K/δ)
b (a) − µa ≤ 2
µ . (0.1)
N t (a)

The proof of the above lemma uses Azuma-Hoeffding’s inequality (Theorem A.2) for each arm a and iteration t and
then apply a union bound over all T iterations and K arms.
Now we can conclude the proof of the main theorem.
Proof: Below we conditioned on the above Inequality 0.1 holds. This gives us the following optimism:
s
t ln(T K/δ)
µa ≤ µ
b (a) + 2 , ∀a, t.
N t (a)

Thus, we can upper bound the regret as follows:


s s
? t ln(T K/δ) ln(T K/δ)
µ − µIt ≤ µ
b (It ) + 2 − µIt ≤ 4 .
N t (It ) N t (It )

46
Sum over all iterations, we get:
T −1 T −1
s
X p X 1
µ? − µIt ≤ 4 ln(T K/δ)
t=0 t=0
N t (It )
N T (a) s
p X X 1 p Xq p X
= 4 ln(T K/δ) √ ≤ 8 ln(T K/δ) T
N (a) ≤ 8 ln(T K/δ) K N T (a)
a i=1
i a a
p √
≤ 8 ln(T K/δ) KT .

Note that our algorithm has regret K at the first K rounds.


On the other hand, if for each arm a, the gap ∆a > 0, then, we must have:

4 ln(T K/δ)
N T (a) ≤ .
∆2a

which is because after the UCB of an arm a is below µ? , UCB algorithm will never pull this arm a again (the UCB of
the µ? is no smaller than µ? ).
Thus for the regret calculation, we get:
T
X −1 X X 4 ln(T K/δ)
µ? − µIt ≤ NkT (a)∆a = .
∆a
t=0 a6=a? ?
a6=a

Together with the fact that Inequality 0.1 holds with probability at least 1 − δ, we conclude the proof.

5.2 Linear Bandits: Handling Large Action Spaces

Let D ⊂ Rd be a compact (but otherwise arbitrary) set of decisions. On each round, we must choose a decision
xt ∈ D. Each such choice results in a reward rt ∈ [−1, 1].
We assume that, regardless of the history H of decisions and observed rewards, the conditional expectation of rt is a
fixed linear function, i.e. for all x ∈ D,

E[rt |xt = x] = µ? · x ∈ [−1, 1],

where x ∈ D is arbitrary. Here, observe that we have assumed the mean reward for any decision is bounded in [−1, 1].
Under these assumptions, the noise sequence,

ηt = rt − µ? · xt

is a martingale difference sequence.


The is problem is essentially a bandit version of a fundamental geometric optimization problem, in which the agent’s
feedback on each round t is only the observed reward rt and where the agent does not know µ? apriori.
If x0 , . . . xT −1 are the decisions made in the game, then define the cumulative regret by
T
X −1
RT = µ? · x? − µ? · xt
t=0

47
Algorithm 2 The Linear UCB algorithm
Input: λ, βt
1: for t = 0, 1 . . . do
2: Execute
xt = argmaxx∈D max µ · x
µ∈BALLt

and observe the reward rt .


3: Update BALLt+1 (as specified in Equation 0.2).
4: end for

where x? ∈ D is an optimal decision for µ? , i.e.

x? ∈ argmaxx∈D µ? · x

which exists since D is compact. Observe that if the mean µ? were known, then the optimal strategy would be to play
x? every round. Since the expected loss for each decision x equals µ? · x, the cumulative regret is just the difference
PT −1 loss for the actual decisions xt . By the Hoeffding-
between the expected loss of the optimal algorithm and the expected
Azuma inequality (see Lemma A.2), the observed reward t=0 rt will be close to their (conditional) expectations
PT −1 ?
t=0 µ · xt .

Since the sequence of decisions x1 , . . . , xT −1 may depend on the particular sequence of random noise encountered,
RT is a random variable. Our goal in designing an algorithm is to keep RT as small as possible.

5.2.1 The LinUCB algorithm

LinUCB is based on “optimism in the face of uncertainty,” which is described in Algorithm 2. At episode t, we use all
previous experience to define an uncertainty region (an ellipse) BALLt . The center of this region, µ
bt , is the solution of
the following regularized least squares problem:
t−1
X
µ
bt = arg min kµ · xτ − rτ k22 + λkµk22
µ
τ =0
t−1
X
= Σ−1
t rτ xτ ,
τ =0

where λ is a parameter and where


t−1
X
Σt = λI + xτ x>
τ , with Σ0 = λI.
τ =0

The shape of the region BALLt is defined through the feature covariance Σt .
Precisely, the uncertainty region, or confidence ball, is defined as:

µt − µ? )> Σt (b µt − µ? ) ≤ βt ,

BALLt = (b (0.2)

where βt is a parameter of the algorithm.

Computation. Suppose that we have an efficient linear optimization oracle, i.e. that we can efficiently solve the
problem:
max ν · x
x∈D

48
for any ν. Even with this, Step 2 of LinUCB may not be computationally tractable. For example, suppose that D
is provided to us as a polytope, then the above oracle can be efficiently computed using linear programming, while
LinUCB is an NP-hard optimization. Here, we can actually use a wider confidence region, where we can keep track
of `1 ball which contains BALLt . See Section 5.4 for further reading.

5.2.2 Upper and Lower Bounds

Our main result here is that we have sublinear regret with only a polynomial dependence on the dimension d and,
importantly, no dependence on the cardinality of the decision space D, i.e. on |D|.
Theorem 5.3. Suppose that the noise ηt is σ 2 sub-Gaussian 1 , that kµ? k ≤ W , and that kxk ≤ B for all x ∈ D. Set
λ = σ 2 /W 2 and
T B2W 2
   
βt := σ 2 2 + 4d log 1 + + 8 log(4/δ) .
d
We have that with probability greater than 1 − δ, that (simultaneously) for all T ≥ 0,
√ T B2W 2
   
RT ≤ cσ T d log 1 + + log(4/δ)
dσ 2

where c is an absolute constant. In other words, we have that RT is O? (d T ) with high probability.

The following shows that no algorithm can do better.


Theorem 5.4. (Lower bound) There exists a distribution over linear bandit problems (i.e. a distribution over µ) with
rewards the rewards being bounded by 1 in magnitude and σ 2 ≤ 1, such that for every (randomized) algorithm, we
have for n ≥ max{256, d2 /16},
1 √
Eµ ERT ≥ d T.
2500
where the inner expectation is with respect to randomness in the problem and the algorithm.

5.3 LinUCB Analysis

In establishing the upper bounds there are two main propositions from which the upper bounds follow. The first is in
showing that the confidence region is appropriate.
Proposition 5.5. (Confidence) Let δ > 0. We have that
Pr(∀t, µ? ∈ BALLt ) ≥ 1 − δ.

Section 5.3.2 is devoted to establishing this confidence bound. In essence, the proof seeks to understand the growth of
µt − µ? )> Σt (b
the quantity (b µt − µ? ).
The second main step in analyzing LinUCB is to show that as long as the aforementioned high-probability event holds,
we have some control on the growth of the regret. Let us define
regrett = µ? · x∗ − µ? · xt
which denotes the instantaneous regret.
The following bounds the sum of the squares of instantaneous regret.
1 Roughly speaking, this say that tail probabilities of ηt decay no more slowly than a Gaussian distribution. If the noise is bounded, i.e. |ηt | ≤ B

49
Proposition 5.6. (Sum of Squares Regret Bound) Suppose that kxk ≤ B for x ∈ D. Suppose βt is increasing and
larger than 1. For LinUCB, if µ? ∈ BALLt for all t, then
T −1
T B2
X  
regret2t ≤ 4βT d log 1 +
t=0

This is proven in Section 5.3.1. The idea of the proof involves a potential function argument on the log volume (i.e. the
log determinant) of the “precision matrix” Σt (which tracks how accurate our estimates of µ? are in each direction).
The proof involves relating the growth of this volume to the regret.
Using these two results we are able to prove our upper bound as follows:
Proof:[Proof of Theorem 5.3] By Propositions 5.5 and 5.6 along with the Cauchy-Schwarz inequality, we have, with
probability at least 1 − δ,
v s
T −1 u T −1
T B2
X u X  
2
RT = regrett ≤ T
t regrett ≤ 4T βT d log 1 + .
t=0 t=0

The remainder of the proof follows from using our chosen value of βT and algebraic manipulations (that 2ab ≤
a2 + b2 ).
We now provide the proofs of these two propositions.

5.3.1 Regret Analysis

In this section, we prove Proposition 5.6, which says that the sum of the squares of the instantaneous regrets of the
algorithm is small, assuming the evolving confidence balls always contain the true mean µ? . An important observation
is that on any round t in which µ? ∈ BALLt , the instantaneous regret is at most the “width” of the ellipsoid in the
direction of the chosen decision. Moreover, the algorithm’s choice of decisions forces the ellipsoids to shrink at a rate
that ensures that the sum of the squares of the widths is small. We now formalize this.
Unless explicitly stated, all norms refer to the `2 norm.

Lemma 5.7. Let x ∈ D. If µ ∈ BALLt and x ∈ D. Then


q
bt )> x| ≤
|(µ − µ βt x> Σ−1
t x

Proof: By Cauchy-Schwarz, we have:


1/2 −1/2 1/2 −1/2
bt )> x| = |(µ − µ
|(µ − µ bt )> Σt Σt x| = |(Σt (µ − µbt ))> Σt x|
q q
1/2 −1/2 1/2
≤ kΣt (µ − µ
bt )kkΣt xk = kΣt (µ − µ bt )k x> Σ−1
t x≤ βt x> Σ−1
t x

where the last inequality holds since µ ∈ BALLt .


Define q
−1
wt := x>
t Σt x t

which we√ interpret as the “normalized width” at time t in the direction of the chosen decision. We now see that the
width, 2 βt wt , is an upper bound for the instantaneous regret.

50
Lemma 5.8. Fix t ≤ T . If µ? ∈ BALLt , then
p p
regrett ≤ 2 min ( βt wt , 1) ≤ 2 βT min (wt , 1)

e> xt . By choice of xt , we have


e ∈ BALLt denote the vector which minimizes the dot product µ
Proof: Let µ
e> xt = max max µ> x ≥ (µ? )> x∗ ,
µ
µ∈BALLt x∈D

where the inequality used the hypothesis µ? ∈ BALLt . Hence,


regrett = (µ? )> x∗ − (µ? )> xt ≤ (e
µ − µ? )> xt
p
= (eµ−µ bt )> xt + (b
µt − µ? )> xt ≤ 2 βt wt
where the last step follows from Lemma 5.7 since µ e and µ? are in BALLt . Since rt ∈ [−1, 1], regrett is always at
most 2 and the first inequality follows. The final inequality is due to that βt is increasing and larger than 1.
The following two lemmas prove useful in showing that we can treat the log determinant as a potential function, where
can bound the sum of widths independently of the choices made by the algorithm.
Lemma 5.9. We have:
−1
TY
det ΣT = det Σ0 (1 + wt2 ).
t=0

Proof: By the definition of Σt+1 , we have


1/2 −1/2 −1/2 1/2
det Σt+1 = det(Σt + xt x>
t ) = det(Σt (I + Σt xt x>
t Σt )Σt )
−1/2 −1/2
= det(Σt ) det(I + Σt xt (Σt xt )> ) = det(Σt ) det(I + vt vt> ),
−1/2
where vt := Σt xt . Now observe that vt> vt = wt2 and
(I + vt vt> )vt = vt + vt (vt> vt ) = (1 + wt2 )vt
Hence (1 + wt2 ) is an eigenvalue of I + vt vt> . Since vt vt> is a rank one matrix, all other eigenvalues of I + vt vt> equal
1. Hence, det(I + vt vt> ) is (1 + wt2 ), implying det Σt+1 = (1 + wt2 ) det Σt . The result follows by induction.
Lemma 5.10. (“Potential Function” Bound) For any sequence x0 , . . . xT −1 such that, for t < T , kxt k2 ≤ B, we
have:
T −1
!
T B2
 
  1 X >
log det ΣT −1 / det Σ0 = log det I + xt xt ≤ d log 1 + .
λ t=0 dλ

PT −1
Proof: Denote the eigenvalues of xt x>
t=0 t as σ1 , . . . σd , and note:

−1
d T
! T −1
X X X
σi = Trace xt x>t = kxt k2 ≤ T B 2 .
i=1 t=0 t=0

Using the AM-GM inequality,


T −1
! d
!
1 X Y
log det I + xt x>
t = log (1 + σi /λ)
λ t=0 i=1
d
!1/d d
!
T B2
 
Y 1X
= d log (1 + σi /λ) ≤ d log (1 + σi /λ) ≤ d log 1 + ,
i=1
d i=1 dλ

51
which concludes the proof.
Finally, we are ready to prove that if µ? always stays within the evolving confidence region, then our regret is under
control.
Proof:[Proof of Proposition 5.6] Assume that µ? ∈ BALLt for all t. We have that:
T
X −1 T
X −1 T
X −1
regret2t ≤ 4βt min(wt2 , 1) ≤ 4βT min(wt2 , 1)
t=0 t=0 t=0
T −1
T B2
X    
≤ 4βT ln(1 + wt2 ) ≤ 4βT log det ΣT −1 / det Σ0 = 4βT d log 1 +
t=0

where the first inequality follow from By Lemma 5.8; the second from that βt is an increasing function of t; the third
uses that for 0 ≤ y ≤ 1, ln(1 + y) ≥ y/2; the final two inequalities follow by Lemmas 5.9 and 5.10.

5.3.2 Confidence Analysis

Proof:[Proof of Proposition 5.5] Since rτ = xτ · µ? + ητ , we have:


t−1
X t−1
X
bt − µ? = Σ−1
µ t rτ xτ − µ? = Σ−1
t xτ (xτ · µ? + ητ ) − µ?
τ =0 τ =0
t−1
! t−1 t−1
X X X
= Σ−1
t xτ (xτ )> µ? − µ? + Σ−1
t ητ xτ = λΣ−1 ? −1
t µ + Σt η τ xτ
τ =0 τ =0 τ =0

For any 0 < δt < 1, using Lemma A.5, it holds with probability at least 1 − δt ,
q
µt − µ? ) = k(Σt )1/2 (b
µt − µ? )> Σt (b
(b µt − µ? )k
t−1
−1/2 ? −1/2
X
≤ λΣt µ + Σt ητ xτ
τ =0
√ p
≤ λkµ? k + 2σ 2 log (det(Σt ) det(Σ0 )−1 /δt ).

where we have also used the triangle inequality and that kΣ−1
t k ≤ 1/λ.

We seek to lower bound Pr(∀t, µ? ∈ BALLt ). Note that at t = 0, by our choice of λ, we have that BALL0 contains
W ? , so Pr(µ? ∈
/ BALL0 ) = 0. For t ≥ 1, let us assign failure probability δt = (3/π 2 )/t2 for the t-th event, which,
using the above, gives us an upper bound on the sum failure probability as

X ∞
X
1 − Pr(∀t, µ? ∈ BALLt ) = Pr(∃t, µ? ∈
/ BALLt ) ≤ Pr(µ? ∈
/ BALLt ) < (1/t2 )(3/π 2 ) = 1/2.
t=1 t=1

This along with Lemma 5.10 completes the proof.

5.4 Bibliographic Remarks and Further Readings

The orignal multi-armed bandit model goes to back to [Robbins, 1952]. The linear bandit model was first introduced
in [Abe and Long, 1999]. Our analysis of the LinUCB algorithm follows from the original proof in [Dani et al., 2008],

52
with a simplified concentration analysis due to [Abbasi-Yadkori et al., 2011]. The first sub-linear regret bound here
was due to [Auer et al., 2002], which used a more complicated algorithm.
The lower bound we present is also due to [Dani et al., 2008], which also shows that LinUCB is minimax optimal.

with probability one, then we can take σ 2 = B 2 .

53
54
Chapter 6

Strategic Exploration in Tabular MDPs

We now turn to how an agent acting in an unknown MDP can obtain a near-optimal reward over time. Compared with
the previous setting with access to a generative model, we no longer have easy access to transitions at each state, but
only have the ability to execute trajectories in the MDP. The main complexity this adds to the learning process is that
the agent has to engage in exploration, that is, plan to reach new states where enough samples have not been seen yet,
so that optimal behavior in those states can be learned.
Learning is in an episodic setting, where in every episode k, the learner acts for H step starting from a fixed starting
state s0 ∼ µ and, at the end of the H-length episode, the state is reset. It is straightforward to extend this setting where
the starting state is sampled from a distribution, i.e. s0 ∼ µ. In particular, we will follow the episodic MDP model in
Section 1.2
The goal of the agent is to minimize her expected cumulative regret over K episodes:

" K−1
#
X H−1
X
?
Regret := E KV (s0 ) − r(skh , akh ) ,
k=0 h=0

where the expectation is with respect to the randomness of the MDP environment and, possibly, any randomness of
the agent’s strategy.
In this chapter, we consider tabular MDPs where S and A are discrete. We denote S = |S| and A = |A|.
We will now present a sub-linear regret algorithm, UCB-Value Iteration. This chapter follows the proof in [Azar et al.,
2017], with a number of simplifications, albeit with a worse sample complexity.
We denote V π as the expected total reward of π, i.e., V π := Es0 ∼µ V π (s0 ).

6.1 The UCB-VI algorithm

If the learner then executes π k in the underlying MDP to generate a single trajectory τ k = {skh , akh }H−1
h=0 with ah =
πhk (skh ) and skh+1 ∼ Ph (·|skh , akh ). We first define some notations below. Consider the very beginning of episode k.
We use the history information up to the end of episode k − 1 (denoted as H<k ) to form some statistics. Specifically,

55
Algorithm 3 UCBVI
Input: reward function r (assumed to be known), confidence parameters
1: for k = 0 . . . K do
2: Compute Pbhk as the empirical estimates, for all h (Eq. 0.1)
3: Compute reward bonus bkh for all h (Eq. 0.2)
4: Run Value-Iteration on {Pbhk , r + bkh }H−1
h=0 (Eq. 0.3)
5: Set π k as the returned policy of VI.
6: end for

we define:
k−1
X
Nhk (s, a, s0 ) = 1{(sih , aih , sih+1 ) = (s, a, s0 )},
i=1
k−1
X
Nhk (s, a) = 1{(sih , aih ) = (s, a)}, ∀h, s, a.
i=1

Namely, we maintain counts of how many times s, a, s0 and s, a are visited at time step h from the beginning of the
learning process to the end of the episode k − 1. We use these statistics to form an empirical model:

N k (s, a, s0 )
Pbhk (s0 |s, a) = h k , ∀h, s, a, s0 . (0.1)
Nh (s, a)

We will also use the counts to define a reward bonus, denoted as bh (s, a) for all h, s, a. Denote L := ln (SAHK/δ)
(δ as usual represents the failure probability which we will define later). We define reward bonus as follows:
s
k L
bh (s, a) = H . (0.2)
Nhk (s, a)

With reward bonus and the empirical model, the learner uses Value Iteration on the empirical transition Pbhk and the
combined reward rh + bkh . Starting at H (note that H is a fictitious extra step as an episode terminates at H − 1), we
perform dynamic programming all the way to h = 0:

VbHk (s) = 0, ∀s,


n o
b k (s, a) = min rh (s, a) + bk (s, a) + Pbk (·|s, a) · Vb k , H ,
Q h h h h+1

Vbhk (s) = max Q


b kh (s, a), πhk (s) = argmaxa Q
b kh (s, a), ∀h, s, a. (0.3)
a

k b k , we truncate the value by H. This is because we know that due to the


Note that when using Vbh+1 to compute Q h
assumption that r(s, a) ∈ [0, 1], no policy’s Q value will ever be larger than H.
k
Denote π k = {π0k , . . . , πH−1 }. Learner then executes π k in the MDP to get a new trajectory τ k .
UCBVI repeats the above procedure for K episodes.

6.2 Analysis

We will prove the following theorem.

56
Theorem 6.1 (Regret Bound of UCBVI). UCBVI achieves the following regret bound:
"K−1 #
X k
 p  √ 
Regret := E V?−Vπ ≤ 2H 2 S AK · ln(SAH 2 K 2 ) = O e H 2 S AK
k=0

Remark While the above√regret is sub-optimal, the algorithm we presented here indeed achieves a sharper bound
e 2 SAK) [Azar et al., 2017], which gives the tight dependency bound on S, A, K. The
in the leading term O(H
dependency on H is not tight and tightening the dependency on H requires modifications to the reward bonus (use
Bernstein inequality rather than Hoeffding’s inequality for reward bonus design).
We prove the above theorem in this section.
We start with bounding the error from the learned model Pbhk .
Lemma 6.2 (State-action wise `1 model error). Fix δ ∈ (0, 1). For all k ∈ [0, . . . , K − 1], s ∈ S, a ∈ A, h ∈
[0, . . . , H − 1], with probability at least 1 − δ, we have:
s
k ? S ln(SAHK/δ)
Pbh (·|s, a) − Ph (·|s, a) ≤ .
1 Nhk (s, a)

The proof of the above lemma uses Proposition A.4 and a union bound over all s, a, n, h.
The following lemma is still about model error, but this time we consider an average model error.
Lemma 6.3 (State-action wise average model error). Fix δ ∈ (0, 1). For all k ∈ [1, . . . , K − 1], s ∈ S, a ∈ A, h ∈
[0, . . . , H − 1], and consider Vh? : S → [0, H], with probability at least 1 − δ, we have:
s
ln(SAHN/δ)
Pbhk (·|s, a) · Vh+1
?
− Ph? (·|s, a) · Vh+1
?
≤H .
Nhk (s, a)

Proof: We provide a proof sketch. Consider a fixed s, a, k, h. We have:


k−1
1 X
Pbhk (·|s, a) · Vh+1
?
= k
1{(sih , akh ) = (s, a)}Vh+1
?
(sih+1 ).
Nh (s, a) i=1
 ? 
Note that for any (sih , aih ) = (s, a), we have E Vh+1 (sih+1 )|sih , aih = Ph (·|s, a) · Vh+1
?
. Thus, we can apply
k ? ?
Hoeffding’s inequality here to bound Pbh (·|s, a) · Vh+1 − Ph (·|s, a) · Vh+1 . With a union bound over all s, a, k, h,
we conclude the proof.
We denote the two inequalities in Lemma 6.2 and Lemma 6.3 as event Emodel . Note that the failure probability of
Emodel is at most 2δ. Below we condition on Emodel being true (we deal with failure event at the very end).
Now we study the effect of reward bonus. Similar to the idea in multi-armed bandits, we want to pick a policy π k ,
such that the value of π k in under the combined reward rh + bkh and the empirical model Pbhk is optimistic, i.e., we want
Vb0k (s0 ) ≥ V0? (s0 ) for all s0 . The following lemma shows that via reward bonus, we are able to achieve this optimism.
Lemma 6.4 (Optimism). Assume Emodel is true. For all episode k, we have:
Vb0k (s0 ) ≥ V0? (s0 ), ∀s0 ∈ S;

where Vbhk is computed based on VI in Eq. 0.3.

57
Proof: We prove via induction. At the additional time step H we have VbHk (s) = VH? (s) = 0 for all s.
k ?
Starting at h + 1, and assuming we have Vbh+1 (s) ≥ Vh+1 (s) for all s, we move to h below.

Consider any s, a ∈ S × A. First, if Qkh (s, a) = H, then we have Qkh (s, a) ≥ Q?h (s, a).

b kh (s, a) − Q?h (s, a) = bkh (s, a) + Pbhk (·|s, a) · Vbh+1


Q k
− Ph? (·|s, a) · Vh+1
?

≥ bkh (s, a) + Pbhk (·|s, a) · Vh+1


?
− Ph? (·|s, a) · Vh+1
?
 
= bkh (s, a) + Pbhk (·|s, a) − Ph? (·|s, a) · Vh+1 ?

s
k ln(SAHK/δ)
≥ bh (s, a) − H ≥ 0.
Nhk (s, a)

where the first inequality is from the inductive hypothesis, and the last inequality uses Lemma 6.3.
b k , one can finish the proof by showing Vb n (s) ≥ V ? (s), ∀s.
From Q h+1 h h

Now we are ready to prove the main theorem.


Proof:[Proof of Theorem 7.9]
Let us consider episode k and denote H<k as the history up to the end of episode k − 1. We consider bounding
k
V ? − V π . Using Optimism and the simulation lemma, we can get the following result:
H−1 h   i
k k
πk
X
V ? − V π ≤ Vb0k (s0 ) − V0π (s0 ) ≤ Esh ,ah ∼dπk bkh (sh , ah ) + Pbhk (·|sh , ah ) − P ? (·|sh , ah ) · Vbh+1 (0.4)
h
h=0

We prove the above two inequalities in the lecture. We leave the proof of the above inequality (Eq 0.4 as an exercise
for readers. Note that this is slightly different from the usual simulation lemma, as here we truncate Vb by H during
VI.
 
πk
Under Emodel , we can bound Pbhk (·|sh , ah ) − P ? (·|sh , ah ) · Vbh+1 (recall Lemma 6.2) with a Holder’s inequality:
 
πk πk
Pbhk (·|sh , ah ) − P ? (·|sh , ah ) · Vbh+1 ≤ Pbhk (·|sh , ah ) − P ? (·|sh , ah ) Vbh+1
1 ∞
s
S ln(SAKH/δ)
≤H .
Nhk (s, a)
k
Hence, back to per-episode regret V ? − V π , we get:
H−1  q 
πk
X
? k k
V −V ≤ Esh ,ah ∼dπk bh (sh , ah ) + H S ln(SAHK/δ)/Nh (sh , ah )
h
h=0
H−1
X  q 
≤ Esh ,ah ∼dπk k
2H S ln(SAHK/δ)/Nh (sh , ah )
h
h=0
 
H−1
p X 1
= 2H ln(SAHK/δ)E  q H<k  ,
h=0 Nhk (skh , akh )

where in the last term the expectation is taken with respect to the trajectory {skh , akh } (which is generated from π k )
while conditioning on all history H<k up to and including the end of episode k − 1.

58
Now we sum all episodes together and take the failure event into consideration.
"K−1 # " K−1
!# " K−1
!#
πk πk πk
X X X
? ? ?
E V −V = E 1{Emodel } V −V + E 1{E model } V −V
k=0 k=0 k=0
" K−1
!#
πk
X
?
≤ E 1{Emodel } V −V + 2δKH
k=0
 
K−1
X H−1
p X 1
≤ 2H S ln(SAHK/δ)E  q  + 2δKH
k=0 h=0 Nhk (skh , akh )

We can bound the double summation term above using lemma 6.5. We can conclude that:
"N #
πn
X p
?
E V −V ≤ 4H 2 S AN ln(SAHN/δ) + 2δN H.
n=1

Now set δ = 1/N H, we get:


"N #
X n p  p 
E V ? − V π ≤ 4H 2 S AN ln(SAH 2 N 2 ) + 2 = O H 2 S AN ln(SAH 2 N 2 ) .
n=1

This concludes the proof of Theorem 7.9.


Lemma 6.5. Consider arbitrary K sequence of trajectories τ k = {skh , akh }H−1
h=0 for k = 0, . . . , K − 1. We have

K−1
X H−1
X 1 √
q ≤ 2H SAK.
k=0 h=0 Nhk (skh , akh )

Proof: We swap the order of the two summation above:

K−1 NhK (s,a)


X H−1
X 1
H−1
X K−1
X 1
H−1
X X X 1
q = q = √
k=0 h=0 Nhk (skh , akh ) h=0 k=0 Nhk (skh , akh ) h=0 s,a∈S×A i=1
i
H−1 H−1
X X q X s X √
≤2 NhK (s, a) ≤ SA NhK (s, a) = H SAK,
h=0 s,a∈S×A h=0 s,a

PN √ √
where in the first inequality we use the fact that i=1 1/ i ≤ 2 N , and in the second inequality we use CS
inequality.

6.3 Bibliographic Remarks and Further Readings

The first provably correct PAC algorithm for reinformcent learning (which finds a near optimal policy) was due to
Kearns and Singh [2002], which provided the E3 algorithm; it achieves polynomial sample complexity in tabular
MDPs. Brafman and Tennenholtz [2002] presents the Rmax algorithm which provides a refined PAC analysis over E3 .

59
Both are model based approaches [Kakade, 2003] improves the sample complexity to be O(S 2 A). Both E3 and Rmax
uses the concept of absorbing MDPs to achieve optimism and balance exploration and exploitation.
p
Jaksch et al. [2010] provides the first O( (T )) regret bound, where T is the number of timesteps in the MDP (T is
proportiona to K in our setting); this dependence on T is optimal. Subsequently, Azar et al. [2017], Dann et al. [2017]
provide algorithms that, asymptotically, achieve minimax regret bound in tabular MDPs. By this, we mean that for
sufficienlty large T (for T ≥ Ω(|S|2 )), the results in Azar et al. [2017], Dann et al. [2017] obtain optimal dependencies
on |S| and |A|. The requirement that T ≥ Ω(|S|2 ) before these bounds hold is essentially the requirement that non-
trivial model accuracy is required. It is an opent question to remove this dependence.
Lower bounds are provided in [Dann and Brunskill, 2015, Osband and Van Roy, 2016, Azar et al., 2017].

Further exploration strategies. Refs and discussion for Q-learning, reward free, and thompson sampling to
be added...

60
Chapter 7

Linearly Parameterized MDPs

In this chapter, we consider learning and exploration in linearly parameterized MDPs—the linear MDP. Linear MDP
generalizes tabular MDPs into MDPs with potentially infinitely many state and action pairs.
This chapter follows largely follows the model and analysis first provided in [Jin et al., 2020].

7.1 Setting

We consider episodic finite horizon MDP with horizon H, M = {S, A, {rh }h , {Ph }h , H, s0 }, where s0 is a fixed
initial state, rh : S × A 7→ [0, 1] and Ph : S × A 7→ ∆(S) are time-dependent reward function and transition kernel.
Note that for time-dependent finite horizon MDP, the optimal policy will be time-dependent as well. For simplicity, we
overload notations a bit and denote π = {π0 , . . . , πH−1 }, where each πh : S 7→ A. We also denote V π := V0π (s0 ),
i.e., the expected total reward of π starting at h = 0 and s0 .
We define the learning protocol below. Learning happens in an episodic setting. Every episode k, learner first proposes
a policy π k based on all the history information up to the end of episode k − 1. The learner then executes π k in the
H−1
underlying MDP to generate a single trajectory τ k = {skh , akh }h=0 with ah = πhk (skh ) and skh+1 ∼ Ph (·|skh , akh ). The
goal of the learner is to minimize the following cumulative regret over N episodes:
"K−1 #
X 
? πk
Regret := E V −V ,
k=0

where the expectation is with respect to the randomness of the MDP environment and potentially the randomness of
the learner (i.e., the learner might make decisions in a randomized fashion).

7.1.1 Low-Rank MDPs and Linear MDPs

Note that here we do not assume S and A are finite anymore. Indeed in this note, both of them could be continuous.
Without any further structural assumption, the lower bounds we saw in the Generalization Lecture forbid us to get a
polynomially regret bound.
The structural assumption we make in this note is a linear structure in both reward and the transition.

Definition 7.1 (Linear MDPs). Consider transition {Ph } and {rh }h . A linear MDP has the following structures on rh

61
and Ph :

rh (s, a) = θh? · φ(s, a), Ph (·|s, a) = µ?h φ(s, a), ∀h

where φ is a known state-action feature map φ : S × A 7→ Rd , and µ?h ∈ R|S|×d . Here φ, θh? are known to the learner,
while µ? is unknown. We further assume the following norm bound on the parameters: (1) sups,a kφ(s, a)k2 ≤ 1, (2)

kv > µ?h k2 ≤ d for any v such that kvk∞ ≤ 1, and all h, and (3) kθh? k2 ≤ W for all h. We assume rh (s, a) ∈ [0, 1]
for all h and s, a.

The model essentially says that the transition matrix Ph ∈ R|S|×|S||A| has rank at most d, and Ph = µ?h Φ. where
Φ ∈ Rd×|S||A| and each column of Φ corresponds to φ(s, a) for a pair s, a ∈ S × A.

Linear Algebra Notations For real-valued matrix A, we denote kAk2 = supx:kxk2 =1 kAxk2 which denotes the
maximum singular value of A. We denote kAkF as the Frobenius norm kAk2F = i,j A2i,j where Ai,j denotes the
P

i, j’th entry of A. For any Positive Definite matrix Λ, we denote x> Λx = kxk2Λ . We denote det(A) as the determinant
Qd
of the matrix A. For a PD matrix Λ, we note that det(Λ) = i=1 σi where σi is the eigenvalues of Λ. For notation
simplicity, during inequality derivation, we will use ., h to suppress all absolute constants. We will use O
e to suppress
all absolute constants and log terms.

7.2 Planning in Linear MDPs

We first study how to do value iteration in linear MDP if µ is given.


?
We start from Q?H−1 (s, a) = θH−1 ?
· φ(s, a), and πH−1 ?
(s) = argmaxa Q?H−1 (s, a) = argmaxa θH−1 · φ(s, a), and
? ?
VH−1 (s) = argmaxa QH−1 (s, a).
Now we do dynamic programming from h + 1 to h:

Q?h (s, a) = θh? · φ(s, a) + Es∼Ph (·|s,a) Vh+1


?
(s0 ) = θh? · φ(s, a) + Ph (·|s, a) · Vh+1
?
= θh? · φ(s, a) + (µ?h φ(s, a))> Vh+1
?

(0.1)
? ? > ?

= φ(s, a) · θh + (µh ) Vh+1 = φ(s, a) · wh , (0.2)

where we denote wh := θh? + (µ?h )> Vh+1


?
. Namely we see that Q?h (s, a) is a linear function with respect to φ(s, a)!
We can continue by defining πh (s) = argmaxa Q?h (s, a) and Vh? (s) = maxa Q?h (s, a).
?

At the end, we get a sequence of linear Q? , i.e., Q?h (s, a) = wh · φ(s, a), and the optimal policy is also simple,
πh? (s) = argmaxa wh · φ(s, a), for all h = 0, . . . , H − 1.
One key property of linear MDP is that a Bellman Backup of any function f : S 7→ R is a linear function with respect
to φ(s, a). We summarize the key property in the following claim.

Claim 7.2. Consider any arbitrary function f : S 7→ [0, H]. At any time step h ∈ [0, . . . H − 1], there must exist a
w ∈ Rd , such that, for all s, a ∈ S × A:

rh (s, a) + Ph (·|s, a) · f = w> φ(s, a).

The proof of the above claim is essentially the Eq. 0.1.

62
7.3 Learning Transition using Ridge Linear Regression

In this section, we consider the following simple question: given a dataset of state-action-next state tuples, how can
we learn the transition Ph for all h?
Note that µ? ∈ R|S|×d . Hence explicitly writing down and storing the parameterization µ? takes time at least |S|. We
show that we can represent the model in a non-parametric way.
We consider a particular episode n. Similar to Tabular-UCBVI, we learn a model at the very beginning of the episode
n using all data from the previous episodes (episode 1 to the end of the episode n − 1). We denote such dataset as:
n−1
Dhn = sih , aih , sih+1

i=0
.

We maintain the following statistics using Dhn :

n−1
X
Λnh = φ(sih , aih )φ(sih , aih )> + λI,
i=0

where λ ∈ R+ (it will be set to 1 eventually, but we keep it here for generality).
To get intuition of Λn , think about the tabular setting where φ(s, a) is a one-hot vector (zeros everywhere except that
the entry corresponding to (s, a) is one). Then Λnh is a diagonal matrix and the diagonal entry contains N n (s, a)—the
number of times (s, a) has been visited.
We consider the following multi-variate linear regression problem. Denote δ(s) as a one-hot vector that has zero
everywhere except that the entry corresponding to s is one. Denote ih = P (·|sih , aih ) − δ(sih+1 ). Conditioned on
history Hhi (history Hhi denotes all information from the very beginning of the learning process up to and including
(sih , aih )), we have:

E ih |Hhi = 0,
 

simply because sih+1 is sampled from Ph (·|sih , aih ) conditioned on (sih , aih ). Also note that kih k1 ≤ 2 for all h, i.
Since µ?h φ(sih , aih ) = Ph (·|sih , aih ), and δ(sih+1 ) is an unbiased estimate of Ph (·|sih , aih ) conditioned on sih , aih , it is
reasonable to learn µ? via regression from φ(sih , aih ) to δ(sih+1 ). This leads us to the following ridge linear regression:

n−1
X 2
bnh = argminµ∈R|S|×d
µ µφ(sih , aih ) − δ(sih+1 ) 2
+ λkµk2F .
i=0

Ridge linear regression has the following closed-form solution:


n−1
X
bnh =
µ δ(sih+1 )φ(sih , aih )> (Λnh )−1 (0.3)
i=0

Note that µbnh ∈ R|S|×d , so we never want to explicitly store it. Note that we will always use µ bnh together with a specific
s, a pair and a value function V (think about value iteration case), i.e., we care about Pbhn (·|s, a) · V := (b
µnh φ(s, a)) · V ,
which can be re-written as:
n−1
X
µnh φ(s, a)) · V = φ(s, a)>
Pbhn (·|s, a) · V := (b (Λnh )−1 φ(sih , aih )V (sih+1 ),
i=0

63
where we use the fact that δ(s)> V = V (s). Thus the operator Pbhn (·|s, a) · V simply requires storing all data and can
be computed via simple linear algebra and the computation complexity is simply poly(d, n)—no poly dependency on
|S|.
bnh and µ?h .
Let us calculate the difference between µ
bh and µ?h ). For all n and h, we must have:
Lemma 7.3 (Difference between µ
n−1
−1 −1
X
bnh − µ?h = −λµ?h (Λnh )
µ + ih φ(sih , aih )> (Λnh ) .
i=1

bnh :
Proof: We start from the closed-form solution of µ
n−1
X n−1
X
bnh =
µ δ(sih+1 )φ(sih , aih )> (Λnh )−1 = (P (·|sih , aih ) + nh )φ(sih , aih )> (Λnh )−1
i=0 i=0
n−1
X n−1
X n−1
X
= (µ?h φ(sih , aih ) + ih )φ(sih , aih )> (Λnh )−1 = µ∗h φ(sih , aih )φ(sih , aih )> (Λnh )−1 + ih φ(sih , aih )> (Λnh )−1
i=0 i=0 i=0
n−1
X n−1
X
= µ?h φ(sih , aih )φ(sih , aih )> (Λnh )−1 + ih φ(sih , aih )> (Λnh )−1
i=0 i=0
n−1
X n−1
X
= µ?h (Λnh − λI)(Λnh )−1 + ih φ(sih , aih )> (Λnh )−1 = µ∗h − λµ?h (Λnh )−1 + ih φ(sih , aih )> (Λnh )−1 .
i=0 i=0

Rearrange terms, we conclude the proof.


Lemma 7.4. Fix V : S 7→ [0, H]. For all n and s, a ∈ S × A, and h, with probability at least 1 − δ, we have:
n−1
r
X
i i > i H det(Λnh )1/2 det(λI)−1/2
φ(sh , ah )(V h ) ≤ 3H ln .
i=0 n −1
δ
(Λh )

Proof: We first check the noise terms {V > ih }h,i . Since V is independent of the data (it’s a pre-fixed function), and
by linear property of expectation, we have:

E V > ih |Hhi = 0, |V > ih | ≤ kV k∞ kih k1 ≤ 2H, ∀h, i.


 

Hence, this is a Martingale difference sequence. Using the Self-Normalized vector-valued Martingale Bound (Lemma A.5),
we have that for all n, with probability at least 1 − δ:
n−1
r
X
i i > i det(Λnh )1/2 det(λI)−1/2
φ(sh , ah )(V h ) ≤ 3H ln .
i=0 n −1
δ
(Λh )

Apply union bound over all h ∈ [H], we get that with probability at least 1 − δ, for all n, h:
n−1
r
X
i i > i H det(Λnh )1/2 det(λI)−1/2
φ(sh , ah )(V h ) ≤ 3H ln . (0.4)
i=0 n −1
δ
(Λh )

64
7.4 Uniform Convergence via Covering

Now we take a detour first and consider how to achieve a uniform convergence result over a function class F that
contains infinitely many functions. Previously we know how to get uniform convergence if F is finite—we simply do
a union bound. However, when F contains infinitely many functions, we cannot simply apply a union bound. We will
use the covering argument here.
Consider the following ball with radius R: Θ = {θ ∈ Rd : kθk2 ≤ R ∈ R+ }. Fix an . An -net N ⊂ Θ is a set
such that for any θ ∈ Θ, there exists a θ0 ∈ N , such that kθ − θ0 k2 ≤ . We call the smallest -net as -cover. Abuse
notations a bit, we simply denote N as the -cover.
The -covering number is the size of -cover N . We define the covering dimension as ln (|N |)

Lemma 7.5. The -covering number of the ball Θ = {θ ∈ Rd : kθk2 ≤ R ∈ R+ } is upper bounded by (1 + 2R/)d .

We can extend the above definition to a function class. Specifically, we look at the following function. For a triple of
(w, β, Λ) where w ∈ Rd and kwk2 ≤ L, β ∈ [0, B], and Λ such that σmin (Λ) ≥ λ, we define fw,β,Λ : S 7→ [0, R] as
follows:

  q  
fw,β,Λ (s) = min max w> φ(s, a) + β φ(s, a)> Λ−1 φ(s, a) , H , ∀s ∈ S. (0.5)
a

We denote the function class F as:

F = {fw,β,Λ : kwk2 ≤ L, β ∈ [0, B], σmin (Λ) ≥ λ}. (0.6)

Note that F contains infinitely many functions as the parameters are continuous. However we will show that it has
finite covering number that scales exponentially with respect to the number of parameters in (w, β, Λ).

Why do we look at F? As we will see later in this chapter F contains all possible Q
b h functions one could encounter
during the learning process.

Lemma 7.6 (-covering dimension of F). Consider F defined in Eq. 0.6. Denote its -cover as N with the `∞ norm
as the distance metric, i.e., d(f1 , f2 ) = kf1 − f2 k∞ for any f1 , f2 ∈ F. We have that:

√ √
ln (|N |) ≤ d ln(1 + 6L/) + ln(1 + 6B/( λ)) + d2 ln(1 + 18B 2 d/(λ2 )).

Note that the -covering dimension scales quadratically with respect to d.


Proof: We start from building a net over the parameter space (w, β, Λ), and then we convert the net over parameter
space to an -net over F under the `∞ distance metric.

65
We pick two functions that corresponding to parameters (w, β, Λ) and (ŵ, β̂, Λ).
b
 q   q 
|f (s) − fˆ(s)| ≤ max w> φ(s, a) + β φ(s, a)> Λ−1 φ(s, a) − max ŵ> φ(s, a) + β̂ φ(s, a)> Λ̂−1 φ(s, a)
a a
 q   q 
≤ max w> φ(s, a) + β φ(s, a)> Λ−1 φ(s, a) − ŵ> φ(s, a) + β̂ φ(s, a)> Λ̂−1 φ(s, a)
a
q
≤ max (w − ŵ)> φ(s, a) + max (β − β̂) φ(s, a)> Λ−1 φ(s, a)
a a
q q
+ max β̂( φ(s, a)> Λ−1 φ(s, a) − φ(s, a)> Λ̂−1 φ(s, a))
a


r
≤ kw − ŵk2 + |β − β̂|/ λ + B φ(s, a)> (Λ−1 − Λ̂−1 )φ(s, a)
√ q
≤ kw − ŵk2 + |β − β̂|/ λ + B kΛ−1 − Λ̂−1 kF
Note that Λ−1 is a PD matrix with σmax (Λ−1 ) ≤ 1/λ.

Now we consider the /3-Net N/3,w over {w : kwk2 ≤ L}, λ/3-net N√λ/3,β over interval [0, B] for β, and

2 /(9B 2 )-net N2 /(9B),Λ over {Λ : kΛkF ≤ d/λ}. The product of these three nets provide a -cover for F, which
means that that size of the -net N for F is upper bounded as:
ln |N | ≤ ln |N/3,w | + ln |N√λ/3,β | + ln |N2 /(9B 2 ),Λ |
√ √
≤ d ln(1 + 6L/) + ln(1 + 6B/( λ)) + d2 ln(1 + 18B 2 d/(λ2 )).

Remark Covering gives a way to represent the complexity of function class (or hypothesis class). Relating to VC,
covering number is upper bound roughly by exp(d) with d being the VC-dimension. However, there are cases where
VC-dimensional is infinite, but covering number if finite.
Now we can build a uniform convergence argument for all f ∈ F.
Lemma 7.7 (Uniform Convergence Results). Set λ = 1. Fix δ ∈ (0, 1). For all n, h, all s, a, and all f ∈ F, with
probability at least 1 − δ, we have:
!
√ √
r
H
  q q
n 2
Ph (·|s, a) − P (·|s, a) · f . H kφ(s, a)k(Λn )−1
b d ln(1 + 6L N ) + d ln(1 + 18B dN ) + ln .
h δ

Proof: Recall Lemma 7.4, we have with probability at least 1 − δ, for all n, h, for a pre-fixed V (independent of the
random process):
n−1 2
H det(Λnh )1/2 det(λI)−1/2
 
X
i i > i H
φ(sh , ah )(V h ) ≤ 9H 2 ln ≤ 9H 2 ln + d ln (1 + N )
i=1 n −1
δ δ
(Λh )

where we have used the fact that kφk2 ≤ 1, λ = 1, and kΛnh k2 ≤ N + 1.


Denote the -cover of F as N . With an application of a union bound over all functions in N , we have that with
probability at least 1 − δ, for all V ∈ N , all n, h, we have:
n−1 2  
X
i i > i H
φ(sh , ah )(V h ) ≤ 9H 2 ln + ln (|N |) + d ln (1 + N ) .
i=1 n −1
δ
(Λh )

66
Recall Lemma 7.6, substitute the expression of ln |N | into the above inequality, we get:

n−1 2
√ 2
 
X H
φ(sih , aih )(V > ih ) ≤ 9H 2
ln 2 2
+ d ln(1 + 6L/) + d ln(1 + 18B d/ ) + d ln (1 + N ) .
i=1
δ
(Λn
h)
−1

Now consider an arbitrary f ∈ F. By the definition of -cover, we know that for f , there exists a V ∈ N , such that
kf − V k∞ ≤ . Thus, we have:

n−1 2 n−1 2 n−1 2


X X X
φ(sih , aih )(f > ih ) ≤2 φ(sih , aih )(V > ih ) +2 φ(sih , aih )((V − f )> ih )
i=1 (Λn −1 i=1 (Λn −1 i=1 (Λn −1
h) h) h)

n−1 2
X
≤2 φ(sih , aih )(V > ih ) + 82 N
i=1 (Λn −1
h)
√ 2
 
2 H
≤ 9H ln 2 2
+ d ln(1 + 6L/) + d ln(1 + 18B d/ ) + d ln (1 + N ) + 82 N,
δ
Pn−1
where in the second inequality we use the fact that k i=1 φ(sih , aih )(V − f )> ih k2(Λn )−1 ≤ 42 N , which is from
h

n−1 2 n−1 n−1


X X 42 X
φ(sih , aih )(V − f )> ih ≤k φ(sih , aih )2k2(Λn )−1 ≤ k φ(sih , aih )k2 ≤ 42 N.
i=1 i=1
h λ i=1
(Λn
h)
−1


Set  = 1/ N , we get:

n−1 2
√ √
 
X H
φ(sih , aih )(f > ih ) ≤ 9H ln 2 2 2
+ d ln(1 + 6L N ) + d ln(1 + 18B dN ) + d ln (1 + N ) + 8
i=1
δ
(Λn
h)
−1

√ √
 
H
. H 2 ln + d ln(1 + 6L N ) + d2 ln(1 + 18B 2 dN ) ,
δ

where we recall . ignores absolute constants.


 
Now recall that we can express Pbhn (·|s, a) − P (·|s, a) · f = φ(s, a)> (b
µnh − µ?h )> f . Recall Lemma 7.3, we have:

n−1
−1 >
X
µnh φ(s, a) − µ?h φ(s, a)) · f | ≤ λφ(s, a)> (Λnh )
|(b (µ?h ) f + φ(s, a)> (Λnh )−1 φ(sih , aih )(ih )> f
i=1
s
√ √ √
 
H
. H d kφ(s, a)k(Λn )−1 + kφ(s, a)k(Λnh )−1 H 2 ln
+ d ln(1 + 6L N ) + d2 ln(1 + 18B 2 dN )
h δ
!
√ √
r
H
q q
h H kφ(s, a)k(Λn )−1 ln + d ln(1 + 6L N ) + d ln(1 + 18B 2 dN ) .
h δ

67
7.5 Algorithm

Our algorithm, Upper Confidence Bound Value Iteration (UCB-VI) will use reward bonus to ensure optimism. Specif-
ically, we will the following reward bonus, which is motivated from the reward bonus used in linear bandit:
q
bnh (s, a) = β φ(s, a)> (Λnh )−1 φ(s, a), (0.7)

where β contains poly of H and d, and other constants and log terms. Again to gain intuition, please think about what
this bonus would look like when we specialize linear MDP to tabular MDP.

Algorithm 4 UCBVI for Linear MDPs


1: Input: parameters β, λ
2: for n = 1 . . . N do
3: Compute Pbhn for all h (Eq. 0.3)
4: Compute reward bonus bnh for all h (Eq. 0.7)
5: Run Value-Iteration on {Pbhn , rh + bnh }H−1
h=0 (Eq. 0.8)
6: Set π n as the returned policy of VI.
7: end for

With the above setup, now we describe the algorithm. Every episode n, we learn the model µ bnh via ridge linear
regression. We then form the quadratic reward bonus as shown in Eq. 0.7. With that, we can perform the following
truncated Value Iteration (always truncate the Q function at H):
VbHn (s) = 0, ∀s,
q
b n (s, a) = θ? · φ(s, a) + β φ(s, a)> (Λn )−1 φ(s, a) + φ(s, a)> (b
Q µnh )> Vbh+1
n
h h
q
µnh )> Vbh+1
= β φ(s, a)> (Λnh )−1 φ(s, a) + (θ? + (b n
)> φ(s, a),
Vbhn (s) = min{max Q
b n (s, a), H},
h πhn (s) = argmaxa Q
b n (s, a).
h (0.8)
a

b n contains two components: a quadratic component and a linear component. And Vb n has the format
Note that above Q h h
of fw,β,Λ defined in Eq. 0.5.
bn .
The following lemma bounds the norm of linear weights in Qh

Lemma 7.8. Assume β ∈ [0, B]. For all n, h, we have Vbhn is in the form of Eq. 0.5, and Vbhn falls into the following
class:
HN
V = {fw,β,Λ : kwk2 ≤ W + , β ∈ [0, B], σmin (Λ) ≥ λ}. (0.9)
λ

µnh )> Vbh+1


Proof: We just need to show that θ? + (b n
has its `2 norm bounded. This is easy to show as we always have
n
kVh+1 k∞ ≤ H as we do truncation at Value Iteration:
b

µnh )> Vbh+1


θ? + (b n
µnh )> Vbh+1
≤ W + (b n
.
2 2

bnh from Eq. 0.3:


Now we use the closed-form of µ
n−1 n−1
X X Hn
µnh )> Vbh+1
(b n
= n
Vbh+1 (sih+1 )φ(sih , aih )> (Λnh )−1 ≤H (Λnh )−1 φ(sih , aih ) ≤ ,
2
i=1 i=0
λ
2 2
n −1
where we use the fact that kVbh+1 k∞ ≤ H, σmax (Λ ) ≤ 1/λ, and sups,a kφ(s, a)k2 ≤ 1.

68
7.6 Analysis of UCBVI for Linear MDPs

In this section, we prove the following regret bound for UCBVI.

Theorem 7.9 (Regret Bound). Set β = O


e (Hd), λ = 1. UCBVI (Algorithm 4) achieves the following regret bound:

" N
#
 √ 
πn
X
? e H 2 d3 N
E NV − V ≤O
i=0

The main steps of the proof are similar to the main steps of UCBVI in tabular MDPs. We first prove optimism
via induction, and then we use optimism to upper bound per-episode regret. Finally we use simulation lemma to
decompose the per-episode regret.
In this section, to make notation simple, we set λ = 1 directly.

7.6.1 Proving Optimism

Proving optimism requires us to first bound model error which we have built in the uniform convergence result shown
in Lemma 7.7, namely, the bound we get for (Pbhn (·|s, a) − P (·|s, a)) · f for all f ∈ V. Recall Lemma 7.7 but this time
replacing F by V defined in Eq. 0.9. With probability at least 1 − δ, for all n, h, s, a and for all f ∈ V,
!
√ √
r
H
q q
(Pbhn (·|s, a) − P (·|s, a)) · f ≤ H kφ(s, a)k(Λn )−1 ln + d ln(1 + 6(W + HN ) N ) + d ln(1 + 18B 2 dN )
h δ
r !
H p 2
p
2
. Hd kφ(s, a)k(Λn )−1 ln + ln(W N + HN ) + ln(B dN )
h δ
!
√ √ √
r
H p
. kφ(s, a)k(Λn )−1 Hd ln + ln(W + H) + ln B + ln d + ln N .
h δ
| {z }
:=β

Denote the above inequality as event Emodel . Below we are going to condition on Emodel being hold. Note that here
for notation simplicity, we denote
!
√ √ √
r
H p
β = Hd ln + ln(W + H) + ln B + ln d + ln N =O
e (Hd) .
δ

remark Note that in the definition of V (Eq. 0.9), we have β ∈ [0, B]. And in the above formulation of β, note
that B appears inside a log term. So we need to set B such that β ≤ B and we can get the correct B by solving the
inequality β ≤ B for B.

Lemma 7.10 (Optimism). Assume event Emodel is true. for all n and h,

Vbhn (s) ≥ Vh? (s), ∀s.

69
n ?
Proof: We consider a fixed episode n. We prove via induction. Assume that Vbh+1 (s) ≥ Vh+1 (s) for all s. For time
step h, we have:

Q̂nh (s, a) − Q?h (s, a)


q
= θ? · φ(s, a) + β φ(s, a)> (Λnh )−1 φ(s, a) + φ(s, a)> (bµnh )> Vbh+1
n
− θ? · φ(s, a) − φ(s, a)> (µ?h )> Vh+1
?

q
> n
≥ β φ(s, a)> (Λnh )−1 φ(s, a) + φ(s, a)> (bµnh − µ?h ) Vbh+1 ,

n ?
where in the last inequality we use the inductive hypothesis that Vbh+1 (s) ≥ Vh+1 (s), and µ?h φ(s, a) is a valid distri-
n
bution (note that µ
bh φ(s, a) is not necessarily a valid distribution). We need to show that the bonus is big enough to
> n
offset the model error φ(s, a)> (bµnh − µ?h ) Vbh+1 . Since we have event Emodel being true, we have that:

(Pbhn (·|s, a) − P (·|s, a)) · Vbh+1


n
. βkφ(s, a)k(Λnh )−1 ,

n
as by the construction of V, we know that Vbh+1 ∈ V.
This concludes the proof.

7.6.2 Regret Decomposition

Now we can upper bound the per-episode regret as follows:

V ? − V πn ≤ Vb0n (s0 ) − V0πn (s0 ).

We can further bound the RHS of the above inequality using simulation lemma. Recall Eq. 0.4 that we derived in the
note for tabular MDP (Chapter 6:
H−1
X h   i
Vb0n (s0 ) − V0πn (s0 ) ≤ Es,a∼dπhn bnh (s, a) + Pbhn (·|s, a) − P (·|s, a) · Vbh+1
n
.
h=0

(recall that the simulation lemma holds for any MDPs—it’s not specialized to tabular).
 
In the event Emodel , we already know that for any s, a, h, n, we have Pbhn (·|s, a) − P (·|s, a) ·Vbh+1
n
. βkφ(s, a)k(Λnh )−1 =
bnh (s, a). Hence, under Emodel , we have:

H−1
X H−1
X
Vb0n (s0 ) − V0πn (s0 ) ≤ Es,a∼dπhn [2bnh (s, a)] . Es,a∼dπhn [bnh (s, a)] .
h=0 h=0

Sum over all episodes, we have the following statement.

Lemma 7.11 (Regret Bound). Assume the event Emodel holds. We have:
N
X −1 N
X −1 H−1
X
(V0? (s0 ) − V0πn (s0 )) ≤ Esnh ,anh ∼dπhn [bnh (snh , anh )]
n=0 n=0 h=0

70
7.6.3 Concluding the Final Regret Bound

We first consider the following elliptical potential argument, which is similar to what we have seen in the linear bandit
lecture.
Lemma 7.12 (Elliptical Potential). Consider an arbitrary sequence of state action pairs sih , aih . Assume sups,a kφ(s, a)k2 ≤
Pn−1
1. Denote Λnh = I + i=0 φ(sih , aih )φ(sih , aih )> . We have:
−1
N
!
N +1
X det(Λ )
φ(sih , aih )(Λih )−1 φ(sih , aih ) ≤ 2 ln h
. 2d ln(N ).
i=0
det(I)

Proof: By the Lemma 3.7 and 3.8 in the linear bandit lecture note,
N
X N
X
φ(sih , aih )(Λih )−1 φ(sih , aih ) ≤2 ln(1 + φ(sih , aih )(Λih )−1 φ(sih , aih ))
i=1 i=1
!
det(ΛN
h
+1
)
≤ 2 ln
det(I)
N +1
≤ 2d ln(1 + ) . 2d ln(N ).

where the first inequality uses that for 0 ≤ y ≤ 1, ln(1 + y) ≥ y/2.
Now we use Lemma 7.11 together with the above inequality to conclude the proof.
Proof:[Proof of main Theorem 7.9]
We split the expected regret based on the event Emodel .
" N
# " N
!#
X X
? πn ? πn
E NV − V = E 1{Emodel holds} N V − V
n=1 n=1
" N
!#
X
? πn
+ E 1{Emodel doesn’t hold} N V − V
n=1
" N
!#
X
? πn
≤ E 1{Emodel holds} N V − V + δN H
n=1
" N H−1
#
X X
.E bnh (snh , anh ) + δN H.
n=1 h=0

Note that:
N H−1
X X N H−1
X Xq
bnh (snh , anh ) =β φ(snh , anh )> (Λnh )−1 φ(snh , anh )
n=1 h=0 n=1 h=0
H−1
XX N q
=β φ(snh , anh )> (Λnh )−1 φ(snh , anh )
h=0 n=1
v
H−1
Xu
u N
X p
≤β tN φ(snh , anh )(Λnh )−1 φ(snh , anh ) . βH N d ln(N ).
h=0 n=1

Recall that β = O(Hd).


e This concludes the proof.

71
7.7 Bibliographic Remarks and Further Readings

There are number of ways to linearly parameterize an MDP such that it permits for efficient reinforcement learning
(both statistically and computationally). The first observation that such assumptions lead to statistically efficient
algorithms was due to [Jiang et al., 2017] due to that these models have low Bellman rank (as we shall see in Chapter 8).
The first statistically and computationally efficient algorithm for a linearly parameterized MDP model was due to
[Yang and Wang, 2019a,b]. Subsequently, [Jin et al., 2020] provided a computationally and statistically efficient
algorithm for simplified version of this model, which is the model we consider here. The model of [Modi et al., 2020,
Jia et al., 2020, Ayoub et al., 2020, Zhou et al., 2020] provides another linearly parameterized model, which can viewed
as parameterizing P (s0 |s, a) as a linear combination of feature functions φ(s, a, s0 ). One notable aspect of the model
we choose to present here, where Ph (·|s, a) = µ?h φ(s, a), is that this model has a number of free parameters that is
|S| · d (note that µ is unknown and is of size |S| · d), and yet the statistical complexity does not depend on |S|. Notably,
this implies that accurate model estimation request O(|S|) samples, while the regret for reinforcement learning is only
polynomial in d. The linearly parameterized models of [Modi et al., 2020, Jia et al., 2020, Ayoub et al., 2020, Zhou
et al., 2020] are parameterized by O(d) parameters, and, while O(d) free parameters suggests lower model capacity
(where accurate model based estimation requires only polynomial in d samples), these models are incomparable to the
linearly parameterized models presented in this chapter;
It is worth observing that all of these models permit statistically efficient estimation due to that they have bounded
Bellman rank [Jiang et al., 2017] (and bounded Witness rank [Sun et al., 2019a]), a point which we return to in the
next Chapter.
The specific linear model we consider here was originally introduced by [Jin et al., 2020]. The non-parametric model-
based algorithm we study here was first introduced by [Lykouris et al., 2019] (but under the context of adversarial
attacks).
The analysis we present here does not easily extend to infinite dimensional feature φ (e.g., RBF kernel); here, [Agarwal
et al., 2020a] provide an algorithm and an analysis that extends to infinite dimensional φ, i.e. where we have a
Reproducing Kernel Hilbert Space (RKHS) and the regret is based on the concept of Information Gain.

72
Chapter 8

Parametric Models with Bounded Bellman


Rank

Our previous lectures on exploration in RL focused on the UCBVI algorithm designed for the tabular MDPs and Linear
MDPs. While linear MDPs extends tabular MDPs to the function approximation regime, it is still limited in linear
function approximation, and indeed the assumption that a Bellman backup of any function is still a linear function is a
strong assumption. In this chapter, we consider the setting beyond tabular and linear representation. We aim to design
algorithm with general function approximation that works for a large family of MDPs that subsumes not only tabular
MDPs and linear MDPs, but also other models such as Linear function approximation with Bellman Completion (this
generalizes linear MDPs), reactive predictive state representation (PSRs), and reactive Partially Observable Markov
Decision Process (POMDPs).

8.1 Problem setting

We consider finite-horizon episodic time-dependent Markov Decision Process (MDP) M = ({S}h , {A}h , Ph , s0 , r, H)
where Sh is the state space for time step h and we assume S0 , S2 , . . . , SH−1 are disjoint (A0 , A2 , . . . , AH−1 are dis-
joint as well). We assume S0 = {s0 } though we can generalize to an arbitrary S0 with an initial state distribution.

Remark This setting indeed generalizes our previous finite horizon setting where we have a fixed S and A, but
time-dependent Ph and rh , as we just need to add time step h into state space S to create Sh , i.e., every state s ∈ Sh
only contains h. The benefit of using the above slightly more general notation is that this allows us to ignore h in P, r,
and in π, V and Q as now state action contains time step h.
The goal of an agent is to maximize the cumulative expected reward it obtains over H steps:

max V π (s0 ).
π

In this chapter, we focus on a PAC (Probably Approximately Correct) guarantee. Namely, our goal is to find a policy
π̂ such that V π (s0 ) ≥ V ? (s0 ).
We make the following boundedness assumption on the rewards.
PH−1
Assumption 8.1. Almost surely, for any trajectory τ and step h, 0 ≤ rh ≤ 1. Additionally, 0 ≤ h=0 rh ≤ 1 almost
surely for any trajectory τ .

73
While the first part of the assumption is the standard boundedness assumption we have made throughout, the second
assumes that the trajectory level rewards are also bounded by 1, instead of H, which is helpful for capturing sparse-
reward goal-directed problems with rewards only at one point in a successful trajectory. While normalization of the
trajectory level reward also keeps the net reward bounded, this makes the total reward only scale as 1/H if the rewards
are sparse along the trajectory.

8.2 Value-function approximation

We consider a model-free, value function based approach here. More specifically, denote S = ∪h Sh and A := ∪h Ah ,
we assume that we are given the following function class:

F = {f : S × A → [0, 1]}.

Since we want to learn a near-optimal behavior, we seek to approximate the Q-value function of the optimal policy,
namely Q? using f ∈ F. To this end, we start with a simplifying assumption that Q? lies in F. In practice, this can
be weakened to having a good approximation for Q? in F, but we focus on exact containment for the cleanest setting.
Formally, we make the following realizability assumption.
Assumption 8.2 (Value-function realizability). The function class F satisfies Q? ∈ F.

Armed with this assumption, we may ask whether we can find Q? using a number of samples which does not scale as
|X |, trading it off for a statistical complexity measure for F such as ln |F|. The next result, adapted from Krishna-
murthy et al. [2016] shows that this is not possible.
p
Theorem 8.3. Fix H, K ∈ N with K ≥ 2 and  ∈ (0, 1/8]. For any algorithm, there exists an MDP with a horizon
of H and K actions, a class of predictors F with |F| = K H and Q? ∈ F and a constant c > 0 such that the
probability that the algorithm outputs a policy π π ) ≥ V ? −  after collecting T trajectories from the MDP is
b with V (b
H 2
at most 2/3 for all T ≤ cK / .

In words, the theorem says that for any algorithm, there exists an MDP where it cannot find a good policy in fewer
than an exponential number of samples in the planning horizon, even when Q? ∈ F Furthermore, the size of the class
F required for this result is K H , so that a logarithmic dependence on |F| will not explain the lower bound. The lower
bound construction basically uses the binary tree example that we have seen in the Generalization chapter.
The lower bound indicates that in order to learn in polynomial sample complexity, we need additional assumptions.
While we have seen that polynomially sample complexity is possible in linear MDPs and tabular MDPs, our goal in
this chapter is to significantly weaken the structural assumption on the MDPs.

8.3 Bellman Rank

Having concluded that we cannot find a near optimal policy using a reasonable number of samples with just the
realizability assumption, it is clear that additional structural assumptions on the problem are required in order to make
progress. We now give one example of such a structure, named Bellman rank, which was introduced by Jiang et al.
[2017]. In order to motivate and define this quantity, we need some additional notation. For a function f ∈ F, let
us define πf (x) = argmaxa∈A f (x, a). Namely πf is the greedy action selector with respect to f . For a policy π,
function f ∈ F and h ∈ [H], let us also define the average Bellman error of the function approximator f :
 
E(f ; π, h) = Esh ,ah ∼dπh f (sh , ah ) − rh − Esh+1 ∼P (·|sh ,ah ) max f (sh+1 , a) . (0.1)
a∈Ah+1

74
This is called the average Bellman error as it is not the error on an individual state action s, a, but an expected error
under the state-action distribution induced by the policy π. To see why the definition might be natural, we note that
the following property of Q? from the Bellman optimality equations.
Fact 8.4. E(Q? , π, h) = 0, for all policies π and levels h.

The fact holds due to Q? satisfying the Bellman optimality condition. We also have seen that for any f such that
f (s, a) = r(s, a) + Es0 ∼P (·|s,a) maxa0 f (s0 , a0 ) for all s, a ∈ S × A, then f = Q? .
Thus, if we ever discover a function f such that E(f ; π, h) 6= 0 for an arbitrary policy π and time step h, then we know
that f 6= Q? . The average Bellman error allows us to detect wrong function approximators, i.e., f such that f 6= Q? .
We now make a structural assumption on average Bellman errors, which allows us to reason about the Bellman errors
induced by all policies πf in a sample-efficient manner. For any h ∈ [H − 1], let is define the Bellman error matrix
Eh ∈ R|F |×|F | as
[Eh ]g,f = E(f ; πg , h). (0.2)

That is, each entry in the matrix captures the Bellman error of the function indexed by the column f under the greedy
policy πg induced by the row g at step h. With this notation, we define the Bellman rank of the MDP and a function
class F below.
Definition 8.5 (Bellman Rank). The Bellman rank of an MDP and a function class F is the smallest integer M such
that rank(Eh ) ≤ M for all h ∈ [H − 1].

Intuitively, if the Bellman rank is small, then for any level h, the number of linearly independent rows is small. That
is, the average Bellman error for any function under most policies can be expressed as linear combination of the
Bellman errors of that function on a small set of policies corresponding to the linearly independent rows. Note that
the definition presented here is a modification of the original definition from Jiang et al. [2017] in order to precisely
capture the linear MDP example we covered before. [Jiang et al., 2017] showed that Bellman rank can be further
upper bounded in terms of latent quantities such as the rank of the transition matrix, or the number of latent states if
the MDP has an equivalent formulation as an MDP with a small number of latent states. We refer the reader to Jiang
et al. [2017] for detailed examples, as well as connections of Bellman rank with other rank type notions in the RL
literature to measure problem complexity.

8.3.1 Examples

Before moving on the the algorithm, we first present three examples that admit small Bellman Rank: tabular MDPs,
linear MDPs, and Bellman Completion with Linear Function class. We note that the setting of Bellman Completion
with Linear Function class subsumes linear MDPs, and linear MDPs subsumes tabular MDPs.

Tabular MDPs

For tabular MDPs, we can directly rewrite the Bellman error as follows. Focusing on an arbitrary h:
 
π
X
E(f ; πg , h) = dhg (sh , ah ) f (sh , ah ) − r(sh , ah ) − Esh+1 ∼P (·|sh ,ah ) max f (sh+1 , a)
a
sh ,ah ∈Sh ×Ah
D E
π
= dhg , f (·, ·) − r(·, ·) − Esh+1 ∼P (·|·,·) max f (sh+1 , a) .
a

Namely, E(f ; πg , h) can be written as an inner product of two vectors whose dimension is |Sh ||Ah |. Note that in
tabular MDPs, number of states and number of actions are the natural complexity quantity in sample complexity and
regret bounds.

75
Linear MDPs

Recall the linear MDP definition. In linear MDPs, we know that Q? is a linear function with respect to feature vector
φ, i.e., Q? (sh , ah ) = (w? )> φ(sh , ah ). We will parameterize F to be the following linear function class:

F = {w> φ(s, a) : w ∈ Rd , kwk2 ≤ W },

where we assume kw? k2 ≤ W . Recall that the transition P and reward are also linear, i.e., P (·|sh , ah ) = µ? φ(sh , ah ), r(sh , ah ) =
φ(sh , ah )> θ? . In this case, for the average Bellman error, we have:
 
E(f ; πg , h) = Esh ,ah ∼dπg w> φ(sh , ah ) − (θ? )> φ(sh , ah ) − Esh+1 ∼P (·|sh ,ah ) max w> φ(sh+1 , a)
h a∈Ah+1
  
= Esh ,ah ∼dπg w> φ(sh , ah ) − (θ? )> φ(sh , ah ) − φ(sh , ah )> (µ? )> max w> φ(·, a)
h a∈Ah+1
   
= w − θ? − (µ? )> max w> φ(·, a) , Esh ,ah ∼dπg φ(sh , ah ) .
a∈Ah+1 h

Namely, Bellman error is written as the inner product of two vectors whose dimension is d. Thus, Bellman rank is at
most d for linear MDPs.

Bellman Completion in Linear Function Approximation

The last example we study here further generalizes linear MDPs. The setting Bellman Completion in Linear Function
Approximation is defined as follows.
Definition 8.6 (Bellman Completion under Linear Function Class). We assume r(sh , ah ) = θ? · φ(sh , ah ) and we
consider the following linear function class F = {w> φ(s, a) : w ∈ Rd , kwk2 ≤ W }. Recall the Bellman operator T .
We have Bellman completion under F if and only if for any f ∈ F, T f ∈ F.

We can verify that linear MDP is a special instance here.


Consider f (s, a) := w> φ(s, a), and denote T f (s, a) := we> φ(sh , ah ). We can rewrite the Bellman error as follows:
 
> ? > >
E(f ; πg , h) = Esh ,ah ∼d g w φ(sh , ah ) − (θ ) φ(sh , ah ) − Esh+1 ∼P (·|sh ,ah ) max w φ(sh+1 , a)
π
h a∈Ah+1
 > >

= Esh ,ah ∼dπg w φ(sh , ah ) − we φ(sh , ah )
h
D E
= w − w, e Esh ,ah ∼dπg [φ(sh , ah )] .
h

Namely, we can write the bellman error as an inner product of two vectors with dimension d. This indicates that the
Bellman Rank is at most d.

8.3.2 Linear MDP with Bounded Degree of Freedom

In the literature, there is another class of linear models defined as follows:

P ? (s0 |s, a) := hθ? , φ(s, a, s0 )i, kθ? k2 ≤ W.

Below we assume reward function r is known.


We consider the following model class P = {Pθ (s0 |s, a) = θ · φ(s, a, s0 ) : θ ∈ {v ∈ Rd , kvk2 ≤ W }}.

76
To apply Bellman rank and OLIVE, we need to convert model class P to Q function class Q. For each model that
is associated with θ, we denote the model’s optimal Q function as Qθ (i.e., the Q? under reward r and transition
Pθ (s0 |s, a) := θ · φ(s, a, s0 )). We form a Q function class as follows:

F := {Qθ : θ ∈ {v ∈ Rd : kvk2 ≤ W }}.

Now let us write down the average Bellman error:


h i
0 0
E(Qθ ; πg , h) = Es,a∼dπg Qθ (s, a) − r(s, a) − Es0 ∼P ? (·|s,a) max
0
Qθ (s , a )
a

Now we use the fact that Qθ is the optimal Q function under r and Pθ . Apply Bellman equation on Qθ (s, a) with r
and Pθ , we have:

E(Qθ ; πg , h) = Es,a∼dπg Es0 ∼Pθ (·|s,a) Vθ (s0 ) − Es0 ∼P ? (·|s,a) Vθ (s0 )


 
" !#
X
= Es,a∼dπg (θ − θ? )> φ(s, a, s0 )Vθ (s0 )
s0
* +
X
0 0 ?
= Es,a∼dπg φ(s, a, s )Vθ (s ), θ−θ
s0

8.3.3 Examples that do not have low Bellman Rank

[Jiang et al., 2017] also demonstrates a few more examples (with a slightly modification of the definition of average
Bellman error) also admits low-Bellman rank including reactive POMDPs and reactive PSRs. So far in the literature,
the only example that doesn’t admit low Bellman rank but has polynomial sample complexity algorithms is the factored
MDPs [Kearns and Koller, 1999]. Sun et al. [2019a] showed that Bellman rank can be exponential in Factored MDPs.

8.4 Algorithm

Having defined our main structural assumption, we now describe an algorithm whose sample complexity depends
on the Bellman rank, with no explicit dependence on |S| and |A| and only logarithmic scaling with |F|. For ease of
presentation, we will assume that all the expectations can be measured exactly with no errors, which serves to illustrate
the key idea of Explore-or-Terminate. For a more careful analysis with finite samples, we refer the reader to Jiang et al.
[2017]. The algorithm, named OLIVE for Optimism Led Iterative Value-function Elimination is an iterative algorithm
which successively prunes value functions that have non-zero average Bellman error (recall Q? has zero Bellman error
at any state-action pair). It then uses the principle of optimism in the face of uncertainty to select its next policy which
allows us to detect if a new optimal policy has been found. The algorithm is described in Algorithm 6.
The key step here is that during elimination procedure (Line 6), we enumerate all remaining f ∈ Ft−1 under a fixed
roll-in policy πt , i.e., we can estimate E(f ; πt , h) for all f via a dataset that is collected from πt . Assume we get the
πt
following dataset {sih , aih , rhi , sih+1 }N i i i i i
i=1 where sh , ah ∼ dh , and sh+1 ∼ P (·|sh , ah ). For any f ∈ Fh−1 , we can
form the following empirical estimate of E(f ; πt , h):
N
1 X i i 
E(f ; πt , h) =
e f (sh , ah ) − rhi − max f (sih+1 , a) .
N i=1 a

We can apply Hoeffding’s inequality (Lemma A.1) and a union bound over Ft−1 together here to get a uniform
p
convergence argument, i.e., for any f ∈ Ft−1 , we will have E(f
e ; πt , h) − E(f ; πt , h) = O(
e ln(|F|/δ)/N ).

77
Algorithm 5 The OLIVE algorithm for MDPs with low Bellman rank
Input: Function class F.
1: Initialize F0 = F.
2: for t = 1, 2, . . . , do
3: Define ft = argmaxf ∈Ft−1 maxa f (s0 , a) and πt = πft .
4: if maxa ft (s0 , a) = V πt then return πt .
5: else
6: Update Ft = {f ∈ Ft−1 : E(f ; πt , h) = 0, for all h ∈ [H − 1]}.
7: end if
8: end for

Since we assume that all the expectations are available exactly, the main complexity analysis in OLIVE concerns the
number of iterations before it terminates. When we estimate expectations using samples, this iteration complexity
is critical as it also scales the sample complexity of the algorithm. We will state and prove the following theorem
regarding the iteration complexity of OLIVE.
We need the following lemma to show that performance difference between maxa f (s0 , a) and the value of its induced
greedy policy V πt .
Lemma 8.7 (Performance Difference). For any f , we have:
H−1
X h h ii
max f (x0 , a) − V πf (x0 ) = Exh ,ah ∼dπf f (xh , ah ) − r(xh , ah ) − Exh+1 ∼P (·|xh ,ah ) max f (xh+1 , a) .
a h a
h=0

We leave the proof of the above lemma in HW2.


Theorem 8.8. For any MDP and F with Bellman rank M , OLIVE terminates in at most M H iterations and outputs
π? .

Proof: Consider an iteration t of OLIVE. Due to Assumption 8.2 and Fact 8.4, we know that Q? ∈ Ft−1 . Suppose
OLIVE terminates at this iteration and returns πt . Then we have

V πt = Vft = max Vf ≥ VQ? = V ? ,


f ∈Ft−1

since Q? ∈ Ft−1 . So the algorithm correctly outputs an optimal policy when it terminates.
On the other hand, if it does not terminate then V πt 6= Vft and Lemma 8.7 implies that E(ft , πt , h) > 0 for some step
h ∈ [H − 1]. This certainly ensures that ft ∈ / Ft , but has significantly stronger implications. Note that ft ∈ Ft−1
implies that E(ft , πs , h) = 0 for all s < t and h ∈ [H − 1]. Since we just concluded that E(ft , πt , h) > 0 for some h,
it must be the case that the row corresponding to πt is linearly independent of those corresponding to π1 , . . . , πt−1 in
the matrix Eh . Consequently, at each non-final iteration, OLIVE finds a row that is linearly independent of all previous
identified rows (i.e., indexed by f1 , f2 , . . . , ft−1 ). Since Eh has rank M , the total number of linearly independent rows
one could find is at most M ! Recall that there are at most H many Bellman error matrices, then the algorithm must
terminate in number of rounds HM , which gives the statement of the theorem.
The proof of the theorem makes it precise that the factorization underlying Bellman rank really plays the role of an
efficient basis for exploration in complex MDPs. Extending these ideas to noisy estimates of expectations requires
some care since algebraic notions like rank are not robust to noise. Instead Jiang et al. [2017] use a more general
volumetric argument to analyze the noisy case, as well as describe robustness to requirements of exact low-rank
factorization and realizability.

78
Unfortunately, the OLIVE algorithm is not computationally efficient, and a computational hardness result was discov-
ered by Dann et al. [2018]. Developing both statistically and computationally efficient exploration algorithms for RL
with rich observations is an area of active research.
However, whether there exists a computationally efficient algorithm for low Bellman rank setting is still an open
problem.

8.5 Extension to Model-based Setting

We briefly discuss the extension to model-based setting. Different from model-free value-based setting, in model-
based learning, we start with a model class that contains possible transitions. To ease presentation, we denote the
ground truth model as P ? .

P = {P : S × A → ∆(S)}.

Again we assume realizability:

Assumption 8.9. We assume P ? ∈ P.

Given any P ∈ P, we can define the optimal value function and optimal Q function under P . Specifically, we denote
VP? and Q?P as the optimal value and Q function under model P , and πP as the optimal policy under P .
We introduce a class of witness functions (a.k.a discriminators),

G = {g : S × A × A 7→ R}.

We denote the witness model-misfit for a model P as follows:


 
W(P ; P̄ , h, G) := sup Esh ,ah ∼dπP̄ Esh+1 ∼P (·|sh ,ah ) g(sh , ah , sh+1 ) − Esh+1 ∼P ? (·|sh ,ah ) g(sh , ah , sh+1 ) .
g∈G h

π π
Namely, we are using a discriminator class G to distinguish two distributions dhP̄ ◦ P and dhP̄ ◦ P ? . Note that for the
ground truth model P ? , we always have witness misfit being zero, i.e., W(P ? ; P̄ , h, G) = 0 for any P̄ , h, G.
To further compare to Bellman rank, we assume the following assumption on the discriminators:

Assumption 8.10. We assume G contains functions r + VP? for all P ∈ P, i.e.,

{r(s, a) + VP? (s0 ) : P ∈ P} ⊆ G.

With this assumption, we can show that Bellman Domination, i.e., W(P ; P̄ , h, G) ≥ E(Q?P ; πP̄ , h), i.e., the average
Bellman error of Q?P under the state-action distribution induced by the optimal policy of P̄ .
We define Witness rank as follows.

Definition 8.11 (Witness Rank). Consider the Witness misfit matrix Wh ∈ R|P|×|P| , where [Wh ]P̄ ,P := W(P ; P̄ , h, G).
Witness rank is the smallest number M such that rank(Wh ) ≤ M for all h.

Sun et al. [2019a] provides a more refined definition of Witness rank which is at least as small as the Bellman rank
under the value functions Q = {Q?P : P ∈ P}, and shows that for factored MDPs, Bellman rank with Q could be
exponentially large with respect to horizon H while witness rank remains small and captures the right complexity
measure in factored MDPs.

79
Indeed one can show that there exists a factored MDP, such that no model-free algorithms using Q as a function
class input is able to achieve polynomial sample complexity, while the algorithm OLIME which we present later,
achieves polynomially sample complexity [Sun et al., 2019a]. This demonstrates an exponential separation in sample
complexity between model-based approaches and model-free approaches.
We close here by providing the algorithm Optimism-Led Iterative Model Elimination (OLIME) below.

Algorithm 6 The OLIME algorithm for MDPs with low Witness rank
Input: model class P.
1: Initialize P0 = P.
2: for t = 1, 2, . . . , do
3: Define Pt = argmaxP ∈Pt−1 VP? (s0 ) and πt = πPt .
4: if VPπtt = V πt then return πt .
5: else
6: Update Pt = {f ∈ Pt−1 : W(P ; πt , h, G) = 0, for all h ∈ [H − 1]}.
7: end if
8: end for

The algorithm use optimism and maintains Pt in each iteration. Every iteration, it picks the most optimistic model Pt
from the current model class Pt . Again optimism allows us to identify if the algorithm has find the optimal policy.
If the algorithm does not terminate, then we are guaranteed to find a model Pt that admits non-zero witness misfit.
We then update Pt by eliminating all models that have non-zero model misfit under the distribution of πt . Again,
following the similar argument we had for OLIVE, one can show that OLIME terminates in at most M H iterations,
with M being the witness rank.

8.6 Bibliographic Remarks and Further Readings

Bellman rank was original proposed by Jiang et al. [2017]. Note that the definition of our average Bellman error is
a simple modification of the original average Bellman error definition in [Jiang et al., 2017]. Our definition allows it
to capture linear MDPs and the Bellman Completion in Linear Function Approximation, without assuming discrete
action space and paying a polynomial dependency on the number of actions.
Witness rank was proposed by Sun et al. [2019a] for model-based setting. Sun et al. [2019a] showed that witness
rank captures the correct complexity in Factored MDPs [Kearns and Koller, 1999] while Bellman rank could be
exponentially large in H when applied to Factored MDPs. Sun et al. [2019a] also demonstrates an exponential sample
complexity gap between model-based algorithms and model-free algorithms with general function approximation.

80
Part 3

Policy Optimization

81
Chapter 9

Policy Gradient Methods and Non-Convex


Optimization

For a distribution ρ over states, define:


V π (ρ) := Es0 ∼ρ [V π (s0 )] ,

where we slightly overload notation. Consider a class of parametric policies {πθ |θ ∈ Θ ⊂ Rd }. The optimization
problem of interest is:
max V πθ (ρ) .
θ∈Θ

We drop the MDP subscript in this chapter.


One immediate issue is that if the policy class {πθ } consists of deterministic policies then πθ will, in general, not be
differentiable. This motivates us to consider policy classes that are stochastic, which permit differentiability.

Example 9.1 (Softmax policies). It is instructive to explicitly consider a “tabular” policy representation, given by the
softmax policy:
exp(θs,a )
πθ (a|s) = P , (0.1)
a0 exp(θs,a )
0

where the parameter space is Θ = R|S||A| . Note that (the closure of) the set of softmax policies contains all stationary
and deterministic policies.

Example 9.2 (Log-linear policies). For any state, action pair s, a, suppose we have a feature mapping φs,a ∈ Rd . Let
us consider the policy class
exp(θ · φs,a )
πθ (a|s) = P
a ∈A exp(θ · φs,a )
0 0

with θ ∈ Rd .

Example 9.3 (Neural softmax policies). Here we may be interested in working with the policy class

exp fθ (s, a)
πθ (a|s) = P 0

a0 ∈A exp fθ (s, a )

where the scalar function fθ (s, a) may be parameterized by a neural network, with θ ∈ Rd .

83
9.1 Policy Gradient Expressions and the Likelihood Ratio Method

Let τ denote a trajectory, whose unconditional distribution Prπµ (τ ) under π with starting distribution µ, is

Prπµ (τ ) = µ(s0 )π(a0 |s0 )P (s1 |s0 , a0 )π(a1 |s1 ) · · · . (0.2)

We drop the µ subscript when it is clear from context.


It is convenient to define the discounted total reward of a trajectory as:

X
R(τ ) := γ t r(st , at )
t=0

where st , at are the state-action pairs in τ . Observe that:

V πθ (µ) = Eτ ∼Prπµθ [R(τ )] .

Theorem 9.4. (Policy gradients) The following are expressions for ∇θ V πθ (µ):

• REINFORCE:

" #
X
πθ
∇V (µ) = Eτ ∼Prπµθ R(τ ) ∇ log πθ (at |st )
t=0

• Action value expression:



" #
X
πθ t πθ
∇V (µ) = Eτ ∼Prπµθ γ Q (st , at )∇ log πθ (at |st )
t=0
1 h i
= Es∼dπθ Ea∼πθ (·|s) Qπθ (s, a)∇ log πθ (a|s)
1−γ

• Advantage expression:
1 h i
∇V πθ (µ) = Es∼dπθ Ea∼πθ (·|s) Aπθ (s, a)∇ log πθ (a|s)
1−γ

The alternative expressions are more helpful to use when we turn to Monte Carlo estimation.
Proof: We have:
X
∇V πθ (µ) = ∇ R(τ )Prπµθ (τ )
τ
X
= R(τ )∇Prπµθ (τ )
τ
X
= R(τ )Prπµθ (τ )∇ log Prπµθ (τ )
τ
X
= R(τ )Prπµθ (τ )∇ log (µ(s0 )πθ (a0 |s0 )P (s1 |s0 , a0 )πθ (a1 |s1 ) · · · )
τ

!
X X
= R(τ )Prπµθ (τ ) ∇ log πθ (at |st )
τ t=0

which completes the proof of the first claim.

84
For the second claim, for any state s0

∇V πθ (s0 )
X
= ∇ πθ (a0 |s0 )Qπθ (s0 , a0 )
a0
X  X
= ∇πθ (a0 |s0 ) Qπθ (s0 , a0 ) + πθ (a0 |s0 )∇Qπθ (s0 , a0 )
a0 a0
X  
= πθ (a0 |s0 ) ∇ log πθ (a0 |s0 ) Qπθ (s0 , a0 )
a0
X  X 
+ πθ (a0 |s0 )∇ r(s0 , a0 ) + γ P (s1 |s0 , a0 )V πθ (s1 )
a0 s1
X   X
= πθ (a0 |s0 ) ∇ log πθ (a0 |s0 ) Qπθ (s0 , a0 ) + γ πθ (a0 |s0 )P (s1 |s0 , a0 )∇V πθ (s1 )
a0 a0 ,s1
πθ
= E π
τ ∼Prs0θ [Q (s0 , a0 )∇ log πθ (a0 |s0 )] + γE π
τ ∼Prs0θ [∇V πθ (s1 )] .

By linearity of expectation,

∇V πθ (µ)
= Eτ ∼Prπµθ [Qπθ (s0 , a0 )∇ log πθ (a0 |s0 )] + γEτ ∼Prπµθ [∇V πθ (s1 )]
= Eτ ∼Prπµθ [Qπθ (s0 , a0 )∇ log πθ (a0 |s0 )] + γEτ ∼Prπµθ [Qπθ (s1 , a1 )∇ log πθ (a1 |s1 )] + . . . .

where the last step follows from recursion. This completes the proof of the second claim.
The proof of the final claim is left as an exercise to the reader.

9.2 (Non-convex) Optimization

It is worth explicitly noting that V πθ (s) is non-concave in θ for the softmax parameterizations, so the standard tools
of convex optimization are not applicable.

Lemma 9.5. (Non-convexity) There is an MDP M (described in Figure 0.1) such that the optimization problem V πθ (s)
is not concave for both the direct and softmax parameterizations.

Proof: Recall the MDP in Figure 0.1. Note that since actions in terminal states s3 , s4 and s5 do not change the
expected reward, we only consider actions in states s1 and s2 . Let the ”up/above” action as a1 and ”right” action as
a2 . Note that
V π (s1 ) = π(a2 |s1 )π(a1 |s2 ) · r

Now consider
θ(1) = (log 1, log 3, log 3, log 1), θ(2) = (− log 1, − log 3, − log 3, − log 1)
where θ is written as a tuple (θa1 ,s1 , θa2 ,s1 , θa1 ,s2 , θa2 ,s2 ). Then, for the softmax parameterization, we have:

3 3 9
π (1) (a2 |s1 ) = ; π (1) (a1 |s2 ) = ; V (1) (s1 ) = r
4 4 16
and
1 1 1
π (2) (a2 |s1 ) = ; π (2) (a1 |s2 ) = ; V (2) (s1 ) = r.
4 4 16

85
0 0

s3 s4

0
0 r>0

s1 0 s2 0 s5

Figure 0.1: (Non-concavity example) A deterministic MDP corresponding to Lemma 9.5 where V πθ (s) is not concave.
Numbers on arrows represent the rewards for each action.

θ (1) +θ (2)
Also, for θ(mid) = 2 ,
1 1 1
π (mid) (a2 |s1 ) = ; π (mid) (a1 |s2 ) = ; V (mid) (s1 ) = r.
2 2 4

This gives
V (1) (s1 ) + V (2) (s1 ) > 2V (mid) (s1 ),
which shows that V π is non-concave.

9.2.1 Gradient ascent and convergence to stationary points

Let us say a function f : Rd → R is β -smooth if


k∇f (w) − ∇f (w0 )k ≤ βkw − w0 k , (0.3)
where the norm k · k is the Euclidean norm. In other words, the derivatives of f do not change too quickly.
Gradient ascent, with a fixed stepsize η, follows the update rule:
θt+1 = θt + η∇V πθt (µ) .
It is convenient to use the shorthand notation:
π (t) := πθt , V (t) := V πθt

The next lemma is standard in non-convex optimization.


Lemma 9.6. (Convergence to Stationary Points) Assume that for all θ ∈ Θ, V πθ is β-smooth and bounded below by
V∗ . Suppose we use the constant stepsize η = 1/β. For all T , we have that

2β(V ∗ (µ) − V (0) (µ))


min k∇V (t) (µ)k2 ≤ .
t≤T T

9.2.2 Monte Carlo estimation and stochastic gradient ascent

One difficulty is that even if we know the MDP M , computing the gradient may be computationally intensive. It turns
out that we can obtain unbiased estimates of π with only simulation based access to our model, i.e. assuming we can
obtain sampled trajectories τ ∼ Prπµθ .

86
With respect to a trajectory τ , define:

X 0
Q
dπθ (s , a )
t t := γ t −t r(st0 , at0 )
t0 =t
X∞
∇V
\ πθ (µ) := γtQ
dπθ (s , a )∇ log π (a |s )
t t θ t t
t=0

We now show this provides an unbiased estimated of the gradient:


Lemma 9.7. (Unbiased gradient estimate) We have :
h i
E πθ ∇V
\ πθ (µ) = ∇V πθ (µ)
τ ∼Prµ

Proof: Observe:
"∞ #
X
πθ (µ)] td
πθ
E[∇V
\ = E γ Q (st , at )∇ log πθ (at |st )
t=0
"∞ #
(a) X
t
= E γ E[Qπθ (s , a )|s , a ]∇ log π (a |s )
t t t t θ t t
d
t=0

" #
(b) X
t πθ
= E γ Q (st , at )∇ log πθ (at |st )
t=0

where (a) follows from the tower property of the conditional expectations and (b) follows from that the Markov
property implies E[Qπθ (s , a )|s , a ] = Qπθ (s , a ).
t t t t t t
d

Hence, the following procedure is a stochastic gradient ascent algorithm:

1. initialize θ0 .
2. For t = 0, 1, . . .
(a) Sample τ ∼ Prπµθ .
(b) Update:
θt+1 = θt + ηt ∇V
\ πθ (µ)

\
where ηt is the stepsize and ∇V πθ (µ) estimated with τ .

Note here that we are ignoring that τ is an infinte length sequence. It can be truncated appropriately so as to control
the bias.
The following is standard result with regards to non-convex optimization. Again, with reasonably bounded variance,
we will obtain a point θt with small gradient norm.
Lemma 9.8. (Stochastic Convergence to Stationary Points) Assume that for all θ ∈ Θ, V πθ is β-smooth and bounded
below by V∗ . Suppose the variance is bounded as follows:
E[k∇V
\ πθ (µ) − ∇V πθ (µ)k2 ] ≤ σ 2

∗ (0) 2
p t ≤ β(V (µ) − V (µ))/σ , suppose we use a constant stepsize of ηt = 1/β, and thereafter, we use ηt =
For
2/(βT ). For all T , we have:
r
(t) 2 2β(V ∗ (µ) − V (0) (µ)) 2σ 2
min E[k∇V (µ)k ] ≤ + .
t≤T T T

87
Baselines and stochastic gradient ascent

A significant practical issue is that the variance σ 2 is often large in practice. Here, a form of variance reduction is
often critical in practice. A common method is as follows.
Let f : S → R.

1. Construct f as an estimate of V πθ (µ). This can be done using any previous data.
2. Sample a new trajectory τ , and define:

X 0
Q
dπθ (s , a )
t t := γ t −t r(st0 , at0 )
t0 =t
X∞  
∇V
\ πθ (µ) := γt Qπθ (s , a ) − f (s ) ∇ log π (a |s )
d t t t θ t t
t=0

We often refer to f (s) as a baseline at state s.


Lemma 9.9. (Unbiased gradient estimate with Variance Reduction) For any procedure used to construct to the base-
line function f : S → R, if the samples used to construct f are independent of the trajectory τ , where Qπθ (s , a ) is
t t
d
constructed using τ , then:
"∞ #
X  
E γ Qπθ (st , at ) − f (st ) ∇ log πθ (at |st ) = ∇V πθ (µ)
t d

t=0

where the expectation is with respect to both the random trajectory τ and the random function f (·).

Proof: For any function g(s),


X X X
E [∇ log π(a|s)g(s)] = ∇π(a|s)g(s) = g(s) ∇π(a|s) = g(s)∇ π(a|s) = g(s)∇1 = 0
a a a

Using that f (·) is independent of τ , we have that for all t


"∞ #
X
t
E γ f (st )∇ log πθ (at |st ) = 0
t=0

The result now follow froms Lemma 9.7.

9.3 Bibliographic Remarks and Further Readings

The REINFOCE algorithm is due to [Williams, 1992], which is an example of the likelihood ratio method for gradient
estimation [Glynn, 1990].
For standard optimization results in non-convex optimization (e.g. Lemma 9.6 and 9.8), we refer the reader to [Beck,
2017]. Our results for convergence rates for SGD to approximate stationary points follow from [Ghadimi and Lan,
2013].

88
Chapter 10

Optimality

We now seek to understand the global convergence properties of policy gradient methods, when given exact gradients.
Here, we will largely limit ourselves to the (tabular) softmax policy class in Example 9.1.
Given our a starting distribution ρ over states, recall our objective is:

max V πθ (ρ) .
θ∈Θ

where {πθ |θ ∈ Θ ⊂ Rd } is some class of parametric policies.


While we are interested in good performance under ρ, we will see how it is helpful to optimize under a different
measure µ. Specifically, we consider optimizing V πθ (µ), i.e.

max V πθ (µ) ,
θ∈Θ

even though our ultimate goal is performance under V πθ (ρ).


We now consider the softmax policy parameterization (0.1). Here, we still have a non-concave optimization problem
in general, as shown in Lemma 9.5, though we do show that global optimality can be reached under certain regu-
larity conditions. From a practical perspective, the softmax parameterization of policies is preferable to the direct
parameterization, since the parameters θ are unconstrained and standard unconstrained optimization algorithms can
be employed. However, optimization over this policy class creates other challenges as we study in this section, as the
optimal policy (which is deterministic) is attained by sending the parameters to infinity.
This chapter will study three algorithms for this problem, for the softmax policy class. The first performs direct policy
gradient ascent on the objective without modification, while the second adds a log barrier regularizer to keep the
parameters from becoming too large, as a means to ensure adequate exploration. Finally, we study the natural policy
gradient algorithm and establish a global optimality convergence rate, with no dependence on the dimension-dependent
factors.
The presentation in this chapter largely follows the results in [Agarwal et al., 2020d].

10.1 Vanishing Gradients and Saddle Points

To understand the necessity of optimizing under a distribution µ that is different from ρ, let us first give an informal
argument that some condition on the state distribution of π, or equivalently µ, is necessary for stationarity to imply

89
a1
a4 a4
a3 a3
a2 a2
s0 s1 ··· sH sH+1
a1
a1 a1

Figure 0.1: (Vanishing gradient example) A deterministic, chain MDP of length H + 2. We consider a policy where
π(a|si ) = θsi ,a for i = 1, 2, . . . , H. Rewards are 0 everywhere other than r(sH+1 , a1 ) = 1. See Proposition 10.1.

optimality. For example, in a sparse-reward MDP (where the agent is only rewarded upon visiting some small set of
states), a policy that does not visit any rewarding states will have zero gradient, even though it is arbitrarily suboptimal
in terms of values. Below, we give a more quantitative version of this intuition, which demonstrates that even if π
chooses all actions with reasonable probabilities (and hence the agent will visit all states if the MDP is connected),
then there is an MDP where a large fraction of the policies π have vanishingly small gradients, and yet these policies
are highly suboptimal in terms of their value.
Concretely, consider the chain MDP of length H + 2 shown in Figure 0.1. The starting state of interest is state s0 and
the discount factor γ = H/(H + 1). Suppose we work with the direct parameterization, where πθ (a|s) = θs,a for
a = a1 , a2 , a3 and πθ (a4 |s) = 1 − θs,a1 − θs,a2 − θs,a3 . Note we do not over-parameterize the policy. For this MDP
and policy structure, if we were to initialize the probabilities over actions, say deterministically, then there is an MDP
(obtained by permuting the actions) where all the probabilities for a1 will be less than 1/4.
The following result not only shows that the gradient is exponentially small in H, it also shows that many higher order
derivatives, up to O(H/ log H), are also exponentially small in H.
Proposition 10.1 (Vanishing gradients at suboptimal parameters). Consider the chain MDP of Figure 0.1, with H + 2
states, γ = H/(H + 1), and with the direct policy parameterization (with 3|S| parameters, as described in the text
H
above). Suppose θ is such that 0 < θ < 1 (componentwise) and θs,a1 < 1/4 (for all states s). For all k ≤ 40 log(2H) −1,
we have ∇kθ V πθ (s0 ) ≤ (1/3)H/4 , where ∇kθ V πθ (s0 ) is a tensor of the kth order derivatives of V πθ (s0 ) and the
norm is the operator norm of the tensor.1 Furthermore, V ? (s0 ) − V πθ (s0 ) ≥ (H + 1)/8 − (H + 1)2 /3H .

We do not prove this lemma here (see Section 10.5). The lemma illustrates that lack of good exploration can indeed be
detrimental in policy gradient algorithms, since the gradient can be small either due to π being near-optimal, or, simply
because π does not visit advantageous states often enough. Furthermore, this lemma also suggests that varied results
in the non-convex optimization literature, on escaping from saddle points, do not directly imply global convergence
due to that the higher order derivatives are small.
While the chain MDP of Figure 0.1, is a common example where sample based estimates of gradients will be 0 under
random exploration strategies; there is an exponentially small in H chance of hitting the goal state under a random
exploration strategy. Note that this lemma is with regards to exact gradients. This suggests that even with exact
computations (along with using exact higher order derivatives) we might expect numerical instabilities.

10.2 Policy Gradient Ascent

Let us now return to the softmax policy class, from Equation 0.1, where:
exp(θs,a )
πθ (a|s) = P ,
a0 exp(θs,a )
0

⊗k
1 The operator norm of a kth -order tensor J ∈ Rd is defined as supu1 ,...,uk ∈Rd : kui k2 =1 hJ, u1 ⊗ . . . ⊗ ud i.

90
where the number of parameters in this policy class is |S||A|.
Observe that:
∂ log πθ (a|s) h i h i 
= 1 s = s0 1 a = a0 − πθ (a0 |s)
∂θs0 ,a0
where 1[E] is the indicator of E being true.

Lemma 10.2. For the softmax policy class, we have:

∂V πθ (µ) 1
= dπθ (s)πθ (a|s)Aπθ (s, a) (0.1)
∂θs,a 1−γ µ

Proof: Using the advantage expression for the policy gradient (see Theorem 9.4),

∂V πθ (µ) 1 h ∂ log πθ (a|s) i


= Es∼dπµθ Ea∼πθ (·|s) Aπθ (s, a)
∂θs0 ,a0 1−γ ∂θs0 ,a0
1 h h i h i i
= Es∼dπµθ Ea∼πθ (·|s) Aπθ (s, a)1 s = s0 1 a = a0 − πθ (a0 |s)
1−γ
1 h  h i i
= dπµθ (s0 )Ea∼πθ (·|s0 ) Aπθ (s0 , a) 1 a = a0 − πθ (a0 |s0 )
1−γ
1  h i h i
dπµθ (s0 ) Ea∼πθ (·|s0 ) Aπθ (s0 , a)1 a = a0 − πθ (a0 |s0 )Ea∼πθ (·|s0 ) Aπθ (s0 , a)

=
1−γ
1
= dπθ (s0 )πθ (a0 |s0 )Aπθ (s0 , a0 ) − 0 ,
1−γ µ

where the last step uses that for any policy a π(a|s)Aπ (s, a) = 0.
P

The update rule for gradient ascent is:


θ(t+1) = θ(t) + η∇θ V (t) (µ). (0.2)

Recall from Lemma 9.5 that, even for the case of the softmax policy class (which contains all stationary policies), our
optimization problem is non-convex. Furthermore, due to the exponential scaling with the parameters θ in the softmax
parameterization, any policy that is nearly deterministic will have gradients close to 0. Specifically, for any sequence
θt
of policies π θt that becomes deterministic, k∇V π k → 0.
In spite of these difficulties, it turns out we have a positive result that gradient ascent asymptotically converges to the
global optimum for the softmax parameterization.

Theorem 10.3 (Global convergence for softmax parameterization). Assume we follow the gradient ascent update rule
as specified in Equation (0.2) and that the distribution µ is strictly positive i.e. µ(s) > 0 for all states s. Suppose
3
η ≤ (1−γ)
8 , then we have that for all states s, V (t) (s) → V ? (s) as t → ∞.

The proof is somewhat technical, and we do not provide a proof here (see Section 10.5).
A few remarks are in order. Theorem 10.3 assumed that optimization distribution µ was strictly positive, i.e. µ(s) > 0
for all states s. We conjecture that any gradient ascent may not globally converge if this condition is not met. The
concern is that if this condition is not met, then gradient ascent may not globally converge due to that dπµθ (s) effectively
scales down the learning rate for the parameters associated with state s (see Equation 0.1).
Furthermore, there is strong reason to believe that the convergence rate for this is algorithm (in the worst case) is
exponentially slow in some of the relevant quantities, such as in terms of the size of state space. We now turn to a
regularization based approach to ensure convergence at a polynomial rate in all relevant quantities.

91
10.3 Log Barrier Regularization

Due to the exponential scaling with the parameters θ, policies can rapidly become near deterministic, when optimizing
under the softmax parameterization, which can result in slow convergence. Indeed a key challenge in the asymptotic
analysis in the previous section was to handle the growth of the absolute values of parameters as they tend to infinity.
Recall that the relative-entropy for distributions p and q is defined as:

KL(p, q) := Ex∼p [− log q(x)/p(x)].

Denote the uniform distribution over a set X by UnifX , and define the following log barrier regularized objective as:
 
πθ
Lλ (θ) := V (µ) − λ Es∼UnifS KL(UnifA , πθ (·|s))

λ X
= V πθ (µ) + log πθ (a|s) + λ log |A| , (0.3)
|S| |A| s,a

where λ is a regularization parameter. The constant (i.e. the last term) is not relevant with regards to optimization.
This regularizer is different from the more commonly utilized entropy regularizer, a point which we return to later.
The policy gradient ascent updates for Lλ (θ) are given by:

θ(t+1) = θ(t) + η∇θ Lλ (θ(t) ). (0.4)

We now see that any approximate first-order stationary points of the entropy-regularized objective is approximately
globally optimal, provided the regularization is sufficiently small.
Theorem 10.4. (Log barrier regularization) Suppose θ is such that:

k∇θ Lλ (θ)k2 ≤ opt

and opt ≤ λ/(2|S| |A|). Then we have that for all starting state distributions ρ:
?

πθ ? 2λ dπρ
V (ρ) ≥ V (ρ) − .
1−γ µ

?

ρ
We refer to µ as the distribution mismatch coefficient. The above theorem shows the importance of having an

appropriate measure µ(s) in order for the approximate first-order stationary points to be near optimal.
Proof: The proof consists of showing that maxa Aπθ (s, a) ≤ 2λ/(µ(s)|S|) for all states. To see that this is sufficient,
observe that by the performance difference lemma (Lemma 1.16),
1 X π?
V ? (ρ) − V πθ (ρ) = d (s)π ? (a|s)Aπθ (s, a)
1 − γ s,a ρ
1 X π?
≤ d (s) max Aπθ (s, a)
1−γ s ρ a∈A

1 X ?
≤ 2dπρ (s)λ/(µ(s)|S|)
1−γ s
?
!
2λ dπρ (s)
≤ max .
1−γ s µ(s)

92
which would then complete the proof.
We now proceed to show that maxa Aπθ (s, a) ≤ 2λ/(µ(s)|S|). For this, it suffices to bound Aπθ (s, a) for any state-
action pair s, a where Aπθ (s, a) ≥ 0 else the claim is trivially true. Consider an (s, a) pair such that Aπθ (s, a) > 0.
Using the policy gradient expression for the softmax parameterization (see Equation 0.1),
 
∂Lλ (θ) 1 λ 1
= dπµθ (s)πθ (a|s)Aπθ (s, a) + − πθ (a|s) . (0.5)
∂θs,a 1−γ |S| |A|

The gradient norm assumption k∇θ Lλ (θ)k2 ≤ opt implies that:


 
∂Lλ (θ) 1 λ 1
opt ≥ = dπθ (s)πθ (a|s)Aπθ (s, a) + − πθ (a|s)
∂θs,a 1−γ µ |S| |A|
 
λ 1
≥ − πθ (a|s) ,
|S| |A|

where we have used Aπθ (s, a) ≥ 0. Rearranging and using our assumption opt ≤ λ/(2|S| |A|),

1 opt |S| 1
πθ (a|s) ≥ − ≥ .
|A| λ 2|A|

Solving for Aπθ (s, a) in (0.5), we have:


  
πθ 1−γ 1 ∂Lλ (θ) λ 1
A (s, a) = + 1−
dπµθ (s) πθ (a|s) ∂θs,a |S| πθ (a|s)|A|
 
1−γ λ
≤ 2|A|opt +
dπµθ (s) |S|
1−γ λ
≤ 2 πθ
dµ (s) |S|
≤ 2λ/(µ(s)|S|) ,

where the penultimate step uses opt ≤ λ/(2|S| |A|) and the final step uses dπµθ (s) ≥ (1 − γ)µ(s). This completes the
proof.
The policy gradient ascent updates for Lλ (θ) are given by:

θ(t+1) = θ(t) + η∇θ Lλ (θ(t) ). (0.6)

By combining the above theorem with the convergence of gradient ascent to first order stationary points (Lemma 9.6),
we obtain the following corollary.
8γ 2λ
Corollary 10.5. (Iteration complexity with log barrier regularization) Let βλ := (1−γ)3 + |S| . Starting from any
(0) (1−γ)
initial θ , consider the updates (0.6) with λ = dπ
? and η = 1/βλ . Then for all starting state distributions ρ,
ρ
2 µ

we have
? 2
n o 320|S|2 |A|2 dπρ
min V ? (ρ) − V (t) (ρ) ≤  whenever T ≥ .
t<T (1 − γ)6 2 µ

The corollary shows the importance of balancing how the regularization parameter λ is set relative to the desired
accuracy , as well as the importance of the initial distribution µ to obtain global optimality.

93
Proof:[of Corollary 10.5] Let βλ be the smoothness of Lλ (θ). A valid upper bound on βλ is:
8γ 2λ
βλ = + ,
(1 − γ)3 |S|
where we leave the proof as an exercise to the reader.
(1−γ)
Using Theorem 10.4, the desired optimality gap  will follow if we set λ = dπ
? and if k∇θ Lλ (θ)k2 ≤
ρ
2 µ

λ/(2|S| |A|). In order to complete the proof, we need to bound the iteration complexity of making the gradient
sufficiently small.
By Lemma 9.6, after T iterations of gradient ascent with stepsize of 1/βλ , we have
2 2βλ (Lλ (θ? ) − Lλ (θ(0) )) 2βλ
min ∇θ Lλ (θ(t) ) ≤ ≤ , (0.7)
t≤T 2 T (1 − γ) T
where βλ is an upper bound on the smoothness of Lλ (θ). We seek to ensure
s
2βλ λ
opt ≤ ≤
(1 − γ) T 2|S| |A|

8βλ |S|2 |A|2


Choosing T ≥ (1−γ) λ2 satisfies the above inequality. Hence,

8βλ |S|2 |A|2 64 |S|2 |A|2 16 |S||A|2


≤ +
(1 − γ) λ2 (1 − γ)4 λ2 (1 − γ) λ
80 |S|2 |A|2

(1 − γ)4 λ2
? 2
320 |S|2 |A|2 dπρ
=
(1 − γ)6 2 µ

where we have used that λ < 1. This completes the proof.

Entropy vs. log barrier regularization. A commonly considered regularizer is the entropy, where the regularizer
would be:
1 X 1 XX
H(πθ (·|s)) = −πθ (a|s) log πθ (a|s).
|S| s |S| s a
Note the entropy is far less aggressive in penalizing small probabilities, in comparison to the log barrier, which is
equivalent to the relative entropy. In particular, the entropy regularizer is always bounded between 0 and log |A|,
while the relative entropy (against the uniform distribution over actions), is bounded between 0 and infinity, where it
tends to infinity as probabilities tend to 0. Here, it can be shown that he convergence rate is asymptotically O(1) (see
Section 10.5) though it is unlikely that the convergence rate for this method is polynomial in other relevant quantities,
including |S|, |A|, 1/(1 − γ), and the distribution mismatch coefficient. The polynomial convergence rate using
the log barrier (KL) regularizer crucially relies on the aggressive nature in which the relative entropy prevents small
probabilities.

10.4 The Natural Policy Gradient

Observe that a policy constitutes a family of probability distributions {πθ (·|s)|s ∈ S}. We now consider a pre-
conditioned gradient descent method based on this family of distributions. Recall that the Fisher information matrix

94
of a parameterized density pθ (x) is defined as Ex∼pθ ∇ log pθ (x)∇ log pθ (x)> . Now we let us define Fρθ as an
 

(average) Fisher information matrix on the family of distributions {πθ (·|s)|s ∈ S} as follows:
Fρθ := Es∼dπρ θ Ea∼πθ (·|s) (∇ log πθ (a|s))∇ log πθ (a|s)> .
 

Note that the average is under the state-action visitation frequencies. The NPG algorithm performs gradient updates
in the geometry induced by this matrix as follows:
θ(t+1) = θ(t) + ηFρ (θ(t) )† ∇θ V (t) (ρ), (0.8)
where M † denotes the Moore-Penrose pseudoinverse of the matrix M .
Throughout this section, we restrict to using the initial state distribution ρ ∈ ∆(S) in our update rule in (0.8) (so
our optimization measure µ and the performance measure ρ are identical). Also, we restrict attention to states s ∈ S
reachable from ρ, since, without loss of generality, we can exclude states that are not reachable under this start state
distribution2 .
For the softmax parameterization, this method takes a particularly convenient form; it can be viewed as a soft policy
iteration update.
Lemma 10.6. (Softmax NPG as soft policy iteration) For the softmax parameterization (0.1), the NPG updates (0.8)
take the form:

(t+1) (t) η (t) (t+1) (t) exp ηA(t) (s, a)/(1 − γ)
θ =θ + A + ηv and π (a|s) = π (a|s) ,
1−γ Zt (s)

where Zt (s) = a∈A π (t) (a|s) exp ηA(t) (s, a)/(1 − γ) . Here, v is only a state dependent offset (i.e. vs,a = cs for
P
some cs ∈ R for each state s), and, owing to the normalization Zt (s), v has no effect on the update rule.

It is important to note that while the ascent direction was derived using the gradient ∇θ V (t) (ρ), which depends on
ρ, the NPG update rule actually has no dependence on the measure ρ. Furthermore, there is no dependence on the
(t)
state distribution dρ , which is due to the pseudoinverse of the Fisher information cancelling out the effect of the state
distribution in NPG.
Proof: By definition of the Moore-Penrose pseudoinverse, we have that (Fρθ )† ∇V πθ (ρ) = w? if an only if w? is the
minimum norm solution of:
min k∇V πθ (ρ) − Fρθ wk2 .
w

Let us first evaluate Fρθ w. For the softmax policy parameterization, Lemma 0.1 implies:
X
w> ∇θ log πθ (a|s) = ws,a − ws,a0 πθ (a0 |s) := ws,a − ws
a0 ∈A

where ws is not a function of a. This implies that:


  
θ >
Fρ w = Es∼dρ θ Ea∼πθ (·|s) ∇ log πθ (a|s) w ∇θ log πθ (a|s)
π

   h i
= Es∼dρ θ Ea∼πθ (·|s) ∇ log πθ (a|s) ws,a − ws
π = Es∼dπρ θ Ea∼πθ (·|s) ws,a ∇ log πθ (a|s) ,

where the last equality uses that ws is not a function of s. Again using the functional form of derivative of the softmax
policy parameterization, we have:
   
θ πθ 0 0 0
Fρ w = d (s )πθ (a |s ) ws0 ,a0 − ws0 .
s0 ,a0
2 Specifically, we restrict the MDP to the set of states {s ∈ S : ∃π such that dπ
ρ (s) > 0}.

95
This implies:

X  2
1
k∇V πθ (ρ) − Fρθ wk2 = dπθ (s)πθ (a|s) Aπθ (s, a) − Fρθ w s,a
 
s,a
1−γ
 !2
X 1 X
= dπθ (s)πθ (a|s) Aπθ (s, a) − ws,a − ws,a0 πθ (a0 |s) .
s,a
1−γ 0 a ∈A

1 1
Due to that w = 1−γ Aπθ leads to 0 error, the above implies that all 0 error solutions are of the form w = 1−γ Aπθ + v,
where v is only a state dependent offset (i.e. vs,a = cs for some cs ∈ R for each state s). The first claim follows due to
that the minimum norm solution is one of these solutions. The proof of the second claim now follows by the definition
of the NPG update rule, along with that v has no effect on the update rule due to the normalization constant Zt (s).
We now see that this algorithm enjoys a dimension free convergence rate.

Theorem 10.7 (Global convergence for NPG). Suppose we run the NPG updates (0.8) using ρ ∈ ∆(S) and with
θ(0) = 0. Fix η > 0. For all T > 0, we have:

log |A| 1
V (T ) (ρ) ≥ V ∗ (ρ) − − .
ηT (1 − γ)2 T

Note in the above the theorem that the NPG algorithm is directly applied to the performance measure V π (ρ), and
the guarantees are also with respect to ρ. In particular, there is no distribution mismatch coefficient in the rate of
convergence.
Now setting η ≥ (1 − γ)2 log |A|, we see that NPG finds an -optimal policy in a number of iterations that is at most:

2
T ≤ ,
(1 − γ)2 

which has no dependence on the number of states or actions, despite the non-concavity of the underlying optimization
problem.
The proof strategy we take borrows ideas from the classical multiplicative weights algorithm (see Section!10.5).
First, the following improvement lemma is helpful:

Lemma 10.8 (Improvement lower bound for NPG). For the iterates π (t) generated by the NPG updates (0.8), we have
for all starting state distributions µ

(1 − γ)
V (t+1) (µ) − V (t) (µ) ≥ Es∼µ log Zt (s) ≥ 0.
η

Proof: First, let us show that log Zt (s) ≥ 0. To see this, observe:
X
log Zt (s) = log π (t) (a|s) exp(ηA(t) (s, a)/(1 − γ))
a
X η X (t)
≥ π (t) (a|s) log exp(ηA(t) (s, a)/(1 − γ)) = π (a|s)A(t) (s, a) = 0.
a
1−γ a

π (t) (a|s)A(t) (s, a) = 0. By the


P
where we have used Jensen’s inequality on the concave function log x and that a

96
performance difference lemma,
1 X
V (t+1) (µ) − V (t) (µ) = Es∼d(t+1) π (t+1) (a|s)A(t) (s, a)
1−γ µ
a
1 X π (t+1) (a|s)Zt (s)
= Es∼d(t+1) π (t+1) (a|s) log
η µ
a
π (t) (a|s)
1 1
= E (t+1) KL(πs(t+1) ||πs(t) ) + Es∼d(t+1) log Zt (s)
η s∼dµ η µ

1 1−γ
≥ Es∼d(t+1) log Zt (s) ≥ Es∼µ log Zt (s),
η µ η
(t+1)
where the last step uses that dµ ≥ (1 − γ)µ (by (0.5)) and that log Zt (s) ≥ 0.
With this lemma, we now prove Theorem 10.7.
?
Proof:[of Theorem 10.7] Since ρ is fixed, we use d? as shorthand for dπρ ; we also use πs as shorthand for the vector
of π(·|s). By the performance difference lemma (Lemma 1.16),
? 1 X
V π (ρ) − V (t) (ρ) = Es∼d? π ? (a|s)A(t) (s, a)
1−γ a
1 X π (t+1) (a|s)Zt (s)
= Es∼d? π ? (a|s) log
η a
π (t) (a|s)
!
1 X

= Es∼d? KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) ) + π (a|s) log Zt (s)
η a
1  
= Es∼d? KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) ) + log Zt (s) ,
η
where we have used the closed form of our updates from Lemma 10.6 in the second step.
By applying Lemma 10.8 with d? as the starting state distribution, we have:
1 1  (t+1) ? 
Es∼d? log Zt (s) ≤ V (d ) − V (t) (d? )
η 1−γ
which gives us a bound on Es∼d? log Zt (s).
Using the above equation and that V (t+1) (ρ) ≥ V (t) (ρ) (as V (t+1) (s) ≥ V (t) (s) for all states s by Lemma 10.8), we
have:
T −1
π? (T −1) 1 X π?
V (ρ) − V (ρ) ≤ (V (ρ) − V (t) (ρ))
T t=0
T −1 T −1
1 X 1 X
= Es∼d? (KL(πs? ||πs(t) ) − KL(πs? ||πs(t+1) )) + Es∼d? log Zt (s)
ηT t=0 ηT t=0
T −1 
Es∼d? KL(πs? ||π (0) ) 1 X 
≤ + V (t+1) (d? ) − V (t) (d? )
ηT (1 − γ)T t=0
Es∼d? KL(πs? ||π (0) ) V (T ) (d? ) − V (0) (d? )
= +
ηT (1 − γ)T
log |A| 1
≤ + .
ηT (1 − γ)2 T
The proof is completed using that V (T ) (ρ) ≥ V (T −1) (ρ).

97
10.5 Bibliographic Remarks and Further Readings

The natural policy gradient method was originally presented in [Kakade, 2001]; a number of arguments for this method
have been provided based on information geometry [Kakade, 2001, Bagnell and Schneider, 2003, Peters and Schaal,
2008].
The convergence rates in this chapter are largely derived from [Agarwal et al., 2020d]. The proof strategy for the NPG
analysis has origins in the online regret framework in changing MDPs [Even-Dar et al., 2009], which would result in
a worst rate in comparison to [Agarwal et al., 2020d]. This observation that the proof strategy from [Even-Dar et al.,
2009] provided a convergence rate for the NPG was made in [Neu et al., 2017]. The faster NPG rate we present here
is due to [Agarwal et al., 2020d]. The analysis of the MD-MPI algorithm [Geist et al., 2019] also implies a O(1/T )
rate for the NPG, though with worse dependencies on other parameters.
Building on ideas in [Agarwal et al., 2020d], [Mei et al., 2020] showed that, for the softmax policy class, both the
gradient ascent and entropy regularized gradient ascent asymptotically converge at a O(1/t); it is unlikely these meth-
ods are have finite rate which are polynomial in other quantities (such as the |S|, |A|, 1/(1 − γ), and the distribution
mismatch coefficient).
[Mnih et al., 2016] introduces the entropy regularizer (also see [Ahmed et al., 2019] for a more detailed empirical
investigation).

98
Chapter 11

Function Approximation and the NPG

We now analyze the case of using parametric policy classes:

Π = {πθ | θ ∈ Rd },

where Π may not contain all stochastic policies (and it may not even contain an optimal policy). In contrast with the
tabular results in the previous sections, the policy classes that we are often interested in are not fully expressive, e.g.
d  |S||A| (indeed |S| or |A| need not even be finite for the results in this section); in this sense, we are in the regime
of function approximation.
We focus on obtaining agnostic results, where we seek to do as well as the best policy in this class (or as well as some
other comparator policy). While we are interested in a solution to the (unconstrained) policy optimization problem

max V πθ (ρ),
θ∈Rd

(for a given initial distribution ρ), we will see that optimization with respect to a different distribution will be helpful,
just as in the tabular case,
We will consider variants of the NPG update rule (0.8):

θ ← θ + ηFρ (θ)† ∇θ V θ (ρ) . (0.1)

Our analysis will leverage a close connection between the NPG update rule (0.8) with the notion of compatible function
approximation. We start by formalizing this connection. The compatible function approximation error also provides a
measure of the expressivity of our parameterization, allowing us to quantify the relevant notion of approximation error
for the NPG algorithm.
The main results in this chapter establish the effectiveness of NPG updates where there is error both due to statistical
estimation (where we may not use exact gradients) and approximation (due to using a parameterized function class).
We will see an precise estimation/approximation decomposition based on the compatible function approximation error.
The presentation in this chapter largely follows the results in [Agarwal et al., 2020d].

11.1 Compatible function approximation and the NPG

We now introduce the notion of compatible function approximation, which both provides some intuition with regards
to policy gradient methods and it will help us later on with regards to characterizing function approximation.

99
Lemma 11.1 (Gradients and compatible function approximation). Let w? denote the following minimizer:
 
2
w? ∈ Es∼dπµθ Ea∼πθ (·|s) Aπθ (s, a) − w · ∇θ log πθ (a|s) ,

where the squared error above is referred to as the compatible function approximation. Denote the best linear predic-
tor of Aπθ (s, a) using ∇θ log πθ (a|s) by A
bπθ (s, a), i.e.

bπθ (s, a) := w? · ∇θ log πθ (a|s).


A

We have that:
1
∇θ V πθ (µ) bπθ (s, a) .
 
= E πθ Ea∼πθ (·|s) ∇θ log πθ (a|s)A
1 − γ s∼dµ

Proof: The first order optimality conditions for w? imply


h i
Es∼dπµθ Ea∼πθ (·|s) Aπθ (s, a) − w? · ∇θ log πθ (a|s) ∇θ log πθ (a|s) = 0

(0.2)

bπθ (s, a),


Rearranging and using the definition of A
1
∇θ V πθ (µ) Es∼dπµθ Ea∼πθ (·|s) ∇θ log πθ (a|s)Aπθ (s, a)
 
=
1−γ
1 
bπθ (s, a) ,

= Es∼dπµθ Ea∼πθ (·|s) ∇θ log πθ (a|s)A
1−γ
which completes the proof.
The next lemma shows that the weight vector above is precisely the NPG ascent direction. Precisely,
Lemma 11.2. We have that:
1
Fρ (θ)† ∇θ V θ (ρ) = w? , (0.3)
1−γ
where w? is a minimizer of the following regression problem:

w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) (w> ∇θ log πθ (·|s) − Aπθ (s, a))2 .
 

Proof: The above is a straightforward consequence of the first order optimality conditions (see Equation 0.2). Specifi-
cally, Equation 0.2, along with the advantage expression for the policy gradient (see Theorem 9.4), imply that w? must
satisfy:
1
∇θ V θ (ρ) = Fρ (θ)w?
1−γ
which completes the proof.
This lemma implies that we might write the NPG update rule as:
η
θ←θ+ w? .
1−γ
where w? is minimizer of the compatible function approximation error (which depends on θ.
The above regression problem can be viewed as “compatible” function approximation: we are approximating Aπθ (s, a)
using the ∇θ log πθ (·|s) as features. We also consider a variant of the above update rule, Q-NPG, where instead of us-
ing advantages in the above regression we use the Q-values. This viewpoint provides a methodology for approximate
updates, where we can solve the relevant regression problems with samples.

100
11.2 Examples: NPG and Q-NPG

In practice, the most common policy classes are of the form:


(  )
exp fθ (s, a) d
Π = πθ (a|s) = P  θ∈R , (0.4)
0
a0 ∈A exp fθ (s, a )

where fθ is a differentiable function. For example, the tabular softmax policy class is one where fθ (s, a) = θs,a .
Typically, fθ is either a linear function or a neural network. Let us consider the NPG algorithm, and a variant Q-NPG,
in each of these two cases.

11.2.1 Log-linear Policy Classes and Soft Policy Iteration

For any state-action pair (s, a), suppose we have a feature mapping φs,a ∈ Rd . Each policy in the log-linear policy
class is of the form:
exp(θ · φs,a )
πθ (a|s) = P ,
a ∈A exp(θ · φs,a )
0 0

with θ ∈ Rd . Here, we can take fθ (s, a) = θ · φs,a .


With regards to compatible function approximation for the log-linear policy class, we have:
θ θ
∇θ log πθ (a|s) = φs,a , where φs,a = φs,a − Ea0 ∼πθ (·|s) [φs,a0 ],
θ
that is, φs,a is the centered version of φs,a . With some abuse of notation, we accordingly also define φ̄π for any policy
π. Here, using (0.3), the NPG update rule (0.1) is equivalent to:
h θ 2
i
NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Aπθ (s, a) − w · φs,a .

(We have rescaled the learning rate η in comparison to (0.1)). Note that we recompute w? for every update of θ. Here,
the compatible function approximation error measures the expressivity of our parameterization in how well linear
functions of the parameterization can capture the policy’s advantage function.
We also consider a variant of the NPG update rule (0.1), termed Q-NPG, where:
h 2 i
Q-NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Qπθ (s, a) − w · φs,a .

Note we do not center the features for Q-NPG; observe that Qπ (s, a) is also not 0 in expectation under π(·|s), unlike
the advantage function.
(NPG/Q-NPG and Soft-Policy Iteration) We now see how we can view both NPG and Q-NPG as an incremental (soft)
version of policy iteration, just as in Lemma 10.6 for the tabular case. Rather than writing the update rule in terms of
the parameter θ, we can write an equivalent update rule directly in terms of the (log-linear) policy π:
h π 2
i
NPG: π(a|s) ← π(a|s) exp(w? · φs,a )/Zs , w? ∈ argminw Es∼dπρ ,a∼π(·|s) Aπ (s, a) − w · φs,a ,
π
where Zs is normalization constant. While the policy update uses the original features φ instead of φ , whereas the
π
quadratic error minimization is terms of the centered features φ , this distinction is not relevant due to that we may
π
also instead use φ (in the policy update) which would result in an equivalent update; the normalization makes the
update invariant to (constant) translations of the features. Similarly, an equivalent update for Q-NPG, where we update
π directly rather than θ, is:
h 2 i
Q-NPG: π(a|s) ← π(a|s) exp(w? · φs,a )/Zs , w? ∈ argminw Es∼dπρ ,a∼π(·|s) Qπ (s, a) − w · φs,a .

101
(On the equivalence of NPG and Q-NPG) If it is the case that the compatible function approximation error is 0, then
it straightforward to verify that the NPG and Q-NPG are equivalent algorithms, in that their corresponding policy
updates will be equivalent to each other.

11.2.2 Neural Policy Classes

Now suppose fθ (s, a) is a neural network parameterized by θ ∈ Rd , where the policy class Π is of form in (0.4).
Observe:
∇θ log πθ (a|s) = gθ (s, a), where gθ (s, a) = ∇θ fθ (s, a) − Ea0 ∼πθ (·|s) [∇θ fθ (s, a0 )],
and, using (0.3), the NPG update rule (0.1) is equivalent to:
h 2 i
NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Aπθ (s, a) − w · gθ (s, a)

(Again, we have rescaled the learning rate η in comparison to (0.1)).


The Q-NPG variant of this update rule is:
h 2 i
Q-NPG: θ ← θ + ηw? , w? ∈ argminw Es∼dπρ θ ,a∼πθ (·|s) Qπθ (s, a) − w · ∇θ fθ (s, a) .

11.3 The NPG “Regret Lemma”

It is helpful for us to consider NPG more abstractly, as an update rule of the form

θ(t+1) = θ(t) + ηw(t) . (0.5)

We will now provide a lemma where w(t) is an arbitrary (bounded) sequence, which will be helpful when specialized.
Recall a function f : Rd → R is said to be β-smooth if for all x, x0 ∈ Rd :

k∇f (x) − ∇f (x0 )k2 ≤ βkx − x0 k2 ,

and, due to Taylor’s theorem, recall that this implies:

β 0
f (x0 ) − f (x) − ∇f (x) · (x0 − x) ≤ kx − xk22 . (0.6)
2

The following analysis of NPG is draws close connections to the mirror-descent approach used in online learning (see
Section 11.6), which motivates us to refer to it as a “regret lemma”.

Lemma 11.3. (NPG Regret Lemma) Fix a comparison policy π e and a state distribution ρ. Assume for all s ∈ S
and a ∈ A that log πθ (a|s) is a β-smooth function of θ. Consider the update rule (0.5), where π (0) is the uniform
distribution (for all states) and where the sequence of weights w(0) , . . . , w(T ) , satisfies kw(t) k2 ≤ W (but is otherwise
arbitrary). Define: h i
errt = Es∼de Ea∼eπ(·|s) A(t) (s, a) − w(t) · ∇θ log π (t) (a|s) .

We have that:
T −1
!
n o 1 log |A| ηβW 2 1 X
min V πe (ρ) − V (t) (ρ) ≤ + + errt .
t<T 1−γ ηT 2 T t=0

102
This lemma is the key tool in understanding the role of function approximation of various algorithms. We will consider
one example in detail with regards to the log-linear policy class (from Example 9.2).
Note that when errt = 0,pas will be the case with the (tabular) softmax
p policy class with exact gradients, we obtain
a convergence rate of O( 1/T ) using a learning rate of η = O( 1/T ). Note that this is slower than the faster rate
of O(1/T ), provided in Theorem 10.7. Obtaining a bound that leads to a faster rate in the setting with errors requires
more complicated dependencies on errt than those stated above.
Proof: By smoothness (see (0.6)),

π (t+1) (a|s)  β
log ≥ ∇θ log π (t) (a|s) · θ(t+1) − θ(t) − kθ(t+1) − θ(t) k22
π (t) (a|s) 2
β
= η∇θ log π (t) (a|s) · w(t) − η 2 kw(t) k22 .
2

We use de as shorthand for dπρe (note ρ and π


e are fixed); for any policy π, we also use πs as shorthand for the vector
π(·|s). Using the performance difference lemma (Lemma 1.16),
 
Es∼de KL(eπs ||πs(t) ) − KL(e πs ||πs(t+1) )
π (t+1) (a|s)
 
= Es∼de Ea∼eπ(·|s) log (t)
π (a|s)
h i β
≥ ηEs∼de Ea∼eπ(·|s) ∇θ log π (t) (a|s) · w(t) − η 2 kw(t) k22 (using previous display)
2

h i
(t) (t) 2
= ηEs∼de Ea∼eπ(·|s) A (s, a) − η kw k2
h 2 i
+ ηEs∼de Ea∼eπ(·|s) ∇θ log π (a|s) · w(t) − A(t) (s, a)
(t)

 
β
= (1 − γ)η V (ρ) − V (ρ) − η 2 kw(t) k22 − η errt
π
e (t)
2

Rearranging, we have:
 
1 1   ηβ
V πe (ρ) − V (t) (ρ) ≤ πs ||πs(t) ) − KL(e
Es∼de KL(e πs ||πs(t+1) ) + W 2 + errt
1−γ η 2

Proceeding,

T −1 T −1
1 X πe 1 X
(V (ρ) − V (t) (ρ)) ≤ E e (KL(eπs ||πs(t) ) − KL(e
πs ||πs(t+1) ))
T t=0 ηT (1 − γ) t=0 s∼d
T −1 
ηβW 2

1 X
+ + errt
T (1 − γ) t=0 2
T −1
πs ||π (0) )
Es∼de KL(e ηβW 2 1 X
≤ + + errt
ηT (1 − γ) 2(1 − γ) T (1 − γ) t=0
T −1
log |A| ηβW 2 1 X
≤ + + errt ,
ηT (1 − γ) 2(1 − γ) T (1 − γ) t=0

which completes the proof.

103
11.4 Q-NPG: Performance Bounds for Log-Linear Policies

For a state-action distribution υ, define:


 
πθ
2
L(w; θ, υ) := Es,a∼υ Q (s, a) − w · φs,a .

The iterates of the Q-NPG algorithm can be viewed as minimizing this loss under some (changing) distribution υ.
We now specify an approximate version of Q-NPG. It is helpful to consider a slightly more general version of the
algorithm in the previous section, where instead of optimizing under a starting state distribution ρ, we have a different
starting state-action distribution ν. The motivation for this is similar in spirit to our log barrier regularization: we seek
to maintain exploration (and estimation) over the action space even if the current policy does not have coverage over
the action space.
Analogous to the definition of the state visitation measure, dπµ , we can define a visitation measure over states and
actions induced by following π after s0 , a0 ∼ ν. We overload notation using dπν to also refer to the state-action
visitation measure; precisely,

X
dπν (s, a) := (1 − γ)Es0 ,a0 ∼ν γ t Prπ (st = s, at = a|s0 , a0 ) (0.7)
t=0

where Prπ (st = s, at = a|s0 , a0 ) is the probability that st = s and at = a, after starting at state s0 , taking action a0 ,
and following π thereafter. While we overload notation for visitation distributions (dπµ (s) and dπν (s, a)) for notational
convenience, note that the state-action measure dπν uses the subscript ν, which is a state-action measure.
Q-NPG will be defined with respect to the on-policy state action measure starting with s0 , a0 ∼ ν. As per our
convention, we define
(t)
d(t) := dνπ .
The approximate version of this algorithm is:

Approx. Q-NPG: θ(t+1) = θ(t) + ηw(t) , w(t) ≈ argminkwk2 ≤W L(w; θ(t) , d(t) ), (0.8)

where the above update rule also permits us to constrain the norm of the update direction w(t) (alternatively, we could
use `2 regularization as is also common in practice). The exact minimizer is denoted as:
(t)
w? ∈ argminkwk2 ≤W L(w; θ(t) , d(t) ).
(t)
Note that w? depends on the current parameter θ(t) .
Our analysis will take into account both the excess risk (often also referred to as estimation error) and the approxima-
tion error. The standard approximation-estimation error decomposition is as follows:
(t) (t)
L(w(t) ; θ(t) , d(t) ) = L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) + L(w? ; θ(t) , d(t) )
| {z } | {z }
Excess risk Approximation error

Using a sample based approach, we would expect stat = O(1/ N ) or better, where N is the number of samples
(t)
used to estimate. w? In constrast, the approximation error is due to modeling error, and does not tend to 0 with more
samples. We will see how these two errors have strikingly different impact on our final performance bound.
Note that we have already considered two cases where approx = 0. For the tabular softmax policy class, it is immediate
that approx = 0. A more interesting example (where the state and action space could be infinite) is provided by
the linear parameterized MDP model from Chapter 7. Here, provided that we use the log-linear policy class (see

104
Section 11.2.1) with features corresponding to the linear MDP features, it is straightforward to see that approx = 0
for this log-linear policy class. More generally, we will see the effect of model misspecification in our performance
bounds.
We make the following assumption on these errors:
Assumption 11.4 (Approximation/estimation error bounds). Let w(0) , w(1) , . . . w(T −1) be the sequence of iterates
used by the Q-NPG algorithm Suppose the following holds for all t < T :

1. (Excess risk) Suppose the estimation error is bounded as follows:


(t)
L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) ≤ stat

2. (Approximation error) Suppose the approximation error is bounded as follows:


(t)
L(w? ; θ(t) , d(t) ) ≤ approx .

We will also see how, with regards to our estimation error, we will need a far more mild notion of coverage. Here,
with respect to any state-action distribution υ, define:

Συ = Es,a∼υ φs,a φ>


 
s,a .

We make a the following conditioning assumption:


Assumption 11.5 (Relative condition number). Fix a state distribution ρ (this will be what ultimately be the perfor-
mance measure that we seek to optimize). Consider an arbitrary comparator policy π ? (not necessarily an optimal
policy). With respect to π ? , define the state-action measure d? as
?
d? (s, a) = dπρ (s) · UnifA (a)
?
i.e. d? samples states from the comparators state visitation measure, dπρ and actions from the uniform distribution.
Define
w > Σd ? w
sup >
= κ,
w∈Rd w Σν w

and assume that κ is finite.

We later discuss why it is reasonable to expect that κ is not a quantity related to the size of the state space.
The main result of this chapter shows how the approximation error, the excess risk, and the conditioning, determine
the final performance.
Theorem 11.6. Fix a state distribution ρ; a state-action distribution ν; an arbitrary comparator policy π ? (not
necessarily an optimal policy). Suppose Assumption 11.5 holds with respect to thesepchoices and that kφs,a k2 ≤ B
for all s, a. Suppose the Q-NPG update rule (in (0.8)) starts with θ(0) = 0, η = 2 log |A|/(B 2 W 2 T ), and the
(random) sequence of iterates satisfies Assumption 11.4. We have that:
r s
d?
 o  
n ?
π (t) BW 2 log |A| 4|A|
E min V (ρ) − V (ρ) ≤ + κ ·  stat + ·  approx .
t<T 1−γ T (1 − γ)3 ν ∞

p
Note when approx = 0, our convergence rate is O( 1/T ) plus a term that depends on the excess risk; hence, provided
we obtain enough samples, then stat will also tend to 0, and we will be competitive with the comparison policy π ? .
The above also shows the striking difference between the effects of estimation error and approximation error. A few
remarks are now in order.

105
Transfer learning, distribution shift, and the approximation error. In large scale problems, the worst case distri-
?
bution mismatch factor dν is unlikely to be small. However, this factor is ultimately due to transfer learning. Our

(t)
approximation error is with respect to the fitting distribution d(t) , where we assume that L(w? ; θ(t) , d(t) ) ≤ approx .
(t)
As the proof will show, the relevant notion of approximation error will be L(w? ; θ(t) , d? ), where d? is the fixed
comparators measure. In others words, to get a good performance bound we need to successfully have low transfer
learning error to the fixed measure d? . Furthermore, in many modern machine learning applications, this error is often
is favorable, in that it is substantially better than worst case theory might suggest.
See Section 11.6 for further remarks on this point.

Dimension dependence in κ and the importance of ν. It is reasonable to think about κ as being dimension de-
pendent (or worse), but it is not necessarily related to the size of the state space. For example, if kφs,a k2 ≤ B, then
B2
κ ≤ σmin (Es,a∼ν [φs,a φ>
though this bound may be pessimistic. Here, we also see the importance of choice of ν
s,a ])
in having a small (relative) condition number; in particular, this is the motivation for considering the generalization
which allows for a starting state-action distribution ν vs. just a starting state distribution µ (as we did in the tabular
case). Roughly speaking, we desire a ν which provides good coverage over the features. As the following lemma
shows, there always exists a universal distribution ν, which can be constructed only with knowledge of the feature set
(without knowledge of d? ), such that κ ≤ d.

Lemma 11.7. (κ ≤ d is always possible) Let Φ = {φ(s, a)|(s, a) ∈ S × A} ⊂ Rd and suppose Φ is a compact set.
There always exists a state-action distribution ν, which is supported on at most d2 state-action pairs and which can
be constructed only with knowledge of Φ (without knowledge of the MDP or d? ), such that:

κ ≤ d.

Proof: The distribution can be found through constructing the minimal volume ellipsoid containing Φ, i.e. the Loẅner-
John ellipsoid. To be added...

Direct policy optimization vs. approximate value function programming methods Part of the reason for the
success of the direct policy optimization approaches is to due their more mild dependence on the approximation error.
?
Here, our theoretical analysis has a dependence on a distribution mismatch coefficient, dν , while approximate

value function methods have even worse dependencies. See Chapter 3. As discussed earlier and as can be seen in
the regret lemma (Lemma 11.3), the distribution mismatch coefficient is due to that the relevant error for NPG is a
transfer error notion to a fixed comparator distribution, while approximate value function methods have more stringent
conditions where the error has to be small under, essentially, the distribution of any other policy.

11.4.1 Analysis

Proof: (of Theorem 11.6) For the log-linear policy class, due to that the feature mapping φ satisfies kφs,a k2 ≤ B,
then it is not difficult to verify that log πθ (a|s) is a B 2 -smooth function. Using this and the NPG regret lemma
(Lemma 11.3), we have:
T −1
r
n ? o BW 2 log |A| 1 X
min V π (ρ) − V (t) (ρ) ≤ + errt .
t<T 1−γ T (1 − γ)T t=0

where we have used our setting of η.

106
We make the following decomposition of errt :
h i
(t)
errt = Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i
(t)
+ Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s) .


For the first term, using that ∇θ log πθ (a|s) = φs,a − Ea0 ∼πθ (·|s) [φs,a0 ] (see Section 11.2.1), we have:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) A(t) (s, a) − w? · ∇θ log π (t) (a|s)
h i h i
(t) (t)
= Es∼d?ρ ,a∼π? (·|s) Q(t) (s, a) − w? · φs,a − Es∼d?ρ ,a0 ∼π(t) (·|s) Q(t) (s, a0 ) − w? · φs,a0
r r
  2  2
(t) (t)
≤ Es∼d?ρ ,a∼π? (·|s) Q(t) (s, a) − w? · φs,a + Es∼d?ρ ,a0 ∼π(t) (·|s) Q(t) (s, a0 ) − w? · φs,a0
r h 2 i q
(t) (t)
≤ 2 |A|Es∼d?ρ ,a∼UnifA Q(t) (s, a) − w? · φs,a = 2 |A|L(w? ; θ(t) , d? ).

(t)
where we have used the definition of d? and L(w? ; θ(t) , d? ) in the last step. Using following crude upper bound,

(t) d? (t) 1 d? (t)


L(w? ; θ(t) , d? ) ≤ L(w? ; θ(t) , d(t) ) ≤ L(w? ; θ(t) , d(t) ),
d(t) ∞ 1−γ ν ∞

(where the last step uses the defintion of d(t) , see Equation 0.7), we have that:
s
h
(t) (t) (t)
i |A| d? (t)
Es∼d?ρ ,a∼π? (·|s) A (s, a) − w? · ∇θ log π (a|s) ≤ 2 L(w? ; θ(t) , d(t) ). (0.9)
1−γ ν ∞

For the second term, let us now show that:


h i
(t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s)

s
|A|κ  (t)

≤ 2 L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) (0.10)
1−γ

To see this, first observe that a similar argument to the above leads to:
h i
(t)
Es∼d?ρ ,a∼π? (·|s) w? − w(t) · ∇θ log π (t) (a|s)

h i h i
(t) (t)
= Es∼d?ρ ,a∼π? (·|s) w? − w(t) · φs,a − Es∼d?ρ ,a0 ∼π(t) (·|s) w? − w(t) · φs,a0
 
r h 2 i q
(t)  (t)
≤ 2 |A|Es,a∼d? w? − w(t) · φs,a = 2 |A| · kw? − w(t) k2Σd? ,

where we use the notation kxk2M := x> M x for a matrix M and a vector x. From the definition of κ,
(t) (t) κ (t)
kw? − w(t) k2Σd? ≤ κkw? − w(t) k2Σν ≤ kw? − w(t) k2Σ (t)
1−γ d

(t) (t)
using that (1 − γ)ν ≤ dπν (see (0.7)). Due to that w? minimizes L(w; θ(t) , d(t) ) over the set W := {w : kwk2 ≤
(t)
W }, for any w ∈ W the first-order optimality conditions for w? imply that:
(t) (t)
(w − w? ) · ∇L(w? ; θ(t) , d(t) ) ≥ 0.

107
Therefore, for any w ∈ W,
(t)
L(w; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) )
h 2 i (t)
= Es,a∼d(t) w · φ(s, a) − w? · φ(s, a) + w? · φ(s, a) − Q(t) (s, a) − L(w? ; θ(t) , d(t) )
h 2 i h i
= Es,a∼d(t) w · φ(s, a) − w? · φ(s, a) + 2(w − w? )Es,a∼d(t) φ(s, a) w? · φ(s, a) − Q(t) (s, a)
(t) (t) (t)
= kw − w? k2Σ + (w − w? ) · ∇L(w? ; θ(t) , d(t) )
d(t)
(t)
≥ kw − w? k2Σ .
d(t)

Noting that w(t) ∈ W by construction in Algorithm 0.8 yields the claimed bound on the second term in (0.10).
Using the bounds on the first and second terms in (0.9) and (0.10), along with concavity of the square root function,
we have that:
s s
|A| d? (t) (t) (t) |A|κ  (t)

errt ≤ 2 L(w? ; θ , d ) + 2 L(w(t) ; θ(t) , d(t) ) − L(w? ; θ(t) , d(t) ) .
1−γ ν ∞ 1−γ

The proof is completed by substitution and using our assumptions on stat and bias .

11.5 Q-NPG Sample Complexity

To be added...

11.6 Bibliographic Remarks and Further Readings

The notion of compatible function approximation was due to [Sutton et al., 1999], which also proved the claim in
Lemma 11.1. The close connection of the NPG update rule to compatible function approximation (Lemma 0.3) was
noted in [Kakade, 2001].
The regret lemma (Lemma 11.3) for the NPG analysis has origins in the online regret framework in changing MDPs [Even-
Dar et al., 2009]. The convergence rates in this chapter are largely derived from [Agarwal et al., 2020d]. The Q-NPG
algorithm for the log-linear policy classes is essentially the same algorithm as POLITEX [Abbasi-Yadkori et al., 2019],
with a distinction that it is important to use a state action measure over the initial distribution. The analysis and error
decomposition of Q-NPG is from [Agarwal et al., 2020d], which has a more general analysis of NPG with function
approximation under the regret lemma. This more general approach also permits the analysis of neural policy classes,
as shown in [Agarwal et al., 2020d]. Also, [Liu et al., 2019] provide an analysis of the TRPO algorithm [Schulman
et al., 2015] (essentially the same as NPG) for neural network parameterizations in the somewhat restrictive linearized
“neural tangent kernel” regime.

108
Chapter 12

CPI, TRPO, and More

In this chapter, we consider conservative policy iteration (CPI) and trust-region constrained policy optimization
(TRPO). Both CPI and TRPO can be understood as making small incremental update to the policy by forcing that
the new policy’s state action distribution is not too far away from the current policy’s. We will see that CPI achieves
that by forming a new policy that is a mixture of the current policy and a local greedy policy, while TRPO forcing
that by explicitly adding a KL constraint (over polices’ induced trajectory distributions space) in the optimization
procedure. We will show that TRPO gives an equivalent update procedure as Natural Policy Gradient.
Along the way, we discuss the benefit of incremental policy update, by contrasting it to another family of policy update
procedure called Approximate Policy Iteration (API), which performs local greedy policy search and could potentially
lead to abrupt policy change. We show that API in general fails to converge or make local improvement, unless under
a much stronger concentrability ratio assumption.
The algorithm and analysis of CPI is adapted from the original one in [Kakade and Langford, 2002], and the we follow
the presentation of TRPO from [Schulman et al., 2015], while making a connection to the NGP algorithm.

12.1 Conservative Policy Iteration

As the name suggests, we will now describe a more conservative version of the policy iteration algorithm, which
shifts the next policy away from the current policy with a small step size to prevent drastic shifts in successive state
distributions.
We consider the discounted MDP here {S, A, P, r, γ, ρ} where ρ is the initial state distribution. Similar to Policy
Gradient Methods, we assume that we have a restart distribution µ (i.e., the µ-restart setting). Throughout this section,
for any policy π, we denote dπµ as the state-action visitation starting from s0 ∼ µ instead of ρ, and dπ the state-action
visitation starting from the true initial state distribution ρ, i.e., s0 ∼ ρ. Similarly, we denote Vµπ as expected discounted
total reward of policy π starting at µ, while V π as the expected discounted total reward of π with ρ as the initial state
distribution. We assume A is discrete but S could be continuous.
CPI is based on the concept of Reduction to Supervised Learning. Specifically we will use the Approximate Greedy
Policy Selector defined in Chapter 3 (Definition 3.1). We recall the definition of the ε-approximate Greedy Policy
Selector Gε (π, Π, µ) below. Given a policy π, policy class Π, and a restart distribution µ, denote π
b = Gε (π, Π, µ), we
have that:

Es∼dπµ Aπ (s, π
b(s)) ≥ max Es∼dπµ Aπ (s, π
e(s)) − ε.
e∈Π
π

109
Recall that in Chapter 3 we explained two approach to implement such approximate oracle: one with a reduction to
classification oracle, and the other one with a reduction to regression oracle.

12.1.1 The CPI Algorithm

CPI, summarized in Alg. 7, will iteratively generate a sequence of policies π i . Note we use πα = (1−α)π+απ 0 to refer
to a randomized policy which at any state s, chooses an action according to π with probability 1 − α and according to
π 0 with probability α. The greedy policy π 0 is computed using the ε-approximate greedy policy selector Gε (π t , Π, µ).
t
The algorithm is terminate when there is no significant one-step improvement over π t , i.e., Es∼dπt Aπ (s, π 0 (s))) ≤ ε.
µ

Algorithm 7 Conservative Policy Iteration (CPI)


Input: Initial policy π 0 ∈ Π, accuracy parameter ε.
1: for t = 0, 1, 2 . . . do
2: π 0 = Gε (π t , Π, µ)
t
3: if Es∼dπt Aπ (s, π 0 (s)) ≤ ε then
µ
4: Return π t
5: end if
6: Update π t+1 = (1 − α)π t + απ 0
7: end for

The main intuition behind the algorithm is that the stepsize α controls the difference between state distributions of π t
and π t+1 . Let us look into the performance difference lemma to get some intuition on this conservative update. From
PDL, we have:
t+1 t 1 t α t
Vµπ − Vµπ = E πt+1 Aπ (s, π t+1 (s)) = E πt+1 Aπ (s, π 0 (s)),
1 − γ s∼dµ 1 − γ s∼dµ
t
where the last equality we use the fact that π t+1 = (1 − α)π t + απ 0 and Aπ (s, π t (s)) = 0 for all s. Thus, if we can
t t
search for a policy π 0 ∈ Π that maximizes Es∼dπt+1 Aπ (s, π 0 (s)) and makes Es∼dπt+1 Aπ (s, π 0 (s)) > 0, then we
µ µ

can guarantee policy improvement. However, at episode t, we do not know the state distribution of π t+1 and all we
t t
have access to is dπµ . Thus, we explicitly make the policy update procedure to be conservative such that dπµ and the
t+1 t
new policy’s distribution dµπ is guaranteed to be not that different. Thus we can hope that Es∼dπt+1 Aπ (s, π 0 (s)) is
µ
t
close to Es∼dπt Aπ (s, π 0 (s)), and the latter is something that we can manipulate using the greedy policy selector.
µ

Below we formalize the above intuition and show that with small enough α, we indeed can ensure monotonic policy
improvement.
We start from the following lemma which shows that π t+1 and π t are close to each other in terms of total variation
t+1 t
distance at any state, and dµπ and dπµ are close as well.

Lemma 12.1 (Similar Policies imply similar state visitations). Consider any t, we have that:

π t+1 (·|s) − π t (·|s) 1


≤ 2α, ∀s;

Further, we have:
t+1 t 2αγ
dµπ − dπµ ≤ .
1 1−γ

110
Proof: The first claim in the above lemma comes from the definition of policy update:
π t+1 (·|s) − π t (·|s) 1
= α π t (·|s) − π 0 (·|s) 1
≤ 2α.
Denote Pπh as the state distribution resulting from π at time step h with µ as the initial state distribution. We consider
t+1 t+1
bounding kPhπ − Phπ k1 with h ≥ 1.
t+1 t+1 X  t+1 t

Pπh (s0 ) − Phπ (s0 ) = π
Ph−1 (s)π t+1 (a|s) − Pπh−1 (s)π t (a|s) P (s0 |s, a)
s,a
X 
π t+1 t+1 t+1 t
= Ph−1 (s)π t+1 (a|s) − Pπh−1 (s)π t (a|s) + Pπh−1 (s)π t (a|s) − Pπh−1 (s)π t (a|s) P (s0 |s, a)
s,a
X t+1 X
π
π t+1 (a|s) − π t (a|s) P (s0 |s, a)

= Ph−1 (s)
s a
X X
π t+1 t
+ Ph−1 (s) − Pπh−1 (s) π t (a|s)P (s0 |s, a).
s a

Take absolute value on both sides, we get:


X t+1 t+1 X t+1 X X
Phπ (s0 ) − Phπ (s0 ) ≤ Pπh−1 (s) π t+1 (a|s) − π t (a|s) P (s0 |s, a)
s0 s a s0
X t+1 t XX
+ Pπh−1 (s) − Pπh−1 (s) π (a|s)P (s0 |s, a)
t

s s0 a
t+1 t t+1 t
≤ 2α + kPπh−1 − Pπh−1 k1 ≤ 4α + kPπh−2 − Pπh−2 k1 = 2hα.
Now use the definition of dπµ , we have:
∞  t+1 
t+1 t X t
dµπ − dπµ = (1 − γ) γ h Pπh − Pπh .
h=0

Add `1 norm on both sides, we get:



t+1 t X
dπµ − dπµ ≤ (1 − γ) γ h 2hα
1
h=0
P∞ γ
It is not hard to verify that h=0 γhh = (1−γ)2 . Thus, we can conclude that:
t+1 t 2αγ
dπµ − dπµ ≤ .
1 1−γ

The above lemma states that if π t+1 and π t are close in terms of total variation distance for every state, then the
total variation distance between the resulting state visitations from π t+1 and π t will be small up to a effective horizon
1/(1 − γ) amplification.
The above lemma captures the key of the conservative policy update. Via the conservative policy update, we make
t+1 t
sure that dµπ and dπµ are close to each other in terms of total variation distance. Now we use the above lemma to
show a monotonic policy improvement.
t
Theorem 12.2 (Monotonic Improvement in CPI). Consider any episode t. Denote A = Es∼dπt Aπ (s, π 0 (s)). We
µ
have:
 
π t+1 πt α 2αγ
Vµ − Vµ ≥ A−
1−γ (1 − γ)2

111
A(1−γ)2
Set α = 4γ , we get:

t+1 t A2 (1 − γ)
Vµπ − Vµπ ≥ .

The above lemma shows that as long as we still have positive one-step improvement, i.e., A > 0, then we guarantee
that π t+1 is strictly better than π t .
Proof: Via PDL, we have:
t+1 t 1 t
Vµπ − Vµπ = Es∼dπt+1 αAπ (s, π 0 (s)).
1−γ µ

Recall Lemma 12.1, we have:


 t+1 t
 t t t
(1 − γ) Vµπ − Vµπ = Es∼dπt αAπ (s, π 0 (s)) + Es∼dπt+1 αAπ (s, π 0 (s)) − Es∼dπt αAπ (s, π 0 (s))
µ µ µ

πt 0 π t t+1
≥ Es∼dπt αA (s, π (s)) − α sup |A (s, a)|kdπµ − dπµ k1
µ
s,a,π
t α t t+1
≥ Es∼dπt αAπ (s, π 0 (s)) − kdπ − dπµ k1
µ 1−γ µ
2α2 γ
 
t 2αγ
≥ Es∼dπt αAπ (s, π 0 (s)) − = α A −
µ (1 − γ)2 (1 − γ)2
where the first inequality we use the fact that for any two distributions P1 and P2 and any function f , |Ex∼P1 f (x) −
Ex∼P2 f (x)| ≤ supx |f (x)|kP1 − P2 k1 , for the second inequality, we use the fact that |Aπ (s, a)| ≤ 1/(1 − γ) for any
π, s, a, and the last inequality uses Lemma 12.1.
For the second part of the above theorem, note that we want to maximum the policy improvement as much as possible
by choosing α. So we can pick α which maximizes α(A − 2αγ/(1 − γ)2 ) . This gives us the α we claimed in the
lemma. Plug in α back into α(A − 2αγ/(1 − γ)2 ), we conclude the second part of the theorem.

The above theorem indicates that with the right choice of α, we guarantee that the policy is making improvement as
long as A > 0. Recall the termination criteria in CPI where we terminate CPI when A ≤ ε. Putting these results
together, we obtain the following overall convergence guarantee for the CPI algorithm.
Theorem 12.3 (Local optimality of CPI). Algorithm 7 terminates in at most 8γ/2 steps and outputs a policy π t
t
satisfying maxπ∈Π Es∼dπt Aπ (s, π(s)) ≤ 2ε.
µ

Proof: Note that our reward is bounded in [0, 1] which means that Vµπ ∈ [0, 1/(1 − γ)]. Note that we have shown in
A2 (1−γ)
Theorem 12.2, every iteration t, we have policy improvement at least 8γ , where recall A at episode t is defined
πt 0
as A = Es∼dπt A (s, π (s)). If the algorithm does not terminate at episode t, then we guarantee that:
µ

t+1 t ε2 (1 − γ)
Vµπ ≥ Vµπ +

Since Vµπ is upper bounded by 1/(1 − γ), so we can at most make improvement 8γ/2 many iterations.
Finally, recall that ε-approximate greedy policy selector π 0 = Gε (π t , Π, µ), we have:
t t
max Es∼dπt Aπ (s, π(s)) ≤ Es∼dπt Aπ (s, π 0 (s)) + ε ≤ 2ε
π∈Π µ µ

112
This concludes the proof.
Theorem 12.3 can be viewed as a local optimality guarantee in a sense. It shows that when CPI terminates, we cannot
find a policy π ∈ Π that achieves local improvement over the returned policy more than ε. However, this does not
necessarily imply that the value of π is close to V ? . However, similar to the policy gradient analysis, we can turn this
?
local guarantee into a global one when the restart distribution µ covers dπ . We formalize this intuition next.
Theorem 12.4 (Global optimality of CPI). Upon termination, we have a policy π such that:
?
? π 2ε + Π dπ
V −V ≤ ,
(1 − γ)2 µ ∞

where Π := Es∼dπµ maxa∈A Aπ (s, a) − maxπ∈Π Es∼dπµ maxa∈A Aπ (s, π(s)).

In other words, if our policy class is rich enough to approximate the policy maxa∈A Aπ (s, a) under dπµ , i.e., Π is
?
?

small, and µ covers dπ in a sense that µ ≤ ∞, CPI guarantees to find a near optimal policy.

Proof: By the performance difference lemma,


1
V?−Vπ = E π? Aπ (s, π ? (s))
1 − γ s∼d
1
≤ E π? max Aπ (s, a)
1 − γ s∼d a∈A

1 dπ
≤ Es∼dπµ max Aπ (s, a)
1 − γ dπµ ∞ a∈A

1 dπ
≤ Es∼dπµ max Aπ (s, a)
(1 − γ)2 µ ∞
a∈A


 
1
=≤ max Es∼dπµ Aπ (s, π̂(s)) − max Es∼dπµ Aπ (s, π̂(s)) + Es∼dπµ max Aπ (s, a)
(1 − γ)2 µ ∞ π̂∈Π π̂∈Π a∈A
π∗
1 d
≤ (2ε + Π ) ,
(1 − γ)2 µ ∞

where the second inequality holds due to the fact that maxa Aπ (s, a) ≥ 0, the third inequality uses the fact that
dπµ (s) ≥ (1 − γ)µ(s) for any s and π, and the last inequality uses the definition Π and Theorem 12.3.
It is informative to contrast CPI and policy gradient algorithms due to the similarity of their guarantees. Both provide
local optimality guarantees. For CPI, the local optimality always holds, while for policy gradients it requires a smooth
value function as a function of the policy parameters. If the distribution mismatch between an optimal policy and the
output of the algorithm is not too large, then both algorithms further yield a near optimal policy. The similarities are
not so surprising. Both algorithms operate by making local improvements to the current policy at each iteration, by
inspecting its advantage function. The changes made to the policy are controlled using a stepsize parameter in both
the approaches. It is the actual mechanism of the improvement which differs in the two cases. Policy gradients assume
that the policy’s reward is a differentiable function of the parameters, and hence make local improvements through
gradient ascent. The differentiability is certainly an assumption and does not necessarily hold for all policy classes.
An easy example is when the policy itself is not an easily differentiable function of its parameters. For instance, if the
policy is parametrized by regression trees, then performing gradient updates can be challenging.
In CPI, on the other hand, the basic computational primitive required on the policy class is the ability to maximize
the advantage function relative to the current policy. Notice that Algorithm 7 does not necessarily restrict to a policy
class, such as a set of parametrized policies as in policy gradients. Indeed, due to the reduction to supervised learning
approach (e.g., using the weighted classification oracle CO), we can parameterize policy class via decision tree, for

113
instance. This property makes CPI extremely attractive. Any policy class over which efficient supervised learning
algorithms exist can be adapted to reinforcement learning with performance guarantees.
A second important difference between CPI and policy gradients is in the notion of locality. Policy gradient updates
are local in the parameter space, and we hope that this makes small enough changes to the state distribution that the
new policy is indeed an improvement on the older one (for instance, when we invoke the performance difference
lemma between successive iterates). While this is always true in expectation for correctly chosen stepsizes based on
properties of stochastic gradient ascent on smooth functions, the variance of the algorithm and lack of robustness to
suboptimal stepsizes can make the algorithm somewhat finicky. Indeed, there are a host of techniques in the literature
to both lower the variance (through control variates) and explicitly control the state distribution mismatch between
successive iterates of policy gradients (through trust region techniques). On the other hand, CPI explicitly controls the
amount of perturbation to the state distribution by carefully mixing policies in a manner which does not drastically
alter the trajectories with high probability. Indeed, this insight is central to the proof of CPI, and has been instrumental
in several follow-ups, both in the direct policy improvement as well as policy gradient literature.

12.2 Trust Region Methods and Covariant Policy Search

So far we have seen policy gradient methods and CPI which all uses a small step-size to ensure incremental update
in policies. Another popular approach for incremental policy update is to explicitly forcing small change in policies’
distribution via a trust region constraint. More specifically, let us go back to the general policy parameterization πθ . At
iteration t with the current policy πθt , we are interested in the following local trust-region constrained optimization:

θt
max Es∼dπθt Ea∼πθ (·|s) Aπ (s, a)
θ µ
 θt

s.t., KL Prπµ ||Prπµθ ≤ δ,

where recall Prπµ (τ ) is the trajectory distribution induced by π starting at s0 ∼ µ, and KL(P1 ||P2 ) are KL-divergence
between two distribution P1 and P2 . Namely we explicitly perform local policy search with a constraint forcing the
π
new policy not being too far away from Prµθt in terms of KL divergence.
As we are interested in small local update in parameters, we can perform sequential quadratic programming here, i.e.,
we can further linearize the objective function at θt and quadratize the KL constraint at θt to form a local quadratic
programming:
D θt
E
max Es∼dπθt Ea∼πθt (·|s) ∇θ ln πθt (a|s)Aπ (s, a), θ (0.1)
θ µ
 θt
 1   θt
 
s.t., h∇θ KL Prµπ ||Prπµθ |θ=θt , θ − θt i + (θ − θt )> ∇2θ KL Prπµ ||Prπµθ |θ=θt (θ − θt ) ≤ δ, (0.2)
2

where we denote ∇2 KL|θ=θt as the Hessian of the KL constraint measured at θt . Note that KL divergence is not
a valid metric as it is not symmetric. However, its local quadratic approximation can serve as a valid local distance
metric, as we prove below that the Hessian ∇2 KL|θ=θt is a positive semi-definite matrix. Indeed, we will show that
the Hessian of the KL constraint is exactly equal to the fisher information matrix, and the above quadratic programming
exactly reveals a Natural Policy Gradient update. Hence Natural policy gradient can also be interpreted as performing
sequential quadratic programming with KL constraint over policy’s trajectory distributions.
To match to the practical algorithms in the literature (e.g., TRPO), below we focus on episode finite horizon setting
again (i.e., an MDP with horizon H).

114
Claim 12.5. Consider a finite horizon MDP with horizon H. Consider any fixed θt . We have:
 θt

∇θ KL Prµπ ||Prπµθ |θ=θt = 0,
 θt

>
∇2θ KL Prµπ ||Prπµθ |θ=θt = HEs,a∼dπθt ∇ ln πθt (a|s) (∇ ln πθt (a|s)) .

Proof: We first recall the trajectory distribution in finite horizon setting.


H−1
Y
Prπµ (τ ) = µ(s0 ) π(ah |sh )P (sh+1 |sh , ah ).
h=0

We first prove that the gradient of KL is zero. First note that:


θt
 θt
 X θt Prπµ (τ )
KL Prµπ ||Prπµθ = Prπµ (τ ) ln θ
τ Prπµ (τ )
H−1
! H−1
X θt X πθ (ah |sh ) X πθt (ah |sh )
= Prπµ (τ ) ln t = Es πθ
t ln .
πθ (ah |sh ) h ,ah ∼Ph πθ (ah |sh )
τ h=0 h=0
 θt

Thus, for ∇θ KL Prµπ ||Prπµθ , we have:

  H−1
θt X
∇KL Prµπ ||Prπµθ |θ=θt = Es πθ
t − ∇ ln πθt (ah |sh )
h ,ah ∼Ph
h=0
H−1
X
=− Es πθ
t Eah ∼πθt (·|sh ) ∇ ln πθt (ah |sh ) = 0,
h ∼Ph
h=0

where we have seen the last step when we argue the unbiased nature of policy gradient with an action independent
baseline.
Now we move to the Hessian.
  H−1
θt X
∇2 KL Prµπ ||Prπµθ |θ=θt = − Es πθ
t ∇2 ln πθ (ah |sh )|θ=θt
h ,ah ∼Ph
h=0
H−1   
X ∇πθ (a|s)
=− Es πθ
t ∇
h ,ah ∼Ph πθ (a|s)
h=0
H−1
∇ πθ (ah |sh ) ∇πθ (ah |sh )∇πθ (ah |sh )>
X  2 
=− Es πθ
t −
h ,ah ∼Ph πθ (ah |sh ) πθ2 (ah |sh )
h=0
H−1
X
= Es πθ
t ∇θ ln πθ (ah |sh )∇θ ln πθ (ah |sh )> ,
h ,ah ∼Ph
h=0

∇2 πθ (ah |sh )
where in the last equation we use the fact that Es πθ
t πθ (ah |sh ) = 0.
h ,ah ∼Ph

The above claim shows that a second order taylor expansion of the KL constraint over trajectory distribution gives a
local distance metric at θt :
(θ − θt )Fθt (θ − θt ),

115
>
where again Fθt := HEs,a∼dπθt ∇ ln πθt (a|s) (∇ ln πθt (a|s)) is proportional to the fisher information matrix. Note
that Fθt is a PSD matrix and thus d(θ, θt ) := (θ − θt )Fθt (θ − θt ) is a valid distance metric. By sequential quadratic
programming, we are using local geometry information of the trajectory distribution manifold induced by the param-
eterization θ, rather the naive Euclidean distance in the parameter space. Such a method is sometimes referred to as
Covariant Policy Search, as the policy update procedure will be invariant to linear transformation of parameterization
(See Section 12.3 for further discussion).
Now using the results from Claim 12.5, we can verify that the local policy optimization procedure in Eq. 0.2 exactly
recovers the NPG update, where the step size is based on the trust region parameter δ. Denote ∆ = θ − θt , we have:

max h∆, ∇θ V πθt i ,



s.t., ∆> Fθt ∆> ≤ δ,

which gives the following update procedure:


s
δ
θt+1 = θt + ∆ = θt + πθt )> F −1 ∇V πθt
· Fθ−1
t
∇V πθt ,
(∇V θt

where note that we use the self-normalized learning rate computed using the trust region parameter δ.

12.2.1 Proximal Policy Optimization

Here we consider an `∞ style trust region constraint:


θt
max Es∼dπθt Ea∼πθ (·|s) Aπ (s, a) (0.3)
θ µ

s.t., sup π θ (·|s) − π θt (·|s) tv


≤ δ, (0.4)
s

Namely, we restrict the new policy such that it is close to π θt at every state s under the total variation distance. Recall
the CPI’s update, CPI indeed makes sure that the new policy will be close the old policy at every state. In other words,
the new policy computed by CPI is a feasible solution of the constraint Eq. 0.4, but is not the optimal solution of
the above constrained optimization program. Also one downside of the CPI algorithm is that one needs to keep all
previous learned policies around, which requires large storage space when policies are parameterized by large deep
neural networks.
Proximal Policy Optimization (PPO) aims to directly optimize the objective Eq. 0.3 using multiple steps of gradient
updates, and approximating the constraints Eq. 0.4 via a clipping trick. We first rewrite the objective function using
importance weighting:

π θ (a|s) πθt
max Es∼dπθt Ea∼πθt (·|s) A (s, a), (0.5)
θ µ π θt (·|s)
θt
where we can easily approximate the expectation Es∼dπθt Ea∼πθt (·|s) via finite samples s ∼ dπ , a ∼ π θt (·|s).
µ

To make sure π θ (a|s) and π θt (a|s) are not that different, PPO modifies the objective function by clipping the density
ratio π θ (a|s) and π θt (a|s):

π θ (a|s) πθt π θ (a|s)


   
π θt
L(θ) := Es∼dπθt Ea∼πθt (·|s) min A (s, a), clip ; 1 − , 1 +  A (s, a) , (0.6)
µ π θt (·|s) π θt (·|s)

116

1 −  x ≤ 1 − 

where clip (x; 1 − , 1 + ) = 1 +  x ≥ 1 +  . The clipping operator ensures that for πθ , at any state action pair

x else

h  θ  θ i
where π (a|s)/π (a|s) 6∈ [1 − , 1 + ], we get zero gradient, i.e., ∇θ clip ππθt(a|s)
θ θt
(·|s)
; 1 − , 1 +  A π t
(s, a) = 0.
The outer min makes sure the objective function L(θ) is a lower bound of the original objective. PPO then proposes
θt
to collect a dataset (s, a) with s ∼ dµπ and a ∼ π θt (·|s), and then perform multiple steps of mini-batch stochastic
gradient ascent on L(θ).
One of the key difference between PPO and other algorithms such as NPG is that PPO targets to optimizes objective
θt
Eq. 0.5 via multiple steps of mini-batch stochastic gradient ascent with mini-batch data from dπµ and π θt , while
algorithm such as NPG indeed optimizes the first order taylor expansion of Eq. 0.5 at θt , i.e.,
D θt
E
max (θ − θt ), Es,a∼dπθt ∇θ ln π θt (a|s)Aπ (s, a) ,
θ µ

upper to some trust region constraints (e.g., kθ − θt k2 ≤ δ in policy gradient, and kθ − θt k2Fθt ≤ δ for NPG ).

12.3 Bibliographic Remarks and Further Readings

The analysis of CPI is adapted from the original one in [Kakade and Langford, 2002]. There have been a few further
interpretations of CPI. One interesting perspective is that CPI can be treated as a boosting algorithm [Scherrer and
Geist, 2014].
More generally, CPI and NPG are part of family of incremental algorithms, including Policy Search by Dynamic
Programming (PSDP) [Bagnell et al., 2004] and MD-MPI [Geist et al., 2019]. PSDP operates in a finite horizon
setting and optimizes a sequence of time-dependent policies; from the last time step to the first time step, every
iteration of, PSDP only updates the policy at the current time step while holding the future policies fixed — thus
making incremental update on the policy. See [Scherrer, 2014] for more a detailed discussion and comparison of
some of these approaches. Mirror Descent-Modified Policy Iteration (MD-MPI) algorithm [Geist et al., 2019] is a
family of actor-critic style algorithms which is based on regularization and is incremental in nature; with negative
entropy as the Bregman divergence (for the tabular case), MD-MPI recovers the NPG the tabular case (for the softmax
parameterization).
Broadly speaking, these incremental algorithms can improve upon the stringent concentrability conditions for approx-
imate value iteration methods, presented in Chapter 3. Scherrer [2014] provide a more detailed discussion of bounds
which depend on these density ratios. As discussed in the last chapter, the density ratio for NPG can be interpreted as
a factor due to transfer learning to a single, fixed distribution.
The interpretation of NPG as Covariant Policy Search is due to [Bagnell and Schneider, 2003], as the policy update
procedure will be invariant to linear transformations of the parameterization; see [Bagnell and Schneider, 2003] for a
more detailed discussion on this.
The TRPO algorithm is due to [Schulman et al., 2015]. The original TRPO analysis provides performance guarantees,
largely relying on a reduction to the CPI guarantees. In this Chapter, we make a sharper connection of TRPO to NPG,
which was subsequently observed by a number of researchers; this connection provides a sharper analysis for the
generalization and approximation behavior of TRPO (e.g. via the results presented in Chapter 11). In practice, a
popular variant is the Proximal Policy Optimization (PPO) algorithm [Schulman et al., 2017].

117
118
Part 4

Further Topics

119
Chapter 13

Linear Quadratic Regulators

This chapter will introduce some of the fundamentals of optimal control for the linear quadratic regulator model. This
model is an MDP, with continuous states and actions. While the model itself is often inadequate as a global model, it
can be quite effective as a locally linear model (provided our system does not deviate away from the regime where our
linear model is reasonable approximation).
The basics of optimal control theory can be found in any number of standards text Anderson and Moore [1990], Evans
[2005], Bertsekas [2017]. The treatment of Gauss-Newton and the NPG algorithm are due to Fazel et al. [2018].

13.1 The LQR Model

In the standard optimal control problem, a dynamical system is described as

xt+1 = ft (xt , ut , wt ) ,

where ft maps a state xt ∈ Rd , a control (the action) ut ∈ Rk , and a disturbance wt , to the next state xt+1 ∈ Rd ,
starting from an initial state x0 . The objective is to find the control policy π which minimizes the long term cost,
" H
#
X
minimize Eπ ct (xt , ut )
t=0
such that xt+1 = ft (xt , ut , wt ) t = 0, . . . , H.

where H is the time horizon (which can be finite or infinite).


In practice, this is often solved by considering the linearized control (sub-)problem where the dynamics are approxi-
mated by
xt+1 = At xt + Bt ut + wt ,

with the matrices At and Bt are derivatives of the dynamics f and where the costs are approximated by a quadratic
function in xt and ut .
This chapter focuses on an important special case: finite and infinite horizon problem referred to as the linear quadratic
regulator (LQR) problem. We can view this model as being an local approximation to non-linear model. However, we
will analyze these models under the assumption that they are globally valid.

121
Finite Horizon LQRs. The finite horizon LQR problem is given by
" H−1
#
X
minimize E x> H QxH + (x> >
t Qxt + ut Rut )
t=0
such that xt+1 = At xt + Bt ut + wt , x0 ∼ D, wt ∼ N (0, σ 2 I) ,
where initial state x0 ∼ D is assumed to be randomly distributed according to distribution D; the disturbance wt ∈ Rd
follows the law of a multi-variate normal with covariance σ 2 I; the matrices At ∈ Rd×d and Bt ∈ Rd×k are referred
to as system (or transition) matrices; Q ∈ Rd×d and R ∈ Rk×k are both positive definite matrices that parameterize
the quadratic costs. Note that this model is a finite horizon MDP, where the S = Rd and A = Rk .

Infinite Horizon LQRs. We also consider the infinite horizon LQR problem:
"H #
1 X
> >
minimize lim E (xt Qxt + ut Rut )
H→∞ H
t=0
such that xt+1 = Axt + But + wt , x0 ∼ D, wt ∼ N (0, σ 2 I).
Note that here we are assuming the dynamics are time homogenous. We will assume that the optimal objective function
(i.e. the optimal average cost) is finite; this is referred to as the system being controllable. This is a special case of an
MDP with an average reward objective.
Throughout this chapter, we assume A and B are such that the optimal cost is finite. Due to the geometric nature of
the system dynamics (say for a controller which takes controls ut that are linear in the state xt ), there may exists linear
controllers with infinite costs. This instability of LQRs (at least for some A and B and some controllers) leads to that
the theoretical analysis often makes various assumptions on A and B in order to guarantee some notion of stability. In
practice, the finite horizon setting is more commonly used in practice, particularly due to that the LQR model is only a
good local approximation of the system dynamics, where the infinite horizon model tends to be largely of theoretical
interest. See Section 13.6.

The infinite horizon discounted case? The infinite horizon discounted case tends not to be studied for LQRs. This
is largely due to that, for the undiscounted case (with the average cost objective), we have may infinte costs (due to the
aforementioned geometric nature of the system dynamics); in such cases, discounting will not necessarily make the
average cost finite.

13.2 Bellman Optimality:


Value Iteration & The Algebraic Ricatti Equations

A standard result in optimal control theory shows that the optimal control input can be written as a linear function in
the state. As we shall see, this is a consequence of the Bellman equations.

13.2.1 Planning and Finite Horizon LQRs

Slightly abusing notation, it is convenient to define the value function and state-action value function with respect to
the costs as follows: For a policy π, a state x, and h ∈ {0, . . . H − 1}, we define the value function Vhπ : Rd → R as
h H−1
X i
Vhπ (x) = E x>
H QxH + (x >
t Qxt + u >
t Rut ) π, xh = x ,
t=h

122
where again expectation is with respect to the randomness of the trajectory, that is, the randomness in state transitions.
Similarly, the state-action value (or Q-value) function Qπh : Rd × Rk → R is defined as

h H−1
X i
Qπh (x, u) = E x>
H QxH + (x >
t Qxt + u>
t Rut ) π, xh = x, uh = u .
t=h

We define V ? and Q? analogously.


The following theorem provides a characterization of the optimal policy, via the algebraic Ricatti equations. These
equations are simply the value iteration algorithm for the special case of LQRs.

Theorem 13.1. (Value Iteration and the Ricatti Equations). Suppose R is positive definite. The optimal policy is a
linear controller specified by:
π ? (xt ) = −Kt? xt

where
Kt? = (Bt> Pt+1 Bt + R)−1 Bt> Pt+1 At .

Here, Pt can be computed iteratively, in a backwards manner, using the following algebraic Ricatti equations, where
for t ∈ [H],

Pt := A> > >


t Pt+1 At + Q − At Pt+1 Bt (Bt Pt+1 Bt + R)
−1 >
Bt Pt+1 At
= A> ? > > ?
t Pt+1 At + Q − (Kt+1 ) (Bt Pt+1 Bt + R)Kt+1

and where PH = Q. (The above equation is simply the value iteration algorithm).
Furthermore, for t ∈ [H], we have that:

H
X
Vt? (x) = x> Pt x + σ 2 Trace(Ph ).
h=t+1

We often refer to Kt? as the optimal control gain matrices. It is straightforward to generalize the above when σ ≥ 0.
We have assumed that R is strictly positive definite, to avoid have to working with the pseudo-inverse; the theorem is
still true when R = 0, provided we use a pseudo-inverse.
Proof: By the Bellman optimality conditions (Theorem 1.9 for episodic MDPs), we know the optimal policy (among
all possibly history dependent, non-stationary, and randomized policies), is given by a deterministic stationary policy
which is only a function of xt and t. We have that:

QH−1 (x, u) = E (AH−1 x + BH−1 u + wH−1 )> Q(AH−1 x + BH−1 u + wH−1 ) + x> Qx + u> Ru
 

= (AH−1 x + BH−1 u)> Q(AH−1 x + BH−1 u) + σ 2 Trace(Q) + x> Qx + u> Ru

>
due to that xH = AH−1 x + BH−1 u + wH−1 , and E[wH−1 QwH−1 ] = Trace(σ 2 Q). Due to that this is a quadratic
function of u, we can immediately derive that the optimal control is given by:
>
?
πH−1 (x) = −(BH−1 QBH−1 + R)−1 BH−1
> ?
QAH−1 x = −KH−1 x,

where the last step uses that PH := Q.


?
For notational convenience, let K = KH−1 , A = AH−1 , and B = BH−1 . Using the optimal control at x, i.e.

123
?
u = −KH−1 x, we have:
? ?
VH−1 (x) = QH−1 (x, −KH−1 x)
 
= x> (A − BK)> Q(A − BK) + Q + K > RK x + σ 2 Trace(Q)
 
= x> AQA + Q − 2K > B > QA + K > (B > QB + R)K x + σ 2 Trace(Q)
 
= x> AQA + Q − 2K > (B > QB + R)K + K > (B > QB + R)K x + σ 2 Trace(Q)
 
= x> AQA + Q − K > (B > QB + R)K x + σ 2 Trace(Q)
= x> PH−1 x + σ 2 Trace(Q).
?
where the fourth step uses our expression for K = KH−1 . This proves our claim for t = H − 1.
This implies that:
Q?H−2 (x, u) = E[VH−1
?
(AH−2 x + BH−2 u + wH−2 )] + x> Qx + u> Ru
= (AH−2 x + BH−2 u)> PH−1 (AH−2 x + BH−2 u) + σ 2 Trace(PH−1 ) + σ 2 Trace(Q) + x> Qx + u> Ru.
The remainder of the proof follows from a recursive argument, which can be verified along identical lines to the
t = H − 1 case.

13.2.2 Planning and Infinite Horizon LQRs

Theorem 13.2. Suppose that the optimal cost is finite and that R is positive definite. Let P be a solution to the
following algebraic Riccati equation:
P = AT P A + Q − AT P B(B T P B + R)−1 B T P A. (0.1)
(Note that P is a positive definite matrix). We have that the optimal policy is:
π ? (x) = −K ? x
where the optimal control gain is:
K ∗ = −(B T P B + R)−1 B T P A. (0.2)
We have that P is unique and that the optimal average cost is σ 2 Trace(P ).

As before, P parameterizes the optimal value function. We do not prove this theorem here though it follows along
similar lines to the previous proof, via a limiting argument.
To find P , we can again run the recursion:
P ← Q + AT P A − AT P B(R + B T P B)−1 B T P A
starting with P = Q, which can be shown to converge to the unique positive semidefinite solution of the Ricatti
equation (since one can show the fixed-point iteration is contractive). Again, this approach is simply value iteration.

13.3 Convex Programs to find P and K ?

For the infinite horizon LQR problem, the optimization may be formulated as a convex program. In particular, the
LQR problem can also be expressed as a semidefinite program (SDP) with variable P , for the infinite horizon case.
We now present this primal program along with the dual program.

124
Note that these programs are the analogues of the linear programs from Section 1.5 for MDPs. While specifying these
linear programs for an LQR (as an LQR is an MDP) would result in infinite dimensional linear programs, the special
structure of the LQR implies these primal and dual programs have a more compact formulation when specified as an
SDP.

13.3.1 The Primal for Infinite Horizon LQR

The primal optimization problem is given as:


2
maximize σ  Trace(P )
AT P A + Q − I A> P B

subject to  0, P  0,
BT P A B>P B + R
where the optimization variable is P . This SDP has a unique solution, P ? , which satisfies the algebraic Ricatti
equations (Equation 0.1 ); the optimal average cost of the infinite horizon LQR is σ 2 Trace(P ? ); and the optimal
policy is given by Equation 0.2.
The SDP can be derived by relaxing the equality in the Riccati equation to an inequality, then using the Schur comple-
ment lemma to rewrite the resulting Riccati inequality as linear matrix inequality. In particular, we can consider the
relaxation where P must satisfy:
P  AT P A + Q − AT P B(B T P B + R)−1 B T P A.
That the solution to this relaxed optimization problem leads to the optimal P ? is due to the Bellman optimality
conditions.
Now the Schur complement lemma for positive semi-definiteness is as follows: define
 
D E
X= .
E> F
(for matrices D, E, and F of appropriate size, with D and F being square symmetric matrices). We have that X is
PSD if and only if
F − E > D−1 E  0.
This shows that the constraint set is equivalent to the above relaxation.

13.3.2 The Dual

The dual optimization problem is given as:


  
Q 0
minimize Trace Σ ·
0 R
subject to Σxx = (A B)Σ(A B)> + σ 2 I, Σ  0,
where the optimization variable is a symmetric matrix Σ, which is a (d + k) × (d + k) matrix with the block structure:
 
Σxx Σxu
Σ= .
Σux Σuu
The interpretation of Σ is that it is the covariance matrix of the stationary distribution. This analogous to the state-
visitation measure for an MDP.
This SDP has a unique solution, say Σ? . The optimal gain matrix is then given by:
K ? = −Σ?ux (Σ?xx )−1 .

125
13.4 Policy Iteration, Gauss Newton, and NPG

Note that the noise variance σ 2 does not impact the optimal policy, in either the discounted case or in the infinite
horizon case.
Here, when we examine local search methods, it is more convenient to work with case where σ = 0. In this case, we
can work with cumulative cost rather the average cost. Precisely, when σ = 0, the infinite horizon LQR problem takes
the form:
"H #
X
> >
min C(K), where C(K) = Ex0 ∼D (xt Qxt + ut Rut ) ,
K
t=0

where they dynamics evolves as


xt+1 = (A − BK)xt .
Note that we have directly parameterized our policy as a linear policy in terms of the gain matrix K, due to that we
know the optimal policy is linear in the state. Again, we assume that C(K ? ) is finite; this assumption is referred to as
the system being controllable.
We now examine local search based approaches, where we will see a close connection to policy iteration. Again, we
have a non-convex optimization problem:
Lemma 13.3. (Non-convexity) If d ≥ 3, there exists an LQR optimization problem, minK C(K), which is not convex
or quasi-convex.

Regardless, we will see that gradient based approaches are effective. For local search based approaches, the importance
of (some) randomization, either in x0 or noise through having a disturbance, is analogous to our use of having a wide-
coverage distribution µ (for MDPs).

13.4.1 Gradient Expressions

Gradient descent on C(K), with a fixed stepsize η, follows the update rule:
K ← K − η∇C(K) .
It is helpful to explicitly write out the functional form of the gradient. Define PK as the solution to:
PK = Q + K > RK + (A − BK)> PK (A − BK) .
and, under this definition, it follows that C(K) can be written as:
C(K) = Ex0 ∼D x>
0 PK x 0 .

Also, define ΣK as the (un-normalized) state correlation matrix, i.e.



X
ΣK = Ex0 ∼D xt x>
t .
t=0

Lemma 13.4. (Policy Gradient Expression) The policy gradient is:


∇C(K) = 2 (R + B > PK B)K − B > PK A ΣK


For convenience, define EK to be


EK = (R + B > PK B)K − B > PK A ,


as a result the gradient can be written as ∇C(K) = 2EK ΣK .

126
Proof: Observe:

CK (x0 ) = x> > > > >



0 PK x0 = x0 Q + K RK x0 + x0 (A − BK) PK (A − BK)x0

= x> >

0 Q + K RK x0 + CK ((A − BK)x0 ) .

Let ∇ denote the gradient with respect to K; note that ∇CK ((A − BK)x0 ) has two terms as function of K, one with
respect to K in the subscript and one with respect to the input (A − BK)x0 . This implies

∇CK (x0 ) = 2RKx0 x> > >


0 − 2B PK (A − BK)x0 x0 + ∇CK (x1 )|x1 =(A−BK)x0

X
= 2 (R + B > PK B)K − B > PK A xt x>
t
t=0

where we have used recursion and that x1 = (A − BK)x0 . Taking expectations completes the proof.
The natural policy gradient. Let us now motivate a version of the natural gradient. The natural policy gradient
follows the update:
"∞ #
X
θ ← θ − η F (θ)−1 ∇C(θ), where F (θ) = E ∇ log πθ (ut |xt )∇ log πθ (ut |xt )> ,
t=0

where F (θ) is the Fisher information matrix. A natural special case is using a linear policy with additive Gaussian
noise, i.e.
πK (x, u) = N (Kx, σ 2 I) (0.3)
where K ∈ Rk×d and σ 2 is the noise variance. In this case, the natural policy gradient of K (when σ is considered
fixed) takes the form:
K ← K − η∇C(πK )Σ−1 K (0.4)
Note a subtlety here is that C(πK ) is the randomized policy.
To see this, one can verify that the Fisher matrix of size kd × kd, which is indexed as [GK ](i,j),(i0 ,j 0 ) where i, i0 ∈
{1, . . . k} and j, j 0 ∈ {1, . . . d}, has a block diagonal form where the only non-zeros blocks are [GK ](i,·),(i,·) = ΣK
(this is the block corresponding to the i-th coordinate of the action, as i ranges from 1 to k). This form holds more
generally, for any diagonal noise.

13.4.2 Convergence Rates

We consider three exact rules, where we assume access to having exact gradients. As before, we can also estimate
these gradients through simulation. For gradient descent, the update is

Kn+1 = Kn − η∇C(Kn ). (0.5)

For natural policy gradient descent, the direction is defined so that it is consistent with the stochastic case, as per
Equation 0.4, in the exact case the update is:

Kn+1 = Kn − η∇C(Kn )Σ−1


Kn (0.6)

One show that the Gauss-Newton update is:

Kn+1 = Kn − η(R + B > PKn B)−1 ∇C(Kn )Σ−1


Kn . (0.7)

(Gauss-Newton is non-linear optimization approach which uses a certain Hessian approximation. It can be show that
this leads to the above update rule.) Interestingly, for the case when η = 1 , the Gauss-Newton method is equivalent
to the policy iteration algorithm, which optimizes a one-step deviation from the current policy.

127
The Gauss-Newton method requires the most complex oracle to implement: it requires access to ∇C(K), ΣK , and
R + B > PK B; as we shall see, it also enjoys the strongest convergence rate guarantee. At the other extreme, gradient
descent requires oracle access to only ∇C(K) and has the slowest convergence rate. The natural policy gradient sits in
between, requiring oracle access to ∇C(K) and ΣK , and having a convergence rate between the other two methods.
In this theorem, kM k2 denotes the spectral norm of a matrix M .
Theorem 13.5. (Global Convergence of Gradient Methods) Suppose C(K0 ) is finite and, for µ defined as

µ := σmin (Ex0 ∼D x0 x>


0 ),

suppose µ > 0.

• Gauss-Newton case: Suppose η = 1, the Gauss-Newton algorithm (Equation 0.7) enjoys the following perfor-
mance bound:
kΣK ∗ k2 C(K0 ) − C(K ∗ )
C(KN ) − C(K ∗ ) ≤ , for N ≥ log .
µ 
kBk2 C(K )
0
• Natural policy gradient case: For a stepsize η = 1/(kRk2 + 2
µ ), natural policy gradient descent
(Equation 0.6) enjoys the following performance bound:
kBk22 C(K0 ) C(K0 ) − C(K ∗ )
 
kΣK ∗ k2 kRk2
C(KN ) − C(K ∗ ) ≤ , for N ≥ + log .
µ σmin (R) µσmin (R) 

• Gradient descent case: For any starting policy K0 , there exists a (constant) stepsize η (which could be a function
of K0 ), such that:
C(KN ) → C(K ∗ ), as N → ∞.

13.4.3 Gauss-Newton Analysis

We only provide a proof for the Gauss-Newton case (see 13.6 for further readings).
We overload notation and let K denote the policy π(x) = Kx. For the infinite horizon cost function, define:

X
x> >

VK (x) := t Qxt + ut Rut
t=0
= x > PK x ,

and
QK (x, u) := x> Qx + u> Ru + VK (Ax + Bu) ,
and
AK (x, u) = QK (x, u) − VK (x) .
The next lemma is identical to the performance difference lemma.
Lemma 13.6. (Cost difference lemma) Suppose K and K 0 have finite costs. Let {x0t } and {u0t } be state and action
sequences generated by K 0 , i.e. starting with x00 = x and using u0t = −K 0 x0t . It holds that:
X
VK 0 (x) − VK (x) = AK (x0t , u0t ) .
t

Also, for any x, the advantage is:

AK (x, K 0 x) = 2x> (K 0 − K)> EK x + x> (K 0 − K)> (R + B > PK B)(K 0 − K)x . (0.8)

128
Proof: Let c0t be the cost sequence generated by K 0 . Telescoping the sum appropriately:
X X
VK 0 (x) − VK (x) = c0t − VK (x) = (c0t + VK (x0t ) − VK (x0t )) − VK (x)
t=0 t=0
X X
= (c0t + VK (x0t+1 ) − VK (x0t )) = AK (x0t , u0t )
t=0 t=0

which completes the first claim (the third equality uses the fact that x = x0 = x00 ).
For the second claim, observe that:
VK (x) = x> Q + K > RK x + x> (A − BK)> PK (A − BK)x


And, for u = K 0 x,
AK (x, u) = QK (x, u) − VK (x)
= x> Q + (K 0 )> RK 0 x + x> (A − BK 0 )> PK (A − BK 0 )x − VK (x)


= x> Q + (K 0 − K + K)> R(K 0 − K + K) x +




x> (A − BK − B(K 0 − K))> PK (A − BK − B(K 0 − K))x − VK (x)


= 2x> (K 0 − K)> (R + B > PK B)K − B > PK A x +


x> (K 0 − K)> (R + B > PK B)(K 0 − K))x ,


which completes the proof.
We have the following corollary, which can be viewed analogously to a smoothness lemma.
Corollary 13.7. (“Almost” smoothness) C(K) satisfies:
C(K 0 ) − C(K) = −2Trace(ΣK 0 (K − K 0 )> EK ) + Trace(ΣK 0 (K − K 0 )> (R + B > PK B)(K − K 0 ))

To see why this is related to smoothness (recall the definition of a smooth function in Equation 0.6), suppose K 0 is
sufficiently close to K so that:
ΣK 0 ≈ ΣK + O(kK − K 0 k) (0.9)
and the leading order term 2Trace(ΣK 0 (K 0 − K)> EK ) would then behave as Trace((K 0 − K)> ∇C(K)).
Proof: The claim immediately results from Lemma 13.6, by using Equation 0.8 and taking an expectation.
We now use this cost difference lemma to show that C(K) is gradient dominated.
Lemma 13.8. (Gradient domination) Let K ∗ be an optimal policy. Suppose K has finite cost and µ > 0. It holds
that:
C(K) − C(K ∗ ) ≤ kΣK ∗ kTrace(EK
>
(R + B > PK B)−1 EK )
kΣK ∗ k
≤ Trace(∇C(K)> ∇C(K))
µ2 σmin (R)

Proof: From Equation 0.8 and by completing the square,


AK (x, K 0 x) = QK (x, K 0 x) − VK (x)
= 2Trace(xx> (K 0 − K)> EK ) + Trace(xx> (K 0 − K)> (R + B > PK B)(K 0 − K))
>
= Trace(xx> K 0 − K + (R + B > PK B)−1 EK (R + B > PK B) K 0 − K + (R + B > PK B)−1 EK )


− Trace(xx> EK
>
(R + B > PK B)−1 EK )
≥ −Trace(xx> EK
>
(R + B > PK B)−1 EK ) (0.10)

129
with equality when K 0 = K − (R + B > PK B)−1 EK .
Let x∗t and u∗t be the sequence generated under K∗ . Using this and Lemma 13.6,
X X
C(K) − C(K ∗ ) = −E AK (x∗t , u∗t ) ≤ E Trace(x∗t (x∗t )> EK
>
(R + B > PK B)−1 EK )
t t
>
= Trace(ΣK ∗ EK (R + B > PK B)−1 EK ) ≤ kΣK ∗ kTrace(EK
>
(R + B > PK B)−1 EK )
which completes the proof of first inequality. For the second inequality, observe:
> 1
Trace(EK (R + B > PK B)−1 EK ) ≤ k(R + B > PK B)−1 k Trace(EK
>
EK ) ≤ >
Trace(EK EK )
σmin (R)
1 1
= Trace(Σ−1 > −1
K ∇C(K) ∇C(K)ΣK ) ≤ Trace(∇C(K)> ∇C(K))
σmin (R) σmin (ΣK )2 σmin (R)
1
≤ 2 Trace(∇C(K)> ∇C(K))
µ σmin (R)
which completes the proof of the upper bound. Here the last step is because ΣK  E[x0 x>
0 ].

The next lemma bounds the one step progress of Gauss-Newton.


Lemma 13.9. (Gauss-Newton Contraction) Suppose that:
K 0 = K − η(R + B > PK B)−1 ∇C(K)Σ−1
K ,.

If η ≤ 1, then  
0 ∗ ηµ
C(K ) − C(K ) ≤ 1 − (C(K) − C(K ∗ ))
kΣK ∗ k

Proof: Observe that for PSD matrices A and B, we have that Trace(AB) ≥ σmin (A)Trace(B). Also, observe
K 0 = K − η(R + B > PK B)−1 EK . Using Lemma 13.7 and the condition on η,
C(K 0 ) − C(K) >
= −2ηTrace(ΣK 0 EK (R + B > PK B)−1 EK ) + η 2 Trace(ΣK 0 EK
>
(R + B > PK B)−1 EK )
>
≤ −ηTrace(ΣK 0 EK (R + B > PK B)−1 EK )
>
≤ −ησmin (ΣK 0 )Trace(EK (R + B > PK B)−1 EK )
>
≤ −ηµTrace(EK (R + B > PK B)−1 EK )
µ
≤ −η (C(K) − C(K ∗ )) ,
kΣK ∗ k
where the last step uses Lemma 13.8.

With this lemma, the proof of the convergence rate of the Gauss Newton algorithm is immediate.
µ
Proof: (of Theorem 13.5, Gauss-Newton case) The theorem is due to that η = 1 leads to a contraction of 1 − kΣK ∗ k
at every step.

13.5 System Level Synthesis for Linear Dynamical Systems

We demonstrate another parameterization of controllers which admits convexity. Specifically in this section, we
consider finite horizon setting, i.e.,
xt+1 = Axt + But + wt , x0 ∼ D, wt ∼ N (0, σ 2 I), t ∈ [0, 1, . . . , H − 1].

130
Instead of focusing on quadratic cost function, we consider general convex cost function c(xt , ut ), and the goal is to
optimize:
"H−1 #
X
Eπ ct (xt , ut ) ,
t=0

where π is a time-dependent policy.


While we relax the assumption on the cost function from quadratic cost to general convex cost, we still focus on linearly
parameterized policy, i.e., we are still interested in searching for a sequence of time-dependent linear controllers
{−Kt? }H−1 ?
t=0 with ut = −Kt xt , such that it minimizes the above objective function. We again assume the system is
H−1
controllable, i.e., the expected total cost under {−Kt? }t=0 is finite.
Note that due to the fact that now the cost function is not quadratic anymore, the Riccati equation we studied before
does not hold , and it is not necessarily true that the value function of any linear controllers will be a quadratic function.
We present another parameterization of linear controllers which admit convexity in the objective function with respect
to parameterization (recall that the objective function is non-convex with respect to linear controllers −Kt ). With the
new parameterization, we will see that due to the convexity, we can directly apply gradient descent to find the globally
optimal solution.
H−1
Consider any fixed time-dependent linear controllers {−Kt }t=0 . We start by rolling out the linear system under the
controllers. Note that during the rollout, once we observe xt+1 , we can compute wt as wt = xt+1 − Axt − But . With
x0 ∼ µ, we have that at time step t:

ut = −Kt xt = −Kt (Axt−1 − BKt−1 xt−1 + wt−1 )


= −Kt wt−1 − Kt (A − BKt−1 )xt−1
= −Kt wt−1 − Kt (A − BKt−1 )(Axt−2 − BKt−2 xt−2 + wt−2 )
= −Kt wt−1 −Kt (A − BKt−1 ) wt−2 −Kt (A − BKt−1 )(A − BKt−2 ) xt−2
|{z} | {z } | {z }
:=Mt−1;t :=Mt−2;t :=Mt−3;t
t
! t−1 t−1−τ
Y X Y
= −Kt (A − BKt−τ ) x0 − Kt (A − BKt−h )wτ
τ =1 τ =0 h=1

Note that we can equivalently write ut using x0 and the noises w0 , . . . , wt−1 . Hence for time step t, let us do the
following re-parameterization. Denote
t
! t−1−τ
Y Y
Mt := −Kt (A − BKt−τ ) , Mτ ;t := −Kt (A − BKt−h ), where τ = [0, . . . , t − 1].
τ =1 h=1

Now we can express the control ut using Mτ ;t for τ ∈ [0, . . . , t − 1] and Mt as follows:
t−1
X
ut = Mt x0 + Mτ ;t wτ ,
τ =0

which is equal to ut = −Kt xt . Note that above we only reasoned time step t. We can repeat the same calculation for
all t = 0 → H − 1.
The above calculation basically proves the following claim.
Claim 13.10. For any linear controllers π := {−K0 , . . . , −KH−1 }, there exists a parameterization,
H−1
e := {{Mt , M0;t , . . . Mt−1;t }}t=0 , such that when execute π and π
π e under any initialization x0 , and any sequence of
H−1
noises {wt }t=0 , they generate exactly the same state-control trajectory.

131
H−1
We can execute π e := {{Mt , M0;t , . . . Mt−1;t }}t=0 in the following way. Given any x0 , we execute u0 = M0 x0 and
time step t with the observed xt , we calculate wt−1 = xt − Axt−1 − But−1 , and execute the control
observe x1 ; at P
t−1
ut = Mt x0 + τ =0 Mτ ;t wτ . We repeat until we execute the last control uH−1 and reach xH .
What is the benefit of the above parameterization? Note that for any t, this is clearly over-parameterized: the simple
linear controllers −Kt has d × k many parameters, while the new controller {Mt ; {Mτ ;t }t−1 τ =0 } has t × d × k many
parameters. The benefit of the above parameterization is that the objective function now is convex! The following
claim formally shows the convexity.
hP i
H−1 H−1
Claim 13.11. Given π e := {{Mt , M0;t , . . . Mt−1;t }}t=0 and denote its expected total cost as J(eπ ) := E t=0 c(x t , u )
t ,

where the expectation is with respect to the noise wt ∼ N (0, σ 2 I) and the initial state x0 ∼ µ, and c is any convex
H−1
π ) is convex with respect to parameters {{Mt , M0;t , . . . Mt−1;t }}t=0 .
function with respect to x, u. We have that J(e

H−1
Proof: We consider a fixed x0 and a fixed sequence of noises {wt }t=0 . Recall that for ut , we can write it as:
t−1
X
u t = Mt x 0 + Mτ ;t wτ ,
τ =0

which is clearly linear with respect to the parameterization {Mt ; M0;t , . . . Mt−1;t }.
For xt , we can show by induction that it is linear with respect to {Mτ ; M0;τ , . . . Mτ −1;τ }t−1
τ =0 . Note that it is clearly
true for x0 which is independent of any parameterizations. Assume this claim holds for xt with t ≥ 0, we now check
xt+1 . We have:
t−1
X
xt+1 = Axt + But = Axt + BMt x0 + BMτ ;t wτ
τ =0

Note that by inductive hypothesis xt is linear with respect to {Mτ ; M0;τ , . . . Mτ −1;τ }τt−1 =0 , and the part BMt x0 +
Pt−1
τ =0 BMτ ;t wτ is clearly linear with respect to {Mt ; M0;t , . . . Mt−1;t }. Together, this implies that xt+1 is linear
with respect to {Mτ ; M0;τ , . . . Mτ −1;τ }tτ =0 , which concludes that for any xt with t = 0, . . . , H − 1, we have xt is
linear with respect to {Mτ ; M0;τ , . . . Mτ −1;τ }H−1
τ =0 .
PH−1
Note that the trajectory total cost is t=0 c(xt , ut ). Since ct is convex with respect to xt and ut , and xt and
PH−1
ut are linear with respect to {Mτ ; M0;τ , . . . Mτ −1;τ }τH−1
=0 , we have that t=0 ct (xt , ut ) is convex with respect to
H−1
{Mτ ; M0;τ , . . . Mτ −1;τ }τ =0 .
In last step we can simply take expectation with respect to x0 and w1 , . . . , wH−1 . By linearity of expectation, we
conclude the proof.
The above immediately suggests that (sub) gradient-based algorithms such as projected gradient descent on parameters
{Mt , M0;t , . . . Mt−1;t }H−1
t=0 can converge to the globally optimal solutions for any convex function ct . Recall that the
best linear controllers {−Kt? }H−1 ? ? ? H−1
t=0 has its own corresponding parameterization {Mt , M0;t , . . . , Mt−1;t }t=0 . Thus
H−1
gradient-based optimization (with some care of the boundness of the parameters {Mt? , M0;t ? ?
, . . . , Mt−1;t }t=0 ) can
? H−1
find a solution that at least as good as the best linear controllers {−Kt }t=0 .

Remark The above claims easily extend to time-dependent transition and cost function, i.e., At , Bt , ct are time-
dependent. One can also extend it to episodic online control setting with adversarial noises wt and adversarial cost
functions using no-regret online convex programming algorithms [Shalev-Shwartz, 2011]. In episodic online control,
every episode k, an adversary determines a sequence of bounded noises w0k , . . . , wH−1
k
, and cost function ck (x, u);
k k k k H−1
e = {Mt , M0;t , . . . , Mt−1;t }t=0 and executes them (learner does
the learner proposes a sequence of controllers π

132
not know the cost function ck until the end of the episode, and the noises are revealed in a way when she observes
xkt and calculates wt−1
k
as xkt − Axkt−1 − Bukt−1 ); at the end of the episode, learner observes ck and suffers total cost
PH−1 k k k
t=0 c (xt , ut ). The goal of the episodic online control is to be no-regret with respect to the best linear controllers
in hindsight:
K−1
X H−1
X K−1
X
H−1
ck (xkt , ukt ) − min J k ({−Kt? }t=0 ) = o(K),
{−Kt? }H−1
k=0 t=0 t=0 k=0

where we denote J k ({−Kt? }H−1 ? H−1 k k


t=0 ) as the total cost of executing {−Kt }t=0 in episodic k under c and noises {wt },
PH−1 ? k
i.e., t=0 c(xt , ut ) with ut = −Kt xt and xt+1 = Axt + But + wt , for all t.

13.6 Bibliographic Remarks and Further Readings

The basics of optimal control theory can be found in any number of standards text [Anderson and Moore, 1990, Evans,
2005, Bertsekas, 2017]. The primal and dual formulations for the continuous time LQR are derived in Balakrishnan
and Vandenberghe [2003], while the dual formulation for the discrete time LQR, that we us here, is derived in Cohen
et al. [2019].
The treatment of Gauss-Newton, NPG, and PG algorithm are due to Fazel et al. [2018]. We have only provided the
proof for the Gauss-Newton case based; the proofs of convergence rates for NPG and PG can be found in Fazel et al.
[2018].
For many applications, the finite horizon LQR model is widely used as a model of locally linear dynamics, e.g. [Ahn
et al., 2007, Todorov and Li, 2005, Tedrake, 2009, Perez et al., 2012]. The issue of instabilities are largely due to
model misspecification and in the accuracy of the Taylor’s expansion; it is less evident how the infinite horizon LQR
model captures these issues. In contrast, for MDPs, practice tends to deal with stationary (and discounted) MDPs,
due that stationary policies are more convenient to represent and learn; here, the non-stationary, finite horizon MDP
model tends to be more of theoretical interest, due to that it is straightforward (and often more effective) to simply
incorporate temporal information into the state. Roughly, if our policy is parametric, then practical representational
constraints leads us to use stationary policies (incorporating temporal information into the state), while if we tend
to use non-parametric policies (e.g. through some rollout based procedure, say with “on the fly” computations like
in model predictive control, e.g. [Williams et al., 2017]), then it is often more effective to work with finite horizon,
non-stationary models.

Sample Complexity and Regret for LQRs. We have not treated the sample complexity of learning in an LQR (see
[Dean et al., 2017, Simchowitz et al., 2018, Mania et al., 2019] for rates). Here, the basic analysis follows from the
online regression approach, which was developed in the study of linear bandits [Dani et al., 2008, Abbasi-Yadkori
et al., 2011]; in particular, the self-normalized bound for vector-valued martingales [Abbasi-Yadkori et al., 2011] (see
Theorem A.5) provides a direct means to obtain sharp confidence intervals for estimating the system matrices A and
B from data from a single trajectory (e.g. see [Simchowitz and Foster, 2020]).
Another family of work provides regret analyses of online LQR problems [Abbasi-Yadkori and Szepesvári, 2011,
Dean et al., 2018, Mania et al., 2019, Cohen et al., 2019, Simchowitz and Foster, 2020]. Here, naive random search
suffices for sample efficient learning of LQRs [Simchowitz and Foster, 2020]. For the learning and control of more
complex nonlinear dynamical systems, one would expect this is insufficient, where strategic exploration is required
for sample efficient learning; just as in the case for MDPs (e.g. the UCB-VI algorithm).

Convex Parameterization of Linear Controllers The convex parameterization in Section 13.5 is based on [Agar-
wal et al., 2019] which is equivalent to the System Level Synthesis (SLS) parameterization [Wang et al., 2019]. Agar-

133
wal et al. [2019] uses SLS parameterization in the infinite horizon online control setting and leverages a reduction to
online learning with memory. Note that in episodic online control setting, we can just use classic no-regret online
learner such as projected gradient descent [Zinkevich, 2003]. For partial observable linear systems, the classic Youla
parameterization[Youla et al., 1976] introduces a convex parameterization. We also refer readers to [Simchowitz et al.,
2020] for more detailed discussion about Youla parameterization and the generalization of Youla parameterization.
Moreover, using the performance difference lemma, the exact optimal control policy for adversarial noise with full
observations can be exactly characterized Foster and Simchowitz [2020], Goel and Hassibi [2020] and yields a form
reminiscent of the SLS parametrization Wang et al. [2019].

134
Chapter 14

Imitation Learning

In this chapter, we study imitation learning. Unlike the Reinforcement Learning setting, in Imitation Learning, we do
not have access to the ground truth reward function (or cost function), but instead, we have expert demonstrations. We
often assume that the expert is a policy that approximately optimizes the underlying reward (or cost) functions. The
goal is to leverage the expert demonstrations to learn a policy that performs as good as the expert.
We consider three settings of imitation learning: (1) pure offline where we only have expert demonstrations and no
more real world interaction is allowed, (2) hybrid where we have expert demonstrations, and also is able to interact
with the real world (e.g., have access to the ground truth transition dynamics); (3) interactive setting where we have
an interactive expert and also have access to the underlying reward (cost) function.

14.1 Setting

We will focus on finite horizon MDPs M = {S, A, r, µ, P, H} where r is the reward function but is unknown to the
learner. We represent the expert as a closed-loop policy π ? : S 7→ ∆(A). For analysis simplicity, we assume the
expert π ? is indeed the optimal policy of the original MDP M with respect to the ground truth reward r. Again, our
b : S 7→ ∆(A) that performs as well as the expert, i.e., V πb needs to be close to V ? , where V π
goal is to learn a policy π
denotes the expected total reward of policy π under the MDP M. We denote dπ as the state-action visitation of policy
π under M.
?
We assume we have a pre-collected expert dataset in the format of {s?i , a?i }M ? ? π
i=1 where si , ai ∼ d .

14.2 Offline IL: Behavior Cloning

We study offline IL here. Specifically, we study the classic Behavior Cloning algorithm.
We consider a policy class Π = {π : S 7→ ∆(A)}. We consider the following realizable assumption.
Assumption 14.1. We assume Π is rich enough such that π ? ∈ Π.

For analysis simplicity, we assume Π is discrete. But our sample complexity will only scale with respect to ln(|Π|).
Behavior cloning is one of the simplest Imitation Learning algorithm which only uses the expert data D? and does not
require any further interaction with the MDP. It computes a policy via a reduction to supervised learning. Specifically,

135
we consider a reduction to Maximum Likelihood Estimation (MLE):
N
X
Behavior Cloning (BC): π
b = argmaxπ∈Π ln π(a?i |s?i ). (0.1)
i=1

Namely we try to find a policy from Π that has the maximum likelihood of fitting the training data. As this is a
reduction to MLE, we can leverage the existing classic analysis of MLE (e.g., Chapter 7 in [Geer and van de Geer,
2000]) to analyze the performance of the learned policy.
Theorem 14.2 (MLE Guarantee). Consider the MLE procedure (Eq. 0.1). With probability at least 1 − δ, we have:
2 ln (|Π|/δ)
π (·|s) − π ? (·|s)ktv ≤
Es∼dπ? kb .
M

We refer readers to Section E, Theorem 21 in [Agarwal et al., 2020b] for detailed proof of the above MLE guarantee.
?
b and π ? to their performance difference V πb and V π .
Now we can transfer the average divergence between π
? ?
b is close to π ? under dπ . Outside of dπ ’s support,
One thing to note is that BC only ensures that the learned policy π
we have no guarantee that π b and π ? will be close to each other. The following theorem shows that a compounding
error occurs when we study the performance of the learned policy V pi .
b

Theorem 14.3 (Sample Complexity of BC). With probability at least 1 − δ, BC returns a policy π
b such that:
r
2 ln(|Π|/δ)
V ? − V πb ≤ 2
.
(1 − γ) M

Proof: We start with the performance difference lemma and the fact that Ea∼π(·|s) Aπ (s, a) = 0:
 
(1 − γ) V ? − V πb = Es∼dπ? Ea∼π? (·|s) Aπb (s, a)
= Es∼dπ? Ea∼π? (·|s) Aπb (s, a) − Es∼dπ? Ea∼bπ(·|s) Aπb (s, a)
1
≤ Es∼dπ? kπ ? (·|s) − π b(·|s)k1
1−γ
1
q
2
≤ Es∼dπ? kπ ? (·|s) − π b(·|s)k1
1−γ
1
q
2
= 4Es∼dπ? kπ ? (·|s) − π b(·|s)ktv .
1−γ
1
where we use the fact that sups,a,π |Aπ (s, a)| ≤ 1−γ , and the fact that (E[x])2 ≤ E[x2 ].
Using Theorem 14.2 and rearranging terms conclude the proof.
For Behavior cloning, from the supervised learning error the quadratic polynomial dependency on the effect horizon
1/(1 − γ) is not avoidable in worst case [Ross and Bagnell, 2010]. This is often referred as the distribution shift issue
?
in the literature. Note that π b is trained under dπ , but during execution, π
b makes prediction on states that are generated
?
by itself, i.e., dπb , instead of the training distribution dπ .

14.3 The Hybrid Setting: Statistical Benefit and Algorithm

The question we want to answer here is that if we know the underlying MDP’s transition P (but the reward is still
unknown), can we improve Behavior Cloning? In other words:

136
what is the benefit of the known transition in addition to the expert demonstrations?

Instead of a quadratic dependency on the effective horizon, we should expect a linear dependency on the effective
horizon. The key benefit of the known transition is that we can test our policy using the known transition to see how
far away we are from the expert’s demonstrations, and then use the known transition to plan to move closer to the
expert demonstrations.
In this section, we consider a statistical efficient, but computationally intractable algorithm, which we use to demon-
strate that informationally theoretically, by interacting with the underlying known transition, we can do better than
Behavior Cloning. In the next section, we will introduce a popular and computationally efficient algorithm Maximum
Entropy Inverse Reinforcement Learning (MaxEnt-IRL) which operates under this setting (i.e., expert demonstrations
with a known transition).
We start from the same policy class Π as we have in the BC algorithm, and again we assume realizability (Assump-
tion 14.1) and Π is discrete.
Since we know the transition P , information theoretically, for any policy π, we have dπ available (though it is compu-
π?
tationally intractable to compute dπ for large scale MDPs). We have (s?i , a?i )M
i=1 ∼ d .

Below we present an algorithm which we called Distribution Matching with Scheffé Tournament (DM-ST).
For any two policies π and π 0 , we denote fπ,π0 as the following witness function:
h i
fπ,π0 := argmaxf :kf k∞ ≤1 Es,a∼dπ f (s, a) − Es,a∼dπ0 f (s, a) .

We denote the set of witness functions as:


F = {fπ,π0 : π, π 0 ∈ Π, π 6= π 0 }.
Note that |F| ≤ |Π|2 .
DM-ST selects π
b using the following procedure:
" " M
##
1 X
b ∈ argminπ∈Π
DM-ST: π max Es,a∼dπ f (s, a) − f (s?i , a?i ) .
f ∈F M i=1

The following theorem captures the sample complexity of DM-ST.


Theorem 14.4 (Sample Complexity of DM-ST). With probability at least 1 − δ, DM-ST finds a policy π
b such that:
s
2 ln(|Π|) + ln 1δ

4
V ? − V πb ≤ .
1−γ M

Proof: The proof basically relies on a uniform convergence argument over F of which the size is |Π|2 . First we note
that for all policy π ∈ Π:
?
max [Es,a∼dπ f (s, a) − Es,a∼d? f (s, a)] = max [Es,a∼dπ f (s, a) − Es,a∼d? f (s, a)] = dπ − dπ ,
f ∈F f :kf k∞ ≤1 1

where the first equality comes from the fact that F includes arg maxf :kf k∞ ≤1 [Es,a∼dπ f (s, a) − Es,a∼d? f (s, a)].
Via Hoeffding’s inequality and a union bound over F, we get that with probability at least 1 − δ, for all f ∈ F:
M
r
1 X ? ? ln(|F|/δ)
f (si , ai ) − Es,a∼d? f (s, a) ≤ 2 := stat .
M i=1 M

137
  1
PM
Denote fb := arg maxf ∈F Es,a∼dπb f (s, a) − Es,a∼d? f (s, a) , and fe := arg maxf ∈F Es,a∼dπb f (s, a)− M i=1 f (si , ai ).
Hence, for π
b, we have:

M
π ? 1 Xb ? ?
d −d
b
= Es,a∼dπb f (s, a) − Es,a∼d f (s, a) ≤ Es,a∼dπb f (s, a) −
b ? b b f (si , ai ) + stat
1 M i=1
M
1 Xe
≤ Es,a∼dπb fe(s, a) − f (si , ai ) + stat
M i=1
M
1 Xe
≤ Es,a∼dπ? fe(s, a) − f (si , ai ) + stat
M i=1
≤ Es,a∼dπ? fe(s, a) − Es,a∼d? fe(s, a) + 2stat = 2stat ,

where in the third inequality we use the optimality of π


b.
Recall that V π = Es,a∼dπ r(s, a)/(1 − γ), we have:

1  sups,a |r(s, a)| πb 2


V πb − V ? = E πb r(s, a) − Es,a∼d? r(s, a) ≤ d − d? ≤ stat .
1 − γ s,a∼d 1−γ 1 1−γ

This concludes the proof.


Note that above theorem confirms the statistical benefit of having the access to a known transition: comparing to the
classic Behavior Cloning algorithm, the approximation error of DM-ST only scales linearly with respect to horizon
1/(1 − γ) instead of quadratically.

14.3.1 Extension to Agnostic Setting

So far we focused on agnostic learning setting. What would happen if π ? 6∈ Π? We can still run our DM-ST algorithm
as is. We state the sample complexity of DM-ST in agnostic setting below.

Theorem 14.5 (Agnostic Guarantee of DM-ST). Assume Π is finite, but π ? 6∈ Π. With probability at least 1 − δ,
DM-ST learns a policy π
b such that:
r !
? π 1 3 1 ln(|Π| + ln(1/δ))
V −V b
≤ kd? − dπe k1 ≤ min kdπ − d? k1 + O
e .
1−γ 1 − γ π∈Π 1−γ M

e := argminπ∈Π kdπ − d? k1 . Let us denote:


Proof: We first define some terms below. Denote π

 
fe = argmaxf ∈F Es,a∼dπb f (s, a) − Es,a∼dπe f (s, a) ,
" M
#
1 X ? ?
f = argmaxf ∈F Es,a∼dπb f (s, a) − f (si , ai ) ,
M i=1
" M
#
0 1 X ? ?
f = arg max Es,a∼dπe [f (s, a)] − f (si , ai ) .
f ∈F M i=1

138
Starting with a triangle inequality, we have:

? ?
dπb − dπ ≤ dπb − dπe + dπe − dπ
1
h 1 i 1
h i ?
= Es,a∼dπb fe(s, a) − Es,a∼dπe fe(s, a) + dπe − dπ
1
M M
h i 1 Xe ? ? 1 Xe ? ? h i ?
= Es,a∼dπb fe(s, a) − f (si , ai ) + f (si , ai ) − Es,a∼dπe fe(s, a) + dπe − dπ
M i=1 M i=1 1

M M
1 X 1 Xe ? ?
f (s?i , a?i ) +
 
≤ Es,a∼dπb f (s, a) − f (si , ai ) − Es,a∼d? fe(s, a)
M i=1 M i=1
h h ii ?
+ Es,a∼d? fe(s, a) − Es,a∼dπe fe(s, a) + dπe − dπ
1
M
r
1 X ln(|F|/δ)
≤ Es,a∼dπe [f 0 (s, a)] − f 0 (s?i , a?i ) + 2 + 2kdπe − d? k1
M i=1 M
r
0 0 ln(|F|/δ)
≤ Es,a∼dπe [f (s, a)] − Es,a∼d? [f (s, a)] + 4 + 2kdπe − d? k1
M
r
? π ln(|F|/δ)
≤ 3kd − d k1 + 4
e
,
M

where the first inequality usesP the definition of f , the second inequality uses the fact that πb is the minimizer of
1 M ? ?
maxf ∈F Es,a∼dπ f (s, a) − M i=1 f (si , ai ), along the way we also use Heoffding’s inequality where ∀f ∈ F,
PM p
|Es,a∼d? f (s, a) − i=1 f (s?i , a?i )| ≤ 2 ln(|F|/δ)/M , with probability at least 1 − δ.
As we can see, comparing to the realizable setting, here we have an extra term that is related to minπ∈Π kdπ −
d? k1 . Note that the dependency on horizon also scales linearly in this case. In general, the constant 3 in front of
minπ∈Π kdπ − d? k1 is not avoidable in Scheffé estimator [Devroye and Lugosi, 2012].

14.4 Maximum Entropy Inverse Reinforcement Learning

Similar to Behavior cloning, we assume we have a dataset of state-action pairs from expert D? = {s?i , a?i }N
i=1 where
?
s?i , a?i ∼ dπ . Different from Behavior cloning, here we assume that we have access to the underlying MDP’s transi-
tion, i.e., we assume transition is known and we can do planning if we were given a cost function.
We assume that we are given a state-action feature mapping φ : S ×A 7→ Rd (this can be extended infinite dimensional
feature space in RKHS, but we present finite dimensional setting for simplicity). We assume the true ground truth cost
function as c(s, a) := θ? · φ(s, a) with kθ? k2 ≤ 1 and θ? being unknown.
The goal of the learner is to compute a policy π : S 7→ ∆(A) such that when measured under the true cost function, it
performs as good as the expert, i.e., Es,a∼dπ θ? · φ(s, a) ≈ Es,a∼dπ? θ? · φ(s, a).
We will focus on finite horizon MDP setting in this section. We denote ρπ (τ ) as the trajectory distribution induced by
π and dπ as the average state-action distribution induced by π.

139
14.4.1 MaxEnt IRL: Formulation and The Principle of Maximum Entropy

MaxEnt IRL uses the principle of Maximum Entropy and poses the following policy optimization optimization pro-
gram:
X
max − ρπ (τ ) ln ρπ (τ ),
π
τ
N
X
s.t., Es,a∼dπ φ(s, a) = φ(s?i , a?i )/N.
i=1

PN
Note that i=1 φ(s?i , a?i )/N is an unbiased estimate of Es,a∼dπ? φ(s, a). MaxEnt-IRL searches for a policy that
maximizes the entropy of its trajectory distribution subject to a moment matching constraint.
Note that there could be many policies that satisfy the moment matching constraint, i.e., Es,a∼dπ φ(s, a) = Es,a∼dπ? φ(s, a),
and any feasible solution is guaranteed to achieve the same performance of the expert under the ground truth cost func-
tion θ? · φ(s, a). The maximum entropy objective ensures that the solution is unique.
Using the Markov property, we notice that:
X
argmaxπ − ρπ (τ ) ln ρπ (τ ) = argmaxπ −Es,a∼dπ ln π(a|s).
τ

This, we can rewrite the MaxEnt-IRL as follows:

min Es,a∼dπ ln π(a|s), (0.2)


π
N
X
s.t., Es,a∼dπ φ(s, a) = φ(s?i , a?i )/N. (0.3)
i=1

14.4.2 Algorithm

To better interpret the objective function, below we replace i φ(s?i , a?i )/N by its P
P
expectation Es,a∼dπ? φ(s, a). Note
that we can use standard concentration inequalities to bound the difference between i φ(s?i , a?i )/N and Es,a∼dπ? φ(s, a).
Using Lagrange multipliers, we can rewrite the constrained optimization program in Eq. 0.2 as follows:

min Es,a∼dπ ln π(a|s) + max Es,a∼dπ θ> φ(s, a) − Es,a∼dπ? θ> φ(s, a).
π θ

The above objective conveys a clear goal of our imitation learning problem: we are searching for π that minimizes the
?
MMD between dπ and dπ with a (negative) entropy regularization on the trajectory distribution of policy π.
To derive an algorithm that optimizes the above objective, we first again swap the min max order via the minimax
theorem:
 
max min Es,a∼dπ θ> φ(s, a) − Es,a∼dπ? θ> φ(s, a) + Es,a∼dπ ln π(a|s) .
θ π

The above objective proposes a natural algorithm where we perform projected gradient ascent on θ, while perform
best response update on π, i.e., given θ, we solve the following planning problem:

argminπ Es,a∼dπ θ> φ(s, a) + Es,a∼dπ ln π(a|s). (0.4)

140
Note that the above objective can be understood as planning with cost function θ> φ(s, a) with an additional negative
entropy regularization on the trajectory distribution. On the other hand, given π, we can compute the gradient of θ,
which is simply the difference between the expected features:
Es,a∼dπ φ(s, a) − Es,a∼dπ? φ(s, a),
which gives the following gradient ascent update on θ:

θ := θ + η Es,a∼dπ φ(s, a) − Es,a∼dπ? φ(s, a) . (0.5)

Algorithm of MaxEnt-IRL MaxEnt-IRL updates π and θ alternatively using Eq. 0.4 and Eq. 0.5, respectively. We
summarize the algorithm in Alg. 8. Note that for the stochastic gradient of θt , we see that it is the empirical average
feature difference between the current policy πt and the expert policy π ? .

Algorithm 8 MaxEnt-IRL
Input: Expert data D? = {s?i , a?i }M
i=1 , MDP M, parameters β, η, N .
1: Initialize θ0 with kθ0 k2 ≤ 1.
2: for t = 1, 2, . . . , do
Entropy-regularized Planning with cost θt> φ(s, a): πt ∈ argminπ Es,a∼dπ θt> φ(s, a) + β ln π(a|s) .
 
3:
4: Draw samples {si , ai }N
i=1 ∼ d
πt
h Pπt in M.
by executing i
0 N PM
5: Stochastic Gradient Update: θ = θt + η N1 i=1 φ(si , ai ) − M 1
i=1 φ(s ? ?
i , a )
i .
6: end for

Note that Alg. 8 uses an entropy-regularized planning oracle. Below we discuss how to implement such entropy-
regularized planning oracle via dynamic programming.

14.4.3 Maximum Entropy RL: Implementing the Planning Oracle in Eq. 0.4

The planning oracle in Eq. 0.4 can be implemented in a value iteration fashion using Dynamic Programming. We
denote c(s, a) := θ · φ(s, a).
We are interested in implementing the following planning objective:
argminπ Es,a∼dπ [c(s, a) + ln π(a|s)]
The subsection has its own independent interests beyond the framework of imitation learning. This maximum entropy
regularized planning formulation is widely used in RL as well and it is well connected to the framework of RL as
Inference.
As usually, we start from the last time step H − 1. For any policy π and any state-action (s, a), we have the cost-to-go
QπH−1 (s, a):
X
QπH−1 (s, a) = c(s, a) + ln π(a|s), VH−1 π
(s) = π(a|s) (c(s, a) + ln π(a|s)) .
a

We have:
X
?
VH−1 (s) = min ρ(a)c(s, a) + ρ(a) ln ρ(a). (0.6)
ρ∈∆(A)
a

Take gradient with respect to ρ, set it to zero and solve for ρ, we get:
?
πH−1 (a|s) ∝ exp (−c(s, a)) , ∀s, a.

141
?
Substitute πH−1 back to the expression in Eq. 0.6, we get:
!
X
?
VH−1 (s) = − ln exp (−c(s, a)) ,
a

i.e., we apply a softmin operator rather than the usual min operator (recall here we are minimizing cost).
?
With Vh+1 , we can continue to h. Denote Q? (s, a) as follows:

Q?h (s, a) = r(s, a) + Es0 ∼P (·|s,a) Vh+1


?
(s0 ).

For Vh? , we have:


X
Vh? (s) = min ?
(s0 ) .

ρ(a) c(s, a) + ln ρ(a) + Es0 ∼P (·|s,a) Vh+1
ρ∈∆(A)
a

Again we can show that the minimizer of the above program has the following form:

πh? (a|s) ∝ exp (−Q?h (s, a)) . (0.7)

Substitute πh? back to Q?h , we can show that:


!
X
Vh? (s) = − ln exp (−Q?h (s, a)) ,
a

where we see again that Vh? is based on a softmin operator on Q? .


Thus the soft value iteration can be summarized below:

Soft Value Iteration:


!
X
Q?H−1 (s, a) = c(s, a), πh? (a|s) ∝ exp(−Q?h (s, a)), Vh? (s) = − ln exp(−Q?h (s, a)) , ∀h.
a

We can continue the above procedure to h = 0, which gives us the optimal policies, all in the form of Eq. 0.7

14.5 Interactive Imitation Learning:


AggreVaTe and Its Statistical Benefit over Offline IL Setting

We study the Interactive Imitation Learning setting where we have an expert policy that can be queried at any time
during training, and we also have access to the ground truth reward signal.
We present AggreVaTe (Aggregate Values to Imitate) first and then analyze its sample complexity under the realizable
setting.
Again we start with a realizable policy class Π that is discrete. We denote ∆(Π) as the convex hull of Π and each
policy π ∈ ∆(Π) is a mixture policy represented by a distribution ρ ∈ R|Π| with ρ[i] ≥ 0 and kρk1 = 1. With
this parameterization, our decision space simply becomes ∆(|Π|), i.e., any point in ∆(|Π|) corresponds to a mixture
policy. Notation wise, given ρ ∈ ∆(|Π|), we denote πρ as the corresponding mixture policy. We denote πi as the i-th
policy in Π. We denote ρ[i] as the i-th element of the vector ρ.

142
Algorithm 9 AggreVaTe
Input: The interactive expert, regularization λ
1: Initialize ρ0 to be a uniform distribution.
2: for t = 0, 2, . . . , do
3: Sample st ∼ dπρt
4: Query expert to get A? (st , a) for all a ∈ A
Pt P|Π|  P|Π|
5: Policy update: ρt+1 = argmaxρ∈∆(|Π|) j=0 i=1 ρ[i] Ea∼πi (·|st ) A? (st , a) − λ i=1 ρ[i] ln(ρ[i])
6: end for

AggreVaTe assumes an interactive expert from whom I can query for action feedback. Basically, given a state s, let
us assume that expert returns us the advantages of all actions, i.e., one query of expert oracle at any state s returns
A? (s, a), ∀a ∈ A.1
The algorithm in summarized in Alg. 9.
P|Π|
To analyze the algorithm, let us introduce some additional notations. Let us denote `t (ρ) = i=1 ρ[i]Ea∼πi (·|s) A? (st , a),
which is dependent on state st generated at iteration t, and is a linear function with respect to decision variable ρ. Ag-
greVaTe is essentially running the specific online learning algorithm, Follow-the-Regularized Leader (FTRL) (e.g.,
see [Shalev-Shwartz, 2011]) with Entropy regularization:
t
X
ρt+1 = argmaxρ∈∆(|Π|) `t (ρ) + λEntropy(ρ).
i=0

Denote c = maxs,a Aπ (s, a). This implies that supρ,t k`t (ρ)k ≤ c. FTRL with linear loss functions and entropy
regularization gives the following deterministic regret guarantees ([Shalev-Shwartz, 2011]):

T
X −1 T
X −1 p
max `t (ρ) − `t (ρt ) ≤ c log(|Π|)T . (0.8)
ρ
t=0 t=0

We will analyze AggreVaTe’s sample complexity using the above result.

Theorem 14.6 (Sample Complexity of AggreVaTe). Denote c = sups,a |A? (s, a)|. Let us denote Π and stat as
follows:

M −1
r r
1 X ln(|Π|)/δ ln(1/δ)
Π := max `t (ρ? ), stat := +2 .
ρ∈∆(|Π|) M M M
t=0

With probability at least 1 − δ, after M iterations (i.e., M calls of expert oracle), AggreVaTe finds a policy π
b such that:

c 1
V ? − V πb ≤ stat − Π .
1−γ 1−γ

Proof: At each iteration t, let us define `˜t (ρt ) as follows:

|Π|
X
`˜t (ρt ) = Es∼dπρt ρt [i]Ea∼πi (·|s) A? (s, a)
i=1

1 Technically one cannot use one query to get A? (s, a) for all a. But one can use importance weighting to get an unbiased estimate of A? (s, a)

for all a via just one expert roll-out. For analysis simplicity, we assume one expert query at s returns the whole vector A? (s, ·) ∈ R|A| .

143
Denote Et [`t (ρt )] as the conditional expectation of `t (ρt ), conditioned on all history up and including the end of
iteration t − 1. Thus, we have Et [`t ](ρt ) = `˜t (ρt ) as ρt only depends on the history up to the end of iteration t − 1.
Also note that |`t (ρ)| ≤ c. Thus by Azuma-Hoeffding inequality (Theorem A.2), we get:
M −1 M −1
r
1 X ˜ 1 X ln(1/δ)
`t (ρt ) − `t (ρt ) ≤ 2c ,
M t=0 T t=0 M

1
PM −1
with probability at least 1 − δ. Now use Eq. 0.8, and denote ρ? = argminρ∈∆(Π) M t=0 `t (ρ), i.e., the best
PM −1
minimizer that minimizes the average loss t=0 `t (ρ)/M . we get:

M −1 M −1 M −1
r r r
1 X ˜ 1 X ln(1/δ) 1 X ? ln(|Π|) ln(1/δ)
`t (ρt ) ≥ `t (ρt ) − 2c ≥ `t (ρ ) − c − 2c ,
M t=0 M t=0 M M t=0 M M

which means that there must exist a t̂ ∈ {0, . . . , M − 1}, such that:
M −1
r r
1 X ln(|Π|) ln(1/δ)
`et̂ (ρt̂ ) ≥ `t (ρ? ) − c − 2c .
M t=0 M M

Now use Performance difference lemma, we have:


|Π|
X
(1 − γ) (−V ? + V πρt̂ ) = Es∼dπρt̂ ρt̂ [i]Ea∼πi (·|s) A? (s, a) = `t̂ (ρt̂ )
i=1
M −1
r r
1 X
? ln(|Π|) ln(1/δ)
≥ `t (ρ ) − c − 2c .
M t=0
T T

Rearrange terms, we get:


M −1
" # "r r #
? πρ 1 1 X ? c ln(|Π|)/δ ln(1/δ)
V −V t̂ ≤− `t (ρ ) + +2 ,
1−γ M t=0 1−γ M M

which concludes the proof.

Remark We analyze the Π by discussing realizable setting and non-realizable setting separately. PM When Π is re-
alizable, i.e., π ? ∈ Π, by the definition of our loss function `t , we immediately have i=1 `t (ρ? ) ≥ 0 since
A? (s, π ? (s)) = 0 for all s ∈ S. Moreover, when π ? is not the globally optimal policy, it is possible that Π :=
PM ?
i=1 `t (ρ )/M > 0, which implies that when M → ∞ (i.e., stat → 0), AggreVaTe indeed can learn a policy that
outperforms the expert policy π ? . In general when π ? 6∈ Π, there might not exist a mixture policy ρ that achieves
positive advantage against π ? under the M training samples {s0 , . . . , sM −1 }. In this case, Π < 0.
Does AggreVaTe avoids distribution shift? Under realizable setting, note that the bound explicitly depends on sups,a |A? (s, a)|
(i.e., it depends on |`t (ρt )| for all t). In worst case, it is possible that sups,a |A? (s, a)| = Θ(1/(1 − γ)), which implies
that AggreVaTe could suffer a quadratic horizon dependency, i.e., 1/(1 − γ)2 . Note that DM-ST provably scales
linearly with respect to 1/(1 − γ), but DM-ST requires a stronger assumption that the transition P is known.
When sups,a |A? (s, a)| = o(1/(1 − γ)), then AggreVaTe performs strictly better than BC. This is possible when the
MDP is mixing quickly under π ? , or Q? is L- Lipschitz continuous, i.e., |Q? (s, a) − Q? (s, a0 )| ≤ Lka − a0 k, with
bounded range in action space, e.g., supa,a0 ka − a0 k ≤ β ∈ R+ . In this case, if L and β are independent of 1/(1 − γ),
βL
then sups,a |A? (s, a)| ≤ Lβ, which leads to a 1−γ dependency.

144
When π ? 6∈ Π, the agnostic result of AggreVaTe and the agnostic result of DM-ST is not directly comparable. Note that
?
the model class error that DM-ST suffers is algorithmic-independent, i.e., it is minπ∈Π kdπ −dπ k1 and it only depends
on π ? and Π, while the model class Π in AggreVaTe is algorithmic path-dependent, i.e., in additional to π ? and Π,
it depends the policies π1 , π2 , . . . computed during the learning process. Another difference is that minπ∈Π kdπ −
?
1
dπ k1 ∈ [0, 2], while Π indeed could scale linearly with respect to 1/(1 − γ), i.e., π ∈ [− 1−γ , 0] (assume π ? is the
globally optimal policy for simplicity).

14.6 Bibliographic Remarks and Further Readings

Behavior cloning was used in autonomous driving back in 1989 [Pomerleau, 1989], and the distribution shift and
compounding error issue was studied by Ross and Bagnell [2010]. Ross and Bagnell [2010], Ross et al. [2011]
proposed using interactive experts to alleviate the distribution shift issue.
MaxEnt-IRL was proposed by Ziebart et al. [2008]. The original MaxEnt-IRL focuses on deterministic MDPs and
derived distribution over trajectories. Later Ziebart et al. [2010] proposed the Principle of Maximum Causal Entropy
framework which captures MDPs with stochastic transition. MaxEnt-IRL has been widely used in real applications
(e.g., [Kitani et al., 2012, Ziebart et al., 2009]).
To the best of our knowledge, the algorithm Distribution Matching with Scheffé Tournament introduced in this chapter
is new here and is the first to demonstrate the statistical benefit (in terms of sample complexity of the expert policy) of
the hybrid setting over the pure offline setting with general function approximation.
Recently, there are approaches that extend the linear cost functions used in MaxEnt-IRL to deep neural networks
which are treated as discriminators to distinguish between experts datasets and the datasets generated by the policies.
Different distribution divergences have been proposed, for instance, JS-divergence [Ho and Ermon, 2016], general
f-divergence which generalizes JS-divergence [Ke et al., 2019], and Integral Probability Metric (IPM) which includes
Wasserstein distance [Sun et al., 2019b]. While these adversarial IL approaches are promising as they fall into the hy-
brid setting, it has been observed that empirically, adversarial IL sometimes cannot outperform simple algorithms such
as Behavior cloning which operates in pure offline setting (e.g., see experiments results on common RL benchmarks
from Brantley et al. [2019]).
Another important direction in IL is to combine expert demonstrations and reward functions together (i.e., perform
Imitation together with RL). There are many works in this direction, including learning with pre-collected expert data
[Hester et al., 2017, Rajeswaran et al., 2017, Cheng et al., 2018, Le et al., 2018, Sun et al., 2018], learning with an
interactive expert [Daumé et al., 2009, Ross and Bagnell, 2014, Chang et al., 2015, Sun et al., 2017, Cheng et al.,
2020, Cheng and Boots, 2018]. The algorithm AggreVaTe was originally introduced in [Ross and Bagnell, 2014]. A
policy gradient and natural policy gradient version of AggreVaTe are introduced in [Sun et al., 2017]. The precursor
of AggreVaTe is the algorithm Data Aggregation (DAgger) [Ross et al., 2011], which leverages interactive experts and
a reduction to no-regret online learning, but without assuming access to the ground truth reward signals.
The maximum entropy RL formulation has been widely used in RL literature as well. For instance, Guided Policy
Search (GPS) (and its variants) use maximum entropy RL formulation as its planning subroutine [Levine and Koltun,
2013, Levine and Abbeel, 2014]. We refer readers to the excellent survey from Levine [2018] for more details of
Maximum Entropy RL formulation and its connections to the framework of RL as Probabilistic Inference.

145
146
Chapter 15

Offline Reinforcement Learning

Offline reinforcement learning broadly refers to reinforcement learning problems in which the learner does not get to
interact with the environment. Instead, the learner is simply presented with a batch of experience collected by some
decision-making policy, and the goal is to use this data to learn a near-optimal (or at the very least) better policy. This
setting is quite important for high-stakes decision-making scenarios which might arise in precision medicine or where
safety is a serious concern. On the other hand, one significant challenge is that exploration is not controlled by the
learner. Thus we will either (a) require some assumptions that ensure that the data-collection policy effectively covers
the state-action space, or (b) not be able to find a global near-optimal policy.
The fitted Q-iteration sample complexity analysis which this chapter focusses on, is originally due to [Munos, 2003,
Munos and Szepesvári, 2008, Antos et al., 2008].

15.1 Setting

We consider infinite horizon discounted MDP M = {S, A, γ, P, r, ρ} where ρ is the initial state distribution. We
assume reward is bounded, i.e., sups,a r(s, a) ∈ [0, 1]. For any policy π : S 7→ A, we denote V π and Qπ as the value
and Q function of π, and we denote dπ ∈ ∆(S × A) as the state-action visitation of π. For notation simplicity, we
denote Vmax := 1/(1 − γ).
Given any f : S × A 7→ R, we denote the Bellman operator T f : S × A 7→ R as follows. For all s, a ∈ S × A,

T f (s, a) := r(s, a) + Es0 ∼P (·|s,a) max


0
f (s0 , a0 ).
a ∈A

In batch RL, rather than interact with the environment to collect data, we will be presented with n tuples (s, a, r, s0 )
where (s, a) ∼ µ, r = r(s, a) and s0 ∼ P (· | s, a). Here µ is an approximation of the data collection policy, and
it is only an approximation because we think of the tuples as iid. This is primarily to simplify the analysis, and it
is possible to obtain results when we replace the iid dataset with one actually collected by a policy, which involves
dealing with temporal correlations. See Section 15.4 for further discussion.
Given the dataset D := {(si , ai , ri , s0i )}ni=1 our goal is to output a near optimal policy for the MDP, that is we would
like our algorithm to produce a policy π̂ such that, with probability at least 1 − δ, V (π̂) ≥ V ? − , for some (, δ)
pair. As usual, the number of samples n will depend on the accuracy parameters (, δ) and we would like n to scale
favorably with these.
Denote a function class F = {f : S × A 7→ [0, Vmax ]}. We assume F is discrete and also contains Q? .

147
Assumption 15.1 (Realizability). We assume F is rich enough such that Q? ∈ F.

We require the sample complexity of the learning algorithm scales polynomially with respect to ln (|F|).
Since in offline RL, the learner cannot interact with the environment at all, we require the data distribution ν is
exploratory enough.

Assumption 15.2 (Concentrability). There exists a constant C such that for any policy π (including non-stationary
policies), we have:

dπ (s, a)
∀π, h, x, a : ≤ C.
µ(s, a)

Note that concentrability does not require that the state space is finite, but it does place some constraints on the system
dynamics. Note that the above assumption requires that µ to cover all possible policies’s state-action distribution, even
including non-stationary policies. Recall the concentrability assumptions in Approximate Policy Iteration (Chapter 3)
and Conservative Policy Iteration (Chapter 12). The concentrability assumption here is the strongest as it requires µ
to cover even non-stationary policies’ state-action distributions.
In additional to the above two assumptions, we need an assumption on the representational condition of class F.

Assumption 15.3 (Bellman Completion). We assume that for any f ∈ F, T f ∈ F.

Note that this implies that Q? ∈ F (as Q? is the convergence point of Value Iteration), which is the weaker assumption
we would hope is sufficient. However, as we discuss in Section 15.4, the Bellman completion assumption is necessary
in order to learn in polynomial sample complexity.

15.2 Algorithm: Fitted Q Iteration (FQI)

Fitted Q Iteration (FQI) simply performs the following iteration. Start with some f0 ∈ F, FQI iterates:
n 
X 2
FQI: ft ∈ argminf ∈F f (s0i , ai ) − ri − γ max
0
ft−1 (si , ai ) . (0.1)
a ∈A
i=1

After k many iterations, we output a policy πk (s) := argmaxa fk (s, a), ∀s.
Note that the Bayes optimal solution is T ft−1 . Due to the Bellman completion assumption, the Bayes optimal solution
T ft−1 ∈ F. Thus, we should expect that ft is close to the Bayes optimal T ft−1 under the distribution µ, i.e., with a
uniform convergence argument, for the generalization bound, we should expect that:

2
p
Es,a∼µ (ft (s, a) − T ft−1 (s, a)) ≈ 1/n.

Indeed, as we demonstrate in Lemma 15.5, for square loss, under the realizability assumption, i.e., the Bayes optimal
belongs to F (T ft−1 ∈ F), we can have a sharper generalization error scaling in the order of 1/n. Thus in high level,
ft ≈ T ft−1 as our data distribution µ is exploratory, and we know that based on value iteration, T fk−1 is a better
approximation of Q? than fk , i.e., kT ft−1 − Q? k∞ ≤ γkft−1 − Q? k∞ , we can expect FQI to converge to the optimal
solution when n → ∞, t → ∞. We formalize the above intuition below.

148
15.3 Analysis

With ft from FQI (Eq. 0.1), denote πt (s) := argmaxa ft (s, a) for all s ∈ S.
We first state the performance guarantee of FQI.
Theorem 15.4 (FQI guarantee). The k th iterate of Fitted Q Iteration guarantees that with probability 1 − δ
r !
? πk 1 C log(|F|/δ) 2γ k
V −V ≤O 3
+ .
(1 − γ) n (1 − γ)2

The first term is the estimation error term, which goes to 0 as we get more data. The second term is “optimization
error” term that goes to 0 as we do more iterations. This term can always be made arbitrarily small at the expense of
more computation.
We now prove the theorem. Given any distribution ν ∈ S × A, and any function f : S × A 7→ R, we write
kf k22,ν := Es,a∼ν f 2 (s, a), and kf k1,ν := Es,a∼ν |f (s, a)|. For any ν ∈ ∆(S) and a policy π, we denote ν ◦ π as the
joint state-action distribution, i.e., s ∼ ν, a ∼ π(·|s).
Proof: We start from the Performance Difference Lemma:

(1 − γ) (V ? − V πk ) = Es∼dπk [−A? (s, πk (s))]


= Es∼dπk [Q? (s, π ? (s)) − Q? (s, πk (s))]
≤ Es∼dπk [Q? (s, π ? (s)) − fk (s, π ? (s)) + fk (s, πk (s)) − Q? (s, πk (s))]
≤ kQ? − fk k1,dπk ◦π? + kQ? − fk k1,dπk ◦πk
≤ kQ? − fk k2,dπk ◦π? + kQ? − fk k2,dπk ◦πk ,

where the first inequality comes from the fact that πk is a greedy policy of fk , i.e., fk (s, πk (s)) ≥ fk (s, a) for any
other a including π ? (s). Now we bound each term on the RHS of the above inequality. We do this by consider a
state-action distribution ν. We have:

kQ? − fk k2,ν ≤ kQ? − T fk−1 k2,ν + kfk − T fk−1 k2,ν


s  2 
? 0 0
≤ γ Es,a∼ν Es0 ∼P (·|s,a) max Q (s , a) − max fk−1 (s , a) + kfk − T fk−1 k2,ν
a a
r
2

≤γ Es,a∼ν,s0 ∼P (·|s,a) max (Q? (s0 , a) − fk−1 (s0 , a)) + Ckfk − T fk−1 k2,µ ,
a

where in the last inequality, we use the fact that (E[x])2 ≤ E[x2 ], (maxx f (x) − maxx g(x))2 ≤ maxx (f (x) − g(x))2
for any two functions f and g, and assumption 15.2.
2
Denote ν 0 (s0 , a0 ) = 0 0 ? 0 0
P
s,a ν(s, a)P (s |s, a)1{a = argmaxa (Q (s , a) − fk−1 (s , a)) }, the above inequality be-
comes:

kQ? − fk k2,ν ≤ γkQ? − fk−1 k2,ν 0 + Ckfk − T fk−1 k2,µ .

We can recursively repeat the same process for kQ? − fk−1 k2,ν 0 till step k = 0:

√ k−1
X
kQ? − fk k2,ν ≤ C γ t kfk−t − T fk−t−1 k2,µ + γ k kQ? − f0 k2,eν ,
t=0

where νe is some valid state-action distribution.

149
Note that for the first term on the RHS of the above inequality, we can bound it using Lemma 15.5. With probability
at least 1 − δ, we have:
! !
√ k−1 k−1
r

p
V 2 ln(|F|/δ) V C ln(|F|/δ)
max
X X
max
C γ t kfk−t − T fk−t−1 k2,µ ≤ O C γk =O √
t=0 t=0
n (1 − γ) n

For the second term, we have:


γkQ? − f0 k2,eν ≤ γ k Vmax .
Thus, we have:
p !
Vmax C ln(|F|/δ)
kQ? − fk k2,ν = O √ + γ k Vmax ,
(1 − γ) n

for all ν, including ν = dπk ◦ π ? , and ν = dπk ◦ πk . This concludes the proof.
The following lemma studies the generalization error for least square problems in FQI. Specifically, it leverages the fact
that for square loss, under the realizability assumption (Bayes optimal belongs to the function class), the generalization
error scales in the order of O(1/n).
Pn
Lemma 15.5 (Least Square Generalization Error). Given f ∈ F, denote fˆf := argminf ∈F i=1 (f (si , ai ) − ri −
γ maxa0 f (s0i , a0 ))2 . With probability at least 1 − δ, for all f ∈ F, we have:
  
 2 2
Vmax ln |Fδ |
Es,a∼µ fˆf (s, a) − T f (s, a) = O  .
n

Note that the above lemma indicates that with probability at least 1 − δ, for any t = 1, 2, . . . , we must have:
  
2
Vmax ln |Fδ |
2
Es,a∼µ (ft (s, a) − T ft−1 (s, a)) = O  .
n
Pn
Proof: Let us consider a fixed function f 0 ∈ F first, and denote fˆ = argminf ∈F i=1 (f (si , ai )−ri −γ maxa0 ∈A f 0 (s0i , a0 ))2 .
Note that fˆ is fully determined by f . At the end, we will apply a union bound over all f 0 ∈ F.
For any fixed f ∈ F, let us denote the random variable zi :
 2  2
zif := f (si , ai ) − ri − γ max
0
f 0 0
(si , a0
) − T f 0
(s ,
i ia ) − ri − γ max
0
f 0 0
(si , a0
) .
a ∈A a ∈A

First note that:


|zif | ≤ Vmax
2
,
which comes from the assumption that f (s, a) ∈ [0, Vmax ] for all s, a, f .
First note that:
h i
Esi ,ai ∼µ,s0i ∼P (·|si ,ai ) zif
 
0 0 0
= Esi ,ai ∼µ,s0i ∼P (·|si ,ai ) (f (si , ai ) − T f (si , ai )) f (si , ai ) + T f (si , ai ) − 2(ri + γ max f (s0i , a0 ))
a0 ∈A
  
= Esi ,ai ∼µ (f (si , ai ) − T f 0 (si , ai )) f (si , ai ) + T f 0 (si , ai ) − 2 ri + Es0i ∼P (·|si ,ai ) max 0
f 0 0
(si , a0
)
a
2 2
= Esi ,ai ∼µ (f (si , ai ) − T f 0 (si , ai )) = Es,a∼µ (f (s, a) − T f 0 (s, a)) ,

150
where the third equality uses the definition of Bellman operator, i.e., that ri + Es0i ∼P (·|si ,ai ) maxa0 f 0 (s0i , a0 ) =
T f 0 (si , ai ).

Now we calculate the second moment of zif .


h i
Esi ,ai ∼µ,s0i ∼P (·|si ,ai ) (zif )2
  2 
= Esi ,ai ∼µ,s0i ∼P (·|si ,ai ) (f (si , ai ) − T f 0 (si , ai ))2 f (si , ai ) + T f 0 (si , ai ) − 2(ri + γ max f 0 0
(si , a0
))
a0
2
Esi ,ai ∼µ (f (si , ai ) − T f 0 (si , ai ))2
 
≤ 4Vmax
2
Es,a∼µ (f (s, a) − T f 0 (s, a))2
 
= 4Vmax

where in the last inequality, we again use the assumption that f (s, a) ∈ [0, Vmax ] for all s, a, f .
2 Pn
Now we can use Bernstein’s inequality to bound Es,a∼µ (f (s, a) − T f 0 (s, a)) − n1 i=1 zi . Together with a union
bound over all f ∈ F, we have that with probability at least 1 − δ, for any f ∈ F:
n
r
0 2 1X f 2 E
8Vmax 0 2
s,a∼µ [(f (s, a) − T f (s, a)) ] ln(|F|/δ) 4V 2 ln(|F|/δ)
Es,a∼µ (f (s, a) − T f (s, a)) − zi ≤ + max .
n i=1 n 3n

ˆ 0
Set f = fˆ, and use the fact that fˆ is the minimizer of the least square, i.e., n1 i=1 zif ≤ n1 i=1 ziT f = 0, we have:
Pn Pn

v h i
u
2 E
8Vmax ˆ 0
s,a∼µ (f (s, a) − T f (s, a))
2 ln(|F|/δ)
 2 u 4V 2 ln(|F|/δ)
Es,a∼µ fˆ(s, a) − T f 0 (s, a) ≤ + max
t
.
n 3n
 2
Solve for Es,a∼µ fˆ(s, a) − T f 0 (s, a) , we get:

 2 √ 2 V 2 ln(|F|/δ)
Es,a∼µ fˆ(s, a) − T f 0 (s, a) ≤
p max
2 + 10/3 .
n

Note that the above result holds for a fixed f 0 ∈ F. Apply union bound over all f 0 ∈ F, we can conclude the proof.

√ the proof show that more generally, we can obtain a O(1/n) generalization error (as
Note that the above lemma and
opposed to the common O(1/ n)) for square loss under the realizability assumption.

15.4 Bibliographic Remarks and Further Readings

The authors graciously acknowledge Akshay Krishnamurthy for sharing the lecture notes from which this chapter is
based on.
The fitted Q-iteration sample complexity analysis is originally due to [Munos, 2003, Munos and Szepesvári, 2008,
Antos et al., 2008], under the concentrability based assumptions. More generally, the offline RL literature [Munos,
2003, Szepesvári and Munos, 2005, Antos et al., 2008, Munos and Szepesvári, 2008, Tosatto et al., 2017, Chen and
Jiang, 2019, Xie and Jiang, 2020] largely analyzes the sample complexity of approximate dynamic programming-based
approaches under either of the following two categories of assumptions: (i) density based assumptions on the state-
action space, where it is assumed there is low distribution shift with regards to the offline data collection distribution
(e.g. concentrability based) (ii) representation conditions that assumes some given linear function class satisfies a
completion conditions (e.g. Assumption 15.3 considered here), along with with some coverage assumption over the
feature space (rather than coverage over the state-action space).

151
With regards to the simpler question of just policy evaluation — estimating the value of a given policy with offline
data — [Duan and Wang, 2020] provide minimax optimal rates. With regards to necessary conditions for sample
efficient offline RL, [Wang et al., 2020] proves that offline RL is impossible, under only the weaker assumptions of
realizability and feature coverage. In particular, the latter result shows how fitted Q-iteration can have exponential error
amplification; furthermore, it shows that substantially stronger assumptions are required for offline RL algorithms to
be statistically efficient (such as the assumptions considered in this chapter).

152
Chapter 16

Partially Observable Markov Decision


Processes

To be added...

153
154
Bibliography

Yasin Abbasi-Yadkori and Csaba Szepesvári. Regret bounds for the adaptive control of linear quadratic systems. In
Conference on Learning Theory, pages 1–26, 2011.

Yasin Abbasi-Yadkori, Dávid Pál, and Csaba Szepesvári. Improved algorithms for linear stochastic bandits. In Ad-
vances in Neural Information Processing Systems, pages 2312–2320, 2011.

Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. POLITEX:
Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages
3692–3702, 2019.

Naoki Abe and Philip M. Long. Associative reinforcement learning using linear probabilistic concepts. In Proc. 16th
International Conf. on Machine Learning, pages 3–11. Morgan Kaufmann, San Francisco, CA, 1999.

Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable
policy gradient learning. NeurIPS, 2020a.

Alekh Agarwal, Sham Kakade, Akshay Krishnamurthy, and Wen Sun. Flambe: Structural complexity and representa-
tion learning of low rank mdps. NeurIPS, 2020b.

Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is
minimax optimal. In COLT, volume 125, pages 67–83, 2020c.

Alekh Agarwal, Sham M. Kakade, Jason D. Lee, and Gaurav Mahajan. On the theory of policy gradient methods:
Optimality, approximation, and distribution shift, 2020d.

Naman Agarwal, Brian Bullins, Elad Hazan, Sham M Kakade, and Karan Singh. Online control with adversarial
disturbances. arXiv preprint arXiv:1902.08721, 2019.

Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans, editors. Understanding the impact of
entropy on policy optimization, 2019. URL [Link]

Hyo-Sung Ahn, YangQuan Chen, and Kevin L. Moore. Iterative learning control: Brief survey and categorization.
IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 37(6):1099–1121, 2007.

Brian D. O. Anderson and John B. Moore. Optimal Control: Linear Quadratic Methods. Prentice-Hall, Inc., Upper
Saddle River, NJ, USA, 1990. ISBN 0-13-638560-5.

András Antos, Csaba Szepesvári, and Rémi Munos. Learning near-optimal policies with bellman-residual minimiza-
tion based fitted policy iteration and a single sample path. Machine Learning, 71(1):89–129, 2008.

P. Auer, N. Cesa-Bianchi, and P. Fischer. Finite-time analysis of the multiarmed bandit problem. Mach. Learn., 47
(2-3):235–256, 2002. ISSN 0885-6125.

155
Alex Ayoub, Zeyu Jia, Csaba Szepesvári, Mengdi Wang, and Lin F. Yang. Model-based reinforcement learning with
value-targeted regression. arXiv preprint arXiv:2006.01107, 2020. URL [Link]
01107.
Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of
reinforcement lear ning with a generative model. Machine learning, 91(3):325–349, 2013.
Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In
Doina Precup and Yee Whye Teh, editors, Proceedings of Machine Learning Research, volume 70, pages 263–272,
International Convention Centre, Sydney, Australia, 06–11 Aug 2017. PMLR.
J. Andrew Bagnell and Jeff Schneider. Covariant policy search. Proceedings of the 18th International Joint Confer-
ence on Artificial Intelligence, pages 1019–1024, 2003. URL [Link]
1630659.1630805.
J Andrew Bagnell, Sham M Kakade, Jeff G Schneider, and Andrew Y Ng. Policy search by dynamic programming.
In Advances in neural information processing systems, pages 831–838, 2004.
V. Balakrishnan and L. Vandenberghe. Semidefinite programming duality and linear time-invariant systems. IEEE
Transactions on Automatic Control, 48(1):30–41, 2003.
A. Beck. First-Order Methods in Optimization. Society for Industrial and Applied Mathematics, Philadelphia, PA,
2017. doi: 10.1137/1.9781611974997.
Richard Bellman. Dynamic programming and Lagrange multipliers. Proceedings of the National Academy of Sciences,
42(10):767–769, 1956.
Dimitri P. Bertsekas. Dynamic Programming and Optimal Control. Athena Scientific, 2017.
Ronen I Brafman and Moshe Tennenholtz. R-max-a general polynomial time algorithm for near-optimal reinforcement
learning. Journal of Machine Learning Research, 3(Oct):213–231, 2002.
Kianté Brantley, Wen Sun, and Mikael Henaff. Disagreement-regularized imitation learning. In International Confer-
ence on Learning Representations, 2019.
Kai-Wei Chang, Akshay Krishnamurthy, Alekh Agarwal, Hal Daume, and John Langford. Learning to search better
than your teacher. In International Conference on Machine Learning, pages 2058–2066. PMLR, 2015.
Jinglin Chen and Nan Jiang. Information-theoretic considerations in batch reinforcement learning. In International
Conference on Machine Learning, pages 1042–1051, 2019.
Ching-An Cheng and Byron Boots. Convergence of value aggregation for imitation learning. arXiv preprint
arXiv:1801.07292, 2018.
Ching-An Cheng, Xinyan Yan, Nolan Wagener, and Byron Boots. Fast policy learning through imitation and rein-
forcement. arXiv preprint arXiv:1805.10413, 2018.
Ching-An Cheng, Andrey Kolobov, and Alekh Agarwal. Policy improvement from multiple experts. arXiv preprint
arXiv:2007.00795, 2020.
Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only sqrtT
regret. In International Conference on Machine Learning, pages 1300–1309, 2019.
Varsha Dani, Thomas P Hayes, and Sham M Kakade. Stochastic linear optimization under bandit feedback. In COLT,
pages 355–366, 2008.
Christoph Dann and Emma Brunskill. Sample complexity of episodic fixed-horizon reinforcement learning. In Ad-
vances in Neural Information Processing Systems, pages 2818–2826, 2015.

156
Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic
reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
Christoph Dann, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. On
oracle-efficient PAC reinforcement learning with rich observations. In Advances in Neural Information Processing
Systems 31, 2018.
Hal Daumé, John Langford, and Daniel Marcu. Search-based structured prediction. Machine learning, 75(3):297–325,
2009.
S. Dean, H. Mania, N. Matni, B. Recht, and S. Tu. On the sample complexity of the linear quadratic regulator. ArXiv
e-prints, 2017.
Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control
of the linear quadratic regulator. In Advances in Neural Information Processing Systems, pages 4188–4197, 2018.
Luc Devroye and Gábor Lugosi. Combinatorial methods in density estimation. Springer Science & Business Media,
2012.
Simon S Du, Sham M Kakade, Ruosong Wang, and Lin F Yang. Is a good representation sufficient for sample efficient
reinforcement learning? In International Conference on Learning Representations, 2019.
Yaqi Duan and Mengdi Wang. Minimax-optimal off-policy evaluation with linear function approximation. arXiv,
abs/2002.09516, 2020. URL [Link]
Lawrence C. Evans. An introduction to mathematical optimal control theory. University of California, Department of
Mathematics, page 126, 2005. ISSN 14712334.
Eyal Even-Dar, Sham M Kakade, and Yishay Mansour. Online Markov decision processes. Mathematics of Operations
Research, 34(3):726–736, 2009.
Maryam Fazel, Rong Ge 0001, Sham M. Kakade, and Mehran Mesbahi. Global Convergence of Policy Gradient
Methods for the Linear Quadratic Regulator. In Proceedings of the 35th International Conference on Machine
Learning, pages 1466–1475. PMLR, 2018.
Dylan J Foster and Max Simchowitz. Logarithmic regret for adversarial online control. arXiv preprint
arXiv:2003.00189, 2020.
Sara A Geer and Sara van de Geer. Empirical Processes in M-estimation, volume 6. Cambridge university press, 2000.
Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. arXiv
preprint arXiv:1901.11275, 2019.
Saeed Ghadimi and Guanghui Lan. Stochastic first- and zeroth-order methods for nonconvex stochastic programming.
SIAM Journal on Optimization, 23(4):2341–2368, 2013.
Peter W. Glynn. Likelihood ratio gradient estimation for stochastic systems. Commun. ACM, 33(10):75–84, 1990.
ISSN 0001-0782.
Gautam Goel and Babak Hassibi. The power of linear controllers in lqr control. arXiv preprint arXiv:2002.02574,
2020.
Todd Hester, Matej Vecerik, Olivier Pietquin, Marc Lanctot, Tom Schaul, Bilal Piot, Dan Horgan, John Quan, Andrew
Sendonaris, Gabriel Dulac-Arnold, et al. Deep q-learning from demonstrations. arXiv preprint arXiv:1704.03732,
2017.
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning. In Advances in neural information pro-
cessing systems, pages 4565–4573, 2016.

157
Daniel Hsu, Sham Kakade, and Tong Zhang. A spectral algorithm for learning hidden markov models. Journal of
Computer and System Sciences, 78, 11 2008.

Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of
Machine Learning Research, 11(Apr):1563–1600, 2010.

Zeyu Jia, Lin Yang, Csaba Szepesvari, and Mengdi Wang. Model-based reinforcement learning with value-targeted
regression. Proceedings of Machine Learning Research, 120:666–686, 2020.

Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, John Langford, and Robert E. Schapire. Contextual decision
processes with low Bellman rank are PAC-learnable. In International Conference on Machine Learning, 2017.

Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear
function approximation. In Conference on Learning Theory, pages 2137–2143, 2020.

S. Kakade. A natural policy gradient. In NIPS, 2001.

Sham Kakade and John Langford. Approximately Optimal Approximate Reinforcement Learning. In Proceedings of
the 19th International Conference on Machine Learning, volume 2, pages 267–274, 2002.

Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of College
London, 2003.

Liyiming Ke, Matt Barnes, Wen Sun, Gilwoo Lee, Sanjiban Choudhury, and Siddhartha Srinivasa. Imitation learning
as f -divergence minimization. arXiv preprint arXiv:1905.12888, 2019.

Michael Kearns and Daphne Koller. Efficient reinforcement learning in factored mdps. In IJCAI, volume 16, pages
740–747, 1999.

Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine Learning, 49
(2-3):209–232, 2002.

Michael J Kearns and Satinder P Singh. Finite-sample convergence rates for q-learning and indirect algorithms. In
Advances in neural information processing systems, pages 996–1002, 1999.

Michael J. Kearns, Yishay Mansour, and Andrew Y. Ng. Approximate planning in large pomdps via reusable trajec-
tories. In S. A. Solla, T. K. Leen, and K. Müller, editors, Advances in Neural Information Processing Systems 12,
pages 1001–1007. MIT Press, 2000.

Kris M Kitani, Brian D Ziebart, James Andrew Bagnell, and Martial Hebert. Activity forecasting. In European
Conference on Computer Vision, pages 201–214. Springer, 2012.

Akshay Krishnamurthy, Alekh Agarwal, and John Langford. PAC reinforcement learning with rich observations. In
Advances in Neural Information Processing Systems, pages 1840–1848, 2016.

Alessandro Lazaric, Mohammad Ghavamzadeh, and Rémi Munos. Analysis of classification-based policy iteration
algorithms. The Journal of Machine Learning Research, 17(1):583–612, 2016.

Hoang M Le, Nan Jiang, Alekh Agarwal, Miroslav Dudı́k, Yisong Yue, and Hal Daumé III. Hierarchical imitation and
reinforcement learning. arXiv preprint arXiv:1803.00590, 2018.

Sergey Levine. Reinforcement learning and control as probabilistic inference: Tutorial and review. arXiv preprint
arXiv:1805.00909, 2018.

Sergey Levine and Pieter Abbeel. Learning neural network policies with guided policy search under unknown dynam-
ics. In Advances in Neural Information Processing Systems, pages 1071–1079, 2014.

158
Sergey Levine and Vladlen Koltun. Guided policy search. In International Conference on Machine Learning, pages
1–9, 2013.
Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based
reinforcement learning with a generative model. CoRR, abs/2005.12900, 2020.
Boyi Liu, Qi Cai, Zhuoran Yang, and Zhaoran Wang. Neural proximal/trust region policy optimization attains globally
optimal policy. CoRR, abs/1906.10306, 2019. URL [Link]
Thodoris Lykouris, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Corruption robust exploration in episodic
reinforcement learning. arXiv preprint arXiv:1911.08689, 2019.
Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalent control of LQR is efficient. arXiv preprint
arXiv:1902.07826, 2019.
Yishay Mansour and Satinder Singh. On the complexity of policy iteration. UAI, 01 1999.
C. McDiarmid. On the method of bounded differences. In Surveys in Combinatorics, pages 148–188. Cambridge
University Press, 1989.
Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax
policy gradient methods, 2020.
Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Sil-
ver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In International conference
on machine learning, pages 1928–1937, 2016.
Aditya Modi, Nan Jiang, Ambuj Tewari, and Satinder P. Singh. Sample complexity of reinforcement learning using
linearly combined model ensembles. In The 23rd International Conference on Artificial Intelligence and Statistics,
AISTATS, volume 108 of Proceedings of Machine Learning Research, 2020.
Rémi Munos. Error bounds for approximate policy iteration. In ICML, volume 3, pages 560–567, 2003.
Rémi Munos. Error bounds for approximate value iteration. In AAAI, 2005.
Rémi Munos and Csaba Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning
Research, 9(May):815–857, 2008.
Gergely Neu, Anders Jonsson, and Vicenç Gómez. A unified view of entropy-regularized markov decision processes.
CoRR, abs/1705.07798, 2017.
Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. ArXiv, abs/1608.02732,
2016.
Alejandro Perez, Robert Platt, George Konidaris, Leslie Kaelbling, and Tomas Lozano-Perez. LQR-RRT*: Optimal
sampling-based motion planning with automatically derived extension heuristics. In IEEE International Conference
on Robotics and Automation, pages 2537–2542, 2012.
Jan Peters and Stefan Schaal. Natural actor-critic. Neurocomput., 71(7-9):1180–1190, 2008. ISSN 0925-2312.
Dean A Pomerleau. Alvinn: An autonomous land vehicle in a neural network. In Advances in neural information
processing systems, pages 305–313, 1989.
Martin Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley-Interscience, 1994.
Aravind Rajeswaran, Vikash Kumar, Abhishek Gupta, Giulia Vezzani, John Schulman, Emanuel Todorov, and Sergey
Levine. Learning complex dexterous manipulation with deep reinforcement learning and demonstrations. arXiv
preprint arXiv:1709.10087, 2017.

159
H. Robbins. Some aspects of the sequential design of experiments. In Bulletin of the American Mathematical Society,
volume 55, 1952.

Stéphane Ross and Drew Bagnell. Efficient reductions for imitation learning. In Proceedings of the thirteenth inter-
national conference on artificial intelligence and statistics, pages 661–668, 2010.

Stephane Ross and J Andrew Bagnell. Reinforcement and imitation learning via interactive no-regret learning. arXiv
preprint arXiv:1406.5979, 2014.

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to
no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and
statistics, pages 627–635, 2011.

Bruno Scherrer. Approximate policy iteration schemes: a comparison. In International Conference on Machine
Learning, pages 1314–1322, 2014.

Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iteration as boosted
policy search. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages
35–50. Springer, 2014.

John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization.
In International Conference on Machine Learning, pages 1889–1897, 2015.

John Schulman, F. Wolski, Prafulla Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.
ArXiv, abs/1707.06347, 2017.

Shai Shalev-Shwartz. Online learning and online convex optimization. Foundations and Trends in Machine Learning,
4(2):107–194, 2011.

Aaron Sidford, Mengdi Wang, Xian Wu, Lin F. Yang, and Yinyu Ye. Near-optimal time and sample complexities
for for solving discounted markov decision process with a generative model. In Advances in Neural Information
Processing Systems 31, 2018.

Max Simchowitz and Dylan J Foster. Naive exploration is optimal for online LQR. arXiv preprint arXiv:2001.09576,
2020.

Max Simchowitz, Horia Mania, Stephen Tu, Michael I Jordan, and Benjamin Recht. Learning without mixing: To-
wards a sharp analysis of linear system identification. In COLT, 2018.

Max Simchowitz, Karan Singh, and Elad Hazan. Improper learning for non-stochastic control. arXiv preprint
arXiv:2001.09254, 2020.

Satinder Singh and Richard Yee. An upper bound on the loss from approximate optimal-value functions. Machine
Learning, 16(3):227–233, 1994.

Wen Sun, Arun Venkatraman, Geoffrey J Gordon, Byron Boots, and J Andrew Bagnell. Deeply aggrevated: Differen-
tiable imitation learning for sequential prediction. arXiv preprint arXiv:1703.01030, 2017.

Wen Sun, J Andrew Bagnell, and Byron Boots. Truncated horizon policy search: Combining reinforcement learning
& imitation learning. arXiv preprint arXiv:1805.11240, 2018.

Wen Sun, Nan Jiang, Akshay Krishnamurthy, Alekh Agarwal, and John Langford. Model-based rl in contextual
decision processes: Pac bounds and exponential improvements over model-free approaches. In Conference on
Learning Theory, pages 2898–2933, 2019a.

Wen Sun, Anirudh Vemula, Byron Boots, and J Andrew Bagnell. Provably efficient imitation learning from observation
alone. arXiv preprint arXiv:1905.10948, 2019b.

160
Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforce-
ment learning with function approximation. In Advances in Neural Information Processing Systems, volume 99,
pages 1057–1063, 1999.
Csaba Szepesvári and Rémi Munos. Finite time bounds for sampling based fitted value iteration. In Proceedings of
the 22nd international conference on Machine learning, pages 880–887. ACM, 2005.
Russ Tedrake. LQR-trees: Feedback motion planning on sparse randomized trees. The International Journal of
Robotics Research, 35, 2009.
Emanuel Todorov and Weiwei Li. A generalized iterative LQG method for locally-optimal feedback control of con-
strained nonlinear stochastic systems. In American Control Conference, pages 300–306, 2005.
Samuele Tosatto, Matteo Pirotta, Carlo d’Eramo, and Marcello Restelli. Boosted fitted q-iteration. In International
Conference on Machine Learning, pages 3434–3443. PMLR, 2017.
Ruosong Wang, Dean P. Foster, and Sham M. Kakade. What are the statistical limits of offline rl with linear function
approximation?, 2020.
Yuh-Shyang Wang, Nikolai Matni, and John C Doyle. A system-level approach to controller synthesis. IEEE Trans-
actions on Automatic Control, 64(10):4079–4093, 2019.
Grady Williams, Andrew Aldrich, and Evangelos A. Theodorou. Model predictive path integral control: From theory
to parallel computation. Journal of Guidance, Control, and Dynamics, 40(2):344–357, 2017.
Ronald J Williams. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine
learning, 8(3-4):229–256, 1992.
Tengyang Xie and Nan Jiang. Q? approximation schemes for batch reinforcement learning: A theoretical comparison.
arXiv preprint arXiv:2003.03924, 2020.
Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound.
arXiv preprint arXiv:1905.10389, 2019a.
Lin F. Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In International
Conference on Machine Learning, pages 6995–7004, 2019b.
Yinyu Ye. A new complexity result on solving the markov decision problem. Math. Oper. Res., 30:733–749, 08 2005.
Yinyu Ye. The simplex and policy-iteration methods are strongly polynomial for the markov decision problem with a
fixed discount rate. Math. Oper. Res., 36(4):593–603, 2011.
Dante Youla, Hamid Jabr, and Jr Bongiorno. Modern wiener-hopf design of optimal controllers–part ii: The multi-
variable case. IEEE Transactions on Automatic Control, 21(3):319–338, 1976.
Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with
feature mapping. arXiv preprint arXiv:2006.13165, 2020.
Brian D Ziebart, Andrew L Maas, J Andrew Bagnell, and Anind K Dey. Maximum entropy inverse reinforcement
learning. In AAAI, pages 1433–1438, 2008.
Brian D Ziebart, Nathan Ratliff, Garratt Gallagher, Christoph Mertz, Kevin Peterson, J Andrew Bagnell, Martial
Hebert, Anind K Dey, and Siddhartha Srinivasa. Planning-based prediction for pedestrians. In 2009 IEEE/RSJ
International Conference on Intelligent Robots and Systems, pages 3931–3936. IEEE, 2009.
Brian D Ziebart, J Andrew Bagnell, and Anind K Dey. Modeling interaction via the principle of maximum causal
entropy. In Proceedings of the 27th International Conference on Machine Learning (ICML-10). Carnegie Mellon
University, 2010.

161
Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the
20th international conference on machine learning (icml-03), pages 928–936, 2003.

162
Appendix A

Concentration

Lemma A.1. (Hoeffding’s inequality) Suppose X1 , X2P , . . . Xn are a sequence of independent, identically distributed
n
(i.i.d.) random variables with mean µ. Let X̄n = n−1 i=1 Xi . Suppose that Xi ∈ [b− , b+ ] with probability 1, then
2
/(b+ −b− )2
P (X̄n ≥ µ + ) ≤ e−2n .

Similarly,
2
/(b+ −b− )2
P (X̄n ≤ µ − ) ≤ e−2n .

The Chernoff bound implies that with probability 1 − δ:


p
X̄n − EX ≤ (b+ − b− ) ln(1/δ)/(2n) .

Theorem A.2 (Hoeffding-Azuma Inequality). Suppose X1 , . . . , XT is a martingale difference sequence where each
Xt is a σ sub-Gaussian, Then, for every 1 > δ > 0,
−2
X  
Pr( Xi ≥ ) ≤ exp .
2N σ 2
Pn
Lemma A.3. (Bernstein’s inequality) Suppose X1 , . . . , Xn are independent random variables. Let X̄n = n−1 i=1 Xi ,
µ = EX̄n , and Var(Xi ) denote the variance of Xi . If Xi − EXi ≤ b for all i, then

n 2 2
 
P (X̄n ≥ µ + ) ≤ exp − Pn .
2 i=1 Var(Xi ) + 2nb/3

If all the variances are equal, the Bernstein inequality implies that, with probability at least 1 − δ,
p 2b ln(1/δ)
X̄n − EX ≤ 2Var(X) ln(1/δ)/n + .
3n

The following concentration bound is a simple application of the McDiarmid’s inequality [McDiarmid, 1989] (e.g. see
[Hsu et al., 2008] for proof).
Proposition A.4. (Concentration for Discrete Distributions) Let z be a discrete random variable that takes values in
{1, . . . , d}, distributed according to q. We write q as a vector where ~q = [Pr(z = j)]dj=1 . Assume we have N iid
PN
samples, and that our empirical estimate of ~q is [b
q ]j = i=1 1[zi = j]/N .

163
We have that ∀ > 0:  √  2
q − ~qk2 ≥ 1/ N +  ≤ e−N  .
Pr kb

which implies that: √ √ 2


d(1/ N + ) ≤ e−N  .

Pr kb
q − ~qk1 ≥
Lemma A.5 (Self-Normalized Bound for Vector-Valued Martingales; [Abbasi-Yadkori et al., 2011]). Let {εi }∞ i=1 be
a real-valued stochastic process with corresponding filtration {Fi }∞
i=1 such that ε i is F i measurable, E[ε i |F i−1 ] = 0,
and εi is conditionally σ-sub-Gaussian with σ ∈ R+ . Let {Xi }∞ i=1 be a stochastic process with X i ∈ H (some Hilbert
space) and Xi being Ft measurable. Assume that a linear operator Pt Σ : H >→ H is positive definite, i.e., x> Σx > 0
>
for any x ∈ H. For any t, define the linear operator Σt = Σ0 + i=1 Xi Xi (here xx denotes outer-product in H).
With probability at least 1 − δ, we have for all t ≥ 1:

t 2
det(Σt ) det(Σ)−1
X  
Xi εi ≤ σ 2 log .
δ2
i=1 Σ−1
t

164

You might also like