Lecture 4 Pre
Lecture 4 Pre
Approximation
Emma Brunskill
Winter 2025
Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π given Q π
May need to modify policy evaluation:
If π is deterministic, can’t compute Q(s, a) for any a ̸= π(s)
How to interleave policy evaluation and improvement?
Policy improvement is now using an estimated Q
Computational complexity?
Converge to optimal Q ∗ function?
Empirical performance?
Definition of GLIE
All state-action pairs are visited an infinite number of times
lim Ni (s, a) → ∞
i→∞
Definition of GLIE
All state-action pairs are visited an infinite number of times
lim Ni (s, a) → ∞
i→∞
Theorem
GLIE Monte-Carlo control converges to the optimal state-action value
function Q(s, a) → Q ∗ (s, a)
Initialize policy π
Repeat:
Policy evaluation: compute Q π using temporal difference updating
with ϵ-greedy policy
Policy improvement: Same as Monte carlo policy improvement, set π
to ϵ-greedy (Q π )
Method 1: SARSA
On policy: SARSA computes an estimate Q of policy used to act
9: t =t +1
10: end loop
Computational complexity?
Converge to optimal Q ∗ function? Recall:
Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
Q is an estimate of the performance of a policy that may be changing
at each time step
Empirical performance?
Theorem
SARSA for finite-state and finite-action MDPs converges to the optimal
action-value, Q(s, a) → Q ∗ (s, a), under the following conditions:
1 The policy sequence πt (a|s) satisfies the condition of GLIE
2 The step-sizes αt satisfy the Robbins-Munro sequence such that
∞
X
αt = ∞
t=1
X∞
αt2 < ∞
t=1
1
For ex. αt = T satisfies the above condition.
On-policy learning
Direct experience
Learn to estimate and evaluate a policy from experience obtained from
following that policy
Off-policy learning
Learn to estimate and evaluate a policy using experience gathered from
following a different policy
Q-learning:
See optional worked example and optional understanding check at the end
of the slides
Avoid explicitly storing or learning the following for every single state
and action
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
Reduce memory needed to store (P, R)/V /Q/π
Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good (P, R)/V /Q/π
First assume we could query any state s and action a and an oracle
would return the true value for Q π (s, a)
Similar to supervised learning: assume given ((s, a), Q π (s, a)) pairs
The objective is to find the best approximate representation of Q π
given a particular parameterized function Q̂(s, a; w )
1: Initialize w, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visit toP(s, a) in episode k then
Lk
6: Gt (s, a) = j=t rk,j
7: ∇w J(w ) = −2[Gt (s, a)−Q̂(st , at ; w )]∇w Q̂(st , at ; w ) (Compute
Gradient)
8: Update weights ∆w
9: end if
10: end for
11: k =k +1
12: end loop
1: Initialize w, s
2: loop
3: Given s sample a ∼ π(s), r (s, a),s ′ ∼ p(s ′ |s, a)
4: ∇w J(w ) = −2[r + γ V̂ (s ′ ; w ) − V̂ (s; w )]∇w V̂ (s; w )
5: Update weights ∆w
6: if s ′ is not a terminal state then
7: Set s = s ′
8: else
9: Restart episode, sample initial state s
10: end if
11: end loop
Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:
∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
To help improve stability, fix the target weights used in the target
calculation for multiple updates
Target network uses a different set of weights than the weights being
updated
Let parameters w − be the set of weights used in the target, and w
be the weights that are being updated
Slight change to computation of target value:
(s, a, r , s ′ ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Use stochastic gradient descent to update the network weights
∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w − ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a
learning rate, and how often to update the target network. Often a fixed size replay buffer is used for experience replay, which
introduces a parameter to control the size, and the need to decide how to populate it.
In DQN we compute the target value for the sampled (s, a, r , s) using
a separate set of target weights: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Select all that are true
This doubles the computation time compared to a method that does
not have a separate set of weights
This doubles the memory requirements compared to a method that
does not have a separate set of weights
Not sure
In DQN we compute the target value for the sampled (s, a, r , s ′ ) using
a separate set of target weights: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Select all that are true
This doubles the computation time compared to a method that does
not have a separate set of weights
This doubles the memory requirements compared to a method that
does not have a separate set of weights
Not sure
Deep
Game Linear
Network
Breakout 3 3
Enduro 62 29
River Raid 2345 1453
Seaquest 656 275
Space
301 302
Invaders
Deep DQN w/
Game Linear
Network fixed Q
Breakout 3 3 10
Enduro 62 29 141
River Raid 2345 1453 2868
Seaquest 656 275 1003
Space
301 302 373
Invaders
Theorem
For any ϵ-greedy policy πi , the ϵ-greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi