0% found this document useful (0 votes)
3 views

Lecture 4 Pre

This lecture focuses on model-free control and function approximation in reinforcement learning, covering topics such as generalized policy improvement, Monte-Carlo control, and temporal difference methods. Key concepts include the balance between exploration and exploitation, the use of ϵ-greedy policies for action selection, and the importance of value function approximation. The lecture also discusses the iterative process of policy evaluation and improvement without prior knowledge of the environment's dynamics.

Uploaded by

nxhqwerty4306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Lecture 4 Pre

This lecture focuses on model-free control and function approximation in reinforcement learning, covering topics such as generalized policy improvement, Monte-Carlo control, and temporal difference methods. Key concepts include the balance between exploration and exploitation, the use of ϵ-greedy policies for action selection, and the importance of value function approximation. The lecture also discusses the iterative process of policy evaluation and improvement without prior knowledge of the environment's dynamics.

Uploaded by

nxhqwerty4306
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 85

Lecture 4: Model Free Control and Function

Approximation

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2025

Structure and content drawn in part from David Silver’s Lecture 5


and Lecture 6. For additional reading please see SB Sections 5.2-5.4,
6.4, 6.5, 6.7

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 1 / 85
Check Your Understanding L4N1: Model-free Generalized
Policy Improvement

Consider policy iteration


Repeat:
Policy evaluation: compute Q π
Policy improvement πi+1 (s) = arg maxa Q πi (s, a)
Question: is this πi+1 deterministic or stochastic? Assume for each
state s there is a unique maxa Q πi (s, a).
Answer: Deterministic, Stochastic, Not Sure
Now consider evaluating the policy of this new πi+1 . Recall in
model-free policy evaluation, we estimated V π , using π to generate
new trajectories
Question: Can we compute Q πi+1 (s, a) ∀s, a by using this πi+1 to
generate new trajectories?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 2 / 85
Check Your Understanding L4N1: Model-free Generalized
Policy Improvement
Consider policy iteration
Repeat:
Policy evaluation: compute Q π
Policy improvement πi+1 (s) = arg maxa Q πi (s, a)
Question: is this πi+1 deterministic or stochastic? Assume for each
state s there is a unique maxa Q πi (s, a).

Now consider evaluating the policy of this new πi+1 . Recall in


model-free policy evaluation, we estimated V π , using π to generate
new trajectories
Question: Can we compute Q πi+1 (s, a) ∀s, a by using this πi+1 to
generate new trajectories?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 3 / 85
Class Structure

Last time: Policy evaluation with no knowledge of how the world


works (MDP model not given)
Control (making decisions) without a model of how the world works
Generalization – Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 4 / 85
Today’s Lecture

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 5 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 6 / 85
Model-free Policy Iteration

Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π given Q π
May need to modify policy evaluation:
If π is deterministic, can’t compute Q(s, a) for any a ̸= π(s)
How to interleave policy evaluation and improvement?
Policy improvement is now using an estimated Q

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 7 / 85
The Problem of Exploration

Goal: Learn to select actions to maximize total expected future reward


Problem: Can’t learn about actions without trying them (need to
explore
Problem: But if we try new actions, spending less time taking actions
that our past experience suggests will yield high reward (need to
exploit knowledge of domain to achieve high rewards)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 8 / 85
ϵ-greedy Policies

Simple idea to balance exploration and achieving rewards


Let |A| be the number of actions
Then an ϵ-greedy policy w.r.t. a state-action value Q(s, a) is
π(a|s) =
ϵ
arg maxa Q(s, a), w. prob 1 − ϵ + |A|
a′ ̸= arg max Q(s, a) w. prob |A|
ϵ

In words: select argmax action with probability 1 − ϵ, else select


action uniformly at random

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 9 / 85
Policy Improvement with ϵ-greedy policies

Recall we proved that policy iteration using given dynamics and


reward models, was guaranteed to monotonically improve
That proof assumed policy improvement output a deterministic policy
Same property holds for ϵ-greedy policies

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 10 / 85
Monotonic ϵ-greedy Policy Improvement
Theorem
For any ϵ-greedy policy πi , the ϵ-greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi
π π
X
Q i (s, πi+1 (s)) = πi+1 (a|s)Q i (s, a)
a∈A
 
π π
X
= (ϵ/|A|)  Q i (s, a) + (1 − ϵ) max Q i (s, a)
a
a∈A

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 11 / 85
Today: Model-free Control

Generalized policy improvement


Importance of exploration
Monte Carlo control
Model-free control with temporal difference (SARSA, Q-learning)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 12 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 13 / 85
Recall Monte Carlo Policy Evaluation, Now for Q

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), k = 1, Input ϵ = 1, π


2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given π
3: Compute Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti ∀t
4: for t = 1, . . . , T do
5: if First visit to (s,a) in episode k then
6: N(s, a) = N(s, a) + 1
1
7: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
8: end if
9: end for
10: k =k +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 14 / 85
Monte Carlo Online Control / On Policy Improvement

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), Set ϵ = 1, k = 1


2: πk = ϵ-greedy(Q) // Create initial ϵ-greedy policy
3: loop
4: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given πk
4: Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti
5: for t = 1, . . . , T do
6: if First visit to (s, a) in episode k then
7: N(s, a) = N(s, a) + 1
1
8: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
9: end if
10: end for
11: k = k + 1, ϵ = 1/k
12: πk = ϵ-greedy(Q) // Policy improvement
13: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 15 / 85
Optional Worked Example: MC for On Policy Control
Mars rover with new actions:
r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, ϵ=.5. Q(s, a) = 0 for all (s, a)
Sample trajectory from ϵ-greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q ϵ−π (−, a1 ) = [1 0 1 0 0 0 0]
After this trajectory (Select all)
Q ϵ−π (−, a2 ) = [0 0 0 0 0 0 0]
The new greedy policy would be: π = [1 tie 1 tie tie tie tie]
The new greedy policy would be: π = [1 2 1 tie tie tie tie]
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 1/9.
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 2/3.
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 5/6.
Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 16 / 85
Properties of MC control with ϵ-greedy policies

Computational complexity?
Converge to optimal Q ∗ function?
Empirical performance?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 17 / 85
L4N2 Check Your Understanding: Monte Carlo Online
Control / On Policy Improvement

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), Set ϵ = 1, k = 1


2: πk = ϵ-greedy(Q) // Create initial ϵ-greedy policy
3: loop
4: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given πk
4: Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti
5: for t = 1, . . . , T do
6: if First visit to (s, a) in episode k then
7: N(s, a) = N(s, a) + 1
1
8: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
9: end if
10: end for
11: k = k + 1, ϵ = 1/k
12: πk = ϵ-greedy(Q) // Policy improvement
13: end loop

Is Q an estimate of Q πk ? When might this procedure fail to compute


the optimal Q ∗ ?
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 18 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 19 / 85
Greedy in the Limit of Infinite Exploration (GLIE)

Definition of GLIE
All state-action pairs are visited an infinite number of times

lim Ni (s, a) → ∞
i→∞

Behavior policy (policy used to act in the world) converges to greedy


policy

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 20 / 85
Greedy in the Limit of Infinite Exploration (GLIE)

Definition of GLIE
All state-action pairs are visited an infinite number of times

lim Ni (s, a) → ∞
i→∞

Behavior policy (policy used to act in the world) converges to greedy


policy

A simple GLIE strategy is ϵ-greedy where ϵ is reduced to 0 with the


following rate: ϵi = 1/i

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 21 / 85
GLIE Monte-Carlo Control using Tabular Representations

Theorem
GLIE Monte-Carlo control converges to the optimal state-action value
function Q(s, a) → Q ∗ (s, a)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 22 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 23 / 85
Model-free Policy Iteration with TD Methods

Initialize policy π
Repeat:
Policy evaluation: compute Q π using temporal difference updating
with ϵ-greedy policy
Policy improvement: Same as Monte carlo policy improvement, set π
to ϵ-greedy (Q π )
Method 1: SARSA
On policy: SARSA computes an estimate Q of policy used to act

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 24 / 85
General Form of SARSA Algorithm

1: Set initial ϵ-greedy policy π randomly, t = 0, initial state st = s0


2: Take at ∼ π(st )
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 ) // Sample action from policy
6: Observe (rt+1 , st+2 )
7: Update Q given (st , at , rt , st+1 , at+1 ):

8: Perform policy improvement:

9: t =t +1
10: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 25 / 85
General Form of SARSA Algorithm

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0


2: Take at ∼ π(st ) // Sample action from policy
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 )
6: Observe (rt+1 , st+2 )
7: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
8: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
9: t =t +1
10: end loop

See worked example with Mars rover at end of slides

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 26 / 85
Properties of SARSA with ϵ-greedy policies

Computational complexity?
Converge to optimal Q ∗ function? Recall:
Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
Q is an estimate of the performance of a policy that may be changing
at each time step
Empirical performance?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 27 / 85
Convergence Properties of SARSA

Theorem
SARSA for finite-state and finite-action MDPs converges to the optimal
action-value, Q(s, a) → Q ∗ (s, a), under the following conditions:
1 The policy sequence πt (a|s) satisfies the condition of GLIE
2 The step-sizes αt satisfy the Robbins-Munro sequence such that

X
αt = ∞
t=1
X∞
αt2 < ∞
t=1

1
For ex. αt = T satisfies the above condition.

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 28 / 85
Properties of SARSA with ϵ-greedy policies

Result builds on stochastic approximation


Relies on step sizes decreasing at the right rate
Relies on Bellman backup contraction property
Relies on bounded rewards and value function

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 29 / 85
On and Off-Policy Learning

On-policy learning
Direct experience
Learn to estimate and evaluate a policy from experience obtained from
following that policy
Off-policy learning
Learn to estimate and evaluate a policy using experience gathered from
following a different policy

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 30 / 85
Q-Learning: Learning the Optimal State-Action Value

SARSA is an on-policy learning algorithm


SARSA estimates the value of the current behavior policy (policy
using to take actions in the world)
And then updates that (behavior) policy
Alternatively, can we directly estimate the value of π ∗ while acting
with another behavior policy πb ?
Yes! Q-learning, an off-policy RL algorithm

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 31 / 85
Q-Learning: Learning the Optimal State-Action Value

SARSA is an on-policy learning algorithm


Estimates the value of behavior policy (policy using to take actions in
the world)
And then updates the behavior policy
Q-learning
estimate the Q value of π ∗ while acting with another behavior policy πb
Key idea: Maintain Q estimates and bootstrap for best future value
Recall SARSA

Q(st , at ) ← Q(st , at ) + α((rt + γQ(st+1 , at+1 )) − Q(st , at ))

Q-learning:

Q(st , at ) ← Q(st , at ) + α((rt + γ max



Q(st+1 , a′ )) − Q(st , at ))
a

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 32 / 85
Q-Learning with ϵ-greedy Exploration

1: Initialize Q(s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0


2: Set πb to be ϵ-greedy w.r.t. Q
3: loop
4: Take at ∼ πb (st ) // Sample action from policy
5: Observe (rt , st+1 )
6: Q(st , at ) ← Q(st , at ) + α(rt + γ maxa Q(st+1 , a) − Q(st , at ))
7: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
8: t =t +1
9: end loop

See optional worked example and optional understanding check at the end
of the slides

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 33 / 85
Q-Learning with ϵ-greedy Exploration

What conditions are sufficient to ensure that Q-learning with ϵ-greedy


exploration converges to optimal Q ∗ ?

What conditions are sufficient to ensure that Q-learning with ϵ-greedy


exploration converges to optimal π ∗ ?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 34 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 35 / 85
Motivation for Function Approximation

Avoid explicitly storing or learning the following for every single state
and action
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
Reduce memory needed to store (P, R)/V /Q/π
Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good (P, R)/V /Q/π

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 36 / 85
State Action Value Function Approximation for Policy
Evaluation with an Oracle

First assume we could query any state s and action a and an oracle
would return the true value for Q π (s, a)
Similar to supervised learning: assume given ((s, a), Q π (s, a)) pairs
The objective is to find the best approximate representation of Q π
given a particular parameterized function Q̂(s, a; w )

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 37 / 85
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function Q π (s, a) and its approximation Q̂(s, a; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as

J(w ) = Eπ [(Q π (s, a) − Q̂(s, a; w ))2 ]

Can use gradient descent to find a local minimum


1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:

Expected SGD is the same as the full gradient update


Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 38 / 85
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function Q π (s, a) and its approximation Q̂(s, a; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(Q π (s, a) − Q̂(s, a; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
∇w J(w ) = ∇w Eπ [Q π (s, a) − Q̂(s, a; w )]2
= −2Eπ [(Q π (s, a) − Q̂(s, a; w )]∇w Q̂(s, a, w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 39 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 40 / 85
Model Free VFA Policy Evaluation

No oracle to tell true Q π (s, a) for any state s and action a


Use model-free state-action value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 41 / 85
Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3)


Following a fixed policy π (or had access to prior data)
Goal is to estimate V π and/or Q π
Maintained a lookup table to store estimates V π and/or Q π
Updated these estimates after each episode (Monte Carlo methods)
or after each step (TD methods)
Now: in value function approximation, change the estimate
update step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 42 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 43 / 85
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return


Q π (st , at )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,action,return) pairs:
⟨(s1 , a1 ), G1 ⟩, ⟨(s2 , a2 ), G2 ⟩, . . . , ⟨(sT , aT ), GT ⟩
Substitute Gt for the true Q π (st , at ) when fit function approximator

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 44 / 85
MC Value Function Approximation for Policy Evaluation

1: Initialize w, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visit toP(s, a) in episode k then
Lk
6: Gt (s, a) = j=t rk,j
7: ∇w J(w ) = −2[Gt (s, a)−Q̂(st , at ; w )]∇w Q̂(st , at ; w ) (Compute
Gradient)
8: Update weights ∆w
9: end if
10: end for
11: k =k +1
12: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 45 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 46 / 85
Recall: Temporal Difference Learning w/ Lookup Table

Uses bootstrapping and sampling to approximate V π


Updates V π (s) after each transition (s, a, r , s ′ ):

V π (s) = V π (s) + α(r + γV π (s ′ ) − V π (s))

Target is r + γV π (s ′ ), a biased estimate of the true value V π (s)


Represent value for each state with a separate table entry
Note: Unlike MC we will focus on V instead of Q for policy
evaluation here, because there are more ways to create TD targets
from Q values than V values

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 47 / 85
Temporal Difference TD(0) Learning with Value Function
Approximation

Uses bootstrapping and sampling to approximate true V π


Updates estimate V π (s) after each transition (s, a, r , s ′ ):

V π (s) = V π (s) + α(r + γV π (s ′ ) − V π (s))

Target is r + γV π (s ′ ), a biased estimate of the true value V π (s)


In value function approximation, target is r + γ V̂ π (s ′ ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
1 Sampling
2 Bootstrapping
3 Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 48 / 85
Temporal Difference TD(0) Learning with Value Function
Approximation

In value function approximation, target is r + γ V̂ π (s ′ ; w ), a biased


and approximated estimate of the true value V π (s)
Can reduce doing TD(0) learning with value function approximation
to supervised learning on a set of data pairs:
⟨s1 , r1 + γ V̂ π (s2 ; w )⟩, ⟨s2 , r2 + γ V̂ (s3 ; w )⟩, . . .
Find weights to minimize mean squared error

J(w ) = Eπ [(rj + γ V̂ π (sj+1 , w ) − V̂ (sj ; w ))2 ]

Use stochastic gradient descent, as in MC methods

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 49 / 85
TD(0) Value Function Approximation for Policy Evaluation

1: Initialize w, s
2: loop
3: Given s sample a ∼ π(s), r (s, a),s ′ ∼ p(s ′ |s, a)
4: ∇w J(w ) = −2[r + γ V̂ (s ′ ; w ) − V̂ (s; w )]∇w V̂ (s; w )
5: Update weights ∆w
6: if s ′ is not a terminal state then
7: Set s = s ′
8: else
9: Restart episode, sample initial state s
10: end if
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 50 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation


Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 51 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 52 / 85
Control using Value Function Approximation

Use value function approximation to represent state-action values


Q̂ π (s, a; w ) ≈ Q π
Interleave
Approximate policy evaluation using value function approximation
Perform ϵ-greedy policy improvement
Can be unstable. Generally involves intersection of the following:
Function approximation
Bootstrapping
Off-policy learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 53 / 85
Action-Value Function Approximation with an Oracle

Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:

J(w ) = Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Use stochastic gradient descent to find a local minimum

∇W J(w ) = −2E (Q π (s, a) − Q̂ π (s, a; w ))∇w Q̂ π (s, a; w )


h i

Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 54 / 85
Incremental Model-Free Control Approaches
Similar to policy evaluation, true state-action value function for a
state is unknown and so substitute a target value for true Q(st , at )

∆w = α(Q(st , at ) − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

SARSA: Use TD target r + γ Q̂(s ′ , a′ ; w ) which leverages the current


function approximation value

∆w = α(r + γ Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )

Q-learning: Uses related TD target r + γ maxa′ Q̂(s ′ , a′ ; w )

∆w = α(r + γ max

Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 55 / 85
”Deadly Triad” which Can Cause Instability

Informally, updates involve doing an (approximate) Bellman backup


followed by best trying to fit underlying value function to a particular
feature representation
Bellman operators are contractions, but value function approximation
fitting can be an expansion
To learn more, see Baird example in Sutton and Barto 2018
”Deadly Triad” can lead to oscillations or lack of convergence
Bootstrapping
Function Approximation
Off policy learning (e.g. Q-learning)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 56 / 85
Table of Contents

Generalized Policy Improvement


Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation


Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 57 / 85
Using these ideas to do Deep RL in Atari

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 58 / 85
Q-Learning with Neural Networks

Q-learning converges to optimal Q ∗ (s, a) using tabular representation


In value function approximation Q-learning minimizes MSE loss by
stochastic gradient descent using a target Q estimate instead of true
Q
But Q-learning with VFA can diverge
Two of the issues causing problems:
Correlations between samples
Non-stationary targets
Deep Q-learning (DQN) addresses these challenges by using
Experience replay
Fixed Q-targets

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 59 / 85
DQNs: Experience Replay

To help remove correlations, store dataset (called a replay buffer) D


from prior experience

To perform experience replay, repeat the following:


(s, a, r , s ′ ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa′ Q̂(s ′ , a′ ; w )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max

Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 60 / 85
DQNs: Experience Replay
To help remove correlations, store dataset D from prior experience

To perform experience replay, repeat the following:


(s, a, r , s ′ ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa′ Q̂(s ′ , a′ ; w )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max

Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Uses target as a scalar, but function weights will get updated


on the next round, changing the target value
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 61 / 85
DQNs: Fixed Q-Targets

To help improve stability, fix the target weights used in the target
calculation for multiple updates
Target network uses a different set of weights than the weights being
updated
Let parameters w − be the set of weights used in the target, and w
be the weights that are being updated
Slight change to computation of target value:
(s, a, r , s ′ ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max

Q̂(s ′ , a′ ; w − ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 62 / 85
DQN Pseudocode

1: Input C , α, D = {}, Initialize w , w − = w , t = 0


2: Get initial state s0
3: loop
4: Sample action at given ϵ-greedy policy for current Q̂(st , a; w )
5: Observe reward rt and next state st+1
6: Store transition (st , at , rt , st+1 ) in replay buffer D
7: Sample random minibatch of tuples (si , ai , ri , si+1 ) from D
8: for j in minibatch do
9: if episode terminated at step i + 1 then
10: yi = ri
11: else
12: yi = ri + γ maxa′ Q̂(si+1 , a′ ; w − )
13: end if
14: Do gradient descent step on (yi − Q̂(si , ai ; w ))2 for parameters w : ∆w = α(yi − Q̂(si , ai ; w ))∇w Q̂(si , ai ; w )
15: end for
16: t =t+1
17: if mod(t,C) == 0 then
18: w− ← w
19: end if
20: end loop
Note there are several hyperparameters and algorithm choices. One needs to choose the neural network architecture, the

learning rate, and how often to update the target network. Often a fixed size replay buffer is used for experience replay, which

introduces a parameter to control the size, and the need to decide how to populate it.

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 63 / 85
Check Your Understanding L4N3: Fixed Targets

In DQN we compute the target value for the sampled (s, a, r , s) using
a separate set of target weights: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Select all that are true
This doubles the computation time compared to a method that does
not have a separate set of weights
This doubles the memory requirements compared to a method that
does not have a separate set of weights
Not sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 64 / 85
Check Your Understanding L4N3: Fixed Targets.
Solutions

In DQN we compute the target value for the sampled (s, a, r , s ′ ) using
a separate set of target weights: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Select all that are true
This doubles the computation time compared to a method that does
not have a separate set of weights
This doubles the memory requirements compared to a method that
does not have a separate set of weights
Not sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 65 / 85
DQNs Summary

DQN uses experience replay and fixed Q-targets


Store transition (st , at , rt+1 , st+1 ) in replay memory D
Sample random mini-batch of transitions (s, a, r , s ′ ) from D
Compute Q-learning targets w.r.t. old, fixed parameters w −
Optimizes MSE between Q-network and Q-learning targets
Uses stochastic gradient descent

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 66 / 85
DQNs in Atari

End-to-end learning of values Q(s, a) from pixels s


Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step
Used a deep neural network with CNN
Network architecture and hyperparameters fixed across all games

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 67 / 85
DQN

Figure: Human-level control through deep reinforcement learning, Mnih et al,


2015

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 68 / 85
DQN Results in Atari

Figure: Human-level control through deep reinforcement learning, Mnih et al,


2015
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 69 / 85
Which Aspects of DQN were Important for Success?

Deep
Game Linear
Network
Breakout 3 3
Enduro 62 29
River Raid 2345 1453
Seaquest 656 275
Space
301 302
Invaders

Note: just using a deep NN actually hurt performance sometimes!

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 70 / 85
Which Aspects of DQN were Important for Success?

Deep DQN w/
Game Linear
Network fixed Q
Breakout 3 3 10
Enduro 62 29 141
River Raid 2345 1453 2868
Seaquest 656 275 1003
Space
301 302 373
Invaders

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 71 / 85
Which Aspects of DQN were Important for Success?

Deep DQN w/ DQN w/ DQN w/replay


Game Linear
Network fixed Q replay and fixed Q
Breakout 3 3 10 241 317
Enduro 62 29 141 831 1006
River Raid 2345 1453 2868 4102 7447
Seaquest 656 275 1003 823 2894
Space
301 302 373 826 1089
Invaders

Replay is hugely important


Why? Beyond helping with correlation between samples, what does
replaying do?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 72 / 85
Deep RL

Success in Atari has led to huge excitement in using deep neural


networks to do value function approximation in RL
Some immediate improvements (many others!)
Double DQN (Deep Reinforcement Learning with Double Q-Learning,
Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network Architectures
for Deep Reinforcement Learning, Wang et al, ICML 2016)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 73 / 85
What You Should Understand

Be able to implement TD(0) and MC on policy evaluation


Be able to implement Q-learning and SARSA and MC control
algorithms
List the 3 issues that can cause instability and describe the problems
qualitatively: function approximation, bootstrapping and off-policy
learning
Know some of the key features in DQN that were critical (experience
replay, fixed targets)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 74 / 85
Class Structure

Last time and start of this time: Model-free reinforcement learning


with function approximation
Next time: Policy gradients

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 75 / 85
Monotonic ϵ-greedy Policy Improvement

Theorem
For any ϵ-greedy policy πi , the ϵ-greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi

Therefore V πi+1 ≥ V π (from the policy improvement theorem)

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 76 / 85
SARSA Initialization Conceptual Question

Mars rover with new actions:


r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = r (−, a1 ),
Q(−, a2 ) = r (−, a2 )
SARSA: (s6 , a1 , 0, s7 , a2 , 5, s7 ).
Does how Q is initialized matter (initially? asymptotically?)?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 77 / 85
Optional Worked Example: MC for On Policy Control
Solution

Mars rover with new actions:


r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, ϵ=.5. Q(s, a) = 0 for all (s, a)
Sample trajectory from ϵ-greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q ϵ−π (−, a1 ) = [1 0 1 0 0 0 0]
After this trajectory:
Q ϵ−π (−, a2 ) = [0 1 0 0 0 0 0]
The new greedy policy would be: π = [1 2 1 tie tie tie tie]
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 5/6.

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 78 / 85
Optional Worked Example SARSA for Mars Rover

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0


2: Take at ∼ π(st ) // Sample action from policy
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 )
6: Observe (rt+1 , st+2 )
7: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
8: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
9: t =t +1
10: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Assume starting state is s6 and sample a1

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 79 / 85
Worked Example: SARSA for Mars Rover

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0


2: Take at ∼ π(st ) // Sample action from policy
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 )
6: Observe (rt+1 , st+2 )
7: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
8: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
9: t =t +1
10: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Assume starting state is s6 and sample a1

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 80 / 85
Worked Example: SARSA for Mars Rover

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0


2: Take at ∼ π(st ) // Sample action from policy
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 )
6: Observe (rt+1 , st+2 )
7: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
8: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
9: t =t +1
10: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Tuple: (s6 , a1 , 0, s7 , a2 , 5, s7 ).
Q(s6 , a1 ) = .5 ∗ 0 + .5 ∗ (0 + γQ(s7 , a2 )) = 2.5

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 81 / 85
Worked Example: ϵ-greedy Q-Learning Mars

1: Initialize Q(s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0


2: Set πb to be ϵ-greedy w.r.t. Q
3: loop
4: Take at ∼ πb (st ) // Sample action from policy
5: Observe (rt , st+1 )
6: Q(st , at ) ← Q(st , at ) + α(rt + γ maxa Q(st+1 , a) − Q(st , at ))
7: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
8: t =t +1
9: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Like in SARSA example, start in s6 and take a1 .

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 82 / 85
Worked Example: ϵ-greedy Q-Learning Mars

1: Initialize Q(s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0


2: Set πb to be ϵ-greedy w.r.t. Q
3: loop
4: Take at ∼ πb (st ) // Sample action from policy
5: Observe (rt , st+1 )
6: Q(st , at ) ← Q(st , at ) + α(rt + γ maxa Q(st+1 , a) − Q(st , at ))
7: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
8: t =t +1
9: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Tuple: (s6 , a1 , 0, s7 ).
Q(s6 , a1 ) = 0 + .5 ∗ (0 + γ maxa′ Q(s7 , a′ ) − 0) = .5*10 = 5
Recall that in the SARSA update we saw Q(s6 , a1 ) = 2.5 because we used
the actual action taken at s7 instead of the max
Does how Q is initialized matter (initially? asymptotically?)?

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 83 / 85
Optional Check Your Understanding L4: SARSA and
Q-Learning

SARSA: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))


Q-Learning:
Q(st , at ) ← Q(st , at ) + α(rt + γ maxa′ Q(st+1 , a′ ) − Q(st , at ))
Select all that are true
1 Both SARSA and Q-learning may update their policy after every step
2 If ϵ = 0 for all time steps, and Q is initialized randomly, a SARSA Q
state update will be the same as a Q-learning Q state update
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 84 / 85
Optional Check Your Understanding SARSA and
Q-Learning Solutions

SARSA: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))


Q-Learning:
Q(st , at ) ← Q(st , at ) + α(rt + γ maxa′ Q(st+1 , a′ ) − Q(st , at ))
Select all that are true
1 Both SARSA and Q-learning may update their policy after every step
2 If ϵ = 0 for all time steps, and Q is initialized randomly, a SARSA Q
state update will be the same as a Q-learning Q state update
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning.


Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 85 / 85

You might also like