0% found this document useful (0 votes)

3 views

Lecture 4 Pre

This lecture focuses on model-free control and function approximation in reinforcement learning, covering topics such as generalized policy improvement, Monte-Carlo control, and temporal difference methods. Key concepts include the balance between exploration and exploitation, the use of ϵ-greedy policies for action selection, and the importance of value function approximation. The lecture also discusses the iterative process of policy evaluation and improvement without prior knowledge of the environment's dynamics.

Uploaded by

nxhqwerty4306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

3 views

Lecture 4 Pre

Uploaded by

nxhqwerty4306

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 85

Lecture 4: Model Free Control and Function

Approximation

Emma Brunskill

CS234 Reinforcement Learning.

Winter 2025

Structure and content drawn in part from David Silver’s Lecture 5

and Lecture 6. For additional reading please see SB Sections 5.2-5.4,
6.4, 6.5, 6.7

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 1 / 85
Check Your Understanding L4N1: Model-free Generalized
Policy Improvement

Consider policy iteration

Repeat:
Policy evaluation: compute Q π
Policy improvement πi+1 (s) = arg maxa Q πi (s, a)
Question: is this πi+1 deterministic or stochastic? Assume for each
state s there is a unique maxa Q πi (s, a).
Answer: Deterministic, Stochastic, Not Sure
Now consider evaluating the policy of this new πi+1 . Recall in
model-free policy evaluation, we estimated V π , using π to generate
new trajectories
Question: Can we compute Q πi+1 (s, a) ∀s, a by using this πi+1 to
generate new trajectories?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 2 / 85
Check Your Understanding L4N1: Model-free Generalized
Policy Improvement
Consider policy iteration
Repeat:
Policy evaluation: compute Q π
Policy improvement πi+1 (s) = arg maxa Q πi (s, a)
Question: is this πi+1 deterministic or stochastic? Assume for each
state s there is a unique maxa Q πi (s, a).

Now consider evaluating the policy of this new πi+1 . Recall in

model-free policy evaluation, we estimated V π , using π to generate
new trajectories
Question: Can we compute Q πi+1 (s, a) ∀s, a by using this πi+1 to
generate new trajectories?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 3 / 85
Class Structure

Last time: Policy evaluation with no knowledge of how the world

works (MDP model not given)
Control (making decisions) without a model of how the world works
Generalization – Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 4 / 85
Today’s Lecture

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation

Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation

Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 5 / 85
Table of Contents

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control
Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 6 / 85
Model-free Policy Iteration

Initialize policy π
Repeat:
Policy evaluation: compute Q π
Policy improvement: update π given Q π
May need to modify policy evaluation:
If π is deterministic, can’t compute Q(s, a) for any a ̸= π(s)
How to interleave policy evaluation and improvement?
Policy improvement is now using an estimated Q

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 7 / 85
The Problem of Exploration

Goal: Learn to select actions to maximize total expected future reward

Problem: Can’t learn about actions without trying them (need to
explore
Problem: But if we try new actions, spending less time taking actions
that our past experience suggests will yield high reward (need to
exploit knowledge of domain to achieve high rewards)
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 8 / 85
ϵ-greedy Policies

Simple idea to balance exploration and achieving rewards

Let |A| be the number of actions
Then an ϵ-greedy policy w.r.t. a state-action value Q(s, a) is
π(a|s) =
ϵ
arg maxa Q(s, a), w. prob 1 − ϵ + |A|
a′ ̸= arg max Q(s, a) w. prob |A|
ϵ

In words: select argmax action with probability 1 − ϵ, else select

action uniformly at random

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 9 / 85
Policy Improvement with ϵ-greedy policies

Recall we proved that policy iteration using given dynamics and

reward models, was guaranteed to monotonically improve
That proof assumed policy improvement output a deterministic policy
Same property holds for ϵ-greedy policies

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 10 / 85
Monotonic ϵ-greedy Policy Improvement
Theorem
For any ϵ-greedy policy πi , the ϵ-greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi
π π
X
Q i (s, πi+1 (s)) = πi+1 (a|s)Q i (s, a)
a∈A
 
π π
X
= (ϵ/|A|)  Q i (s, a) + (1 − ϵ) max Q i (s, a)
a
a∈A

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 11 / 85
Today: Model-free Control

Generalized policy improvement

Importance of exploration
Monte Carlo control
Model-free control with temporal difference (SARSA, Q-learning)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 12 / 85
Table of Contents

Generalized Policy Improvement

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 13 / 85
Recall Monte Carlo Policy Evaluation, Now for Q

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), k = 1, Input ϵ = 1, π

2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given π
3: Compute Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti ∀t
4: for t = 1, . . . , T do
5: if First visit to (s,a) in episode k then
6: N(s, a) = N(s, a) + 1
1
7: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
8: end if
9: end for
10: k =k +1
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 14 / 85
Monte Carlo Online Control / On Policy Improvement

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), Set ϵ = 1, k = 1

2: πk = ϵ-greedy(Q) // Create initial ϵ-greedy policy
3: loop
4: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,T ) given πk
4: Gk,t = rk,t + γrk,t+1 + γ 2 rk,t+2 + · · · γ Ti −1 rk,Ti
5: for t = 1, . . . , T do
6: if First visit to (s, a) in episode k then
7: N(s, a) = N(s, a) + 1
1
8: Q(st , at ) = Q(st , at ) + N(s,a) (Gk,t − Q(st , at ))
9: end if
10: end for
11: k = k + 1, ϵ = 1/k
12: πk = ϵ-greedy(Q) // Policy improvement
13: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 15 / 85
Optional Worked Example: MC for On Policy Control
Mars rover with new actions:
r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, ϵ=.5. Q(s, a) = 0 for all (s, a)
Sample trajectory from ϵ-greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q ϵ−π (−, a1 ) = [1 0 1 0 0 0 0]
After this trajectory (Select all)
Q ϵ−π (−, a2 ) = [0 0 0 0 0 0 0]
The new greedy policy would be: π = [1 tie 1 tie tie tie tie]
The new greedy policy would be: π = [1 2 1 tie tie tie tie]
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 1/9.
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 2/3.
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 5/6.
Not sure
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 16 / 85
Properties of MC control with ϵ-greedy policies

Computational complexity?
Converge to optimal Q ∗ function?
Empirical performance?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 17 / 85
L4N2 Check Your Understanding: Monte Carlo Online
Control / On Policy Improvement

1: Initialize Q(s, a) = 0, N(s, a) = 0 ∀(s, a), Set ϵ = 1, k = 1

Is Q an estimate of Q πk ? When might this procedure fail to compute

the optimal Q ∗ ?
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 18 / 85
Table of Contents

Generalized Policy Improvement

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 19 / 85
Greedy in the Limit of Infinite Exploration (GLIE)

Definition of GLIE
All state-action pairs are visited an infinite number of times

lim Ni (s, a) → ∞
i→∞

Behavior policy (policy used to act in the world) converges to greedy

policy

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 20 / 85
Greedy in the Limit of Infinite Exploration (GLIE)

Definition of GLIE
All state-action pairs are visited an infinite number of times

lim Ni (s, a) → ∞
i→∞

Behavior policy (policy used to act in the world) converges to greedy

policy

A simple GLIE strategy is ϵ-greedy where ϵ is reduced to 0 with the

following rate: ϵi = 1/i

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 21 / 85
GLIE Monte-Carlo Control using Tabular Representations

Theorem
GLIE Monte-Carlo control converges to the optimal state-action value
function Q(s, a) → Q ∗ (s, a)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 22 / 85
Table of Contents

Generalized Policy Improvement

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 23 / 85
Model-free Policy Iteration with TD Methods

Initialize policy π
Repeat:
Policy evaluation: compute Q π using temporal difference updating
with ϵ-greedy policy
Policy improvement: Same as Monte carlo policy improvement, set π
to ϵ-greedy (Q π )
Method 1: SARSA
On policy: SARSA computes an estimate Q of policy used to act

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 24 / 85
General Form of SARSA Algorithm

1: Set initial ϵ-greedy policy π randomly, t = 0, initial state st = s0

2: Take at ∼ π(st )
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 ) // Sample action from policy
6: Observe (rt+1 , st+2 )
7: Update Q given (st , at , rt , st+1 , at+1 ):

8: Perform policy improvement:

9: t =t +1
10: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 25 / 85
General Form of SARSA Algorithm

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0

See worked example with Mars rover at end of slides

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 26 / 85
Properties of SARSA with ϵ-greedy policies

Computational complexity?
Converge to optimal Q ∗ function? Recall:
Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
Q is an estimate of the performance of a policy that may be changing
at each time step
Empirical performance?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 27 / 85
Convergence Properties of SARSA

Theorem
SARSA for finite-state and finite-action MDPs converges to the optimal
action-value, Q(s, a) → Q ∗ (s, a), under the following conditions:
1 The policy sequence πt (a|s) satisfies the condition of GLIE
2 The step-sizes αt satisfy the Robbins-Munro sequence such that
∞
X
αt = ∞
t=1
X∞
αt2 < ∞
t=1

1
For ex. αt = T satisfies the above condition.

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 28 / 85
Properties of SARSA with ϵ-greedy policies

Result builds on stochastic approximation

Relies on step sizes decreasing at the right rate
Relies on Bellman backup contraction property
Relies on bounded rewards and value function

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 29 / 85
On and Off-Policy Learning

On-policy learning
Direct experience
Learn to estimate and evaluate a policy from experience obtained from
following that policy
Off-policy learning
Learn to estimate and evaluate a policy using experience gathered from
following a different policy

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 30 / 85
Q-Learning: Learning the Optimal State-Action Value

SARSA is an on-policy learning algorithm

SARSA estimates the value of the current behavior policy (policy
using to take actions in the world)
And then updates that (behavior) policy
Alternatively, can we directly estimate the value of π ∗ while acting
with another behavior policy πb ?
Yes! Q-learning, an off-policy RL algorithm

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 31 / 85
Q-Learning: Learning the Optimal State-Action Value

SARSA is an on-policy learning algorithm

Estimates the value of behavior policy (policy using to take actions in
the world)
And then updates the behavior policy
Q-learning
estimate the Q value of π ∗ while acting with another behavior policy πb
Key idea: Maintain Q estimates and bootstrap for best future value
Recall SARSA

Q(st , at ) ← Q(st , at ) + α((rt + γQ(st+1 , at+1 )) − Q(st , at ))

Q-learning:

Q(st , at ) ← Q(st , at ) + α((rt + γ max

′
Q(st+1 , a′ )) − Q(st , at ))
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 32 / 85
Q-Learning with ϵ-greedy Exploration

1: Initialize Q(s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0

See optional worked example and optional understanding check at the end
of the slides

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 33 / 85
Q-Learning with ϵ-greedy Exploration

What conditions are sufficient to ensure that Q-learning with ϵ-greedy

exploration converges to optimal Q ∗ ?

What conditions are sufficient to ensure that Q-learning with ϵ-greedy

exploration converges to optimal π ∗ ?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 34 / 85
Table of Contents

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation

Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation

Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 35 / 85
Motivation for Function Approximation

Avoid explicitly storing or learning the following for every single state
and action
Dynamics or reward model
Value
State-action value
Policy
Want more compact representation that generalizes across state or
states and actions
Reduce memory needed to store (P, R)/V /Q/π
Reduce computation needed to compute (P, R)/V /Q/π
Reduce experience needed to find a good (P, R)/V /Q/π

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 36 / 85
State Action Value Function Approximation for Policy
Evaluation with an Oracle

First assume we could query any state s and action a and an oracle
would return the true value for Q π (s, a)
Similar to supervised learning: assume given ((s, a), Q π (s, a)) pairs
The objective is to find the best approximate representation of Q π
given a particular parameterized function Q̂(s, a; w )

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 37 / 85
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function Q π (s, a) and its approximation Q̂(s, a; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as

J(w ) = Eπ [(Q π (s, a) − Q̂(s, a; w ))2 ]

Can use gradient descent to find a local minimum

1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:

Expected SGD is the same as the full gradient update

Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 38 / 85
Stochastic Gradient Descent
Goal: Find the parameter vector w that minimizes the loss between a
true value function Q π (s, a) and its approximation Q̂(s, a; w ) as
represented with a particular function class parameterized by w .
Generally use mean squared error and define the loss as
J(w ) = Eπ [(Q π (s, a) − Q̂(s, a; w ))2 ]
Can use gradient descent to find a local minimum
1
∆w = − α∇w J(w )
2
Stochastic gradient descent (SGD) uses a finite number of (often
one) samples to compute an approximate gradient:
∇w J(w ) = ∇w Eπ [Q π (s, a) − Q̂(s, a; w )]2
= −2Eπ [(Q π (s, a) − Q̂(s, a; w )]∇w Q̂(s, a, w )
Expected SGD is the same as the full gradient update
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 39 / 85
Table of Contents

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation

Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 40 / 85
Model Free VFA Policy Evaluation

No oracle to tell true Q π (s, a) for any state s and action a

Use model-free state-action value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 41 / 85
Model Free VFA Prediction / Policy Evaluation

Recall model-free policy evaluation (Lecture 3)

Following a fixed policy π (or had access to prior data)
Goal is to estimate V π and/or Q π
Maintained a lookup table to store estimates V π and/or Q π
Updated these estimates after each episode (Monte Carlo methods)
or after each step (TD methods)
Now: in value function approximation, change the estimate
update step to include fitting the function approximator

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 42 / 85
Table of Contents

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation

Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 43 / 85
Monte Carlo Value Function Approximation

Return Gt is an unbiased but noisy sample of the true expected return

Q π (st , at )
Therefore can reduce MC VFA to doing supervised learning on a set
of (state,action,return) pairs:
⟨(s1 , a1 ), G1 ⟩, ⟨(s2 , a2 ), G2 ⟩, . . . , ⟨(sT , aT ), GT ⟩
Substitute Gt for the true Q π (st , at ) when fit function approximator

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 44 / 85
MC Value Function Approximation for Policy Evaluation

1: Initialize w, k = 1
2: loop
3: Sample k-th episode (sk,1 , ak,1 , rk,1 , sk,2 , . . . , sk,Lk ) given π
4: for t = 1, . . . , Lk do
5: if First visit toP(s, a) in episode k then
Lk
6: Gt (s, a) = j=t rk,j
7: ∇w J(w ) = −2[Gt (s, a)−Q̂(st , at ; w )]∇w Q̂(st , at ; w ) (Compute
Gradient)
8: Update weights ∆w
9: end if
10: end for
11: k =k +1
12: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 45 / 85
Table of Contents

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation

Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation
Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 46 / 85
Recall: Temporal Difference Learning w/ Lookup Table

Uses bootstrapping and sampling to approximate V π

Updates V π (s) after each transition (s, a, r , s ′ ):

V π (s) = V π (s) + α(r + γV π (s ′ ) − V π (s))

Target is r + γV π (s ′ ), a biased estimate of the true value V π (s)

Represent value for each state with a separate table entry
Note: Unlike MC we will focus on V instead of Q for policy
evaluation here, because there are more ways to create TD targets
from Q values than V values

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 47 / 85
Temporal Difference TD(0) Learning with Value Function
Approximation

Uses bootstrapping and sampling to approximate true V π

Updates estimate V π (s) after each transition (s, a, r , s ′ ):

V π (s) = V π (s) + α(r + γV π (s ′ ) − V π (s))

Target is r + γV π (s ′ ), a biased estimate of the true value V π (s)

In value function approximation, target is r + γ V̂ π (s ′ ; w ), a biased
and approximated estimate of the true value V π (s)
3 forms of approximation:
1 Sampling
2 Bootstrapping
3 Value function approximation

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 48 / 85
Temporal Difference TD(0) Learning with Value Function
Approximation

In value function approximation, target is r + γ V̂ π (s ′ ; w ), a biased

and approximated estimate of the true value V π (s)
Can reduce doing TD(0) learning with value function approximation
to supervised learning on a set of data pairs:
⟨s1 , r1 + γ V̂ π (s2 ; w )⟩, ⟨s2 , r2 + γ V̂ (s3 ; w )⟩, . . .
Find weights to minimize mean squared error

J(w ) = Eπ [(rj + γ V̂ π (sj+1 , w ) − V̂ (sj ; w ))2 ]

Use stochastic gradient descent, as in MC methods

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 49 / 85
TD(0) Value Function Approximation for Policy Evaluation

1: Initialize w, s
2: loop
3: Given s sample a ∼ π(s), r (s, a),s ′ ∼ p(s ′ |s, a)
4: ∇w J(w ) = −2[r + γ V̂ (s ′ ; w ) − V̂ (s; w )]∇w V̂ (s; w )
5: Update weights ∆w
6: if s ′ is not a terminal state then
7: Set s = s ′
8: else
9: Restart episode, sample initial state s
10: end if
11: end loop

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 50 / 85
Table of Contents

Generalized Policy Improvement

Monte-Carlo Control with Tabular Representations
Greedy in the Limit of Infinite Exploration
Temporal Difference Methods for Control

1 Model Free Value Function Approximation

Policy Evaluation
Monte Carlo Policy Evaluation
Temporal Difference TD(0) Policy Evaluation

2 Control using Value Function Approximation

Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 51 / 85
Table of Contents

Generalized Policy Improvement

2 Control using Value Function Approximation

Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 52 / 85
Control using Value Function Approximation

Use value function approximation to represent state-action values

Q̂ π (s, a; w ) ≈ Q π
Interleave
Approximate policy evaluation using value function approximation
Perform ϵ-greedy policy improvement
Can be unstable. Generally involves intersection of the following:
Function approximation
Bootstrapping
Off-policy learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 53 / 85
Action-Value Function Approximation with an Oracle

Q̂ π (s, a; w ) ≈ Q π
Minimize the mean-squared error between the true action-value
function Q π (s, a) and the approximate action-value function:

J(w ) = Eπ [(Q π (s, a) − Q̂ π (s, a; w ))2 ]

Use stochastic gradient descent to find a local minimum

∇W J(w ) = −2E (Q π (s, a) − Q̂ π (s, a; w ))∇w Q̂ π (s, a; w )

h i

Stochastic gradient descent (SGD) samples the gradient

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 54 / 85
Incremental Model-Free Control Approaches
Similar to policy evaluation, true state-action value function for a
state is unknown and so substitute a target value for true Q(st , at )

∆w = α(Q(st , at ) − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

In Monte Carlo methods, use a return Gt as a substitute target

∆w = α(Gt − Q̂(st , at ; w ))∇w Q̂(st , at ; w )

SARSA: Use TD target r + γ Q̂(s ′ , a′ ; w ) which leverages the current

function approximation value

∆w = α(r + γ Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )

Q-learning: Uses related TD target r + γ maxa′ Q̂(s ′ , a′ ; w )

∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 55 / 85
”Deadly Triad” which Can Cause Instability

Informally, updates involve doing an (approximate) Bellman backup

followed by best trying to fit underlying value function to a particular
feature representation
Bellman operators are contractions, but value function approximation
fitting can be an expansion
To learn more, see Baird example in Sutton and Barto 2018
”Deadly Triad” can lead to oscillations or lack of convergence
Bootstrapping
Function Approximation
Off policy learning (e.g. Q-learning)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 56 / 85
Table of Contents

Generalized Policy Improvement

2 Control using Value Function Approximation

Control using General Value Function Approximators
Deep Q-Learning

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 57 / 85
Using these ideas to do Deep RL in Atari

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 58 / 85
Q-Learning with Neural Networks

Q-learning converges to optimal Q ∗ (s, a) using tabular representation

In value function approximation Q-learning minimizes MSE loss by
stochastic gradient descent using a target Q estimate instead of true
Q
But Q-learning with VFA can diverge
Two of the issues causing problems:
Correlations between samples
Non-stationary targets
Deep Q-learning (DQN) addresses these challenges by using
Experience replay
Fixed Q-targets

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 59 / 85
DQNs: Experience Replay

To help remove correlations, store dataset (called a replay buffer) D

from prior experience

To perform experience replay, repeat the following:

(s, a, r , s ′ ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa′ Q̂(s ′ , a′ ; w )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 60 / 85
DQNs: Experience Replay
To help remove correlations, store dataset D from prior experience

To perform experience replay, repeat the following:

∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Uses target as a scalar, but function weights will get updated

on the next round, changing the target value
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 61 / 85
DQNs: Fixed Q-Targets

To help improve stability, fix the target weights used in the target
calculation for multiple updates
Target network uses a different set of weights than the weights being
updated
Let parameters w − be the set of weights used in the target, and w
be the weights that are being updated
Slight change to computation of target value:
(s, a, r , s ′ ) ∼ D: sample an experience tuple from the dataset
Compute the target value for the sampled s: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Use stochastic gradient descent to update the network weights

∆w = α(r + γ max
′
Q̂(s ′ , a′ ; w − ) − Q̂(s, a; w ))∇w Q̂(s, a; w )
a

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 62 / 85
DQN Pseudocode

1: Input C , α, D = {}, Initialize w , w − = w , t = 0

2: Get initial state s0
3: loop
4: Sample action at given ϵ-greedy policy for current Q̂(st , a; w )
5: Observe reward rt and next state st+1
6: Store transition (st , at , rt , st+1 ) in replay buffer D
7: Sample random minibatch of tuples (si , ai , ri , si+1 ) from D
8: for j in minibatch do
9: if episode terminated at step i + 1 then
10: yi = ri
11: else
12: yi = ri + γ maxa′ Q̂(si+1 , a′ ; w − )
13: end if
14: Do gradient descent step on (yi − Q̂(si , ai ; w ))2 for parameters w : ∆w = α(yi − Q̂(si , ai ; w ))∇w Q̂(si , ai ; w )
15: end for
16: t =t+1
17: if mod(t,C) == 0 then
18: w− ← w
19: end if
20: end loop
Note there are several hyperparameters and algorithm choices. One needs to choose the neural network architecture, the

learning rate, and how often to update the target network. Often a fixed size replay buffer is used for experience replay, which

introduces a parameter to control the size, and the need to decide how to populate it.

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 63 / 85
Check Your Understanding L4N3: Fixed Targets

In DQN we compute the target value for the sampled (s, a, r , s) using
a separate set of target weights: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Select all that are true
This doubles the computation time compared to a method that does
not have a separate set of weights
This doubles the memory requirements compared to a method that
does not have a separate set of weights
Not sure

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 64 / 85
Check Your Understanding L4N3: Fixed Targets.
Solutions

In DQN we compute the target value for the sampled (s, a, r , s ′ ) using
a separate set of target weights: r + γ maxa′ Q̂(s ′ , a′ ; w − )
Select all that are true
This doubles the computation time compared to a method that does
not have a separate set of weights
This doubles the memory requirements compared to a method that
does not have a separate set of weights
Not sure

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 65 / 85
DQNs Summary

DQN uses experience replay and fixed Q-targets

Store transition (st , at , rt+1 , st+1 ) in replay memory D
Sample random mini-batch of transitions (s, a, r , s ′ ) from D
Compute Q-learning targets w.r.t. old, fixed parameters w −
Optimizes MSE between Q-network and Q-learning targets
Uses stochastic gradient descent

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 66 / 85
DQNs in Atari

End-to-end learning of values Q(s, a) from pixels s

Input state s is stack of raw pixels from last 4 frames
Output is Q(s, a) for 18 joystick/button positions
Reward is change in score for that step
Used a deep neural network with CNN
Network architecture and hyperparameters fixed across all games

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 67 / 85
DQN

Figure: Human-level control through deep reinforcement learning, Mnih et al,

2015

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 68 / 85
DQN Results in Atari

Figure: Human-level control through deep reinforcement learning, Mnih et al,

2015
Emma Brunskill (CS234 Reinforcement Learning.
Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 69 / 85
Which Aspects of DQN were Important for Success?

Deep
Game Linear
Network
Breakout 3 3
Enduro 62 29
River Raid 2345 1453
Seaquest 656 275
Space
301 302
Invaders

Note: just using a deep NN actually hurt performance sometimes!

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 70 / 85
Which Aspects of DQN were Important for Success?

Deep DQN w/
Game Linear
Network fixed Q
Breakout 3 3 10
Enduro 62 29 141
River Raid 2345 1453 2868
Seaquest 656 275 1003
Space
301 302 373
Invaders

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 71 / 85
Which Aspects of DQN were Important for Success?

Deep DQN w/ DQN w/ DQN w/replay

Game Linear
Network fixed Q replay and fixed Q
Breakout 3 3 10 241 317
Enduro 62 29 141 831 1006
River Raid 2345 1453 2868 4102 7447
Seaquest 656 275 1003 823 2894
Space
301 302 373 826 1089
Invaders

Replay is hugely important

Why? Beyond helping with correlation between samples, what does
replaying do?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 72 / 85
Deep RL

Success in Atari has led to huge excitement in using deep neural

networks to do value function approximation in RL
Some immediate improvements (many others!)
Double DQN (Deep Reinforcement Learning with Double Q-Learning,
Van Hasselt et al, AAAI 2016)
Prioritized Replay (Prioritized Experience Replay, Schaul et al, ICLR
2016)
Dueling DQN (best paper ICML 2016) (Dueling Network Architectures
for Deep Reinforcement Learning, Wang et al, ICML 2016)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 73 / 85
What You Should Understand

Be able to implement TD(0) and MC on policy evaluation

Be able to implement Q-learning and SARSA and MC control
algorithms
List the 3 issues that can cause instability and describe the problems
qualitatively: function approximation, bootstrapping and off-policy
learning
Know some of the key features in DQN that were critical (experience
replay, fixed targets)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 74 / 85
Class Structure

Last time and start of this time: Model-free reinforcement learning

with function approximation
Next time: Policy gradients

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 75 / 85
Monotonic ϵ-greedy Policy Improvement

Theorem
For any ϵ-greedy policy πi , the ϵ-greedy policy w.r.t. Q πi , πi+1 is a
monotonic improvement V πi+1 ≥ V πi

Therefore V πi+1 ≥ V π (from the policy improvement theorem)

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 76 / 85
SARSA Initialization Conceptual Question

Mars rover with new actions:

r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = r (−, a1 ),
Q(−, a2 ) = r (−, a2 )
SARSA: (s6 , a1 , 0, s7 , a2 , 5, s7 ).
Does how Q is initialized matter (initially? asymptotically?)?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 77 / 85
Optional Worked Example: MC for On Policy Control
Solution

Mars rover with new actions:

r (−, a1 ) = [ 1 0 0 0 0 0 +10], r (−, a2 ) = [ 0 0 0 0 0 0 +5], γ = 1.
Assume current greedy π(s) = a1 ∀s, ϵ=.5. Q(s, a) = 0 for all (s, a)
Sample trajectory from ϵ-greedy policy
Trajectory = (s3 , a1 , 0, s2 , a2 , 0, s3 , a1 , 0, s2 , a2 , 0, s1 , a1 , 1, terminal)
First visit MC estimate of Q of each (s, a) pair?
Q ϵ−π (−, a1 ) = [1 0 1 0 0 0 0]
After this trajectory:
Q ϵ−π (−, a2 ) = [0 1 0 0 0 0 0]
The new greedy policy would be: π = [1 2 1 tie tie tie tie]
If ϵ = 1/3, prob of selecting a1 in s1 in the new ϵ-greedy policy is 5/6.

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 78 / 85
Optional Worked Example SARSA for Mars Rover

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0

2: Take at ∼ π(st ) // Sample action from policy
3: Observe (rt , st+1 )
4: loop
5: Take action at+1 ∼ π(st+1 )
6: Observe (rt+1 , st+2 )
7: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))
8: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
9: t =t +1
10: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Assume starting state is s6 and sample a1

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 79 / 85
Worked Example: SARSA for Mars Rover

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 80 / 85
Worked Example: SARSA for Mars Rover

1: Set initial ϵ-greedy policy π, t = 0, initial state st = s0

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 81 / 85
Worked Example: ϵ-greedy Q-Learning Mars

1: Initialize Q(s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 82 / 85
Worked Example: ϵ-greedy Q-Learning Mars

1: Initialize Q(s, a),∀s ∈ S, a ∈ A t = 0, initial state st = s0

2: Set πb to be ϵ-greedy w.r.t. Q
3: loop
4: Take at ∼ πb (st ) // Sample action from policy
5: Observe (rt , st+1 )
6: Q(st , at ) ← Q(st , at ) + α(rt + γ maxa Q(st+1 , a) − Q(st , at ))
7: π(st ) = arg maxa Q(st , a) w.prob 1 − ϵ, else random
8: t =t +1
9: end loop
Initialize ϵ = 1/k, k = 1, and α = 0.5, Q(−, a1 ) = [ 1 0 0 0 0 0 +10],
Q(−, a2 ) =[ 1 0 0 0 0 0 +5], γ = 1
Tuple: (s6 , a1 , 0, s7 ).
Q(s6 , a1 ) = 0 + .5 ∗ (0 + γ maxa′ Q(s7 , a′ ) − 0) = .5*10 = 5
Recall that in the SARSA update we saw Q(s6 , a1 ) = 2.5 because we used
the actual action taken at s7 instead of the max
Does how Q is initialized matter (initially? asymptotically?)?

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 83 / 85
Optional Check Your Understanding L4: SARSA and
Q-Learning

SARSA: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))

Q-Learning:
Q(st , at ) ← Q(st , at ) + α(rt + γ maxa′ Q(st+1 , a′ ) − Q(st , at ))
Select all that are true
1 Both SARSA and Q-learning may update their policy after every step
2 If ϵ = 0 for all time steps, and Q is initialized randomly, a SARSA Q
state update will be the same as a Q-learning Q state update
3 Not sure

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 84 / 85
Optional Check Your Understanding SARSA and
Q-Learning Solutions

SARSA: Q(st , at ) ← Q(st , at ) + α(rt + γQ(st+1 , at+1 ) − Q(st , at ))

Emma Brunskill (CS234 Reinforcement Learning.

Lecture
) 4: Model Free Control and Function Approximation
Winter 2025 85 / 85

Manual: Installation and Operating Instructions
No ratings yet
Manual: Installation and Operating Instructions
54 pages
4472-2V-EP Operacion Ingles PDF
No ratings yet
4472-2V-EP Operacion Ingles PDF
125 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Lecture 4: Model Free Control: Emma Brunskill
No ratings yet
Lecture 4: Model Free Control: Emma Brunskill
66 pages
Lecture 5: Model-Free Control: David Silver
No ratings yet
Lecture 5: Model-Free Control: David Silver
43 pages
Lnotes 04
No ratings yet
Lnotes 04
8 pages
2.2+Model Free+Control
No ratings yet
2.2+Model Free+Control
92 pages
10 - Reinforcement Learning
No ratings yet
10 - Reinforcement Learning
24 pages
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
No ratings yet
19 - Monte Carlo and Temporal Difference for Markov Decision Processes.pptx
57 pages
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
No ratings yet
slidedeck_7_MAS_2021_22_RL_3_MC_Sarsa_QL
65 pages
lec22
No ratings yet
lec22
22 pages
DSA5102_lecture12
No ratings yet
DSA5102_lecture12
41 pages
Lecture 3 Pre
No ratings yet
Lecture 3 Pre
67 pages
Lecture 5: Value Function Approximation: Emma Brunskill
No ratings yet
Lecture 5: Value Function Approximation: Emma Brunskill
59 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
09 - Monte Carlo Learning
No ratings yet
09 - Monte Carlo Learning
24 pages
lecture3pre
No ratings yet
lecture3pre
67 pages
CH3_2 Montecarlo Control
No ratings yet
CH3_2 Montecarlo Control
33 pages
Fa Ii
No ratings yet
Fa Ii
62 pages
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
No ratings yet
Serge Levine Course Introduction To Reinforcement Learning 6 Value Function
27 pages
12 ML Reinforcement Learning Value Based Control
No ratings yet
12 ML Reinforcement Learning Value Based Control
12 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Rl Dp and Value and Policy
No ratings yet
Rl Dp and Value and Policy
4 pages
Lecture6 Convergence of MDPs
No ratings yet
Lecture6 Convergence of MDPs
23 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
Bridging The Gap Between Value and Policy Based Reinforcement Learning
No ratings yet
Bridging The Gap Between Value and Policy Based Reinforcement Learning
21 pages
rl-3
No ratings yet
rl-3
31 pages
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
No ratings yet
Reinforcement Learning: Csci 5512: Artificial Intelligence Ii
30 pages
qp ans
No ratings yet
qp ans
40 pages
Unit 05 Dynamic Programming
No ratings yet
Unit 05 Dynamic Programming
9 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
20AI903_RL_UNIT 4
No ratings yet
20AI903_RL_UNIT 4
49 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
2023_week5_policy
No ratings yet
2023_week5_policy
62 pages
1502.05477v5
No ratings yet
1502.05477v5
16 pages
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
No ratings yet
Chapter 13: Policy Gradient Methods: by Richard Sutton and Andrew Barto
35 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
Unit 5 - Policy Based
No ratings yet
Unit 5 - Policy Based
30 pages
Lecture26 Ri
No ratings yet
Lecture26 Ri
55 pages
lecture-06
No ratings yet
lecture-06
98 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
No ratings yet
Reinforcement Learning: Mitchell, Ch. 13 (See Also Barto & Sutton Book On-Line)
14 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
No ratings yet
DD2431 Machine Learning Lab 4: Reinforcement Learning Python Version
9 pages
Rl Exam Tutti
No ratings yet
Rl Exam Tutti
47 pages
Reinforcement Learning: Russell and Norvig: CH 21
No ratings yet
Reinforcement Learning: Russell and Norvig: CH 21
16 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
DRL_Homework_1
No ratings yet
DRL_Homework_1
4 pages
Littomore
No ratings yet
Littomore
169 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
18 AI BasicRL
No ratings yet
18 AI BasicRL
96 pages
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
No ratings yet
Practice Assignment 5: Reinforcement Learning Prof. B. Ravindran
2 pages
Cs748 s2021 Quizzes Till q4
No ratings yet
Cs748 s2021 Quizzes Till q4
4 pages
paper RL
No ratings yet
paper RL
61 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
5 - Policy Gradient Methods
No ratings yet
5 - Policy Gradient Methods
57 pages
Lnotes 03
No ratings yet
Lnotes 03
11 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
An Introduction To Policy Search Methods: Thomas Furmston
No ratings yet
An Introduction To Policy Search Methods: Thomas Furmston
33 pages
Advanced Econometrics: Methods and Practical Uses
From Everand
Advanced Econometrics: Methods and Practical Uses
Himadri Deshpande
No ratings yet
Smart Home Environment Presentation
No ratings yet
Smart Home Environment Presentation
13 pages
Flexpay Emv Crind Start-Up/Service Manual: Encore S Enhanced
No ratings yet
Flexpay Emv Crind Start-Up/Service Manual: Encore S Enhanced
58 pages
Randomising Numbers and Their Effect
No ratings yet
Randomising Numbers and Their Effect
2 pages
Python for Excel Felix Zumstein download
No ratings yet
Python for Excel Felix Zumstein download
53 pages
Comparison of Cloud Simulators For Effective Modeling of Cloud Applications
No ratings yet
Comparison of Cloud Simulators For Effective Modeling of Cloud Applications
13 pages
Fundamentals of Solid Modeling and Graphic Communication, 7th Edition Gary R. Bertoline instant download
No ratings yet
Fundamentals of Solid Modeling and Graphic Communication, 7th Edition Gary R. Bertoline instant download
40 pages
Specs. For LED Street Lighting Luminaires - MEW Kuwait
No ratings yet
Specs. For LED Street Lighting Luminaires - MEW Kuwait
26 pages
802.11 Security - Wired Equivalent Privacy (WEP) : by Shruthi B Krishnan
No ratings yet
802.11 Security - Wired Equivalent Privacy (WEP) : by Shruthi B Krishnan
19 pages
Configuração Casa OSUmbrel
No ratings yet
Configuração Casa OSUmbrel
5 pages
Patterns Homework 5th Grade
100% (1)
Patterns Homework 5th Grade
7 pages
Hardhat Beginners To Advanced Guides
No ratings yet
Hardhat Beginners To Advanced Guides
212 pages
Iot Based Vehicle Pollution Monitoring and Alerting System Using Thingspeak Server and GSM
No ratings yet
Iot Based Vehicle Pollution Monitoring and Alerting System Using Thingspeak Server and GSM
6 pages
CP - 109 - Safe Work Obsevation Process
No ratings yet
CP - 109 - Safe Work Obsevation Process
8 pages
As Cyber Attacks On Cars Rise, So Does Related Cybersecurity
No ratings yet
As Cyber Attacks On Cars Rise, So Does Related Cybersecurity
4 pages
CS502 MidTerm MCQs by Talha Sajid
No ratings yet
CS502 MidTerm MCQs by Talha Sajid
45 pages
UCP 206-19 Pillow Block Ball Bearing Units - 20210519
No ratings yet
UCP 206-19 Pillow Block Ball Bearing Units - 20210519
5 pages
Overview of Management Consulting
No ratings yet
Overview of Management Consulting
7 pages
Web Technologies Bits
No ratings yet
Web Technologies Bits
7 pages
U10 18019 Arg Id003
No ratings yet
U10 18019 Arg Id003
5 pages
Operating Systems: Chapter 5: Input/Output
No ratings yet
Operating Systems: Chapter 5: Input/Output
26 pages
Assassin-S Creed II - 2009 - Ubisoft
No ratings yet
Assassin-S Creed II - 2009 - Ubisoft
10 pages
Learning Book - Theory, Methodology, Tools and Applications For Modeling and Simulation of Complex Systems - 3 PDF
No ratings yet
Learning Book - Theory, Methodology, Tools and Applications For Modeling and Simulation of Complex Systems - 3 PDF
580 pages
Samsung Galaxy On7 - Full Phone Specifications
No ratings yet
Samsung Galaxy On7 - Full Phone Specifications
1 page
The Georgetown MT Experiment
No ratings yet
The Georgetown MT Experiment
7 pages
8 Secure Car Parking System Using VHDL
No ratings yet
8 Secure Car Parking System Using VHDL
4 pages
Sistemas de Ecuaciones Lineales: Método de Jacobi
No ratings yet
Sistemas de Ecuaciones Lineales: Método de Jacobi
43 pages
Cable Sheath Bonding Application Guide
No ratings yet
Cable Sheath Bonding Application Guide
49 pages
Programming Interfaces
No ratings yet
Programming Interfaces
134 pages