0% found this document useful (0 votes)
32 views16 pages

RL Lecture5

This lecture covers Monte Carlo methods in reinforcement learning, including their history, definitions, and applications. Key topics include Monte Carlo prediction and control, incremental means, and the importance of exploration versus exploitation. The methods are model-free, require complete episodes for training, and ultimately converge on the optimal action-value function.

Uploaded by

vidhathrigujji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
32 views16 pages

RL Lecture5

This lecture covers Monte Carlo methods in reinforcement learning, including their history, definitions, and applications. Key topics include Monte Carlo prediction and control, incremental means, and the importance of exploration versus exploitation. The methods are model-free, require complete episodes for training, and ultimately converge on the optimal action-value function.

Uploaded by

vidhathrigujji
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

Reinforcement Learning

Lecture 5: Monte Carlo methods

Chris G. Willcocks
Durham University
Lecture overview

Lecture covers chapter 5 in Sutton & Barto [1] and adaptations from David Silver [2]

1 Introduction
history of Monte Carlo methods
definition

2 Monte Carlo prediction


overview
definition
incremental means
prediction with incremental updates

3 Monte Carlo control


policy iteration using action-value function
don’t just be greedy!
-greedy exploration
greedy at the limit of infinite exploration

2 / 16
Introduction history of Monte Carlo methods

History: Monte Carlo methods Example: Monte Carlo path tracing

Invented by Stanislaw Ulman in the 1940s,


when trying to calculate the probability of a
successful Canfield solitaire. He randomly lay
the cards out 100 times, and simply counted
the number of successful plays.

Widely used today, for example:


• Path tracing in compute graphics
• Computational physics, chemistry, ...
• Grid-free PDE solvers [3]

3 / 16
Introduction definition

Definition: Monte Carlo method Example: approximating π


Apply repeated random sampling to obtain
numerical results for difficult or otherwise
impossible problems

General approach:
1. Define a domain of possible inputs
2. Generate inputs randomly from a
probability distribution over the domain
3. Perform a deterministic computation on
the inputs
4. Aggregate (e.g. average) the results

4 / 16
Monte Carlo reinforcement learning overview

Overview: MC reinforcement learning


MC RL samples complete episodes
Monte Carlo reinforcement learning learns
from episodes of experience:
1. Recap: empircal risk minimiation
2. It’s model-free (requires no knowledge
of MDP transitions/rewards)
3. Learns from complete episodes (you
have to play a full game from start to
finish)
4. One simple idea: the value function =
the empirical mean return

5 / 16
Monte Carlo reinforcement learning definition

Definition: MC reinforcement learning

Putting this together, we sample episodes from Example: episode


experience under policy π

S1 , A1 , R2 , S2 , A2 , ..., Sk ∼ π,

where we’re going to look at the total discounted


reward (the return) at each timestep onwards

Gt = Rt+1 + γRt+2 + ... + γ T −1 RT ,

and our value function as the expected return

vπ (s) = Eπ [Gt | St = s].

With MC reinforcement learning, we use an empirical


mean instead of the expected return.

6 / 16
Monte Carlo reinforcement learning incremental means

Definition: Incremental means


Example: episode
RL algorithms use incremental means, where µ1 , µ2 , ...
from a sequence is computed incrementally
k
1X
µk = xj
k j=1
k−1
!
1 X
= xk + xj
k j=1
1
= (xk + (k − 1)µk−1 )
k
1
= µk−1 + (xk − µk−1 )
k

7 / 16
Monte Carlo methods prediction with incremental updates

Definition: MC prediction, incremental updates

Putting this together, we sample episodes from


Example: episode
experience under policy π

S1 , A1 , R2 , S2 , A2 , ..., ST ∼ π,

and every time we visit a state, we’re going to increase


a visit counter, then we will use our running mean:

N (St ) ← N (St ) + 1
1
V (St ) ← V (St ) + (Gt − V (St ))
N (St )

It’s common to also just track a running mean and


forget about old episodes:

V (St ) ← V (St ) + α(Gt − V (St ))

8 / 16
Monte Carlo methods policy iteration using action-value function

Problem: model-free learning. Solution: Q Example: caching Q-values

Simply greedily improving the policy over V (s)


requires a model:

π 0 (s) = arg max Ras + Pss


a 0
0 V (s ),
a∈A

whereas greedy policy improvement over Q(s, a) is


model-free:
π 0 (s) = arg max Q(s, a)
a∈A

Follow along in Colab: W

9 / 16
Monte Carlo methods don’t just be greedy!

Algorithm: greedy MC that will get stuck

Q = np.zeros([n_states, n_actions])
n_visits = np.zeros([n_states, n_actions])

for episode in range(num_episodes): R=0


s = env.reset(), done = False, result_list = []
while not done:
→ a = np.argmax(Q[s, :]) R=1
s’, reward, done, _ = env.step(a)
results_list.append((s, a))
result_sum += reward
s = s’
R=5
for (s, a) in results_list:
n_visits[s, a] += 1.0
α = 1.0 / n_visits[s, a]
Q[s, a] += α ∗ (result_sum − Q[s, a])

10 / 16
Monte Carlo methods -greedy exploration

Definition: -greedy exploration Problem: local minima


The simplest idea to avoid local minima is:
• choose a random action with probability 
• choose the action greedily with probability 1 −  R=0
• where all m actions are tied with non-zero
probability R=1
This gives the updated policy:

/m + 1 −  if a∗ = maxa∈A Q(s, a)



π(a|s) =
/m otherwise R=5
Proof of convergence in Equation 5.2 of [1]

11 / 16
Monte Carlo methods don’t just explore!

Asymptotically we can’t just explore...

12 / 16
Monte Carlo methods greedy at the limit of infinite exploration

Definition: greedy at the limit with infinite exploration (GLIE)

Defines a schedule for exploration, such that these two conditions are met:
1. You continue to explore everything

lim Nk (s, a) = ∞
k→∞

2. The policy converges on a greedy policy:

lim πk (a|s) = 1(a = arg max Qk (s, a0 ))


k→∞ a0 ∈A

13 / 16
Monte Carlo methods greedy at the limit of infinite exploration

Algorithm: greedy at the limit of ∞ exploration

..
for episode in range(num_episodes):
s = env.reset(), done = False, result_list = []
while not done:
→  = min(1.0, 10000.0/(episode+1))
if np.random.rand() > :
a = np.argmax(Q[s, :])
else: R=1
a = env.action_space.sample()
s’, reward, done, _ = env.step(a)
results_list.append((s, a))
result_sum += reward
s = s’ R=5

for (s, a) in results_list:


n_visits[s, a] += 1.0
α = 1.0 / n_visits[s, a]
Q[s, a] += α ∗ (result_sum − Q[s, a])

14 / 16
Take Away Points

Summary

In summary, Monte Carlo RL methods:

• are a solution to the reinforcement learning


problem
• require training with complete episodes
• are model-free
• can balance exploration vs exploitation
• eventually converge on the optimal
action-value function

15 / 16
References I

[1] Richard S Sutton and Andrew G Barto.


Reinforcement learning: An introduction (second edition). Available online I. MIT
press, 2018.
[2] David Silver. Reinforcement Learning lectures.
https://www.davidsilver.uk/teaching/. 2015.
[3] Rohan Sawhney and Keenan Crane. “Monte Carlo geometry processing: a grid-free
approach to PDE-based methods on volumetric domains”. In:
ACM Transactions on Graphics (TOG) 39.4 (2020), pp. 123–1.

16 / 16

You might also like