0% found this document useful (0 votes)

32 views16 pages

RL Lecture5

This lecture covers Monte Carlo methods in reinforcement learning, including their history, definitions, and applications. Key topics include Monte Carlo prediction and control, incremental means, and the importance of exploration versus exploitation. The methods are model-free, require complete episodes for training, and ultimately converge on the optimal action-value function.

Uploaded by

vidhathrigujji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views16 pages

RL Lecture5

Uploaded by

vidhathrigujji

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

Reinforcement Learning

Lecture 5: Monte Carlo methods

Chris G. Willcocks
Durham University
Lecture overview

Lecture covers chapter 5 in Sutton & Barto [1] and adaptations from David Silver [2]

1 Introduction
history of Monte Carlo methods
deﬁnition

2 Monte Carlo prediction

overview
deﬁnition
incremental means
prediction with incremental updates

3 Monte Carlo control

policy iteration using action-value function
don’t just be greedy!
-greedy exploration
greedy at the limit of inﬁnite exploration

2 / 16
Introduction history of Monte Carlo methods

History: Monte Carlo methods Example: Monte Carlo path tracing

Invented by Stanislaw Ulman in the 1940s,

when trying to calculate the probability of a
successful Canﬁeld solitaire. He randomly lay
the cards out 100 times, and simply counted
the number of successful plays.

Widely used today, for example:

• Path tracing in compute graphics
• Computational physics, chemistry, ...
• Grid-free PDE solvers [3]

3 / 16
Introduction deﬁnition

Deﬁnition: Monte Carlo method Example: approximating π

Apply repeated random sampling to obtain
numerical results for diﬃcult or otherwise
impossible problems

General approach:
1. Deﬁne a domain of possible inputs
2. Generate inputs randomly from a
probability distribution over the domain
3. Perform a deterministic computation on
the inputs
4. Aggregate (e.g. average) the results

4 / 16
Monte Carlo reinforcement learning overview

Overview: MC reinforcement learning

MC RL samples complete episodes
Monte Carlo reinforcement learning learns
from episodes of experience:
1. Recap: empircal risk minimiation
2. It’s model-free (requires no knowledge
of MDP transitions/rewards)
3. Learns from complete episodes (you
have to play a full game from start to
ﬁnish)
4. One simple idea: the value function =
the empirical mean return

5 / 16
Monte Carlo reinforcement learning deﬁnition

Deﬁnition: MC reinforcement learning

Putting this together, we sample episodes from Example: episode

experience under policy π

S1 , A1 , R2 , S2 , A2 , ..., Sk ∼ π,

where we’re going to look at the total discounted

reward (the return) at each timestep onwards

Gt = Rt+1 + γRt+2 + ... + γ T −1 RT ,

and our value function as the expected return

vπ (s) = Eπ [Gt | St = s].

With MC reinforcement learning, we use an empirical

mean instead of the expected return.

6 / 16
Monte Carlo reinforcement learning incremental means

Deﬁnition: Incremental means

Example: episode
RL algorithms use incremental means, where µ1 , µ2 , ...
from a sequence is computed incrementally
k
1X
µk = xj
k j=1
k−1
!
1 X
= xk + xj
k j=1
1
= (xk + (k − 1)µk−1 )
k
1
= µk−1 + (xk − µk−1 )
k

7 / 16
Monte Carlo methods prediction with incremental updates

Deﬁnition: MC prediction, incremental updates

Putting this together, we sample episodes from

Example: episode
experience under policy π

S1 , A1 , R2 , S2 , A2 , ..., ST ∼ π,

and every time we visit a state, we’re going to increase

a visit counter, then we will use our running mean:

N (St ) ← N (St ) + 1
1
V (St ) ← V (St ) + (Gt − V (St ))
N (St )

It’s common to also just track a running mean and

forget about old episodes:

V (St ) ← V (St ) + α(Gt − V (St ))

8 / 16
Monte Carlo methods policy iteration using action-value function

Problem: model-free learning. Solution: Q Example: caching Q-values

Simply greedily improving the policy over V (s)

requires a model:

π 0 (s) = arg max Ras + Pss

a 0
0 V (s ),
a∈A

whereas greedy policy improvement over Q(s, a) is

model-free:
π 0 (s) = arg max Q(s, a)
a∈A

Follow along in Colab: W

9 / 16
Monte Carlo methods don’t just be greedy!

Algorithm: greedy MC that will get stuck

Q = np.zeros([n_states, n_actions])
n_visits = np.zeros([n_states, n_actions])

for episode in range(num_episodes): R=0

s = env.reset(), done = False, result_list = []
while not done:
→ a = np.argmax(Q[s, :]) R=1
s’, reward, done, _ = env.step(a)
results_list.append((s, a))
result_sum += reward
s = s’
R=5
for (s, a) in results_list:
n_visits[s, a] += 1.0
α = 1.0 / n_visits[s, a]
Q[s, a] += α ∗ (result_sum − Q[s, a])

10 / 16
Monte Carlo methods -greedy exploration

Deﬁnition: -greedy exploration Problem: local minima

The simplest idea to avoid local minima is:
• choose a random action with probability
• choose the action greedily with probability 1 − R=0
• where all m actions are tied with non-zero
probability R=1
This gives the updated policy:

/m + 1 − if a∗ = maxa∈A Q(s, a)

π(a|s) =
/m otherwise R=5
Proof of convergence in Equation 5.2 of [1]

11 / 16
Monte Carlo methods don’t just explore!

Asymptotically we can’t just explore...

12 / 16
Monte Carlo methods greedy at the limit of inﬁnite exploration

Deﬁnition: greedy at the limit with inﬁnite exploration (GLIE)

Deﬁnes a schedule for exploration, such that these two conditions are met:
1. You continue to explore everything

lim Nk (s, a) = ∞
k→∞

2. The policy converges on a greedy policy:

lim πk (a|s) = 1(a = arg max Qk (s, a0 ))

k→∞ a0 ∈A

13 / 16
Monte Carlo methods greedy at the limit of inﬁnite exploration

Algorithm: greedy at the limit of ∞ exploration

..
for episode in range(num_episodes):
s = env.reset(), done = False, result_list = []
while not done:
→ = min(1.0, 10000.0/(episode+1))
if np.random.rand() > :
a = np.argmax(Q[s, :])
else: R=1
a = env.action_space.sample()
s’, reward, done, _ = env.step(a)
results_list.append((s, a))
result_sum += reward
s = s’ R=5

for (s, a) in results_list:

n_visits[s, a] += 1.0
α = 1.0 / n_visits[s, a]
Q[s, a] += α ∗ (result_sum − Q[s, a])

14 / 16
Take Away Points

Summary

In summary, Monte Carlo RL methods:

• are a solution to the reinforcement learning

problem
• require training with complete episodes
• are model-free
• can balance exploration vs exploitation
• eventually converge on the optimal
action-value function

15 / 16
References I

[1] Richard S Sutton and Andrew G Barto.

Reinforcement learning: An introduction (second edition). Available online I. MIT
press, 2018.
[2] David Silver. Reinforcement Learning lectures.
https://www.davidsilver.uk/teaching/. 2015.
[3] Rohan Sawhney and Keenan Crane. “Monte Carlo geometry processing: a grid-free
approach to PDE-based methods on volumetric domains”. In:
ACM Transactions on Graphics (TOG) 39.4 (2020), pp. 123–1.

16 / 16

Monte Carlo Methods for RL Experts
No ratings yet
Monte Carlo Methods for RL Experts
28 pages
CH3 - 2 Montecarlo Control
No ratings yet
CH3 - 2 Montecarlo Control
33 pages
2.2+model Free+Control
No ratings yet
2.2+model Free+Control
92 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
13 pages
Unit3 - REL
No ratings yet
Unit3 - REL
87 pages
Monte Carlo Learning
No ratings yet
Monte Carlo Learning
14 pages
Monte Carlo Methods in Reinforcement Learning
No ratings yet
Monte Carlo Methods in Reinforcement Learning
245 pages
Ideai Reinforcement Learning
No ratings yet
Ideai Reinforcement Learning
167 pages
Monte Carlo Methods in AI & Data Science
No ratings yet
Monte Carlo Methods in AI & Data Science
40 pages
Passive vs Active Reinforcement Learning
No ratings yet
Passive vs Active Reinforcement Learning
15 pages
ML Unit 5
No ratings yet
ML Unit 5
30 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
Lecture 4 - ModelFreePrediction
No ratings yet
Lecture 4 - ModelFreePrediction
48 pages
Monte Carlo Methods in Reinforcement Learning
50% (2)
Monte Carlo Methods in Reinforcement Learning
5 pages
ML Unit 5
No ratings yet
ML Unit 5
30 pages
3 Evaluation
No ratings yet
3 Evaluation
41 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
57 pages
CSE 445 - Lecture 9 - Reinforcement Learning
No ratings yet
CSE 445 - Lecture 9 - Reinforcement Learning
45 pages
Unit 3
No ratings yet
Unit 3
30 pages
Unit-5 ML
No ratings yet
Unit-5 ML
18 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
Understanding Reinforcement Learning Concepts
No ratings yet
Understanding Reinforcement Learning Concepts
50 pages
05 MC Methods
No ratings yet
05 MC Methods
53 pages
CH3 - 1 Montecarlo Components
No ratings yet
CH3 - 1 Montecarlo Components
18 pages
Model Free Methods
No ratings yet
Model Free Methods
31 pages
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
No ratings yet
Unit-3 Unit-3 RL Problems, Prediction and Control P 241111 181426
15 pages
DSA5102 Lecture12
No ratings yet
DSA5102 Lecture12
41 pages
ML Unit 5 at VS
No ratings yet
ML Unit 5 at VS
29 pages
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
No ratings yet
4 Reinforcement Learning - Basic Algorithms: - S, A) ) and The Immediate Reward Function R (R (S, A, S
16 pages
L5-Monte Carlo Methods
No ratings yet
L5-Monte Carlo Methods
48 pages
Unit 1 Reinforcement Learning
No ratings yet
Unit 1 Reinforcement Learning
70 pages
Lecture 5: Monte Carlo Learning: Shiyu Zhao
No ratings yet
Lecture 5: Monte Carlo Learning: Shiyu Zhao
51 pages
Lecture 6 MONTE CARLO Example
No ratings yet
Lecture 6 MONTE CARLO Example
11 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
No ratings yet
19 - Monte Carlo and Temporal Difference For Markov Decision Processes
57 pages
Understanding Monte Carlo Methods in Depth
100% (2)
Understanding Monte Carlo Methods in Depth
22 pages
Machine Learning Syllabus: Reinforcement Learning
No ratings yet
Machine Learning Syllabus: Reinforcement Learning
29 pages
Lecture 5 - ModelFreePrediction
No ratings yet
Lecture 5 - ModelFreePrediction
79 pages
402 Lec20
No ratings yet
402 Lec20
21 pages
Unit 4
No ratings yet
Unit 4
49 pages
DLMAIRIL01 Q4-2024 Session4
No ratings yet
DLMAIRIL01 Q4-2024 Session4
80 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Policy Gradient Methods For Reinforcement Learning PDF
No ratings yet
Policy Gradient Methods For Reinforcement Learning PDF
5 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
58 pages
04 MC Methods
No ratings yet
04 MC Methods
18 pages
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
No ratings yet
Slidedeck 7 MAS 2021 22 RL 3 MC Sarsa QL
65 pages
Lecture 12 Slides - After
No ratings yet
Lecture 12 Slides - After
50 pages
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
No ratings yet
Artificial Intelligence: Lecture 10 - Reinforcement Learning Prof. Shivanjali Khare
45 pages
CZ3005 Module 5 - Reinforcement Learning
No ratings yet
CZ3005 Module 5 - Reinforcement Learning
31 pages
Basics of Reinforcement Learning
No ratings yet
Basics of Reinforcement Learning
15 pages
Lecture 3 Post
No ratings yet
Lecture 3 Post
67 pages
Model Free Prediction
No ratings yet
Model Free Prediction
38 pages
RL - Exam2023 Solved
No ratings yet
RL - Exam2023 Solved
6 pages
Reinforcement Learning Overview
No ratings yet
Reinforcement Learning Overview
31 pages
Gantt Chart
No ratings yet
Gantt Chart
3 pages
Nargis Latif Report
No ratings yet
Nargis Latif Report
3 pages
International Business Law Dissertation Topics
100% (2)
International Business Law Dissertation Topics
5 pages
Pay Attention Carter Jones Schmidt Gary D Download
100% (1)
Pay Attention Carter Jones Schmidt Gary D Download
39 pages
PHD GSO 002 25
No ratings yet
PHD GSO 002 25
2 pages
HY4903P/B: Pin Description Features
No ratings yet
HY4903P/B: Pin Description Features
11 pages
Heat Transfer in Packed Bed
No ratings yet
Heat Transfer in Packed Bed
12 pages
GMT SE REF 00191 - A - GMT Electronics Standards 1
No ratings yet
GMT SE REF 00191 - A - GMT Electronics Standards 1
33 pages
Starbucks Corporation Case Study
No ratings yet
Starbucks Corporation Case Study
3 pages
Seminar Report
No ratings yet
Seminar Report
9 pages
Management Theories Explained
No ratings yet
Management Theories Explained
7 pages
Effective Communication Essentials
No ratings yet
Effective Communication Essentials
20 pages
Fineness of Cement by Sieve Analysis Method (Dry Process)
No ratings yet
Fineness of Cement by Sieve Analysis Method (Dry Process)
20 pages
In Roberts and Strayer
No ratings yet
In Roberts and Strayer
17 pages
UTG-3010G Series: (Smart) Digital Electric Displacer Type Liquid Level Transmitter / Controller With HART Communication
No ratings yet
UTG-3010G Series: (Smart) Digital Electric Displacer Type Liquid Level Transmitter / Controller With HART Communication
3 pages
Syphilis Rapid Test MSDS Overview
No ratings yet
Syphilis Rapid Test MSDS Overview
3 pages
The Teacher and The Community School Culture and Organizational Leadership
100% (1)
The Teacher and The Community School Culture and Organizational Leadership
80 pages
Brochure TMC Industry Eng
No ratings yet
Brochure TMC Industry Eng
4 pages
FRP Flange Design
No ratings yet
FRP Flange Design
29 pages
Criminal Behavior A Psychological Approach Global Edition Curt R. Bartol Anne M. Bartol PDF Download
100% (7)
Criminal Behavior A Psychological Approach Global Edition Curt R. Bartol Anne M. Bartol PDF Download
113 pages
Control Systems Exam: 18 Questions
No ratings yet
Control Systems Exam: 18 Questions
3 pages
Cover
No ratings yet
Cover
2 pages
Refrigerant Study for Engineers
No ratings yet
Refrigerant Study for Engineers
3 pages
Beyond Sex and Gender 1st Edition Wendy Cealey Harrison PDF Download
100% (6)
Beyond Sex and Gender 1st Edition Wendy Cealey Harrison PDF Download
61 pages
King's Proposal Vol 1
No ratings yet
King's Proposal Vol 1
259 pages
Quiz 04 e
No ratings yet
Quiz 04 e
2 pages
Mathematical Physics-I 2022 November (2019 Admission Onwards)
No ratings yet
Mathematical Physics-I 2022 November (2019 Admission Onwards)
2 pages
Costanza Et Al 1997 The Value of The World's Ecosystem Services and Natural Capital PDF
100% (2)
Costanza Et Al 1997 The Value of The World's Ecosystem Services and Natural Capital PDF
8 pages
JTY GD A30E Addressable Smoke Detector
No ratings yet
JTY GD A30E Addressable Smoke Detector
5 pages
Self Control Boot Camp
No ratings yet
Self Control Boot Camp
142 pages

RL Lecture5

Uploaded by

RL Lecture5

Uploaded by

Reinforcement Learning

Lecture 5: Monte Carlo methods

2 Monte Carlo prediction

3 Monte Carlo control

History: Monte Carlo methods Example: Monte Carlo path tracing

Invented by Stanislaw Ulman in the 1940s,

Widely used today, for example:

Deﬁnition: Monte Carlo method Example: approximating π

Overview: MC reinforcement learning

Deﬁnition: MC reinforcement learning

Putting this together, we sample episodes from Example: episode

where we’re going to look at the total discounted

Gt = Rt+1 + γRt+2 + ... + γ T −1 RT ,

and our value function as the expected return

vπ (s) = Eπ [Gt | St = s].

With MC reinforcement learning, we use an empirical

Deﬁnition: Incremental means

Deﬁnition: MC prediction, incremental updates

Putting this together, we sample episodes from

and every time we visit a state, we’re going to increase

It’s common to also just track a running mean and

V (St ) ← V (St ) + α(Gt − V (St ))

Problem: model-free learning. Solution: Q Example: caching Q-values

Simply greedily improving the policy over V (s)

π 0 (s) = arg max Ras + Pss

whereas greedy policy improvement over Q(s, a) is

Follow along in Colab: W

Algorithm: greedy MC that will get stuck

for episode in range(num_episodes): R=0

Deﬁnition: -greedy exploration Problem: local minima

/m + 1 −  if a∗ = maxa∈A Q(s, a)

Asymptotically we can’t just explore...

Deﬁnition: greedy at the limit with inﬁnite exploration (GLIE)

2. The policy converges on a greedy policy:

lim πk (a|s) = 1(a = arg max Qk (s, a0 ))

Algorithm: greedy at the limit of ∞ exploration

for (s, a) in results_list:

In summary, Monte Carlo RL methods:

• are a solution to the reinforcement learning

[1] Richard S Sutton and Andrew G Barto.

You might also like

Deﬁnition: -greedy exploration Problem: local minima

/m + 1 − if a∗ = maxa∈A Q(s, a)