Reinforcement Learning
Lecture 5: Monte Carlo methods
Chris G. Willcocks
Durham University
Lecture overview
Lecture covers chapter 5 in Sutton & Barto [1] and adaptations from David Silver [2]
1 Introduction
history of Monte Carlo methods
definition
2 Monte Carlo prediction
overview
definition
incremental means
prediction with incremental updates
3 Monte Carlo control
policy iteration using action-value function
don’t just be greedy!
-greedy exploration
greedy at the limit of infinite exploration
2 / 16
Introduction history of Monte Carlo methods
History: Monte Carlo methods Example: Monte Carlo path tracing
Invented by Stanislaw Ulman in the 1940s,
when trying to calculate the probability of a
successful Canfield solitaire. He randomly lay
the cards out 100 times, and simply counted
the number of successful plays.
Widely used today, for example:
• Path tracing in compute graphics
• Computational physics, chemistry, ...
• Grid-free PDE solvers [3]
3 / 16
Introduction definition
Definition: Monte Carlo method Example: approximating π
Apply repeated random sampling to obtain
numerical results for difficult or otherwise
impossible problems
General approach:
1. Define a domain of possible inputs
2. Generate inputs randomly from a
probability distribution over the domain
3. Perform a deterministic computation on
the inputs
4. Aggregate (e.g. average) the results
4 / 16
Monte Carlo reinforcement learning overview
Overview: MC reinforcement learning
MC RL samples complete episodes
Monte Carlo reinforcement learning learns
from episodes of experience:
1. Recap: empircal risk minimiation
2. It’s model-free (requires no knowledge
of MDP transitions/rewards)
3. Learns from complete episodes (you
have to play a full game from start to
finish)
4. One simple idea: the value function =
the empirical mean return
5 / 16
Monte Carlo reinforcement learning definition
Definition: MC reinforcement learning
Putting this together, we sample episodes from Example: episode
experience under policy π
S1 , A1 , R2 , S2 , A2 , ..., Sk ∼ π,
where we’re going to look at the total discounted
reward (the return) at each timestep onwards
Gt = Rt+1 + γRt+2 + ... + γ T −1 RT ,
and our value function as the expected return
vπ (s) = Eπ [Gt | St = s].
With MC reinforcement learning, we use an empirical
mean instead of the expected return.
6 / 16
Monte Carlo reinforcement learning incremental means
Definition: Incremental means
Example: episode
RL algorithms use incremental means, where µ1 , µ2 , ...
from a sequence is computed incrementally
k
1X
µk = xj
k j=1
k−1
!
1 X
= xk + xj
k j=1
1
= (xk + (k − 1)µk−1 )
k
1
= µk−1 + (xk − µk−1 )
k
7 / 16
Monte Carlo methods prediction with incremental updates
Definition: MC prediction, incremental updates
Putting this together, we sample episodes from
Example: episode
experience under policy π
S1 , A1 , R2 , S2 , A2 , ..., ST ∼ π,
and every time we visit a state, we’re going to increase
a visit counter, then we will use our running mean:
N (St ) ← N (St ) + 1
1
V (St ) ← V (St ) + (Gt − V (St ))
N (St )
It’s common to also just track a running mean and
forget about old episodes:
V (St ) ← V (St ) + α(Gt − V (St ))
8 / 16
Monte Carlo methods policy iteration using action-value function
Problem: model-free learning. Solution: Q Example: caching Q-values
Simply greedily improving the policy over V (s)
requires a model:
π 0 (s) = arg max Ras + Pss
a 0
0 V (s ),
a∈A
whereas greedy policy improvement over Q(s, a) is
model-free:
π 0 (s) = arg max Q(s, a)
a∈A
Follow along in Colab: W
9 / 16
Monte Carlo methods don’t just be greedy!
Algorithm: greedy MC that will get stuck
Q = np.zeros([n_states, n_actions])
n_visits = np.zeros([n_states, n_actions])
for episode in range(num_episodes): R=0
s = env.reset(), done = False, result_list = []
while not done:
→ a = np.argmax(Q[s, :]) R=1
s’, reward, done, _ = env.step(a)
results_list.append((s, a))
result_sum += reward
s = s’
R=5
for (s, a) in results_list:
n_visits[s, a] += 1.0
α = 1.0 / n_visits[s, a]
Q[s, a] += α ∗ (result_sum − Q[s, a])
10 / 16
Monte Carlo methods -greedy exploration
Definition: -greedy exploration Problem: local minima
The simplest idea to avoid local minima is:
• choose a random action with probability
• choose the action greedily with probability 1 − R=0
• where all m actions are tied with non-zero
probability R=1
This gives the updated policy:
/m + 1 − if a∗ = maxa∈A Q(s, a)
π(a|s) =
/m otherwise R=5
Proof of convergence in Equation 5.2 of [1]
11 / 16
Monte Carlo methods don’t just explore!
Asymptotically we can’t just explore...
12 / 16
Monte Carlo methods greedy at the limit of infinite exploration
Definition: greedy at the limit with infinite exploration (GLIE)
Defines a schedule for exploration, such that these two conditions are met:
1. You continue to explore everything
lim Nk (s, a) = ∞
k→∞
2. The policy converges on a greedy policy:
lim πk (a|s) = 1(a = arg max Qk (s, a0 ))
k→∞ a0 ∈A
13 / 16
Monte Carlo methods greedy at the limit of infinite exploration
Algorithm: greedy at the limit of ∞ exploration
..
for episode in range(num_episodes):
s = env.reset(), done = False, result_list = []
while not done:
→ = min(1.0, 10000.0/(episode+1))
if np.random.rand() > :
a = np.argmax(Q[s, :])
else: R=1
a = env.action_space.sample()
s’, reward, done, _ = env.step(a)
results_list.append((s, a))
result_sum += reward
s = s’ R=5
for (s, a) in results_list:
n_visits[s, a] += 1.0
α = 1.0 / n_visits[s, a]
Q[s, a] += α ∗ (result_sum − Q[s, a])
14 / 16
Take Away Points
Summary
In summary, Monte Carlo RL methods:
• are a solution to the reinforcement learning
problem
• require training with complete episodes
• are model-free
• can balance exploration vs exploitation
• eventually converge on the optimal
action-value function
15 / 16
References I
[1] Richard S Sutton and Andrew G Barto.
Reinforcement learning: An introduction (second edition). Available online I. MIT
press, 2018.
[2] David Silver. Reinforcement Learning lectures.
https://www.davidsilver.uk/teaching/. 2015.
[3] Rohan Sawhney and Keenan Crane. “Monte Carlo geometry processing: a grid-free
approach to PDE-based methods on volumetric domains”. In:
ACM Transactions on Graphics (TOG) 39.4 (2020), pp. 123–1.
16 / 16