Reinforcement Learning 2
Reinforcement Learning 2
Reinforcement Learning
(2)
𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .
𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
Recap: Grid World: Utilities of States
For 𝛾 = 1
Recap: Policies and Utilities of States
𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .
…and that, even when the action from the optimal policy is
taken, it may not be optimal in the context of the rest of the
agent’s policy.
Recall that, in the context of the multi-armed bandit problem, we said that:
As we get information, we don’t want to keep seemingly taking bad
options frequently…
…but a finite number of trials is never enough to be certain about the
result of a stochastic process – we are never done with exploration.
Our two previous algorithms explore, but don’t prioritise their exploration.
We can encourage exploration of unvisited states by optimistically
estimating their utility.
And where 𝑁(𝑠, 𝑎) is the count of times the agent has taken action 𝑎 in state 𝑠
and 𝑁𝑒 is a parameter of the algorithm.
Grid World: Exploration Function
SARSA and
Q-learning
TDL For Active Reinforcement Learning
𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠
𝜋 ∗ 𝑠 = argmax 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
One approach is not to learn the utilities of states 𝑈(𝑠), but instead to
learn Q-values 𝑄(𝑠, 𝑎) – the expected utility of taking action 𝑎 in state 𝑠.
𝑈 𝑠 = max 𝑄(𝑎, 𝑠)
𝑎
and to greedily choose actions:
𝜋 ∗ 𝑠 = argmax 𝑄(𝑎, 𝑠)
𝑎
SARSA
In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎
In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max
′
𝑄(𝑠 ′ , 𝑎′ ) − 𝑄 𝑠, 𝑎
𝑎
Q-learning:
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max 𝑄(𝑠 ′ , 𝑎 ′ ) − 𝑄 𝑠, 𝑎
′ 𝑎
−106
?
? ? ?
SARSA and Q-Learning Compared
𝑎1 𝟏. 𝟎
𝑠0
𝑠1
𝟏. 𝟎
𝑎0
𝟎. 𝟓
𝟎. 𝟏
𝑎0 𝑎1
𝟎. 𝟓
𝟎. 𝟗
𝑅 𝑠0 = −0.50
𝑠2
𝑅 𝑠1 = −0.75
𝑅 𝑠2 = −0.10
SARSA and Q-learning Illustrated
Imagine that we have run one of SARSA or Q-learning with an 𝜖-greedy policy
and obtained the following estimates of Q-values
𝑄(𝒂, 𝒔) 𝑠0 𝑠𝟏
𝑎0 -0.8 -1.35
𝑎1 -0.7 -0.85
Let us assume that the agent starts in state 𝑠0 and chooses to take action 𝑎1 ,
receiving reward -0.5 and transitioning to state 𝑠1 .
For the purposes of the SARSA update, we will assume that, in this state, it
chooses action 𝑎0 (exploration).
SARSA and Q-learning Illustrated