0% found this document useful (0 votes)
28 views41 pages

Reinforcement Learning 2

Uploaded by

Tùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
28 views41 pages

Reinforcement Learning 2

Uploaded by

Tùng Đào
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

Machine Learning

Reinforcement Learning
(2)

Dr. Harry Goldingay


[email protected]
Learning Outcomes

At the end of this lecture you should:

Understand the difference between active and passive


reinforcement learning and understand the role of
exploration in the former.

Understand two active reinforcement learning algorithms –


SARSA and Q-learning – well enough to implement them.

Be aware of other important approaches to reinforcement


learning.
Motivation
Aims of Reinforcement Learning

In the previous unit, we introduced passive reinforcement


learning…
Given a policy, how do we evaluate it?

…but when talking about MDPs more generally, we did better:


we could find optimal policies.

Can we do the same in a reinforcement learning setting?


Scenario 1

Our agent has a policy.

It acts in the environment following this policy.

It uses the information it observes to learn information about the


utility of the environmental states.

Based on this information, it calculates the optimal policy.

Any issues with this?


Scenario 2

Our agent has no policy – instead it acts randomly.

It acts in the environment.

It uses the information it observes to learn information about the


utility of the environmental states.

Based on this information, it calculates the optimal policy.

Any issues with this?


Which Slot Machine To Play?
How to Act in Grid World?
Active
Reinforcement
Learning
Greedy Algorithm

We were able to use the Bellman equations…

𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .

𝜋 ∗ 𝑠 = argmax ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
Recap: Grid World: Utilities of States

For 𝛾 = 1
Recap: Policies and Utilities of States

If we know 𝑈 𝑠 for each state, we can use it to infer the optimal


policy.
Choose the action which results in the highest expected
utility of the next state.

What action should an agent take in state (3,1) in the previous


example?
Up: 0.8 × 0.66 + 0.1 × 0.655 + 0.388 ≈ 0.63
Left: 0.8 × 0.655 + 0.1 × 0.66 + 0.611 ≈ 0.65
The agent should choose Left.

Key takeaway: 𝜋 ∗ can be inferred directly from utility.


Greedy Agent

We were able to use the Bellman equations…

𝑈 𝑠 = 𝑅 𝑠 + 𝛾 max ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′
…to find the optimal policy 𝜋 ∗ from the utilities of states 𝑈 𝑠 .

A greedy agent acts in the environment to estimate the required


quantities…
𝑈 𝑠
𝑅 𝑠
𝑃 𝑠 ′ 𝑠, 𝑎
… and then at each step chooses the action which maximises utility in
its current state.

Choice of action makes this an example of active reinforcement learning.


Greedy Agent: Multi-armed Bandit
Greedy Agent: Grid World
Greedy Agent: Grid World Policy
Exploration

We have seen in the previous examples that the greedy


algorithm is flawed.

Taking the current estimated optimal action restricts data


gathering – may mean that the optimal action is not
discovered…

…and that, even when the action from the optimal policy is
taken, it may not be optimal in the context of the rest of the
agent’s policy.

Greedy agents exploit their knowledge – we need an agent


which also explores.
𝜖-greedy Algorithm

We have already talked about an excellent way of exploring –


acting randomly in all states!

Can we get the advantages of the greedy and the random


approach?

Simple solution: 𝝐-greedy algorithm:


Act randomly with probability 𝜖 (some parameter of the
algorithm)
Otherwise act greedily

This algorithm makes the trade-off between exploration and


exploitation clear.
Balancing Exploration and Exploitation

Recall that, in the context of the multi-armed bandit problem, we said that:
As we get information, we don’t want to keep seemingly taking bad
options frequently…
…but a finite number of trials is never enough to be certain about the
result of a stochastic process – we are never done with exploration.

In theory we want our algorithms to be greedy in the limit with infinite


exploration (GLIE).
Infinite exploration: they should take all actions in all states an
unbounded number of times.
Greedy in the limit: approaches greedy behaviour arbitrarily closely as
time increases.
Simple example 𝝐𝒕 -greedy: agent acts randomly a fraction 1/𝑡.
Exploration Function

Our two previous algorithms explore, but don’t prioritise their exploration.
We can encourage exploration of unvisited states by optimistically
estimating their utility.

Consider an active version of the previously discussed value iteration algorithm


in which an agent acts greedily at each step based on optimistic estimated utility
𝑈 + 𝑠 which is updated as follows:

𝑈 + 𝑠 = 𝑅 𝑠 + 𝛾 max 𝑓(෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈 + 𝑠 ′ , 𝑁(𝑠, 𝑎))


𝑎
𝑠′
Where:
𝑅+, 𝑖𝑓 𝑛 < 𝑁𝑒
𝑓 𝑢, 𝑛 = ቊ
𝑢, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

And where 𝑁(𝑠, 𝑎) is the count of times the agent has taken action 𝑎 in state 𝑠
and 𝑁𝑒 is a parameter of the algorithm.
Grid World: Exploration Function
SARSA and
Q-learning
TDL For Active Reinforcement Learning

Recall the update equation for TDL:

𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠

And our criterion for choosing an action based on the bellman


equations:

𝜋 ∗ 𝑠 = argmax ෍ 𝑃 𝑠 ′ 𝑠, 𝑎 𝑈(𝑠 ′ )
𝑎∈𝐴(𝑠)
𝑠′

What issues does this imply in an active reinforcement learning


setting?
The passive version of TDL does not give us enough information to
choose an action!
We would need also to know/learn the transition model 𝑃 𝑠 ′ 𝑠, 𝑎
Q-values

Can we adapt the ideas of TDL to give us a model-free approach to


active reinforcement learning?

One approach is not to learn the utilities of states 𝑈(𝑠), but instead to
learn Q-values 𝑄(𝑠, 𝑎) – the expected utility of taking action 𝑎 in state 𝑠.

Note that, given Q-values it is easy to recover state utilities:

𝑈 𝑠 = max 𝑄(𝑎, 𝑠)
𝑎
and to greedily choose actions:

𝜋 ∗ 𝑠 = argmax 𝑄(𝑎, 𝑠)
𝑎
SARSA

How to learn Q-values? How to act on them?

In SARSA (State-Action-Reward-State-Action) an agent has some


policy 𝜋 based on 𝑄 (e.g. 𝜖-greedy).

In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.

In state 𝑠′ it takes action 𝑎′.

It updates Q-values as follows:

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎

The updated Q-values may cause the policy to change.


Q-Learning

Q-learning is a related approach.

An agent has some policy 𝜋 based on 𝑄 (e.g. 𝜖-greedy).

In state 𝑠 it takes action 𝑎 resulting in reward 𝑅(𝑠) and new state 𝑠′.

It updates Q-values as follows:

𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max

𝑄(𝑠 ′ , 𝑎′ ) − 𝑄 𝑠, 𝑎
𝑎

The updated Q-values may cause the policy to change.


TDL vs SARSA vs Q-learning

Compare the update equations.

Passive Temporal Difference Learning:


𝑈 𝑠 = 𝑈 𝑠 + 𝛼 𝑁𝑠 𝑠 𝑅 𝑠 + 𝛾𝑈 𝑠 ′ − 𝑈 𝑠
Expected utility
based on action
SARSA: from policy
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾𝑄 𝑠 ′ , 𝑎′ − 𝑄 𝑠, 𝑎

Q-learning:
𝑄 𝑠, 𝑎 = 𝑄 𝑠, 𝑎 + 𝛼 𝑅 𝑠 + 𝛾 max 𝑄(𝑠 ′ , 𝑎 ′ ) − 𝑄 𝑠, 𝑎
′ 𝑎

Expected utility based


on greedy policy
SARSA and Q-Learning Compared

−106

?
? ? ?
SARSA and Q-Learning Compared

If we knew the true Q-values, then behaving greedily would be


optimal…
Q-learning tries to learn this optimal policy regardless of the actual
policy the agent is using.
It is an off-policy learning algorithm.

…however, if an agent is not following a greedy policy (e.g. because it


is exploring) then basing actions on Q-values for a greedy algorithm
can cause the agent to act suboptimally.
SARSA instead tries to learn the optimal Q-values for its current
policy.
It is an on-policy learning algorithm.

…and can cause issues with convergence.


SARSA and Q-learning Illustrated

𝑎1 𝟏. 𝟎

𝑠0
𝑠1
𝟏. 𝟎
𝑎0

𝟎. 𝟓
𝟎. 𝟏
𝑎0 𝑎1
𝟎. 𝟓
𝟎. 𝟗
𝑅 𝑠0 = −0.50
𝑠2
𝑅 𝑠1 = −0.75
𝑅 𝑠2 = −0.10
SARSA and Q-learning Illustrated

Imagine that we have run one of SARSA or Q-learning with an 𝜖-greedy policy
and obtained the following estimates of Q-values

𝑄(𝒂, 𝒔) 𝑠0 𝑠𝟏
𝑎0 -0.8 -1.35
𝑎1 -0.7 -0.85

What is the greedy policy?


Take action 𝑎1 in state 𝑠0 and action 𝑎1 in state 𝑠1 .

Let us assume that the agent starts in state 𝑠0 and chooses to take action 𝑎1 ,
receiving reward -0.5 and transitioning to state 𝑠1 .

For the purposes of the SARSA update, we will assume that, in this state, it
chooses action 𝑎0 (exploration).
SARSA and Q-learning Illustrated

Sequence: 𝑠0 , 𝑎1 −0.5 → {𝑠1 , 𝑎0 }


The agent has taken action 𝑎1 in state 𝑠0 so must update
𝑄(𝑠0 , 𝑎1 ).
We will set 𝛼 = 0.1 and 𝛾 = 0.75
Action chosen
in state 𝑠1
SARSA:
𝑄 𝑠0 , 𝑎1 = 𝑄 𝑠0 , 𝑎1 + 𝛼 𝑅 𝑠0 + 𝛾𝑄 𝑠1 , 𝑎0 − 𝑄 𝑠0 , 𝑎1
𝑄 𝑠0 , 𝑎1 = −0.7 + 0.1 −0.5 − 0.75 × 1.35 + 0.7 = −0.78

Best action in state 𝑠1


Q-learning:
𝑄 𝑠0 , 𝑎1 = 𝑄 𝑠0 , 𝑎1 + 𝛼 𝑅 𝑠0 + 𝛾 max 𝑄(𝑠 , 𝑎 ′) − 𝑄 𝑠 , 𝑎
′ 1 0 1
𝑎
𝑄 𝑠, 𝑎 = −0.7 + 0.1 −0.5 − 0.75 × 0.85 + 0.7 = −0.74
Generalisation
Scaling Up
Scaling Up

The approaches we have discussed so far depend on


estimating quantities per state…
𝑄(𝑠, 𝑎)

…or even quadratic in the numbers of states.


𝑃(𝑠’|𝑠, 𝑎)

We need (multiple) samples for each state.

Chess has been estimated to have 1040 board states.

We need some way to generalise from data we have gathered


about states we have visited.
Genralisation – The Problem

Assume that we want to generalise Q-values, then:

We have a series of state-action pairs


[ 𝜎1 , 𝛼1 , 𝜎2 , 𝛼2 , … , 𝜎𝑁 , 𝛼𝑁 ]

For each state-action pair we have a sample of the


corresponding Q-value
[𝑄 𝜎1 , 𝛼1 , 𝑄 𝜎2 , 𝛼2 , … , 𝑄 𝜎𝑁 , 𝛼𝑁 ]

We want to predict the Q-value function for unseen state-


action pairs:
𝑄(𝜎, 𝛼)
Function Approximation

The problem we have just described is just a supervised


learning problem.

We choose some model to approximate our Q-value function


𝑄෠𝜽 (𝑠, 𝑎)

We want to update our parameters 𝜽 to minimise some error


We will use sum-of-squares

We can use on-line learning to update the model parameters


after each trial:
Other Approaches
Dynamic Programming

We have focussed primarily on model-free approaches to


reinforcement learning (temporal difference approaches)…

…but model-based approaches, such as adaptive dynamic


programming are also successful:
Learn the model by acting in the environment
Use offline approaches (e.g. value iteration) to infer optimal policy.

These approaches often converge faster than model-free


approaches…

…but don’t scale as well.


Policy Search

We have introduced ways of constructing policies from


parameters (e.g. Q-values).

We can estimate the quality of a policy by running trials of an


agent with that policy in the environment and measuring
observed reward.

We want to maximise the quality of the policy.

This is an optimization problem! Can apply tools such as


gradient descent.
Conclusion

You should now:

Understand the difference between active and passive


reinforcement learning and understand the role of
exploration in the former.

Understand two active reinforcement learning algorithms –


SARSA and Q-learning – well enough to implement them.

Be aware of other important approaches to reinforcement


learning.

You might also like