Reinforcement Learning
Geoff Hulten
Reinforcement Learning
• Learning to interact with an environment
• Robots, games, process control
• With limited human training
• Where the ‘right thing’ isn’t obvious Agent
Reward
Action
State
• Supervised Learning:
• Goal:
• Data:
Environment
• Reinforcement Learning:
• Goal:
Maximize
• Data:
TD-Gammon – Tesauro ~1995 State: Board State
Actions: Valid Moves
Reward: Win or Lose
• Net with 80 hidden units,
initialize to random weights
• Select move based on network
estimate & shallow search
P(win)
• Learn by playing against itself
• 1.5 million games of training
-> competitive with world class players
Atari 2600 games
State: Raw Pixels
Actions: Valid Moves
Reward: Game Score
• Same model/parameters for
~50 games
https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
Robotics and
Locomotion
State:
Joint States/Velocities
Accelerometer/Gyroscope
Terrain
Actions: Apply Torque to Joints
Reward: Velocity – { stuff }
https://youtu.be/hx_bgoTF7bs
2017 paper https://arxiv.org/pdf/1707.02286.pdf
Alpha Go State: Board State
Actions: Valid Moves
Reward: Win or Lose
• Learning how to beat humans at ‘hard’ games (search
space too big)
• Far surpasses (Human) Supervised learning
• Algorithm learned to outplay humans at chess in 24 hours
https://deepmind.com/documents/119/agz_unformatted_nature.pdf
How Reinforcement Learning is Different
• Delayed Reward
• Agent chooses training data
• Explore vs Exploit (Life long learning)
• Very different terminology (can be confusing)
Setup for Reinforcement Learning
Markov Decision Process Policy
(environment) (agent’s behavior)
• Discrete-time stochastic control process • – The action to take in state
• Each time step, :
• Agent chooses action from set • Goal maximize:
• Moves to new state with probability:
Probability of moving to each state
• Receives reward:
• – Tradeoff immediate vs future
• Every outcome depends on and
• Nothing depends on previous states/actions
Reward for making that move Value of being in that state
Simple Example of Agent in an Environment
State:
Map Locations
Score: 100
0
0, 0 1, 0 2, 0
100
Actions:
Move within map
Reaching chest ends episode 0, 1 1, 1 2, 1
0, 2 1, 2 2, 2
…
Reward:
100 at chest
0 for others
Policies
Policy Evaluating Policies
0, 0 1, 0 2, 0
12.5 100
0, 1 1, 1 2, 1
50
0, 2 1, 2 2, 2
Move to <1,1>
Move to <0,1> Move to <1,0>
Move to <2,0>
𝜋
Policy could be better
𝑉 ¿
Q learning
Learn a policy that optimizes for all states, using:
• No prior knowledge of state transition probabilities:
• No prior knowledge of the reward function:
Approach:
• Initialize estimate of discounted reward for every state/action pair:
• Repeat (for a while):
• Take a random action from
• Receive and from environment
• Update = +
• Random restart if in terminal state
1 Exploration Policy:
∝𝑣 =
1+𝑣𝑖𝑠𝑖𝑡𝑠( 𝑠 , 𝑎)
Example of Q learning
(round 1)
• Initialize to 0
• Random initial state =
0, 0 1, 0 2, 0
0
• Random action from 0
0
0 0
0 0 100
0
0, 1 1, 1 2, 1
0 0
• Update
0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0
0 0
0 0
• Update
• No more moves possible, start again…
Example of Q learning
(round 2)
• Round 2: Random initial state =
• Random action from 0, 0 1, 0 2, 0
0 0
0
0 0
0 0 100
0, 1 1, 1 2, 1
• Update + * 100 0 0
0 0
0 0 0
• Random action from 0, 2
0
1, 2
0
2, 2
0
50
0 0
0 0
• Update
• No more moves possible, start again…
𝛾=0.5
Example of Q learning
(some acceleration…)
𝛾=0.5
• Random Initial State 0, 0
0
1, 0
0
2, 0
0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
0 0
50
0 0
• Update 0
0
0
0
50
0
0, 2 1, 2 2, 2
0 25
0
0 0
Example of Q learning
(some acceleration…)
𝛾=0.5
• Random Initial State 0, 0
0
1, 0
100
0
2, 0
0
0 0
• Update 0, 1
0
1, 1
0 100
2, 1
25
0 50
0 0
• Update 0
0
0
0
50
0
0, 2 1, 2 2, 2
0 25
0 0
Example of Q learning
( after many, many runs…)
0, 0 1, 0 2, 0
50 100
• converged
25
12.5 25
25 50 100
0, 1 1, 1 2, 1
• Policy is: 25 50
12.5 25
6.25 12.5 25
12.5 25 50
0, 2 1, 2 2, 2
12.5 25
6.25 12.5
Challenges for Reinforcement Learning
• When there are many
states and actions Turns Remaining: 15
• When the episode can
end without reward
• When there is a
‘narrow’ path to Each stepexploring
Random ~50% probability of of
will fall off going
ropewrong
~97%way – P(reaching
of the time goal) ~ 0.01%
reward
Reward Shaping
• Hand craft intermediate
objectives that yield
reward
• Encourage the right type
of exploration
• Requires custom human
work
• Risks of learning to game
the rewards
Memory
• Retrain on previous explorations
0, 0 1, 0 2, 0
0
50 100
• Maintain samples of:
0
0 0
25
0 0
50 100
0
0, 1 1, 1 2, 1
25
0 0
50
0 0
0 0 0
25
0
25 0 50
0
0, 2 1, 2 2, 2
• Useful when 0 25
0
• It is cheaper to use some RAM/CPU than to run 0 12.5
0
more simulations
Replay
Replay
Do itana exploration
bunchexploration
a different of times
• It is hard to get to reward so you want to
leverage it for as much as possible when it
happens
Gym – toolkit for reinforcement learning
CartPole
import gym
env = gym.make('CartPole-v0')
import random
import QLearning # Your implementation goes here...
import Assignment7Support
trainingIterations = 20000
qlearner = QLearning.QLearning(<Parameters>)
for trialNumber in range(trainingIterations):
observation = env.reset()
reward = 0
for i in range(300):
Reward +1 per step the pole remains up env.render() # Comment out to make much faster...
currentState = ObservationToStateSpace(observation)
action = qlearner.GetAction(currentState, <Parameters>)
MountainCar
oldState = ObservationToStateSpace(observation)
observation, reward, isDone, info = env.step(action)
newState = ObservationToStateSpace(observation)
qlearner.ObserveAction(oldState, action, newState, reward, …)
if isDone:
if(trialNumber%1000) == 0:
print(trialNumber, i, reward)
break
# Now you have a policy in qlearner – use it...
Reward 200 at flag -1 per step
https://gym.openai.com/docs/
Some Problems with QLearning
• State space is continuous
• Must approximate by discretizing
• Treats states as identities print(env.observation_space.high)
#> array([ 2.4 , inf, 0.20943951, inf])
• No knowledge of how states relate print(env.observation_space.low)
• Requires many iterations to fill in #> array([-2.4 , -inf, -0.20943951, -inf])
• Converging can be difficult with
randomized transitions/rewards
Policy Gradients
• Q-learning -> learn a value function
• = an estimate of the expected
discounted reward of taking from
• Performance time: take the action
that has the highest estimated value
• Policy Gradient -> learn policy
directly
• Probability distribution over
• Performance time: choose action
according to distribution
Example from: https://www.youtube.com/watch?v=tqrcjHuNdmQ
Policy Gradients
• Receive a frame
• Forward propagate to get
• Select by sampling from
• Find the gradient that makes more likely –
store it
• Play the rest of the game
• If won, take a step in direction
• If lost, take a step in direction
One per action
Sum and step
in correct direction
Policy Gradients – reward shaping
Not relevant to outcome(?)
Less important to outcome More important to outcome
Summary Agent
Reinforcement Learning:
Reward
Action
• Goal: Maximize
State
• Data:
Environment
Many (awesome) recent successes:
• Robotics
• Surpassing humans at difficult games
• Doing it with (essentially) zero human knowledge (Simple) Approaches:
• Q-Learning -> discounted reward of action
• Policy Gradients -> Probability distribution over
Challenges: • Reward Shaping
• When the episode can end without reward • Memory
• When there is a ‘narrow’ path to reward • Lots of parameter tweaking…
• When there are many states and actions