L13 Reinforcement Learning
L13 Reinforcement Learning
Semester I, 2024-25
Reinforcement Learning
Rohan Paul
1
Outline
• Last Class
• Markov Decision Processes
• This Class
• Reinforcement Learning
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 21 sections 21.1 – 21.4.
2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.
3
Reinforcement Learning Setup
• Markov decision process (MDP):
• A set of states s S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Goal is to determine a policy (s)
• Don’t know T or R
• We don’t know which states are good or what
the actions do
• Must actually try actions and states to learn
their utility.
• The agent’s goal is still to act optimally i.e.
determine the optimal policy.
4
Reinforcement Learning Setup
• Markov decision process (MDP):
• A set of states s S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Goal is to determine a policy (s)
How should the agent act in this setup so as to maximize expected future
rewards. Also, think of the setup as acting in an ”unknown” MDP 5
Another View: Offline (MDP) vs Online (RL)
• Offline (MDP)
• We are given an MDP
The agent solves the MDP
• We use policy or value iteration to learn a policy. and computes a policy.
This is computed offline. Now it simply acts with it.
• At runtime the agent only executes the
computed policy
• Online (RL)
• We do not have full knowledge of the MDP The agent interacts with the
• We must interact with the world to learn which world and learns that one of
states are good and which actions eventually the states is bad.
lead to good rewards.
6
Reinforcement Learning
• Given
• Agent, states, actions, immediate rewards and environment.
• Not given
• Transition function (the agent cannot predict which state will it land in once it
takes an action)
• Does not know the reward function. Does not know what reward it will get in a
state.
• Agent’s Task
• A policy that maximizes expected rewards (the objective has not changed)
7
Reinforcement Learning (RL)
• People and animals learn by interacting with our
environment
• It is active rather than passive.
• Interactions are often sequential — future
interactions can depend on earlier ones
Biological motivation
• Reward Hypothesis
• Any goal can be formalized as the outcome of
maximizing a cumulative reward.
• We can learn without examples of optimal
behaviour Instead, we optimise some reward signal
Reinforcement Learning: Setup
Agent Key characteristic of reinforcement
learning
State: s • Only evaluative feedback present.
Reward: r Actions: a
• The agent takes and action and is
provided feedback (reward).
• The agent is not told which action
Environment it should take in a state.
16
Reinforcement Learning: Approaches
• Different learning agents
• Utility-based agent
• Learn a value/utility function on states and use it to select actions that maximize the expected outcome utility.
• Q-learning
• Learn action-utility function (or Q-function) giving expected utility of taking a given action in a given state.
• Reflex agent.
• Directly learn a mapping from states to action (policy).
• Approaches
• Passive Learning
• The agent’s policy is fixed. the task is to learn the utilities of states (or state-action pairs); this could involve
learning a model of the environment.
• It cannot select the actions during training.
• Active Learning
• The agent can select actions, it must also learn what actions to take.
• Issue of exploration: an agent must experience as much as possible of the environment in order to learn how
to behave in it.
17
Passive Reinforcement Learning
• Setup
• Input: a fixed policy (s)
• Don’t know the transitions T(s,a,s’)
• Don’t know the rewards R(s,a,s’)
• The agent executes a set of trials or
episodes using its policy (s) The learner is provided with a policy (cannot change
that), using the policy it executes trials or episodes in
• In each episode or trial it starts in a state the world. The goal is to determine the value of states.
and experiences a sequence of state
transitions and rewards till it reaches a
terminal state.
• Goal: learn (estimate) the state values V (s)
18
Model-Based Reinforcement Learning
• Model-Based Idea
• Learn an approximate model (R(), T()) based on experiences
• Compute the value function using the learned model (as if it were correct).
19
Example: Model-Based Learning
Input Policy Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10 …
B C D
Episode 3 Episode 4 R(s,a,s’).
E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume: = 1 D, exit, x, +10 A, exit, x, -10 …
20
Toy example: Model-Based vs Model-Free Estimation
Goal: Compute expected age of COL333/671 students
Known P(A)
Eventually, we learn the right model Bypass the model construction. Averaging works
which gives the right estimates. because the samples appear with the right frequencies.21
Model-Based vs. Model Free RL
Model-based RL:
• Explore environment and learn model, T=P(s’|s,a) and R(s,a), (almost)
everywhere. Use model to plan a policy (solving the MDP)
• Suitable when the state-space is manageable
Model-free RL:
• Do not learn a model; learn value function or policy directly
• Suitable when the state space is large
24
Toy Example: Monte Carlo Method
Input Policy Observed Episodes (Training)
Episode 1 Episode 2 Value/utility of state c
V(C) = ((9 + 9 + 9 +(-11))/4)
B, east, C, -1 B, east, C, -1 =4
A C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 Output Values
B C D -10
Episode 3 Episode 4 A
E +8 +4 +10
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1 B C D
Assume: = 1 D, exit, x, +10 A, exit, x, -10
-2
E
25
First-Visit Monte Carlo (FVMC) Policy
Evaluation
26
Pseudo-code from Sutton and Barto (Reinforcement Learning) Ch 5 (Sec. 5.1)
First-Visit Monte Carlo (FVMC)
• First-Visit Monte Carlo (FVMC)
• Averages the returns following the first visit to a state s in the episode.
• Every-visit Monte Carlo (EVMC)
• Averages returns following all the visits to s.
• Convergence
• FVMC – error falls as 1/N(s). Needs lots of data
• EVMC - error falls quadratically, slightly better data efficiency.
27
Monte Carlo Methods: Pros and Cons
• Pros
• Does not estimate a model. It does not require any -10
knowledge of T(), R(). A
• With large number of runs it computes the correct
average values, using just sample transitions +8 +4 +10
B C D
• Cons -2
• Each state must be learned separately, loses the state E
connection information.
• Estimate of one state is not taking advantage of the • Problem: we have lost the
estimates of the other states. connection between states.
• Note: Bellman equations tell us that value function for states • If B and E both go to C under
has a recursive relationship. this policy, how can their values
• Could only be used in an episodic setting. be different?
• Setting
• Can be used in an episodic or infinite-horizon non-episodic settings
• Immediately updates the estimate of V(s) after each (s, a, s’, r) tuple.
“If one had to identify one idea that is central and novel to reinforcement learning,
it would undoubtedly be temporal-difference (TD) learning”, Sutton and Barto 2017
29
Temporal Difference (TD) Learning
s
• Temporal difference learning of values
(s)
• Policy still fixed
• Move values toward the value of the successor that is encountered s, (s)
• Keep a running average
• Don’t need to store all the experience to build models. It is a model-free
approach. s’
What is observed
Sample of V(s):
B, east, C, -2 C, east, D, -2
A 0 0 0
B C D 0 0 8 -1 0 8 -1 3 8
E 0 0 0
Assume: = 1, α = 1/2
Temporal Difference
TD Learning (intuitively)
- Nudge our prior estimate of the
value function for a state using the
given experience.
- Shift the estimate based on the
error in what we are experiencing
and what estimate we had before
- Weighted by the learning rate. 32
TD Value Learning: Problems
• TD Value Learning
• Model-free approach to perform policy evaluation
• Incorporates Bellman updates with running sample averages
• Output of TD Value Learning
• TD Value learning outputs the value function
• How to turn the learned values into a new policy?
• Can use the following relationships:
• Q-Value Iteration
• Start with Q0(s,a) = 0
• Given Qk, calculate the depth Qk+1 q-values for all q-states:
35
Sample
Q-Learning
• Sample-based Q-value iteration
36
Q-Learning: Procedure
37
Q-Learning
The agent does not know the rewards a-priori. Learns the effect of the east action over time. Only the actions we do are
updated. Occasionally falls in the cliff and gets the negative reward. Note that the max of the Q values is propagated38(green
values) to other states as it is approximating the optimal Q value.
Q-Learning: Properties
• Off-policy learning
• Q-learning converges to optimal policy -- even if the agent is acting sub-
optimally.
39
SARSA
• State-Action-Reward-State-Action (SARSA)
• Update using (s, a, r, s’, a’)
• Note
• SARSA waits until an action is taken and backs up the Q-value for that action. Learns
the Q-function from actual transitions.
• On-policy algorithm
• More realistic if the policy is partly controlled by other agents. Learns from more
realistic values.
40
Active Reinforcement Learning
• Q-learning so far
• Q-learning allows action selection because we are learning the Q(s,a) function.
• This means there is a policy that can be derived from the Q function learned.
• Should the agent follow this policy exactly or should it explore at times?
• Active RL
• The agent can select actions
• Actions play two roles
• A means to collect reward (exploitation)
• Help in acquiring a model of the environment (exploration)
• Exploration vs. Exploitation trade-off
• Act according to the current optimal (based on Q-Values)
• Pick a different action to explore.
• Example: a new tea stall opens in IIT (should you try the new one or stick to the old one? Goal is to
maximize tea utility over time
41
Exploration Strategy
• How to force exploration?
• An -greedy approach
• Every time step: either pick a random action or act on the current policy
• With (small) probability , pick a random action
• With (large) probability 1-, act based on the current policy (based on the current
Q –values in the table that the agent is updating)
• What is the problem with -greedy?
• It takes a long time to explore.
• Exploration is not directed towards states of which we have less information.
42
Exploration Functions
• How to direct exploration towards state-action pairs that are less explored?
• Exploration function
• Trades off exploitation (preference for high values of u) vs. exploration (preference
for actions that have not been tried out as yet).
• f(u,n): Increasing in u and decreasing in n.
• Effects
• The lower the N(s’, a’) is the higher is the Modified update
exploration bonus.
• The exploration bonus makes those
states favorable which lead to unknown
states (propagation).
• Will have a cascading effect when an
exploration action is there.
44
Multi-arm Bandits
Multi-armed bandits are equivalent to a one state MDP.
Goal is to learn a policy that picks actions such that in
the expected rewards are maximized.
45
Source: Emma Brunskill (CS232 Course) Lecture 10
Toy Example: Treatment planning
• Consider deciding how to best treat • We can model this problem as a multi-
patients with broken toes arm bandit problem with 3 arms
• Imagine have 3 possible options: • Imagine true (unknown) Bernoulli reward
• Do Surgery parameters for each arm (actions) are
• Perform buddy taping the broken toe with
another toe,
• Do Nothing
• Outcome measure / reward is binary
variable: whether the toe has healed
(reward +1) or not healed (reward 0)
after 6 weeks, as assessed by x-ray
46
Source: Emma Brunskill (CS232 Course) Lecture 10
Toy Example: Treatment Planning
• Greedy actions are those that look best at the present, but some of the
other actions may be better.
• Epsilon-greedy forces the non-greedy actions to be tried.
• Indiscriminately, with no preference for those actions that are nearly greedy or
particularly uncertain.
• It would be better to select among the non-greedy actions according to
their potential for being optimal.
• Take into account how close their estimates are to being maximal and the
uncertainties in their estimates.
48
Upper Confidence Bound
• Upper Confidence Bound (UCB) for action
selection
• Square-root term is a measure of uncertainty
or variance in the estimate of the action’s
value.
• When a is selected then Nt(a) is incremented. The
uncertainty reduces as denominator increases.
• Each time when a is not selected then t increases
but Nt(a) does not. The uncertainty estimate
increases (numerator).
• Natural logarithm
• Increases get smaller over time but are
unbounded; all actions will eventually be selected.
• Actions with lower value estimates will be selected
with lower frequency.
49
Upper Confidence Bound
50
Source: Emma Brunskill (CS232 Course)
Example
Example of K-arm bandits in practice. Figure is for a 10-arm bandit test bed that learns the rewards for the 10 options. The average reward collected
goes up as the learning progresses.
51
Figure from S&B Book Chapter 2
Problem of Generalization
State space is
raw pixels, the
• Problem when the state space is very large state space size
• Visiting all states is not possible at training time. is large, cannot
• Memory issue: cannot fit the q-table in memory. experience all
• Need a succinct representation of the state. states.
• Feature-based Representations
• Features or properties are functions from states to real
numbers (often 0/1) that capture important properties of the
state.
• Example features that can be computed from the state The states are similar. But for Q-
• Distance to closest ghost learning it will be a different entry. It
• Distance to closest dot
• Number of ghosts does not know that in essence these
52
states are the same.
Linear Value Functions
• Using a feature representation, a q function (or value function) can be expressed for any state
using a set of weights:
• Now the goal of Q-learning is to estimate these weights from experience. Once the weights
are learned the resulting Q-values will hopefully be close to the true Q values.
• Main consequence:
• A new state that shares features with a previously seen state will have the same Q-values.
• Disadvantage
• If there are less features, actually different states (good and bad states) may start looking the same.
53
Approximate Q-learning
• Q-learning with linear Q-functions:
Exact Q function
updates
Approximate Q
function updates
“target” “prediction”
• Interpretation:
• Adjust weights of active features.
• If the difference is positive (what we get is higher than previous) and the feature value is 1 then the
weight increases and the q value increases. No change if the feature value is 0.
54
Deep Q-Networks (DQNs)
Goal: policy to move the base.
Modeling the Q-function via a Neural Network.
What we want to estimate is
what is the utility of this state
and action we can take?
55
Training the Deep Q Network
How to train a network to predict the Q value
Key idea: turn the Q-learning equation as a loss.
for an action for a state?
Optimize using back propagation.
Need lots of data to train! Obtain roll outs from the simulator.
56
Conclusions
• Learning from reinforcement applies to problems where there is no knowledge of which
states have good rewards and how the actions lead to new states,
• Various approaches for performing RL: estimating a model, estimating the value
function or directly optimizing the policy (not covered yet). Variation with whether we
learn from the whole episode or from each transition.
• Data is essentially experience of the agent. There is evaluative feedback not prescriptive
feedback. RL is different from supervised learning.
57