0% found this document useful (0 votes)
8 views57 pages

L13 Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views57 pages

L13 Reinforcement Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 57

COL333/671: Introduction to AI

Semester I, 2024-25

Reinforcement Learning

Rohan Paul

1
Outline
• Last Class
• Markov Decision Processes
• This Class
• Reinforcement Learning
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 21 sections 21.1 – 21.4.

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Reinforcement Learning Setup
• Markov decision process (MDP):
• A set of states s  S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Goal is to determine a policy (s)

• Don’t know T or R
• We don’t know which states are good or what
the actions do
• Must actually try actions and states to learn
their utility.
• The agent’s goal is still to act optimally i.e.
determine the optimal policy.
4
Reinforcement Learning Setup
• Markov decision process (MDP):
• A set of states s  S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Goal is to determine a policy (s)

• But we don’t know T or R In the Car MDP (discussed previously) we now


• I.e. we don’t know which states are good or do not know the transition model and the
what the actions do rewards. How should the agent act in this setup
• Must actually try actions and states out to learn to maximize expected future rewards.

How should the agent act in this setup so as to maximize expected future
rewards. Also, think of the setup as acting in an ”unknown” MDP 5
Another View: Offline (MDP) vs Online (RL)
• Offline (MDP)
• We are given an MDP
The agent solves the MDP
• We use policy or value iteration to learn a policy. and computes a policy.
This is computed offline. Now it simply acts with it.
• At runtime the agent only executes the
computed policy

• Online (RL)
• We do not have full knowledge of the MDP The agent interacts with the
• We must interact with the world to learn which world and learns that one of
states are good and which actions eventually the states is bad.
lead to good rewards.

6
Reinforcement Learning
• Given
• Agent, states, actions, immediate rewards and environment.
• Not given
• Transition function (the agent cannot predict which state will it land in once it
takes an action)
• Does not know the reward function. Does not know what reward it will get in a
state.
• Agent’s Task
• A policy that maximizes expected rewards (the objective has not changed)

7
Reinforcement Learning (RL)
• People and animals learn by interacting with our
environment
• It is active rather than passive.
• Interactions are often sequential — future
interactions can depend on earlier ones
Biological motivation
• Reward Hypothesis
• Any goal can be formalized as the outcome of
maximizing a cumulative reward.
• We can learn without examples of optimal
behaviour Instead, we optimise some reward signal
Reinforcement Learning: Setup
Agent Key characteristic of reinforcement
learning
State: s • Only evaluative feedback present.
Reward: r Actions: a
• The agent takes and action and is
provided feedback (reward).
• The agent is not told which action
Environment it should take in a state.

• Receive feedback in the form of rewards


• Agent’s utility is defined by the reward function
• Must (learn to) act so as to maximize expected rewards
• All learning is based on observed samples of outcomes!
9
Classes of Learning Problems

Figure: Alexander Amini


Examples of RL
Examples: Learning to Walk

[Kohl and Stone, ICRA 2004]


Initial
Examples: Learning to Walk

[Kohl and Stone, ICRA 2004]


Training
Examples: Learning to Walk

[Kohl and Stone, ICRA 2004]


Finished
Examples: Game Play

Google Deep Mind 15


Examples: Healthcare Domain

Blood pressure control

16
Reinforcement Learning: Approaches
• Different learning agents
• Utility-based agent
• Learn a value/utility function on states and use it to select actions that maximize the expected outcome utility.
• Q-learning
• Learn action-utility function (or Q-function) giving expected utility of taking a given action in a given state.
• Reflex agent.
• Directly learn a mapping from states to action (policy).
• Approaches
• Passive Learning
• The agent’s policy is fixed. the task is to learn the utilities of states (or state-action pairs); this could involve
learning a model of the environment.
• It cannot select the actions during training.
• Active Learning
• The agent can select actions, it must also learn what actions to take.
• Issue of exploration: an agent must experience as much as possible of the environment in order to learn how
to behave in it.

17
Passive Reinforcement Learning
• Setup
• Input: a fixed policy (s)
• Don’t know the transitions T(s,a,s’)
• Don’t know the rewards R(s,a,s’)
• The agent executes a set of trials or
episodes using its policy (s) The learner is provided with a policy (cannot change
that), using the policy it executes trials or episodes in
• In each episode or trial it starts in a state the world. The goal is to determine the value of states.
and experiences a sequence of state
transitions and rewards till it reaches a
terminal state.
• Goal: learn (estimate) the state values V (s)

18
Model-Based Reinforcement Learning
• Model-Based Idea
• Learn an approximate model (R(), T()) based on experiences
• Compute the value function using the learned model (as if it were correct).

• Step 1: Learn an empirical MDP model


• Count outcomes s’ for each s, a
• Normalize to give an estimate of
• Discover each when we experience (s, a, s’)
Note: going through an
intermediate stage of learning the
• Step 2: Solve the learned MDP model. In contrast, “model-free”
approaches do not learn the
• For example, use value iteration, to obtain the final policy. intermediate model.
• Plug in the estimated T and R in the following equation:

19
Example: Model-Based Learning
Input Policy  Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10 …
B C D
Episode 3 Episode 4 R(s,a,s’).
E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume:  = 1 D, exit, x, +10 A, exit, x, -10 …
20
Toy example: Model-Based vs Model-Free Estimation
Goal: Compute expected age of COL333/671 students
Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]


Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Eventually, we learn the right model Bypass the model construction. Averaging works
which gives the right estimates. because the samples appear with the right frequencies.21
Model-Based vs. Model Free RL
Model-based RL:
• Explore environment and learn model, T=P(s’|s,a) and R(s,a), (almost)
everywhere. Use model to plan a policy (solving the MDP)
• Suitable when the state-space is manageable

Model-free RL:
• Do not learn a model; learn value function or policy directly
• Suitable when the state space is large

Next, we look at a model free approach to RL


22
Monte Carlo Methods

• Learning the state-value function for a given policy


• What is the value of a state?
• Expected return – expected cumulative future discounted reward
• Key Idea
• Sample trajectories from the world directly and estimate a value function
without a model
• Simply average the returns observed after visits to that state.
• As more returns are observed the average should converge to the expected
value.
• Each occurrence of a state in an episode is a called a visit to the state.
23
Toy Example: Monte Carlo Method
Input Policy  Observed Episodes (Training)
Episode 1 Episode 2 Value/utility of state c
V(C) = ((9 + 9 + 9 +(-11))/4)
B, east, C, -1 B, east, C, -1 =4
A C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10
B C D
Episode 3 Episode 4
E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10

24
Toy Example: Monte Carlo Method
Input Policy  Observed Episodes (Training)
Episode 1 Episode 2 Value/utility of state c
V(C) = ((9 + 9 + 9 +(-11))/4)
B, east, C, -1 B, east, C, -1 =4
A C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 Output Values
B C D -10
Episode 3 Episode 4 A
E +8 +4 +10
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1 B C D
Assume:  = 1 D, exit, x, +10 A, exit, x, -10
-2
E
25
First-Visit Monte Carlo (FVMC) Policy
Evaluation

Initialize the value function arbitrarily.

Loop over the episode from the end


till the start.

Account for the discounting of the reward

Except if (unless) the state has been visited


from time 0 till (t-1) append and average out
the results.
I.e., only update the value estimate if this is
the first visit to the state.

26
Pseudo-code from Sutton and Barto (Reinforcement Learning) Ch 5 (Sec. 5.1)
First-Visit Monte Carlo (FVMC)
• First-Visit Monte Carlo (FVMC)
• Averages the returns following the first visit to a state s in the episode.
• Every-visit Monte Carlo (EVMC)
• Averages returns following all the visits to s.
• Convergence
• FVMC – error falls as 1/N(s). Needs lots of data
• EVMC - error falls quadratically, slightly better data efficiency.

27
Monte Carlo Methods: Pros and Cons
• Pros
• Does not estimate a model. It does not require any -10
knowledge of T(), R(). A
• With large number of runs it computes the correct
average values, using just sample transitions +8 +4 +10
B C D
• Cons -2
• Each state must be learned separately, loses the state E
connection information.
• Estimate of one state is not taking advantage of the • Problem: we have lost the
estimates of the other states. connection between states.
• Note: Bellman equations tell us that value function for states • If B and E both go to C under
has a recursive relationship. this policy, how can their values
• Could only be used in an episodic setting. be different?

How can the Bellman equation be incorporated in this learning? 28


Temporal Difference (TD) Learning
• Model-Free combination of
• Monte Carlo (learning from sample trajectories/experience) and
• Dynamic programming (via Bellman Equations)
• Incorporate Bootstrapping
• Update value function estimates of a state based on others
• Adjust the value function estimate using the Bellman Equation relationship between the value function
of successor states.
• More data-efficient than a Monte Carlo method

• Setting
• Can be used in an episodic or infinite-horizon non-episodic settings
• Immediately updates the estimate of V(s) after each (s, a, s’, r) tuple.

“If one had to identify one idea that is central and novel to reinforcement learning,
it would undoubtedly be temporal-difference (TD) learning”, Sutton and Barto 2017

29
Temporal Difference (TD) Learning
s
• Temporal difference learning of values
(s)
• Policy still fixed
• Move values toward the value of the successor that is encountered s, (s)
• Keep a running average
• Don’t need to store all the experience to build models. It is a model-free
approach. s’

What is observed
Sample of V(s):

Update to V(s): Modify the old value

Same update: Sign and magnitude


of the difference.
Temporal Difference Learning: Example
States Observed Transitions

B, east, C, -2 C, east, D, -2

A 0 0 0

B C D 0 0 8 -1 0 8 -1 3 8

E 0 0 0

Assume:  = 1, α = 1/2
Temporal Difference

TD Learning (intuitively)
- Nudge our prior estimate of the
value function for a state using the
given experience.
- Shift the estimate based on the
error in what we are experiencing
and what estimate we had before
- Weighted by the learning rate. 32
TD Value Learning: Problems
• TD Value Learning
• Model-free approach to perform policy evaluation
• Incorporates Bellman updates with running sample averages
• Output of TD Value Learning
• TD Value learning outputs the value function
• How to turn the learned values into a new policy?
• Can use the following relationships:

• Problem: we don’t have T and R


• Next: Can we learn the Q-values and not values? Then select actions
for the new policy using the relationships above.
33
Estimating the value function may be useful on its own.
From Value Iteration to Q-Value Iteration
• Value Iteration
• Start with V0(s) = 0
• Given Vk, calculate the Vk+1 for all states as:

• Q-Value Iteration
• Start with Q0(s,a) = 0
• Given Qk, calculate the depth Qk+1 q-values for all q-states:

35
Sample
Q-Learning
• Sample-based Q-value iteration

• Estimate Q(s,a) values as:


• Receive a sample (s,a,s’,r)
• Consider your old estimate:
• Consider your new sample estimate
• Incorporate the new estimate into a
running average:

36
Q-Learning: Procedure

37
Q-Learning

Exit with +10 reward

Cliff at the bottom,


negative reward here.

The agent does not know the rewards a-priori. Learns the effect of the east action over time. Only the actions we do are
updated. Occasionally falls in the cliff and gets the negative reward. Note that the max of the Q values is propagated38(green
values) to other states as it is approximating the optimal Q value.
Q-Learning: Properties
• Off-policy learning
• Q-learning converges to optimal policy -- even if the agent is acting sub-
optimally.

• Some technical conditions:


• Exploration is enough
• In the limit does not matter how the actions are selected.

39
SARSA
• State-Action-Reward-State-Action (SARSA)
• Update using (s, a, r, s’, a’)

• SARSA Update equation

• Note
• SARSA waits until an action is taken and backs up the Q-value for that action. Learns
the Q-function from actual transitions.
• On-policy algorithm
• More realistic if the policy is partly controlled by other agents. Learns from more
realistic values.
40
Active Reinforcement Learning
• Q-learning so far
• Q-learning allows action selection because we are learning the Q(s,a) function.
• This means there is a policy that can be derived from the Q function learned.
• Should the agent follow this policy exactly or should it explore at times?
• Active RL
• The agent can select actions
• Actions play two roles
• A means to collect reward (exploitation)
• Help in acquiring a model of the environment (exploration)
• Exploration vs. Exploitation trade-off
• Act according to the current optimal (based on Q-Values)
• Pick a different action to explore.
• Example: a new tea stall opens in IIT (should you try the new one or stick to the old one? Goal is to
maximize tea utility over time

41
Exploration Strategy
• How to force exploration?
• An -greedy approach
• Every time step: either pick a random action or act on the current policy
• With (small) probability , pick a random action
• With (large) probability 1-, act based on the current policy (based on the current
Q –values in the table that the agent is updating)
• What is the problem with -greedy?
• It takes a long time to explore.
• Exploration is not directed towards states of which we have less information.

42
Exploration Functions
• How to direct exploration towards state-action pairs that are less explored?

• Exploration function
• Trades off exploitation (preference for high values of u) vs. exploration (preference
for actions that have not been tried out as yet).
• f(u,n): Increasing in u and decreasing in n.

• What is the exploration function trying to achieve?


• The exploration function provides an optimistic estimate of the best possible value
attainable in any state.
• Makes the agent think that there is high reward propagated from states that are
explored less.
43
Exploratory Q-Learning
• In Q-learning
• Explicitly encode the value of exploration
in the Q-function
• Exploratory Q-Learning

• Effects
• The lower the N(s’, a’) is the higher is the Modified update
exploration bonus.
• The exploration bonus makes those
states favorable which lead to unknown
states (propagation).
• Will have a cascading effect when an
exploration action is there.
44
Multi-arm Bandits
Multi-armed bandits are equivalent to a one state MDP.
Goal is to learn a policy that picks actions such that in
the expected rewards are maximized.

45
Source: Emma Brunskill (CS232 Course) Lecture 10
Toy Example: Treatment planning
• Consider deciding how to best treat • We can model this problem as a multi-
patients with broken toes arm bandit problem with 3 arms
• Imagine have 3 possible options: • Imagine true (unknown) Bernoulli reward
• Do Surgery parameters for each arm (actions) are
• Perform buddy taping the broken toe with
another toe,
• Do Nothing
• Outcome measure / reward is binary
variable: whether the toe has healed
(reward +1) or not healed (reward 0)
after 6 weeks, as assessed by x-ray

46
Source: Emma Brunskill (CS232 Course) Lecture 10
Toy Example: Treatment Planning

Avg. the rewards for each action.

Exploration vs. Exploitation trade off!


If the reward variance is high then the
epsilon-greedy approach works better
than simply greedy.
47
Source: Emma Brunskill (CS232 Course)
Upper Confidence Bound

• Greedy actions are those that look best at the present, but some of the
other actions may be better.
• Epsilon-greedy forces the non-greedy actions to be tried.
• Indiscriminately, with no preference for those actions that are nearly greedy or
particularly uncertain.
• It would be better to select among the non-greedy actions according to
their potential for being optimal.
• Take into account how close their estimates are to being maximal and the
uncertainties in their estimates.

48
Upper Confidence Bound
• Upper Confidence Bound (UCB) for action
selection
• Square-root term is a measure of uncertainty
or variance in the estimate of the action’s
value.
• When a is selected then Nt(a) is incremented. The
uncertainty reduces as denominator increases.
• Each time when a is not selected then t increases
but Nt(a) does not. The uncertainty estimate
increases (numerator).
• Natural logarithm
• Increases get smaller over time but are
unbounded; all actions will eventually be selected.
• Actions with lower value estimates will be selected
with lower frequency.
49
Upper Confidence Bound

50
Source: Emma Brunskill (CS232 Course)
Example
Example of K-arm bandits in practice. Figure is for a 10-arm bandit test bed that learns the rewards for the 10 options. The average reward collected
goes up as the learning progresses.

51
Figure from S&B Book Chapter 2
Problem of Generalization
State space is
raw pixels, the
• Problem when the state space is very large state space size
• Visiting all states is not possible at training time. is large, cannot
• Memory issue: cannot fit the q-table in memory. experience all
• Need a succinct representation of the state. states.

• Need for Generalization


• Learn from small number of training states encountered in
training
• Generalize that experience to new, similar situations that not
encountered before

• Feature-based Representations
• Features or properties are functions from states to real
numbers (often 0/1) that capture important properties of the
state.
• Example features that can be computed from the state The states are similar. But for Q-
• Distance to closest ghost learning it will be a different entry. It
• Distance to closest dot
• Number of ghosts does not know that in essence these
52
states are the same.
Linear Value Functions
• Using a feature representation, a q function (or value function) can be expressed for any state
using a set of weights:

• Now the goal of Q-learning is to estimate these weights from experience. Once the weights
are learned the resulting Q-values will hopefully be close to the true Q values.

• Main consequence:
• A new state that shares features with a previously seen state will have the same Q-values.
• Disadvantage
• If there are less features, actually different states (good and bad states) may start looking the same.

53
Approximate Q-learning
• Q-learning with linear Q-functions:

Exact Q function
updates

Approximate Q
function updates
“target” “prediction”

• Interpretation:
• Adjust weights of active features.
• If the difference is positive (what we get is higher than previous) and the feature value is 1 then the
weight increases and the q value increases. No change if the feature value is 0.

54
Deep Q-Networks (DQNs)
Goal: policy to move the base.
Modeling the Q-function via a Neural Network.
What we want to estimate is
what is the utility of this state
and action we can take?

What is the advantage of the neural network?

55
Training the Deep Q Network
How to train a network to predict the Q value
Key idea: turn the Q-learning equation as a loss.
for an action for a state?
Optimize using back propagation.

Need lots of data to train! Obtain roll outs from the simulator.

56
Conclusions
• Learning from reinforcement applies to problems where there is no knowledge of which
states have good rewards and how the actions lead to new states,

• Various approaches for performing RL: estimating a model, estimating the value
function or directly optimizing the policy (not covered yet). Variation with whether we
learn from the whole episode or from each transition.

• Data is essentially experience of the agent. There is evaluative feedback not prescriptive
feedback. RL is different from supervised learning.

• Fundamental tradeoff between exploration and exploitation. Generalization is a


challenge in large state spaces.

57

You might also like