0% found this document useful (0 votes)

8 views57 pages

L13 Reinforcement Learning

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views57 pages

L13 Reinforcement Learning

Uploaded by

Abhijeet Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 57

COL333/671: Introduction to AI

Semester I, 2024-25

Reinforcement Learning

Rohan Paul

1
Outline
• Last Class
• Markov Decision Processes
• This Class
• Reinforcement Learning
• Reference Material
• Please follow the notes as the primary reference on this topic. Supplementary
reading on topics covered in class from AIMA Ch 21 sections 21.1 – 21.4.

2
Acknowledgement
These slides are intended for teaching purposes only. Some material
has been used/adapted from web sources and from slides by Doina
Precup, Dorsa Sadigh, Percy Liang, Mausam, Parag, Emma Brunskill,
Alexander Amini, Dan Klein, Anca Dragan, Nicholas Roy and others.

3
Reinforcement Learning Setup
• Markov decision process (MDP):
• A set of states s  S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Goal is to determine a policy (s)

• Don’t know T or R
• We don’t know which states are good or what
the actions do
• Must actually try actions and states to learn
their utility.
• The agent’s goal is still to act optimally i.e.
determine the optimal policy.
4
Reinforcement Learning Setup
• Markov decision process (MDP):
• A set of states s  S
• A set of actions (per state) A
• A model T(s,a,s’)
• A reward function R(s,a,s’)
• Goal is to determine a policy (s)

• But we don’t know T or R In the Car MDP (discussed previously) we now

• I.e. we don’t know which states are good or do not know the transition model and the
what the actions do rewards. How should the agent act in this setup
• Must actually try actions and states out to learn to maximize expected future rewards.

How should the agent act in this setup so as to maximize expected future
rewards. Also, think of the setup as acting in an ”unknown” MDP 5
Another View: Offline (MDP) vs Online (RL)
• Offline (MDP)
• We are given an MDP
The agent solves the MDP
• We use policy or value iteration to learn a policy. and computes a policy.
This is computed offline. Now it simply acts with it.
• At runtime the agent only executes the
computed policy

• Online (RL)
• We do not have full knowledge of the MDP The agent interacts with the
• We must interact with the world to learn which world and learns that one of
states are good and which actions eventually the states is bad.
lead to good rewards.

6
Reinforcement Learning
• Given
• Agent, states, actions, immediate rewards and environment.
• Not given
• Transition function (the agent cannot predict which state will it land in once it
takes an action)
• Does not know the reward function. Does not know what reward it will get in a
state.
• Agent’s Task
• A policy that maximizes expected rewards (the objective has not changed)

7
Reinforcement Learning (RL)
• People and animals learn by interacting with our
environment
• It is active rather than passive.
• Interactions are often sequential — future
interactions can depend on earlier ones
Biological motivation
• Reward Hypothesis
• Any goal can be formalized as the outcome of
maximizing a cumulative reward.
• We can learn without examples of optimal
behaviour Instead, we optimise some reward signal
Reinforcement Learning: Setup
Agent Key characteristic of reinforcement
learning
State: s • Only evaluative feedback present.
Reward: r Actions: a
• The agent takes and action and is
provided feedback (reward).
• The agent is not told which action
Environment it should take in a state.

• Receive feedback in the form of rewards

• Agent’s utility is defined by the reward function
• Must (learn to) act so as to maximize expected rewards
• All learning is based on observed samples of outcomes!
9
Classes of Learning Problems

Figure: Alexander Amini

Examples of RL
Examples: Learning to Walk

[Kohl and Stone, ICRA 2004]

Initial
Examples: Learning to Walk

[Kohl and Stone, ICRA 2004]

Training
Examples: Learning to Walk

[Kohl and Stone, ICRA 2004]

Finished
Examples: Game Play

Google Deep Mind 15

Examples: Healthcare Domain

Blood pressure control

16
Reinforcement Learning: Approaches
• Different learning agents
• Utility-based agent
• Learn a value/utility function on states and use it to select actions that maximize the expected outcome utility.
• Q-learning
• Learn action-utility function (or Q-function) giving expected utility of taking a given action in a given state.
• Reflex agent.
• Directly learn a mapping from states to action (policy).
• Approaches
• Passive Learning
• The agent’s policy is fixed. the task is to learn the utilities of states (or state-action pairs); this could involve
learning a model of the environment.
• It cannot select the actions during training.
• Active Learning
• The agent can select actions, it must also learn what actions to take.
• Issue of exploration: an agent must experience as much as possible of the environment in order to learn how
to behave in it.

17
Passive Reinforcement Learning
• Setup
• Input: a fixed policy (s)
• Don’t know the transitions T(s,a,s’)
• Don’t know the rewards R(s,a,s’)
• The agent executes a set of trials or
episodes using its policy (s) The learner is provided with a policy (cannot change
that), using the policy it executes trials or episodes in
• In each episode or trial it starts in a state the world. The goal is to determine the value of states.
and experiences a sequence of state
transitions and rewards till it reaches a
terminal state.
• Goal: learn (estimate) the state values V (s)

18
Model-Based Reinforcement Learning
• Model-Based Idea
• Learn an approximate model (R(), T()) based on experiences
• Compute the value function using the learned model (as if it were correct).

• Step 1: Learn an empirical MDP model

• Count outcomes s’ for each s, a
• Normalize to give an estimate of
• Discover each when we experience (s, a, s’)
Note: going through an
intermediate stage of learning the
• Step 2: Solve the learned MDP model. In contrast, “model-free”
approaches do not learn the
• For example, use value iteration, to obtain the final policy. intermediate model.
• Plug in the estimated T and R in the following equation:

19
Example: Model-Based Learning
Input Policy  Observed Episodes (Training) Learned Model
Episode 1 Episode 2 T(s,a,s’).
B, east, C, -1 B, east, C, -1 T(B, east, C) = 1.00
A T(C, east, D) = 0.75
C, east, D, -1 C, east, D, -1
T(C, east, A) = 0.25
D, exit, x, +10 D, exit, x, +10 …
B C D
Episode 3 Episode 4 R(s,a,s’).
E R(B, east, C) = -1
E, north, C, -1 E, north, C, -1
R(C, east, D) = -1
C, east, D, -1 C, east, A, -1 R(D, exit, x) = +10
Assume:  = 1 D, exit, x, +10 A, exit, x, -10 …
20
Toy example: Model-Based vs Model-Free Estimation
Goal: Compute expected age of COL333/671 students
Known P(A)

Without P(A), instead collect samples [a1, a2, … aN]

Unknown P(A): “Model Based” Unknown P(A): “Model Free”

Eventually, we learn the right model Bypass the model construction. Averaging works
which gives the right estimates. because the samples appear with the right frequencies.21
Model-Based vs. Model Free RL
Model-based RL:
• Explore environment and learn model, T=P(s’|s,a) and R(s,a), (almost)
everywhere. Use model to plan a policy (solving the MDP)
• Suitable when the state-space is manageable

Model-free RL:
• Do not learn a model; learn value function or policy directly
• Suitable when the state space is large

Next, we look at a model free approach to RL

22
Monte Carlo Methods

• Learning the state-value function for a given policy

• What is the value of a state?
• Expected return – expected cumulative future discounted reward
• Key Idea
• Sample trajectories from the world directly and estimate a value function
without a model
• Simply average the returns observed after visits to that state.
• As more returns are observed the average should converge to the expected
value.
• Each occurrence of a state in an episode is a called a visit to the state.
23
Toy Example: Monte Carlo Method
Input Policy  Observed Episodes (Training)
Episode 1 Episode 2 Value/utility of state c
V(C) = ((9 + 9 + 9 +(-11))/4)
B, east, C, -1 B, east, C, -1 =4
A C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10
B C D
Episode 3 Episode 4
E
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1
Assume:  = 1 D, exit, x, +10 A, exit, x, -10

24
Toy Example: Monte Carlo Method
Input Policy  Observed Episodes (Training)
Episode 1 Episode 2 Value/utility of state c
V(C) = ((9 + 9 + 9 +(-11))/4)
B, east, C, -1 B, east, C, -1 =4
A C, east, D, -1 C, east, D, -1
D, exit, x, +10 D, exit, x, +10 Output Values
B C D -10
Episode 3 Episode 4 A
E +8 +4 +10
E, north, C, -1 E, north, C, -1
C, east, D, -1 C, east, A, -1 B C D
Assume:  = 1 D, exit, x, +10 A, exit, x, -10
-2
E
25
First-Visit Monte Carlo (FVMC) Policy
Evaluation

Initialize the value function arbitrarily.

Loop over the episode from the end

till the start.

Account for the discounting of the reward

Except if (unless) the state has been visited

from time 0 till (t-1) append and average out
the results.
I.e., only update the value estimate if this is
the first visit to the state.

26
Pseudo-code from Sutton and Barto (Reinforcement Learning) Ch 5 (Sec. 5.1)
First-Visit Monte Carlo (FVMC)
• First-Visit Monte Carlo (FVMC)
• Averages the returns following the first visit to a state s in the episode.
• Every-visit Monte Carlo (EVMC)
• Averages returns following all the visits to s.
• Convergence
• FVMC – error falls as 1/N(s). Needs lots of data
• EVMC - error falls quadratically, slightly better data efficiency.

27
Monte Carlo Methods: Pros and Cons
• Pros
• Does not estimate a model. It does not require any -10
knowledge of T(), R(). A
• With large number of runs it computes the correct
average values, using just sample transitions +8 +4 +10
B C D
• Cons -2
• Each state must be learned separately, loses the state E
connection information.
• Estimate of one state is not taking advantage of the • Problem: we have lost the
estimates of the other states. connection between states.
• Note: Bellman equations tell us that value function for states • If B and E both go to C under
has a recursive relationship. this policy, how can their values
• Could only be used in an episodic setting. be different?

How can the Bellman equation be incorporated in this learning? 28

Temporal Difference (TD) Learning
• Model-Free combination of
• Monte Carlo (learning from sample trajectories/experience) and
• Dynamic programming (via Bellman Equations)
• Incorporate Bootstrapping
• Update value function estimates of a state based on others
• Adjust the value function estimate using the Bellman Equation relationship between the value function
of successor states.
• More data-efficient than a Monte Carlo method

• Setting
• Can be used in an episodic or infinite-horizon non-episodic settings
• Immediately updates the estimate of V(s) after each (s, a, s’, r) tuple.

“If one had to identify one idea that is central and novel to reinforcement learning,
it would undoubtedly be temporal-difference (TD) learning”, Sutton and Barto 2017

29
Temporal Difference (TD) Learning
s
• Temporal difference learning of values
(s)
• Policy still fixed
• Move values toward the value of the successor that is encountered s, (s)
• Keep a running average
• Don’t need to store all the experience to build models. It is a model-free
approach. s’

What is observed
Sample of V(s):

Update to V(s): Modify the old value

Same update: Sign and magnitude

of the difference.
Temporal Difference Learning: Example
States Observed Transitions

B, east, C, -2 C, east, D, -2

A 0 0 0

B C D 0 0 8 -1 0 8 -1 3 8

E 0 0 0

Assume:  = 1, α = 1/2
Temporal Difference

TD Learning (intuitively)
- Nudge our prior estimate of the
value function for a state using the
given experience.
- Shift the estimate based on the
error in what we are experiencing
and what estimate we had before
- Weighted by the learning rate. 32
TD Value Learning: Problems
• TD Value Learning
• Model-free approach to perform policy evaluation
• Incorporates Bellman updates with running sample averages
• Output of TD Value Learning
• TD Value learning outputs the value function
• How to turn the learned values into a new policy?
• Can use the following relationships:

• Problem: we don’t have T and R

• Next: Can we learn the Q-values and not values? Then select actions
for the new policy using the relationships above.
33
Estimating the value function may be useful on its own.
From Value Iteration to Q-Value Iteration
• Value Iteration
• Start with V0(s) = 0
• Given Vk, calculate the Vk+1 for all states as:

• Q-Value Iteration
• Start with Q0(s,a) = 0
• Given Qk, calculate the depth Qk+1 q-values for all q-states:

35
Sample
Q-Learning
• Sample-based Q-value iteration

• Estimate Q(s,a) values as:

• Receive a sample (s,a,s’,r)
• Consider your old estimate:
• Consider your new sample estimate
• Incorporate the new estimate into a
running average:

36
Q-Learning: Procedure

37
Q-Learning

Exit with +10 reward

Cliff at the bottom,

negative reward here.

The agent does not know the rewards a-priori. Learns the effect of the east action over time. Only the actions we do are
updated. Occasionally falls in the cliff and gets the negative reward. Note that the max of the Q values is propagated38(green
values) to other states as it is approximating the optimal Q value.
Q-Learning: Properties
• Off-policy learning
• Q-learning converges to optimal policy -- even if the agent is acting sub-
optimally.

• Some technical conditions:

• Exploration is enough
• In the limit does not matter how the actions are selected.

39
SARSA
• State-Action-Reward-State-Action (SARSA)
• Update using (s, a, r, s’, a’)

• SARSA Update equation

• Note
• SARSA waits until an action is taken and backs up the Q-value for that action. Learns
the Q-function from actual transitions.
• On-policy algorithm
• More realistic if the policy is partly controlled by other agents. Learns from more
realistic values.
40
Active Reinforcement Learning
• Q-learning so far
• Q-learning allows action selection because we are learning the Q(s,a) function.
• This means there is a policy that can be derived from the Q function learned.
• Should the agent follow this policy exactly or should it explore at times?
• Active RL
• The agent can select actions
• Actions play two roles
• A means to collect reward (exploitation)
• Help in acquiring a model of the environment (exploration)
• Exploration vs. Exploitation trade-off
• Act according to the current optimal (based on Q-Values)
• Pick a different action to explore.
• Example: a new tea stall opens in IIT (should you try the new one or stick to the old one? Goal is to
maximize tea utility over time

41
Exploration Strategy
• How to force exploration?
• An -greedy approach
• Every time step: either pick a random action or act on the current policy
• With (small) probability , pick a random action
• With (large) probability 1-, act based on the current policy (based on the current
Q –values in the table that the agent is updating)
• What is the problem with -greedy?
• It takes a long time to explore.
• Exploration is not directed towards states of which we have less information.

42
Exploration Functions
• How to direct exploration towards state-action pairs that are less explored?

• Exploration function
• Trades off exploitation (preference for high values of u) vs. exploration (preference
for actions that have not been tried out as yet).
• f(u,n): Increasing in u and decreasing in n.

• What is the exploration function trying to achieve?

• The exploration function provides an optimistic estimate of the best possible value
attainable in any state.
• Makes the agent think that there is high reward propagated from states that are
explored less.
43
Exploratory Q-Learning
• In Q-learning
• Explicitly encode the value of exploration
in the Q-function
• Exploratory Q-Learning

• Effects
• The lower the N(s’, a’) is the higher is the Modified update
exploration bonus.
• The exploration bonus makes those
states favorable which lead to unknown
states (propagation).
• Will have a cascading effect when an
exploration action is there.
44
Multi-arm Bandits
Multi-armed bandits are equivalent to a one state MDP.
Goal is to learn a policy that picks actions such that in
the expected rewards are maximized.

45
Source: Emma Brunskill (CS232 Course) Lecture 10
Toy Example: Treatment planning
• Consider deciding how to best treat • We can model this problem as a multi-
patients with broken toes arm bandit problem with 3 arms
• Imagine have 3 possible options: • Imagine true (unknown) Bernoulli reward
• Do Surgery parameters for each arm (actions) are
• Perform buddy taping the broken toe with
another toe,
• Do Nothing
• Outcome measure / reward is binary
variable: whether the toe has healed
(reward +1) or not healed (reward 0)
after 6 weeks, as assessed by x-ray

46
Source: Emma Brunskill (CS232 Course) Lecture 10
Toy Example: Treatment Planning

Avg. the rewards for each action.

Exploration vs. Exploitation trade off!

If the reward variance is high then the
epsilon-greedy approach works better
than simply greedy.
47
Source: Emma Brunskill (CS232 Course)
Upper Confidence Bound

• Greedy actions are those that look best at the present, but some of the
other actions may be better.
• Epsilon-greedy forces the non-greedy actions to be tried.
• Indiscriminately, with no preference for those actions that are nearly greedy or
particularly uncertain.
• It would be better to select among the non-greedy actions according to
their potential for being optimal.
• Take into account how close their estimates are to being maximal and the
uncertainties in their estimates.

48
Upper Confidence Bound
• Upper Confidence Bound (UCB) for action
selection
• Square-root term is a measure of uncertainty
or variance in the estimate of the action’s
value.
• When a is selected then Nt(a) is incremented. The
uncertainty reduces as denominator increases.
• Each time when a is not selected then t increases
but Nt(a) does not. The uncertainty estimate
increases (numerator).
• Natural logarithm
• Increases get smaller over time but are
unbounded; all actions will eventually be selected.
• Actions with lower value estimates will be selected
with lower frequency.
49
Upper Confidence Bound

50
Source: Emma Brunskill (CS232 Course)
Example
Example of K-arm bandits in practice. Figure is for a 10-arm bandit test bed that learns the rewards for the 10 options. The average reward collected
goes up as the learning progresses.

51
Figure from S&B Book Chapter 2
Problem of Generalization
State space is
raw pixels, the
• Problem when the state space is very large state space size
• Visiting all states is not possible at training time. is large, cannot
• Memory issue: cannot fit the q-table in memory. experience all
• Need a succinct representation of the state. states.

• Need for Generalization

• Learn from small number of training states encountered in
training
• Generalize that experience to new, similar situations that not
encountered before

• Feature-based Representations
• Features or properties are functions from states to real
numbers (often 0/1) that capture important properties of the
state.
• Example features that can be computed from the state The states are similar. But for Q-
• Distance to closest ghost learning it will be a different entry. It
• Distance to closest dot
• Number of ghosts does not know that in essence these
52
states are the same.
Linear Value Functions
• Using a feature representation, a q function (or value function) can be expressed for any state
using a set of weights:

• Now the goal of Q-learning is to estimate these weights from experience. Once the weights
are learned the resulting Q-values will hopefully be close to the true Q values.

• Main consequence:
• A new state that shares features with a previously seen state will have the same Q-values.
• Disadvantage
• If there are less features, actually different states (good and bad states) may start looking the same.

53
Approximate Q-learning
• Q-learning with linear Q-functions:

Exact Q function
updates

Approximate Q
function updates
“target” “prediction”

• Interpretation:
• Adjust weights of active features.
• If the difference is positive (what we get is higher than previous) and the feature value is 1 then the
weight increases and the q value increases. No change if the feature value is 0.

54
Deep Q-Networks (DQNs)
Goal: policy to move the base.
Modeling the Q-function via a Neural Network.
What we want to estimate is
what is the utility of this state
and action we can take?

What is the advantage of the neural network?

55
Training the Deep Q Network
How to train a network to predict the Q value
Key idea: turn the Q-learning equation as a loss.
for an action for a state?
Optimize using back propagation.

Need lots of data to train! Obtain roll outs from the simulator.

56
Conclusions
• Learning from reinforcement applies to problems where there is no knowledge of which
states have good rewards and how the actions lead to new states,

• Various approaches for performing RL: estimating a model, estimating the value
function or directly optimizing the policy (not covered yet). Variation with whether we
learn from the whole episode or from each transition.

• Data is essentially experience of the agent. There is evaluative feedback not prescriptive
feedback. RL is different from supervised learning.

• Fundamental tradeoff between exploration and exploitation. Generalization is a

challenge in large state spaces.

A Study On Customer Satisfaction and Loyalty Towards Bookmyshow
No ratings yet
A Study On Customer Satisfaction and Loyalty Towards Bookmyshow
56 pages
Free Los Trillizos Bradley Spanish Edition
No ratings yet
Free Los Trillizos Bradley Spanish Edition
2 pages
Embedded Systems Lab 18ECL66: Demonstrate The Use of An External Interrupt To Toggle An LED On/Off
No ratings yet
Embedded Systems Lab 18ECL66: Demonstrate The Use of An External Interrupt To Toggle An LED On/Off
16 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
F20-AI-L10
No ratings yet
F20-AI-L10
45 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
46 pages
SP14 CS188 Lecture 10 - Reinforcement Learning I
No ratings yet
SP14 CS188 Lecture 10 - Reinforcement Learning I
35 pages
11-DL-Deep Learning For Reinforcement Learning
No ratings yet
11-DL-Deep Learning For Reinforcement Learning
47 pages
Lecture#5 Monte Carlo Methods Part I
No ratings yet
Lecture#5 Monte Carlo Methods Part I
28 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
45 pages
ML Unit-4 - RTU
No ratings yet
ML Unit-4 - RTU
18 pages
16 RL
No ratings yet
16 RL
51 pages
DLMAIRIL01_Q4-2024_Session4
No ratings yet
DLMAIRIL01_Q4-2024_Session4
80 pages
Lecture 30 Reinforcement-Learning
No ratings yet
Lecture 30 Reinforcement-Learning
50 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
52 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
No ratings yet
Tri-Tue-Nhan-Tao - Nathan-Lambert - Lec13 - 6up-Reinforcement-Learning - (Cuuduongthancong - Com)
8 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning (Part 2) : Nguyen Do Van, PHD
46 pages
7- Reinforcement Learning
No ratings yet
7- Reinforcement Learning
23 pages
IntroductiontoRL-BR
No ratings yet
IntroductiontoRL-BR
22 pages
lecture 9 Reiforcement learning (1)
No ratings yet
lecture 9 Reiforcement learning (1)
29 pages
Unit-8 - Reinforcement Learning
No ratings yet
Unit-8 - Reinforcement Learning
52 pages
Machine Learning For NLP
No ratings yet
Machine Learning For NLP
58 pages
Reinforcement Learning MY101
No ratings yet
Reinforcement Learning MY101
15 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
50 pages
AI (IT) UNIT-5
No ratings yet
AI (IT) UNIT-5
43 pages
Lec 10
No ratings yet
Lec 10
50 pages
Reinforced Learning
No ratings yet
Reinforced Learning
25 pages
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
No ratings yet
7.reinforcement Learning-Introduction-The Learning Task Q-Learning
34 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
48 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
MLT Unit-5 notes
No ratings yet
MLT Unit-5 notes
17 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
unit4(AI)2024.docx-1
No ratings yet
unit4(AI)2024.docx-1
22 pages
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
No ratings yet
Reinforcement Learning I:: The Setting and Classical Stochastic Dynamic Programming Algorithms
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
Artificial Intelligence: Computer Science & Engineering, Khulna University
No ratings yet
Artificial Intelligence: Computer Science & Engineering, Khulna University
30 pages
Reinforcement
No ratings yet
Reinforcement
9 pages
A (Long) Peek Into Reinforcement Learning _ Lil'Log
No ratings yet
A (Long) Peek Into Reinforcement Learning _ Lil'Log
23 pages
subtitle (10)
No ratings yet
subtitle (10)
2 pages
37 RL
No ratings yet
37 RL
18 pages
UNIT 3
No ratings yet
UNIT 3
32 pages
Fundamentals of Reinforcement Learning
No ratings yet
Fundamentals of Reinforcement Learning
33 pages
Neural Networks: 1 October, 2016
No ratings yet
Neural Networks: 1 October, 2016
51 pages
Unit 5 - Reinforcement Learning
No ratings yet
Unit 5 - Reinforcement Learning
15 pages
UNIT-3
No ratings yet
UNIT-3
29 pages
Unit 4
No ratings yet
Unit 4
49 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Lecture 29 RL
No ratings yet
Lecture 29 RL
38 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
28 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
6 pages
ML - Unit-3 - Reinforcement Learning
No ratings yet
ML - Unit-3 - Reinforcement Learning
47 pages
Dissecting Reinforcement Learning-Part9
No ratings yet
Dissecting Reinforcement Learning-Part9
15 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Reinforcement Learning-1
No ratings yet
Reinforcement Learning-1
13 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
32 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
38 pages
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
From Everand
Reinforcement Learning Explained - A Step-by-Step Guide to Reward-Driven AI
Luka Nikolic
No ratings yet
Application Program For PCO in PLAN
No ratings yet
Application Program For PCO in PLAN
41 pages
Ibrahim Zitouni Pearson Set Assignment Activity 1 Cyber Security and Incident Mangement Ibra 1687641753
No ratings yet
Ibrahim Zitouni Pearson Set Assignment Activity 1 Cyber Security and Incident Mangement Ibra 1687641753
36 pages
Arduino 110901051202 Phpapp01
No ratings yet
Arduino 110901051202 Phpapp01
41 pages
Mtech Cs and Crs PCB 2024
No ratings yet
Mtech Cs and Crs PCB 2024
10 pages
Difference Between PERT and CPM
No ratings yet
Difference Between PERT and CPM
4 pages
EAPP Outlining
100% (1)
EAPP Outlining
17 pages
Controller HC 800
No ratings yet
Controller HC 800
2 pages
Hicom Office 150E v2.2 Service Manual
100% (1)
Hicom Office 150E v2.2 Service Manual
1,234 pages
Sipt 20 P
No ratings yet
Sipt 20 P
6 pages
Landing Zone Reference Architecture
No ratings yet
Landing Zone Reference Architecture
3 pages
Chapter 3 Conceptual Database Design and E-R Modeling (2)
No ratings yet
Chapter 3 Conceptual Database Design and E-R Modeling (2)
79 pages
(SOLVED) Mongod Service Start Exits With Code 100 - MongoDB Knowledge Base
No ratings yet
(SOLVED) Mongod Service Start Exits With Code 100 - MongoDB Knowledge Base
1 page
Week 1 - LECTURE - 1 - INTRO GIS - 21032024
No ratings yet
Week 1 - LECTURE - 1 - INTRO GIS - 21032024
20 pages
Reggiani Fiery XF731
No ratings yet
Reggiani Fiery XF731
14 pages
Appian Developer Resume Example With Appropriate Skills
No ratings yet
Appian Developer Resume Example With Appropriate Skills
9 pages
Cmps201 Lab 1
No ratings yet
Cmps201 Lab 1
2 pages
C8500 Ordering Guide
No ratings yet
C8500 Ordering Guide
35 pages
STM8 Introducao
No ratings yet
STM8 Introducao
17 pages
TecnaiT12_manual
No ratings yet
TecnaiT12_manual
22 pages
010E000C00020101.eds
No ratings yet
010E000C00020101.eds
4 pages
Will Work-from-Home Work Forever? (Ep. 464) : Stephen J. Dubner Mary Diduch
No ratings yet
Will Work-from-Home Work Forever? (Ep. 464) : Stephen J. Dubner Mary Diduch
29 pages
Physical Infrastructure Management Basics Transcript
0% (1)
Physical Infrastructure Management Basics Transcript
20 pages
(IJCST-V11I3P9) :raghu Ram Chowdary Velevela
No ratings yet
(IJCST-V11I3P9) :raghu Ram Chowdary Velevela
6 pages
Secure Login System in Assembly Language
No ratings yet
Secure Login System in Assembly Language
13 pages
Intructions (En)
No ratings yet
Intructions (En)
3 pages
Integration-3 - T
No ratings yet
Integration-3 - T
16 pages
SpesifikasiPengadaan Peralatan
No ratings yet
SpesifikasiPengadaan Peralatan
5 pages

L13 Reinforcement Learning

Uploaded by

L13 Reinforcement Learning

Uploaded by

COL333/671: Introduction to AI

• But we don’t know T or R In the Car MDP (discussed previously) we now

• Receive feedback in the form of rewards

Figure: Alexander Amini

[Kohl and Stone, ICRA 2004]

[Kohl and Stone, ICRA 2004]

[Kohl and Stone, ICRA 2004]

Google Deep Mind 15

Blood pressure control

• Step 1: Learn an empirical MDP model

Without P(A), instead collect samples [a1, a2, … aN]

Next, we look at a model free approach to RL

• Learning the state-value function for a given policy

Initialize the value function arbitrarily.

Loop over the episode from the end

Account for the discounting of the reward

Except if (unless) the state has been visited

How can the Bellman equation be incorporated in this learning? 28

Update to V(s): Modify the old value

Same update: Sign and magnitude

• Problem: we don’t have T and R

• Estimate Q(s,a) values as:

Exit with +10 reward

Cliff at the bottom,

• Some technical conditions:

• SARSA Update equation

• What is the exploration function trying to achieve?

Avg. the rewards for each action.

Exploration vs. Exploitation trade off!

• Need for Generalization

What is the advantage of the neural network?

• Fundamental tradeoff between exploration and exploitation. Generalization is a

You might also like