0% found this document useful (0 votes)

5 views35 pages

Lecture3__InsideAnAgent

The document discusses the components and functions of an agent in a reinforcement learning context, including agent state, policy, value functions, and models. It differentiates between fully observable and partially observable environments, explaining how agents construct their state representations based on history and beliefs. Additionally, it categorizes agents into value-based, policy-based, actor-critic, model-free, and model-based types, highlighting their unique characteristics and learning approaches.

Uploaded by

mscai2024.avinesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views35 pages

Lecture3__InsideAnAgent

Uploaded by

mscai2024.avinesh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 35

Inside an Agent

Inside an Agent
• Agent State

• Policy

• Value functions

• Model
Agent Environment Loop
Agent state
• Everything the agent takes with it from one time step to
another.
Agent state
• Agent state is the
information used
to determine what
happens next

• Formally, state is a
function of the
history: St = f (Ht)
History
• The history is the sequence of observations, actions,
rewards

Ht = O1, R1, A1, ..., At−1, Ot , Rt

– i.e. all observable variables up to time t

– Eg: the sensorimotor stream of a robot or embodied agent

• What happens next depends on the history:

– The agent selects actions
– The environment selects observations/rewards
Environment State
Information State
• An information state (a.k.a. Markov state) contains all
useful information from the history.

• “The future is independent of the past given the

present”
• Once the state is known, the history may be thrown
away
Rat Example
Fully Observable Environment
• Agent directly observes environment state
• Agent state = environment state = information
state
• Formally, this is a Markov decision process
(MDP)
Partially Observable Environments
• Partial observability: agent indirectly observes
environment
– A robot with camera vision isn’t told its absolute
location
– A poker playing agent only observes public cards
Partially Observable Environments
• Now agent state != environment state
• Formally this is a partially observable Markov
decision process (POMDP)
• Agent must construct its own state
representation based on
– Complete history
– Beliefs of environment state etc
Partially Observable Environments
Partially Observable Environments
Partially Observable Environments
Policy
• A policy is the agent’s behavior

• It is a map from state to action

• Deterministic policy: a = π(s)

– The deterministic policy output an action with probability one.

• Stochastic policy: π(a|s) = P[At = a|St = s]

– Stochastic policy output the probability distribution over the
actions from states.

• Football scenario
Trajectory
• Trajectory τ is a sequence of states and action
State Value function
• The state-value Vπ(s) is the expected total
reward, starting from state s and acts according
to policy π.
• Used to evaluate the goodness/badness of
states and therefore to select between actions

, π]

• γ is called discount factor and is between [0,1]

State Value function
• Every action at a subsequent time step is
selected based on the policy π.

• The conditioning is not based on the sequence

of actions but on the policy that chooses as
action based on the state.

• The value of a state will change if the policy is

changed.
State Value function
• The discount factor helps to give more importance
to the short term rewards and less importance to
the long term rewards.

• Extreme cases
– γ = 0 (myopic policy)
– γ = 1 (undiscounted policy)

• If we find an optimal value function, an optimal

policy can be derived from it.
State Value function
| =s]

is updated as

| =s, A =π(s)]

This is called Bellman Equation

Optimal state-value function
• It has high possible value function compared
to other value function for all states

• If we know optimal value function, then the

function is optimal policy 𝛑*

policy that corresponds to optimal value
Action value function
• It is the expected return for an agent starting from state s

to policy 𝛑.
and taking arbitrary action a then forever after act according

• The optimal Q-function Q*(s, a) means highest possible q

value for an agent starting from state s and choosing action
a.
• Q*(s, a) is an indication for how good it is for an agent to
pick action while being in state s.
Optimal Q and Optimal V
• If we know the optimal Q-function Q*(s, a), the optimal
policy can be easily extracted by choosing the action a that
gives maximum Q*(s, a) for state s.

• Since V*(s) is the maximum expected total reward when

starting from state s , it will be the maximum of Q*(s,
a)overall possible actions.
Model
• A model predicts what the environment will
do next
• P predicts the next state
• R predicts the next (immediate) reward, e.g.
Example
Policy

Arrows represent policy π(s) for each state s

Value Function

Numbers represent value vπ(s) of each state s

Model

• May be obtained by interacting with the environment

Agent categories
• Value Based
– The agent has a notion of Value Function
– No explicit policy is available
– The objective is to minimize the loss between the
predicted and target value to approximate the
true action value function
Agent categories
• Policy Based
– The agent has a notion of Policy
– No Value Function
– It learns the policy directly without learning the
value function.
– A stochastic policy outputs a probability
distribution over actions.
– Here, the policy is parameterized and the best
parameters are learnt to give the optimal policy
Agent categories
• Policy Based
Agent categories
• Actor Critic
– Has two networks: Actor and Critic
– The actor decided which action should be taken
– The critic inform the actor how good was the
action and how it should adjust.
– The learning of the actor is based on policy
gradient approach.
– In comparison, critics evaluate the action
produced by the actor by computing the value
function.
Agent categories
• Model Free
– No explicit model of the environment.
– Has value function and/or policy

• Model based
– The agent has the model of the environment.
– Optionally has value function and/or policy
Thank You

GA-600-09 - Fire Resistance Design Manual
100% (4)
GA-600-09 - Fire Resistance Design Manual
180 pages
Veld Products-Lesson Notes
No ratings yet
Veld Products-Lesson Notes
5 pages
M 004 Exercises
No ratings yet
M 004 Exercises
33 pages
Method Statement Mechanical
93% (14)
Method Statement Mechanical
24 pages
Quiz 4 - Continuous Probability Distribution
No ratings yet
Quiz 4 - Continuous Probability Distribution
4 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
2
No ratings yet
2
23 pages
lec12
No ratings yet
lec12
60 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
MIT 6.036 Lecture
No ratings yet
MIT 6.036 Lecture
64 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
Markov Decision Process (MDP)
No ratings yet
Markov Decision Process (MDP)
31 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
Reinforcement Learning: Karan Kathpalia
No ratings yet
Reinforcement Learning: Karan Kathpalia
80 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
UNIT VI
No ratings yet
UNIT VI
17 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
RL UNIT - II
No ratings yet
RL UNIT - II
20 pages
Reinforcement Learning - Markov Decision Process
No ratings yet
Reinforcement Learning - Markov Decision Process
11 pages
Value Functions & Bellman Equations: UNIT-3
No ratings yet
Value Functions & Bellman Equations: UNIT-3
11 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
1 page
Markov decision
No ratings yet
Markov decision
4 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
CS229
No ratings yet
CS229
17 pages
Reinforcement Learning2A
No ratings yet
Reinforcement Learning2A
88 pages
Reinforcement Learning: Part I - Definitions
No ratings yet
Reinforcement Learning: Part I - Definitions
26 pages
UNIT 4 (2)
No ratings yet
UNIT 4 (2)
6 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
35 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
1.7 Policies and Value Functions
No ratings yet
1.7 Policies and Value Functions
23 pages
Reinforcement-Learning-Cheatsheet
No ratings yet
Reinforcement-Learning-Cheatsheet
16 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
Reinforcement Learning Model Based Planning Dynamic Programming
No ratings yet
Reinforcement Learning Model Based Planning Dynamic Programming
17 pages
M 2
No ratings yet
M 2
12 pages
PDF Unit-5(Full Unit)
No ratings yet
PDF Unit-5(Full Unit)
37 pages
21ai020 & Reinforcement Learning: Topic
No ratings yet
21ai020 & Reinforcement Learning: Topic
8 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
Reinforcement Learning: Nguyen Do Van, PHD
No ratings yet
Reinforcement Learning: Nguyen Do Van, PHD
40 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
9 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
RL_UNIT-II (1)
No ratings yet
RL_UNIT-II (1)
14 pages
L12 Reinforcement Learning 2
No ratings yet
L12 Reinforcement Learning 2
26 pages
L13 Reinforcement Learning
No ratings yet
L13 Reinforcement Learning
35 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
Module 3.0
No ratings yet
Module 3.0
17 pages
CS 747, Autumn 2023: Lecture 6: Shivaram Kalyanakrishnan
No ratings yet
CS 747, Autumn 2023: Lecture 6: Shivaram Kalyanakrishnan
68 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Microsoft PowerPoint - Lecture20Final-Part1
No ratings yet
Microsoft PowerPoint - Lecture20Final-Part1
65 pages
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
No ratings yet
Reinforcement Learning: Amulya Viswambaran (202090007) Kehkashan Fatima (202090202) Sruthi Krishnan (202090333)
40 pages
DRL
No ratings yet
DRL
9 pages
Sections
No ratings yet
Sections
76 pages
A Brief Introduction To Reinforcement Learning
No ratings yet
A Brief Introduction To Reinforcement Learning
4 pages
Markov Decision Process: Fundamentals and Applications
From Everand
Markov Decision Process: Fundamentals and Applications
Fouad Sabry
No ratings yet
CAT-2021-Slot-1 Paper With Answer Key
No ratings yet
CAT-2021-Slot-1 Paper With Answer Key
21 pages
Prestotig Range
No ratings yet
Prestotig Range
6 pages
Despiece Secadora
No ratings yet
Despiece Secadora
12 pages
Electrochemistry - Concentration Cells
No ratings yet
Electrochemistry - Concentration Cells
5 pages
Pagsusuri
No ratings yet
Pagsusuri
42 pages
GCSE Revision Guide English Language Paper1
No ratings yet
GCSE Revision Guide English Language Paper1
46 pages
Mas List
No ratings yet
Mas List
8 pages
Nasadiya Sukta - A Probe Into Its Depth
100% (2)
Nasadiya Sukta - A Probe Into Its Depth
8 pages
Duty Drawback
100% (1)
Duty Drawback
181 pages
Data Sheet: Desuperheater
No ratings yet
Data Sheet: Desuperheater
1 page
Wear Protection: Densit Wearflex Densit Wearcast Densit Wearspray
No ratings yet
Wear Protection: Densit Wearflex Densit Wearcast Densit Wearspray
5 pages
The Great Prince Shan
No ratings yet
The Great Prince Shan
212 pages
Safety_Presentation
No ratings yet
Safety_Presentation
12 pages
Noah A Preacher of Faith
No ratings yet
Noah A Preacher of Faith
13 pages
Curve PDF
No ratings yet
Curve PDF
1 page
Hydration and Physical Performance
No ratings yet
Hydration and Physical Performance
7 pages
Automatic Chapati Making Machine Rolling Type
No ratings yet
Automatic Chapati Making Machine Rolling Type
5 pages
IGCSE redox _ Quizizz
No ratings yet
IGCSE redox _ Quizizz
7 pages
Program Flow
No ratings yet
Program Flow
7 pages
COOKERY 10 Quarter 2 LAS No. 5
100% (2)
COOKERY 10 Quarter 2 LAS No. 5
5 pages
Ekoled Instrukcijos Juostos ENG
No ratings yet
Ekoled Instrukcijos Juostos ENG
1 page
OceanofPDF.com Lout of the Counts Family Volume 1 - Yu Ryeo-han
No ratings yet
OceanofPDF.com Lout of the Counts Family Volume 1 - Yu Ryeo-han
348 pages
Daily Temperature Chart Instructions
No ratings yet
Daily Temperature Chart Instructions
6 pages
Medical Fitness Undertaking-Ankit Singh
No ratings yet
Medical Fitness Undertaking-Ankit Singh
2 pages
Introduction To G1000 Search Patterns: Minnesota Wing Advanced Aircrew Training
No ratings yet
Introduction To G1000 Search Patterns: Minnesota Wing Advanced Aircrew Training
42 pages

Lecture3__InsideAnAgent

Uploaded by

Lecture3__InsideAnAgent

Uploaded by

Inside an Agent

Ht = O1, R1, A1, ..., At−1, Ot , Rt

– i.e. all observable variables up to time t

• What happens next depends on the history:

• “The future is independent of the past given the

• It is a map from state to action

• Deterministic policy: a = π(s)

• Stochastic policy: π(a|s) = P[At = a|St = s]

• γ is called discount factor and is between [0,1]

• The conditioning is not based on the sequence

• The value of a state will change if the policy is

• If we find an optimal value function, an optimal

This is called Bellman Equation

• If we know optimal value function, then the

function is optimal policy 𝛑*

• The optimal Q-function Q*(s, a) means highest possible q

• Since V*(s) is the maximum expected total reward when

Arrows represent policy π(s) for each state s

Numbers represent value vπ(s) of each state s

• May be obtained by interacting with the environment

You might also like