0% found this document useful (0 votes)
5 views35 pages

Lecture3__InsideAnAgent

The document discusses the components and functions of an agent in a reinforcement learning context, including agent state, policy, value functions, and models. It differentiates between fully observable and partially observable environments, explaining how agents construct their state representations based on history and beliefs. Additionally, it categorizes agents into value-based, policy-based, actor-critic, model-free, and model-based types, highlighting their unique characteristics and learning approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views35 pages

Lecture3__InsideAnAgent

The document discusses the components and functions of an agent in a reinforcement learning context, including agent state, policy, value functions, and models. It differentiates between fully observable and partially observable environments, explaining how agents construct their state representations based on history and beliefs. Additionally, it categorizes agents into value-based, policy-based, actor-critic, model-free, and model-based types, highlighting their unique characteristics and learning approaches.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 35

Inside an Agent

Inside an Agent
• Agent State

• Policy

• Value functions

• Model
Agent Environment Loop
Agent state
• Everything the agent takes with it from one time step to
another.
Agent state
• Agent state is the
information used
to determine what
happens next

• Formally, state is a
function of the
history: St = f (Ht)
History
• The history is the sequence of observations, actions,
rewards

Ht = O1, R1, A1, ..., At−1, Ot , Rt

– i.e. all observable variables up to time t


– Eg: the sensorimotor stream of a robot or embodied agent

• What happens next depends on the history:


– The agent selects actions
– The environment selects observations/rewards
Environment State
Information State
• An information state (a.k.a. Markov state) contains all
useful information from the history.

• “The future is independent of the past given the


present”
• Once the state is known, the history may be thrown
away
Rat Example
Fully Observable Environment
• Agent directly observes environment state
• Agent state = environment state = information
state
• Formally, this is a Markov decision process
(MDP)
Partially Observable Environments
• Partial observability: agent indirectly observes
environment
– A robot with camera vision isn’t told its absolute
location
– A poker playing agent only observes public cards
Partially Observable Environments
• Now agent state != environment state
• Formally this is a partially observable Markov
decision process (POMDP)
• Agent must construct its own state
representation based on
– Complete history
– Beliefs of environment state etc
Partially Observable Environments
Partially Observable Environments
Partially Observable Environments
Policy
• A policy is the agent’s behavior

• It is a map from state to action

• Deterministic policy: a = π(s)


– The deterministic policy output an action with probability one.

• Stochastic policy: π(a|s) = P[At = a|St = s]


– Stochastic policy output the probability distribution over the
actions from states.

• Football scenario
Trajectory
• Trajectory τ is a sequence of states and action
State Value function
• The state-value Vπ(s) is the expected total
reward, starting from state s and acts according
to policy π.
• Used to evaluate the goodness/badness of
states and therefore to select between actions

, π]

• γ is called discount factor and is between [0,1]


State Value function
• Every action at a subsequent time step is
selected based on the policy π.

• The conditioning is not based on the sequence


of actions but on the policy that chooses as
action based on the state.

• The value of a state will change if the policy is


changed.
State Value function
• The discount factor helps to give more importance
to the short term rewards and less importance to
the long term rewards.

• Extreme cases
– γ = 0 (myopic policy)
– γ = 1 (undiscounted policy)

• If we find an optimal value function, an optimal


policy can be derived from it.
State Value function
| =s]

is updated as

| =s, A =π(s)]

This is called Bellman Equation


Optimal state-value function
• It has high possible value function compared
to other value function for all states

• If we know optimal value function, then the

function is optimal policy 𝛑*


policy that corresponds to optimal value
Action value function
• It is the expected return for an agent starting from state s

to policy 𝛑.
and taking arbitrary action a then forever after act according

• The optimal Q-function Q*(s, a) means highest possible q


value for an agent starting from state s and choosing action
a.
• Q*(s, a) is an indication for how good it is for an agent to
pick action while being in state s.
Optimal Q and Optimal V
• If we know the optimal Q-function Q*(s, a), the optimal
policy can be easily extracted by choosing the action a that
gives maximum Q*(s, a) for state s.

• Since V*(s) is the maximum expected total reward when


starting from state s , it will be the maximum of Q*(s,
a)overall possible actions.
Model
• A model predicts what the environment will
do next
• P predicts the next state
• R predicts the next (immediate) reward, e.g.
Example
Policy

Arrows represent policy π(s) for each state s


Value Function

Numbers represent value vπ(s) of each state s


Model

• May be obtained by interacting with the environment


Agent categories
• Value Based
– The agent has a notion of Value Function
– No explicit policy is available
– The objective is to minimize the loss between the
predicted and target value to approximate the
true action value function
Agent categories
• Policy Based
– The agent has a notion of Policy
– No Value Function
– It learns the policy directly without learning the
value function.
– A stochastic policy outputs a probability
distribution over actions.
– Here, the policy is parameterized and the best
parameters are learnt to give the optimal policy
Agent categories
• Policy Based
Agent categories
• Actor Critic
– Has two networks: Actor and Critic
– The actor decided which action should be taken
– The critic inform the actor how good was the
action and how it should adjust.
– The learning of the actor is based on policy
gradient approach.
– In comparison, critics evaluate the action
produced by the actor by computing the value
function.
Agent categories
• Model Free
– No explicit model of the environment.
– Has value function and/or policy

• Model based
– The agent has the model of the environment.
– Optionally has value function and/or policy
Thank You

You might also like