0% found this document useful (0 votes)
6 views63 pages

PowerPoint Presentation

This document provides an overview of Finite Markov Decision Processes (MDP) in the context of reinforcement learning, detailing the agent-environment interface, goals, rewards, and the concepts of episodic and continuing tasks. It emphasizes the importance of maximizing cumulative rewards and discusses various examples of applications such as bioreactors and robotic tasks. The lecture outlines the foundational elements of MDP, including states, actions, and the significance of optimal policies and value functions.

Uploaded by

wptjcks15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views63 pages

PowerPoint Presentation

This document provides an overview of Finite Markov Decision Processes (MDP) in the context of reinforcement learning, detailing the agent-environment interface, goals, rewards, and the concepts of episodic and continuing tasks. It emphasizes the importance of maximizing cumulative rewards and discusses various examples of applications such as bioreactors and robotic tasks. The lecture outlines the foundational elements of MDP, including states, actions, and the significance of optimal policies and value functions.

Uploaded by

wptjcks15
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 63

Reinforcement Learning

Finite Markov Decision Process

U Kang
Seoul National University
U Kang
In This Lecture

 Markov Decision Process


 Definition
 Return, state, and action
 Policy and value functions
 Optimality and approximation

U Kang
Overview

 Markov Decision Process (MDP)


 A classical formalization of sequential decision making,
where actions influence not just immediate rewards, but
also subsequent situations, or states, and through those
future rewards
 Involve delayed reward and the need to tradeoff
immediate and delayed reward
 Estimate 𝑞∗ (𝑠, 𝑎) for each action a in each state s, or the
value 𝑣∗ (𝑠) of each state given optimal action selections

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Agent-Environment Interface

 Objective: learning from interaction to achieve a


goal
 Agent: the learner and decision maker
 Environment: everything outside the agent
 Environment gives rewards, special numerical
values that the agent seeks to maximize over time

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Agent-Environment Interface

 Consider a sequence of time steps, t = 0, 1, 2, …


 At each time t, the agent receives state 𝑆𝑡 and
reward 𝑅𝑡 (when t != 0), and on that basis selects
an action 𝐴𝑡
 One step later, the agent receives state 𝑆𝑡+1 and
reward 𝑅𝑡+1 , and on that basis selects an action
𝐴𝑡+1
 Sequence of interactions
 𝑆0 , 𝐴0 , 𝑅1 , 𝑆1 , 𝐴1 , 𝑅2 , 𝑆2 , 𝐴2 , …

U Kang
MDP

 Finite MDP
 States, actions, and rewards are finite
 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 = 𝑃{𝑆𝑡 = 𝑠 ′ , 𝑅𝑡 = 𝑟|𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎}
 The probability distribution 𝑝 defines the dynamics of
MDP
 𝑝: 𝑆 × 𝑅 × 𝑆 × 𝐴 → [0,1]
 σ𝑠′ σ𝑟 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 = 1, for all 𝑠, 𝑎

 The probability 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 depends only on the


preceding state and action, not earlier ones
 Markov property

U Kang
MDP

 From 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 , we can compute anything we


want to know about the environment
 State-transition probabilities
 𝑝 𝑠 ′ 𝑠, 𝑎 = 𝑃 𝑆𝑡 = 𝑠 ′ 𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎 = σ𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎)
 Expected reward
 𝑟(𝑠, 𝑎) = 𝐸[𝑅𝑡 |𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎] = σ𝑟 𝑟 σ𝑠′ 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎)
 Expected reward for state-action-next state
𝑝(𝑠 ′ ,𝑟|𝑠,𝑎)
 𝑟(𝑠, 𝑎, 𝑠′) = 𝐸[𝑅𝑡 |𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎, 𝑆𝑡 = 𝑠′] = σ𝑟 𝑟 ′
𝑝 𝑠 𝑠, 𝑎

U Kang
Flexibility of MDP

 The time steps need not refer to fixed intervals of


real time; they can refer to arbitrary successive
stages of decision making and acting
 The actions can be low-level controls, such as the
voltages applied to the motors of a robot arm, or
high-level decisions, such as whether or not to
have lunch or to go to graduate school
 The states can take a wide variety of forms. E.g.,
low-level sensor readings, or high-level symbolic
descriptions of objects in a room

U Kang
Agent-Environment Boundary

 Different from physical boundary


 The motors and mechanical linkages of a robot and its
sensing hardware should usually be considered parts of
the environment rather than parts of the agent
 Anything that cannot be changed arbitrarily by the
agent is considered to be outside of it and thus
part of its environment
 The agent–environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge

U Kang
Abstraction by MDP

 MDP: a considerable abstraction of the problem of


goal-directed learning from interaction
 Any problem of learning goal-directed behavior
can be reduced to three signals between agent
and environment
 Actions: choices made by the agent
 States: basis on which the choices are made
 Rewards: the agent’s goal

U Kang
Example: Bioreactor

 Goal: determine moment-by-moment temperatures and


stirring rates for a bioreactor (a large vat of nutrients and
bacteria used to produce useful chemicals)
 Actions: target temperatures and target stirring rates
 Represented as a vector
 States: thermocouple and other sensory readings, perhaps
filtered and delayed, plus symbolic inputs representing the
ingredients in the vat and the target chemical
 Represented as a vector
 Rewards: moment-by-moment measures of the rate at
which the useful chemical is produced by the bioreactor
 Represented as a scalar

U Kang
Example: Pick-and-Place Robot

 Goal: control the motion of a robot arm in a repetitive pick-


and-place task, to learn movements that are fast and
smooth
 Actions: voltages applied to each motor at each joint
 States: latest readings of joint angles and velocities
 Reward: +1 for each object successfully picked up and
placed
 To encourage smooth movements, on each time step a small,
negative reward can be given as a function of the moment-to-
moment “jerkiness” of the motion.

U Kang
Example: Recycling Robot

 A mobile robot collecting empty soda cans,


running on a rechargeable battery
 State: battery level (high, low)
 Actions: search (for a can), wait (for someone to
bring a can), recharge (its battery)
 Actions sets
 A(high) = {search, wait}
 A(low) = {search, wait, recharge}

U Kang
Example: Recycling Robot

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Goals and Rewards

 Reward hypothesis: the goal of an agent is to


maximize the expected value of the cumulated
reward
 This has proved to be flexible and widely
applicable

U Kang
Examples

 Making a robot learn to walk


 Reward: proportional to the robot’s forward motion, on each time step
 Making a robot learn how to escape from a maze
 Reward: −1 for every time step that passes prior to escape; this
encourages the agent to escape as quickly as possible
 Making a robot learn to find and collect empty soda cans for
recycling
 Reward: 0 most of the time, and +1 for each can collected. One might
also want to give the robot negative rewards when it bumps into
things or when somebody yells at it
 Learning to play chess
 Reward: +1 for winning, −1 for losing, and 0 for drawing and for all
nonterminal positions
U Kang
Maximizing Rewards

 The agent always learns to maximize its long-term reward


 If we want it to do something for us, we must provide
rewards to it in such a way that in maximizing them the
agent will also achieve our goals.
 Reward signal is not the place to impart to the agent prior
knowledge about how to achieve what we want it to do
 E.g., a chess-playing agent should be rewarded only for actually
winning, not for achieving subgoals such as taking its opponent’s
pieces or gaining control of the center of the board
 Exploiting prior knowledge on how: initial policy or value functions
 Reward signal is your way of communicating to the robot
what you want it to achieve, not how you want it achieved

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Returns and Episodes

 Agent’s goal: maximize the cumulative rewards in


the long run

 Maximize expected return 𝐺𝑡


 𝐺𝑡 = 𝑅𝑡+1 + 𝑅𝑡+2 + 𝑅𝑡+3 + ⋯ + 𝑅𝑇
 T: final time step

 What is the implication of T?

U Kang
Returns and Episodes

 Episode (trial): separated subsequence in agent-


environment interaction
 Plays of a game
 Trips through a maze
 Any sort of repeated interaction

 Each episode ends in a special terminal state, followed by a


reset to a starting state

 Each episode may give different rewards

U Kang
Returns and Episodes

 Episodic tasks: tasks with episodes


 𝑆: set of all nonterminal states
 𝑆 + : set of all states plus the terminal state

 Continuing tasks
 Agent-environment interaction does not break into episodes, but
goes on continually without limit
 E.g., a robot with a long life span
 The return may be infinite; thus we need a concept of discounting

U Kang
Returns and Episodes

 An agent selects actions to maximize the sum of discounted


rewards
 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + ⋯ = σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1
 𝛾: discount rate in [0, 1]

 Returns at successive time steps are related to each other


 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + 𝛾 3 𝑅𝑡+4 + ⋯
= 𝑅𝑡+1 + 𝛾(𝑅𝑡+2 + 𝛾𝑅𝑡+3 + 𝛾 2 𝑅𝑡+4 + ⋯ )
= 𝑅𝑡+1 + 𝛾𝐺𝑡+1

U Kang
Returns and Episodes

 Discount rate determines the present value of future


rewards
 A reward received k time steps in the future is worth only 𝛾 𝑘−1 times
what it would be worth if it were received immediately
 If 𝛾 < 1, 𝐺𝑡 is finite as long as reward sequence {𝑅𝑘 } is bounded
1
 If the reward is constant +1, then 𝐺𝑡 = σ∞ 𝑘
𝑘=0 𝛾 = 1−𝛾

 If 𝛾 = 0, the agent concerns only on the immediate rewards


 As 𝛾 approaches 1, the agent becomes more far-sighted

U Kang
Example: Pole-Balancing

 Objective: apply forces to a cart moving along a track to


keep a pole hinged to the cart from falling over
 Failure: if the pole falls past a given angle from vertical or if
the cart runs off the track

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Example: Pole-Balancing

 Episodic task formulation


 Reward +1 for every time step on which failure did not occur
 Continuing task formulation (with discounting)
 Reward −1 on each failure and 0 at all other times
 The return at each time is related to −𝛾 𝐾 , where K is the number of
time steps before failure
 In either case, the return is maximized by keeping the pole
balanced for as long as possible

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Episodic and Continuing Tasks

 Agent-environment interaction
 Episodic tasks: interaction is broken into separate
episodes
 Continuing tasks

 It is useful to establish one notation to refer to


both episodic and continuing tasks

U Kang
Episodic and Continuing Tasks

 Absorbing state
 Transitions only to itself and generates only rewards of 0
 Useful to unify episodic and continuing tasks
 Setting 𝑇 = 3 or 𝑇 = ∞ give the same result

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Episodic and Continuing Tasks

 Unified notations for episodic and continuing tasks


 𝐺𝑡 = σ𝑇𝑘=𝑡+1 𝛾 𝑘−𝑡−1 𝑅𝑘
 Episodic tasks: 𝛾 = 1
 𝑇 can be finite, or 𝑇 = ∞ with an absorbing state
 Continuing tasks: 𝛾 < 1, 𝑇 = ∞

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Policies and Value Functions

 Almost all RL algorithms involve estimating value


functions
 Value function: map a state (or state-action pair) to a
value representing how good it is in terms of expected
return

 The rewards the agent can expect to receive in the


future depend on what actions it will take; thus
value functions are defined with respect to
particular ways of acting, called policies

U Kang
Policies and Value Functions

 Policy: a mapping from a state to probabilities of


selecting each action
 𝜋(𝑎|𝑠): the probability of selecting 𝐴𝑡 = 𝑎, if 𝑆𝑡 = 𝑠

U Kang
Policies and Value Functions

 Value function
 State-value function for policy 𝜋: expected return when
starting in s and following 𝜋 thereafter
 𝑣𝜋 𝑠 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠 = 𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠

 Action-value function for policy 𝜋: expected return when


starting in s, taking action a, and following 𝜋 thereafter
 𝑞𝜋 𝑠, 𝑎 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 =
𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎

 The value of the final state is always 0

U Kang
Policies and Value Functions

 Value functions 𝑣𝜋 and 𝑞𝜋 can be estimated from


experience
 Monte Carlo methods: average over many random
samples of actual returns
 𝑣𝜋 (𝑠): maintain average of the actual returns that have followed
the state
 𝑞𝜋 (𝑠, 𝑎): maintain average of the actual returns that have
followed the state and the action
 Function approximator
 Monte Carlo method is limited when there are many states
 Maintain 𝑣𝜋 and 𝑞𝜋 as parameterized functions (with fewer
parameters than states)
 Learn parameters to better match the observed returns
U Kang
Value Function

 Recursive relationship of value functions 𝑣𝜋


 𝑣𝜋 𝑠 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠 = 𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠
= 𝐸𝜋 𝑅𝑡+1 + 𝛾𝐺𝑡+1 𝑆𝑡 = 𝑠
= σ𝑎 𝜋(𝑎|𝑠) σ𝑠′ σ𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝐸𝜋 𝐺𝑡+1 𝑆𝑡+1 = 𝑠′ ]
= σ𝑎 𝜋(𝑎|𝑠) σ𝑠′ ,𝑟 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 [𝑟 + 𝛾𝑣𝜋 𝑠 ′ ]
Bellman equation for 𝑣𝜋
 The value function 𝑣𝜋 is the unique solution to its Bellman equation

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Example: Gridworld
Sutton and Barto,
Reinforcement
Learning, 2018
 State: each cell
 Actions: north, south, east, and west
 Rewards: +10 (A -> A’), +5 (B -> B’), -1 for “off the grid”, and
0 otherwise

𝑣𝜋 (𝑠) for equiprobable random policy,


when 𝛾 = 0.9
U Kang
Example: Gridworld
Sutton and Barto,
Reinforcement
Learning, 2018
 Observations
 Value of A is less than 10; Value of B is greater than 5
 Value of A’ is negative
 Upper cells have higher values than bottom cells

𝑣𝜋 (𝑠) for equiprobable random policy,


when 𝛾 = 0.9
U Kang
Example: Golf

 State: location of the ball


 Action: club (, and direction)
 Reward: -1 for each stroke
until hitting the ball into the
hole
 Observations
 𝑣𝑝𝑢𝑡𝑡
 Shorter gaps of values than in driver
 −∞ value for sand
 𝑞∗ (𝑠, 𝑑𝑟𝑖𝑣𝑒𝑟)
 Best action-value for ‘driver’ action

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Optimal Policies and Values

 The goal of RL is to find a policy that achieves the best


reward in the long run
 A policy is better than the other if its expected return is
greater than the other for all states
 𝜋 ≥ 𝜋′ iff 𝑣𝜋 (𝑠) ≥ 𝑣𝜋′ (𝑠), for all s
 Optimal policy 𝜋∗ : better than or equal to all other policies
 Share the optimal state-value function 𝑣∗
 𝑣∗ 𝑠 = 𝑚𝑎𝑥𝜋 𝑣𝜋 (𝑠) for all s
 Share the optimal action-value function 𝑞∗
 𝑞∗ 𝑠, 𝑎 = 𝑚𝑎𝑥𝜋 𝑞𝜋 (𝑠, 𝑎) for all a and s
 𝑞∗ 𝑠, 𝑎 = 𝐸[𝑅𝑡+1 + 𝛾𝑣∗ 𝑆𝑡+1 |𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎]

U Kang
Optimal Policies and Values

 Bellman optimality equation


 The value of a state under an optimal policy must equal the
expected return for the best action from that state
 𝑣∗ 𝑠 = 𝑚𝑎𝑥𝑎 𝑞𝜋∗ (𝑠, 𝑎)
= 𝑚𝑎𝑥𝑎 𝐸𝜋∗ 𝐺𝑡 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
= 𝑚𝑎𝑥𝑎 𝐸𝜋∗ 𝑅𝑡+1 + 𝛾𝐺𝑡+1 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
= 𝑚𝑎𝑥𝑎 𝐸 𝑅𝑡+1 + 𝛾𝑣∗ 𝑆𝑡+1 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎
= 𝑚𝑎𝑥𝑎 σ𝑠′ ,𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝑣∗ 𝑠′ ]

 𝑞∗ 𝑠, 𝑎 = 𝐸 𝑅𝑡+1 + 𝛾 𝑚𝑎𝑥𝑎′ 𝑞∗ (𝑆𝑡+1 , 𝑎′) 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎


= σ𝑠′ ,𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝑚𝑎𝑥𝑎′ 𝑞∗ (𝑠′, 𝑎′)]

U Kang
Optimal Policies and Values

 Backup diagram for 𝑣∗ and 𝑞∗

Sutton and Barto,


Reinforcement
Learning, 2018

U Kang
Optimal Policies and Values

 How to find the best policy, given 𝑣∗ ?


 𝑣∗ 𝑠 = 𝑚𝑎𝑥𝑎 σ𝑠′ ,𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝑣∗ 𝑠′ ]
 For each state s, there will be one or more actions at which the
maximum is obtained in the Bellman optimality equation; any policy
that assigns nonzero probability only to these actions is optimal
 Can be thought of as one-step search: the actions that appear best
after a one-step search are optimal
 Greedy search: any policy that is greedy with respect to the optimal
evaluation function 𝑣∗ is optimal
 Considering the one-step consequences of actions is enough, since
𝑣∗ already takes into account the reward consequences of all
possible future behavior

U Kang
Optimal Policies and Values

 How to find the best policy, given 𝑞∗ ?


 𝑞∗ 𝑠, 𝑎 = σ𝑠′ ,𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝑚𝑎𝑥𝑎′ 𝑞∗ (𝑠′, 𝑎′)]
 The agent does not even have to do a one-step search!
 For any state s, find any action that maximizes 𝑞∗ 𝑠, 𝑎
 The action-value function effectively caches the results of all one-
step ahead searches

U Kang
Example: Gridworld

U Kang
Example: Recycling Robot

 h, l, s, w, re: high, low, search, wait, recharge


 For any choice of 𝑟𝑠 , 𝑟𝑤 , 𝛼, 𝛽, and 𝛾, with 0 ≤ 𝛾 < 1, 0 ≤ 𝛼, 𝛽 ≤ 1, there is
exactly one pair of numbers 𝑣∗ (ℎ) and 𝑣∗ (𝑙), that simultaneously satisfy the
following equations
𝑝 ℎ ℎ, 𝑠 𝑟 ℎ, 𝑠, ℎ + 𝛾𝑣∗ ℎ + 𝑝 𝑙 ℎ, 𝑠 𝑟 ℎ, 𝑠, 𝑙 + 𝛾𝑣∗ 𝑙 ,
 𝑣∗ ℎ = 𝑚𝑎𝑥
𝑝 ℎ ℎ, 𝑤 𝑟 ℎ, 𝑤, ℎ + 𝛾𝑣∗ ℎ + 𝑝 𝑙 ℎ, 𝑤 𝑟 ℎ, 𝑤, 𝑙 + 𝛾𝑣∗ 𝑙
𝛼 𝑟𝑠 + 𝛾𝑣∗ ℎ + 1 − 𝛼 𝑟𝑠 + 𝛾𝑣∗ 𝑙 ,
= 𝑚𝑎𝑥
1 𝑟𝑤 + 𝛾𝑣∗ ℎ + 0 𝑟𝑤 + 𝛾𝑣∗ 𝑙
𝑟𝑠 + 𝛾[𝛼𝑣∗ ℎ + 1 − 𝛼 𝑣∗ 𝑙 ,
= 𝑚𝑎𝑥 .
𝑟𝑤 + 𝛾𝑣∗ ℎ

𝛽𝑟𝑠 − 3 1 − 𝛽 + 𝛾 1 − 𝛽 𝑣∗ ℎ + 𝛽𝑣∗ 𝑙 ,
 𝑣∗ (𝑙) = 𝑚𝑎𝑥 𝑟𝑤 +𝛾𝑣∗ 𝑙 , .
𝛾𝑣∗ ℎ

U Kang
Bellman Optimality Equation

 Explicitly solving Bellman optimality equation leads to


solving the RL problem; however, it is rarely useful in
practice
 1) We do not know the exact dynamics of the environment
 2) We do not have enough computational resources to complete the
computation
 3) Markov property may not be true

 E.g., backgammon
 It has 1020 states, and thus take thousands of years to solve the
Bellman equation
 Needs approximate computation

U Kang
Bellman Optimality Equation

 Many decision-making methods can be viewed as ways of


approximately solving the Bellman optimality equation
 𝑣∗ 𝑠 = 𝑚𝑎𝑥𝑎 σ𝑠′ ,𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝑣∗ 𝑠′ ]
 Heuristic search method
 Expand the Bellman optimality equation several times (up to some
depth), forming a “tree” of possibilities, and then using a heuristic
evaluation function to approximate 𝑣∗ at the “leaf” nodes
 DP (dynamic programming) is more closely related to the
Bellman Optimality Equation
 Many RL methods approximately solve the Bellman
optimality equation using actual experienced transitions in
place of knowledge of the expected transitions
U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Optimality and Approximation

 An agent that learns an optimal policy has done


very well, but in practice this rarely happens
 For the kinds of tasks in which we are interested,
optimal policies can be generated only with
extreme computational cost
 Optimality is an ideal that agents can only
approximate to varying degrees, due to
 Computational limitation
 Memory limitation

U Kang
Optimality and Approximation

 Computational limitation
 Chess: it is difficult to exactly solve the Bellman
optimality equation
 Memory limitation
 For small and finite states, it is possible to make a table
of storing values, policies, and models
 When there are too many states, we need parameterized
function approximation

U Kang
Optimality and Approximation

 Approximation for RL presents challenges


 In approximating optimal behavior, there may be many states that
the agent faces with such a low probability that selecting
suboptimal actions for them has little impact on the amount of
reward the agent receives
 E.g., Tesauro’s backgammon player plays with exceptional skill even
though it might make very bad decisions on board configurations
that never occur in games against experts.
 The online nature of RL makes it possible to approximate optimal
policies in ways that put more effort into learning to make good
decisions for frequently encountered states, at the expense of less
effort for infrequently encountered states
 This is one key property that distinguishes RL from other
approaches to approximately solving MDPs

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Conclusion

 RL
 Learning from interaction how to behave in order to achieve a goal
 RL agent and its environment interact over a sequence of discrete
time steps
 Actions: choices made by the agent
 States: basis for making the choices
 Rewards: the basis for evaluating the choices
 Environment: everything inside the agent is completely known and
controllable by the agent; everything outside is incompletely
controllable but may or may not be completely known
 Policy: a stochastic rule from state to action
 Agent’s goal: maximize the amount of reward it receives over time

U Kang
Conclusion

 Markov decision process (MDP)


 State, action, and reward sets with well-defined transition
probabilities
 Return: function of future rewards that the agent seeks to maximize
in expectation
 Undiscounted return: appropriate for episodic tasks
 Discounted return: appropriate for continuing tasks

U Kang
Conclusion

 Markov decision process (MDP)


 A policy’s value functions map each state, or state–action pair to the
expected return from that state, or state–action pair, given that the
agent uses the policy
 Optimal value functions: give the largest expected return achievable
by any policy
 Optimal policy: a policy whose value functions are optimal
 The optimal value functions are unique for a given MDP, but there
can be many optimal policies
 Any policy that is greedy with respect to the optimal value functions
must be an optimal policy
 Bellman optimality equations should be satisfied by the optimal
value functions; given the optimal value functions, an optimal policy
can be determined easily
U Kang
Conclusion

 Approximation
 RL agent may have complete knowledge of the environment, or
incomplete knowledge of it
 Even if the agent has a complete knowledge of the environment, the
agent cannot exactly compute it
 Computational limitation
 Memory limitation
 Approximation is the key to RL

U Kang
Exercise

 (Question 1)
 Suppose that we have a small gridworld and agent travels each state
with the equiprobable random policy. There are four actions possible in
each state, 𝐴 = 𝑢𝑝, 𝑑𝑜𝑤𝑛, 𝑟𝑖𝑔ℎ𝑡, 𝑙𝑒𝑓𝑡 , which deterministically cause
the corresponding state transitions, except that actions that would take
the agent off the grid in fact leave the state unchanged. The reward is -
1 on all transitions until the terminal state (shaded) is reached.

U Kang
Exercise

 (Question 1)
 Now, suppose a new state 15 is added to the gridworld just below state
13, and its actions, left, up, right, and down, take the agent to states 12,
13, 14, and 15, respectively. Assume that the transitions and the values
of the original states are unchanged. What is 𝑣𝜋 (15) with the discount
rate 𝛾 = 0.9 for the equiprobable random policy in this case?

U Kang
Exercise

 (Answer)
 We need to calculate state-value function for state 15. Since we have
converged state-value function for all states except the new state 15,
we may calculate 𝑣𝜋 15 with the neighbor state-value functions.
 From the state-value function equation 𝑉𝜋 𝑠 =
σ𝑎∈𝐴 𝜋 𝑠 𝑎 ∗ σ𝑠′ ∈𝑆 𝑃𝑠,𝑠𝑎 ′ + 𝛾 ∗ 𝑉 𝑠 ′ , state-value function
′ ∗ 𝑟 𝑠 𝜋
for the new state 15 can be formulized as,
1
 𝑉𝜋 15 = ∗ [ −1 + 𝛾𝑉𝜋 12 + −1 + 𝛾𝑉𝜋 13 + −1 + 𝛾𝑉𝜋 14 +
4

U Kang
Questions?

U Kang

You might also like