0% found this document useful (0 votes)

6 views63 pages

PowerPoint Presentation

This document provides an overview of Finite Markov Decision Processes (MDP) in the context of reinforcement learning, detailing the agent-environment interface, goals, rewards, and the concepts of episodic and continuing tasks. It emphasizes the importance of maximizing cumulative rewards and discusses various examples of applications such as bioreactors and robotic tasks. The lecture outlines the foundational elements of MDP, including states, actions, and the significance of optimal policies and value functions.

Uploaded by

wptjcks15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views63 pages

PowerPoint Presentation

Uploaded by

wptjcks15

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 63

Reinforcement Learning

Finite Markov Decision Process

U Kang
Seoul National University
U Kang
In This Lecture

 Markov Decision Process

 Definition
 Return, state, and action
 Policy and value functions
 Optimality and approximation

U Kang
Overview

 Markov Decision Process (MDP)

 A classical formalization of sequential decision making,
where actions influence not just immediate rewards, but
also subsequent situations, or states, and through those
future rewards
 Involve delayed reward and the need to tradeoff
immediate and delayed reward
 Estimate 𝑞∗ (𝑠, 𝑎) for each action a in each state s, or the
value 𝑣∗ (𝑠) of each state given optimal action selections

U Kang
Outline

Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion

U Kang
Agent-Environment Interface

 Objective: learning from interaction to achieve a

goal
 Agent: the learner and decision maker
 Environment: everything outside the agent
 Environment gives rewards, special numerical
values that the agent seeks to maximize over time

Sutton and Barto,

Reinforcement
Learning, 2018

U Kang
Agent-Environment Interface

 Consider a sequence of time steps, t = 0, 1, 2, …

 At each time t, the agent receives state 𝑆𝑡 and
reward 𝑅𝑡 (when t != 0), and on that basis selects
an action 𝐴𝑡
 One step later, the agent receives state 𝑆𝑡+1 and
reward 𝑅𝑡+1 , and on that basis selects an action
𝐴𝑡+1
 Sequence of interactions
 𝑆0 , 𝐴0 , 𝑅1 , 𝑆1 , 𝐴1 , 𝑅2 , 𝑆2 , 𝐴2 , …

U Kang
MDP

 Finite MDP
 States, actions, and rewards are finite
 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 = 𝑃{𝑆𝑡 = 𝑠 ′ , 𝑅𝑡 = 𝑟|𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎}
 The probability distribution 𝑝 defines the dynamics of
MDP
 𝑝: 𝑆 × 𝑅 × 𝑆 × 𝐴 → [0,1]
 σ𝑠′ σ𝑟 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 = 1, for all 𝑠, 𝑎

 The probability 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 depends only on the

preceding state and action, not earlier ones
 Markov property

U Kang
MDP

 From 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 , we can compute anything we

want to know about the environment
 State-transition probabilities
 𝑝 𝑠 ′ 𝑠, 𝑎 = 𝑃 𝑆𝑡 = 𝑠 ′ 𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎 = σ𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎)
 Expected reward
 𝑟(𝑠, 𝑎) = 𝐸[𝑅𝑡 |𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎] = σ𝑟 𝑟 σ𝑠′ 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎)
 Expected reward for state-action-next state
𝑝(𝑠 ′ ,𝑟|𝑠,𝑎)
 𝑟(𝑠, 𝑎, 𝑠′) = 𝐸[𝑅𝑡 |𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎, 𝑆𝑡 = 𝑠′] = σ𝑟 𝑟 ′
𝑝 𝑠 𝑠, 𝑎

U Kang
Flexibility of MDP

 The time steps need not refer to fixed intervals of

real time; they can refer to arbitrary successive
stages of decision making and acting
 The actions can be low-level controls, such as the
voltages applied to the motors of a robot arm, or
high-level decisions, such as whether or not to
have lunch or to go to graduate school
 The states can take a wide variety of forms. E.g.,
low-level sensor readings, or high-level symbolic
descriptions of objects in a room

U Kang
Agent-Environment Boundary

 Different from physical boundary

 The motors and mechanical linkages of a robot and its
sensing hardware should usually be considered parts of
the environment rather than parts of the agent
 Anything that cannot be changed arbitrarily by the
agent is considered to be outside of it and thus
part of its environment
 The agent–environment boundary represents the
limit of the agent’s absolute control, not of its
knowledge

U Kang
Abstraction by MDP

 MDP: a considerable abstraction of the problem of

goal-directed learning from interaction
 Any problem of learning goal-directed behavior
can be reduced to three signals between agent
and environment
 Actions: choices made by the agent
 States: basis on which the choices are made
 Rewards: the agent’s goal

U Kang
Example: Bioreactor

 Goal: determine moment-by-moment temperatures and

stirring rates for a bioreactor (a large vat of nutrients and
bacteria used to produce useful chemicals)
 Actions: target temperatures and target stirring rates
 Represented as a vector
 States: thermocouple and other sensory readings, perhaps
filtered and delayed, plus symbolic inputs representing the
ingredients in the vat and the target chemical
 Represented as a vector
 Rewards: moment-by-moment measures of the rate at
which the useful chemical is produced by the bioreactor
 Represented as a scalar

U Kang
Example: Pick-and-Place Robot

 Goal: control the motion of a robot arm in a repetitive pick-

and-place task, to learn movements that are fast and
smooth
 Actions: voltages applied to each motor at each joint
 States: latest readings of joint angles and velocities
 Reward: +1 for each object successfully picked up and
placed
 To encourage smooth movements, on each time step a small,
negative reward can be given as a function of the moment-to-
moment “jerkiness” of the motion.

U Kang
Example: Recycling Robot

 A mobile robot collecting empty soda cans,

running on a rechargeable battery
 State: battery level (high, low)
 Actions: search (for a can), wait (for someone to
bring a can), recharge (its battery)
 Actions sets
 A(high) = {search, wait}
 A(low) = {search, wait, recharge}

U Kang
Example: Recycling Robot

Sutton and Barto,

Reinforcement
Learning, 2018

U Kang
Outline

U Kang
Goals and Rewards

 Reward hypothesis: the goal of an agent is to

maximize the expected value of the cumulated
reward
 This has proved to be flexible and widely
applicable

U Kang
Examples

 Making a robot learn to walk

 Reward: proportional to the robot’s forward motion, on each time step
 Making a robot learn how to escape from a maze
 Reward: −1 for every time step that passes prior to escape; this
encourages the agent to escape as quickly as possible
 Making a robot learn to find and collect empty soda cans for
recycling
 Reward: 0 most of the time, and +1 for each can collected. One might
also want to give the robot negative rewards when it bumps into
things or when somebody yells at it
 Learning to play chess
 Reward: +1 for winning, −1 for losing, and 0 for drawing and for all
nonterminal positions
U Kang
Maximizing Rewards

 The agent always learns to maximize its long-term reward

 If we want it to do something for us, we must provide
rewards to it in such a way that in maximizing them the
agent will also achieve our goals.
 Reward signal is not the place to impart to the agent prior
knowledge about how to achieve what we want it to do
 E.g., a chess-playing agent should be rewarded only for actually
winning, not for achieving subgoals such as taking its opponent’s
pieces or gaining control of the center of the board
 Exploiting prior knowledge on how: initial policy or value functions
 Reward signal is your way of communicating to the robot
what you want it to achieve, not how you want it achieved

U Kang
Outline

U Kang
Returns and Episodes

 Agent’s goal: maximize the cumulative rewards in

the long run

 Maximize expected return 𝐺𝑡

 𝐺𝑡 = 𝑅𝑡+1 + 𝑅𝑡+2 + 𝑅𝑡+3 + ⋯ + 𝑅𝑇
 T: final time step

 What is the implication of T?

U Kang
Returns and Episodes

 Episode (trial): separated subsequence in agent-

environment interaction
 Plays of a game
 Trips through a maze
 Any sort of repeated interaction

 Each episode ends in a special terminal state, followed by a

reset to a starting state

 Each episode may give different rewards

U Kang
Returns and Episodes

 Episodic tasks: tasks with episodes

 𝑆: set of all nonterminal states
 𝑆 + : set of all states plus the terminal state

 Continuing tasks
 Agent-environment interaction does not break into episodes, but
goes on continually without limit
 E.g., a robot with a long life span
 The return may be infinite; thus we need a concept of discounting

U Kang
Returns and Episodes

 An agent selects actions to maximize the sum of discounted

rewards
 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + ⋯ = σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1
 𝛾: discount rate in [0, 1]

 Returns at successive time steps are related to each other

 𝐺𝑡 = 𝑅𝑡+1 + 𝛾𝑅𝑡+2 + 𝛾 2 𝑅𝑡+3 + 𝛾 3 𝑅𝑡+4 + ⋯
= 𝑅𝑡+1 + 𝛾(𝑅𝑡+2 + 𝛾𝑅𝑡+3 + 𝛾 2 𝑅𝑡+4 + ⋯ )
= 𝑅𝑡+1 + 𝛾𝐺𝑡+1

U Kang
Returns and Episodes

 Discount rate determines the present value of future

rewards
 A reward received k time steps in the future is worth only 𝛾 𝑘−1 times
what it would be worth if it were received immediately
 If 𝛾 < 1, 𝐺𝑡 is finite as long as reward sequence {𝑅𝑘 } is bounded
1
 If the reward is constant +1, then 𝐺𝑡 = σ∞ 𝑘
𝑘=0 𝛾 = 1−𝛾

 If 𝛾 = 0, the agent concerns only on the immediate rewards

 As 𝛾 approaches 1, the agent becomes more far-sighted

U Kang
Example: Pole-Balancing

 Objective: apply forces to a cart moving along a track to

keep a pole hinged to the cart from falling over
 Failure: if the pole falls past a given angle from vertical or if
the cart runs off the track

Sutton and Barto,

Reinforcement
Learning, 2018

U Kang
Example: Pole-Balancing

 Episodic task formulation

 Reward +1 for every time step on which failure did not occur
 Continuing task formulation (with discounting)
 Reward −1 on each failure and 0 at all other times
 The return at each time is related to −𝛾 𝐾 , where K is the number of
time steps before failure
 In either case, the return is maximized by keeping the pole
balanced for as long as possible

U Kang
Outline

U Kang
Episodic and Continuing Tasks

 Agent-environment interaction
 Episodic tasks: interaction is broken into separate
episodes
 Continuing tasks

 It is useful to establish one notation to refer to

both episodic and continuing tasks

U Kang
Episodic and Continuing Tasks

 Absorbing state
 Transitions only to itself and generates only rewards of 0
 Useful to unify episodic and continuing tasks
 Setting 𝑇 = 3 or 𝑇 = ∞ give the same result

Sutton and Barto,

Reinforcement
Learning, 2018

U Kang
Episodic and Continuing Tasks

 Unified notations for episodic and continuing tasks

 𝐺𝑡 = σ𝑇𝑘=𝑡+1 𝛾 𝑘−𝑡−1 𝑅𝑘
 Episodic tasks: 𝛾 = 1
 𝑇 can be finite, or 𝑇 = ∞ with an absorbing state
 Continuing tasks: 𝛾 < 1, 𝑇 = ∞

U Kang
Outline

U Kang
Policies and Value Functions

 Almost all RL algorithms involve estimating value

functions
 Value function: map a state (or state-action pair) to a
value representing how good it is in terms of expected
return

 The rewards the agent can expect to receive in the

future depend on what actions it will take; thus
value functions are defined with respect to
particular ways of acting, called policies

U Kang
Policies and Value Functions

 Policy: a mapping from a state to probabilities of

selecting each action
 𝜋(𝑎|𝑠): the probability of selecting 𝐴𝑡 = 𝑎, if 𝑆𝑡 = 𝑠

U Kang
Policies and Value Functions

 Value function
 State-value function for policy 𝜋: expected return when
starting in s and following 𝜋 thereafter
 𝑣𝜋 𝑠 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠 = 𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠

 Action-value function for policy 𝜋: expected return when

starting in s, taking action a, and following 𝜋 thereafter
 𝑞𝜋 𝑠, 𝑎 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎 =
𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠, 𝐴𝑡 = 𝑎

 The value of the final state is always 0

U Kang
Policies and Value Functions

 Value functions 𝑣𝜋 and 𝑞𝜋 can be estimated from

experience
 Monte Carlo methods: average over many random
samples of actual returns
 𝑣𝜋 (𝑠): maintain average of the actual returns that have followed
the state
 𝑞𝜋 (𝑠, 𝑎): maintain average of the actual returns that have
followed the state and the action
 Function approximator
 Monte Carlo method is limited when there are many states
 Maintain 𝑣𝜋 and 𝑞𝜋 as parameterized functions (with fewer
parameters than states)
 Learn parameters to better match the observed returns
U Kang
Value Function

 Recursive relationship of value functions 𝑣𝜋

 𝑣𝜋 𝑠 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠 = 𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠
= 𝐸𝜋 𝑅𝑡+1 + 𝛾𝐺𝑡+1 𝑆𝑡 = 𝑠
= σ𝑎 𝜋(𝑎|𝑠) σ𝑠′ σ𝑟 𝑝(𝑠 ′ , 𝑟|𝑠, 𝑎) [𝑟 + 𝛾𝐸𝜋 𝐺𝑡+1 𝑆𝑡+1 = 𝑠′ ]
= σ𝑎 𝜋(𝑎|𝑠) σ𝑠′ ,𝑟 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 [𝑟 + 𝛾𝑣𝜋 𝑠 ′ ]
Bellman equation for 𝑣𝜋
 The value function 𝑣𝜋 is the unique solution to its Bellman equation

Sutton and Barto,

Reinforcement
Learning, 2018

U Kang
Example: Gridworld
Sutton and Barto,
Reinforcement
Learning, 2018
 State: each cell
 Actions: north, south, east, and west
 Rewards: +10 (A -> A’), +5 (B -> B’), -1 for “off the grid”, and
0 otherwise

𝑣𝜋 (𝑠) for equiprobable random policy,

when 𝛾 = 0.9
U Kang
Example: Gridworld
Sutton and Barto,
Reinforcement
Learning, 2018
 Observations
 Value of A is less than 10; Value of B is greater than 5
 Value of A’ is negative
 Upper cells have higher values than bottom cells

𝑣𝜋 (𝑠) for equiprobable random policy,

when 𝛾 = 0.9
U Kang
Example: Golf

 State: location of the ball

 Action: club (, and direction)
 Reward: -1 for each stroke
until hitting the ball into the
hole
 Observations
 𝑣𝑝𝑢𝑡𝑡
 Shorter gaps of values than in driver
 −∞ value for sand
 𝑞∗ (𝑠, 𝑑𝑟𝑖𝑣𝑒𝑟)
 Best action-value for ‘driver’ action