PowerPoint Presentation
PowerPoint Presentation
U Kang
Seoul National University
U Kang
In This Lecture
U Kang
Overview
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Agent-Environment Interface
U Kang
Agent-Environment Interface
U Kang
MDP
Finite MDP
States, actions, and rewards are finite
𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 = 𝑃{𝑆𝑡 = 𝑠 ′ , 𝑅𝑡 = 𝑟|𝑆𝑡−1 = 𝑠, 𝐴𝑡−1 = 𝑎}
The probability distribution 𝑝 defines the dynamics of
MDP
𝑝: 𝑆 × 𝑅 × 𝑆 × 𝐴 → [0,1]
σ𝑠′ σ𝑟 𝑝 𝑠 ′ , 𝑟 𝑠, 𝑎 = 1, for all 𝑠, 𝑎
U Kang
MDP
U Kang
Flexibility of MDP
U Kang
Agent-Environment Boundary
U Kang
Abstraction by MDP
U Kang
Example: Bioreactor
U Kang
Example: Pick-and-Place Robot
U Kang
Example: Recycling Robot
U Kang
Example: Recycling Robot
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Goals and Rewards
U Kang
Examples
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Returns and Episodes
U Kang
Returns and Episodes
U Kang
Returns and Episodes
Continuing tasks
Agent-environment interaction does not break into episodes, but
goes on continually without limit
E.g., a robot with a long life span
The return may be infinite; thus we need a concept of discounting
U Kang
Returns and Episodes
U Kang
Returns and Episodes
U Kang
Example: Pole-Balancing
U Kang
Example: Pole-Balancing
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Episodic and Continuing Tasks
Agent-environment interaction
Episodic tasks: interaction is broken into separate
episodes
Continuing tasks
U Kang
Episodic and Continuing Tasks
Absorbing state
Transitions only to itself and generates only rewards of 0
Useful to unify episodic and continuing tasks
Setting 𝑇 = 3 or 𝑇 = ∞ give the same result
U Kang
Episodic and Continuing Tasks
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Policies and Value Functions
U Kang
Policies and Value Functions
U Kang
Policies and Value Functions
Value function
State-value function for policy 𝜋: expected return when
starting in s and following 𝜋 thereafter
𝑣𝜋 𝑠 = 𝐸𝜋 𝐺𝑡 𝑆𝑡 = 𝑠 = 𝐸𝜋 σ∞ 𝑘
𝑘=0 𝛾 𝑅𝑡+𝑘+1 𝑆𝑡 = 𝑠
U Kang
Policies and Value Functions
U Kang
Example: Gridworld
Sutton and Barto,
Reinforcement
Learning, 2018
State: each cell
Actions: north, south, east, and west
Rewards: +10 (A -> A’), +5 (B -> B’), -1 for “off the grid”, and
0 otherwise
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Optimal Policies and Values
U Kang
Optimal Policies and Values
U Kang
Optimal Policies and Values
U Kang
Optimal Policies and Values
U Kang
Optimal Policies and Values
U Kang
Example: Gridworld
U Kang
Example: Recycling Robot
𝛽𝑟𝑠 − 3 1 − 𝛽 + 𝛾 1 − 𝛽 𝑣∗ ℎ + 𝛽𝑣∗ 𝑙 ,
𝑣∗ (𝑙) = 𝑚𝑎𝑥 𝑟𝑤 +𝛾𝑣∗ 𝑙 , .
𝛾𝑣∗ ℎ
U Kang
Bellman Optimality Equation
E.g., backgammon
It has 1020 states, and thus take thousands of years to solve the
Bellman equation
Needs approximate computation
U Kang
Bellman Optimality Equation
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Optimality and Approximation
U Kang
Optimality and Approximation
Computational limitation
Chess: it is difficult to exactly solve the Bellman
optimality equation
Memory limitation
For small and finite states, it is possible to make a table
of storing values, policies, and models
When there are too many states, we need parameterized
function approximation
U Kang
Optimality and Approximation
U Kang
Outline
Agent-Environment Interface
Goals and Rewards
Returns and Episodes
Episodic and Continuing Tasks
Policies and Value Functions
Optimal Policies and Value Functions
Optimality and Approximation
Conclusion
U Kang
Conclusion
RL
Learning from interaction how to behave in order to achieve a goal
RL agent and its environment interact over a sequence of discrete
time steps
Actions: choices made by the agent
States: basis for making the choices
Rewards: the basis for evaluating the choices
Environment: everything inside the agent is completely known and
controllable by the agent; everything outside is incompletely
controllable but may or may not be completely known
Policy: a stochastic rule from state to action
Agent’s goal: maximize the amount of reward it receives over time
U Kang
Conclusion
U Kang
Conclusion
Approximation
RL agent may have complete knowledge of the environment, or
incomplete knowledge of it
Even if the agent has a complete knowledge of the environment, the
agent cannot exactly compute it
Computational limitation
Memory limitation
Approximation is the key to RL
U Kang
Exercise
(Question 1)
Suppose that we have a small gridworld and agent travels each state
with the equiprobable random policy. There are four actions possible in
each state, 𝐴 = 𝑢𝑝, 𝑑𝑜𝑤𝑛, 𝑟𝑖𝑔ℎ𝑡, 𝑙𝑒𝑓𝑡 , which deterministically cause
the corresponding state transitions, except that actions that would take
the agent off the grid in fact leave the state unchanged. The reward is -
1 on all transitions until the terminal state (shaded) is reached.
U Kang
Exercise
(Question 1)
Now, suppose a new state 15 is added to the gridworld just below state
13, and its actions, left, up, right, and down, take the agent to states 12,
13, 14, and 15, respectively. Assume that the transitions and the values
of the original states are unchanged. What is 𝑣𝜋 (15) with the discount
rate 𝛾 = 0.9 for the equiprobable random policy in this case?
U Kang
Exercise
(Answer)
We need to calculate state-value function for state 15. Since we have
converged state-value function for all states except the new state 15,
we may calculate 𝑣𝜋 15 with the neighbor state-value functions.
From the state-value function equation 𝑉𝜋 𝑠 =
σ𝑎∈𝐴 𝜋 𝑠 𝑎 ∗ σ𝑠′ ∈𝑆 𝑃𝑠,𝑠𝑎 ′ + 𝛾 ∗ 𝑉 𝑠 ′ , state-value function
′ ∗ 𝑟 𝑠 𝜋
for the new state 15 can be formulized as,
1
𝑉𝜋 15 = ∗ [ −1 + 𝛾𝑉𝜋 12 + −1 + 𝛾𝑉𝜋 13 + −1 + 𝛾𝑉𝜋 14 +
4
U Kang
Questions?
U Kang