MIT 6.036 Lecture
MIT 6.036 Lecture
• State machine
• Observation vs. State
• Markov decision process (MDP)
• Policy: value, optimal
• Finite horizon value iteration, optimal policy
• Return
• Bellman equations: expectation, optimality
• Infinite horizon value iteration
State Machine
standing moving
State Machine
standing moving
State Machine
slow fast
State Machine
slow fast
State Machine
observation O state S
State Machine
• %: S ➞ $ fast
?
Markov Model
Markov Process
Markov Process
Markov Process
Markov Process
Markov Process
p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4
fallen
p=3/5 p=1
21
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability
p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4
standing
fallen
moving p=3/5 p=1
22
Markov Decision Process (MDP)
R(fallen, slow) = 1
R(fallen, fast) = 0 p=2/5 p=2/5 p=1/5
R(standing, slow) = 1
R(standing, fast) = 2
R(moving, slow) = 1 fallen
R(moving, fast) = -1 p=3/5 p=1
Markov Decision Process (MDP)
fallen
p=3/5 p=1
Markov Decision Process (MDP)
• ( discount factor
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1
fallen
p=3/5 r=-1 p=1 r=0
Reward R(s,a)
slow fast s = {standing, moving, fallen} a = {slow, fast} probability
p=1 r=1 reward
fallen
p=1 r=1
p=3/5 r=2
standing standing moving
p=3/4 r=2
moving
horizon 1 fallen
optimal myopic policy
p=3/5 r=-1 p=1 r=0
26
Markov Decision Process (MDP)
• Defined by tuple (S, A, T, R, ))
state space S, action space A, transition function T, reward
function R, discount factor )
• *: S ➞ A
p=1
• Rule book:
p=1
what action to take in each state? standing
p=3/5
moving
• in state s ∈ {fallen, standing, moving} p=3/4
fallen
p=3/5 p=1
Policy
whichpolicy is best
• *: S ➞ A
p=1
• Example robot policies:
p=1
– *A: always slow standing
p=3/5
moving
p=3/4
• *: S ➞ A
p=1
• Policy may be stochastic
p=1
randomness in agent actions standing
p=3/5
moving
• In state S
• Take action a by policy *: S ➞ A
• Transition to state s’ s
• Tree of (s,a,r,s’)
s, a
s’
State-Action Diagram
• In state S
• Take action a by policy *: S ➞ A s
Repeat
take action
What is the Value of a Policy
• *: S ➞ A
• What is the value of a policy? p=1 r=1
p=1 r=1
• Depends on number of steps p=3/5 r=2
standing moving
• Renting robot for h time steps p=3/4 r=2
fallen
p=3/5 r=-1 p=1 r=0
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left p=1 r=1
p=1 r=1
• : value (expected reward) p=3/5 r=2
standing moving
with policy * starting at s p=3/4 r=2
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• By induction on the number of steps left to go h.
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 1
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 2
What is the Value of a Policy?
• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For any h
What is the Value of a Policy?
Updating State Value Function
40
Finite Horizon Value Iteration Algorithm
42
Finite Horizon Optimal Policy
44
Action
45
Policy
46
Policy
47
Return and Discount Factor
(0,1)
49
Action-Value Function
50
Value Functions
51
Value Functions
52
Returns at Successive Time Steps
• Recursive relationship
53
4 Bellman Equations
54
Infinite Horizon
s’
• Bellman equation averages over all possibilities
weighing each by its probability to occur
58
Bellman Equation for Action-Value Function
linear equation
59
Finding Optimal Policy
62
Finding Optimal Policy
Beltran
I