0% found this document useful (0 votes)
25 views64 pages

MIT 6.036 Lecture

The lecture, led by Prof. Iddo Drori, covers State Machines and Markov Decision Processes (MDPs), focusing on concepts such as states, actions, policies, and value functions. Key topics include the transition functions, reward structures, and the Bellman equations for evaluating policies in both finite and infinite horizons. Course materials will be available on the course website, and questions can be directed to Piazza.

Uploaded by

luke bastian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views64 pages

MIT 6.036 Lecture

The lecture, led by Prof. Iddo Drori, covers State Machines and Markov Decision Processes (MDPs), focusing on concepts such as states, actions, policies, and value functions. Key topics include the transition functions, reward structures, and the Bellman equations for evaluating policies in both finite and infinite horizons. Course materials will be available on the course website, and questions can be directed to Piazza.

Uploaded by

luke bastian
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Lecture: starts Tuesday 9:35am

Course website: introml.odl.mit.edu


Who's talking? Prof. Iddo Drori
Questions? Piazza
Materials: Will be available on course website
Today’s Plan: State Machines and
Markov Decision Processes (MDPs)

• State machine
• Observation vs. State
• Markov decision process (MDP)
• Policy: value, optimal
• Finite horizon value iteration, optimal policy
• Return
• Bellman equations: expectation, optimality
• Infinite horizon value iteration
State Machine

• S = set of possible states = {standing, moving}

standing moving
State Machine

• S = set of possible states = {standing, moving}


• s0 ∈ S = initial state = standing

standing moving
State Machine

• S = set of possible states = {standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• s0 ∈ S = initial state = standing
standing moving
fast

slow fast
State Machine

• S = set of possible states = {standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast

slow fast
State Machine

• S = set of possible states = {standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = ?
slow fast
State Machine

• S = set of possible states = {standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
State Machine

• S = set of possible states = {standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast

• May not observe states {standing, moving} directly


for example a sensor measurement
Example: Observation vs. State

observation O state S
State Machine

• S = set of possible states = {fallen, standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
• Not observe states
• $: set of possible outputs
• %: S ➞ $
• $1 = g(s1) = moving
State Machine

• S = set of possible states = {fallen, standing, moving}


slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• $: set of possible outputs standing moving

• %: S ➞ $ fast

• s0 ∈ S = initial state = standing slow fast

• Iteratively computer for t ⥸ 1:


State Machine Example

• Reads binary string

• String has even number of zeros iff ends at state S1


State and Reward

?
Markov Model
Markov Process
Markov Process
Markov Process
Markov Process
Markov Process

• set of possible states S = {fallen, standing, moving}


• set of possible actions A = {slow, fast} p=1

• transition model T: S x A x S ➞ ℝ p=1


p=3/5
• transition function is stochastic standing moving

defines probability distribution p=3/4

over next state given


p=2/5 p=2/5 p=1/5
previous state and action
• Output is state (g is identity)
fallen
p=3/5 p=1
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability

p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1

21
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability

p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4

fallen standing moving


p=2/5 p=2/5 p=1/5
fallen

standing

fallen
moving p=3/5 p=1

22
Markov Decision Process (MDP)

• set of possible states S = {fallen, standing, moving}


• set of possible actions A = {slow, fast} p=1

• transition model T: S x A x S ➞ ℝ p=1


• reward function R: S x A ➞ ℝ standing p=3/5
moving

reward based on state & action p=3/4

R(fallen, slow) = 1
R(fallen, fast) = 0 p=2/5 p=2/5 p=1/5
R(standing, slow) = 1
R(standing, fast) = 2
R(moving, slow) = 1 fallen
R(moving, fast) = -1 p=3/5 p=1
Markov Decision Process (MDP)

• S = set of possible states {fallen, standing, moving}


• A = set of possible actions {slow, fast} p=1

• T: S x A x S ➞ ℝ transition model p=1


• R: S x A ➞ ℝ reward function standing p=3/5
moving

• ( discount factor p=3/4

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1
Markov Decision Process (MDP)

• set of possible states S = {fallen, standing, moving}


• set of possible actions A = {slow, fast} p=1 r=1
• transition model T: S x A x S ➞ ℝ p=1 r=1
• rewards ℝ may be probabilistic standing p=3/5 r=2 moving

with transition function p=3/4 r=2

• ( discount factor
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

fallen
p=3/5 r=-1 p=1 r=0
Reward R(s,a)
slow fast s = {standing, moving, fallen} a = {slow, fast} probability
p=1 r=1 reward
fallen
p=1 r=1
p=3/5 r=2
standing standing moving
p=3/4 r=2
moving

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

horizon 1 fallen
optimal myopic policy
p=3/5 r=-1 p=1 r=0

26
Markov Decision Process (MDP)
• Defined by tuple (S, A, T, R, ))
state space S, action space A, transition function T, reward
function R, discount factor )

• At every time step t agent finds itself in state s ∈ S and


selects action a ∈ A
Agent transitions to next state s’ and selects new action.

• In contrast, in reinforcement learning the agent does not


know T and R, learns by sampling environment
27
Policy

• *: S ➞ A
p=1
• Rule book:
p=1
what action to take in each state? standing
p=3/5
moving
• in state s ∈ {fallen, standing, moving} p=3/4

take action a ∈ {slow, fast}


p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1
Policy
whichpolicy is best
• *: S ➞ A
p=1
• Example robot policies:
p=1
– *A: always slow standing
p=3/5
moving
p=3/4

– *B: always fast


p=2/5 p=2/5 p=1/5

– *C: if fallen slow, else fast


fallen
p=3/5 p=1
– *D: if moving fast, else slow
Stochastic Policy

• *: S ➞ A
p=1
• Policy may be stochastic
p=1
randomness in agent actions standing
p=3/5
moving

for example p=3/4

– *E: from all states


p=2/5 p=2/5 p=1/5
with probability 0.3 do slow
with probability 0.7 do fast
fallen
p=3/5
p=1
In addition to transitions being stochastic.
State-Action Diagram

• In state S
• Take action a by policy *: S ➞ A
• Transition to state s’ s

Repeat take action

• Tree of (s,a,r,s’)
s, a

where the transition function


will take us

s’
State-Action Diagram

• In state S
• Take action a by policy *: S ➞ A s

• Transition to state s’ take action

Repeat

take action
What is the Value of a Policy

• *: S ➞ A
• What is the value of a policy? p=1 r=1

p=1 r=1
• Depends on number of steps p=3/5 r=2
standing moving
• Renting robot for h time steps p=3/4 r=2

then will be destroyed


p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

fallen
p=3/5 r=-1 p=1 r=0
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left p=1 r=1

p=1 r=1
• : value (expected reward) p=3/5 r=2
standing moving
with policy * starting at s p=3/4 r=2

• Example robot policies


– *A: always slow p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

– *B: always fast


– *C: if fallen slow, else fast fallen
p=1 r=0
– *D: if moving fast, else slow p=3/5 r=-1
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• By induction on the number of steps left to go h.

• Base case: no steps remaining, no matter what state


we’re in the value is:
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 1
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 2
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For any h
What is the Value of a Policy?
Updating State Value Function

40
Finite Horizon Value Iteration Algorithm

• Compute : start from horizon 0, store values


• Use to compute
• For n = |S|, m = |A|, horizon h, computation time O(nmh)
Updating Action Value Function

42
Finite Horizon Optimal Policy

• Given the optimal finite horizon policy is:

• May be multiple optimal policies


State

44
Action

45
Policy

46
Policy

47
Return and Discount Factor

(0,1)

if ! = 0 then agent is myopic maximizing only immediate rewards


as ! → 1 agent becomes foresighted
48
State Value Function

49
Action-Value Function

50
Value Functions

51
Value Functions

• For all states s :

52
Returns at Successive Time Steps

• Recursive relationship

53
4 Bellman Equations

• Expectation for state value


linear
• Expectation for action value

• Optimality for state value function


non linear
• Optimality for action value function

54
Infinite Horizon

• Don’t know when game will be over.


• Problem: may be infinite, cannot select one over other
• Solution: find policy that maximizes infinite horizon
discounted value

• t denotes number of steps from starting state


Policy Evaluation

• Expected infinite horizon value of state s under policy ,

• t denotes number of steps from starting state

• n = |S| linear equations


system of linear equations
State-Value Function for Policy
• Expected return starting in s and following policy ! satisfies recursive relationship:

Bellman equation for V!


• Relationship between value of state and values of successors
57
Bellman Equation for State-Value Function
• V! is unique solution to its Bellman equation s
Linear equation take action average
Vector notation
s, a
• Value of start state s
is discounted value of expected r

next state + expected reward where the transition function


will take us

s’
• Bellman equation averages over all possibilities
weighing each by its probability to occur
58
Bellman Equation for Action-Value Function

linear equation

where the transition


function will take us

what action we will take

59
Finding Optimal Policy

• In infinite horizon there exists a stationary optimal policy ,*


(at least one) such that for all s ∈ S and all other policies ,

• Stationary: does not change over time


Infinite Horizon Value Iteration

• expected infinite horizon discounted value of being in


state s, taking action a, then executing optimal policy ,*

• n = |S|, m = |A|, nm non-linear equations with unique solution


value
Bellman Optimality Equation for Q*

non linear equation

expectation over where the transition


function will take us

maximize over the actions we can take


max max

once we have Q* we can act optimally

62
Finding Optimal Policy

• If we know optimal action-value function then


derive optimal policy:

• Optimal policy is not unique


Infinite Horizon Value Iteration Algorithm

Beltran
I

You might also like