0% found this document useful (0 votes)

25 views64 pages

MIT 6.036 Lecture

The lecture, led by Prof. Iddo Drori, covers State Machines and Markov Decision Processes (MDPs), focusing on concepts such as states, actions, policies, and value functions. Key topics include the transition functions, reward structures, and the Bellman equations for evaluating policies in both finite and infinite horizons. Course materials will be available on the course website, and questions can be directed to Piazza.

Uploaded by

luke bastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views64 pages

MIT 6.036 Lecture

Uploaded by

luke bastian

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 64

Lecture: starts Tuesday 9:35am

Course website: introml.odl.mit.edu

Who's talking? Prof. Iddo Drori
Questions? Piazza
Materials: Will be available on course website
Today’s Plan: State Machines and
Markov Decision Processes (MDPs)

• State machine
• Observation vs. State
• Markov decision process (MDP)
• Policy: value, optimal
• Finite horizon value iteration, optimal policy
• Return
• Bellman equations: expectation, optimality
• Infinite horizon value iteration
State Machine

• S = set of possible states = {standing, moving}

standing moving
State Machine

• S = set of possible states = {standing, moving}

• s0 ∈ S = initial state = standing

standing moving
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• s0 ∈ S = initial state = standing
standing moving
fast

slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast

slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = ?
slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
State Machine

• S = set of possible states = {standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast

• May not observe states {standing, moving} directly

for example a sensor measurement
Example: Observation vs. State

observation O state S
State Machine

• S = set of possible states = {fallen, standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• s0 ∈ S = initial state = standing standing moving
fast
• s1 = #(s0, fast) = moving
slow fast
• Not observe states
• $: set of possible outputs
• %: S ➞ $
• $1 = g(s1) = moving
State Machine

• S = set of possible states = {fallen, standing, moving}

slow
• " = set of possible inputs = {slow, fast}
• #: S x " ➞ S transition function
• $: set of possible outputs standing moving

• %: S ➞ $ fast

• s0 ∈ S = initial state = standing slow fast

• Iteratively computer for t ⥸ 1:

State Machine Example

• Reads binary string

• String has even number of zeros iff ends at state S1

State and Reward

?
Markov Model
Markov Process
Markov Process
Markov Process
Markov Process
Markov Process

• set of possible states S = {fallen, standing, moving}

• set of possible actions A = {slow, fast} p=1

• transition model T: S x A x S ➞ ℝ p=1

p=3/5
• transition function is stochastic standing moving

defines probability distribution p=3/4

over next state given

p=2/5 p=2/5 p=1/5
previous state and action
• Output is state (g is identity)
fallen
p=3/5 p=1
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability

p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1

21
Transition P(s,a,s’)
fallen standing moving s = {standing, moving, fallen} a = {slow, fast} probability

p=1
fallen
p=1
standing p=3/5
standing moving
moving
p=3/4

fallen standing moving

p=2/5 p=2/5 p=1/5
fallen

standing

fallen
moving p=3/5 p=1

22
Markov Decision Process (MDP)

• set of possible states S = {fallen, standing, moving}

• set of possible actions A = {slow, fast} p=1

• transition model T: S x A x S ➞ ℝ p=1

• reward function R: S x A ➞ ℝ standing p=3/5
moving

reward based on state & action p=3/4

R(fallen, slow) = 1
R(fallen, fast) = 0 p=2/5 p=2/5 p=1/5
R(standing, slow) = 1
R(standing, fast) = 2
R(moving, slow) = 1 fallen
R(moving, fast) = -1 p=3/5 p=1
Markov Decision Process (MDP)

• S = set of possible states {fallen, standing, moving}

• A = set of possible actions {slow, fast} p=1

• T: S x A x S ➞ ℝ transition model p=1

• R: S x A ➞ ℝ reward function standing p=3/5
moving

• ( discount factor p=3/4

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1
Markov Decision Process (MDP)

• set of possible states S = {fallen, standing, moving}

• set of possible actions A = {slow, fast} p=1 r=1
• transition model T: S x A x S ➞ ℝ p=1 r=1
• rewards ℝ may be probabilistic standing p=3/5 r=2 moving

with transition function p=3/4 r=2

• ( discount factor
p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

fallen
p=3/5 r=-1 p=1 r=0
Reward R(s,a)
slow fast s = {standing, moving, fallen} a = {slow, fast} probability
p=1 r=1 reward
fallen
p=1 r=1
p=3/5 r=2
standing standing moving
p=3/4 r=2
moving

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

horizon 1 fallen
optimal myopic policy
p=3/5 r=-1 p=1 r=0

26
Markov Decision Process (MDP)
• Defined by tuple (S, A, T, R, ))
state space S, action space A, transition function T, reward
function R, discount factor )

• At every time step t agent finds itself in state s ∈ S and

selects action a ∈ A
Agent transitions to next state s’ and selects new action.

• In contrast, in reinforcement learning the agent does not

know T and R, learns by sampling environment
27
Policy

• *: S ➞ A
p=1
• Rule book:
p=1
what action to take in each state? standing
p=3/5
moving
• in state s ∈ {fallen, standing, moving} p=3/4

take action a ∈ {slow, fast}

p=2/5 p=2/5 p=1/5

fallen
p=3/5 p=1
Policy
whichpolicy is best
• *: S ➞ A
p=1
• Example robot policies:
p=1
– *A: always slow standing
p=3/5
moving
p=3/4

– *B: always fast

p=2/5 p=2/5 p=1/5

– *C: if fallen slow, else fast

fallen
p=3/5 p=1
– *D: if moving fast, else slow
Stochastic Policy

• *: S ➞ A
p=1
• Policy may be stochastic
p=1
randomness in agent actions standing
p=3/5
moving

for example p=3/4

– *E: from all states

p=2/5 p=2/5 p=1/5
with probability 0.3 do slow
with probability 0.7 do fast
fallen
p=3/5
p=1
In addition to transitions being stochastic.
State-Action Diagram

• In state S
• Take action a by policy *: S ➞ A
• Transition to state s’ s

Repeat take action

• Tree of (s,a,r,s’)
s, a

where the transition function

will take us

s’
State-Action Diagram

• In state S
• Take action a by policy *: S ➞ A s

• Transition to state s’ take action

Repeat

take action
What is the Value of a Policy

• *: S ➞ A
• What is the value of a policy? p=1 r=1

p=1 r=1
• Depends on number of steps p=3/5 r=2
standing moving
• Renting robot for h time steps p=3/4 r=2

then will be destroyed

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

fallen
p=3/5 r=-1 p=1 r=0
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left p=1 r=1

p=1 r=1
• : value (expected reward) p=3/5 r=2
standing moving
with policy * starting at s p=3/4 r=2

• Example robot policies

– *A: always slow p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

– *B: always fast

– *C: if fallen slow, else fast fallen
p=1 r=0
– *D: if moving fast, else slow p=3/5 r=-1
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• By induction on the number of steps left to go h.

• Base case: no steps remaining, no matter what state

we’re in the value is:
What is the Value of a Policy?

• *: S ➞ A
• +: horizon, number of time steps left
• : value (expected reward) with policy * starting at s
• Value of policy at state s at horizon h is
reward in s + next state’s expected horizon h-1 value.
• For h = 1
What is the Value of a Policy?

40
Finite Horizon Value Iteration Algorithm

• Compute : start from horizon 0, store values

• Use to compute
• For n = |S|, m = |A|, horizon h, computation time O(nmh)
Updating Action Value Function

42
Finite Horizon Optimal Policy

• Given the optimal finite horizon policy is:

• May be multiple optimal policies

State

44
Action

45
Policy

46
Policy

47
Return and Discount Factor

(0,1)

if ! = 0 then agent is myopic maximizing only immediate rewards

as ! → 1 agent becomes foresighted
48
State Value Function

49
Action-Value Function

50
Value Functions

51
Value Functions

• For all states s :

52
Returns at Successive Time Steps

• Recursive relationship

53
4 Bellman Equations

• Expectation for state value

linear
• Expectation for action value

• Optimality for state value function

non linear
• Optimality for action value function

54
Infinite Horizon

• Don’t know when game will be over.

• Problem: may be infinite, cannot select one over other
• Solution: find policy that maximizes infinite horizon
discounted value

• t denotes number of steps from starting state

Policy Evaluation

• Expected infinite horizon value of state s under policy ,

• t denotes number of steps from starting state

• n = |S| linear equations

system of linear equations
State-Value Function for Policy
• Expected return starting in s and following policy ! satisfies recursive relationship:

Bellman equation for V!

• Relationship between value of state and values of successors
57
Bellman Equation for State-Value Function
• V! is unique solution to its Bellman equation s
Linear equation take action average
Vector notation
s, a
• Value of start state s
is discounted value of expected r

next state + expected reward where the transition function

will take us

s’
• Bellman equation averages over all possibilities
weighing each by its probability to occur
58
Bellman Equation for Action-Value Function

linear equation

where the transition

function will take us

what action we will take

59
Finding Optimal Policy

• In infinite horizon there exists a stationary optimal policy ,*

(at least one) such that for all s ∈ S and all other policies ,

• Stationary: does not change over time

Infinite Horizon Value Iteration

• expected infinite horizon discounted value of being in

state s, taking action a, then executing optimal policy ,*

• n = |S|, m = |A|, nm non-linear equations with unique solution

value
Bellman Optimality Equation for Q*

non linear equation

expectation over where the transition

function will take us

maximize over the actions we can take

max max

once we have Q* we can act optimally

62
Finding Optimal Policy

• If we know optimal action-value function then

derive optimal policy:

• Optimal policy is not unique

Infinite Horizon Value Iteration Algorithm

Beltran
I

Cs5811 Ch17 Complex Dec
No ratings yet
Cs5811 Ch17 Complex Dec
29 pages
Lecture3__InsideAnAgent
No ratings yet
Lecture3__InsideAnAgent
35 pages
Lecture7 MDP
No ratings yet
Lecture7 MDP
44 pages
policy (RL IITH)
No ratings yet
policy (RL IITH)
46 pages
08 MDPs
No ratings yet
08 MDPs
110 pages
06 MDP
No ratings yet
06 MDP
89 pages
Lec 08
No ratings yet
Lec 08
59 pages
l1 Mdps Exact Methods
No ratings yet
l1 Mdps Exact Methods
69 pages
Sp14 Cs188 Lecture 8 - Mdps I
No ratings yet
Sp14 Cs188 Lecture 8 - Mdps I
50 pages
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
100% (1)
Markov Decision Processes: Lecture Notes For STP 425: Jay Taylor
86 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
2025_MDPs 2
No ratings yet
2025_MDPs 2
42 pages
Machine Learning
No ratings yet
Machine Learning
5 pages
Markov Decision Processes and Exact Solution Methods
No ratings yet
Markov Decision Processes and Exact Solution Methods
34 pages
mdp-cheatsheet
No ratings yet
mdp-cheatsheet
3 pages
Lecture7 MDPs I
No ratings yet
Lecture7 MDPs I
9 pages
Lec 09
No ratings yet
Lec 09
51 pages
15 MDP
No ratings yet
15 MDP
35 pages
Markov Decision Processes: Stochastic, Sequential Environments
No ratings yet
Markov Decision Processes: Stochastic, Sequential Environments
20 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
242 Sheet 02 03
No ratings yet
242 Sheet 02 03
5 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
17 - Markov Decision Processes.pptx
No ratings yet
17 - Markov Decision Processes.pptx
59 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
No ratings yet
1.1 Discounted (Infinite-Horizon) Markov Decision Processes
26 pages
M 2
No ratings yet
M 2
12 pages
DEEP RL - CONTENT BEYOND SYLLABUS
No ratings yet
DEEP RL - CONTENT BEYOND SYLLABUS
16 pages
F20-AI-L9
No ratings yet
F20-AI-L9
44 pages
lec12
No ratings yet
lec12
60 pages
2025_MDPs_Part 2 (1)
No ratings yet
2025_MDPs_Part 2 (1)
41 pages
lecture-06
No ratings yet
lecture-06
98 pages
Reinforcement learning lec12
No ratings yet
Reinforcement learning lec12
60 pages
CSE2530__Reinforcement_Learning__2025_P1+2
No ratings yet
CSE2530__Reinforcement_Learning__2025_P1+2
115 pages
Sp14 Cs188 Lecture 9 - Mdps II
No ratings yet
Sp14 Cs188 Lecture 9 - Mdps II
48 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
2
No ratings yet
2
23 pages
Finite Markov Decision Processes-BR
No ratings yet
Finite Markov Decision Processes-BR
31 pages
2024 MDPs Part 1
No ratings yet
2024 MDPs Part 1
59 pages
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
No ratings yet
Instructor (Andrew NG) :okay, Good Morning. Welcome Back. So I Hope All of You Had
14 pages
CS229
No ratings yet
CS229
17 pages
16 RL PDF
No ratings yet
16 RL PDF
87 pages
CS 188 Fall 2018 Written HW4 Soln
No ratings yet
CS 188 Fall 2018 Written HW4 Soln
6 pages
Add-On DRL CS06
No ratings yet
Add-On DRL CS06
23 pages
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
No ratings yet
DRL #4-5 - Introducing MDP and Dynamic Programming Solution
74 pages
ML Unit 4
No ratings yet
ML Unit 4
9 pages
cs229-notes12 Reinforcement in Control
No ratings yet
cs229-notes12 Reinforcement in Control
17 pages
Excel and succeed physics book 4
No ratings yet
Excel and succeed physics book 4
3 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
02 MarkovDecisionProcess
No ratings yet
02 MarkovDecisionProcess
51 pages
کتاب هشتم بارگزاری شده
No ratings yet
کتاب هشتم بارگزاری شده
112 pages
20AI903_RL_UNIT 2
No ratings yet
20AI903_RL_UNIT 2
27 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
15 pages
Store Issue Timing Betweet - 3.00 PM To 5.00 PM
No ratings yet
Store Issue Timing Betweet - 3.00 PM To 5.00 PM
1 page
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
No ratings yet
An Introduction to Reinforcement Learning From theory to algorithms (December 19, 2024)_Joon Kwon
66 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
Biotech Ology Book 8
0% (1)
Biotech Ology Book 8
405 pages
Lecture 3 - MDPs and Dynamic Programming
No ratings yet
Lecture 3 - MDPs and Dynamic Programming
66 pages
New CZ3005 Module 4 - Markov Decision Process
No ratings yet
New CZ3005 Module 4 - Markov Decision Process
38 pages
Reinforcement Learning and Control: CS229 Lecture Notes
No ratings yet
Reinforcement Learning and Control: CS229 Lecture Notes
7 pages
7302 5523 Operator
No ratings yet
7302 5523 Operator
389 pages
MIT 14.13 - Lecture 3 and 4
No ratings yet
MIT 14.13 - Lecture 3 and 4
63 pages
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
No ratings yet
Stochastic Process - Markov Property - Markov Chain - Markov Decision Process - Reinforcement Learning - RL Techniques - Example Applications
39 pages
MIT 14.13 Lecture 2
No ratings yet
MIT 14.13 Lecture 2
38 pages
MIT Fluid Dynamics
No ratings yet
MIT Fluid Dynamics
28 pages
HB Ecology 1
No ratings yet
HB Ecology 1
58 pages
EAPP-PPT-1-Reading-Academic-Text
No ratings yet
EAPP-PPT-1-Reading-Academic-Text
109 pages
Solar Full
No ratings yet
Solar Full
32 pages
MIT HVAC Lecture
No ratings yet
MIT HVAC Lecture
4 pages
BAB906 Project Management Fundamentals WK 2 Instructor Version
No ratings yet
BAB906 Project Management Fundamentals WK 2 Instructor Version
45 pages
FE Studying Transportation
No ratings yet
FE Studying Transportation
5 pages
Diamond Star Global SDN BHD Vs Joint Controller ofDE202305042313005417COM949515
No ratings yet
Diamond Star Global SDN BHD Vs Joint Controller ofDE202305042313005417COM949515
21 pages
Financial Analysis Tata Motors: Prepared by Santonu Swastayan Bharat Shreerang Saurabh Prakash Mayank
No ratings yet
Financial Analysis Tata Motors: Prepared by Santonu Swastayan Bharat Shreerang Saurabh Prakash Mayank
35 pages
Hershey and Chase 1952
No ratings yet
Hershey and Chase 1952
18 pages
Jerome Lee Nicholson
No ratings yet
Jerome Lee Nicholson
4 pages
De Tham Khao Thi Tuyen Sinh Vao 10 Mon Tieng Anh Nam 2025 2026 So Gddt Bac Giang
No ratings yet
De Tham Khao Thi Tuyen Sinh Vao 10 Mon Tieng Anh Nam 2025 2026 So Gddt Bac Giang
5 pages
Binary Operation
No ratings yet
Binary Operation
11 pages
7012358
No ratings yet
7012358
5 pages
LESSON 3 - Social Beliefs and Judgments
No ratings yet
LESSON 3 - Social Beliefs and Judgments
27 pages
(HD) MP Assignment 3
No ratings yet
(HD) MP Assignment 3
15 pages
Business Report Writing 2
No ratings yet
Business Report Writing 2
5 pages
Pic Microcontroller
100% (2)
Pic Microcontroller
5 pages
30% Notification of ASTE
No ratings yet
30% Notification of ASTE
8 pages
Formal Report 1
No ratings yet
Formal Report 1
7 pages
CV
No ratings yet
CV
4 pages
Feynman
No ratings yet
Feynman
4 pages
Universidad Abierta para Adultos: Participante Charlenny Desiree Rivera B Actividad 3 Matricula 17-3985
No ratings yet
Universidad Abierta para Adultos: Participante Charlenny Desiree Rivera B Actividad 3 Matricula 17-3985
5 pages
Sri Ramachandran
No ratings yet
Sri Ramachandran
7 pages
Final Project - Lorliyana
No ratings yet
Final Project - Lorliyana
1 page
Determining Plancks Constant
No ratings yet
Determining Plancks Constant
4 pages
EF4E Beg Comm 10A PCM
No ratings yet
EF4E Beg Comm 10A PCM
1 page
Motswsele NOTES
No ratings yet
Motswsele NOTES
15 pages
Real Numbers Lesson Plan Class 10
No ratings yet
Real Numbers Lesson Plan Class 10
4 pages

MIT 6.036 Lecture

Uploaded by

MIT 6.036 Lecture

Uploaded by

Lecture: starts Tuesday 9:35am

Course website: introml.odl.mit.edu

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• S = set of possible states = {standing, moving}

• May not observe states {standing, moving} directly

• S = set of possible states = {fallen, standing, moving}

• S = set of possible states = {fallen, standing, moving}

• s0 ∈ S = initial state = standing slow fast

• Iteratively computer for t ⥸ 1:

• Reads binary string

• String has even number of zeros iff ends at state S1

• set of possible states S = {fallen, standing, moving}

• transition model T: S x A x S ➞ ℝ p=1

defines probability distribution p=3/4

over next state given

p=2/5 p=2/5 p=1/5

fallen standing moving

• set of possible states S = {fallen, standing, moving}

• transition model T: S x A x S ➞ ℝ p=1

reward based on state & action p=3/4

• S = set of possible states {fallen, standing, moving}

• T: S x A x S ➞ ℝ transition model p=1

• ( discount factor p=3/4

p=2/5 p=2/5 p=1/5

• set of possible states S = {fallen, standing, moving}

with transition function p=3/4 r=2

p=2/5 r=1 p=2/5 r=-1 p=1/5 r=-1

• At every time step t agent finds itself in state s ∈ S and

• In contrast, in reinforcement learning the agent does not

take action a ∈ {slow, fast}

– *B: always fast

– *C: if fallen slow, else fast

for example p=3/4

– *E: from all states

Repeat take action

where the transition function

• Transition to state s’ take action

then will be destroyed

• Example robot policies

– *B: always fast

• Base case: no steps remaining, no matter what state

• Compute : start from horizon 0, store values

• Given the optimal finite horizon policy is:

• May be multiple optimal policies

if ! = 0 then agent is myopic maximizing only immediate rewards

• For all states s :

• Expectation for state value

• Optimality for state value function

• Don’t know when game will be over.

• t denotes number of steps from starting state

• Expected infinite horizon value of state s under policy ,

• t denotes number of steps from starting state

• n = |S| linear equations

Bellman equation for V!

next state + expected reward where the transition function

where the transition

what action we will take

• In infinite horizon there exists a stationary optimal policy ,*

• Stationary: does not change over time

• expected infinite horizon discounted value of being in

• n = |S|, m = |A|, nm non-linear equations with unique solution

non linear equation

expectation over where the transition

maximize over the actions we can take

once we have Q* we can act optimally

• If we know optimal action-value function then

• Optimal policy is not unique

You might also like