0% found this document useful (0 votes)
170 views14 pages

Markov Decision Processes Overview

The document discusses Markov Decision Processes (MDPs), which provide a framework for sequential decision making in uncertain environments. An MDP is modeled as a 4-tuple of states, actions, transition probabilities, and rewards. Transition probabilities define the likelihood of moving between states based on actions. The goal is to find an optimal policy that maximizes expected rewards over time. MDPs can be used for problems involving uncertainty like robotics, resource allocation, and more. A policy maps each state to an action, and the value is the expected utility or reward when following that policy over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
170 views14 pages

Markov Decision Processes Overview

The document discusses Markov Decision Processes (MDPs), which provide a framework for sequential decision making in uncertain environments. An MDP is modeled as a 4-tuple of states, actions, transition probabilities, and rewards. Transition probabilities define the likelihood of moving between states based on actions. The goal is to find an optimal policy that maximizes expected rewards over time. MDPs can be used for problems involving uncertainty like robotics, resource allocation, and more. A policy maps each state to an action, and the value is the expected utility or reward when following that policy over time.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd

Markov Decision Processes (MDP)

Sudeshna Sarkar
Department of Computer Science & Engineering
IIT Kharagpur
6-7 Sep 2017
How would you get to the airport in the
least amount of time?
 Metro
 Uber
 Taxi
 Airport Express

2
Uncertainty in the real world
 Randomness shows up in many places.
 Could be caused by limitations of the sensors and actuators of the
robot
 Could be caused by market forces or nature, which we have no
control over.

 State s, action a
 State s1’
 State s2’
 …

 How can we hope to act optimally in the face of randomness?


 Certainly we can't just have a single deterministic plan, and
talking about a minimum cost path doesn't make sense.

3
Applications
 Robotics: decide where to move, but actuators can
fail, hit unseen obstacles, etc.

 Resource allocation: decide what to produce, don't


know the customer demand for various products

 Agriculture: decide what to plant, but don't know


weather and thus crop yield

4
Volcano crossing

5
Dice Game
For each round r = 1, 2, …
 You choose stay or quit.
 If quit, you get $10 and we end the game.
 If stay, you get $4 and then I roll a 6-sided dice.
 If the dice results in 1 or 2, we end the game.
 Otherwise, you continue to the next round.

6
MDP for Dice Game
For each round r = 1, 2, …
 You choose stay or quit.
 If quit, you get $10 and we end the game.
 If stay, you get $4 and then I roll a 6-sided dice.
 If the dice results in 1 or 2, we end the game.
 Otherwise, you continue to the next round.

7
MDP
Markov Decision Processes

Decision Theoretic Planning

Markov Property: The transition properties depend


only on the current state, not on previous history
(how that state was reached)

8
MDP Model
MDP Model <S, A, T, R>
Agent State set S
Action set A
State Reward Action Markov transition function
T(s,a,s’)=Pr(s’|s,a)
Environment
Bounded real-valued reward
function R(s)
• Can be generalized to include
a0 a1 a2 action costs: R(s,a)
s0 s1 s2 s3
r0 r1 r2 • Can be generalized to be a
stochastic function
Process:
• Observe state st in S
• Choose action at in At
• Receive immediate reward rt
• State changes to st+1
9
Similarities of MDP with Search?

10
Transitions
 The transition probabilities T(s, a, s’) specify the
probability of ending up in state s’ if taken action a in
state s.
s a s’ T(s,a,s’)
in quit end 1
in stay in 2/3
in stay end 1/3
 For each state s and action a:

෍ 𝑇 𝑠, 𝑎, 𝑠 ′ = 1
𝑠′∈𝑆
11 Successors: 𝑠′ such that 𝑇 𝑠, 𝑎, 𝑠 ′ > 0
Exercise: Transportation problem
 Street with blocks numbered 1 to n.
 Walking from s to s + 1 takes 1 minute.
 Taking a magic tram from s to 2s takes 2 minutes.
 How to travel from 1 to n in the least time?
 Tram fails with probability 0.5.

12
What is a solution?
 Search problem: path (sequence of actions)
 MDP: ??
 MDP: Policy
 A Policy 𝜋 is a mapping from each state s 2 States to
an action 𝑎 ∈ Actions(𝑠)

13
Evaluating a policy
 Following a policy yields a random path.
 The utility of a policy is the (discounted) sum of the
rewards on the path (this is a random quantity).
Path Utility
[in; stay, 4, end] 4
[in; stay, 4, in; stay, 4, in; stay, 4, end] 12
[in; stay, 4, in; stay, 4, end] 8
[in; stay, 4, in; stay, 4, in; stay, 4, in; stay, 4, end] 16
...
The value of a policy is the expected utility.

14

You might also like