0% found this document useful (0 votes)

97 views11 pages

Nondeterministic Search in AI MDPs

This document discusses non-deterministic search and Markov decision processes (MDPs). It provides an example of an MDP modeling a racecar with three states (cool, warm, overheated) and two actions (slow, fast) where actions can lead to multiple possible successor states with different probabilities. MDPs define transition and reward functions to model these probabilities and rewards. Finite horizons and discount factors are introduced to limit the time an agent can act to receive rewards and model decay in future rewards. MDPs satisfy the Markov property of only depending on the current state, not past states.

Uploaded by

radha gulati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

97 views11 pages

Nondeterministic Search in AI MDPs

Uploaded by

radha gulati

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

CS 188 Introduction to Artificial Intelligence

Fall 2018 Note 4

These lecture notes are heavily based on notes originally written by Nikhil Sharma.

Non-Deterministic Search
Picture a runner, coming to the end of his first ever marathon. Though it seems likely he will complete
the race and claim the accompanying everlasting glory, it’s by no means guaranteed. He may pass out
from exhaustion or misstep and slip and fall, tragically breaking both of his legs. Even more unlikely, a
literally earth-shattering earthquake may spontaneously occur, swallowing up the runner mere inches before
he crosses the finish line. Such possibilities add a degree of uncertainty to the runner’s actions, and it’s this
uncertainty that will be the subject of the following discussion. In the first note, we talked about traditional
search problems and how to solve them; then, in the third note, we changed our model to account for
adversaries and other agents in the world that influenced our path to goal states. Now, we’ll change our
model again to account for another influencing factor – the dynamics of world itself. The environment in
which an agent is placed may subject the agent’s actions to being nondeterministic, which means that there
are multiple possible successor states that can result from an action taken in some state. This is, in fact, the
case in many card games such as poker or blackjack, where there exists an inherent uncertainty from the
randomness of card dealing. Such problems where the world poses a degree of uncertainty are known as
nondeterministic search problems, and can be solved with models known as Markov decision processes,
or MDPs.

Markov Decision Processes

A Markov Decision Process is defined by several properties:
• A set of states S. States in MDPs are represented in the same way as states in traditional search
problems.

• A set of actions A. Actions in MDPs are also represented in the same way as in traditional search
problems.

• A start state.

• Possibly one or more terminal states.

• Possibly a discount factor γ. We’ll cover discount factors shortly.

• A transition function T (s, a, s0 ). Since we have introduced the possibility of nondeterministic actions,
we need a way to delineate the likelihood of the possible outcomes after taking any given action from
any given state. The transition function for a MDP does exactly this - it’s a probability function which
represents the probability that an agent taking an action a ∈ A from a state s ∈ S ends up in a state
s0 ∈ S.

CS 188, Fall 2018, Note 4 1

• A reward function R(s, a, s0 ). Typically, MDPs are modeled with small "living" rewards at each step
to reward an agent’s survival, along with large rewards for arriving at a terminal state. Rewards may
be positive or negative depending on whether or not they benefit the agent in question, and the agent’s
objective is naturally to acquire the maximum reward possible before arriving at some terminal state.

Constructing a MDP for a situation is quite similar to constructing a state-space graph for a search problem,
with a couple additional caveats. Consider the motivating example of a racecar:

There are three possible states, S = {cool, warm, overheated}, and two possible actions A = {slow, f ast}.
Just like in a state-space graph, each of the three states is represented by a node, with edges representing
actions. Overheated is a terminal state, since once a racecar agent arrives at this state, it can no longer
perform any actions for further rewards (it’s a sink state in the MDP and has no outgoing edges). Notably,
for nondeterministic actions, there are multiple edges representing the same action from the same state with
differing successor states. Each edge is annotated not only with the action it represents, but also a transition
probability and corresponding reward. These are summarized below:

• Transition Function: T (s, a, s0 ) • Reward Function: R(s, a, s0 )

– T (cool, slow, cool) = 1 – R(cool, slow, cool) = 1

– T (warm, slow, cool) = 0.5 – R(warm, slow, cool) = 1
– T (warm, slow, warm) = 0.5 – R(warm, slow, warm) = 1
– T (cool, f ast, cool) = 0.5 – R(cool, f ast, cool) = 2
– T (cool, f ast, warm) = 0.5 – R(cool, f ast, warm) = 2
– T (warm, f ast, overheated) = 1 – R(warm, f ast, overheated) = −10

We represent the movement of an agent through different MDP states over time with discrete timesteps,
defining st ∈ S and at ∈ A as the state in which an agent exists and the action which an agent takes at
timestep t respectively. An agent starts in state s0 at timestep 0, and takes an action at every timestep. The
movement of an agent through a MDP can thus be modeled as follows:
a
0 1 a 2 a3 a
s0 −
→ s1 −
→ s2 −
→ s3 −
→ ...

Additionally, knowing that an agent’s goal is to maximize it’s reward across all timesteps, we can corre-
spondingly express this mathematically as a maximization of the following utility function:

U([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

CS 188, Fall 2018, Note 4 2

Markov decision processes, like state-space graphs, can be unraveled into search trees. Uncertainty is mod-
eled in these search trees with q-states, also known as action states, essentially identical to expectimax
chance nodes. This is a fitting choice, as q-states use probabilities to model the uncertainty that the envi-
ronment will land an agent in a given state just as expectimax chance nodes use probabilities to model the
uncertainty that adversarial agents will land our agent in a given state through the move these agents select.
The q-state represented by having taken action a from state s is notated as the tuple (s, a).
Observe the unraveled search tree for our racecar, truncated to depth-2:

The green nodes represent q-states, where an action has been taken from a state but has yet to be resolved
into a successor state. It’s important to understand that agents spend zero timesteps in q-states, and that they
are simply a construct created for ease of representation and development of MDP algorithms.

Finite Horizons and Discounting

There is an inherent problem with our racecar MDP - we haven’t placed any time constraints on the number
of timesteps for which a racecar can take actions and collect rewards. With our current formulation, it could
routinely choose a = slow at every timestep forever, safely and effectively obtaining infinite reward without
any risk of overheating. This is prevented by the introduction of finite horizons and/or discount factors.
An MDP enforcing a finite horizon is simple - it essentially defines a "lifetime" for agents, which gives them
some set number of timesteps n to accrue as much reward as they can before being automatically terminated.
We’ll return to this concept shortly.
Discount factors are slightly more complicated, and are introduced to model an exponential decay in the
value of rewards over time. Concretely, with a discount factor of γ, taking action at from state st at timestep
t and ending up in state st+1 results in a reward of γ t R(st , at , st+1 ) instead of just R(st , at , st+1 ). Now, instead
of maximizing the additive utility

U([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

we attempt to maximize discounted utility

U([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

Noting that the above definition of a discounted utility function looks dangerously close to a geometric
series with ratio γ, we can prove that it’s guaranteed to be finite-valued as long as the constraint |γ| < 1

CS 188, Fall 2018, Note 4 3

(where |n| denotes the absolute value operator) is met through the following logic:

U([s0 , s1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

∞ ∞
Rmax
= ∑ γ t R(st , at , st+1 ) ≤ ∑ γ t Rmax = 1−γ
t=0 t=0

where Rmax is the maximum possible reward attainable at any given timestep in the MDP. Typically, γ is
selected strictly from the range 0 < γ < 1 since values values in the range −1 < γ ≤ 0 are simply not
meaningful in most real-world situations - a negative value for γ means the reward for a state s would
flip-flop between positive and negative values at alternating timesteps.

Markovianess
Markov decision processes are "markovian" in the sense that they satisfy the Markov property, or memo-
ryless property, which states that the future and the past are conditionally independent, given the present.
Intuitively, this means that, if we know the present state, knowing the past doesn’t give us any more infor-
mation about the future. To express this mathematically, consider an agent that has visited states s0 , s1 , ..., st
after taking actions a0 , a1 , ..., at−1 in some MDP, and has just taken action at . The probability that this agent
then arrives at state st+1 given their history of previous states visited and actions taken can be written as
follows:
P(St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 )
where each St denotes the random variable representing our agent’s state and At denotes the random variable
representing the action our agent takes at time t. The Markov property states that the above probability can
be simplified as follows:
P(St+1 = st+1 |St = st , At = at , St−1 = st−1 , At−1 = at−1 , ..., S0 = s0 ) = P(St+1 = st+1 |St = st , At = at )
which is "memoryless" in the sense that the probability of arriving in a state s0 at time t + 1 depends only on
the state s and action a taken at time t, not on any earlier states or actions. In fact, it is these memoryless
probabilities which are encoded by the transition function: T (s, a, s0 ) = P(s0 |s, a) .

Solving Markov Decision Processes

Recall that in deterministic, non-adversarial search, solving a search problem means finding an optimal plan
to arrive at a goal state. Solving a Markov decision process, on the other hand, means finding an optimal
policy π ∗ : S → A, a function mapping each state s ∈ S to an action a ∈ A. An explicit policy π defines a
reflex agent - given a state s, an agent at s implementing π will select a = π(s) as the appropriate action to
make without considering future consequences of its actions. An optimal policy is one that if followed by
the implementing agent, will yield the maximum expected total reward or utility.
Consider the following MDP with S = {a, b, c, d, e}, A = {East,West, Exit} (with Exit being a valid
action only in states a and e and yielding rewards of 10 and 1 respectively), a discount factor γ = 0.1, and
deterministic transitions:

CS 188, Fall 2018, Note 4 4

Two potential policies for this MDP are as follows:

(a) Policy 1 (b) Policy 2

With some investigation, it’s not hard to determine that Policy 2 is optimal. Following the policy until
making action a = Exit yields the following rewards for each start state:

Start State Reward

a 10
b 1
c 0.1
d 0.1
e 1

We’ll now learn how to solve such MDPs (and much more complex ones!) algorithmically using the
Bellman equation for Markov decision processes.

The Bellman Equation

In order to talk about the Bellman equation for MDPs, we must first introduce two new mathematical quan-
tities:

• The optimal value of a state s, V ∗ (s) – the optimal value of s is the expected value of the utility an
optimally-behaving agent that starts in s will receive, over the rest of the agent’s lifetime.

• The optimal value of a q-state (s, a), Q∗ (s, a) - the optimal value of (s, a) is the expected value of the
utility an agent receives after starting in s, taking a, and acting optimally henceforth.

Using these two new quantities and the other MDP quantities discussed earlier, the Bellman equation is
defined as follows:
V ∗ (s) = max ∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]
a
s0
Before we begin interpreting what this means, let’s also define the equation for the optimal value of a q-state
(more commonly known as an optimal q-value):

Q∗ (s, a) = ∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]

Note that this second definition allows us to reexpress the Bellman equation as

V ∗ (s) = max Q∗ (s, a)

which is a dramatically simpler quantity. The Bellman equation is an example of a dynamic program-
ming equation, an equation that decomposes a problem into smaller subproblems via an inherent recur-
sive structure. We can see this inherent recursion in the equation for the q-value of a state, in the term

CS 188, Fall 2018, Note 4 5

[R(s, a, s0 ) + γV ∗ (s0 )]. This term represents the total utility an agent receives by first taking a from s and ar-
riving at s0 and then acting optimally henceforth. The immediate reward from the action a taken, R(s, a, s0 ),
is added to the optimal reward attainable from s0 , V ∗ (s0 ), which is discounted by γ to account for the passage
of the timestep in taking a. Though in most cases there exists a vast number of possible sequences of states
and actions from s0 to some terminal state, all this detail is abstracted away and encapsulated in a single
recursive value, V ∗ (s0 ).
We can now take another step outwards and consider the full equation for q-value. Knowing [R(s, a, s0 ) +
γV ∗ (s0 )] represents the utility attained by acting optimally after arriving in state s0 from q-state (s, a), it
becomes evident that the quantity

∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]

is simply a weighted sum of utilities, with each utility weighted by its probability of occurrence. This is def-
initionally the expected utility of acting optimally from q-state (s, a) onwards! This completes our analysis
and gives us enough insight to interpret the full Bellman equation - the optimal value of a state, V ∗ (s), is
simply the maximum expected utility over all possible actions from s. Computing maximum expected utility
for a state s is essentially the same as running expectimax - we first compute the expected utility from each
q-state (s, a) (equivalent to computing the value of chance nodes), then compute the maximum over these
nodes to compute the maximum expected utility (equivalent to computing the value of a maximizer node).
One final note on the Bellman equation – its usage is as a condition for optimality. In other words, if we can
somehow determine a value V (s) for every state s ∈ S such that the Bellman equation holds true for each
of these states, we can conclude that these values are the optimal values for their respective states. Indeed,
satisfying this condition implies ∀s ∈ S, V (s) = V ∗ (s).

Value Iteration
Now that we have a framework to test for optimality of the values of states in a MDP, the natural follow-up
question to ask is how to actually compute these optimal values. To answer this question, we need time-
limited values (the natural result of enforcing finite horizons). The time-limited value for a state s with a
time-limit of k timesteps is denoted Vk (s), and represents the maximum expected utility attainable from s
given that the Markov decision process under consideration terminates in k timesteps. Equivalently, this is
what a depth-k expectimax run on the search tree for a MDP returns.
Value iteration is a dynamic programming algorithm that uses an iteratively longer time limit to compute
time-limited values until convergence (that is, until the V values are the same for each state as they were in
the past iteration: ∀s,Vk+1 (s) = Vk (s)). It operates as follows:

1. ∀s ∈ S, initialize V0 (s) = 0. This should be intuitive, since setting a time limit of 0 timesteps means
no actions can be taken before termination, and so no rewards can be acquired.

2. Repeat the following update rule until convergence:

∀s ∈ S, Vk+1 (s) ← max ∑ T (s, a, s0 )[R(s, a, s0 ) + γVk (s0 )]

a
s0

At iteration k of value iteration, we use the time-limited values for with limit k for each state to gener-
ate the time-limited values with limit (k + 1). In essence, we use computed solutions to subproblems
(all the Vk (s)) to iteratively build up solutions to larger subproblems (all the Vk+1 (s)); this is what
makes value iteration a dynamic programming algorithm.

CS 188, Fall 2018, Note 4 6

Note that though the Bellman equation looks essentially identical in construction to the update rule above,
they are not the same. The Bellman equation gives a condition for optimality, while the update rule gives a
method to iteratively update values until convergence. When convergence is reached, the Bellman equation
will hold for every state: ∀s ∈ S, Vk (s) = Vk+1 (s) = V ∗ (s).
Let’s see a few updates of value iteration in practice by revisiting our racecar MDP from earlier, introducing
a discount factor of γ = 0.5:

We begin value iteration by initialization of all V0 (s) = 0:

cool warm overheated

V0 0 0 0

In our first round of updates, we can compute ∀s ∈ S, V1 (s) as follows:

V1 (cool) = max{1 · [1 + 0.5 · 0], 0.5 · [2 + 0.5 · 0] + 0.5 · [2 + 0.5 · 0]}

= max{1, 2}
= 2
V1 (warm) = max{0.5 · [1 + 0.5 · 0] + 0.5 · [1 + 0.5 · 0], 1 · [−10 + 0.5 · 0]}
= max{1, −10}
= 1
V1 (overheated) = max{}
= 0

cool warm overheated

V0 0 0 0
V1 2 1 0

Similarly, we can repeat the procedure to compute a second round of updates with our newfound values for

CS 188, Fall 2018, Note 4 7

V1 (s) to compute V2 (s).

V2 (cool) = max{1 · [1 + 0.5 · 2], 0.5 · [2 + 0.5 · 2] + 0.5 · [2 + 0.5 · 1]}

= max{2, 2.75}
= 2.75
V2 (warm) = max{0.5 · [1 + 0.5 · 2] + 0.5 · [1 + 0.5 · 1], 1 · [−10 + 0.5 · 0]}
= max{1.75, −10}
= 1.75
V2 (overheated) = max{}
= 0

cool warm overheated

V0 0 0 0
V1 2 1 0
V2 2.75 1.75 0

It’s worthwhile to observe that V ∗ (s) for any terminal state must be 0, since no actions can ever be taken
from any terminal state to reap any rewards.

Policy Extraction
Recall that our ultimate goal in solving a MDP is to determine an optimal policy. This can be done once
all optimal values for states are determined using a method called policy extraction. The intuition behind
policy extraction is very simple: if you’re in a state s, you should take the action a which yields the maximum
expected utility. Not surprisingly, a is the action which takes us to the q-state with maximum q-value,
allowing for a formal definition of the optimal policy:

∀s ∈ S, π ∗ (s) = argmax Q∗ (s, a) = argmax ∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]

a a s0

It’s useful to keep in mind for performance reasons that it’s better for policy extraction to have the optimal
q-values of states, in which case a single argmax operation is all that is required to determine the optimal
action from a state. Storing only each V ∗ (s) means that we must recompute all necessary q-values with the
Bellman equation before applying argmax, equivalent to performing a depth-1 expectimax.

Policy Iteration
Value iteration can be quite slow. At each iteration, we must update the values of all |S| states (where |n|
refers to the cardinality operator), each of which requires iteration over all |A| actions as we compute the
q-value for each action. The computation of each of these q-values, in turn, requires iteration over each of
the |S| states again, leading to a poor runtime of O(|S|2 |A|). Additionally, when all we want to determine
is the optimal policy for the MDP, value iteration tends to do a lot of overcomputation since the policy as
computed by policy extraction generally converges significantly faster than the values themselves. The fix
for these flaws is to use policy iteration as an alternative, an algorithm that maintains the optimality of value
iteration while providing significant performance gains. Policy iteration operates as follows:

1. Define an initial policy. This can be arbitrary, but policy iteration will converge faster the closer the
initial policy is to the eventual optimal policy.

CS 188, Fall 2018, Note 4 8

2. Repeat the following until convergence:

• Evaluate the current policy with policy evaluation. For a policy π, policy evaluation means
computing V π (s) for all states s, where V π (s) is expected utility of starting in state s when
following π:
V π (s) = ∑ T (s, π(s), s0 )[R(s, π(s), s0 ) + γV π (s0 )]
s0

Define the policy at iteration i of policy iteration as πi . Since we are fixing a single action for
each state, we no longer need the max operator which effectively leaves us with a system of |S|
equations generated by the above rule. Each V πi (s) can then be computed by simply solving
this system. Alternatively, we can also compute V πi (s) by using the following update rule until
convergence, just like in value iteration:
πi
Vk+1 (s) ← ∑ T (s, πi (s), s0 )[R(s, πi (s), s0 ) + γVkπi (s0 )]
s0

However, this second method is typically slower in practice.

• Once we’ve evaluated the current policy, use policy improvement to generate a better policy.
Policy improvement uses policy extraction on the values of states generated by policy evaluation
to generate this new and improved policy:

πi+1 (s) = argmax ∑ T (s, a, s0 )[R(s, a, s0 ) + γV πi (s0 )]

a s0

If πi+1 = πi , the algorithm has converged, and we can conclude that πi+1 = πi = π ∗ .

Let’s run through our racecar example one last time (getting tired of it yet?) to see if we get the same policy
using policy iteration as we did with value iteration. Recall that we were using a discount factor of γ = 0.5.

We start with an initial policy of Always go slow:

cool warm overheated

π0 slow slow −

Because terminal states have no outgoing actions, no policy can assign a value to one. Hence, it’s reasonable
to disregard the state overheated from consideration as we have done, and simply assign ∀i, V πi (s) = 0 for

CS 188, Fall 2018, Note 4 9

any terminal state s. The next step is to run a round of policy evaluation on π0 :

V π0 (cool) = 1 · [1 + 0.5 ·V π0 (cool)]

V π0 (warm) = 0.5 · [1 + 0.5 ·V π0 (cool)] + 0.5 · [1 + 0.5 ·V π0 (warm)]

Solving this system of equations for V π0 (cool) and V π0 (warm) yields:

cool warm overheated

V π0 2 2 0

We can now run policy extraction with these values:

π1 (cool) = argmax{slow : 1 · [1 + 0.5 · 2], f ast : 0.5 · [2 + 0.5 · 2] + 0.5 · [2 + 0.5 · 2]}
= argmax{slow : 2, f ast : 3}
= f ast
π1 (warm) = argmax{slow : 0.5 · [1 + 0.5 · 2] + 0.5 · [1 + 0.5 · 2], f ast : 1 · [−10 + 0.5 · 0]}
= argmax{slow : 3, f ast : −10}
= slow

Running policy iteration for a second round yields π2 (cool) = f ast and π2 (warm) = slow. Since this is the
same policy as π1 , we can conclude that π1 = π2 = π ∗ . Verify this for practice!

cool warm
π0 slow slow
π1 f ast slow
π2 f ast slow

This example shows the true power of policy iteration: with only two iterations, we’ve already arrived at the
optimal policy for our racecar MDP! This is more than we can say for when we ran value iteration on the
same MDP, which was still several iterations from convergence after the two updates we performed.

Summary
The material presented above has much opportunity for confusion. We covered value iteration, policy iter-
ation, policy extraction, and policy evaluation, all of which look similar, using the Bellman equation with
subtle variation. Below is a summary of when to use each algorithm:

• Value iteration: Used for computing the optimal values of states, by iterative updates until conver-
gence.

• Policy evaluation: Used for computing the values of states under a specific policy.

• Policy extraction: Used for determining a policy given some state value function. If the state values
are optimal, this policy will be optimal. This method is used after running value iteration, to compute
an optimal policy from the optimal state values; or as a subroutine in policy iteration, to compute the
best policy for the currently estimated state values.

CS 188, Fall 2018, Note 4 10

• Policy iteration: A technique that encapsulates both policy evaluation and policy extraction and is
used for iterative convergence to an optimal policy. It tends to outperform value iteration, by virtue of
the fact that policies usually converge much faster than the values of states.

CS 188, Fall 2018, Note 4 11

Non-Deterministic Search in MDPs
No ratings yet
Non-Deterministic Search in MDPs
11 pages
Markov Decision Processes Explained
No ratings yet
Markov Decision Processes Explained
65 pages
Introduction to Markov Decision Processes
No ratings yet
Introduction to Markov Decision Processes
13 pages
MDP Basics for AI Researchers
No ratings yet
MDP Basics for AI Researchers
23 pages
MDPs: Policies, Search & Utility
No ratings yet
MDPs: Policies, Search & Utility
13 pages
MDP Basics for AI Researchers
No ratings yet
MDP Basics for AI Researchers
22 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
89 pages
Slides
No ratings yet
Slides
10 pages
Reinforcement Learning & MDPs
100% (1)
Reinforcement Learning & MDPs
8 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
6 pages
L12 Markov Decision Processes
No ratings yet
L12 Markov Decision Processes
64 pages
AI Decision Making & RL Guide
No ratings yet
AI Decision Making & RL Guide
18 pages
119686
No ratings yet
119686
24 pages
A Tutorial For Reinforcement Learning
No ratings yet
A Tutorial For Reinforcement Learning
17 pages
Markov Decision Processes Explained
No ratings yet
Markov Decision Processes Explained
28 pages
Markov Decision Processes in RL
No ratings yet
Markov Decision Processes in RL
74 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
21 pages
Introduction to Reinforcement Learning
No ratings yet
Introduction to Reinforcement Learning
14 pages
Module 3-22 March 2025
No ratings yet
Module 3-22 March 2025
34 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
26 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
14 pages
CSE2530 Reinforcement Learning 2025 P1+2
No ratings yet
CSE2530 Reinforcement Learning 2025 P1+2
115 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
111 pages
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
No ratings yet
Artificial Intelligence and Intelligent Agents (F29AI) MDP I: Intro To Markov Decision Processes
10 pages
Robust Markov Decision Processes Overview
No ratings yet
Robust Markov Decision Processes Overview
29 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Reinforcement Learning Note
No ratings yet
Reinforcement Learning Note
16 pages
Dynamic Programming in RL Planning
No ratings yet
Dynamic Programming in RL Planning
17 pages
MDPs and State Machines Overview
No ratings yet
MDPs and State Machines Overview
64 pages
RL DQN PG
No ratings yet
RL DQN PG
65 pages
Understanding Markov Decision Processes in RL
No ratings yet
Understanding Markov Decision Processes in RL
26 pages
Logistics: CSE 473 Markov Decision Processes
No ratings yet
Logistics: CSE 473 Markov Decision Processes
10 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
44 pages
Understanding Regret and MDP Concepts
No ratings yet
Understanding Regret and MDP Concepts
29 pages
Unit 03 RL Problem
No ratings yet
Unit 03 RL Problem
9 pages
CS415 - Lecture 21 - MDPs I
No ratings yet
CS415 - Lecture 21 - MDPs I
49 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
Markov Decision
No ratings yet
Markov Decision
4 pages
Markov Decision Processes Overview
No ratings yet
Markov Decision Processes Overview
19 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
9 pages
Infinite Horizon Problems
No ratings yet
Infinite Horizon Problems
69 pages
RL Unit-Ii
No ratings yet
RL Unit-Ii
14 pages
MDP Concepts
No ratings yet
MDP Concepts
23 pages
Markov Decision & RL Overview
No ratings yet
Markov Decision & RL Overview
39 pages
Stochastic Optimization Lecture Notes
No ratings yet
Stochastic Optimization Lecture Notes
23 pages
Learning in Robotics with Reinforcement
No ratings yet
Learning in Robotics with Reinforcement
11 pages
Reinforcement Learning Basics
No ratings yet
Reinforcement Learning Basics
4 pages
DSA5102 Lecture11
No ratings yet
DSA5102 Lecture11
44 pages
AIML Unit - 3 MDP New
No ratings yet
AIML Unit - 3 MDP New
30 pages
17 - Markov Decision Processes
No ratings yet
17 - Markov Decision Processes
59 pages
Deep RL - Content Beyond Syllabus
No ratings yet
Deep RL - Content Beyond Syllabus
16 pages
Understanding MDPs in Reinforcement Learning
No ratings yet
Understanding MDPs in Reinforcement Learning
14 pages
Ai (It) Unit-4
100% (1)
Ai (It) Unit-4
37 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
Markov Decision Processes in AI
No ratings yet
Markov Decision Processes in AI
50 pages
Reinforcement Learning - Unit 13 - Week 10
No ratings yet
Reinforcement Learning - Unit 13 - Week 10
3 pages
Understanding Markov Decision Processes
No ratings yet
Understanding Markov Decision Processes
24 pages
DSR 2879
No ratings yet
DSR 2879
25 pages
Depression Identification Using EEG Signals Via A Hybrid of LSTM and Spiking Neural Networks
No ratings yet
Depression Identification Using EEG Signals Via A Hybrid of LSTM and Spiking Neural Networks
13 pages
Computer Vision - Unit 9 - Week 7
No ratings yet
Computer Vision - Unit 9 - Week 7
4 pages
Reliability Analysis and Weibull Distribution
No ratings yet
Reliability Analysis and Weibull Distribution
21 pages
Introduction to SCILAB Programming
No ratings yet
Introduction to SCILAB Programming
24 pages
Users Manual PDF
No ratings yet
Users Manual PDF
52 pages
Organic Chemistry Study Guide
No ratings yet
Organic Chemistry Study Guide
4 pages
Buy FLUKE Borescope - Inspection Equipment With Video Camera Online - Government e Marketplace (GeM)
No ratings yet
Buy FLUKE Borescope - Inspection Equipment With Video Camera Online - Government e Marketplace (GeM)
4 pages
Trigonometric Surveying Techniques
No ratings yet
Trigonometric Surveying Techniques
8 pages
Mirror Link:: Please Click A Few of The Buttons Below If The Link Above Does Not Work
No ratings yet
Mirror Link:: Please Click A Few of The Buttons Below If The Link Above Does Not Work
332 pages
Pendulum Dowsing The Basics
100% (2)
Pendulum Dowsing The Basics
14 pages
2004 AMC 10B Problems
No ratings yet
2004 AMC 10B Problems
7 pages
Wireless Communication Lab Report
No ratings yet
Wireless Communication Lab Report
26 pages
Syllabus July 2025
No ratings yet
Syllabus July 2025
7 pages
HC079 Potassium Hexacyanoferrate II Hexacyanoferrate III
No ratings yet
HC079 Potassium Hexacyanoferrate II Hexacyanoferrate III
2 pages
Ricoh 500G
No ratings yet
Ricoh 500G
48 pages
Lecture Planner 11th Physics Revision
No ratings yet
Lecture Planner 11th Physics Revision
1 page
Bamboo Textiles: Eco-Friendly Fashion
No ratings yet
Bamboo Textiles: Eco-Friendly Fashion
4 pages
Richard Eulogy
No ratings yet
Richard Eulogy
8 pages
DR Nand Kishor Jindal PPT Role of Ahara in Promoting Health in Ayurvedic Perspective
No ratings yet
DR Nand Kishor Jindal PPT Role of Ahara in Promoting Health in Ayurvedic Perspective
13 pages
Expert Help for Alternative Energy Theses
100% (1)
Expert Help for Alternative Energy Theses
8 pages
List of Medicine in BHMS
No ratings yet
List of Medicine in BHMS
2 pages
Water Demand Calculation for Malaysia
No ratings yet
Water Demand Calculation for Malaysia
6 pages
Optical Satellite Communication Space Terminal Tec PDF
No ratings yet
Optical Satellite Communication Space Terminal Tec PDF
11 pages
Understanding Global Warming Effects
No ratings yet
Understanding Global Warming Effects
3 pages
FMLP1
No ratings yet
FMLP1
20 pages
Titan Ltd: India's Leading Retailer Insights
No ratings yet
Titan Ltd: India's Leading Retailer Insights
2 pages
Philco All in One Food Processor PR 2
No ratings yet
Philco All in One Food Processor PR 2
14 pages
Vegetable Savoury Muffins Recipe - Healthy Little Foodies
No ratings yet
Vegetable Savoury Muffins Recipe - Healthy Little Foodies
2 pages
Ejector Vs
No ratings yet
Ejector Vs
2 pages
Xenoblade Chronicles X: A Comprehensive Guide
No ratings yet
Xenoblade Chronicles X: A Comprehensive Guide
517 pages
Periodical Test in Math 6 With TOS and Answer Key
93% (14)
Periodical Test in Math 6 With TOS and Answer Key
6 pages
Factors Influencing Rubber Tolerances
No ratings yet
Factors Influencing Rubber Tolerances
10 pages
SIWES Report: NIIT Abuja Experience
No ratings yet
SIWES Report: NIIT Abuja Experience
36 pages
Classical Mechanics 5th Ed Edition Tom W B Kibble
No ratings yet
Classical Mechanics 5th Ed Edition Tom W B Kibble
464 pages

Nondeterministic Search in AI MDPs

Uploaded by

Nondeterministic Search in AI MDPs

Uploaded by

CS 188 Introduction to Artificial Intelligence

Fall 2018 Note 4

Markov Decision Processes

• Possibly one or more terminal states.

• Possibly a discount factor γ. We’ll cover discount factors shortly.

CS 188, Fall 2018, Note 4 1

• Transition Function: T (s, a, s0 ) • Reward Function: R(s, a, s0 )

– T (cool, slow, cool) = 1 – R(cool, slow, cool) = 1

U([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

CS 188, Fall 2018, Note 4 2

Finite Horizons and Discounting

U([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + R(s1 , a1 , s2 ) + R(s2 , a2 , s3 ) + ...

we attempt to maximize discounted utility

U([s0 , a0 , s1 , a1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

CS 188, Fall 2018, Note 4 3

U([s0 , s1 , s2 , ...]) = R(s0 , a0 , s1 ) + γR(s1 , a1 , s2 ) + γ 2 R(s2 , a2 , s3 ) + ...

Solving Markov Decision Processes

CS 188, Fall 2018, Note 4 4

(a) Policy 1 (b) Policy 2

Start State Reward

The Bellman Equation

Q∗ (s, a) = ∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]

V ∗ (s) = max Q∗ (s, a)

CS 188, Fall 2018, Note 4 5

∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]

2. Repeat the following update rule until convergence:

∀s ∈ S, Vk+1 (s) ← max ∑ T (s, a, s0 )[R(s, a, s0 ) + γVk (s0 )]

CS 188, Fall 2018, Note 4 6

We begin value iteration by initialization of all V0 (s) = 0:

cool warm overheated

In our first round of updates, we can compute ∀s ∈ S, V1 (s) as follows:

V1 (cool) = max{1 · [1 + 0.5 · 0], 0.5 · [2 + 0.5 · 0] + 0.5 · [2 + 0.5 · 0]}

cool warm overheated

CS 188, Fall 2018, Note 4 7

V2 (cool) = max{1 · [1 + 0.5 · 2], 0.5 · [2 + 0.5 · 2] + 0.5 · [2 + 0.5 · 1]}

cool warm overheated

∀s ∈ S, π ∗ (s) = argmax Q∗ (s, a) = argmax ∑ T (s, a, s0 )[R(s, a, s0 ) + γV ∗ (s0 )]

CS 188, Fall 2018, Note 4 8

However, this second method is typically slower in practice.

πi+1 (s) = argmax ∑ T (s, a, s0 )[R(s, a, s0 ) + γV πi (s0 )]

We start with an initial policy of Always go slow:

cool warm overheated

CS 188, Fall 2018, Note 4 9

V π0 (cool) = 1 · [1 + 0.5 ·V π0 (cool)]

Solving this system of equations for V π0 (cool) and V π0 (warm) yields:

cool warm overheated

We can now run policy extraction with these values:

CS 188, Fall 2018, Note 4 10

CS 188, Fall 2018, Note 4 11

You might also like