Dynamic Programming For Partially Observable Stochastic Games

This document summarizes a research paper that develops an exact dynamic programming algorithm for partially observable stochastic games (POSGs). The algorithm is a synthesis of dynamic programming approaches for partially observable Markov decision processes and iterated elimination of dominated strategies in normal form games. For finite-horizon POSGs, the algorithm iteratively eliminates very weakly dominated strategies without explicitly forming the normal form game representation. When agents share the same payoffs, the algorithm can find an optimal solution. The document provides background on POSGs and relevant solution approaches.

Uploaded by

taoli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

56 views7 pages

Dynamic Programming For Partially Observable Stochastic Games

Uploaded by

taoli

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Dynamic Programming for Partially Observable Stochastic Games

Eric A. Hansen Daniel S. Bernstein and Shlomo Zilberstein

Dept. of Computer Science and Engineering Department of Computer Science
Mississippi State University University of Massachusetts
Mississippi State, MS 39762 Amherst, MA 01003
[email protected] {bern,shlomo}@cs.umass.edu

Abstract POMDP is solved by transforming it into a completely ob-

We develop an exact dynamic programming algorithm for
servable MDP over belief states. A different approach is
partially observable stochastic games (POSGs). The algo- needed. Our approach is related to iterative elimination of
rithm is a synthesis of dynamic programming for partially ob- dominated strategies in normal form games, which also al-
servable Markov decision processes (POMDPs) and iterated lows agents to have different beliefs. In fact, our approach
elimination of dominated strategies in normal form games. can be viewed as a synthesis of dynamic programming for
We prove that when applied to finite-horizon POSGs, the al- POMDPs and iterated elimination of dominated strategies
gorithm iteratively eliminates very weakly dominated strate- in normal form games. We define a generalized notion of
gies without first forming a normal form representation of the belief that includes uncertainty about the underlying state
game. For the special case in which agents share the same and uncertainty about other agent’s future plans. This allows
payoffs, the algorithm can be used to find an optimal solu- us to define a multi-agent dynamic programming operator.
tion. We present preliminary empirical results and discuss
ways to further exploit POMDP theory in solving POSGs.
We show that the resulting dynamic programming algorithm
corresponds to a type of iterated elimination of dominated
strategies in the normal form representation of finite-horizon
Introduction POSGs. This is the first dynamic programming algorithm
The theory of stochastic games provides a foundation for iterated strategy elimination. For the special case where
for much recent work on multi-agent planning and learn- all agents share the same payoff function, our dynamic pro-
ing (Littman 1994; Boutilier 1999; Kearns, Mansour, & gramming algorithm can be used to find an optimal solution.
Singh 2000; Brafman & Tennenholtz 2002; Hu & Wellman
2003). A stochastic game can be viewed as an extension of a Related work
Markov decision process (MDP) in which there are multiple A finite-horizon POSG can be viewed as a type of exten-
agents with possibly conflicting goals, and the joint actions sive game with imperfect information (Kuhn 1953). Al-
of agents determine state transitions and rewards. Much of though much work has been done on such games, very lit-
the literature on stochastic games assumes that agents have tle of it is from a computational perspective. This is un-
complete information about the state of the game; in this re- derstandable in light of the negative worst-case complexity
spect, it generalizes work on completely observable MDPs. results for POSGs (Bernstein et al. 2002). A notable excep-
In fact, exact dynamic programming algorithms for stochas- tion is reported in (Koller, Megiddo, & von Stengel 1994;
tic games closely resemble exact dynamic programming al- Koller & Pfeffer 1997), in which the authors take advantage
gorithms for completely observable MDPs (Shapley 1953; of the sequence form representation of two-player games to
Filar & Vrieze 1997; Kearns, Mansour, & Singh 2000). Al- find mixed strategy Nash equilibria efficiently. In contrast to
though there is considerable literature on partially observ- their work, ours applies to any number of players. Further-
able Markov decision processes (POMDPs), corresponding more, our algorithms are focused on eliminating dominated
results for partially observable stochastic games (POSGs) strategies, and do not make any assumptions about which of
are very sparse, and no exact dynamic programming algo- the remaining strategies will be played.
rithm for solving POSGs has been previously described. For the special case of cooperative games, several algo-
In this paper, we show how to generalize the dynamic pro- rithms have been proposed. However, previous algorithms
gramming approach to solving POMDPs in order to develop do not guarantee optimality in general. If all agents share
a dynamic programming algorithm for POSGs. The dif- their private information, a cooperative POSG can be con-
ficulty in developing this generalization is that agents can verted to a single-agent POMDP. There are also algorithms
have different beliefs. As a result, it is not possible to for solving cooperative POSGs with other forms of very spe-
solve a POSG by transforming it into a completely observ- cialized structure (Hsu & Marcus 1982; Becker et al. 2003).
able stochastic game over belief states, analogous to how a For general cooperative POSGs, algorithms such as those of
Copyright °
c 2004, American Association for Artificial Intelli- Peshkin et al. (2000) and Nair et al. (2003) can be used, but
gence (www.aaai.org). All rights reserved. they are only guaranteed to converge to local optima.

UNCERTAINTY 709
P 0
Background where P(o|s, a) = s0 ∈S P(s , o|s, a), and the updated
As background, we review the POSG model and two algo- value function is computed for all belief states b ∈ B. Ex-
rithms that we generalize to create a dynamic programming act DP algorithms for POMDPs rely on Smallwood and
algorithm for POSGs: dynamic programming for POMDPs Sondik’s (1973) proof that the DP operator preserves the
and elimination of dominated strategies in solving normal piecewise linearity and convexity of the value function. This
form games. means that the value function can be represented exactly
by a finite set of |S|-dimensional value vectors, denoted
Partially observable stochastic games V = {v1 , v2 , . . . , vk }, where
A partially observable stochastic game (POSG) is a tuple X
V (b) = max b(s)vj (s). (2)
hI, S, {b0 }, {Ai }, {Oi }, P, {Ri }i, where, 1≤j≤k
s∈S
• I is a finite set of agents (or controllers) indexed 1, . . . , n
• S is a finite set of states As elucidated by Kaelbling et al. (1998), each value vec-
tor corresponds to a complete conditional plan that speci-
• b0 ∈ ∆(S) represents the initial state distribution fies an action for every sequence of observations. Adopting
• Ai is a finite set of actions available to agent i and A~ = the terminology of game theory, we often refer to a com-
×i∈I Ai is the set of joint actions (i.e., action profiles), plete conditional plan as a strategy. We use this interchange-
where ~a = ha1 , . . . , an i denotes a joint action ably with policy tree, because a conditional plan for a finite-
horizon POMDP can be viewed as a tree.
• Oi is a finite set of observations for agent i and O ~ =
The DP operator of Equation (1) computes an updated
×i∈I Oi is the set of joint observations, where ~o = value function, but can also be interpreted as computing an
ho1 , . . . , on i denotes a joint observation updated set of policy trees. In fact, the simplest algorithm
• P is a set of Markovian state transition and observation for computing the DP update has two steps, which are de-
probabilities, where P(s0 , ~o|s, ~a) denotes the probability scribed below.
that taking joint action ~a in state s results in a transition In the first step, the DP operator is given a set Qt of depth-
to state s0 and joint observation ~o t policy trees and a corresponding set V t of value vectors
~ → < is a reward function for agent i representing the horizon-t value function. It computes Qt+1
• Ri : S × A
and V t+1 in two steps. First, a set of depth t + 1 policy
A game unfolds over a finite or infinite sequence of stages, trees, Qt+1 , is created by generating every possible depth
where the number of stages is called the horizon of the game. t + 1 policy tree that makes a transition, after an action and
In this paper, we consider finite-horizon POSGs; some of the observation, to the root node of some depth-t policy tree in
challenges involved in solving the infinite-horizon case are Qt . This operation will hereafter be called an exhaustive
discussed at the end of the paper. At each stage, all agents backup. Note that |Qt+1 | = |A||Qt ||O| . For each policy tree
simultaneously select an action and receive a reward and ob- qj ∈ Qt+1 , it is straightforward to compute a corresponding
servation. The objective, for each agent, is to maximize the value vector, vj ∈ V t+1 .
expected sum of rewards it receives during the game. The second step is to eliminate policy trees that need not
Whether agents compete or cooperate in seeking reward be followed by a decision maker that is maximizing expected
depends on their reward functions. The case in which the value. This is accomplished by eliminating (i.e., pruning)
agents share the same reward function has been called a any policy tree when this can be done without decreasing
decentralized partially observable Markov decision process the value of any belief state.
(DEC-POMDP) (Bernstein et al. 2002).
Formally, a policy tree qj ∈ Qt+1 i with corresponding
t+1
Dynamic programming for POMDPs value vector vj ∈ Vi is considered dominated if for all
A POSG with a single agent corresponds to a POMDP. We b ∈ B there exists a vk ∈ Vit+1 \ vj such that b · vk ≥
briefly review an exact dynamic programming algorithm for b · vj . This test for dominance is performed using linear
POMDPs that provides a foundation for our exact dynamic programming. When qj is removed from the set Qt+1 i , its
programming algorithm for POSGs. We use the same nota- corresponding value vector vj is also removed from Vit+1 .
tion for POMDPs as for POSGs, but omit the subscript that The dual of this linear program can also be used as a test
indexes an agent. for dominance. In this case, a policy tree qj with correspond-
The first step in solving a POMDP by dynamic program- ing value vector vj is dominated when there is a probability
ming (DP) is to convert it into a completely observable MDP distribution p over the other policy trees, such that
with a state set B = ∆(S) that consists of all possible be- X
liefs about the current state. Let ba,o denote the belief state p(k)vk (s) ≥ vj (s), ∀s ∈ S. (3)
that results from belief state b, after action a and observation k6=j
o. The DP operator can be written in the form,
( " #) This alternative, and equivalent, test for dominance plays
X X
V t+1
(b) = max b(s) R(s, a) + t
P(o|s, a)V (b a,o
) , a role in iterated strategy elimination, as we will see in
a∈A the next section, and was recently applied in the context of
s∈S o∈O
(1) POMDPs (Poupart & Boutilier 2004).

710 UNCERTAINTY
Iterated elimination of dominated strategies can consider eliminating additional strategies that may only
Techniques for eliminating dominated strategies in solving a have been best responses to strategies of other agents that
POMDP are very closely related to techniques for eliminat- have since been eliminated. The procedure of alternating
ing dominated strategies in solving games in normal form. between agents until no agent can eliminate another strategy
A game in normal form is a tuple G = {I, {Di }, {Vi }}, is called iterated elimination of dominated strategies.
where I is a finite set of agents, Di is a finite set of strate- In solving normal-form games, iterated elimination of
gies available to agent i, and Vi : D ~ → < is the value (or dominated strategies is a somewhat weak solution concept,
payoff) function for agent i. Unlike a stochastic game, there in that it does not (usually) identify a specific strategy for
are no states or state transitions in this model. an agent to play, but rather a set of possible strategies. To
Every strategy di ∈ Di is a pure strategy. Let δi ∈ ∆(Di ) select a specific strategy requires additional reasoning, and
denote a mixed strategy, that is, a probability distribution introduces the concept of a Nash equilibrium, which is a pro-
over the pure strategies available to agent i, where δi (di ) de- file of strategies (possibly mixed), such that δi ∈ Bi (δ−i )
notes the probability assigned to strategy di ∈ Di . Let d−i for all agents i. Since there are often multiple equilibria,
denote a profile of pure strategies for the other agents (i.e., the problem of equilibrium selection is important. (It has
all the agents except agent i), and let δ−i denote a profile a more straightforward solution for cooperative games than
of mixed strategies for the other agents. Since agents select for general-sum games.) But in this paper, we focus on the
strategies simultaneously, δ−i can also represent agent i’s issue of elimination of dominated strategies.
belief about thePother agents’ likely strategies. If we define
Vi (di , δ−i ) = d−i δ−i (d−i )Vi (di , d−i ), then Dynamic programming for POSGs
In the rest of the paper, we develop a dynamic program-
Bi (δ−i ) = {di ∈ Di |Vi (di , δ−i ) ≥ Vi (d0i , δ−i ) ∀d0i ∈ Di } ming algorithm for POSGs that is a synthesis of dynamic
(4) programming for POMDPs and iterated elimination of dom-
denotes the best response function of agent i, which is the inated strategies in normal-form games. We begin by intro-
set of strategies for agent i that maximize the value of some ducing the concept of a normal-form game with hidden state,
belief about the strategies of the other agents. Any strategy which provides a way of relating the POSG and normal-form
that is not a best response to some belief can be deleted. representations of a game. We describe a method for elim-
A dominated strategy di is identified by using linear pro- inating dominated strategies in such games, and then show
gramming. The linear program identifies a probability dis- how to generalize this method in order to develop a dynamic
tribution σi over the other strategies such that programming algorithm for finite-horizon POSGs.
Vi (σi , d−i ) > Vi (di , d−i ), ∀d−i ∈ D−i . (5)
Normal-form games with hidden state
This test for dominance is very similar to the test for Consider a game that takes the form of a tuple G =
dominance used to prune strategies in solving a POMDP. {I, S, {Di }, {Vi }}, where I is a finite set of agents, S is
It differs in using strict inequality, which is called strict a finite set of states, Di is a finite set of strategies available
dominance. Game theorists also use weak dominance to agent i, and Vi : S × D~ → < is the value (or payoff) func-
to prune strategies. A strategy di is weakly dominated tion for agent i. This definition resembles the definition of a
if Vi (σi , d−i ) ≥ Vi (di , d−i ) for all d−i ∈ D−i , and POSG in that the payoff received by each agent is a function
Vi (σi , d−i ) > Vi (di , d−i ) for some d−i ∈ D−i . The test of the state of the game, as well as the joint strategies of all
for dominance which does not require any strict inequality agents. But it resembles a normal-form game in that there is
is sometimes called very weak dominance, and corresponds no state-transition model. In place of one-step actions and
exactly to the test for dominance in POMDPs, as given in rewards, the payoff function specifies the value of a strategy,
Equation (3). Because a strategy that is very weakly dom- which is a complete conditional plan.
inated but not weakly dominated must be payoff equivalent In a normal form game with hidden state, we define an
to a strategy that very weakly dominates it, eliminating very agent’s belief in a way that synthesizes the definition of be-
weakly dominated strategies may have the same effect as lief for POMDPs (a distribution over possible states) and
eliminating weakly dominated strategies in the reduced nor- the definition of belief in iterated elimination of dominated
mal form representation of a game, where the reduced nor- strategies (a distribution over the possible strategies of the
mal form representation is created by combining any set of other agents). For each agent i, a belief is defined as a dis-
payoff-equivalent strategies into a single strategy. tribution over S × D−i , where the distribution is denoted bi .
There are a couple other interesting differences between The value of a belief of agent i is defined as
the tests for dominance in Equations (3) and (5). First, there X
is a difference in beliefs. In normal-form games, beliefs are Vi (bi ) = max bi (s, d−i )Vi (s, di , d−i ).
di ∈Di
about the strategies of other agents, whereas in POMDPs, s∈S,d−i ∈D−i
beliefs are about the underlying state. Second, elimination
of dominated strategies is iterative when there are multiple A strategy di for agent i is very weakly dominated if elimi-
agents. When one agent eliminates its dominated strategies, nating it does not decrease the value of any belief. The test
this can affect the best-response function of other agents (as- for very weak dominance is a linear program that determines
suming common knowledge of rationality). After all agents whether there is a mixed strategy σi ∈ ∆(Di \ di ) such that
take a turn in eliminating their dominated strategies, they Vi (s, σi , d−i ) ≥ Vi (s, di , d−i ), ∀s ∈ S, ∀d−i ∈ D−i . (6)

UNCERTAINTY 711
These generalizations of the key concepts of belief, value of Multi-agent dynamic programming operator
belief, and dominance play a central role in our development The key step of our algorithm is a multi-agent dynamic
of a DP algorithm for POSGs in the rest of this paper. programming operator that generalizes the DP operator for
In our definition of a normal form game with hidden state, POMDPs. As for POMDPs, the operator has two steps. The
we do not include an initial state probability distribution. first is a backup step that creates new policy trees and vec-
As a result, each strategy profile is associated with an |S|- tors. The second is a pruning step.
dimensional vector that can be used to compute the value In the backup step, the DP operator is given a set of depth-
of this strategy profile for any state probability distribution. t policy trees Qti for each agent i, and corresponding sets of
This differs from a standard normal form game in which value vectors Vit of dimension |S × Qt−i |.1 Based on the ac-
each strategy profile is associated with a scalar value. By tion transition, observation, and reward model of the POSG,
assuming an initial state probability distribution, we could it performs an exhaustive backup on each of the sets of trees,
convert our representation to a standard normal form game to form Qt+1i for each agent i. It also recursively computes
in which each strategy profile has a scalar value. But our the value vectors in Vit+1 for each agent i. Note that this
representation is more in keeping with the approach taken step corresponds to recursively creating a normal form with
by the DP algorithm for POMDPs, and lends itself more hidden state representation of a horizon t + 1 POSG, given a
easily to development of a DP algorithm for POSGs. The normal form with hidden state representation of the horizon
initial state probability distribution given in the definition of t POSG.
a POMDP is not used by the DP algorithm for POMDPs; it The second step of the multi-agent DP operator consists
is only used to select a policy after the algorithm finishes. of pruning dominated policy trees. As in the single agent
The same holds in the DP algorithm for POSGs we develop. case, an agent i policy tree can be pruned if its removal does
Like the POMDP algorithm, it computes a solution for all not decrease the value of any belief for agent i. As with
possible initial state probability distributions. normal form games, removal of a policy tree reduces the di-
mensionality of the other agents’ belief space, and it can be
Normal form of finite-horizon POSGs repeated until no more policy trees can be pruned from any
agent’s set. (Note that different agent orderings may lead to
Disregarding the initial state probability distribution, a
different sets of policy trees and value vectors. The question
finite-horizon POSG can be converted to a normal-form
of order dependence in eliminating dominated strategies has
game with hidden state. When the horizon of a POSG is
been extensively studied in game theory, and we do not con-
one, the two representations of the game are identical, since
sider it here.) Pseudocode for the multi-agent DP operator is
a strategy corresponds to a single action, and the payoff
given in Table 1.
functions for the normal-form game correspond to the re-
The validity of the pruning step follows from a version of
ward functions of the POSG. When the horizon of a POSG
the optimality principle of dynamic programming, which we
is greater than one, the POSG representation of the game can
prove for a single iteration of the multi-agent DP operator.
be converted to a normal form representation with hidden
By induction, it follows for any number of iterations.
state, by a recursive construction. Given the sets of strate-
gies and the value (or payoff) functions for a horizon t game, Theorem 1 Consider a set Qti of depth t policy trees for
the sets of strategies and value functions for the horizon agent i, and consider the set Qt+1
i of depth t + 1 policy trees
t + 1 game are constructed by exhaustive backup, as in the created by exhaustive backup, in the first step of the multi-
case of POMDPs. When a horizon-t POSG is represented agent DP operator. If any policy tree qj ∈ Qti is very weakly
in normal form with hidden state, the strategy sets include dominated, then any policy tree q 0 ∈ Qt+1 i that contains qj
all depth-t policy trees, and the value function is piecewise as a subtree is also very weakly dominated.
linear and convex; each strategy profile is associated with an Proof: Consider a very weakly dominated policy tree qj ∈
|S|-vector that represents the expected t-step cumulative re- Qti . According to the dual formulation of the test for domi-
ward achieved for each potential start state (and so any start t
state distribution) by following this joint strategy.
nance, therePexists a distribution p over policy trees in Qi \qj
such that k6=j p(k)vk (s, q−i ) ≥ vj (s, q−i ) for all s ∈ S
If a finite-horizon POSG is represented this way, iterated and q−i ∈ Qk−i . (Recall that vj ∈ Vit is the value vector cor-
elimination of dominated strategies can be used in solving responding to policy tree qj .) Now consider any policy tree
the game, after the horizon t normal form game is con-
structed. The problem is that this representation can be 1
The value function Vit of agent i can be represented as a set Vit
much larger than the original representation of a POSG. In of value vectors of dimension |S ×Qt−i |, with one for each strategy
fact, the size of the strategy set for each agent i is greater in Qti , or as a set of value vectors of dimension |S|, with one for
t
than |Ai ||Oi | , which is doubly exponential in the horizon each strategy profile in Qti × Qt−i . The two representations are
t. Because of the large sizes of the strategy sets, it is usu- equivalent. The latter is more useful in terms of implementation,
since it means the size of vectors does not change during iterated
ally not feasible to work directly with this representation.
elimination of dominated strategies; only the number of vectors
The dynamic programming algorithm we develop partially changes. (Using this representation, multiple |S|-vectors must be
alleviates this problem by performing iterated elimination deleted for each strategy deleted.) The former representation is
of dominated strategies at each stage in the construction of more useful in explaining the algorithm, since it entails a one-to-
the normal form representation, rather than waiting until the one correspondence between strategies and value vectors, and so
construction is finished. we adopt it in this section.

712 UNCERTAINTY
q 0 ∈ Qt+1
i with qj as a subtree. We can replace instances of Input: Sets of depth-t policy trees Qti and corresponding
qj in q 0 with the distribution p to get a behavioral strategy, value vectors Vit for each agent i.
which is a stochastic policy tree. From the test for domi-
nance, it follows that the value of this behavioral strategy is 1. Perform exhaustive backups to get Qt+1
i for each i.
at least as high as that of q 0 , for any distribution over states 2. Recursively compute Vit+1 for each i.
and strategies of the other agents. Since any behavioral strat-
egy can be represented by a distribution over pure strategies, 3. Repeat until no more pruning is possible:
it follows that q 0 is very weakly dominated. 2 (a) Choose an agent i, and find a policy tree qj ∈ Qt+1
i
Thus, pruning very weakly dominated strategies from the for which the following condition is satisfied:
sets Qti before using the dynamic programming operator is ∀b ∈ ∆(S×Qt+1 −i ), ∃vk ∈ Vi
t+1
\vj s.t. b·vk ≥ b·vj .
equivalent to performing the dynamic programming opera- (b) Qt+1 ← Qt+1 \ qj .
i i
tor without first pruning Qti . The advantage of first pruning t+1 t+1
very weakly dominated strategies from the sets Qti is that it (c) Vi ← Vi \ vj .
improves the efficiency of dynamic programming by reduc- Output: Sets of depth-t + 1 policy trees Qt+1
i and corre-
ing the initial size of the sets Qt+1i generated by exhaustive sponding value vectors Vit+1 for each agent i.
backup.
It is possible to define a multi-agent DP operator that
Table 1: The multi-agent dynamic programming operator.
prunes strongly dominated strategies. However, sometimes
a strategy that is not strongly dominated will have a strongly
dominated subtree. This is referred to as an incredible threat
Theorem 2 Dynamic programming applied to a finite-
in the literature. Thus it is an open question whether we can
horizon POSG corresponds to iterated elimination of very
define a multi-agent DP operator that prunes only strongly
weakly dominated strategies in the normal form of the
dominated strategies. In this paper, we focus on pruning
POSG.
very weakly dominated strategies. As already noted, this is
identical to the form of pruning used for POMDPs. Proof: Let T be the horizon of the POSG. If the initial state
There is an important difference between this algorithm distribution of the POSG is not fixed, then the POSG can be
and the dynamic programming operator for single-agent thought of as a normal form game with hidden state. Theo-
POMDPs, in terms of implementation. In the single agent rem 1 implies that each time a policy tree is pruned by the
case, only the value vectors need to be kept in memory. At DP algorithm, every strategy containing it as a subtree is
execution time, an optimal action can be extracted from the very weakly dominated in this game. And if a strategy is
value function using one-step lookahead, at each time step. very weakly dominated when the initial state distribution is
We do not currently have a way of doing this when there not fixed, then it is certainly very weakly dominated for a
are multiple agents. In the multi-agent case, instead of se- fixed initial state distribution. Thus, the DP algorithm can
lecting an action at each time step, each agent must select a be viewed as iteratively eliminating very weakly dominated
policy tree (i.e., a complete strategy) at the beginning of the strategies in the POSG.2
game. Thus, the policy tree sets must also be remembered. In the case of cooperative games, also known as DEC-
Of course, some memory savings is possible by realizing POMDPs, removing very weakly dominated strategies pre-
that the policy trees for an agent share subtrees. serves at least one optimal strategy profile. Thus, the multi-
agent DP operator can be used to solve finite-horizon DEC-
Solving finite-horizon POSGs POMDPs optimally. When the DP algorithm reaches step
T , we can simply extract the highest-valued strategy profile
As we have described, any finite-horizon POSG can be given for the start state distribution.
a normal form representation. The process of computing
the normal form representation is recursive. Given the def- Corollary 1 Dynamic programming applied to a finite-
inition of a POSG, we successively compute normal form horizon DEC-POMDP yields an optimal strategy profile.
games with hidden state for horizons one, two, and so on, For general-sum POSGs, the DP algorithm converts the
up to horizon T . Instead of computing all possible strategies POSG to a normal form representation with reduced sets
for each horizon, we have defined a multi-agent dynamic of strategies in which there are no very weakly dominated
programming operator that performs iterated elimination of strategies. Although selecting an equilibrium presents a
very weakly dominated strategies at each stage. This im- challenging problem in the general-sum case, standard tech-
proves the efficiency of the algorithm because if a policy tree niques for selecting an equilibrium in a normal form game
is pruned by the multi-agent DP operator at one stage, ev- can be used.
ery policy tree containing it as a subtree is effectively elimi-
nated, in the sense that it will not be created at a later stage.
We now show that performing iterated elimination of very Example
weakly dominated strategies at each stage in the construc- We ran initial tests on a cooperative game involving control
tion of the normal form game is equivalent to waiting until of a multi-access broadcast channel (Ooi & Wornell 1996).
the final stage to perform iterated elimination of very weakly In this problem, nodes need to broadcast messages to each
dominated strategies. other over a channel, but only one node may broadcast at

UNCERTAINTY 713
Horizon Brute force Dynamic programming s = send message
d = don t send message
1 (2, 2) (2, 2) c = collision
n = no collision
2 (8, 8) (6, 6) s d
n c n c
3 (128, 128) (20, 20) s d d d
4 (32768, 32768) (300, 300) n c n c n c n c

d d d d s d d d
n c n c n c n c n c n c n c n c
Table 2: Performance of both algorithms on the multi-access
broadcast channel problem. Each cell displays the number s d s d s d d d d d d d s d d d
of policy trees produced for each agent. The brute force
Agent 1 Agent 2
algorithm could not compute iteration 4. The numbers (in
italics) shown in that cell reflect how many policy trees it
would need to create for each agent. Figure 1: A pair of policy trees that is optimal for the
horizon-4 problem when both message buffers start out full.

a time, otherwise a collision occurs. The nodes share the Future work
common goal of maximizing the throughput of the channel. Development of an exact dynamic programming approach to
The process proceeds in discrete time steps. At the start solving POSGs suggests several avenues for future research,
of each time step, each node decides whether or not to send and we briefly describe some possibilities.
a message. The nodes receive a reward of 1 when a message
is successfully broadcast and a reward of 0 otherwise. At the Improving efficiency
end of the time step, each node receives a noisy observation A major scalability bottleneck is the fact that the number of
of whether or not a message got through. policy trees grows rapidly with the horizon and can quickly
consume a large amount of memory. There are several possi-
The message buffer for each agent has space for only one
ble ways to address this. One technique that provides com-
message. If a node is unable to broadcast a message, the
putational leverage in solving POMDPs is to prune policy
message remains in the buffer for the next time step. If a
trees incrementally, so that an exhaustive backup never has
node i is able to send its message, the probability that its
to be done (Cassandra, Littman, & Zhang 1997). Whether
buffer will fill up on the next step is pi . Our problem has
this can be extended to the multi-agent case is an open prob-
two nodes, with p1 = 0.9 and p2 = 0.1. There are 4 states,
lem. Other techniques seem easier to extend. More aggres-
2 actions per agent, and 2 observations per agent.
sive pruning, such as pruning strategies that are almost very
We compared our DP algorithm with a brute-force algo- weakly dominated, can reduce the number of policy trees
rithm, which also builds sets of policy trees, but never prunes in exchange for bounded sub-optimality (Feng & Hansen
any of them. On a machine with 2 gigabytes of memory, the 2001). The number of policy trees may be reduced by al-
brute-force algorithm was able to complete iteration 3 before lowing stochastic policies, as in (Poupart & Boutilier 2004).
running out of memory, while the DP algorithm was able to Work on compactly represented POMDPs and value func-
complete iteration 4. At the end of iteration 4, the num- tions may be extended to the multi-agent case (Hansen &
ber of policy trees for the DP algorithm was less than 1% Feng 2000).
of the number that would have been produced by the brute- In addition, there exist POMDP algorithms that leverage
force algorithm, had it been able to complete the iteration. a known start state distribution for greater efficiency. These
This result, shown in Table 2, indicates that the multi-agent algorithms perform a forward search from the start state and
DP operator can prune a significant number of trees. How- are able to avoid unreachable belief states. Whether some
ever, even with pruning, the number of policy trees grows kind of forward search can be done in the multi-agent case
quickly with the horizon. At the end of the fourth iteration, is an important open problem.
each agent has 300 policy trees that are not dominated. Be-
cause the piecewise linear and convex value function con- Extension to infinite-horizon POSGs
sists of one |S|-vector for each pair of policy trees from the It should be possible to extend our dynamic programming al-
two agents, the representation of the value function requires gorithm to infinite-horizon, discounted POSGs, and we are
3002 |S|-vectors. In the fifth iteration, an exhaustive backup currently exploring this. In the infinite-horizon case, the
would create a value function that consists of 2 · 3004 |S|- multi-agent DP operator is applied to infinite trees. A fi-
vectors, or more than 16 billion |S|-vectors, before begin- nite set of infinite trees can be represented by a finite-state
ning the process of pruning. This illustrates how the algo- controller, and policy iteration algorithms for single-agent
rithm can run out of memory. In the next section, we dis- POMDPs have been developed based on this representa-
cuss possible ways to avoid the explosion in size of the value tion (Hansen 1998; Poupart & Boutilier 2004). We believe
function. that they can be extended to develop a policy iteration algo-
Figure 1 shows a pair of depth-4 policy trees constructed rithm for infinite-horizon POSGs. Because our definition of
by the DP algorithm. In the case where the message buffers belief depends on explicit representation of a policy as a pol-
both start out full, this pair is optimal, yielding a total reward icy tree or finite-state controller, it is not obvious that a value
of 3.89. iteration algorithm for infinite-horizon POSGs is possible.

714 UNCERTAINTY
Conclusion Hansen, E. 1998. Solving POMDPs by searching in policy
space. In Proceedings of the 14th Conference on Uncer-
We have presented an algorithm for solving POSGs that gen-
tainty in Artificial Intelligence (UAI-98), 211–219.
eralizes both dynamic programming for POMDPs and it-
erated elimination of dominated strategies for normal form Hsu, K., and Marcus, S. I. 1982. Decentralized control
games. It is the first exact algorithm for general POSGs, and of finite state Markov processes. IEEE Transactions on
we have shown that it can be used to find optimal solutions Automatic Control AC-27(2):426–431.
for cooperative POSGs. Although currently limited to solv- Hu, J., and Wellman, M. 2003. Nash Q-learning for
ing very small problems, its development helps to clarify the general-sum stochastic games. Journal of Machine Learn-
relationship between POMDPs and game-theoretic models. ing Research 4:1039–1069.
There are many avenues for future research, in both making Kaelbling, L.; Littman, M.; and Cassandra, A. 1998. Plan-
the algorithm more time and space efficient and extending it ning and acting in partially observable stochastic domains.
beyond finite-horizon POSGs. Artificial Intelligence 101:99–134.
Kearns, M.; Mansour, Y.; and Singh, S. 2000. Fast plan-
Acknowledgments We thank the anonymous reviewers ning in stochastic games. In Proceedings of the 16th Con-
for helpful comments. This work was supported in part by ference on Uncertainty in Artificial Intelligence (UAI-00),
the National Science Foundation under grants IIS-0219606 309–316.
and IIS-9984952, by NASA under cooperative agreement Koller, D., and Pfeffer, A. 1997. Representations and so-
NCC 2-1311, and by the Air Force Office of Scientific Re- lutions for game-theoretic problems. Artificial Intelligence
search under grant F49620-03-1-0090. Daniel Bernstein 94(1):167–215.
was supported by a NASA GSRP Fellowship. Any opinions, Koller, D.; Megiddo, N.; and von Stengel, B. 1994. Fast
findings, and conclusions or recommendations expressed in algorithms for finding randomized strategies in game trees.
this material are those of the authors and do not reflect the In Proceedings of the 26th ACM Symposium on Theory of
views of the NSF, NASA or AFOSR. Computing, 750–759.
Kuhn, H. 1953. Extensive games and the problem of infor-
References mation. In Kuhn, H., and Tucker, A., eds., Contributions to
Becker, R.; Zilberstein, S.; Lesser, V.; and Goldman, C. V. the Theory of Games II. Princeton University Press. 193–
2003. Transition-independent decentralized Markov deci- 216.
sion processes. In Proceedings of the 2nd International Littman, M. 1994. Markov games as a framework for
Conference on Autonomous Agents and Multi-agent Sys- multi-agent reinforcement learning. In Proceedings of the
tems, 41–48. 11th International Conference on Machine Learning, 157–
Bernstein, D.; Givan, R.; Immerman, N.; and Zilberstein, 163.
S. 2002. The complexity of decentralized control of Nair, R.; Pynadath, D.; Yokoo, M.; Tambe, M.; and
Markov decision processes. Mathematics of Operations Marsella, S. 2003. Taming decentralized POMDPs: To-
Research 27(4):819–840. wards efficient policy computation for multiagent settings.
Boutilier, C. 1999. Sequential optimality and coordination In Proceedings of the 18th International Joint Conference
in multiagent systems. In Proceedings of the 16th Interna- on Artificial Intelligence, 705–711.
tional Joint Conference on Artificial Intelligence, 478–485. Ooi, J. M., and Wornell, G. W. 1996. Decentralized con-
trol of a multiple access broadcast channel: Performance
Brafman, R., and Tennenholtz, M. 2002. R-MAX–a gen-
bounds. In Proceedings of the 35th Conference on Deci-
eral polynomial time algorithm for near-optimal reinforce-
sion and Control, 293–298.
ment learning. Journal of Machine Learning Research
3:213–231. Peshkin, L.; Kim, K.-E.; Meuleau, N.; and Kaelbling, L. P.
2000. Learning to cooperate via policy search. In Proceed-
Cassandra, A.; Littman, M. L.; and Zhang, N. L. 1997. ings of the 16th International Conference on Uncertainty
Incremental pruning: A simple, fast, exact method for par- in Artificial Intelligence, 489–496.
tially observable Markov decision processes. In Proceed-
ings of the 13th Annual Conference on Uncertainty in Arti- Poupart, P., and Boutilier, C. 2004. Bounded finite state
ficial Intelligence, 54–61. controllers. In Advances in Neural Information Processing
Systems 16: Proceedings of the 2003 Conference. MIT
Feng, Z., and Hansen, E. 2001. Approximate planning Press.
for factored POMDPs. In Proceedings of the 6th European
Conference on Planning. Shapley, L. 1953. Stochastic games. Proceedings of
the National Academy of Sciences of the United States of
Filar, J., and Vrieze, K. 1997. Competitive Markov Deci- America 39:1095–1100.
sion Processes. Springer-Verlag.
Smallwood, R., and Sondik, E. 1973. The optimal con-
Hansen, E., and Feng, Z. 2000. Dynamic programming trol of partially observable Markov processes over a finite
for POMDPs using a factored state representation. In Pro- horizon. Operations Research 21:1071–1088.
ceedings of the 5th International Conference on Artificial
Intelligence Planning and Scheduling, 130–139.

UNCERTAINTY 715

Machine Learning in Finance: Matthew F. Dixon Igor Halperin Paul Bilokon
82% (11)
Machine Learning in Finance: Matthew F. Dixon Igor Halperin Paul Bilokon
565 pages
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
100% (2)
Foundations of Deep Reinforcement Learning Theory and Practice in Python (Laura Graesser, Wah Loon Keng) (Z-Library)
413 pages
Reinforcement Learning Cheat Sheet: Return
No ratings yet
Reinforcement Learning Cheat Sheet: Return
7 pages
Stanford Machine Learning Course Notes by Andrew NG
No ratings yet
Stanford Machine Learning Course Notes by Andrew NG
16 pages
AI4Finance - Tutorials - Project Proposal - Feb. 07 - 2022
No ratings yet
AI4Finance - Tutorials - Project Proposal - Feb. 07 - 2022
96 pages
A.I Unit4
No ratings yet
A.I Unit4
54 pages
Imprecise Probabilities Meet Partial Observability: Game Semantics For Robust Pomdps
No ratings yet
Imprecise Probabilities Meet Partial Observability: Game Semantics For Robust Pomdps
10 pages
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
No ratings yet
CSE4037 Reinforcement Learning: A Partially Observable Markov Decision Process
19 pages
MODULE6 7 A Partially Observable Markov Decision Process
No ratings yet
MODULE6 7 A Partially Observable Markov Decision Process
19 pages
Conjugate Markov Decision Processes
No ratings yet
Conjugate Markov Decision Processes
8 pages
UG Research Symposium Poster
No ratings yet
UG Research Symposium Poster
1 page
A Pac RL Algorithm For Episodic Pomdps
No ratings yet
A Pac RL Algorithm For Episodic Pomdps
9 pages
Mondal Smdp Alg
No ratings yet
Mondal Smdp Alg
23 pages
Maa : A Heuristic Search Algorithm For Solving Decentralized Pomdps
No ratings yet
Maa : A Heuristic Search Algorithm For Solving Decentralized Pomdps
8 pages
Partially Observable MDP AI
No ratings yet
Partially Observable MDP AI
2 pages
adprl_chapter_icis
No ratings yet
adprl_chapter_icis
43 pages
Bayes-Adaptive POMDPs 2007
No ratings yet
Bayes-Adaptive POMDPs 2007
8 pages
Robust Markov Decision Processes- A Place Where AI and Formal Methods Meet
No ratings yet
Robust Markov Decision Processes- A Place Where AI and Formal Methods Meet
29 pages
1512 04455v1-RNN
No ratings yet
1512 04455v1-RNN
11 pages
Markovian Decision Process
No ratings yet
Markovian Decision Process
27 pages
Robust_Dynamic_Programming
No ratings yet
Robust_Dynamic_Programming
30 pages
A Framework For Sequential Planning in Multi-Agent Settings: Piotr J. Gmytrasiewicz Prashant Doshi
No ratings yet
A Framework For Sequential Planning in Multi-Agent Settings: Piotr J. Gmytrasiewicz Prashant Doshi
31 pages
[14] Action Robust Reinforcement Learning and Applications in Continuous Control
No ratings yet
[14] Action Robust Reinforcement Learning and Applications in Continuous Control
10 pages
Dynamic Programming 2
No ratings yet
Dynamic Programming 2
39 pages
Dynamic_Programming_and_Optimal_Control
No ratings yet
Dynamic_Programming_and_Optimal_Control
62 pages
mondal-smdps
No ratings yet
mondal-smdps
17 pages
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
No ratings yet
Approximate Dynamic Programming and Reinforcement Learning - Algorithms, Analysis and An Application
139 pages
AS02
No ratings yet
AS02
16 pages
Double R
No ratings yet
Double R
42 pages
Reinforcement Learning
No ratings yet
Reinforcement Learning
43 pages
RL Assignment1
No ratings yet
RL Assignment1
5 pages
Stochastic DP
No ratings yet
Stochastic DP
23 pages
POMDP Tutoria POMDP - Tutoriall
No ratings yet
POMDP Tutoria POMDP - Tutoriall
55 pages
5.4-Reinforcement Learning-Part1-Introduction
No ratings yet
5.4-Reinforcement Learning-Part1-Introduction
15 pages
IJCAS v2 n3 pp263-278
No ratings yet
IJCAS v2 n3 pp263-278
16 pages
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
No ratings yet
An Introduction To Markov Decision Processes: Bob Givan Ron Parr Purdue University Duke University
23 pages
Artificial Intelligence: Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner
No ratings yet
Artificial Intelligence: Karina V. Delgado, Leliane N. de Barros, Daniel B. Dias, Scott Sanner
32 pages
EE290 Lecture 16
No ratings yet
EE290 Lecture 16
4 pages
Solution to Assignment_4_Dynamic Programming
No ratings yet
Solution to Assignment_4_Dynamic Programming
11 pages
Experiment 4
No ratings yet
Experiment 4
7 pages
Dinamik Militer
No ratings yet
Dinamik Militer
15 pages
22 Reinforcement Learning
No ratings yet
22 Reinforcement Learning
18 pages
pomdp py AFramework to Build and Solve POMDP Problems
No ratings yet
pomdp py AFramework to Build and Solve POMDP Problems
5 pages
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
No ratings yet
GDD Nonlinear NIPS 2009 Convergent Temporal Difference Learning With Arbitrary Smooth Function Approximation
9 pages
Markov Decision Process Tutorial
No ratings yet
Markov Decision Process Tutorial
22 pages
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
No ratings yet
Markov Decision Processes & Reinforcement Learning: Megan Smith Lehigh University, Fall 2006
40 pages
Dynamic Programming and Optimal Control
No ratings yet
Dynamic Programming and Optimal Control
62 pages
RL Module 4
No ratings yet
RL Module 4
50 pages
Partially Observable Markov Decision Processes and Robotics
No ratings yet
Partially Observable Markov Decision Processes and Robotics
25 pages
RL-DQN-PG
No ratings yet
RL-DQN-PG
65 pages
Lecture Notes
No ratings yet
Lecture Notes
29 pages
15 MDP
No ratings yet
15 MDP
35 pages
2504.15960v1
No ratings yet
2504.15960v1
40 pages
On State Variables and POMDP-s
No ratings yet
On State Variables and POMDP-s
49 pages
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
No ratings yet
19.5 Markov Decision Processes: Resolving Unbounded Expected Rewards
13 pages
An Overview of Machine Learning
No ratings yet
An Overview of Machine Learning
42 pages
DSA5102_lecture11
No ratings yet
DSA5102_lecture11
44 pages
A17 Complexdecisions
No ratings yet
A17 Complexdecisions
28 pages
Tut21 RL
No ratings yet
Tut21 RL
101 pages
Modeling Markov Decision Processes With Imprecise Probabilities Using Probabilistic Logic Programming
No ratings yet
Modeling Markov Decision Processes With Imprecise Probabilities Using Probabilistic Logic Programming
12 pages
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
No ratings yet
NIPS 2012 A Unifying Perspective of Parametric Policy Search Methods For Markov Decision Processes Paper
9 pages
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
From Everand
Computer Vision Graph Cuts: Exploring Graph Cuts in Computer Vision
Fouad Sabry
No ratings yet
Exercises of Complex Analysis
From Everand
Exercises of Complex Analysis
Simone Malacrida
No ratings yet
Logic Programming: Fundamentals and Applications
From Everand
Logic Programming: Fundamentals and Applications
Fouad Sabry
No ratings yet
PowerPoint Presentation
No ratings yet
PowerPoint Presentation
63 pages
Trang 1
No ratings yet
Trang 1
82 pages
AI Unit 4 NEW
No ratings yet
AI Unit 4 NEW
60 pages
Life-Inspired Interoceptive Artificial Intelligence
No ratings yet
Life-Inspired Interoceptive Artificial Intelligence
28 pages
Artificial Intelligence A Z Learn How To Build An AI 2
100% (1)
Artificial Intelligence A Z Learn How To Build An AI 2
33 pages
Markov Decision Process
No ratings yet
Markov Decision Process
3 pages
Markov Decision Process
No ratings yet
Markov Decision Process
8 pages
Ps 4
No ratings yet
Ps 4
12 pages
09 - Hidden Markov Model
No ratings yet
09 - Hidden Markov Model
78 pages
Dynamic Programming and Optimal Control 3rd Edition, Volume II
No ratings yet
Dynamic Programming and Optimal Control 3rd Edition, Volume II
233 pages
Dynamic Programming For Partially Observable Stochastic Games
No ratings yet
Dynamic Programming For Partially Observable Stochastic Games
7 pages
Deep Reinforcement Learning For Trainin
No ratings yet
Deep Reinforcement Learning For Trainin
71 pages
3 - Chapter 4 Value Iteration and Policy Iteration
No ratings yet
3 - Chapter 4 Value Iteration and Policy Iteration
20 pages
An Introduction To Reinforcement Learning
No ratings yet
An Introduction To Reinforcement Learning
63 pages
Ensemble
No ratings yet
Ensemble
8 pages
Machine Learning In Finance From Theory To Practice Matthew F Dixon Igor Halperin Paul Bilokon instant download
No ratings yet
Machine Learning In Finance From Theory To Practice Matthew F Dixon Igor Halperin Paul Bilokon instant download
83 pages
Algorithmic Trading With Markov Chains
No ratings yet
Algorithmic Trading With Markov Chains
29 pages
TSP Csse 31116
No ratings yet
TSP Csse 31116
16 pages
A Review On Condition-Based Maintenance Optimization Models For Stochastically Deteriorating System
No ratings yet
A Review On Condition-Based Maintenance Optimization Models For Stochastically Deteriorating System
42 pages
Answer-key (19)
No ratings yet
Answer-key (19)
29 pages
Ps 4
No ratings yet
Ps 4
6 pages
Markov Decision Processes (MDP) : Sudeshna Sarkar
No ratings yet
Markov Decision Processes (MDP) : Sudeshna Sarkar
14 pages
(Ebook) Recursive Macroeconomic Theory by Lars Ljungqvist, Thomas J. Sargent ISBN 9780262038669, 0262038668 - Download the full ebook now for a seamless reading experience
No ratings yet
(Ebook) Recursive Macroeconomic Theory by Lars Ljungqvist, Thomas J. Sargent ISBN 9780262038669, 0262038668 - Download the full ebook now for a seamless reading experience
60 pages
Master Final
No ratings yet
Master Final
54 pages
Bio-Inspired_Rhythmic_Locomotion_for_Quadruped_Robots
No ratings yet
Bio-Inspired_Rhythmic_Locomotion_for_Quadruped_Robots
8 pages
Chapter 3-Unsupervised learning_updated
No ratings yet
Chapter 3-Unsupervised learning_updated
54 pages

Dynamic Programming For Partially Observable Stochastic Games

Uploaded by

Dynamic Programming For Partially Observable Stochastic Games

Uploaded by

Dynamic Programming for Partially Observable Stochastic Games

Eric A. Hansen Daniel S. Bernstein and Shlomo Zilberstein

Abstract POMDP is solved by transforming it into a completely ob-

You might also like